## Milestone Project III

**Page Scraper** - Create an application which connects to a site and pulls out all links, or images, and saves them to a list. *Optional: Organize the indexed content and don’t allow duplicates. Have it put the results into an easily searchable index file.*

In [1]:
pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [11]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv

In [12]:
def scrape_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Handle bad responses
    except requests.exceptions.RequestException as e:
        print(f"Error accessing {url}: {e}")
        return [], []

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract and normalize all links
    links = set()
    for link in soup.find_all('a', href=True):
        full_link = urljoin(url, link['href'])
        links.add(full_link)

    # Extract and normalize all image sources
    images = set()
    for img in soup.find_all('img', src=True):
        full_img = urljoin(url, img['src'])
        images.add(full_img)

    return list(links), list(images)


In [13]:
def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(["URL"])
        for item in data:
            writer.writerow([item])

In [14]:
url = input("Enter the website URL to scrape: ").strip()
links, images = scrape_page(url)

print(f"\n Found {len(links)} unique links and {len(images)} unique images.")


Enter the website URL to scrape:  https://books.toscrape.com/



 Found 73 unique links and 20 unique images.


In [15]:
save_to_csv(links, 'links.csv')
save_to_csv(images, 'images.csv')

In [16]:
print("\n Results saved as:")
print(" - links.csv")
print(" - images.csv")


 Results saved as:
 - links.csv
 - images.csv
