# Web Scraping with Python

This notebook guides you through the process of web scraping using Python. We'll use common libraries to fetch, parse, and extract data from websites.

## Table of Contents

1. [Introduction to Web Scraping](#Introduction-to-Web-Scraping)
2. [Required Libraries](#Import-Required-Libraries)
3. [Ethical Web Scraping](#Ethical-Web-Scraping)
4. [HTTP Basics](#HTTP-Basics)
5. [Fetch HTML Content](#Fetch-HTML-Content)
6. [Parse HTML with BeautifulSoup](#Parse-HTML-with-BeautifulSoup)
7. [Extract Specific Data](#Extract-Specific-Data)
8. [Handle Pagination](#Handle-Pagination)
9. [Structured Data: Scraping Tables](#Structured-Data-Scraping-Tables)
10. [Dynamic Websites and JavaScript](#Dynamic-Websites-and-JavaScript)
11. [Error Handling](#Error-Handling-and-Robust-Scraping)
12. [Practical Exercises](#Practical-Exercises)
13. [Conclusion](#Conclusion)

## Introduction to Web Scraping

Web scraping is the process of programmatically extracting data from websites. It's useful when you need to:

- Collect data that's not available through an API
- Monitor websites for changes
- Aggregate information from multiple sources
- Create datasets for analysis or machine learning

In this notebook, we'll cover the fundamentals of web scraping with Python, from making HTTP requests to parsing HTML and extracting useful information.

## Import Required Libraries

Import libraries such as requests and BeautifulSoup for web scraping.

In [6]:
# Import the necessary libraries for web scraping
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re

## Ethical Web Scraping

Before we start scraping websites, it's important to understand the legal and ethical considerations involved:

### Legal and Ethical Considerations

1. **Terms of Service (ToS)**: Always review a website's Terms of Service. Many websites explicitly prohibit scraping.
2. **Copyright Laws**: Data on websites may be protected by copyright. Scraping and republishing it could be a violation.
3. **Rate Limiting**: Sending too many requests too quickly can overload a website's servers, which is effectively a DDoS attack.
4. **Personal Data**: Scraping personal data may violate privacy laws like GDPR or CCPA.

### Best Practices

1. **Check robots.txt**: This file tells web crawlers what parts of the site they can access.
2. **Use Delays**: Add time delays between requests to reduce server load.
3. **Identify Your Bot**: Set a proper User-Agent header that identifies your scraper.
4. **Cache Results**: Don't scrape the same content repeatedly.
5. **Use APIs When Available**: If a website offers an API, use it instead of scraping.

In [7]:
# Example: How to check robots.txt before scraping
import requests


def check_robots_txt(url):
    # Extract the base URL
    from urllib.parse import urlparse

    parsed_url = urlparse(url)
    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"

    # Fetch the robots.txt file
    robots_url = f"{base_url}/robots.txt"
    response = requests.get(robots_url)

    if response.status_code == 200:
        print(f"robots.txt found at {robots_url}")
        print("\nContent preview:")
        print(response.text[:500], "..." if len(response.text) > 500 else "")
        return response.text
    else:
        print(f"No robots.txt found at {robots_url}")
        return None


# Check robots.txt for our example site
robots_content = check_robots_txt("https://quotes.toscrape.com/")


# Function to set a proper User-Agent
def get_headers():
    # It's good practice to identify your scraper
    return {
        "User-Agent": "PythonScrapingTutorial/1.0 (Educational Purpose)",
        "Accept": "text/html,application/xhtml+xml,application/xml",
        "Accept-Language": "en-US,en;q=0.5",
    }

No robots.txt found at https://quotes.toscrape.com/robots.txt


## HTTP Basics

Understanding how HTTP works is essential for effective web scraping. Here are some key concepts:

- **HTTP Methods**: GET, POST, PUT, DELETE, etc.
- **Status Codes**: 200 (OK), 404 (Not Found), 403 (Forbidden), etc.
- **Headers**: Contain metadata about the request/response
- **Cookies**: Small pieces of data stored by the browser
- **Sessions**: Maintain state between multiple requests

In [8]:
# Example: Making different types of HTTP requests
import requests
from requests.exceptions import RequestException


# GET request (most common for scraping)
def make_get_request(url, headers=None, params=None):
    try:
        response = requests.get(url, headers=headers, params=params, timeout=10)
        response.raise_for_status()  # Raise exception for 4XX/5XX responses
        return response
    except RequestException as e:
        print(f"Error making GET request: {e}")
        return None


# POST request (for forms, login pages, etc.)
def make_post_request(url, data, headers=None):
    try:
        response = requests.post(url, data=data, headers=headers, timeout=10)
        response.raise_for_status()
        return response
    except RequestException as e:
        print(f"Error making POST request: {e}")
        return None


# Using a session to maintain cookies
def use_session(login_url, login_data, protected_url):
    with requests.Session() as session:
        # Login to the site
        try:
            login_resp = session.post(login_url, data=login_data)
            login_resp.raise_for_status()

            # Access protected page with the same session
            protected_resp = session.get(protected_url)
            protected_resp.raise_for_status()
            return protected_resp
        except RequestException as e:
            print(f"Session error: {e}")
            return None


# Example of examining response headers
sample_url = "https://quotes.toscrape.com/"
response = make_get_request(sample_url, headers=get_headers())
if response:
    print("Response Headers:")
    for header, value in response.headers.items():
        print(f"{header}: {value}")

Response Headers:
Date: Fri, 02 May 2025 20:21:27 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Strict-Transport-Security: max-age=0; includeSubDomains; preload
Content-Encoding: br


## Fetch HTML Content

Use the requests library to fetch HTML content from a given URL.

In [9]:
# Define the URL to scrape
url = "https://quotes.toscrape.com/"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print(f"Successfully fetched the content from {url}")
    print(f"Response status code: {response.status_code}")
else:
    print(f"Failed to fetch the content. Status code: {response.status_code}")

# View the first 500 characters of the HTML content
print("\nHTML Content Preview:")
print(response.text[:500], "...")

Successfully fetched the content from https://quotes.toscrape.com/
Response status code: 200

HTML Content Preview:
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div cla ...


## Parse HTML with BeautifulSoup

Parse the fetched HTML content using BeautifulSoup to create a navigable tree structure.

In [10]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Print the title of the webpage
print(f"Title of the webpage: {soup.title.text}")

# Print the structure of the HTML in a more readable format
print("\nStructure of the HTML:")
print(soup.prettify()[:500], "...")

Title of the webpage: Quotes to Scrape

Structure of the HTML:
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
    ...


## Extract Specific Data

Use BeautifulSoup methods to extract specific data such as titles, links, or tables from the HTML.

In [11]:
# Extract all quote elements
quotes = soup.find_all("div", class_="quote")

# Create empty lists to store data
quote_texts = []
quote_authors = []
quote_tags = []

# Extract data from each quote element
for quote in quotes:
    # Extract the quote text
    text = quote.find("span", class_="text").text
    quote_texts.append(text)

    # Extract the quote author
    author = quote.find("small", class_="author").text
    quote_authors.append(author)

    # Extract the quote tags
    tags = quote.find("div", class_="tags")
    tag_list = tags.find_all("a", class_="tag")
    tags_text = [tag.text for tag in tag_list]
    quote_tags.append(tags_text)

# Create a pandas DataFrame to organize the data
quotes_df = pd.DataFrame(
    {"Text": quote_texts, "Author": quote_authors, "Tags": quote_tags}
)

# Display the first few rows of the DataFrame
print(quotes_df.head())

# Basic analysis
print(f"\nTotal number of quotes extracted: {len(quotes_df)}")
print(f"Number of unique authors: {quotes_df['Author'].nunique()}")

                                                Text           Author  \
0  “The world as we have created it is a process ...  Albert Einstein   
1  “It is our choices, Harry, that show what we t...     J.K. Rowling   
2  “There are only two ways to live your life. On...  Albert Einstein   
3  “The person, be it gentleman or lady, who has ...      Jane Austen   
4  “Imperfection is beauty, madness is genius and...   Marilyn Monroe   

                                             Tags  
0        [change, deep-thoughts, thinking, world]  
1                            [abilities, choices]  
2  [inspirational, life, live, miracle, miracles]  
3              [aliteracy, books, classic, humor]  
4                    [be-yourself, inspirational]  

Total number of quotes extracted: 10
Number of unique authors: 8


## Handle Pagination

Implement logic to handle pagination and scrape data across multiple pages.

In [12]:
# Function to scrape a single page
def scrape_page(url):
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code != 200:
        print(f"Failed to fetch the content. Status code: {response.status_code}")
        return [], [], []

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")

    # Extract all quote elements
    quotes = soup.find_all("div", class_="quote")

    # Create empty lists to store data
    quote_texts = []
    quote_authors = []
    quote_tags = []

    # Extract data from each quote element
    for quote in quotes:
        # Extract the quote text
        text = quote.find("span", class_="text").text
        quote_texts.append(text)

        # Extract the quote author
        author = quote.find("small", class_="author").text
        quote_authors.append(author)

        # Extract the quote tags
        tags = quote.find("div", class_="tags")
        tag_list = tags.find_all("a", class_="tag")
        tags_text = [tag.text for tag in tag_list]
        quote_tags.append(tags_text)

    return quote_texts, quote_authors, quote_tags


# Initialize lists to store all data
all_texts = []
all_authors = []
all_tags = []

# Base URL
base_url = "https://quotes.toscrape.com/page/{}/"

# Number of pages to scrape
num_pages = 3

# Loop through pages
for page in range(1, num_pages + 1):
    page_url = base_url.format(page)
    print(f"Scraping page {page}: {page_url}")

    # Scrape the page
    texts, authors, tags = scrape_page(page_url)

    # Add data to the lists
    all_texts.extend(texts)
    all_authors.extend(authors)
    all_tags.extend(tags)

    # Sleep to be respectful to the website
    time.sleep(1)

# Create a pandas DataFrame with all the data
all_quotes_df = pd.DataFrame(
    {"Text": all_texts, "Author": all_authors, "Tags": all_tags}
)

# Display the DataFrame information
print(f"\nTotal number of quotes collected: {len(all_quotes_df)}")
print("\nFirst 5 quotes:")
print(all_quotes_df.head())

# Count quotes by author
author_counts = all_quotes_df["Author"].value_counts()
print("\nNumber of quotes by author:")
print(author_counts)

# Find most common tags
# Flatten the list of tags
all_tags_flat = [tag for tags_list in all_tags for tag in tags_list]
tag_counts = pd.Series(all_tags_flat).value_counts()
print("\nMost common tags:")
print(tag_counts.head(10))

Scraping page 1: https://quotes.toscrape.com/page/1/
Scraping page 2: https://quotes.toscrape.com/page/2/
Scraping page 3: https://quotes.toscrape.com/page/3/

Total number of quotes collected: 30

First 5 quotes:
                                                Text           Author  \
0  “The world as we have created it is a process ...  Albert Einstein   
1  “It is our choices, Harry, that show what we t...     J.K. Rowling   
2  “There are only two ways to live your life. On...  Albert Einstein   
3  “The person, be it gentleman or lady, who has ...      Jane Austen   
4  “Imperfection is beauty, madness is genius and...   Marilyn Monroe   

                                             Tags  
0        [change, deep-thoughts, thinking, world]  
1                            [abilities, choices]  
2  [inspirational, life, live, miracle, miracles]  
3              [aliteracy, books, classic, humor]  
4                    [be-yourself, inspirational]  

Number of quotes by author:
Author

## Structured Data: Scraping Tables

Many websites present data in structured formats like tables. BeautifulSoup makes it relatively easy to extract this structured data.

### HTML Tables

HTML tables are defined with the `<table>` tag and contain rows (`<tr>`) and cells (`<td>` or `<th>` for headers). This structure makes them ideal for conversion to pandas DataFrames.

In [13]:
# Example: Scraping a table from a webpage
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of a page with a table (this is a sample with a simple HTML table)
table_url = "https://www.w3schools.com/html/html_tables.asp"

# Fetch the page
response = requests.get(table_url, headers=get_headers())

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the table (in this case, the first table on the page)
    table = soup.find("table")

    if table:
        # Extract table headers
        headers = []
        header_row = table.find("tr")
        if header_row:
            headers = [
                header.text.strip() for header in header_row.find_all(["th", "td"])
            ]

        # Extract table rows
        rows = []
        data_rows = table.find_all("tr")[1:] if header_row else table.find_all("tr")

        for row in data_rows:
            cells = row.find_all(["td", "th"])
            row_data = [cell.text.strip() for cell in cells]
            rows.append(row_data)

        # Create pandas DataFrame
        if headers:
            df = pd.DataFrame(rows, columns=headers)
        else:
            df = pd.DataFrame(rows)

        print("Table extracted successfully:")
        print(df.head())
    else:
        print("No table found on the page")
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")

# Alternative: Using pandas' built-in read_html function
# This function automatically extracts tables from HTML
try:
    tables = pd.read_html(response.text)
    print(f"\nNumber of tables found using pd.read_html: {len(tables)}")
    print("\nFirst table using pd.read_html:")
    print(tables[0].head())
except Exception as e:
    print(f"Error using pd.read_html: {e}")

Table extracted successfully:
                        Company          Contact  Country
0           Alfreds Futterkiste     Maria Anders  Germany
1    Centro comercial Moctezuma  Francisco Chang   Mexico
2                  Ernst Handel    Roland Mendel  Austria
3                Island Trading    Helen Bennett       UK
4  Laughing Bacchus Winecellars  Yoshi Tannamuri   Canada

Number of tables found using pd.read_html: 2

First table using pd.read_html:
                        Company          Contact  Country
0           Alfreds Futterkiste     Maria Anders  Germany
1    Centro comercial Moctezuma  Francisco Chang   Mexico
2                  Ernst Handel    Roland Mendel  Austria
3                Island Trading    Helen Bennett       UK
4  Laughing Bacchus Winecellars  Yoshi Tannamuri   Canada


  tables = pd.read_html(response.text)


### Exercise: Scraping Structured Data

Try to scrape a table from a website of your choice. Here are some ideas:

1. World population data
2. Stock market data
3. Sports statistics
4. Weather data

Remember to check the website's robots.txt first!

In [14]:
# Exercise: Scrape a table of your choice
# URL = "your website with table"
# TODO: Complete the code to scrape a table from your chosen website

# Sample solution with a different website
# This is just an example - try finding your own table to scrape!
exercise_url = (
    "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
)

# Check robots.txt first
check_robots_txt(exercise_url)

# Now scrape a table
# Your code here...

robots.txt found at https://en.wikipedia.org/robots.txt

Content preview:
﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapa ...


'\ufeff# robots.txt for http://www.wikipedia.org/ and friends\n#\n# Please note: There are a lot of pages on this site, and there are\n# some misbehaved spiders out there that go _way_ too fast. If you\'re\n# irresponsible, your access to the site may be blocked.\n#\n\n# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN\n# and ignoring 429 ratelimit responses, claims to respect robots:\n# http://mj12bot.com/\nUser-agent: MJ12bot\nDisallow: /\n\n# advertising-related bots:\nUser-agent: Mediapartners-Google*\nDisallow: /\n\n# Wikipedia work bots:\nUser-agent: IsraBot\nDisallow:\n\nUser-agent: Orthogaffe\nDisallow:\n\n# Crawlers that are kind enough to obey, but which we\'d rather not have\n# unless they\'re feeding search engines.\nUser-agent: UbiCrawler\nDisallow: /\n\nUser-agent: DOC\nDisallow: /\n\nUser-agent: Zao\nDisallow: /\n\n# Some bots are known to be trouble, particularly those designed to copy\n# entire sites. Please obey robots.txt.\nUser-agent: sitech

## Dynamic Websites and JavaScript

Many modern websites load content dynamically using JavaScript. This poses a challenge for basic web scraping because the requests library only fetches the initial HTML, not the content loaded by JavaScript.

There are several approaches to scrape dynamic websites:

1. **Find the API endpoints**: Often, dynamic content is loaded from JSON APIs
2. **Use a headless browser**: Tools like Selenium or Playwright can run a browser engine
3. **Use browser automation**: Control a real browser to interact with the page

In [15]:
# Example: Finding and using API endpoints that provide JSON data
import requests
import json

# Many dynamic sites load data from JSON APIs
# Let's fetch data from a sample JSON API
api_url = "https://jsonplaceholder.typicode.com/posts"
response = requests.get(api_url)

if response.status_code == 200:
    data = response.json()
    print(f"Number of posts: {len(data)}")
    print("\nFirst post:")
    print(json.dumps(data[0], indent=2))

    # Convert to DataFrame
    posts_df = pd.DataFrame(data)
    print("\nDataFrame of posts:")
    print(posts_df.head())
else:
    print(f"Failed to fetch data from API. Status code: {response.status_code}")

# Note: To use Selenium or Playwright, you'd need to install additional packages
# The code below is commented out as it requires additional setup

"""
# Example using Selenium (requires installation: pip install selenium)
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode (no UI)

# Set up the driver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)

# Navigate to the page
driver.get("https://quotes.toscrape.com/js/")

# Wait for JavaScript to load content
import time
time.sleep(2)

# Now extract the content after JavaScript has run
quotes = driver.find_elements(By.CLASS_NAME, "quote")
for i, quote in enumerate(quotes[:3]):
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    print(f"Quote {i+1}: {text}")
    print(f"Author: {author}\n")

# Close the driver
driver.quit()
"""

Number of posts: 100

First post:
{
  "userId": 1,
  "id": 1,
  "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
  "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}

DataFrame of posts:
   userId  id                                              title  \
0       1   1  sunt aut facere repellat provident occaecati e...   
1       1   2                                       qui est esse   
2       1   3  ea molestias quasi exercitationem repellat qui...   
3       1   4                               eum et est occaecati   
4       1   5                                 nesciunt quas odio   

                                                body  
0  quia et suscipit\nsuscipit recusandae consequu...  
1  est rerum tempore vitae\nsequi sint nihil repr...  
2  et iusto sed quo iure\nvoluptatem occaecati om...  
3  ullam et saepe reici

'\n# Example using Selenium (requires installation: pip install selenium)\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.service import Service\nfrom selenium.webdriver.chrome.options import Options\nfrom webdriver_manager.chrome import ChromeDriverManager\nfrom selenium.webdriver.common.by import By\n\n# Set up Chrome options\nchrome_options = Options()\nchrome_options.add_argument("--headless")  # Run in headless mode (no UI)\n\n# Set up the driver\ndriver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)\n\n# Navigate to the page\ndriver.get("https://quotes.toscrape.com/js/")\n\n# Wait for JavaScript to load content\nimport time\ntime.sleep(2)\n\n# Now extract the content after JavaScript has run\nquotes = driver.find_elements(By.CLASS_NAME, "quote")\nfor i, quote in enumerate(quotes[:3]):\n    text = quote.find_element(By.CLASS_NAME, "text").text\n    author = quote.find_element(By.CLASS_NAME, "author").text\n    print(f"Q

## Error Handling and Robust Scraping

Web scraping can be unpredictable because websites change frequently. Here are some strategies for making your scraping more robust:

In [16]:
# Example: Robust scraping with error handling
import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException


def robust_scraper(url, max_retries=3, backoff_factor=2):
    """A robust web scraper with retry logic and error handling"""
    headers = get_headers()
    retries = 0

    while retries < max_retries:
        try:
            # Make the request with a timeout
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()  # Raise for 4XX/5XX status codes

            # Parse the content
            soup = BeautifulSoup(response.content, "html.parser")
            return soup

        except RequestException as e:
            wait_time = backoff_factor**retries
            print(f"Error: {e}. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
            retries += 1

    print(f"Failed to retrieve {url} after {max_retries} attempts")
    return None


# Example usage
soup = robust_scraper("https://quotes.toscrape.com")
if soup:
    title = soup.title.text if soup.title else "No title found"
    print(f"Successfully scraped: {title}")

    # Defensive extraction using try-except
    try:
        quotes = soup.find_all("div", class_="quote")
        print(f"Found {len(quotes)} quotes")
    except AttributeError:
        print("Could not extract quotes, page structure may have changed")


# Using CSS selectors as a more robust alternative
def extract_with_css(soup, selector):
    """Extract data using CSS selectors with error handling"""
    try:
        elements = soup.select(selector)
        return elements
    except Exception as e:
        print(f"Error extracting using selector '{selector}': {e}")
        return []


if soup:
    # Multiple selectors to try different approaches
    selectors = [
        ".quote .text",  # Primary selector
        "div.quote span.text",  # Alternative
        "[class='text']",  # Attribute selector
    ]

    # Try each selector until one works
    for selector in selectors:
        elements = extract_with_css(soup, selector)
        if elements:
            print(f"Found {len(elements)} elements using selector: {selector}")
            break

Successfully scraped: Quotes to Scrape
Found 10 quotes
Found 10 elements using selector: .quote .text


## Practical Exercises

Let's put everything together with some practical exercises that combine the concepts we've learned.

### Exercise 1: Build a News Scraper

Create a scraper that extracts headlines, summaries, and links from a news website.

Requirements:
1. Extract at least 10 headlines
2. For each headline, get the summary/snippet and URL
3. Save the results to a CSV file
4. Be respectful of the website (add delays, proper headers)

Suggested sites: BBC, Reuters, NPR (check robots.txt first!)

In [17]:
# Exercise 1: News Scraper
# Your code here...

# Sample solution framework (you'll need to adapt this to your chosen site)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time


def news_scraper(url):
    # TODO: Implement the news scraper
    # 1. Send a request to the URL
    # 2. Parse the HTML
    # 3. Extract headlines, summaries, and links
    # 4. Return the data

    # Placeholder for the solution
    return {"headlines": [], "summaries": [], "links": []}


# Test your scraper
# news_data = news_scraper('https://your-chosen-news-site.com')

### Exercise 2: Create a Web Monitor

Build a tool that monitors a webpage for changes and sends a notification when changes are detected.

Requirements:
1. Take a URL as input
2. Periodically check the webpage (e.g., every hour)
3. Compare with the previous version to detect changes
4. Print a notification when changes are detected

Bonus: Save the history of changes to a file

In [18]:
# Exercise 2: Web Monitor
# Your code here...

# Sample solution framework
import requests
from bs4 import BeautifulSoup
import time
import hashlib


def get_page_content(url):
    # TODO: Fetch the page content
    pass


def generate_content_hash(content):
    # TODO: Generate a hash of the content to detect changes
    pass


def monitor_webpage(url, check_interval=3600):
    # TODO: Implement the monitoring logic
    # 1. Get the initial content
    # 2. Periodically check for new content
    # 3. Compare with previous content
    # 4. Notify if changes are detected
    pass


# Test your monitor (reduced interval for testing)
# monitor_webpage('https://example.com', check_interval=30)

## Conclusion

In this notebook, we've learned how to:
1. Import and use web scraping libraries
2. Fetch HTML content from a website
3. Parse the HTML using BeautifulSoup
4. Extract specific data from the parsed HTML
5. Handle pagination to scrape data across multiple pages
6. Scrape structured data like tables
7. Handle dynamic websites with JavaScript content
8. Implement error handling for robust scraping
9. Apply these concepts in practical exercises

Remember to always be respectful when scraping websites by:
- Reading and following the website's robots.txt file
- Adding delays between requests to avoid overloading the server
- Identifying your scraper with an appropriate user agent
- Only scraping data that is publicly available and legal to scrape

### Additional Resources

- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Requests Library Documentation](https://docs.python-requests.org/en/latest/)
- [Selenium Documentation](https://www.selenium.dev/documentation/)
- [Web Scraping Ethics and Legality](https://www.scrapingbee.com/blog/web-scraping-legal/)

Happy scraping!