# **Notebook 2: Web Scraping for Historians**
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jburnford/hist497_2025/blob/main/02_web_scraping.ipynb)

Welcome to web scraping! In this notebook, you'll learn to extract historical data from websites using Python. We'll work with real Canadian historical sources and build your skills step by step.

**What you'll learn:**
- How web pages are structured (HTML basics)
- Using Beautiful Soup to extract specific content
- Working with Canadian historical archives
- Building reusable code for research

**Ethics first:** Always respect websites. Check robots.txt files, don't overload servers, and respect copyright.


## Step 1: Setting Up Our Tools

Before we can scrape websites, we need to install and import the right libraries. Let's do this step by step.

**First, install the libraries:**

In [26]:
# Example: How to add delays between requests
import time

# Friendly User-Agent so cultural institutions know who is visiting
DEFAULT_HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; HIST497-Student/1.0; +https://github.com/jburnford/hist497_2025)"
}

def respectful_get(url, delay=1, headers=None):
    """
    Make a web request with built-in delay for respectful scraping
    """
    print(f"⏳ Waiting {delay} second(s) before request...")
    time.sleep(delay)  # Wait before making request

    try:
        response = requests.get(url, headers=headers or DEFAULT_HEADERS, timeout=10)
        response.raise_for_status()
        print(f"✅ Request successful")
        return response
    except requests.exceptions.RequestException as e:
        print(f"❌ Request failed: {e}")
        return None

# We'll use this function for respectful scraping throughout the notebook
print("Respectful scraping function defined!")
print("💡 This adds delays between requests to be considerate to servers and sends a helpful User-Agent.")


Respectful scraping function defined!
💡 This adds delays between requests to be considerate to servers and sends a helpful User-Agent.


## 🤝 Ethical Web Scraping Guidelines

Before we start scraping, let's understand the ethics and best practices:

**✅ Always:**
- Check robots.txt (add /robots.txt to any website URL)
- Add delays between requests (don't overwhelm servers)
- Use reasonable timeouts
- Respect copyright and terms of service
- Identify yourself with User-Agent headers when appropriate

**❌ Never:**
- Scrape faster than a human could browse
- Ignore error messages or blocks
- Scrape personal/private information
- Violate website terms of service
- Overload servers with rapid requests

**📖 For historical research:**
- Many archives encourage responsible academic use
- Always cite your digital sources properly
- Consider contacting archives for bulk data access
- Respect cultural sensitivities in historical materials

In [27]:
# This installs the libraries we need
!pip install requests beautifulsoup4 --quiet
print("Libraries installed successfully!")

Libraries installed successfully!


**Now, import them so we can use them:**

Copy this code into the cell below:
```python
import requests
from bs4 import BeautifulSoup
```
If you don't get an error and nothing happens after you click the run button, it worked.

In [28]:
# Import the libraries we need for web scraping
import requests
from bs4 import BeautifulSoup

print("Libraries imported successfully!")
print("✅ requests: Downloads web pages")
print("✅ BeautifulSoup: Parses HTML content")

Libraries imported successfully!
✅ requests: Downloads web pages
✅ BeautifulSoup: Parses HTML content


## Step 2: Your First Web Request

Let's start by downloading a web page. We'll use a Saskatchewan encyclopedia article about the Rupert's Land purchase.

Libraries sometimes block anonymous scraping, so we send a friendly `User-Agent` string that identifies this class.

**Step 2a: Define the URL**

Copy this code:
```python
url = "https://esask.uregina.ca/entry/ruperts_land_purchase.html"
print(f"We're going to scrape: {url}")
print("📚 This is an Encyclopedia of Saskatchewan article about the Rupert's Land purchase")
print("🏔️ Great practice for prairie-focused historical research!")
```


In [30]:
# Define the URL for our Canadian historical source
url = "https://esask.uregina.ca/entry/ruperts_land_purchase.html"
print(f"We're going to scrape: {url}")
print("📚 This is an Encyclopedia of Saskatchewan article about the Rupert's Land purchase")
print("🏔️ Great practice for prairie-focused historical research!")


We're going to scrape: https://esask.uregina.ca/entry/ruperts_land_purchase.html
📚 This is an Encyclopedia of Saskatchewan article about the Rupert's Land purchase
🏔️ Great practice for prairie-focused historical research!


**Step 2b: Download the page**

Now let's actually download the web page. The `requests.get()` function fetches the page for us. We include the same `DEFAULT_HEADERS` so the site recognizes us as a respectful browser.

Copy this code:
```python
response = requests.get(url, headers=DEFAULT_HEADERS)
print(f"Status Code: {response.status_code}")  # 200 means success
print(f"Page downloaded! It contains {len(response.text)} characters.")
```


In [34]:
# Download the page with error handling
try:
    response = requests.get(url, headers=DEFAULT_HEADERS, timeout=10)  # 10 second timeout
    response.raise_for_status()  # Raises an exception for bad status codes
    print(f"✅ Success! Status Code: {response.status_code}")
    print(f"📄 Page downloaded! It contains {len(response.text):,} characters.")
except requests.exceptions.RequestException as e:
    print(f"❌ Error downloading page: {e}")
    print("💡 This could be due to:")
    print("   - No internet connection")
    print("   - Website is down")
    print("   - URL has changed")
    print("   - Server is blocking requests")
    print("   - Try sending headers=DEFAULT_HEADERS to share a friendly User-Agent")


✅ Success! Status Code: 200
📄 Page downloaded! It contains 9,457 characters.


**Step 2c: Preview the HTML**

Can you find the title and main article content?







In [51]:
# Show the first 5000 characters so we can inspect the HTML
print("📄 Previewing Encyclopedia of Saskatchewan HTML...")
print("-" * 50)
print(response.text[:5000])
print("-" * 50)
print("💡 Notice header navigation, metadata, and the main article content.")


📄 Previewing Encyclopedia of Saskatchewan HTML...
--------------------------------------------------
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>The Encyclopedia of Saskatchewan | Details</title>
<!--<base href="http://esask.uregina.ca/">-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<meta name="author" content="Cory Toth - Encyclopedia Of Saskatchewan">
<meta name="robots" content="INDEX, FOLLOW">
<meta name="document-state" content="Dynamic">
<link href="../assets/css/global.css" rel="stylesheet" type="text/css">
<script language="JavaScript" src="../assets/js/mercury.js" type="text/javascript"></script>
<script language="JavaScript" src="../assets/js/niftycube.js" type="text/javascript"></script>
<script type="text/javascript">window.onload=function(){Nifty("div.rounded,div.h1,span.h1,h1,.h2,h2,span.btn");}</script>
</head>

### 🔄 **Your Turn: Adaptation Exercise**

Now try the same steps with a different Canadian source. Copy the code from above and modify it to use this Internet Archive document about Saskatchewan:

```python
url = "https://archive.org/details/saskatchewan00sask"
```

Follow the same steps: define the URL, download the page with `headers=DEFAULT_HEADERS`, check the status code, and look at the first 500 characters using `print(response.text[:500])`.


In [None]:
# Your adaptation exercise here
# 1. Define the Saskatchewan URL
# 2. Download the page with requests.get()
# 3. Print the status code and character count
# 4. Show first 500 characters


## Step 3: Making Sense of HTML with Beautiful Soup

Raw HTML is hard to work with. Beautiful Soup parses it and makes it easy to extract what we need.

**Step 3a: Create your first "soup"**

Let's go back to our Encyclopedia article and parse it properly:

Copy this code:
```python
# Go back to the Encyclopedia article
url = "https://esask.uregina.ca/entry/ruperts_land_purchase.html"
response = requests.get(url, headers=DEFAULT_HEADERS)

# Create the soup object
soup = BeautifulSoup(response.text, 'html.parser')
print("Soup object created! Now we can easily extract content.")
```

In [49]:
# Go back to the Encyclopedia article
url = "https://esask.uregina.ca/entry/ruperts_land_purchase.html"
response = requests.get(url, headers=DEFAULT_HEADERS)

# Create the soup object
soup = BeautifulSoup(response.text, 'html.parser')
print("Soup object created! Now we can easily extract content.")

Soup object created! Now we can easily extract content.


**Step 3b: Extract the page title**

Let's start simple by getting the page title. In HTML, the title is in `<title>` tags.

Copy this code:
```python
title = soup.find('title')
print(f"Page title: {title.get_text()}")
```

In [48]:
title = soup.find('title')
print(f"Page title: {title.get_text()}")

Page title: The Encyclopedia of Saskatchewan | Details


**Step 3c: Get clean text (no HTML tags)**

The `.get_text()` method removes all HTML tags and gives us just the readable content.

In [52]:
# Extract all text content without HTML tag
clean_text = soup.get_text()
print(f"Clean text length: {len(clean_text)} characters")
print()
print("First 1500 characters of clean text:")
print(clean_text[:1500])




Clean text length: 3778 characters

First 500 characters of clean text:



The Encyclopedia of Saskatchewan | Details











<%@include file="menu.html" %>


Welcome to the Encyclopedia of Saskatchewan. For assistance in exploring this site, please click here.



 

History


 
    If you have feedback regarding this entry please fill out our feedback form. 









Rupert's Land Purchase
 Before Saskatchewan became a province, it was part of the North-West Territories and its geographic and economic future was determined by the sale of Rupert's Land. Rupert's Land, the territory granted by the British Crown to the Hudson's Bay Company (HBC) in 1670, was purchased by the government of Canada in 1870: approximately 3 million hectares (or 7 million acres) were purchased for $1.5 million in Canadian currency (Â£300,000). The HBC was granted one-twentieth of the best farmland in the region, and the company held on to its most successful fur-trading operations. The Rupert's Land Purcha

### 🔄 **Your Turn: Practice with Internet Archive**

Now apply the same Beautiful Soup steps to the Saskatchewan document. Copy and adapt the code above:

1. Create a soup object from the Saskatchewan URL
2. Extract and print the title
3. Get the clean text and show the first 500 characters

In [43]:
# Your practice: Apply Beautiful Soup to Saskatchewan document


## Step 4: Targeting Specific Content

Getting all the text is useful, but often we want specific parts. Let's learn to target particular HTML elements.

**Step 4a: Find all paragraphs**

Blog posts organize content in paragraphs (`<p>` tags). Let's find them:

Copy this code:
```python
# Go back to our Encyclopedia article soup
url = "https://esask.uregina.ca/entry/ruperts_land_purchase.html"
response = requests.get(url, headers=DEFAULT_HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all paragraph tags
paragraphs = soup.find_all('p')
print(f"Found {len(paragraphs)} paragraphs")
```

In [53]:
# Go back to our Encyclopedia article soup
url = "https://esask.uregina.ca/entry/ruperts_land_purchase.html"
response = requests.get(url, headers=DEFAULT_HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all paragraph tags
paragraphs = soup.find_all('p')
print(f"Found {len(paragraphs)} paragraphs")

Found 4 paragraphs


**Step 4b: Look at individual paragraphs**

Let's examine a few paragraphs to see what we're working with:

In [54]:
# Look at the first few paragraphs with validation
try:
    if 'paragraphs' not in locals():
        print("❌ Paragraphs not found - run the paragraph extraction cell first")
    elif not paragraphs:
        print("⚠️  No paragraph tags available to display")
    else:
        print("First 3 paragraphs:")
        for i in range(min(3, len(paragraphs))):
            para_text = paragraphs[i].get_text().strip()
            preview = para_text[:100] + ('...' if len(para_text) > 100 else '')
            print()
            print(f"Paragraph {i+1}: {preview}")
except Exception as e:
    print(f"❌ Error previewing paragraphs: {e}")


First 3 paragraphs:

Paragraph 1: Before Saskatchewan became a province, it was part of the North-West Territories and its geographic ...

Paragraph 2: The Métis people settled on the land, establishing their traditional patterns of hunting, trapping, ...

Paragraph 3: The Rupert's Land Purchase also adversely affected Indian populations in the North-West Territories ...


**Step 4c: Filter for substantial paragraphs**

Many paragraphs are short or empty. Let's filter for substantial ones (more than 50 characters):

Copy this code:
```python
# Filter for paragraphs with meaningful content
substantial_paras = []
for para in paragraphs:
    text = para.get_text().strip()
    if len(text) > 50:  # Only paragraphs with more than 50 characters
        substantial_paras.append(text)

print(f"Substantial paragraphs: {len(substantial_paras)}")
print(f"\nFirst substantial paragraph:")
print(substantial_paras[0])
```

In [56]:
# Filter for paragraphs with meaningful content
substantial_paras = []
for para in paragraphs:
    text = para.get_text().strip()
    if len(text) > 50:  # Only paragraphs with more than 50 characters
        substantial_paras.append(text)

print(f"Substantial paragraphs: {len(substantial_paras)}")
print(f"\nFirst substantial paragraph:")
print(substantial_paras[0])


Substantial paragraphs: 3

First substantial paragraph:
Before Saskatchewan became a province, it was part of the North-West Territories and its geographic and economic future was determined by the sale of Rupert's Land. Rupert's Land, the territory granted by the British Crown to the Hudson's Bay Company (HBC) in 1670, was purchased by the government of Canada in 1870: approximately 3 million hectares (or 7 million acres) were purchased for $1.5 million in Canadian currency (Â£300,000). The HBC was granted one-twentieth of the best farmland in the region, and the company held on to its most successful fur-trading operations. The Rupert's Land Purchase drastically altered the historic relationships that Saskatchewan Métis and First Nations peoples had with the land, the Canadian government, and the social environment in the prairie region. Indian and Métis people, who were not consulted about the sale, were seen as a deterrent to successful settlement of the west. The Métis, led by Lou

### 🔄 **Your Turn: Find Historical Quotes**
There are not block quotes in this article.

Historical articles often include quotes in special `<blockquote>` tags. Adapt the code above to:

1. Find all blockquote elements using `soup.find_all('blockquote')`
2. Extract their text and print them
3. Count how many historical quotes you found

Use the same Encyclopedia article soup object.

In [57]:
# Your exercise: Find historical quotes in blockquotes
try:
    if 'soup' not in locals():
        print("❌ Soup object not found - run the Beautiful Soup creation cells first")
    else:
        # Find all blockquote elements
        blockquotes = soup.find_all('blockquote')
        print(f"📜 Found {len(blockquotes)} blockquote elements")

        if blockquotes:
            print("✅ Historical quotes found!")
            print("\n🔍 Analyzing each quote:")

            valid_quotes = []
            for i, quote in enumerate(blockquotes, 1):
                quote_text = quote.get_text().strip()

                if quote_text and len(quote_text) > 10:  # Filter out very short quotes
                    valid_quotes.append(quote_text)
                    print(f"\nQuote {i} ({len(quote_text)} characters):")
                    print("-" * 50)
                    # Show first 200 characters of quote
                    display_text = quote_text[:200] + "..." if len(quote_text) > 200 else quote_text
                    print(display_text)
                    print("-" * 50)

                    # Check for historical indicators
                    quote_lower = quote_text.lower()
                    historical_indicators = ['rupert', 'hudson', 'company', 'indigenous', 'metis', 'cree', 'treaty', '1869', '1870']
                    found_indicators = [ind for ind in historical_indicators if ind in quote_lower]

                    if found_indicators:
                        print(f"📚 Historical indicators found: {', '.join(found_indicators)}")
                    else:
                        print("📝 No obvious historical indicators")
                else:
                    print(f"Quote {i}: (too short or empty)")

            print(f"\n📊 Summary:")
            print(f"   Total blockquotes: {len(blockquotes)}")
            print(f"   Valid historical quotes: {len(valid_quotes)}")

            if valid_quotes:
                avg_length = sum(len(q) for q in valid_quotes) / len(valid_quotes)
                print(f"   Average quote length: {avg_length:.0f} characters")

        else:
            print("❌ No blockquotes found on this page")
            print("💡 This might mean:")
            print("   - This site doesn't use blockquotes for quotes")
            print("   - Quotes might be in other tags (div, p with special classes)")
            print("   - Page structure is different than expected")

except Exception as e:
    print(f"❌ Error finding quotes: {e}")

📜 Found 0 blockquote elements
❌ No blockquotes found on this page
💡 This might mean:
   - This site doesn't use blockquotes for quotes
   - Quotes might be in other tags (div, p with special classes)
   - Page structure is different than expected


## Step 5: Working with Links

Historical sources often link to primary documents. Let's learn to extract and analyze links.

**Step 5a: Find all links**

Links are in `<a>` tags with `href` attributes. Let's find them:

In [61]:

# Go back to our Encyclopedia article soup
url = "https://activehistory.ca/blog/2010/07/05/remembering-oka/"
response = requests.get(url, headers=DEFAULT_HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all links with validation
try:
    if 'soup' not in locals():
        print("❌ Soup object not found - run the Beautiful Soup creation cells first")
    else:
        # Find all links with href attributes
        all_links = soup.find_all('a', href=True)
        print(f"🔗 Found {len(all_links)} links with href attributes")

        if all_links:
            print("✅ Link extraction successful")

            # Analyze link types
            internal_links = 0
            external_links = 0
            archive_links = 0

            print("\n🔍 Analyzing first 5 links:")
            for i, link in enumerate(all_links[:5]):
                link_text = link.get_text().strip()
                link_url = link.get('href')

                # Classify link type
                if link_url.startswith('http'):
                    if 'esask.uregina.ca' in link_url:
                        link_type = "Internal"
                        internal_links += 1
                    elif 'archive.org' in link_url:
                        link_type = "Archive"
                        archive_links += 1
                    else:
                        link_type = "External"
                        external_links += 1
                else:
                    link_type = "Relative"
                    internal_links += 1

                print(f"\n{i+1}. [{link_type}] '{link_text[:50]}{'...' if len(link_text) > 50 else ''}'")
                print(f"    URL: {link_url[:80]}{'...' if len(link_url) > 80 else ''}")

            print(f"\n📊 Link Classification (first 5):")
            print(f"   Internal/Relative: {internal_links}")
            print(f"   External: {external_links}")
            print(f"   Archive links: {archive_links}")

        else:
            print("❌ No links with href found")
            print("💡 This might mean:")
            print("   - Page has no links")
            print("   - Links use different attributes")
            print("   - Page didn't load properly")

except Exception as e:
    print(f"❌ Error finding links: {e}")

🔗 Found 73 links with href attributes
✅ Link extraction successful

🔍 Analyzing first 5 links:

1. [External] ''
    URL: https://activehistory.ca/

2. [External] 'Active History'
    URL: https://activehistory.ca/

3. [External] 'Home'
    URL: http://activehistory.ca/

4. [External] 'About'
    URL: https://activehistory.ca/about/

5. [External] 'Features'
    URL: https://activehistory.ca/papers/

📊 Link Classification (first 5):
   Internal/Relative: 0
   External: 5
   Archive links: 0


**Step 5b: Filter for historical document links**

Let's find links that point to historical archives or documents:

Copy this code:
```python
# Define domains that often contain historical documents
historical_domains = ['archive.org', 'canadiana.org', 'gutenberg.org', 'biographi.ca']

# Filter links
document_links = []
for link in all_links:
    href = link.get('href', '')
    # Check if any historical domain is in the URL
    for domain in historical_domains:
        if domain in href:
            document_links.append({
                'text': link.get_text().strip(),
                'url': href
            })
            break  # Don't add the same link twice

print(f"Historical document links found: {len(document_links)}")
```

In [63]:
# Define domains that often contain historical documents
historical_domains = ['archive.org', 'canadiana.org', 'gutenberg.org', 'biographi.ca']

# Filter links
document_links = []
for link in all_links:
    href = link.get('href', '')
    # Check if any historical domain is in the URL
    for domain in historical_domains:
        if domain in href:
            document_links.append({
                'text': link.get_text().strip(),
                'url': href
            })
            break  # Don't add the same link twice

print(f"Historical document links found: {len(document_links)}")

Historical document links found: 1


**Step 5c: Display the historical links**

In [64]:
# Show the historical document links we found
try:
    if 'document_links' not in locals():
        print("❌ Historical document links not available - run the filter cell first")
    elif not document_links:
        print("ℹ️ No historical document links were identified yet.")
    else:
        print("Historical Document Links:")
        print("=" * 50)
        for i, link in enumerate(document_links, 1):
            print(f"{i}. {link['text']}")
            print(f"   URL: {link['url']}")
            print()
except Exception as e:
    print(f"❌ Error displaying historical document links: {e}")


Historical Document Links:
1. acknowledged by the British in the 1760s
   URL: http://http://www.canadiana.org/view/91943/321



### 🔄 **Your Turn: Find PDF Links**

Use this blog post: "https://activehistory.ca/blog/2025/07/30/taking-care-of-the-truth-a-call-for-collaborative-community-engaged-residential-school-research/"

Many historical documents are available as PDFs. Adapt the link-finding code to:

1. Find all links that contain ".pdf" in their href
2. Store them in a list called `pdf_links`
3. Print how many PDF links you found
4. Display the first 3 PDF links

Hint: Use `if '.pdf' in href:` to check for PDF links.

## Step 6: Internet Archive Metadata

Internet Archive documents have rich metadata. Let's learn to extract it systematically.

**Step 6a: Get an Internet Archive page**

Let's work with our Saskatchewan document:

In [66]:
# Get Internet Archive page with validation
ia_url = "https://archive.org/details/saskatchewan00sask"

print("🏛️ Working with Internet Archive metadata...")
print(f"URL: {ia_url}")

try:
    # Use our respectful function with a bit longer delay for IA
    ia_response = respectful_get(ia_url, delay=2)

    if ia_response:
        print(f"✅ Status Code: {ia_response.status_code}")
        print("✅ Internet Archive page loaded successfully!")

        # Basic validation
        if 'archive.org' in ia_response.url:
            print("✅ Confirmed: This is an Internet Archive page")
        else:
            print("⚠️  Warning: Response URL doesn't match Internet Archive")

        # Check content size
        content_size = len(ia_response.text)
        print(f"📄 Content size: {content_size:,} characters")

        if content_size > 10000:
            print("✅ Substantial content received")
        else:
            print("⚠️  Warning: Less content than expected")

    else:
        print("❌ Failed to load Internet Archive page")
        print("💡 Possible issues:")
        print("   - Internet Archive servers busy")
        print("   - Network connectivity issues")
        print("   - Document no longer available")

except Exception as e:
    print(f"❌ Error loading Internet Archive page: {e}")

🏛️ Working with Internet Archive metadata...
URL: https://archive.org/details/saskatchewan00sask
⏳ Waiting 2 second(s) before request...
✅ Request successful
✅ Status Code: 200
✅ Internet Archive page loaded successfully!
✅ Confirmed: This is an Internet Archive page
📄 Content size: 228,042 characters
✅ Substantial content received


**Step 6b: Extract the document title**

Internet Archive puts document titles in `<h1>` tags:

Copy this code:
```python
title = ia_soup.find('h1')
if title:
    document_title = title.get_text().strip()
    print(f"Document Title: {document_title}")
else:
    print("No title found")
```

In [68]:
title = ia_soup.find('h1')
if title:
    document_title = title.get_text().strip()
    print(f"Document Title: {document_title}")
else:
    print("No title found")

Document Title: Saskatchewan.


**Step 6c: Find metadata fields**

Internet Archive uses `<dt>` (definition term) and `<dd>` (definition description) tags for metadata:

Copy this code:
```python
# Find metadata terms and values
metadata_terms = ia_soup.find_all('dt')
metadata_values = ia_soup.find_all('dd')

print(f"Metadata fields found: {len(metadata_terms)}")
print(f"Metadata values found: {len(metadata_values)}")
```

In [69]:
# Find metadata terms and values
metadata_terms = ia_soup.find_all('dt')
metadata_values = ia_soup.find_all('dd')

print(f"Metadata fields found: {len(metadata_terms)}")
print(f"Metadata values found: {len(metadata_values)}")

Metadata fields found: 36
Metadata values found: 36


**Step 6d: Extract and display metadata**

In [70]:
# Create a dictionary to store metadata
try:
    if 'metadata_terms' not in locals() or 'metadata_values' not in locals():
        print("❌ Metadata fields not found - run the metadata extraction cell first")
    else:
        metadata = {}
        for term, value in zip(metadata_terms, metadata_values):
            term_text = term.get_text().strip()
            value_text = value.get_text().strip()
            metadata[term_text] = value_text
        if not metadata:
            print("ℹ️ No metadata pairs were captured - the page structure may have changed")
        else:
            print("Document Metadata:")
            print("=" * 40)
            important_fields = ['by', 'Publication date', 'Topics', 'Language']
            for field in important_fields:
                if field in metadata:
                    value = metadata[field]
                    if len(value) > 100:
                        value = value[:100] + '...'
                    print(f"{field}: {value}")
            extra_fields = [key for key in metadata if key not in important_fields]
            if extra_fields:
                print()
                print(f"🔍 Additional metadata fields captured: {len(extra_fields)}")
except Exception as e:
    print(f"❌ Error building metadata dictionary: {e}")


Document Metadata:
by: Saskatchewan. Dept. of Agriculture
Publication date: 1912
Topics: Saskatchewan -- Economic conditions -- History, Saskatchewan -- Agriculture, Saskatchewan -- Resourc...
Language: English

🔍 Additional metadata fields captured: 32


### 🔄 **Your Turn: Create a Metadata Function**

Now create a reusable function that extracts metadata from any Internet Archive document. Fill in the missing parts:

```python
def extract_ia_metadata(url):
    """Extract title and metadata from an Internet Archive document"""
    # 1. Get the page with requests.get()
    # 2. Create soup with BeautifulSoup
    # 3. Extract title from h1 tag
    # 4. Extract metadata from dt/dd tags
    # 5. Return a dictionary with title and metadata
```

Test it with the University of Toronto annual report: `https://archive.org/details/annualreport191920nivuoft`

In [71]:
# Create a reusable Internet Archive metadata function
def extract_ia_metadata(url):
    """Extract title and metadata from an Internet Archive document"""
    try:
        print(f"🔍 Analyzing Internet Archive URL: {url}")

        # Extract item ID from URL
        if '/details/' in url:
            item_id = url.split('/details/')[-1]
            print(f"📋 Item ID extracted: {item_id}")
        else:
            return {'error': 'Invalid Internet Archive URL format'}

        # Method 1: Try Internet Archive Python library (preferred)
        try:
            import internetarchive as ia
            item = ia.get_item(item_id)

            # Extract key metadata
            metadata = {
                'method': 'IA Python Library',
                'title': item.metadata.get('title', 'No title'),
                'creator': item.metadata.get('creator', 'No creator'),
                'date': item.metadata.get('date', 'No date'),
                'subject': item.metadata.get('subject', 'No subject'),
                'description': item.metadata.get('description', 'No description')[:200] + '...' if item.metadata.get('description') else 'No description',
                'language': item.metadata.get('language', 'No language'),
                'files_count': len(list(item.files))
            }

            print("✅ Successfully extracted metadata using IA library")
            return metadata

        except ImportError:
            print("⚠️  IA library not available, falling back to web scraping...")
        except Exception as e:
            print(f"⚠️  IA library failed ({e}), falling back to web scraping...")

        # Method 2: Fallback to web scraping
        response = respectful_get(url, delay=2)
        if not response:
            return {'error': 'Could not download page'}

        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract metadata from HTML
        title_tag = soup.find('h1')
        title = title_tag.get_text().strip() if title_tag else 'No title found'

        # Try to find metadata fields
        metadata_dict = {'method': 'Web Scraping', 'title': title}

        # Look for dt/dd metadata pairs
        dt_tags = soup.find_all('dt')
        dd_tags = soup.find_all('dd')

        for dt, dd in zip(dt_tags, dd_tags):
            key = dt.get_text().strip().lower()
            value = dd.get_text().strip()

            # Map common fields
            if 'by' in key or 'creator' in key:
                metadata_dict['creator'] = value
            elif 'date' in key:
                metadata_dict['date'] = value
            elif 'topic' in key or 'subject' in key:
                metadata_dict['subject'] = value
            elif 'language' in key:
                metadata_dict['language'] = value

        print("✅ Successfully extracted metadata using web scraping")
        return metadata_dict

    except Exception as e:
        print(f"❌ Error extracting metadata: {e}")
        return {'error': str(e)}

# Test the function with University of Toronto annual report
uoft_url = "https://archive.org/details/annualreport191920nivuoft"
print("🎓 Testing with University of Toronto Annual Report 1919-20...")

result = extract_ia_metadata(uoft_url)

print(f"\n📊 Extracted Metadata:")
print("=" * 50)
for key, value in result.items():
    print(f"{key.title()}: {value}")

print(f"\n💡 This function demonstrates two approaches:")
print(f"   1. Internet Archive Python library (preferred)")
print(f"   2. Web scraping with Beautiful Soup (fallback)")
print(f"   Real research projects should use both for robustness!")

🎓 Testing with University of Toronto Annual Report 1919-20...
🔍 Analyzing Internet Archive URL: https://archive.org/details/annualreport191920nivuoft
📋 Item ID extracted: annualreport191920nivuoft
⚠️  IA library not available, falling back to web scraping...
⏳ Waiting 2 second(s) before request...
✅ Request successful
✅ Successfully extracted metadata using web scraping

📊 Extracted Metadata:
Method: Web Scraping
Title: Annual report - University of Saskatchewan
Creator: University of Saskatchewan
Date: 20080605175845
Subject: University of Saskatchewan
Language: English

💡 This function demonstrates two approaches:
   1. Internet Archive Python library (preferred)
   2. Web scraping with Beautiful Soup (fallback)
   Real research projects should use both for robustness!


## Step 7: Final Challenge - Your Research Project

Time to put it all together! Choose a Canadian historical source and conduct your own analysis.

**Available sources:**
- Encyclopedia Rupert's Land purchase article: Research the historical quotes and themes
- Saskatchewan History: Analyze the metadata and publication context
- UofT Annual Report 1919-20: Extract institutional information

**Your research steps:**
1. Choose a source and research question
2. Use the scraping techniques you've learned
3. Extract specific information related to your question
4. Present your findings

In [72]:
# Using Internet Archive library - much more efficient!
print("🏛️ Accessing Internet Archive with the Python library...")

try:
    import internetarchive as ia
except ImportError:
    print("❌ Internet Archive library not installed yet - run the installation cell below first")
else:
    try:
        item_id = "saskatchewan00sask"  # The ID from the URL
        item = ia.get_item(item_id)
        print(f"✅ Successfully retrieved item: {item_id}")
        print(f"📚 Title: {item.metadata.get('title', 'No title')}")
        print(f"📅 Date: {item.metadata.get('date', 'No date')}")
        print(f"👤 Creator: {item.metadata.get('creator', 'No creator')}")
        print(f"📖 Subject: {item.metadata.get('subject', 'No subject')}")
        files = list(item.files)
        print()
        print(f"📁 Available files: {len(files)}")
        for i, file in enumerate(files[:5]):
            file_name = file.get('name', 'Unknown')
            file_format = file.get('format', 'Unknown')
            file_size = file.get('size', 'Unknown')
            print(f"   {i+1}. {file_name} ({file_format}) - {file_size} bytes")
        if len(files) > 5:
            print(f"   ... and {len(files) - 5} more files")
        print()
        print("🚀 Much easier than HTML scraping!")
        print("💡 The IA library gives us clean, structured data instantly")
    except Exception as e:
        print(f"❌ Error accessing Internet Archive: {e}")
        print("💡 This might be due to network issues or an unavailable item")


🏛️ Accessing Internet Archive with the Python library...
❌ Internet Archive library not installed yet - run the installation cell below first


In [73]:
# Install and import Internet Archive library
# First install the library (run this once)
!pip install internetarchive --quiet

print("📚 Installing Internet Archive Python library...")
print("✅ Installation complete!")

# Import the library
try:
    import internetarchive as ia
    print("✅ Internet Archive library imported successfully!")
    print("🔗 This library provides direct, efficient access to IA collections")
except ImportError as e:
    print(f"❌ Error importing Internet Archive library: {e}")
    print("💡 Try running: pip install internetarchive")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.0/108.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h📚 Installing Internet Archive Python library...
✅ Installation complete!
✅ Internet Archive library imported successfully!
🔗 This library provides direct, efficient access to IA collections


## Step 8: Introduction to Internet Archive Python Library

While Beautiful Soup is excellent for scraping web pages, the Internet Archive provides a specialized Python library that makes accessing their collections much more efficient. This is perfect for large-scale historical research projects.

**Why use the Internet Archive library?**
- Direct access to metadata without scraping HTML
- Faster downloads and better error handling
- Access to full-text search capabilities
- Bulk processing of large collections
- Respects Internet Archive's preferred access methods

In [79]:
# Final Challenge: Your Historical Research Project

# Available Canadian historical sources:
sources = {
    "Rupert's Land Purchase (Encyclopedia of Saskatchewan)": {
        "url": "https://esask.uregina.ca/entry/ruperts_land_purchase.html",
        "type": "Online encyclopedia article",
        "research_questions": [
            "How is the Rupert's Land purchase narrated for a general audience?",
            "What Indigenous perspectives are included or missing?",
            "Which archival collections are linked for further research?"
        ]
    },
    "Saskatchewan History": {
        "url": "https://archive.org/details/saskatchewan00sask",
        "type": "Internet Archive document",
        "research_questions": [
            "What metadata is available about publication?",
            "What file formats are provided?",
            "Who was the creator/publisher?"
        ]
    },
    "UofT Annual Report 1919-20": {
        "url": "https://archive.org/details/annualreport191920nivuoft",
        "type": "Institutional document",
        "research_questions": [
            "What institutional information is captured?",
            "How is the document structured?",
            "What historical context does it provide?"
        ]
    }
}

# Choose your research project
print("🔬 Historical Web Scraping Research Project")
print("=" * 50)

print("📚 Available sources:")
for i, (name, info) in enumerate(sources.items(), 1):
    print(f"{i}. {name}")
    print(f"   Type: {info['type']}")
    print(f"   URL: {info['url'][:60]}...")
    print(f"   Sample questions:")
    for q in info['research_questions']:
        print(f"     - {q}")

print(f"🎯 Your research steps:")
print("1. Choose a source and research question")
print("2. Use appropriate scraping techniques")
print("3. Extract and validate data")
print("4. Analyze and present findings")

# Example research project - customize this!
chosen_source = "Rupert's Land Purchase (Encyclopedia of Saskatchewan)"  # Change this to your choice
my_url = sources[chosen_source]["url"]
my_question = "What historical themes appear in the Rupert's Land purchase article?"

print(f"📋 Example Research Project:")
print(f"Source: {chosen_source}")
print(f"Question: {my_question}")
print(f"URL: {my_url}")

# Conduct the research
print(f"🔍 Conducting research...")

theme_counts = {}

try:
    response = respectful_get(my_url)
    if response:
        soup = BeautifulSoup(response.text, 'html.parser')
        print("✅ Page successfully scraped")
        if "themes" in my_question.lower():
            paragraphs = soup.find_all('p')
            print(f"📊 Content Analysis:")
            print(f"   Paragraphs found: {len(paragraphs)}")
            all_text = soup.get_text().lower()
            themes = {
                'Economic': ['trade', 'commerce', 'company', 'profit'],
                'Indigenous perspectives': ['indigenous', 'cree', 'assiniboine', 'déné', 'nation'],
                'Colonial administration': ['british', 'government', 'hudson', 'london'],
                'Geographic': ['rupert', 'saskatchewan', 'manitoba', 'prairies', 'north-west'],
                'Temporal': ['1869', '1870', '19th century', 'confederation']
            }
            for theme_name, keywords in themes.items():
                count = sum(all_text.count(keyword) for keyword in keywords)
                if count > 0:
                    theme_counts[theme_name] = count
            if theme_counts:
                print(f"🎭 Historical Themes Found:")
                for theme, count in sorted(theme_counts.items(), key=lambda x: x[1], reverse=True):
                    print(f"   {theme}: {count} mentions")
            else:
                print("ℹ️ No theme keywords were detected - try adjusting the keyword lists")
        else:
            print("ℹ️ Customize the analysis section to match your research question.")
        print(f"📝 Research Findings:")
        print("1. Successfully scraped a Canadian historical source")
        print("2. Identified multiple historical themes in source material" if theme_counts else "2. Ready to adapt the analysis for your own question")
        print("3. Demonstrated effective web scraping for historical research")
        if theme_counts:
            most_common = max(theme_counts.items(), key=lambda x: x[1])
            print(f"4. Most prominent theme: {most_common[0]} ({most_common[1]} mentions)")
    else:
        print("❌ Could not complete research - page unavailable")
except Exception as e:
    print(f"❌ Research error: {e}")

print(f"🚀 Now it's your turn!")
print("Customize this code for your own research question:")
print("1. Change 'chosen_source' to your preferred source")
print("2. Modify 'my_question' to your research interest")
print("3. Adapt the analysis code for your specific question")
print("4. Add your own themes, keywords, or analysis methods")

print(f"💡 Remember to:")
print("- Validate your scraped data")
print("- Handle errors gracefully")
print("- Respect website terms of service")
print("- Cite your digital sources properly")


🔬 Historical Web Scraping Research Project
📚 Available sources:
1. Rupert's Land Purchase (Encyclopedia of Saskatchewan)
   Type: Online encyclopedia article
   URL: https://esask.uregina.ca/entry/ruperts_land_purchase.html...
   Sample questions:
     - How is the Rupert's Land purchase narrated for a general audience?
     - What Indigenous perspectives are included or missing?
     - Which archival collections are linked for further research?
2. Saskatchewan History
   Type: Internet Archive document
   URL: https://archive.org/details/saskatchewan00sask...
   Sample questions:
     - What metadata is available about publication?
     - What file formats are provided?
     - Who was the creator/publisher?
3. UofT Annual Report 1919-20
   Type: Institutional document
   URL: https://archive.org/details/annualreport191920nivuoft...
   Sample questions:
     - What institutional information is captured?
     - How is the document structured?
     - What historical context does it provi

## Summary: What You've Learned

🎉 **Congratulations!** You've mastered the fundamentals of web scraping for historical research:

**Technical Skills:**
- ✅ Making web requests with `requests.get()`
- ✅ Parsing HTML with Beautiful Soup
- ✅ Targeting specific elements (`find`, `find_all`)
- ✅ Extracting text, links, and metadata
- ✅ Building reusable functions for research

**Historical Sources:**
- ✅ Blog posts with embedded historical content
- ✅ Internet Archive documents with metadata
- ✅ Academic indexes and structured data

**Research Methods:**
- ✅ Systematic content extraction
- ✅ Metadata analysis for document context
- ✅ Building reproducible research workflows

**Next Steps:**
In Notebook 3, you'll learn advanced text analysis techniques to find patterns in the historical data you've scraped. We'll also explore the Internet Archive Python library for more efficient access to large collections.

**Remember:**
- Always respect robots.txt and website terms of service
- Cite your digital sources properly
- Consider the limitations and context of digitized materials
- Use these skills responsibly for legitimate research purposes