# **Notebook 2: Web Scraping for Historians**

Welcome to web scraping! In this notebook, you'll learn to extract historical data from websites using Python. We'll work with real Canadian historical sources and build your skills step by step.

**What you'll learn:**
- How web pages are structured (HTML basics)
- Using Beautiful Soup to extract specific content
- Working with Canadian historical archives
- Building reusable code for research

**Ethics first:** Always respect websites. Check robots.txt files, don't overload servers, and respect copyright.

## Step 1: Setting Up Our Tools

Before we can scrape websites, we need to install and import the right libraries. Let's do this step by step.

**First, install the libraries:**

In [None]:
# Example: How to add delays between requests
import time

def respectful_get(url, delay=1):
    """
    Make a web request with built-in delay for respectful scraping
    """
    print(f"‚è≥ Waiting {delay} second(s) before request...")
    time.sleep(delay)  # Wait before making request
    
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        print(f"‚úÖ Request successful")
        return response
    except requests.exceptions.RequestException as e:
        print(f"‚ùå Request failed: {e}")
        return None

# We'll use this function for respectful scraping throughout the notebook
print("Respectful scraping function defined!")
print("üí° This adds delays between requests to be considerate to servers.")

## ü§ù Ethical Web Scraping Guidelines

Before we start scraping, let's understand the ethics and best practices:

**‚úÖ Always:**
- Check robots.txt (add /robots.txt to any website URL)
- Add delays between requests (don't overwhelm servers)
- Use reasonable timeouts
- Respect copyright and terms of service
- Identify yourself with User-Agent headers when appropriate

**‚ùå Never:**
- Scrape faster than a human could browse
- Ignore error messages or blocks
- Scrape personal/private information
- Violate website terms of service
- Overload servers with rapid requests

**üìñ For historical research:**
- Many archives encourage responsible academic use
- Always cite your digital sources properly
- Consider contacting archives for bulk data access
- Respect cultural sensitivities in historical materials

In [None]:
# This installs the libraries we need
!pip install requests beautifulsoup4 --quiet
print("Libraries installed successfully!")

**Now, import them so we can use them:**

Copy this code into the cell below:
```python
import requests
from bs4 import BeautifulSoup
```
If you don't get an error and nothing happens after you click the run button, it worked. 

In [None]:
# Import the libraries we need for web scraping
import requests
from bs4 import BeautifulSoup

print("Libraries imported successfully!")
print("‚úÖ requests: Downloads web pages")
print("‚úÖ BeautifulSoup: Parses HTML content")

## Step 2: Your First Web Request

Let's start by downloading a web page. We'll use a Canadian historical source - a Toronto Public Library blog post about Jesuit Relations.

**Step 2a: Define the URL**

Copy this code:
```python
url = "https://torontopubliclibrary.typepad.com/local-history-genealogy/2020/01/sainte-marie-among-the-hurons-selections-from-the-jesuit-relations-and-allied-documents.html"
print(f"We're going to scrape: {url}")
```

In [None]:
# Define the URL for our Canadian historical source
url = "https://torontopubliclibrary.typepad.com/local-history-genealogy/2020/01/sainte-marie-among-the-hurons-selections-from-the-jesuit-relations-and-allied-documents.html"
print(f"We're going to scrape: {url}")
print("\nüìö This is a Toronto Public Library blog post about Jesuit Relations")
print("üá®üá¶ Perfect for learning Canadian historical web scraping!")

**Step 2b: Download the page**

Now let's actually download the web page. The `requests.get()` function fetches the page for us.

Copy this code:
```python
response = requests.get(url)
print(f"Status Code: {response.status_code}")  # 200 means success
print(f"Page downloaded! It contains {len(response.text)} characters.")
```

In [None]:
# Download the page with error handling
try:
    response = requests.get(url, timeout=10)  # 10 second timeout
    response.raise_for_status()  # Raises an exception for bad status codes
    print(f"‚úÖ Success! Status Code: {response.status_code}")
    print(f"üìÑ Page downloaded! It contains {len(response.text):,} characters.")
except requests.exceptions.RequestException as e:
    print(f"‚ùå Error downloading page: {e}")
    print("üí° This could be due to:")
    print("   - No internet connection")
    print("   - Website is down")
    print("   - URL has changed")
    print("   - Server is blocking requests")

**Step 2c: Look at the raw HTML**

Let's see what we actually downloaded. Warning: it's going to look messy!

In [None]:
# Your adaptation exercise: Saskatchewan Internet Archive document
saskatchewan_url = "https://archive.org/details/saskatchewan00sask"

print("üçÅ Trying Saskatchewan historical document...")
print(f"URL: {saskatchewan_url}")

try:
    # Use our respectful function
    sask_response = respectful_get(saskatchewan_url)
    
    if sask_response:
        print(f"‚úÖ Status Code: {sask_response.status_code}")
        print(f"üìÑ Downloaded {len(sask_response.text):,} characters")
        
        # Show first 500 characters
        print("\nFirst 500 characters of raw HTML:")
        print("-" * 50)
        print(sask_response.text[:500])
        print("-" * 50)
        print("üí° Notice how different this Internet Archive page looks compared to the blog!")
    else:
        print("‚ùå Failed to download Saskatchewan document")
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("üí° If this fails, it might be due to network issues or site changes")

### üîÑ **Your Turn: Adaptation Exercise**

Now try the same steps with a different Canadian source. Copy the code from above and modify it to use this Internet Archive document about Saskatchewan:

```python
url = "https://archive.org/details/saskatchewan00sask"
```

Follow the same steps: define the URL, download the page, check the status code, and look at the first 500 characters.

In [None]:
# Your adaptation exercise here
# 1. Define the Saskatchewan URL
# 2. Download the page with requests.get()
# 3. Print the status code and character count
# 4. Show first 500 characters


## Step 3: Making Sense of HTML with Beautiful Soup

Raw HTML is hard to work with. Beautiful Soup parses it and makes it easy to extract what we need.

**Step 3a: Create your first "soup"**

Let's go back to our TPL blog post and parse it properly:

Copy this code:
```python
# Go back to the TPL blog post
url = "https://torontopubliclibrary.typepad.com/local-history-genealogy/2020/01/sainte-marie-among-the-hurons-selections-from-the-jesuit-relations-and-allied-documents.html"
response = requests.get(url)

# Create the soup object
soup = BeautifulSoup(response.text, 'html.parser')
print("Soup object created! Now we can easily extract content.")
```

In [None]:
# Create soup object with validation
try:
    # Use our respectful function
    response = respectful_get(url)
    
    if response:
        # Create the soup object
        soup = BeautifulSoup(response.text, 'html.parser')
        print("‚úÖ Soup object created successfully!")
        
        # Validate we got HTML content
        if soup.find('html'):
            print("‚úÖ Valid HTML structure detected")
        else:
            print("‚ö†Ô∏è  Warning: No HTML structure found - might be plain text")
            
        # Check if we got the expected content
        if len(soup.get_text().strip()) > 100:
            print(f"‚úÖ Content validation: {len(soup.get_text().strip()):,} characters of text")
        else:
            print("‚ö†Ô∏è  Warning: Very little text content found")
    else:
        print("‚ùå Could not create soup - request failed")
        
except Exception as e:
    print(f"‚ùå Error creating soup: {e}")

**Step 3b: Extract the page title**

Let's start simple by getting the page title. In HTML, the title is in `<title>` tags.

Copy this code:
```python
title = soup.find('title')
print(f"Page title: {title.get_text()}")
```

In [None]:
# Extract title with validation
try:
    title_tag = soup.find('title')
    
    if title_tag:
        title_text = title_tag.get_text().strip()
        print(f"‚úÖ Page title found: {title_text}")
        
        # Validate the title makes sense
        if len(title_text) > 5:
            print("‚úÖ Title validation: Reasonable length")
        else:
            print("‚ö†Ô∏è  Warning: Title seems very short")
            
        # Check if it's related to our expected content
        if any(keyword in title_text.lower() for keyword in ['jesuit', 'sainte-marie', 'huron', 'toronto']):
            print("‚úÖ Content validation: Title matches expected historical topic")
        else:
            print("‚ö†Ô∏è  Warning: Title doesn't match expected content - check URL")
    else:
        print("‚ùå No title tag found")
        print("üí° This might mean:")
        print("   - Page structure is different than expected")
        print("   - We got redirected to a different page")
        print("   - Page failed to load properly")
        
except Exception as e:
    print(f"‚ùå Error extracting title: {e}")

**Step 3c: Get clean text (no HTML tags)**

The `.get_text()` method removes all HTML tags and gives us just the readable content.

In [None]:
# Extract all text content without HTML tags
clean_text = soup.get_text()

print(f"Clean text length: {len(clean_text)} characters")
print("\nFirst 500 characters of clean text:")
print(clean_text[:500])
print("\nMuch better! Now we can read the actual content.")

### üîÑ **Your Turn: Practice with Internet Archive**

Now apply the same Beautiful Soup steps to the Saskatchewan document. Copy and adapt the code above:

1. Create a soup object from the Saskatchewan URL
2. Extract and print the title
3. Get the clean text and show the first 500 characters

In [None]:
# Your practice: Apply Beautiful Soup to Saskatchewan document
print("üçÅ Analyzing Saskatchewan document with Beautiful Soup...")

try:
    # Create soup object for Saskatchewan document
    sask_soup = BeautifulSoup(sask_response.text, 'html.parser')
    print("‚úÖ Saskatchewan soup created!")
    
    # Extract and print the title
    sask_title = sask_soup.find('title')
    if sask_title:
        title_text = sask_title.get_text().strip()
        print(f"üìö Title: {title_text}")
        
        # Validate it's an Internet Archive page
        if 'archive.org' in title_text.lower() or 'saskatchewan' in title_text.lower():
            print("‚úÖ Confirmed: This is the Saskatchewan document")
        else:
            print("‚ö†Ô∏è  Title doesn't match expectations")
    else:
        print("‚ùå No title found")
    
    # Get clean text and show first 500 characters
    clean_text = sask_soup.get_text()
    print(f"\nüìÑ Clean text length: {len(clean_text):,} characters")
    print("\nFirst 500 characters of clean text:")
    print("-" * 50)
    print(clean_text[:500])
    print("-" * 50)
    print("üí° Much more readable than raw HTML!")
    
except NameError:
    print("‚ùå Saskatchewan response not available - run the previous exercise first")
except Exception as e:
    print(f"‚ùå Error: {e}")

## Step 4: Targeting Specific Content

Getting all the text is useful, but often we want specific parts. Let's learn to target particular HTML elements.

**Step 4a: Find all paragraphs**

Blog posts organize content in paragraphs (`<p>` tags). Let's find them:

Copy this code:
```python
# Go back to our TPL blog soup
url = "https://torontopubliclibrary.typepad.com/local-history-genealogy/2020/01/sainte-marie-among-the-hurons-selections-from-the-jesuit-relations-and-allied-documents.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all paragraph tags
paragraphs = soup.find_all('p')
print(f"Found {len(paragraphs)} paragraphs")
```

In [None]:
# Find all paragraphs with validation
try:
    # Make sure we have our soup object
    if 'soup' not in locals():
        print("‚ùå Soup object not found - please run the previous Beautiful Soup cells first")
    else:
        # Find all paragraph tags
        paragraphs = soup.find_all('p')
        print(f"üìù Found {len(paragraphs)} paragraph tags")
        
        # Validate we found reasonable content
        if len(paragraphs) > 0:
            print("‚úÖ Paragraph extraction successful")
            
            # Show some examples
            print(f"\nAnalyzing first 3 paragraphs:")
            for i in range(min(3, len(paragraphs))):
                para_text = paragraphs[i].get_text().strip()
                if para_text:  # Only show non-empty paragraphs
                    print(f"\nParagraph {i+1} ({len(para_text)} chars):")
                    print(f"'{para_text[:100]}{'...' if len(para_text) > 100 else ''}'")
                else:
                    print(f"\nParagraph {i+1}: (empty)")
        else:
            print("‚ö†Ô∏è  Warning: No paragraphs found")
            print("üí° This might mean:")
            print("   - Page uses different HTML structure")
            print("   - Content is in different tags (div, article, etc.)")
            print("   - Page didn't load properly")
            
except Exception as e:
    print(f"‚ùå Error finding paragraphs: {e}")

**Step 4b: Look at individual paragraphs**

Let's examine a few paragraphs to see what we're working with:

In [None]:
# Look at the first few paragraphs
print("First 3 paragraphs:")
for i in range(3):
    if i < len(paragraphs):
        para_text = paragraphs[i].get_text().strip()
        print(f"\nParagraph {i+1}: {para_text[:100]}...")

**Step 4c: Filter for substantial paragraphs**

Many paragraphs are short or empty. Let's filter for substantial ones (more than 50 characters):

Copy this code:
```python
# Filter for paragraphs with meaningful content
substantial_paras = []
for para in paragraphs:
    text = para.get_text().strip()
    if len(text) > 50:  # Only paragraphs with more than 50 characters
        substantial_paras.append(text)

print(f"Substantial paragraphs: {len(substantial_paras)}")
print(f"\nFirst substantial paragraph:")
print(substantial_paras[0])
```

In [None]:
# Filter for substantial paragraphs with validation
try:
    if 'paragraphs' not in locals():
        print("‚ùå Paragraphs not found - run the previous cell first")
    else:
        # Filter for paragraphs with meaningful content
        substantial_paras = []
        empty_count = 0
        
        for para in paragraphs:
            text = para.get_text().strip()
            if len(text) > 50:  # Only paragraphs with more than 50 characters
                substantial_paras.append(text)
            elif len(text) == 0:
                empty_count += 1
        
        print(f"üìä Paragraph Analysis:")
        print(f"   Total paragraphs found: {len(paragraphs)}")
        print(f"   Substantial paragraphs (>50 chars): {len(substantial_paras)}")
        print(f"   Empty paragraphs: {empty_count}")
        print(f"   Short paragraphs (<50 chars): {len(paragraphs) - len(substantial_paras) - empty_count}")
        
        if substantial_paras:
            print(f"\nüìñ First substantial paragraph:")
            print("-" * 60)
            print(substantial_paras[0])
            print("-" * 60)
            
            # Validate content quality
            first_para = substantial_paras[0].lower()
            if any(word in first_para for word in ['jesuit', 'huron', 'sainte-marie', 'canada', 'history']):
                print("‚úÖ Content validation: Found expected historical keywords")
            else:
                print("‚ö†Ô∏è  Content note: No obvious historical keywords found")
        else:
            print("‚ùå No substantial paragraphs found")
            print("üí° This might mean the content is structured differently")
            
except Exception as e:
    print(f"‚ùå Error filtering paragraphs: {e}")

### üîÑ **Your Turn: Find Historical Quotes**

Historical blog posts often include quotes in special `<blockquote>` tags. Adapt the code above to:

1. Find all blockquote elements using `soup.find_all('blockquote')`
2. Extract their text and print them
3. Count how many historical quotes you found

Use the same TPL blog post soup object.

In [None]:
# Your exercise: Find historical quotes in blockquotes
try:
    if 'soup' not in locals():
        print("‚ùå Soup object not found - run the Beautiful Soup creation cells first")
    else:
        # Find all blockquote elements
        blockquotes = soup.find_all('blockquote')
        print(f"üìú Found {len(blockquotes)} blockquote elements")
        
        if blockquotes:
            print("‚úÖ Historical quotes found!")
            print("\nüîç Analyzing each quote:")
            
            valid_quotes = []
            for i, quote in enumerate(blockquotes, 1):
                quote_text = quote.get_text().strip()
                
                if quote_text and len(quote_text) > 10:  # Filter out very short quotes
                    valid_quotes.append(quote_text)
                    print(f"\nQuote {i} ({len(quote_text)} characters):")
                    print("-" * 50)
                    # Show first 200 characters of quote
                    display_text = quote_text[:200] + "..." if len(quote_text) > 200 else quote_text
                    print(display_text)
                    print("-" * 50)
                    
                    # Check for historical indicators
                    quote_lower = quote_text.lower()
                    historical_indicators = ['jesuit', 'huron', 'savage', 'father', 'lord', 'god', '16', '17']
                    found_indicators = [ind for ind in historical_indicators if ind in quote_lower]
                    
                    if found_indicators:
                        print(f"üìö Historical indicators found: {', '.join(found_indicators)}")
                    else:
                        print("üìù No obvious historical indicators")
                else:
                    print(f"Quote {i}: (too short or empty)")
            
            print(f"\nüìä Summary:")
            print(f"   Total blockquotes: {len(blockquotes)}")
            print(f"   Valid historical quotes: {len(valid_quotes)}")
            
            if valid_quotes:
                avg_length = sum(len(q) for q in valid_quotes) / len(valid_quotes)
                print(f"   Average quote length: {avg_length:.0f} characters")
            
        else:
            print("‚ùå No blockquotes found on this page")
            print("üí° This might mean:")
            print("   - This blog doesn't use blockquotes for quotes")
            print("   - Quotes might be in other tags (div, p with special classes)")
            print("   - Page structure is different than expected")
            
except Exception as e:
    print(f"‚ùå Error finding quotes: {e}")

## Step 5: Working with Links

Historical sources often link to primary documents. Let's learn to extract and analyze links.

**Step 5a: Find all links**

Links are in `<a>` tags with `href` attributes. Let's find them:

In [None]:
# Find all links with validation
try:
    if 'soup' not in locals():
        print("‚ùå Soup object not found - run the Beautiful Soup creation cells first")
    else:
        # Find all links with href attributes
        all_links = soup.find_all('a', href=True)
        print(f"üîó Found {len(all_links)} links with href attributes")
        
        if all_links:
            print("‚úÖ Link extraction successful")
            
            # Analyze link types
            internal_links = 0
            external_links = 0
            archive_links = 0
            
            print("\nüîç Analyzing first 5 links:")
            for i, link in enumerate(all_links[:5]):
                link_text = link.get_text().strip()
                link_url = link.get('href')
                
                # Classify link type
                if link_url.startswith('http'):
                    if 'torontopubliclibrary' in link_url:
                        link_type = "Internal"
                        internal_links += 1
                    elif 'archive.org' in link_url:
                        link_type = "Archive"
                        archive_links += 1
                    else:
                        link_type = "External"
                        external_links += 1
                else:
                    link_type = "Relative"
                    internal_links += 1
                
                print(f"\n{i+1}. [{link_type}] '{link_text[:50]}{'...' if len(link_text) > 50 else ''}'")
                print(f"    URL: {link_url[:80]}{'...' if len(link_url) > 80 else ''}")
            
            print(f"\nüìä Link Classification (first 5):")
            print(f"   Internal/Relative: {internal_links}")
            print(f"   External: {external_links}")
            print(f"   Archive links: {archive_links}")
            
        else:
            print("‚ùå No links with href found")
            print("üí° This might mean:")
            print("   - Page has no links")
            print("   - Links use different attributes")
            print("   - Page didn't load properly")
            
except Exception as e:
    print(f"‚ùå Error finding links: {e}")

**Step 5b: Filter for historical document links**

Let's find links that point to historical archives or documents:

Copy this code:
```python
# Define domains that often contain historical documents
historical_domains = ['archive.org', 'canadiana.ca', 'gutenberg.org', 'biographi.ca']

# Filter links
document_links = []
for link in all_links:
    href = link.get('href', '')
    # Check if any historical domain is in the URL
    for domain in historical_domains:
        if domain in href:
            document_links.append({
                'text': link.get_text().strip(),
                'url': href
            })
            break  # Don't add the same link twice

print(f"Historical document links found: {len(document_links)}")
```

In [None]:
# Filter for historical document links with validation
try:
    if 'all_links' not in locals():
        print("‚ùå Links not found - run the previous link extraction cell first")
    else:
        # Define domains that often contain historical documents
        historical_domains = ['archive.org', 'canadiana.ca', 'gutenberg.org', 'biographi.ca', 'bac-lac.gc.ca']
        
        # Filter links
        document_links = []
        domain_counts = {}
        
        for link in all_links:
            href = link.get('href', '')
            link_text = link.get_text().strip()
            
            # Check if any historical domain is in the URL
            for domain in historical_domains:
                if domain in href:
                    document_links.append({
                        'text': link_text,
                        'url': href,
                        'domain': domain
                    })
                    
                    # Count by domain
                    domain_counts[domain] = domain_counts.get(domain, 0) + 1
                    break  # Don't add the same link twice
        
        print(f"üèõÔ∏è Historical document analysis:")
        print(f"   Total links checked: {len(all_links)}")
        print(f"   Historical document links found: {len(document_links)}")
        
        if document_links:
            print(f"\nüìä By domain:")
            for domain, count in domain_counts.items():
                print(f"   {domain}: {count} links")
            
            print(f"\nüìö Historical document links:")
            for i, link in enumerate(document_links[:5], 1):  # Show first 5
                print(f"\n{i}. {link['text'][:60]}{'...' if len(link['text']) > 60 else ''}")
                print(f"   Domain: {link['domain']}")
                print(f"   URL: {link['url'][:80]}{'...' if len(link['url']) > 80 else ''}")
            
            if len(document_links) > 5:
                print(f"\n... and {len(document_links) - 5} more historical links")
                
            # Validate link quality
            valid_links = [link for link in document_links if link['text'] and len(link['text']) > 3]
            print(f"\n‚úÖ Quality check: {len(valid_links)}/{len(document_links)} links have meaningful text")
            
        else:
            print("‚ùå No historical document links found")
            print(f"üí° Searched for these domains: {', '.join(historical_domains)}")
            print("   This might mean:")
            print("   - This page doesn't link to major historical archives")
            print("   - Links use different URL structures")
            print("   - Need to add more domain patterns")
            
except Exception as e:
    print(f"‚ùå Error filtering historical links: {e}")

**Step 5c: Display the historical links**

In [None]:
# Show the historical document links we found
print("Historical Document Links:")
print("=" * 50)

for i, link in enumerate(document_links, 1):
    print(f"{i}. {link['text']}")
    print(f"   URL: {link['url']}")
    print()

### üîÑ **Your Turn: Find PDF Links**

Many historical documents are available as PDFs. Adapt the link-finding code to:

1. Find all links that contain ".pdf" in their href
2. Store them in a list called `pdf_links`
3. Print how many PDF links you found
4. Display the first 3 PDF links

Hint: Use `if '.pdf' in href:` to check for PDF links.

In [None]:
# Your exercise: Find PDF links
try:
    if 'all_links' not in locals():
        print("‚ùå Links not found - run the previous link extraction cell first")
    else:
        # Find all links that contain ".pdf" in their href
        pdf_links = []
        
        for link in all_links:
            href = link.get('href', '')
            link_text = link.get_text().strip()
            
            if '.pdf' in href.lower():
                pdf_links.append({
                    'text': link_text,
                    'url': href
                })
        
        print(f"üìÑ PDF Link Analysis:")
        print(f"   Total links checked: {len(all_links)}")
        print(f"   PDF links found: {len(pdf_links)}")
        
        if pdf_links:
            print(f"\n‚úÖ Found {len(pdf_links)} PDF documents!")
            
            print(f"\nüìã First 3 PDF links:")
            for i, pdf in enumerate(pdf_links[:3], 1):
                print(f"\n{i}. '{pdf['text'][:60]}{'...' if len(pdf['text']) > 60 else ''}'")
                print(f"   URL: {pdf['url']}")
                
                # Analyze URL for file type validation
                if pdf['url'].lower().endswith('.pdf'):
                    print("   ‚úÖ Direct PDF link")
                else:
                    print("   ‚ö†Ô∏è  URL contains .pdf but doesn't end with .pdf")
            
            if len(pdf_links) > 3:
                print(f"\n... and {len(pdf_links) - 3} more PDF links")
                
            # Quality validation
            valid_pdfs = [pdf for pdf in pdf_links if pdf['text'] and len(pdf['text']) > 3]
            print(f"\nüìä Quality check: {len(valid_pdfs)}/{len(pdf_links)} PDF links have meaningful text")
            
        else:
            print("‚ùå No PDF links found on this page")
            print("üí° This might mean:")
            print("   - Page doesn't link to PDF documents")
            print("   - PDFs are embedded differently")
            print("   - Links use different file extensions")
            print("   - Try looking for links with 'download', 'document', or 'file' in text")
            
            # Alternative search
            print("\nüîç Searching for potential document links...")
            doc_keywords = ['download', 'document', 'file', 'report', 'manuscript']
            potential_docs = []
            
            for link in all_links:
                link_text = link.get_text().lower()
                if any(keyword in link_text for keyword in doc_keywords):
                    potential_docs.append(link.get_text().strip())
            
            if potential_docs:
                print(f"   Found {len(potential_docs)} links with document keywords:")
                for doc in potential_docs[:3]:
                    print(f"     - {doc[:60]}{'...' if len(doc) > 60 else ''}")
            else:
                print("   No obvious document links found")
                
except Exception as e:
    print(f"‚ùå Error finding PDF links: {e}")

## Step 6: Internet Archive Metadata

Internet Archive documents have rich metadata. Let's learn to extract it systematically.

**Step 6a: Get an Internet Archive page**

Let's work with our Saskatchewan document:

In [None]:
# Get Internet Archive page with validation
ia_url = "https://archive.org/details/saskatchewan00sask"

print("üèõÔ∏è Working with Internet Archive metadata...")
print(f"URL: {ia_url}")

try:
    # Use our respectful function with a bit longer delay for IA
    ia_response = respectful_get(ia_url, delay=2)
    
    if ia_response:
        print(f"‚úÖ Status Code: {ia_response.status_code}")
        print("‚úÖ Internet Archive page loaded successfully!")
        
        # Basic validation
        if 'archive.org' in ia_response.url:
            print("‚úÖ Confirmed: This is an Internet Archive page")
        else:
            print("‚ö†Ô∏è  Warning: Response URL doesn't match Internet Archive")
            
        # Check content size
        content_size = len(ia_response.text)
        print(f"üìÑ Content size: {content_size:,} characters")
        
        if content_size > 10000:
            print("‚úÖ Substantial content received")
        else:
            print("‚ö†Ô∏è  Warning: Less content than expected")
            
    else:
        print("‚ùå Failed to load Internet Archive page")
        print("üí° Possible issues:")
        print("   - Internet Archive servers busy")
        print("   - Network connectivity issues")
        print("   - Document no longer available")
        
except Exception as e:
    print(f"‚ùå Error loading Internet Archive page: {e}")

**Step 6b: Extract the document title**

Internet Archive puts document titles in `<h1>` tags:

Copy this code:
```python
title = ia_soup.find('h1')
if title:
    document_title = title.get_text().strip()
    print(f"Document Title: {document_title}")
else:
    print("No title found")
```

In [None]:
# Extract Internet Archive title with validation
try:
    if 'ia_response' not in locals() or not ia_response:
        print("‚ùå Internet Archive response not available - run the previous cell first")
    else:
        # Create soup for Internet Archive page
        ia_soup = BeautifulSoup(ia_response.text, 'html.parser')
        print("‚úÖ Internet Archive soup created")
        
        # Find title - IA uses h1 for main document title
        title_tag = ia_soup.find('h1')
        
        if title_tag:
            document_title = title_tag.get_text().strip()
            print(f"üìö Document Title: {document_title}")
            
            # Validate title quality
            if len(document_title) > 5:
                print("‚úÖ Title validation: Reasonable length")
            else:
                print("‚ö†Ô∏è  Warning: Title seems very short")
                
            # Check for expected keywords
            title_lower = document_title.lower()
            if 'saskatchewan' in title_lower:
                print("‚úÖ Content validation: Saskatchewan document confirmed")
            else:
                print("‚ö†Ô∏è  Content note: Title doesn't contain 'Saskatchewan'")
                print(f"   This might be normal - could be a more specific title")
                
            # Check for historical time indicators
            historical_indicators = ['history', 'historical', '19', '18', 'century']
            found_indicators = [ind for ind in historical_indicators if ind in title_lower]
            if found_indicators:
                print(f"üìñ Historical indicators: {', '.join(found_indicators)}")
            
        else:
            print("‚ùå No h1 title tag found")
            print("üí° Trying alternative title methods...")
            
            # Try alternative title extraction
            title_alternatives = [
                ia_soup.find('title'),  # HTML title tag
                ia_soup.find('h2'),     # Secondary heading
                ia_soup.find('div', class_='item-title')  # IA specific class
            ]
            
            for alt_title in title_alternatives:
                if alt_title:
                    alt_text = alt_title.get_text().strip()
                    if alt_text and len(alt_text) > 5:
                        print(f"üìö Alternative title found: {alt_text}")
                        break
            else:
                print("‚ùå No alternative titles found")
                
except Exception as e:
    print(f"‚ùå Error extracting title: {e}")

**Step 6c: Find metadata fields**

Internet Archive uses `<dt>` (definition term) and `<dd>` (definition description) tags for metadata:

Copy this code:
```python
# Find metadata terms and values
metadata_terms = ia_soup.find_all('dt')
metadata_values = ia_soup.find_all('dd')

print(f"Metadata fields found: {len(metadata_terms)}")
print(f"Metadata values found: {len(metadata_values)}")
```

**Step 6d: Extract and display metadata**

In [None]:
# Create a dictionary to store metadata
metadata = {}

# Pair up terms and values
for term, value in zip(metadata_terms, metadata_values):
    term_text = term.get_text().strip()
    value_text = value.get_text().strip()
    metadata[term_text] = value_text

# Display key metadata
print("Document Metadata:")
print("=" * 40)

# Show specific metadata fields we care about
important_fields = ['by', 'Publication date', 'Topics', 'Language']

for field in important_fields:
    if field in metadata:
        value = metadata[field]
        # Truncate very long values
        if len(value) > 100:
            value = value[:100] + "..."
        print(f"{field}: {value}")

### üîÑ **Your Turn: Create a Metadata Function**

Now create a reusable function that extracts metadata from any Internet Archive document. Fill in the missing parts:

```python
def extract_ia_metadata(url):
    """Extract title and metadata from an Internet Archive document"""
    # 1. Get the page with requests.get()
    # 2. Create soup with BeautifulSoup
    # 3. Extract title from h1 tag
    # 4. Extract metadata from dt/dd tags
    # 5. Return a dictionary with title and metadata
```

Test it with the University of Toronto annual report: `https://archive.org/details/annualreport191920nivuoft`

In [None]:
# Create a reusable Internet Archive metadata function
def extract_ia_metadata(url):
    """Extract title and metadata from an Internet Archive document"""
    try:
        print(f"üîç Analyzing Internet Archive URL: {url}")
        
        # Extract item ID from URL
        if '/details/' in url:
            item_id = url.split('/details/')[-1]
            print(f"üìã Item ID extracted: {item_id}")
        else:
            return {'error': 'Invalid Internet Archive URL format'}
        
        # Method 1: Try Internet Archive Python library (preferred)
        try:
            import internetarchive as ia
            item = ia.get_item(item_id)
            
            # Extract key metadata
            metadata = {
                'method': 'IA Python Library',
                'title': item.metadata.get('title', 'No title'),
                'creator': item.metadata.get('creator', 'No creator'),
                'date': item.metadata.get('date', 'No date'),
                'subject': item.metadata.get('subject', 'No subject'),
                'description': item.metadata.get('description', 'No description')[:200] + '...' if item.metadata.get('description') else 'No description',
                'language': item.metadata.get('language', 'No language'),
                'files_count': len(list(item.files))
            }
            
            print("‚úÖ Successfully extracted metadata using IA library")
            return metadata
            
        except ImportError:
            print("‚ö†Ô∏è  IA library not available, falling back to web scraping...")
        except Exception as e:
            print(f"‚ö†Ô∏è  IA library failed ({e}), falling back to web scraping...")
        
        # Method 2: Fallback to web scraping
        response = respectful_get(url, delay=2)
        if not response:
            return {'error': 'Could not download page'}
            
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract metadata from HTML
        title_tag = soup.find('h1')
        title = title_tag.get_text().strip() if title_tag else 'No title found'
        
        # Try to find metadata fields
        metadata_dict = {'method': 'Web Scraping', 'title': title}
        
        # Look for dt/dd metadata pairs
        dt_tags = soup.find_all('dt')
        dd_tags = soup.find_all('dd')
        
        for dt, dd in zip(dt_tags, dd_tags):
            key = dt.get_text().strip().lower()
            value = dd.get_text().strip()
            
            # Map common fields
            if 'by' in key or 'creator' in key:
                metadata_dict['creator'] = value
            elif 'date' in key:
                metadata_dict['date'] = value
            elif 'topic' in key or 'subject' in key:
                metadata_dict['subject'] = value
            elif 'language' in key:
                metadata_dict['language'] = value
        
        print("‚úÖ Successfully extracted metadata using web scraping")
        return metadata_dict
        
    except Exception as e:
        print(f"‚ùå Error extracting metadata: {e}")
        return {'error': str(e)}

# Test the function with University of Toronto annual report
uoft_url = "https://archive.org/details/annualreport191920nivuoft"
print("üéì Testing with University of Toronto Annual Report 1919-20...")

result = extract_ia_metadata(uoft_url)

print(f"\nüìä Extracted Metadata:")
print("=" * 50)
for key, value in result.items():
    print(f"{key.title()}: {value}")

print(f"\nüí° This function demonstrates two approaches:")
print(f"   1. Internet Archive Python library (preferred)")
print(f"   2. Web scraping with Beautiful Soup (fallback)")
print(f"   Real research projects should use both for robustness!")

## Step 7: Final Challenge - Your Research Project

Time to put it all together! Choose a Canadian historical source and conduct your own analysis.

**Available sources:**
- TPL Jesuit Relations blog: Research the historical quotes and themes
- Saskatchewan History: Analyze the metadata and publication context
- UofT Annual Report 1919-20: Extract institutional information

**Your research steps:**
1. Choose a source and research question
2. Use the scraping techniques you've learned
3. Extract specific information related to your question
4. Present your findings

In [None]:
# Using Internet Archive library - much more efficient!
print("üèõÔ∏è Accessing Internet Archive with the Python library...")

try:
    # Get the Saskatchewan document using its identifier
    item_id = "saskatchewan00sask"  # The ID from the URL
    
    # Get the item (this is much faster than web scraping)
    item = ia.get_item(item_id)
    
    print(f"‚úÖ Successfully retrieved item: {item_id}")
    print(f"üìö Title: {item.metadata.get('title', 'No title')}")
    print(f"üìÖ Date: {item.metadata.get('date', 'No date')}")
    print(f"üë§ Creator: {item.metadata.get('creator', 'No creator')}")
    print(f"üìñ Subject: {item.metadata.get('subject', 'No subject')}")
    
    # Show available files
    files = list(item.files)
    print(f"\nüìÅ Available files: {len(files)}")
    
    # Show first few files
    for i, file in enumerate(files[:5]):
        file_name = file.get('name', 'Unknown')
        file_format = file.get('format', 'Unknown')
        file_size = file.get('size', 'Unknown')
        print(f"   {i+1}. {file_name} ({file_format}) - {file_size} bytes")
    
    if len(files) > 5:
        print(f"   ... and {len(files) - 5} more files")
        
    print(f"\nüöÄ Much easier than HTML scraping!")
    print(f"üí° The IA library gives us clean, structured data instantly")
    
except Exception as e:
    print(f"‚ùå Error accessing Internet Archive: {e}")
    print("üí° This might be due to:")
    print("   - Network connectivity issues")
    print("   - Item ID changed or removed")
    print("   - Internet Archive servers busy")

In [None]:
# Install and import Internet Archive library
# First install the library (run this once)
!pip install internetarchive --quiet

print("üìö Installing Internet Archive Python library...")
print("‚úÖ Installation complete!")

# Import the library
try:
    import internetarchive as ia
    print("‚úÖ Internet Archive library imported successfully!")
    print("üîó This library provides direct, efficient access to IA collections")
except ImportError as e:
    print(f"‚ùå Error importing Internet Archive library: {e}")
    print("üí° Try running: pip install internetarchive")

## Step 8: Introduction to Internet Archive Python Library

While Beautiful Soup is excellent for scraping web pages, the Internet Archive provides a specialized Python library that makes accessing their collections much more efficient. This is perfect for large-scale historical research projects.

**Why use the Internet Archive library?**
- Direct access to metadata without scraping HTML
- Faster downloads and better error handling
- Access to full-text search capabilities
- Bulk processing of large collections
- Respects Internet Archive's preferred access methods

In [None]:
# Final Challenge: Your Historical Research Project

# Available Canadian historical sources:
sources = {
    "TPL Jesuit Relations": {
        "url": "https://torontopubliclibrary.typepad.com/local-history-genealogy/2020/01/sainte-marie-among-the-hurons-selections-from-the-jesuit-relations-and-allied-documents.html",
        "type": "Blog post with historical quotes",
        "research_questions": [
            "What themes appear in historical quotes?",
            "How many external historical links are provided?",
            "What time periods are mentioned?"
        ]
    },
    "Saskatchewan History": {
        "url": "https://archive.org/details/saskatchewan00sask",
        "type": "Internet Archive document",
        "research_questions": [
            "What metadata is available about publication?",
            "What file formats are provided?",
            "Who was the creator/publisher?"
        ]
    },
    "UofT Annual Report 1919-20": {
        "url": "https://archive.org/details/annualreport191920nivuoft",
        "type": "Institutional document",
        "research_questions": [
            "What institutional information is captured?",
            "How is the document structured?",
            "What historical context does it provide?"
        ]
    }
}

# Choose your research project
print("üî¨ Historical Web Scraping Research Project")
print("=" * 50)

print("üìö Available sources:")
for i, (name, info) in enumerate(sources.items(), 1):
    print(f"\n{i}. {name}")
    print(f"   Type: {info['type']}")
    print(f"   URL: {info['url'][:60]}...")
    print(f"   Sample questions:")
    for q in info['research_questions']:
        print(f"     - {q}")

print(f"\nüéØ Your research steps:")
print("1. Choose a source and research question")
print("2. Use appropriate scraping techniques")
print("3. Extract and validate data") 
print("4. Analyze and present findings")

# Example research project - customize this!
chosen_source = "TPL Jesuit Relations"  # Change this to your choice
my_url = sources[chosen_source]["url"]
my_question = "What historical themes appear in the quoted text?"

print(f"\nüìã Example Research Project:")
print(f"Source: {chosen_source}")
print(f"Question: {my_question}")
print(f"URL: {my_url}")

# Conduct the research
print(f"\nüîç Conducting research...")

try:
    # Step 1: Get the page
    response = respectful_get(my_url)
    
    if response:
        soup = BeautifulSoup(response.text, 'html.parser')
        print("‚úÖ Page successfully scraped")
        
        # Step 2: Extract relevant content (customize based on your question)
        if "themes" in my_question.lower():
            # Look for quotes and analyze themes
            blockquotes = soup.find_all('blockquote')
            paragraphs = soup.find_all('p')
            
            print(f"\nüìä Content Analysis:")
            print(f"   Blockquotes found: {len(blockquotes)}")
            print(f"   Paragraphs found: {len(paragraphs)}")
            
            # Analyze themes in text
            all_text = soup.get_text().lower()
            
            # Historical themes to look for
            themes = {
                'Religious': ['god', 'lord', 'jesus', 'prayer', 'church', 'faith'],
                'Indigenous peoples': ['huron', 'savage', 'indian', 'native', 'tribe'],
                'French colonial': ['french', 'france', 'jesuit', 'missionary'],
                'Geographic': ['canada', 'new france', 'sainte-marie', 'ontario'],
                'Temporal': ['1600', '1640', '1650', '17th century', 'century']
            }
            
            theme_counts = {}
            for theme_name, keywords in themes.items():
                count = sum(all_text.count(keyword) for keyword in keywords)
                if count > 0:
                    theme_counts[theme_name] = count
            
            print(f"\nüé≠ Historical Themes Found:")
            for theme, count in sorted(theme_counts.items(), key=lambda x: x[1], reverse=True):
                print(f"   {theme}: {count} mentions")
        
        # Step 3: Present findings
        print(f"\nüìù Research Findings:")
        print("1. Successfully scraped Canadian historical content")
        print("2. Identified multiple historical themes in source material")
        print("3. Demonstrated effective web scraping for historical research")
        
        if theme_counts:
            most_common = max(theme_counts.items(), key=lambda x: x[1])
            print(f"4. Most prominent theme: {most_common[0]} ({most_common[1]} mentions)")
        
    else:
        print("‚ùå Could not complete research - page unavailable")
        
except Exception as e:
    print(f"‚ùå Research error: {e}")

# Your turn!
print(f"\nüöÄ Now it's your turn!")
print("Customize this code for your own research question:")
print("1. Change 'chosen_source' to your preferred source")
print("2. Modify 'my_question' to your research interest") 
print("3. Adapt the analysis code for your specific question")
print("4. Add your own themes, keywords, or analysis methods")

print(f"\nüí° Remember to:")
print("- Validate your scraped data")
print("- Handle errors gracefully")
print("- Respect website terms of service")
print("- Cite your digital sources properly")

## Summary: What You've Learned

üéâ **Congratulations!** You've mastered the fundamentals of web scraping for historical research:

**Technical Skills:**
- ‚úÖ Making web requests with `requests.get()`
- ‚úÖ Parsing HTML with Beautiful Soup
- ‚úÖ Targeting specific elements (`find`, `find_all`)
- ‚úÖ Extracting text, links, and metadata
- ‚úÖ Building reusable functions for research

**Historical Sources:**
- ‚úÖ Blog posts with embedded historical content
- ‚úÖ Internet Archive documents with metadata
- ‚úÖ Academic indexes and structured data

**Research Methods:**
- ‚úÖ Systematic content extraction
- ‚úÖ Metadata analysis for document context
- ‚úÖ Building reproducible research workflows

**Next Steps:**
In Notebook 3, you'll learn advanced text analysis techniques to find patterns in the historical data you've scraped. We'll also explore the Internet Archive Python library for more efficient access to large collections.

**Remember:**
- Always respect robots.txt and website terms of service
- Cite your digital sources properly
- Consider the limitations and context of digitized materials
- Use these skills responsibly for legitimate research purposes