# Scraping U.S. Senator Data from Wikipedia

This notebook demonstrates using `requests` and `BeautifulSoup` to scrape structured data
from a Wikipedia page, and using `re` to clean up the extracted text.

We will:
1. Fetch the HTML of the [List of current United States senators](https://en.wikipedia.org/wiki/List_of_current_United_States_senators)
2. Parse the senators table to extract names, states, and party affiliations
3. Clean up Wikipedia footnotes using regular expressions
4. Visit each senator's individual Wikipedia page to find their official Senate website URL

## Step 1: Imports and Configuration

In [None]:
import re
from concurrent.futures import ThreadPoolExecutor, as_completed

import requests
from bs4 import BeautifulSoup

WIKI_BASE = "https://en.wikipedia.org"
HEADERS = {"User-Agent": "PYT200-064 Class Demo (educational project)"}

## Step 2: Fetch the Senators List Page

Wikipedia requires a `User-Agent` header — requests without one get a **403 Forbidden** error.
This is common with many websites and is one of the first real-world obstacles you'll encounter when scraping.

In [None]:
url = "{}/wiki/List_of_current_United_States_senators".format(WIKI_BASE)
response = requests.get(url, headers=HEADERS)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")
print("Page fetched: {} characters".format(len(response.text)))

## Step 3: Find the Senators Table

The page has **four** sortable tables — the first three are leadership summaries.
The senators table is the fourth one. Always inspect what `find_all` returns
rather than assuming there's only one match.

In [None]:
tables = soup.find_all("table", class_="sortable")
print("Found {} sortable tables".format(len(tables)))

# Show what each table looks like by its header row
for i, t in enumerate(tables):
    header_row = t.find("tr")
    headers = [th.get_text(strip=True)[:20] for th in header_row.find_all("th")[:5]]
    print("  Table {}: {}".format(i, headers))

In [None]:
table = tables[3]
tbody = table.find("tbody")
rows = tbody.find_all("tr")
print("Senator table has {} rows (1 header + {} data rows)".format(len(rows), len(rows) - 1))

## Step 4: Understand the Row Structure

The table uses `rowspan="2"` on state cells, meaning each state name spans two rows
(one per senator). The second senator's row has **no state cell** — it's covered by the rowspan.

Senator names are in `<th>` elements (row headers), not `<td>`. This is semantically
correct HTML but surprises students who expect all data in `<td>` elements.

Let's inspect the first few rows to see this structure:

In [None]:
# Inspect the first 4 data rows to see the rowspan pattern
for row in rows[1:5]:
    cells = row.find_all(["td", "th"])
    for cell in cells[:5]:
        tag = cell.name
        rowspan = cell.get("rowspan", "-")
        text = cell.get_text(strip=True)[:25]
        print("  <{}> rowspan={}: \"{}\"".format(tag, rowspan, text))
    print("  ---")

## Step 5: Parse All Senators

We loop through every row, tracking the current state. When we see a `<td>` with
`rowspan="2"`, we know a new state has started. Senator names come from `<th>` elements.

For the party column, we navigate relative to the `<th>` using `find_next_sibling`
rather than counting column positions (which shift due to the rowspan).

In [None]:
senators = []
current_state = None

for row in rows:
    # State cells have rowspan="2", spanning both senator rows
    state_cell = row.find("td", rowspan="2")
    if state_cell:
        current_state = state_cell.get_text(strip=True)

    # Senator names are in <th> row header elements
    name_cell = row.find("th")
    if name_cell and current_state:
        name = name_cell.get_text(strip=True)
        wiki_link = name_cell.find("a")
        wiki_path = wiki_link["href"] if wiki_link else ""

        # Party is two <td> siblings after the <th>: an empty color cell, then the party name
        party_cell = name_cell.find_next_sibling("td").find_next_sibling("td")

        # Extract footnote references before getting the clean party text
        notes = ""
        footnotes = party_cell.find_all("sup")
        if footnotes:
            for sup in footnotes:
                link = sup.find("a")
                if link and link.get("href", "").startswith("#cite_note"):
                    ref_id = link["href"].lstrip("#")
                    ref_li = soup.find("li", id=ref_id)
                    if ref_li:
                        ref_text = ref_li.find("span", class_="reference-text")
                        if ref_text:
                            notes = ref_text.get_text(" ", strip=True)
                            # Use re.sub to remove bracketed citation numbers like [15]
                            notes = re.sub(r"\s*\[\s*\d+\s*\]", "", notes).strip()
                sup.decompose()  # Remove <sup> so it doesn't appear in party text

        party = party_cell.get_text(strip=True)
        senators.append((name, current_state, party, wiki_path, notes))

print("Parsed {} senators".format(len(senators)))

## Step 6: Display the Results

Let's see what we've collected so far — senator names, states, and parties.

In [None]:
print("{:<30} {:<20} {}".format("Senator", "State", "Party"))
print("-" * 65)
for name, state, party, wiki_path, notes in senators:
    print("{:<30} {:<20} {}".format(name, state, party))
print("\nTotal: {} senators".format(len(senators)))

## Step 7: A Closer Look at `re.sub` for Cleaning Footnotes

Four senators have Wikipedia footnotes in their party cells. The footnote text
contains bracketed citation numbers like `[15]` that we need to strip.

The regex pattern `\s*\[\s*\d+\s*\]` matches:

| Component | Matches |
|---|---|
| `\s*` | Optional whitespace before the bracket |
| `\[` | Literal `[` (escaped because `[` is special in regex) |
| `\s*` | Optional whitespace inside the bracket |
| `\d+` | One or more digits |
| `\s*` | Optional whitespace before closing bracket |
| `\]` | Literal `]` |

Let's see which senators have notes and what the cleanup looks like:

In [None]:
for name, state, party, wiki_path, notes in senators:
    if notes:
        print("{} ({})".format(name, state))
        print("  Party: {}".format(party))
        print("  Note:  {}".format(notes))
        print()

## Step 8: Fetch Senate Website URLs

Each senator's name links to their individual Wikipedia page (e.g., `/wiki/Katie_Britt`).
On that page, an **infobox** sidebar contains a "Website" row with their official Senate URL.

We define a function to fetch one senator's website, then use `ThreadPoolExecutor`
to fetch all 100 pages in parallel — roughly 10x faster than doing them one at a time.

Threads work well here because the task is **I/O-bound** (waiting for Wikipedia to respond).
For **CPU-bound** work, you would use `ProcessPoolExecutor` instead.

In [None]:
def get_senate_website(wiki_path):
    """Fetch a senator's Wikipedia page and extract the Senate website URL from the infobox."""
    url = "{}{}".format(WIKI_BASE, wiki_path)
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()

    page_soup = BeautifulSoup(resp.text, "html.parser")
    infobox = page_soup.find("table", class_="infobox")
    if infobox:
        for th in infobox.find_all("th"):
            if "Website" in th.get_text():
                td = th.find_next_sibling("td")
                if td:
                    link = td.find("a")
                    if link:
                        return link.get("href", "")
    return ""

In [None]:
print("Fetching website URLs for {} senators...".format(len(senators)))
websites = [""] * len(senators)

with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_index = {}
    for i, senator in enumerate(senators):
        if senator[3]:  # only if we have a wiki path
            future = executor.submit(get_senate_website, senator[3])
            future_to_index[future] = i

    for future in as_completed(future_to_index):
        idx = future_to_index[future]
        try:
            websites[idx] = future.result()
        except Exception as e:
            print("  Warning: could not fetch website for {}: {}".format(senators[idx][0], e))

# Combine the website URLs with the senator data
senators = [
    (name, state, party, websites[i], notes)
    for i, (name, state, party, _, notes) in enumerate(senators)
]

print("Done!")

## Step 9: Final Results

In [None]:
print("{:<30} {:<20} {:<15} {}".format("Senator", "State", "Party", "Website"))
print("-" * 110)
for name, state, party, website, notes in senators:
    print("{:<30} {:<20} {:<15} {}".format(name, state, party, website))
print("\nTotal: {} senators".format(len(senators)))