## Scraping Wikipedia Natural Disasters
Follow this step-by-step workflow to collect, clean, and visualise a dataset pulled directly from Wikipedia.

### What you will learn
- Build polite HTTP requests with rotating user agents.
- Identify the right tables on a MediaWiki page before scraping.
- Convert raw HTML rows into a tidy pandas DataFrame.
- Clean numeric ranges, remove footnotes, and fix data types.
- Explore the resulting dataset with plots and a simple map.

### Step 0 — Import the libraries
We bring in web-scraping helpers, data wrangling tools, and visualisation packages that we will use throughout the notebook.

In [1]:
# Import libraries for HTTP requests, HTML parsing, and randomization
import requests  # For sending HTTP requests to fetch web pages
from bs4 import BeautifulSoup  # For parsing HTML content
import random  # For selecting random user agents

# Import libraries for data manipulation, regular expressions, and string handling
import pandas as pd  # For data manipulation and creating DataFrames
import re  # For regular expressions to clean text
from io import StringIO  # For reading HTML strings as file-like objects

# Import libraries for visualization, mapping, and geocoding
import folium  # For creating interactive maps
import seaborn as sns  # For statistical data visualization
import matplotlib.pyplot as plt  # For plotting graphs
from geopy.geocoders import Nominatim, Photon  # For geocoding locations
from geopy.extra.rate_limiter import RateLimiter  # For rate-limiting geocoding requests

### Step 1 — Rotate user agents
Websites can respond differently depending on the browser profile they think is visiting. Rotating through a small list of realistic user agents helps keep our requests polite and less predictable.

In [2]:
# Define a list of user agents to rotate through, mimicking different browsers
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/57.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Edge/16.16299',
]

### Step 2 — Helper to pick a user agent
Each request will call this function so we do not reuse the same header repeatedly.

In [3]:
# Define a function to randomly select a user agent from the list
def get_random_user_agent():
    """Return a random user-agent header for outgoing HTTP requests."""
    return random.choice(user_agents)

# Print an example of a selected user agent
print(f"Using User-Agent: {get_random_user_agent()}")

Using User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/57.0


### Step 3 — Target URL
We will scrape the live Wikipedia article so the dataset stays current. Should the structure change, you can re-run the notebook to refresh the data.

In [4]:
# Set the URL of the Wikipedia page to scrape
url = "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"

### Step 4 — Download the page and inspect its tables
1. Send a request with a random user agent and confirm the status code.
2. Parse the HTML with BeautifulSoup so we can navigate the document.
3. Collect only the tables that expose the columns `'Year', 'Death Toll', 'Event', 'Countries Affected', 'Type', 'Date'`.
4. Preview the first few rows before we tidy the data.

In [5]:
# Step 1: Send HTTP request to the RSS feed with a random User-Agent
headers = {"User-Agent": get_random_user_agent()}
response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)

# Step 2: Parse the RSS XML content using BeautifulSoup
soup = BeautifulSoup(response.content, "xml")
print("✅ XML parsed successfully!")

# Step 3: Extract all article titles from the feed
titles = []
for item in soup.find_all("item"):
    title = item.title.get_text()
    titles.append(title)

print(f"✅ Found {len(titles)} article titles.")


Status Code: 200
✅ XML parsed successfully!
✅ Found 34 article titles.


### Step 5 — Extract the rows we care about
Two collapsible tables share the same structure (years 1900–2000 and 2001–present). We will loop through both, normalise their headers, clean each cell (removing footnotes and non-breaking spaces), and combine the results into a single DataFrame that we also persist to disk for reuse.

In [None]:
# Step 4: Display the first 10 article titles
print("\nTop 10 Article Titles:\n")
for i, title in enumerate(titles[:10], start=1):
    print(f"{i}) {title}")


## Step 6 — Clean and explore the dataset

### Convert the death toll into comparable integers
Many rows record a range such as `6,000–9,000`. We will keep the lower bound so we can sort and chart consistently.

In [None]:
# Define a function to extract the lower bound of death toll ranges
def extract_lower_bound(death_toll: str) -> int:
    """Return the first numeric value found in a death toll string."""
    match = re.search(r"(\d+(?:,\d+)?)", str(death_toll))  # Search for numbers with optional commas
    return int(match.group(1).replace(",", "")) if match else 0  # Convert to int, remove commas

# Apply the function to the Death Toll column
df["Death Toll"] = df["Death Toll"].apply(extract_lower_bound)
# Display the updated DataFrame head
display(df.head())

### Inspect the DataFrame shape
Confirm the number of rows (disasters) and columns recorded.

In [None]:
# Get and display the shape of the DataFrame (rows, columns)
df.shape

### Column names
Double-check that our cleaned headers are the ones we expect.

In [None]:
# Display the column names of the DataFrame
df.columns

### Summary statistics
A quick numerical overview highlights the distribution of the cleaned death toll values.

In [None]:
# Generate and display descriptive statistics for the DataFrame
df.describe()

### Check for missing values
Missing data may signal parsing issues or follow-up cleaning tasks.

In [None]:
# Count and display the number of missing values per column
df.isnull().sum()

### Data types
Verify each column has the expected Python type before visualising.

In [None]:
# Display the data types of each column
df.dtypes

### Enforce numeric dtypes
Casting to integers makes sure pandas treats years and death tolls as numbers, not strings.

In [None]:
# Convert Year to integer type
df['Year'] = df['Year'].astype(int)
# Convert Date to datetime type, coercing errors
df['Date'] = pd.to_datetime(df['Date'], format="%Y-%m-%d", errors='coerce')
# Display updated data types
df.dtypes

### Unique countries affected
How many distinct countries or regions appear in the dataset?

In [None]:
# Count unique values in 'Countries Affected' column
unique_countries_count = df['Countries Affected'].nunique()
# Print the count
print(f"{unique_countries_count} unique countries or regions recorded.")

### Unique disaster types
A quick count shows the breadth of disaster categories recorded on the page.

In [None]:
# Count unique values in 'Type' column
unique_disaster_types_count = df['Type'].nunique()
# Print the count
print(f"{unique_disaster_types_count} distinct disaster types captured.")

In [7]:
# Step 4: Display the first 10 article titles
print("\nTop 10 Article Titles:\n")
for i, title in enumerate(titles[:10], start=1):
    print(f"{i}) {title}")



Top 10 Article Titles:

1) Companies Have Shielded Buyers From Tariffs. But Not for Long.
2) A Major Crypto Pardon, and the N.B.A. Gambling Scandal With Mob Ties
3) Letitia James Case Shows Ruthlessness of Justice Dept. in Trump’s Grip
4) Letitia James to Appear in Court as Battle Over Trump-Urged Prosecution Begins
5) Rebuilding Israeli-Held Parts of Gaza: Workable or Another U.S. Pipe Dream?
6) Who Were the Palestinian Prisoners Freed by Israel?
7) N.B.A. Gambling Scandal Reflects America’s Obsession With Sports Betting
8) NBA Gambling Scandal: What We Know
9) Can Ken Burns Win the American Revolution?
10) On a Roll, European Leaders Meet to Bolster Support for Ukraine


### Optional: display the full table
Expand pandas' display settings if you want to inspect every column and row inline. Be cautious—large tables can slow down the notebook.

## Step 7 — Visualise the impact

### Year vs. death toll
Plotting each disaster as a point shows temporal clusters and highlights extreme events by type.

In [8]:
# Step 0: Import required libraries
import requests
from bs4 import BeautifulSoup

# Step 1: Set the RSS feed URL for NYT homepage
url = "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"

# Step 2: Send HTTP request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)

# Step 3: Parse XML content using BeautifulSoup
soup = BeautifulSoup(response.content, "xml")
print("✅ XML parsed successfully!")

# Step 4: Extract article titles from <item> tags
titles = [item.title.get_text() for item in soup.find_all("item")]

# Step 5: Display the top 10 article titles in numbered format
print("\nTop 10 Article Titles:\n")
for i, title in enumerate(titles[:10], start=1):
    print(f"{i}) {title}")


Status Code: 200
✅ XML parsed successfully!

Top 10 Article Titles:

1) Companies Have Shielded Buyers From Tariffs. But Not for Long.
2) A Major Crypto Pardon, and the N.B.A. Gambling Scandal With Mob Ties
3) Letitia James Case Shows Ruthlessness of Justice Dept. in Trump’s Grip
4) Letitia James to Appear in Court as Battle Over Trump-Urged Prosecution Begins
5) Rebuilding Israeli-Held Parts of Gaza: Workable or Another U.S. Pipe Dream?
6) Who Were the Palestinian Prisoners Freed by Israel?
7) N.B.A. Gambling Scandal Reflects America’s Obsession With Sports Betting
8) NBA Gambling Scandal: What We Know
9) Can Ken Burns Win the American Revolution?
10) On a Roll, European Leaders Meet to Bolster Support for Ukraine


### Total deaths by disaster type
Aggregating by type reveals which kinds of disasters have historically been most deadly.

In [9]:
# Step 0: Import libraries
import requests
from bs4 import BeautifulSoup

# Step 1: Set the NYT homepage URL
url = "https://www.nytimes.com/"

# Step 2: Send request with a User-Agent header
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)

# Step 3: Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
print("✅ HTML parsed successfully!")

# Step 4: Extract article titles (h2 or specific classes)
titles = []

# NYT articles often have <h2> with 'css-78b01r' or 'esl82me1' classes, check dynamically
for h2 in soup.find_all("h2"):
    text = h2.get_text(strip=True)
    if text:
        titles.append(text)

# Step 5: Display top 10 headlines
print("\nTop 10 Headlines from NYT Homepage:\n")
for i, title in enumerate(titles[:10], start=1):
    print(f"{i}) {title}")


Status Code: 200
✅ HTML parsed successfully!

Top 10 Headlines from NYT Homepage:

1) Live
2) Top Stories
3) Watch Today’s Videos
4) More News
5) The AthleticSports coverage
6) Well
7) Culture and Lifestyle
8) AudioPodcasts and narrated articles
9) GamesDaily puzzles
10) Site Index


### Total deaths by country or region
Summing per location highlights where disasters have had the heaviest tolls.

In [12]:
# Step 1: Import libraries
import requests
from bs4 import BeautifulSoup

# Step 2: Set the NYT RSS feed URL
url = "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"

# Step 3: Fetch the RSS feed
response = requests.get(url)
if response.status_code != 200:
    print("Failed to fetch the RSS feed")
    exit()

# Step 4: Parse XML content
soup = BeautifulSoup(response.content, "xml")  # parse as XML

# Step 5: Extract top 10 article titles
items = soup.find_all("item")
top_titles = [item.title.text.strip() for item in items[:10]]

# Step 6: Display top 10 titles
print("Top 10 Article Titles:\n")
for i, title in enumerate(top_titles, 1):
    print(f"{i}) {title}")


Top 10 Article Titles:

1) Companies Have Shielded Buyers From Tariffs. But Not for Long.
2) A Major Crypto Pardon, and the N.B.A. Gambling Scandal With Mob Ties
3) Letitia James Case Shows Ruthlessness of Justice Dept. in Trump’s Grip
4) Letitia James to Appear in Court as Battle Over Trump-Urged Prosecution Begins
5) Rebuilding Israeli-Held Parts of Gaza: Workable or Another U.S. Pipe Dream?
6) Who Were the Palestinian Prisoners Freed by Israel?
7) N.B.A. Gambling Scandal Reflects America’s Obsession With Sports Betting
8) NBA Gambling Scandal: What We Know
9) Can Ken Burns Win the American Revolution?
10) On a Roll, European Leaders Meet to Bolster Support for Ukraine


### Recent 20-year view
Filter to the last two decades to surface modern disaster hotspots.

In [13]:
# Step 0: Import libraries
import requests
from bs4 import BeautifulSoup

# Step 1: Set the RSS feed URL
url = "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"

# Step 2: Fetch the RSS feed
response = requests.get(url)
print("Status Code:", response.status_code)

# Step 3: Parse the RSS XML
soup = BeautifulSoup(response.content, "xml")  # Use "xml" parser for RSS

# Step 4: Extract top 10 article titles
items = soup.find_all("item")[:10]  # Get first 10 articles
titles = [item.title.text for item in items]

# Step 5: Display titles
print("\nTop 10 Article Titles:\n")
for i, title in enumerate(titles, start=1):
    print(f"{i}) {title}")


Status Code: 200

Top 10 Article Titles:

1) Companies Have Shielded Buyers From Tariffs. But Not for Long.
2) A Major Crypto Pardon, and the N.B.A. Gambling Scandal With Mob Ties
3) Letitia James Case Shows Ruthlessness of Justice Dept. in Trump’s Grip
4) Letitia James to Appear in Court as Battle Over Trump-Urged Prosecution Begins
5) Rebuilding Israeli-Held Parts of Gaza: Workable or Another U.S. Pipe Dream?
6) Who Were the Palestinian Prisoners Freed by Israel?
7) N.B.A. Gambling Scandal Reflects America’s Obsession With Sports Betting
8) NBA Gambling Scandal: What We Know
9) The Wider Costs of the N.B.A. Insider-Trading Scandal
10) Can Ken Burns Win the American Revolution?


### Disaster impact over time
Line charts help reveal trajectory trends and make it easier to spot slow-moving crises.

In [None]:
# Create a line plot of Year vs Death Toll, colored by Type
sns.relplot(
    data=df, kind="line",
    x="Year", y="Death Toll",
    hue="Type")

In [None]:
# Create a line plot excluding common types like Earthquake and Earthquake, Tsunami
sns.relplot(
    data=df[~df['Type'].isin(['Earthquake', 'Earthquake, Tsunami'])], kind="line",
    x="Year", y="Death Toll",
    hue="Type")

### Top 10 deadliest events
Sort the dataset to spotlight the most catastrophic disasters on record.

In [None]:
# Get the top 10 events by Death Toll
top_events = df.nlargest(10, 'Death Toll')
# Create a horizontal bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Death Toll', y='Event', data=top_events, palette='viridis', hue='Event', legend=False)
plt.title('Top 10 Events with Highest Death Toll')
plt.xlabel('Death Toll')
plt.ylabel('Event')
plt.show()

### Interactive map of affected countries
Geocode each country and plot markers. This step uses rate-limited lookups (update the user agent with your contact information) and may take a couple of minutes to run.

In [None]:
# Set a user agent for geocoding services (required by Nominatim)
USER_AGENT = "DisasterGeoMapper/1.0 (contact: your_email@example.com)"

# Create a base folium map centered at (0,0) with zoom level 2
m = folium.Map(location=[0, 0], zoom_start=2)

# Initialize geocoders with user agent and timeout
nominatim = Nominatim(user_agent=USER_AGENT, timeout=10)
photon = Photon(user_agent=USER_AGENT, timeout=10)

# Apply rate limiting to geocoding functions
nominatim_geocode = RateLimiter(nominatim.geocode, min_delay_seconds=1, swallow_exceptions=True)
photon_geocode = RateLimiter(photon.geocode, min_delay_seconds=1, swallow_exceptions=True)

# Cache for geocoded locations to avoid repeated requests
location_cache = {}

# Function to geocode a country name with caching and fallback
def geocode_country(name: str):
    """Safely geocode a country name with caching and fallback between Nominatim and Photon."""
    if name in location_cache:
        return location_cache[name]
    
    location = nominatim_geocode(name)
    if location is None:
        location = photon_geocode(name)
    
    location_cache[name] = location
    return location

# Dictionary to aggregate disasters by country
disasters_by_country = {}

# Loop through each row in the DataFrame
for _, row in df.iterrows():
    year = row['Year']
    death_toll = row['Death Toll']
    countries = row['Countries Affected']
    event = row['Event']

    # Split countries if multiple are listed
    for country in countries.split(','):
        country = country.strip()
        if not country or country.lower() in {"various", "unknown"}:
            continue
        
        # Geocode the country
        location = geocode_country(country)
        if location is None:
            continue
        
        latitude, longitude = location.latitude, location.longitude
        # Create tooltip with disaster details
        tooltip = f"Year: {year}<br>Death Toll: {death_toll}<br>Country: {country}<br>Event: {event}"
        # Append to list for this country
        disasters_by_country.setdefault(country, []).append((latitude, longitude, tooltip))

# Add markers to the map, averaging locations per country
for country, disasters in disasters_by_country.items():
    country_latitude = sum(lat for lat, lon, _ in disasters) / len(disasters)
    country_longitude = sum(lon for lat, lon, _ in disasters) / len(disasters)
    country_tooltip = "<br>".join(tooltip for _, _, tooltip in disasters)
    folium.Marker([country_latitude, country_longitude], tooltip=country_tooltip).add_to(m)

In [None]:
# Display the interactive map in the notebook
m

In [None]:
# Save the map to an HTML file
m.save('impact_by_country_map.html')