# Web Scraping Movie Data

In this notebook, we'll walk through the process of scraping movie data from "The Numbers" website, covering films released between 2000 and 2025. We'll use various Python libraries to handle challenges like anti-bot protections, extract structured data from HTML, and process everything efficiently using concurrency.

## Setting Up Our Scraping Environment

First we install `cloudscraper`, a library that helps bypass Cloudflare anti-bot pages. This is crucial since many modern websites employ protection mechanisms that detect and block automated scraping attempts.

In [None]:
!pip install cloudscraper



## Importing Required Libraries

Then we import standard library modules for date handling and concurrency, along with third-party modules like BeautifulSoup, pandas, `cloudscraper`, and others used throughout the notebook. Each of these libraries serves a specific purpose:
- BeautifulSoup helps us parse HTML content
- cloudscraper bypasses anti-bot protection
- pandas will organize our data
- ThreadPoolExecutor enables parallel processing
- tqdm provides progress bars to visualize our scraping progress

In [None]:
# Standard library imports
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor

# Third-party imports
from bs4 import BeautifulSoup
import cloudscraper
import pandas as pd
from requests.adapters import HTTPAdapter
from tqdm import tqdm
from urllib3.util.retry import Retry

## Creating a Robust Scraper Session

This function creates and returns a `cloudscraper` session with specified retry logic, enabling repeated attempts for specific HTTP errors. We configure it with sensible defaults:
- 5 total retries for failed requests
- Exponential backoff to avoid overwhelming the server
- Retry on common error codes (403, 429, 500, etc.)
- A realistic user agent to appear more like a normal browser

In [None]:
def get_cloudscraper_session(
    total_retries=5,
    backoff_factor=1,
    status_forcelist=(403, 429, 500, 502, 503, 504),
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
):
    """
    Returns a CloudScraper session configured to retry on specific HTTP errors.
    """

    # Create a CloudScraper session (handles Cloudflare anti-bot challenges).
    scraper = cloudscraper.create_scraper()

    # Optionally update the User-Agent header.
    scraper.headers.update({"User-Agent": user_agent})

    # Configure the Retry strategy.
    retries = Retry(
        total=total_retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retries)

    # Mount the adapter to handle both HTTP and HTTPS.
    scraper.mount("http://", adapter)
    scraper.mount("https://", adapter)

    return scraper


## Fetching and Parsing Web Pages

This function returns a BeautifulSoup object for a given URL, raising an HTTPError if the status code indicates a failure. It's a simple wrapper that handles the connection setup and HTML parsing in one convenient call, making our code more readable and maintainable.

In [None]:
def extract_movies(url, session=None):
    """
    Fetches and returns a BeautifulSoup object for the given URL,
    using a CloudScraper session configured with retries.
    """
    if session is None:
        # Fallback to a default session if none provided
        session = get_cloudscraper_session()

    response = session.get(url)
    # Raises an HTTPError if the status is 4xx or 5xx.
    response.raise_for_status()

    return BeautifulSoup(response.text, "html.parser")


## Extracting Movie Data from Pages

Now we use this function to find rows in the Numbers table, extract movie titles with their corresponding links, and store them in a dictionary. This step essentially maps each movie title to its detail page URL, allowing us to navigate the website programmatically and gather detailed information about each film.

In [None]:
def extract_movies_with_links(soup):
    movie_dict = {}

    # Loop through each row in the table (excluding the header)
    rows = soup.find_all("tr")[1:]  # Skip the header row

    for row in rows:
        # Find the column that contains the movie name (it is in <b><a> tag)
        movie_cell = row.find("td", {"class": "data"})
        if movie_cell:
            movie_name_tag = movie_cell.find_next(
                "a"
            )  # Find the next <a> tag inside the cell
            if movie_name_tag:
                movie_name = movie_name_tag.text.strip()  # Get the movie name
                movie_link = movie_name_tag["href"]  # Get the href link
                movie_dict[movie_name] = movie_link  # Add to dictionary

    return movie_dict


## Processing Detailed Movie Information

This function extracts additional information from a mobile layout page, such as synopsis, budget, running time, cast/crew details, and more. We've designed it to handle various data formats and structures on the page, extracting information from:
- Synopsis sections
- Financial metrics tables
- Movie details tables
- Cast listings
- Production and technical credits

The function organizes all this information into a structured dictionary that we'll later incorporate into our dataset.

In [None]:
def extract_movie_info_mobile(soup):
    """
    Extracts the desired movie information from a mobile-layout page.

    Returns a dictionary with the following keys:
      - synopsis: the movie synopsis text.
      - opening_weekend: monetary value for opening weekend.
      - percent_of_total_gross: the percentage (e.g. "35.4%") from the same cell.
      - production_budget: text as appears in the metrics.
      - domestic_release_date: the first date from the Domestic Releases row.
      - running_time: running time (e.g. "98 minutes").
      - keywords: a comma-separated string of keywords.
      - source: the source text.
      - genre: genre text.
      - production_method: production method text.
      - creative_type: creative type text.
      - production_financing_companies: financing companies text.
      - production_countries: production countries.
      - languages: the languages.
      - lead_ensemble_members: list of dicts with "actor" and "role"
      - production_technical_credits: list of strings for credit rows (stops when an <hr> is encountered)

    Args:
        soup (BeautifulSoup): Parsed HTML of the movie page (mobile layout).

    Returns:
        dict: A dictionary with the extracted movie information.
    """
    result = {}

    summary = soup.find("div", id="summary_mobile")
    if summary:
        # Extract Synopsis: find the <h2> with "Synopsis" and its next <p>.
        syn_h2 = summary.find("h2", string=lambda s: s and "Synopsis" in s)
        if syn_h2:
            syn_p = syn_h2.find_next_sibling("p")
            # In case the <p> includes extra content (like "Metrics"), split at "Metrics"
            result["synopsis"] = (
                syn_p.get_text(strip=True).split("Metrics")[0] if syn_p else ""
            )
        else:
            result["synopsis"] = ""

        # Extract Metrics.
        # Skip the table with id "movie_ratings" if present.
        metrics_table = summary.find("table", id=lambda i: i != "movie_ratings")
        result["opening_weekend"] = ""
        result["percent_of_total_gross"] = ""
        result["production_budget"] = ""
        if metrics_table:
            for row in metrics_table.find_all("tr"):
                cells = row.find_all("td")
                if len(cells) >= 2:
                    label = cells[0].get_text(strip=True).replace("\xa0", " ")
                    value = cells[1].get_text(" ", strip=True)
                    if "Opening Weekend:" in label:
                        if "(" in value and ")" in value:
                            open_val, paren_part = value.split("(", 1)
                            percent = paren_part.split(")")[0].split()[0]
                            result["opening_weekend"] = open_val.strip()
                            result["percent_of_total_gross"] = percent.strip()
                        else:
                            result["opening_weekend"] = value.strip()
                    elif "Production Budget:" in label:
                        result["production_budget"] = value.strip()
        else:
            result.setdefault("opening_weekend", "")
            result.setdefault("percent_of_total_gross", "")
            result.setdefault("production_budget", "")

        # Extract Movie Details.
        details_h2 = summary.find("h2", string=lambda s: s and "Movie Details" in s)
        # Initialize keys with empty values.
        for field in [
            "domestic_release_date",
            "running_time",
            "keywords",
            "source",
            "genre",
            "production_method",
            "creative_type",
            "production_financing_companies",
            "production_countries",
            "languages",
        ]:
            result[field] = ""
        if details_h2:
            details_table = details_h2.find_next("table")
            if details_table:
                for row in details_table.find_all("tr"):
                    cells = row.find_all("td")
                    if len(cells) >= 2:
                        lab = cells[0].get_text(strip=True).replace("\xa0", " ")
                        val = cells[1].get_text(" ", strip=True)
                        if lab.startswith("Domestic Releases:"):
                            # Extract the first date before any parenthesis.
                            result["domestic_release_date"] = val.split("(")[0].strip()
                        elif lab.startswith("Running Time:"):
                            result["running_time"] = val
                        elif lab.startswith("Keywords:"):
                            result["keywords"] = val
                        elif lab.startswith("Source:"):
                            result["source"] = val
                        elif lab.startswith("Genre:"):
                            result["genre"] = val
                        elif lab.startswith("Production Method:"):
                            result["production_method"] = val
                        elif lab.startswith("Creative Type:"):
                            result["creative_type"] = val
                        elif lab.startswith("Production/Financing Companies:"):
                            result["production_financing_companies"] = val
                        elif lab.startswith("Production Countries:"):
                            result["production_countries"] = val
                        elif lab.startswith("Languages:"):
                            result["languages"] = val
    else:
        # If no summary container is found.
        for key in [
            "synopsis",
            "opening_weekend",
            "percent_of_total_gross",
            "production_budget",
            "domestic_release_date",
            "running_time",
            "keywords",
            "source",
            "genre",
            "production_method",
            "creative_type",
            "production_financing_companies",
            "production_countries",
            "languages",
        ]:
            result[key] = ""

    # Extract Lead Ensemble Members and Production/Technical Credits
    # For the mobile layout, assume these appear in a container with id "cast-and-crew_mobile".
    cast_mobile = soup.find("div", id="cast-and-crew_mobile")
    lead_members = []
    prod_tech_credits = []
    role_dict = {}

    if cast_mobile:
        # Extract lead ensemble members.

        lead_header = cast_mobile.find("h1", string=lambda s: s and "Leading Cast" in s)
        if lead_header:
            lead_table = lead_header.find_next("table")
            if lead_table:
                for row in lead_table.find_all("tr"):
                    cells = row.find_all("td")
                    if len(cells) >= 3:
                        actor = cells[0].get_text(strip=True)
                        role = cells[2].get_text(strip=True)
                        lead_members.append({"actor": actor, "role": role})
        # Extract production and technical credits.
        prodtech_header = cast_mobile.find(
            "h1", string=lambda s: s and "Production and Technical Credits" in s
        )

        if prodtech_header:
            credits_table = prodtech_header.find_next("table")
            if credits_table:
                for row in credits_table.find_all("tr"):
                    # Stop processing if a horizontal rule (<hr>) is found in this row.
                    if row.find("hr"):
                        break
                    cells = row.find_all("td")
                    if cells:
                        # Join all cell texts with " | " as separator.
                        row_text = " | ".join(
                            cell.get_text(strip=True) for cell in cells
                        )
                        prod_tech_credits.append(row_text)

                credit_list = prod_tech_credits[0].split(" | ")

                # Initialize an empty dictionary to store roles and associated people

                # Loop through the list to pair people with their roles
                for i in range(0, len(credit_list), 3):
                    person = credit_list[i].strip()
                    role = (
                        credit_list[i + 2].strip() if i + 2 < len(credit_list) else ""
                    )

                    # Add the person to the list for the role in the dictionary
                    if role not in role_dict:
                        role_dict[role] = []
                    role_dict[role].append(person)
    result["lead_ensemble_members"] = lead_members

    result["production_technical_credits"] = role_dict

    return result


## Setting Up and Executing the Scraping Process

Now we choose the year range (2000-2025) and define a concurrency setup with 100 workers. We create functions to:
1. Scrape entire years of movie data
2. Extract detailed information for individual movies

By using ThreadPoolExecutor, we parallelize the scraping of individual movie details while processing years sequentially. This approach balances efficiency with server load considerations.

After collecting all the movie data, we organize everything into a pandas DataFrame for analysis and export.

In [None]:
start_year = 2000
end_year = 2025
num_workers = 100
base_url = "https://www.the-numbers.com"
all_movies_data = []


# Function to scrape a single year's movie data
def scrape_year(year):
    print(f"Scraping year {year}")
    url = f"https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-{year}"

    # Extract movie URLs for the year
    soup = extract_movies(url)  # Extracts the movie URLs for the current year
    movies = extract_movies_with_links(
        soup
    )  # Extracts the movie URLs for the current year

    year_movie_data = []

    # Function to scrape individual movie details
    def scrape_movie_details(movie_title, movie_url):
        full_url = f"{base_url}{movie_url}"
        movie_info = extract_movie_info_mobile(extract_movies(full_url))
        movie_info["movie_url"] = full_url  # Include the movie URL in the data
        movie_info["movie_title"] = movie_title  # Include the movie URL in the data
        return movie_info

    # Use ThreadPoolExecutor to parallelize the scraping of individual movie details
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Submit tasks to scrape each movie concurrently
        futures = [
            executor.submit(scrape_movie_details, movie_title, movie_url)
            for movie_title, movie_url in tqdm(movies.items())
        ]

        # Wait for all tasks to complete and collect the results
        for future in concurrent.futures.as_completed(futures):
            year_movie_data.append(future.result())

    return year_movie_data


# Sequential loop over years, parallelizing the movie detail scraping
for year in range(start_year, end_year + 1):
    # Scrape the movies for the current year
    year_movie_data = scrape_year(year)

    # Add the year data to the all_movies_data list
    all_movies_data.extend(year_movie_data)

# Create a pandas DataFrame from the list of movie data
df = pd.DataFrame(all_movies_data)


Scraping year 2017


100%|██████████| 100/100 [00:19<00:00,  5.17it/s]


Scraping year 2018


100%|██████████| 100/100 [00:16<00:00,  6.21it/s]


Scraping year 2019


100%|██████████| 100/100 [00:22<00:00,  4.50it/s]


Scraping year 2020


100%|██████████| 14/14 [00:00<00:00, 165.49it/s]


Scraping year 2021


100%|██████████| 62/62 [00:07<00:00,  8.80it/s]


Scraping year 2022


100%|██████████| 59/59 [00:01<00:00, 47.15it/s]


Scraping year 2023


100%|██████████| 100/100 [00:15<00:00,  6.30it/s]


Scraping year 2024


100%|██████████| 100/100 [00:14<00:00,  7.00it/s]


Scraping year 2025


0it [00:00, ?it/s]


## Examining Our Collected Dataset

This cell displays data type and null-count information for the generated DataFrame. We successfully scraped data for 4,625 movies released between 2000 and 2025, creating a comprehensive dataset with details about each film's synopsis, budget, cast, production team, and more.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4625 entries, 0 to 4624
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   synopsis                        3623 non-null   object
 1   opening_weekend                 4329 non-null   object
 2   percent_of_total_gross          4329 non-null   object
 3   production_budget               4134 non-null   object
 4   domestic_release_date           4402 non-null   object
 5   running_time                    4545 non-null   object
 6   keywords                        4546 non-null   object
 7   source                          4614 non-null   object
 8   genre                           4621 non-null   object
 9   production_method               4621 non-null   object
 10  creative_type                   4618 non-null   object
 11  production_financing_companies  4062 non-null   object
 12  production_countries            4603 non-null   

## Saving Our Dataset

We export our comprehensive movie dataset to a CSV file for future analysis or as input to other data processing pipelines.

In [None]:
df.to_csv("final_df.csv")