# Highest-Grossing Films Data Extraction and Storage in MongoDB

This notebook extracts data from Wikipedia's "List of highest-grossing films" and stores it in a MongoDB database.

In [11]:
!pip install aiohttp motor beautifulsoup4 nest_asyncio requests

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting requests
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.4.1-cp311-cp311-macosx_10_9_universal2.whl.metadata (35 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m291.8 kB/s[0m eta [36m0:00:00[0m--:--[0m
[?25hDownloading certifi-2025.1.31-py3-none-any.whl (166 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m532.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading charset_normalizer-3.4.1-cp311-cp311-macosx_10_9_universal2.whl (194 kB)

In [12]:
import re
import json
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from motor.motor_asyncio import AsyncIOMotorClient

import nest_asyncio
nest_asyncio.apply() # Enable nested asyncio loops in Colab

## Constants and URLs
- **BASE_URL:** Base URL for Wikipedia.
- **MAIN_URL:** URL for the Wikipedia page listing highest-grossing films.


In [13]:
BASE_URL = "https://en.wikipedia.org"
MAIN_URL = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"
MONGODB_URI = "<ADD YOUR MONGODB URI HERE>"

## Step 1: Parse the Main Page
We use `parse_main_page()` to fetch the main Wikipedia page and extract basic details for each film.
The details include:
- Film title  
- Release year  
- Box office revenue  
- URL to the film's detailed page  

placeholders for additional fields (directors, country, production companies, image URL).

In [14]:
def parse_main_page():
    import requests 
    response = requests.get(MAIN_URL)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {MAIN_URL}")
    main_soup = BeautifulSoup(response.content, "html.parser")
    table = main_soup.find("table", class_="wikitable")
    films = []
    # Loop through each table row (skip header)
    for row in table.find_all("tr")[1:]:
        cells = row.find_all(["th", "td"])
        if len(cells) < 5:
            continue
        # Extract title and film URL from the third cell.
        title_cell = cells[2]
        title_link = title_cell.find("a")
        if title_link:
            title = title_link.get_text(strip=True)
            relative_link = title_link.get("href")
            film_url = BASE_URL + relative_link
        else:
            title = title_cell.get_text(strip=True)
            film_url = None

        # Extract Box Office from the fourth cell.
        box_office = cells[3].get_text(strip=True)
        # Extract Release Year from the fifth cell.
        release_year_text = cells[4].get_text(strip=True)
        try:
            release_year = int(re.search(r'\d{4}', release_year_text).group())
        except Exception:
            release_year = None

        film_record = {
            "title": title,
            "release_year": release_year,
            "box_office": box_office,
            "film_url": film_url,
            "directors": None,
            "country": None,
            "production_companies": None,
            "image_url": None,
        }
        films.append(film_record)
    return films

## Step 2: Film Page Scraping
The asynchronous functions below use `aiohttp` to fetch each film's page concurrently and extract:
- **Image URL:** Found in the "infobox-image" cell.
- **Directors:** From the row labeled "Directed by".
- **Country:** We extract the first country listed.
- **Production Companies:** Optional information.
The function `fetch()` performs the HTTP GET request, and `scrape_film_page()` processes the HTML with BeautifulSoup.


In [15]:
async def fetch(session, url):
    """Fetches URL text asynchronously."""
    async with session.get(url) as response:
        if response.status != 200:
            print(f"Failed to load {url}: {response.status}")
            return None
        return await response.text()

In [16]:
async def scrape_film_page(session, film):
    """Scrape additional details for a single film."""
    url = film["film_url"]
    if not url:
        return film
    html = await fetch(session, url)
    if not html:
        return film
    soup = BeautifulSoup(html, "html.parser")
    infobox = soup.find("table", class_="infobox")
    if not infobox:
        return film

    # Extract the image URL from the infobox.
    image_url = None
    image_cell = infobox.find("td", class_="infobox-image")
    if image_cell:
        img = image_cell.find("img")
        if img and img.has_attr("src"):
            src = img["src"]
            if src.startswith("//"):
                image_url = "https:" + src
            elif src.startswith("http"):
                image_url = src
            else:
                image_url = BASE_URL + src

    directors = None
    country = None
    production_companies = None

    # Loop through each row in the infobox for additional details.
    for row in infobox.find_all("tr"):
        header = row.find("th")
        data_cell = row.find("td")
        if not header or not data_cell:
            continue
        header_text = header.get_text(strip=True)
        # Extract Directors
        if "Directed by" in header_text:
            directors = data_cell.get_text(separator=", ", strip=True)
        # Extract Country (only the first country)
        if header_text.lower() in ["country", "countries"]:
            raw_country = data_cell.get_text(separator=", ", strip=True)
            country = raw_country.split(",")[0].strip()
        # Extract Production Companies (optional)
        if "Production" in header_text and "company" in header_text.lower():
            production_companies = data_cell.get_text(separator=", ", strip=True)

    film["directors"] = directors
    film["country"] = country
    film["production_companies"] = production_companies
    film["image_url"] = image_url
    return film

## Step 3: Push Data to MongoDB
We use Motor (an asynchronous MongoDB driver) to connect to MongoDB and insert film records.
The schema for each document is as follows:
- **id:** Auto-incremented integer (assigned in our code).
- **title:** Film title.
- **release_year:** Year of release.
- **director:** Director(s).
- **box_office:** Box office revenue.
- **country:** Country of origin.

Optional fields include production companies and the image URL.

In [17]:
async def push_to_mongodb(films):
    """
    Push the film records into MongoDB using Motor.
    """
    client = AsyncIOMotorClient(MONGODB_URI)
    db = client.filmdb
    collection = db.films

    # Clear the collection for a fresh run
    await collection.delete_many({})

    for idx, film in enumerate(films, start=1):
        film["id"] = idx  # Add id field

    # Insert the film records into the collection.
    result = await collection.insert_many(films)
    print(f"Inserted {len(result.inserted_ids)} documents into MongoDB.")


## Step 4: Main Function to Run the Workflow
1. Parses the main page synchronously to obtain a list of films.
2. Uses an aiohttp ClientSession to concurrently scrape each film's page for additional details.
4. Pushes the combined data to MongoDB.

In [18]:
films = parse_main_page() 
async with aiohttp.ClientSession() as session:
    tasks = []
    for film in films:
        if film["film_url"]:
            task = asyncio.create_task(scrape_film_page(session, film))
            tasks.append(task)
    updated_films = await asyncio.gather(*tasks)
    # Update the film records with the updated details.
    films_dict = {film["title"]: film for film in films}
    for film in updated_films:
        films_dict[film["title"]] = film
    films = list(films_dict.values())

for film in films:
    print("--------------------------------------------------")
    print(f"Title: {film['title']}")
    print(f"Release Year: {film['release_year']}")
    print(f"Box Office: {film['box_office']}")
    print(f"Directed by: {film.get('directors', 'Not found')}")
    print(f"Country: {film.get('country', 'Not found')}")
    print(f"Production Companies: {film.get('production_companies', 'Not found')}")
    print(f"Image URL: {film.get('image_url', 'Not found')}")
    print("--------------------------------------------------\n")

# Push the data to MongoDB.
await push_to_mongodb(films)


--------------------------------------------------
Title: Avatar
Release Year: 2009
Box Office: $2,923,706,026
Directed by: James Cameron
Country: United Kingdom
Production Companies: None
Image URL: https://upload.wikimedia.org/wikipedia/en/thumb/d/d6/Avatar_%282009_film%29_poster.jpg/220px-Avatar_%282009_film%29_poster.jpg
--------------------------------------------------

--------------------------------------------------
Title: Avengers: Endgame
Release Year: 2019
Box Office: $2,797,501,328
Directed by: Anthony Russo, Joe Russo
Country: United States
Production Companies: Marvel Studios
Image URL: https://upload.wikimedia.org/wikipedia/en/0/0d/Avengers_Endgame_poster.jpg
--------------------------------------------------

--------------------------------------------------
Title: Avatar: The Way of Water
Release Year: 2022
Box Office: $2,320,250,281
Directed by: James Cameron
Country: United States
Production Companies: None
Image URL: https://upload.wikimedia.org/wikipedia/en/thum

## Step 5: Export films to JSON

In [None]:

async def export_films_to_json():
    client = AsyncIOMotorClient(MONGODB_URI)
    db = client.filmdb
    collection = db.films

    # Retrieve all film documents from the collection.
    films = await collection.find().to_list(length=None)
    
    # Optionally, remove the MongoDB _id field if not needed.
    for film in films:
        film.pop("_id", None)
    
    # Write the documents to a JSON file.
    with open("films.json", "w", encoding="utf-8") as f:
        json.dump(films, f, ensure_ascii=False, indent=4)
    
    print(f"Exported {len(films)} documents to films.json")

if __name__ == '__main__':
    asyncio.run(export_films_to_json())
