# Enhancing Our Movie Dataset with Wikipedia and TMDB

In this notebook, we'll enrich our existing movie dataset by gathering additional information from two valuable sources: Wikipedia and The Movie Database (TMDB) API. Many of our movie records have missing information like synopses, budget details, or cast information. We'll address these gaps systematically using web scraping and API techniques.

In [62]:
import pandas as pd

## Loading and Examining Our Dataset

First, we need to understand what data we already have and identify the gaps we need to fill. We'll load our previously scraped movie dataset and examine its structure to identify missing values across different columns.

In [94]:
df = pd.read_csv("updated_movies.csv")

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4625 entries, 0 to 4624
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   synopsis                        3788 non-null   object
 1   opening_weekend                 3681 non-null   object
 2   percent_of_total_gross          3681 non-null   object
 3   production_budget               3116 non-null   object
 4   domestic_release_date           3879 non-null   object
 5   running_time                    4296 non-null   object
 6   keywords                        4056 non-null   object
 7   source                          4528 non-null   object
 8   genre                           4570 non-null   object
 9   production_method               4568 non-null   object
 10  creative_type                   4537 non-null   object
 11  production_financing_companies  3069 non-null   object
 12  production_countries            4522 non-null   

In [None]:
df = df.drop(columns=["Unnamed: 0"])

## Examining Specific Missing Data

Before implementing our scraping strategy, let's take a closer look at some specific columns with missing data. We're particularly interested in production companies and budget information, as these fields are often incomplete in our original dataset but readily available on Wikipedia and TMDB.

In [117]:
missing_data = df.isnull().sum().sort_values(ascending=False)
print(missing_data)

production_budget                 1435
production_financing_companies    1397
percent_of_total_gross             944
opening_weekend                    944
synopsis                           837
domestic_release_date              746
keywords                           569
running_time                       329
languages                          269
production_countries               103
source                              97
creative_type                       88
production_method                   57
genre                               55
lead_ensemble_members                0
production_technical_credits         0
movie_url                            0
movie_title                          0
dtype: int64


## Setting Up TMDB API Access

To access movie data from The Movie Database (TMDB), we first need to install their Python client library. TMDB provides a comprehensive API that will help us fill in missing synopses and other film details.

In [16]:
!pip install tmdbv3api

Collecting tmdbv3api
  Obtaining dependency information for tmdbv3api from https://files.pythonhosted.org/packages/35/fb/9d575292bb7794a7a85bcdbf6c09928aae5ca2ae9f684f7fbbd902e281c4/tmdbv3api-1.9.0-py3-none-any.whl.metadata
  Downloading tmdbv3api-1.9.0-py3-none-any.whl.metadata (8.0 kB)
Downloading tmdbv3api-1.9.0-py3-none-any.whl (25 kB)
Installing collected packages: tmdbv3api
Successfully installed tmdbv3api-1.9.0


## Enriching Synopses using TMDB API

Many of our movie records lack proper synopses, which are crucial for understanding the film's content. Here we'll use the TMDB API to search for movies by title and release year, then extract their plot synopses. 

We've implemented careful date parsing logic to ensure we match the correct film, since many movies share similar titles but were released in different years. For each film with a missing synopsis, we'll make an API request, match the release year to confirm we have the right movie, and then update our dataset with the retrieved synopsis.

In [None]:
from tmdbv3api import TMDb, Movie
import pandas as pd
import time
from datetime import datetime

# TMDB setup
tmdb = TMDb()
tmdb.api_key = "9f3a656bc10a7241687aba82819a8c67"
tmdb.language = "en"
movie_search = Movie()


# Parse your release date format
def parse_our_date(date_str):
    try:
        return datetime.strptime(date_str, "%B %dth, %Y").date()
    except ValueError:
        try:
            return datetime.strptime(date_str, "%B %dst, %Y").date()
        except ValueError:
            try:
                return datetime.strptime(date_str, "%B %dnd, %Y").date()
            except ValueError:
                try:
                    return datetime.strptime(date_str, "%B %drd, %Y").date()
                except ValueError:
                    try:
                        return datetime.strptime(date_str, "%B %d, %Y").date()
                    except:
                        return None


# Get synopsis only if release date matches
def get_synopsis(title, release_date_str):
    our_date = parse_our_date(release_date_str)
    if not our_date:
        return None

    try:
        results = movie_search.search(title)
        for r in results:
            tmdb_date = r.release_date
            if tmdb_date:
                tmdb_parsed = datetime.strptime(tmdb_date, "%Y-%m-%d").date()
                if tmdb_parsed.year == our_date.year:
                    return r.overview
    except Exception as e:
        print(f"Error with {title}: {e}")

    return None


# Only rows where synopsis is missing and we have a release date
mask = df["synopsis"].isnull() & df["domestic_release_date"].notnull()
rows_to_process = df[mask]

for idx, row in rows_to_process.iterrows():
    title = row["movie_title"]
    release_date_str = row["domestic_release_date"]
    synopsis = get_synopsis(title, release_date_str)
    print(title, release_date_str, synopsis)
    if synopsis:
        df.at[idx, "synopsis"] = synopsis
        # print(title,release_date_str,synopsis)
    time.sleep(0.25)  # Respect rate limits

The Perfect Storm June 30th, 2000 In October 1991, a confluence of weather conditions combined to form a killer storm in the North Atlantic. Caught in the storm was the sword-fishing boat Andrea Gail.
Dolphins October 20th, 2000 From the banks of the Bahamas to the seas of Argentina, we go underwater to meet dolphins. Two scientists who study dolphin communication and behaviour lead us on encounters in the wild. Featuring the music of Sting. Nominated for an Academy Award®, Best Documentary, Short Subject, 2000.
X-Men July 14th, 2000 While Senator Kelly addresses a senate committee about the supposed mutant menace, we learn about the making of the movie, X-Men.
Pokemon 2000 July 21st, 2000 A promotional concert/behind the scenes special for the American release of Pokémon: The Movie 2000.
U-571 April 21st, 2000 In the midst of World War II, the battle under the sea rages and the Nazis have the upper hand as the Allies are unable to crack their war codes. However, after a wrecked U-boat

## Saving Our Progress and Examining Updates

After enriching our dataset with TMDB synopses, we'll save the intermediate results to a CSV file and examine a sample of the updated records. This ensures we don't lose our progress if something goes wrong in subsequent steps, and it allows us to verify that the synopsis enrichment worked correctly.

In [115]:
df.to_csv("updated_movies2.csv", index=False)

In [116]:
df.head()

Unnamed: 0,synopsis,opening_weekend,percent_of_total_gross,production_budget,domestic_release_date,running_time,keywords,source,genre,production_method,creative_type,production_financing_companies,production_countries,languages,lead_ensemble_members,production_technical_credits,movie_url,movie_title
0,"Greg Focker is ready to marry his girlfriend, ...","$28,623,300",17.2%,"$55,000,000 (worldwide box office is 6.0 times...","October 6th, 2000",108 minutes,"In-Laws / Future In-Laws , Frat Pack , Family ...",Based on Movie,Comedy,Live Action,Contemporary Fiction,"Universal Pictures , DreamWorks Pictures , Nan...",United States,English,"[{'actor': 'Robert De Niro', 'role': 'Jack Byr...","{'Director': ['Jay Roach'], 'Screenwriter': ['...",https://www.the-numbers.com/movie/Meet-the-Par...,Meet the Parents
1,"In October 1991, a confluence of weather condi...","$41,325,042",22.6%,"$120,000,000 (worldwide box office is 2.7 time...","June 30th, 2000",129 minutes,"Visual Effects , Disaster , Extreme Weather , ...",Based on Factual Book/Article,Drama,Live Action,Dramatization,"Baltimore Spring Creek Pictures , Radiant Prod...",United States,English,"[{'actor': 'George Clooney', 'role': 'Captain ...","{'Director': ['Wolfgang Petersen'], 'Screenwri...",https://www.the-numbers.com/movie/Perfect-Stor...,The Perfect Storm
2,From the banks of the Bahamas to the seas of A...,,,,"October 20th, 2000",39 minutes,"Nature Documentary , Underwater , Animal Lead ...",Based on Real Life Events,Documentary,Live Action,Factual,[MacGillivray Freeman Films],United States,English,[],"{'Director': ['Greg MacGillivray'], 'Screenwri...",https://www.the-numbers.com/movie/Dolphins-(20...,Dolphins
3,Advertising executive Nick Marshall is as cock...,"$33,614,543",18.4%,"$65,000,000 (worldwide box office is 5.8 times...","December 15th, 2000",126 minutes,"Romance , Psychics , Battle of the Sexes , Set...",Original Screenplay,Romantic Comedy,Live Action,Contemporary Fiction,"Paramount Pictures , Icon Productions , Wind D...",United States,English,"[{'actor': 'Mel Gibson', 'role': 'Nick Marshal...","{'Director': ['Nancy Meyers'], 'Screenwriter':...",https://www.the-numbers.com/movie/What-Women-W...,What Women Want
4,When a devious mastermind embroils them in a p...,"$40,128,550",32.0%,"$90,000,000 (worldwide box office is 2.9 times...","November 3rd, 2000",98 minutes,"Visual Effects , Action Comedy , Private Inves...",Based on TV,Action,Live Action,Contemporary Fiction,"Columbia Pictures , Leonard Goldberg , Flower ...",United States,English,"[{'actor': 'Cameron Diaz', 'role': 'Natalie'},...","{'Director': ['McG*'], 'Screenwriter': ['Ryan ...",https://www.the-numbers.com/movie/Charlies-Ang...,Charlie's Angels


## Preparing for Wikipedia Scraping

Now we'll install the necessary packages to help with parsing and displaying table data. This will be useful when examining the results of our Wikipedia data extraction.

In [10]:
pip install tabulate

Collecting tabulate
  Obtaining dependency information for tabulate from https://files.pythonhosted.org/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-any.whl.metadata
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0
Note: you may need to restart the kernel to use updated packages.


## Building a Wikipedia Data Extraction System

For information that isn't available through TMDB, we'll turn to Wikipedia. Wikipedia articles about films typically contain a wealth of structured information in their infoboxes, including production companies, budget figures, cast lists, and crew credits.

We're creating a comprehensive system that:
1. Constructs appropriate Wikipedia page titles (e.g., "Avatar (2009 film)")
2. Fetches and parses the HTML content of Wikipedia articles
3. Extracts structured data from movie infoboxes
4. Formats data into consistent structures for our database
5. Handles special cases like monetary values in different currencies

The most complex part is parsing financial information, as Wikipedia lists budgets and box office figures in various formats (e.g., "$25 million" or "₹300 crore"). We've implemented specialized parsing logic to standardize all monetary values to USD.

In [None]:
import wikipediaapi
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd  # <- Missing import

wiki = wikipediaapi.Wikipedia(user_agent="CoolBot/0.0", language="en")


def build_wikipedia_title(title, year):
    return f"{title} ({year} film)"


def fetch_wikipedia_html(title, year):
    page_title = build_wikipedia_title(title, year)
    page = wiki.page(page_title)
    if page.exists():
        url = page.fullurl
        response = requests.get(url)
        if response.status_code == 200:
            return BeautifulSoup(response.content, "html.parser")
    return None


def is_missing(val):
    try:
        if pd.isna(val) or val is None:
            return True

        # Catch literal "[]", "{}", "nan", etc.
        if isinstance(val, str) and val.strip().lower() in [
            "",
            "[]",
            "{}",
            "nan",
            "none",
        ]:
            return True

        # Empty containers
        if isinstance(val, (list, dict, set)) and len(val) == 0:
            return True

        return False
    except:
        return False


def extract_infobox_data(soup):
    data = {}
    infobox = soup.find("table", class_="infobox vevent")
    if not infobox:
        return data

    for row in infobox.find_all("tr"):
        header = row.find("th")
        value = row.find("td")
        if header and value:
            key = header.get_text(strip=True).lower()

            # If the value is a list (<ul><li>...</li></ul>), extract individual items
            if value.find("ul"):
                items = [li.get_text(strip=True) for li in value.find_all("li")]
                data[key] = items
            else:
                val = value.get_text(separator=", ", strip=True)
                data[key] = val
    return data


def format_financing_companies(info):
    entries = []

    for key in ["productioncompanies", "distributed by"]:
        val = info.get(key)
        if not val:
            continue

        if isinstance(val, list):
            raw_list = val
        else:
            # Clean malformed ref markers like [, 1, ]
            cleaned = re.sub(r"\[\s*,?\s*\w\s*,?\s*\]", "", val)
            raw_list = [v.strip() for v in cleaned.split(",") if v.strip()]

        entries.extend(raw_list)

    return entries if entries else None


def format_technical_credits(info):
    role_map = {
        "directed by": "Director",
        "written by": "Screenwriter",
        "screenplay by": "Screenwriter",
        "story by": "Story Creator",
        "produced by": "Producer",
        "cinematography": "Cinematographer",
        "edited by": "Editor",
        "music by": "Composer",
        "production design": "Production Designer",
        "costume design": "Costume Designer",
        "distributed by": "Distributor",
    }

    credits = {}
    for key, val in info.items():
        std_role = role_map.get(key.lower())
        if not std_role:
            continue

        # Handle lists and strings
        if isinstance(val, list):
            names = [v.strip() for v in val if isinstance(v, str) and v.strip()]
        elif isinstance(val, str):
            names = [v.strip() for v in re.split(r",| and |;| & ", val) if v.strip()]
        else:
            continue  # Skip weird types

        if names:
            credits.setdefault(std_role, []).extend(names)

    return credits if credits else None


def format_cast(info):
    cast_key = next((k for k in ["starring", "cast"] if k in info), None)
    if not cast_key:
        return []

    cast_val = info[cast_key]

    # If it's a list already (from <ul><li>...</li></ul>)
    if isinstance(cast_val, list):
        cast_list = cast_val
    else:
        cast_str = cast_val.replace("\xa0", " ")
        cast_list = [a.strip() for a in cast_str.split(",") if a.strip()]

    parsed_cast = []
    for entry in cast_list:
        match = re.match(r"(.+?)\s+as\s+(.+)", entry, re.IGNORECASE)
        if match:
            actor, role = match.groups()
            parsed_cast.append({"actor": actor.strip(), "role": role.strip()})
        else:
            parsed_cast.append({"actor": entry.strip(), "role": ""})

    return parsed_cast


def parse_money(value):
    """
    Parses money values in various formats and converts to USD (float).
    Supports:
      - $25 million
      - US$13–20 million
      - ₹300 crore → converts to USD
    Returns the value in USD (float), or None if it can't be parsed.
    """
    if not isinstance(value, str):
        return None

    # Remove footnote artifacts like [, 1, ]
    cleaned = re.sub(r"\[\s*,?.*?,?\s*\]", "", value)
    cleaned = cleaned.replace("\xa0", " ").strip()

    # INR CRORE FORMAT
    match = re.search(
        r"(?:₹|Rs\.?)\s*([\d,]+(?:\.\d+)?)(?:\s*[–-]\s*([\d,]+(?:\.\d+)?))?\s*crore",
        cleaned,
        re.IGNORECASE,
    )
    if match:
        try:
            num1 = float(match.group(1).replace(",", ""))
            num2 = match.group(2)
            avg = (num1 + float(num2.replace(",", ""))) / 2 if num2 else num1
            inr_total = avg * 10_000_000
            usd_value = inr_total / 83  # Rough 2025 exchange rate
            return usd_value
        except ValueError:
            return None

    # USD / EURO / GBP FORMAT
    match = re.search(
        r"(?:US\$|\$|€|£)?\s*([\d,.]+)(?:[–-]([\d,.]+))?\s*(million|billion)?",
        cleaned,
        re.IGNORECASE,
    )
    if not match:
        return None

    try:
        num1 = float(match.group(1).replace(",", ""))
        num2 = match.group(2)
        unit = match.group(3)

        if num2:
            num2 = float(num2.replace(",", ""))
            avg = (num1 + num2) / 2
        else:
            avg = num1

        multiplier = 1
        if unit:
            unit = unit.lower()
            if unit == "million":
                multiplier = 1_000_000
            elif unit == "billion":
                multiplier = 1_000_000_000

        return avg * multiplier
    except ValueError:
        return None


def format_budget(info):
    budget_str = info.get("budget")
    if not budget_str:
        return None

    b = parse_money(budget_str)
    if not b:
        return None

    gross = None
    for key in ["box office", "gross revenue"]:
        gross_str = info.get(key)
        if gross_str:
            g = parse_money(gross_str)
            if g:
                gross = g
                break

    formatted_budget = f"${int(b):,}"
    if gross:
        ratio = round(gross / b, 1)
        return f"{formatted_budget} (worldwide box office is {ratio} times production budget)"

    return formatted_budget


def fill_movie_data_from_wiki(row):
    title = row.get("movie_title")
    release_date = row.get("domestic_release_date")
    year = None
    try:
        year = int(re.search(r"\d{4}", release_date).group())
    except:
        return row  # Can't proceed without year

    print(title, year)

    soup = fetch_wikipedia_html(title, year)
    if not soup:
        return row

    info = extract_infobox_data(soup)
    print(info)

    print(row)
    print("********")

    if is_missing(row.get("production_technical_credits")):
        credits = format_technical_credits(info)
        if credits:
            row["production_technical_credits"] = credits

    if is_missing(row.get("lead_ensemble_members")):
        cast = format_cast(info)
        if cast:
            row["lead_ensemble_members"] = cast

    if is_missing(row.get("production_budget")):
        budget = format_budget(info)
        if budget:
            row["production_budget"] = budget

    if is_missing(row.get("production_financing_companies")):
        financing = format_financing_companies(info)
        if financing:
            row["production_financing_companies"] = financing

    print(row)

    return row

## Applying Wikipedia Data Extraction to Fill Missing Information

Now we'll apply our Wikipedia extraction system to the movies that still have missing information. We're targeting four key fields:
- Production budgets
- Cast members
- Technical credits (directors, writers, etc.)
- Production and financing companies

For each movie with missing data in these fields, we'll attempt to locate its Wikipedia page, extract the relevant information, and update our dataset accordingly. This process complements the TMDB data we gathered earlier, creating a more complete picture of each film.

In [None]:
target_fields = [
    "production_budget",
    "lead_ensemble_members",
    "production_technical_credits",
    "production_financing_companies",
]

# This checks per row if any target column is "missing"
filtered_df = df[
    df[target_fields].apply(lambda row: any(is_missing(v) for v in row), axis=1)
]

for idx, row in filtered_df.iterrows():
    df.loc[idx] = fill_movie_data_from_wiki(row)
    print(f"Updated row {idx}")
    print("-------------------------")

Dolphins 2000
{'directed by': 'Greg MacGillivray', 'written by': 'Tim Cahill, Stephen Judson', 'narrated by': 'Pierce Brosnan', 'cinematography': 'Greg MacGillivray, Brad Ohlund', 'edited by': 'Stephen Judson', 'music by': 'Sting', 'distributed by': 'MacGillivray Freeman Films', 'release date': ['April\xa014,\xa02000(2000-04-14)'], 'running time': '39 min.', 'country': 'United States', 'language': 'English'}
synopsis                          From the banks of the Bahamas to the seas of A...
opening_weekend                                                                 NaN
percent_of_total_gross                                                          NaN
production_budget                                                               NaN
domestic_release_date                                            October 20th, 2000
running_time                                                             39 minutes
keywords                          Nature Documentary , Underwater , Animal Lead ...


## Saving Our Fully Enriched Dataset

With both TMDB and Wikipedia data incorporated, our movie dataset is now much more complete. We'll save the final enriched dataset to a CSV file, ready for analysis or further processing. The combined data from multiple sources gives us a rich set of attributes for each movie, enabling more comprehensive analysis of trends in the film industry from 2000 to 2025.

In [None]:
df.to_csv("updated_movies2.csv", index=False)