## Get Data

Here we will merge two datasets: BoxOfficeMojo (boxofficemojo.com) and The Movie Database (TMDB). Here is what each contains:

| Feature / Data Point                                  | **Box Office Mojo (BOM)** 🏛               | **TMDb (The Movie Database)** 🎬                       |
| ----------------------------------------------------- | ------------------------------------------ | ------------------------------------------------------ |
| **Domestic box office grosses**                       | ✅ Accurate (daily/weekly/yearly US/Canada) | ❌ Not provided (only worldwide revenue, often missing) |
| **Worldwide box office**                              | ⚠️ Limited / inconsistent                  | ✅ Available (but often incomplete)                     |
| **Theater counts**                                    | ✅ Number of theaters per release           | ❌ Not available                                        |
| **Budgets**                                           | ❌ Rarely included                          | ✅ Included (when known)                                |
| **Genres**                                            | ❌ Not available                            | ✅ Rich genre metadata (IDs + names)                    |
| **Languages**                                         | ❌ Not available                            | ✅ Original language + spoken languages                 |
| **Production companies/countries**                    | ❌ Not available                            | ✅ Provided                                             |
| **Movie metadata** (runtime, overview, posters, etc.) | ❌ Not available                            | ✅ Extensive metadata                                   |
| **Upcoming movies (future years)**                    | ❌ Only past releases                       | ✅ Includes “In Production”, “Planned”, 2026+           |
| **Filters** (e.g. exclude TV, docs, non-English)      | ❌ No filters                               | ✅ Yes (via metadata)                                   |
| **Coverage**                                          | ✅ US theatrical releases only              | ✅ Worldwide releases (films + TV)                      |


### Setup

In [1]:
# ---- 0) Setup / imports (install if missing) ----
import subprocess, sys, os, time, json, requests
import pandas as pd
import numpy as np
import warnings
from datetime import datetime
from bs4 import BeautifulSoup

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

import sys, os
from pathlib import Path

# make sure the current folder (code/) is on sys.path
sys.path.insert(0, str(Path.cwd()))

import importlib, movie_lists
print(movie_lists.__file__) 
importlib.reload(movie_lists)        

from movie_lists import (
    MARVEL_MCU_FILMS, DC_FILMS, STAR_WARS_FILMS, FAST_FURIOUS_FILMS, 
    WIZARDING_WORLD_FILMS, ALL_LIVE_ACTION_REMAKES,
    MEDIA_ADAPTATIONS, ALL_SUPERHERO_FILMS,
    REMAKE_PATTERNS, REMAKE_TITLE_INDICATORS,
    FRANCHISE_SEQUELS, TITLE_CORRECTIONS
)

def _ensure(pkg, import_name=None):
    try:
        __import__(import_name or pkg.replace("-", "_"))
        print(f"✅ {pkg} already installed")
    except Exception:
        print(f"📦 Installing {pkg}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])
        print(f"✅ {pkg} installed")

for pkg in ["requests", "beautifulsoup4"]:
    _ensure(pkg)


/Users/jasmineplows/Documents/California/Projects/box_office/code/movie_lists.py
✅ requests already installed
📦 Installing beautifulsoup4...
✅ beautifulsoup4 installed


In [2]:
# ---- 1) Globals ----
DATA_DIR = "../data"
os.makedirs(DATA_DIR, exist_ok=True)

START_YEAR = 2015
END_YEAR = 2026
FORCE_REFRESH = False   # flip to True to re-scrape/re-fetch


In [3]:
# ---- 2) Cache wrapper - so you don't have to rescrape every time ----
def load_or_build_csv(path, builder_fn, *, force=FORCE_REFRESH, name="dataset"):
    """
    If `path` exists and not forcing, load CSV.
    Otherwise, call `builder_fn()` -> DataFrame, save to CSV, return it.
    """
    try:
        if (not force) and os.path.exists(path) and os.path.getsize(path) > 0:
            print(f"🗂️  Using cached {name}: {os.path.relpath(path)}")
            return pd.read_csv(path)
    except Exception as e:
        print(f"⚠️  Cache read issue for {name}: {e} — rebuilding")

    print(f"🔄 Building {name} …")
    df = builder_fn()
    os.makedirs(os.path.dirname(path), exist_ok=True)
    df.to_csv(path, index=False)
    print(f"💾 Saved {name} → {os.path.relpath(path)}  ({len(df)} rows)")
    return df


### Box Office Mojo Scrape Function

In [4]:
# ---- 3) Fetch All-Time Domestic Grosses (lifetime) + distributors (integrated) ----
import time, warnings
from bs4 import BeautifulSoup

def _norm_title_local(title: str) -> str:
    if pd.isna(title): return title
    t = str(title).strip()
    for ep in ["Episode I - ","Episode II - ","Episode III - ",
               "Episode IV - ","Episode V - ","Episode VI - ",
               "Episode VII - ","Episode VIII - ","Episode IX - "]:
        t = t.replace(ep, "")
    t = t.replace(" & ", " and ")
    return " ".join(t.split())

def _fetch_bom_distributors_for_year(year, max_pages=50, per_page=200, sleep=0.35, retries=3, retry_sleep=1.0):
    base = f"https://www.boxofficemojo.com/year/{year}/"
    headers = {
        "User-Agent": ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/127.0.0.0 Safari/537.36"),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Connection": "keep-alive",
    }
    frames, offset = [], 0
    for _ in range(max_pages):
        url = base if offset == 0 else f"{base}?offset={offset}"

        html = None
        last_exc = None
        for _try in range(retries):
            try:
                r = requests.get(url, headers=headers, timeout=20)
                if r.status_code == 200 and ("mojo-body-table" in r.text or "<table" in r.text):
                    html = r.text
                    break
                last_exc = RuntimeError(f"HTTP {r.status_code} / unexpected content")
            except Exception as e:
                last_exc = e
            time.sleep(retry_sleep)
        if html is None:
            break

        soup = BeautifulSoup(html, "html.parser")
        table = soup.select_one("table.a-bordered.a-horizontal-stripes.mojo-body-table")
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=UserWarning)
            if table is not None:
                tables = pd.read_html(str(table))
            else:
                tables = pd.read_html(html)

        # table with Distributor + Release/Release Group
        candidates = [t for t in tables if ("Distributor" in t.columns and
                                            (("Release" in t.columns) or ("Release Group" in t.columns)))]
        if not candidates:
            break
        df = candidates[0].copy()
        if df.empty:
            break

        df["title"] = df["Release"] if "Release" in df.columns else df["Release Group"]
        df["release_year"] = year
        df = df[["title", "release_year", "Distributor"]].rename(columns={"Distributor": "distributor"})
        frames.append(df)

        if len(df) < per_page:
            break
        offset += per_page
        time.sleep(sleep)

    out = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame(columns=["title","release_year","distributor"])
    out["title_normalized"] = out["title"].astype(str).apply(_norm_title_local)
    out = (out.sort_values(["release_year","distributor"])
             .drop_duplicates(subset=["title_normalized","release_year"], keep="first"))
    return out[["title_normalized","release_year","distributor"]]

def fetch_alltime_domestic(max_pages=50, sleep=0.35, per_page=200, retries=3, retry_sleep=1.0):
    """
    Scrape All-Time Domestic lifetime list AND attach distributor (via year pages for START_YEAR..END_YEAR).
    Returns columns: ['title','domestic_revenue','release_year','rank','distributor']
    """
    endpoints = [
        "https://www.boxofficemojo.com/chart/domestic/",
        "https://www.boxofficemojo.com/chart/top_lifetime_gross/",
        "https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=NA",
    ]
    headers = {
        "User-Agent": ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/127.0.0.0 Safari/537.36"),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Connection": "keep-alive",
    }

    frames, used_base = [], None
    for base in endpoints:
        frames.clear()
        used_base = base
        offset = 0
        for _ in range(max_pages):
            url = f"{base}&offset={offset}" if ("?" in base and offset) else (f"{base}?offset={offset}" if offset else base)
            html, last_exc = None, None
            for _try in range(retries):
                try:
                    r = requests.get(url, headers=headers, timeout=20)
                    if r.status_code == 200 and r.text and ("mojo-body-table" in r.text or "<table" in r.text):
                        html = r.text
                        break
                    last_exc = RuntimeError(f"HTTP {r.status_code}")
                except Exception as e:
                    last_exc = e
                time.sleep(retry_sleep)
            if html is None:
                if offset == 0:
                    frames.clear()
                break

            soup = BeautifulSoup(html, "html.parser")
            table = soup.select_one("table.a-bordered.a-horizontal-stripes.mojo-body-table")
            with warnings.catch_warnings():
                warnings.simplefilter("ignore", category=UserWarning)
                tables = pd.read_html(str(table) if table is not None else html)

            def _norm_cols(df): return [str(c).strip() for c in df.columns]
            candidates = []
            for t in tables:
                t.columns = _norm_cols(t)
                cols = {c.lower() for c in t.columns}
                if "title" in cols and ("lifetime gross" in cols or "gross" in cols) and ("year" in cols or "release year" in cols or "rank" in cols):
                    candidates.append(t)
            if not candidates:
                break

            df = candidates[0]
            if df.empty:
                break
            frames.append(df)

            if len(df) < per_page:
                break
            offset += per_page
            time.sleep(sleep)

        if frames:
            break

    if not frames:
        raise RuntimeError("No data scraped from Box Office Mojo across all endpoints")

    alltime = pd.concat(frames, ignore_index=True)
    rename_candidates = {
        "Title": "title",
        "Lifetime Gross": "domestic_revenue",
        "Gross": "domestic_revenue",
        "Year": "release_year",
        "Release Year": "release_year",
        "Rank": "rank",
    }
    alltime.columns = [str(c).strip() for c in alltime.columns]
    for k, v in list(rename_candidates.items()):
        if k in alltime.columns:
            alltime = alltime.rename(columns={k: v})

    for col in ["title", "domestic_revenue", "release_year", "rank"]:
        if col not in alltime.columns:
            alltime[col] = np.nan

    alltime["title"] = alltime["title"].astype(str).str.strip()
    alltime["domestic_revenue"] = pd.to_numeric(alltime["domestic_revenue"].astype(str).str.replace(r"[\$,]", "", regex=True), errors="coerce")
    alltime["release_year"] = pd.to_numeric(alltime["release_year"], errors="coerce").astype("Int64")
    alltime["rank"] = pd.to_numeric(alltime["rank"], errors="coerce").astype("Int64")
    alltime = alltime.dropna(subset=["title", "domestic_revenue"], how="any")

    # ALWAYS attach distributors for your modeling window
    y0 = int(START_YEAR)
    y1 = int(END_YEAR)
    print(f"  • Attaching distributors from BOM year pages ({y0}–{y1}) …")
    dist_idx = []
    for y in range(y0, y1 + 1):
        dfy = _fetch_bom_distributors_for_year(y)
        if not dfy.empty:
            dist_idx.append(dfy)
    if dist_idx:
        dist_idx = pd.concat(dist_idx, ignore_index=True)
        alltime["title_normalized"] = alltime["title"].apply(_norm_title_local)
        alltime = alltime.merge(dist_idx, on=["title_normalized","release_year"], how="left")
        alltime = alltime.drop(columns=["title_normalized"])
    else:
        alltime["distributor"] = np.nan

    print(f"📊 All-Time Domestic dataset ready from {used_base} : {alltime.shape}")
    return alltime


### TMDB Scrape Function

In [5]:
# ---- 4) Fetch TMDb (v3/v4) ----
def fetch_tmdb_movies(api_key, start_year=2015, end_year=2026,
                      include_upcoming_pass=True, max_pages_per_year=5,
                      region_us=True, min_vote_count=0, sleep_sec=0.25):
    """
    Fetch movies from TMDb API (v3 or v4).
    """
    if not api_key or api_key == "YOUR_TMDB_API_KEY_HERE":
        raise RuntimeError("TMDb API key missing — add it to config.json or env")

    is_v4 = str(api_key).startswith("eyJ")  # v4 tokens look like JWTs
    base_url = "https://api.themoviedb.org/3"
    headers = {"accept": "application/json"}
    if is_v4:
        headers["Authorization"] = f"Bearer {api_key}"

    all_movies = []
    for year in range(start_year, end_year + 1):
        for page in range(1, max_pages_per_year + 1):
            params = {
                "primary_release_year": year,
                "page": page,
                "language": "en-US",
                "include_adult": "false"
            }
            if region_us:
                params["region"] = "US"
            if min_vote_count > 0:
                params["vote_count.gte"] = min_vote_count
            if not is_v4:  # v3 key
                params["api_key"] = api_key

            try:
                r = requests.get(f"{base_url}/discover/movie", headers=headers, params=params, timeout=20)
                if r.status_code != 200:
                    print(f"    Error {r.status_code} year={year} page={page}: {r.text[:200]}")
                    break
                data = r.json()
                all_movies.extend(data.get("results", []))
                if page >= data.get("total_pages", 1):
                    break
            except Exception as e:
                print(f"    Request failed year={year} page={page}: {e}")
                break
            time.sleep(sleep_sec)

    if not all_movies:
        raise RuntimeError("No movies fetched from TMDb")

    df = pd.DataFrame(all_movies)
    if "release_date" in df.columns:
        df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")
        df["release_year"] = df["release_date"].dt.year
    return df


### Run

Note you should have a config.json file with TMDB_API_KEY specified

In [6]:
# ---- 5) Load or build datasets with cache (BOM now includes distributor) ----
ALLTIME_CSV = os.path.join(DATA_DIR, "boxoffice_alltime_domestic.csv")
TMDB_CSV   = os.path.join(DATA_DIR, "tmdb_filtered.csv")

domestic_df = load_or_build_csv(
    ALLTIME_CSV,
    builder_fn=fetch_alltime_domestic,
    name="All-Time Domestic (lifetime + distributors)"
)

def load_tmdb_key():
    key = None
    for up in ["", "..", "../..", "../../.."]:
        cfg_path = os.path.join(os.getcwd(), up, "config.json")
        if os.path.exists(cfg_path):
            try:
                with open(cfg_path, "r", encoding="utf-8") as f:
                    cfg = json.load(f)
                key = cfg.get("TMDB_V4_TOKEN") or cfg.get("TMDB_API_KEY")
                if key:
                    print(f"🔑 Loaded TMDb key from {cfg_path}")
                    break
            except Exception as e:
                print(f"⚠️ Could not parse {cfg_path}: {e}")
    if not key:
        key = os.getenv("TMDB_V4_TOKEN") or os.getenv("TMDB_API_KEY")
    if not key:
        raise RuntimeError("❌ No TMDb API key found in config.json or environment!")
    return key.strip().strip('"').strip("'")

def build_tmdb_filtered():
    TMDB_API_KEY = load_tmdb_key()
    return fetch_tmdb_movies(
        TMDB_API_KEY,
        start_year=START_YEAR,
        end_year=END_YEAR,
        include_upcoming_pass=False,
        max_pages_per_year=100,
        region_us=True,
        min_vote_count=0,
        sleep_sec=0.2
    )

tmdb_df = load_or_build_csv(
    TMDB_CSV,
    builder_fn=build_tmdb_filtered,
    name="TMDb (filtered)"
)

print(f"\n✅ Domestic: {domestic_df.shape}, TMDb: {tmdb_df.shape}")
display(domestic_df.head())
display(tmdb_df.head())


🗂️  Using cached All-Time Domestic (lifetime + distributors): ../data/boxoffice_alltime_domestic.csv
🗂️  Using cached TMDb (filtered): ../data/tmdb_filtered.csv

✅ Domestic: (10000, 5), TMDb: (22682, 15)


Unnamed: 0,rank,title,domestic_revenue,release_year,distributor
0,1,Star Wars: Episode VII - The Force Awakens,936662225,2015,Walt Disney Studios Motion Pictures
1,2,Avengers: Endgame,858373000,2019,Walt Disney Studios Motion Pictures
2,3,Spider-Man: No Way Home,814866759,2021,Sony Pictures Releasing
3,4,Avatar,785221649,2009,
4,5,Top Gun: Maverick,718732821,2022,Paramount Pictures


Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,release_year
0,False,/7IGKrY1f1KfwMipx9wZC4NRgIdF.jpg,"[18, 10749, 53]",216015,en,Fifty Shades of Grey,When college senior Anastasia Steele steps in ...,22.5569,/63kGofUkt1Mx0SIL4XI4Z5AoSgt.jpg,2015-02-13,Fifty Shades of Grey,False,5.88,12107,2015
1,False,/7vy2K7cRU2RxRm2s5HTnWFFtZR8.jpg,"[10749, 18]",351523,ko,동창회의 목적,"Dongchul, who is managing a small bar, is alwa...",19.0808,/AjlwGaT6tDs2fBIgFaepYiDhL6A.jpg,2015-08-06,Purpose of Reunion,False,6.4,29,2015
2,False,/kIBK5SKwgqIIuRKhhWrJn3XkbPq.jpg,"[28, 12, 878]",99861,en,Avengers: Age of Ultron,When Tony Stark tries to jumpstart a dormant p...,16.7837,/4ssDuvEDkSArWEdyBl2X5EHvYKU.jpg,2015-05-01,Avengers: Age of Ultron,False,7.271,23726,2015
3,False,/q7vmCCmyiHnuKGMzHZlr0fD44b9.jpg,"[10749, 14, 10751, 18]",150689,en,Cinderella,"When her father unexpectedly passes away, youn...",15.9833,/j91LJmcWo16CArFOoapsz84bwxb.jpg,2015-03-13,Cinderella,False,6.826,7222,2015
4,False,/hUpHXyLRNvtt0AAwdPmUsSQQKB8.jpg,"[35, 27]",273477,en,Scouts Guide to the Zombie Apocalypse,Three scouts and lifelong friends join forces ...,15.7602,/lUKvvSnjFlazrdh6wyHxHrdMknD.jpg,2015-10-30,Scouts Guide to the Zombie Apocalypse,False,6.544,2006,2015


### Merge datasets

In [7]:
# ============================================
# 6) Merge TMDb with All-Time Domestic (exact + fuzzy), carry distributor
# ============================================

# Install rapidfuzz if needed (for fuzzy merge fallback)
import subprocess, sys, warnings
try:
    from rapidfuzz import process, fuzz
except Exception:
    print("📦 Installing rapidfuzz…")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "rapidfuzz"])
    from rapidfuzz import process, fuzz

def normalize_title(title: str) -> str:
    if pd.isna(title): 
        return title
    t = str(title).strip()
    for ep in ["Episode I - ","Episode II - ","Episode III - ",
               "Episode IV - ","Episode V - ","Episode VI - ",
               "Episode VII - ","Episode VIII - ","Episode IX - "]:
        t = t.replace(ep, "")
    t = t.replace(" & ", " and ")
    return " ".join(t.split())

# ---- Clean/standardize Domestic (lifetime) ----
domestic_clean = domestic_df.copy()
for col in ["title", "release_year", "domestic_revenue", "rank", "distributor"]:
    if col not in domestic_clean.columns:
        domestic_clean[col] = np.nan

domestic_clean["title"] = domestic_clean["title"].astype(str).str.strip()
domestic_clean["title_normalized"] = domestic_clean["title"].apply(normalize_title)
domestic_clean["release_year"] = pd.to_numeric(domestic_clean["release_year"], errors="coerce").astype("Int64")
domestic_clean["domestic_revenue"] = pd.to_numeric(domestic_clean["domestic_revenue"], errors="coerce")
domestic_clean["distributor"] = domestic_clean["distributor"].astype(str).str.strip()

# Collapse to one row per (title_normalized, release_year)
domestic_keyed = (
    domestic_clean
    .dropna(subset=["title_normalized", "release_year"])
    .groupby(["title_normalized", "release_year"], as_index=False)
    .agg({
        "title": "first",
        "domestic_revenue": "max",
        "rank": "min",
        "distributor": "first"
    })
)

# ---- Clean/standardize TMDb ----
tmdb_clean = tmdb_df.copy()
if "release_date" in tmdb_clean.columns:
    tmdb_clean["release_date"] = pd.to_datetime(tmdb_clean["release_date"], errors="coerce")
    tmdb_clean["release_year"] = tmdb_clean["release_date"].dt.year.where(
        tmdb_clean.get("release_year").isna() if "release_year" in tmdb_clean.columns else True,
        tmdb_clean.get("release_year")
    )
tmdb_clean["release_year"] = pd.to_numeric(tmdb_clean["release_year"], errors="coerce").astype("Int64")

tmdb_clean["original_language"] = tmdb_clean.get("original_language", "en").fillna("en")
tmdb_clean = tmdb_clean[tmdb_clean["original_language"] == "en"]

if "genres" not in tmdb_clean.columns:
    if "genre_ids" in tmdb_clean.columns:
        tmdb_clean["genres"] = tmdb_clean["genre_ids"].astype(str)
    else:
        tmdb_clean["genres"] = ""

def _contains_genre_ids_as_text(s, ids=("99", "10770")):
    st = str(s);  return any(f"{gid}" in st for gid in ids)

tmdb_clean = tmdb_clean[~tmdb_clean["genres"].apply(_contains_genre_ids_as_text)].copy()

tmdb_clean["title"] = tmdb_clean["title"].astype(str).str.strip()
tmdb_clean["title_normalized"] = tmdb_clean["title"].apply(normalize_title)

tmdb_keyed = (
    tmdb_clean.sort_values(["release_year","vote_count","popularity"], ascending=[True, False, False])
              .drop_duplicates(subset=["title_normalized","release_year"], keep="first")
)

print("🔗 Exact merge TMDb ⟷ Domestic (lifetime)…")
merged_df = pd.merge(
    tmdb_keyed,
    domestic_keyed[["title_normalized","release_year","domestic_revenue","rank","distributor"]],
    on=["title_normalized","release_year"],
    how="left",
    suffixes=("", "_domestic"),
)
exact_hits = merged_df["domestic_revenue"].notna().sum()
print(f"✅ Exact matches: {exact_hits:,}")

# ---------- Fuzzy fallback (same-year only) ----------
def fuzzy_fill_domestic(merged, domestic, score_cutoff=90):
    dom_by_year = {}
    dom = domestic[["title_normalized", "release_year", "domestic_revenue", "rank", "distributor"]].dropna(subset=["title_normalized"])
    for y, sub in dom.groupby("release_year"):
        dom_by_year[int(y)] = (sub["title_normalized"].tolist(), sub.index.tolist())

    added = 0
    missing_mask = merged["domestic_revenue"].isna()
    groups = merged[missing_mask].groupby("release_year").groups

    for y, idxs in groups.items():
        if pd.isna(y): continue
        y = int(y)
        if y not in dom_by_year: continue
        titles_dom, idxs_dom = dom_by_year[y]
        if not titles_dom: continue

        for ridx in idxs:
            q = merged.at[ridx, "title_normalized"]
            if not isinstance(q, str) or not q: continue
            match = process.extractOne(q, titles_dom, scorer=fuzz.WRatio, score_cutoff=90)
            if not match: continue
            _, score, pos = match
            dom_idx = idxs_dom[pos]
            merged.at[ridx, "domestic_revenue"] = domestic.loc[dom_idx, "domestic_revenue"]
            merged.at[ridx, "rank"] = domestic.loc[dom_idx, "rank"]
            merged.at[ridx, "distributor"] = domestic.loc[dom_idx, "distributor"]
            added += 1
    return merged, added

print("🧪 Fuzzy matching unmatched rows (same year)…")
merged_df, fuzzy_added = fuzzy_fill_domestic(merged_df, domestic_keyed, score_cutoff=90)
print(f"➕ Fuzzy matches added: {fuzzy_added:,}")

# Final safety: drop any remaining duplicates on (title, release_year) keeping highest domestic
if merged_df.duplicated(subset=["title","release_year"], keep=False).any():
    merged_df = (merged_df
        .sort_values(["release_year","domestic_revenue"], ascending=[True, False])
        .drop_duplicates(subset=["title","release_year"], keep="first"))

# Canonical revenue columns
merged_df["revenue_domestic"] = pd.to_numeric(merged_df["domestic_revenue"], errors="coerce")
merged_df["revenue"] = merged_df["revenue_domestic"]  # modeling target = lifetime domestic

print(f"✅ Merge complete. Rows: {len(merged_df):,}")
display(merged_df.head(5))


🔗 Exact merge TMDb ⟷ Domestic (lifetime)…
✅ Exact matches: 1,354
🧪 Fuzzy matching unmatched rows (same year)…
➕ Fuzzy matches added: 178
✅ Merge complete. Rows: 7,587


Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,release_year,genres,title_normalized,domestic_revenue,rank,distributor,revenue_domestic,revenue
0,False,/j82SfPOqas5rvHbAHAPq7OgaSir.jpg,"[18, 10749, 10752]",271674,en,Suite Française,"France, 1940. In the first days of occupation,...",3.7352,/mPgijY9f79FfD7K8MmzsynXpq0R.jpg,2014-11-05,Suite Française,False,7.258,1123,2014,"[18, 10749, 10752]",Suite Française,,,,,
1,False,/nxWUUSYOTEAjARGUI9EcTBSMo2u.jpg,"[28, 80, 18, 9648, 53]",241765,en,The Outsider,Revolves around a British military contractor ...,2.8373,/iYRFAxQpB0VVVO17cPC8nEyZiLN.jpg,2014-03-11,The Outsider,False,4.5,65,2014,"[28, 80, 18, 9648, 53]",The Outsider,,,,,
2,False,/qiewTS30GsKTFXKTL55soJYuLh4.jpg,"[9648, 35, 10749]",347968,en,One-Minute Time Machine,Every time the beautiful Regina rejects his ad...,2.8858,/2JQUM8bNjGFlwilokqmZ0AvvsY4.jpg,2014-03-29,One-Minute Time Machine,False,6.6,34,2014,"[9648, 35, 10749]",One-Minute Time Machine,,,,,
3,False,/sFEMTxcgbtFySdJvsNiwaaIlZVP.jpg,"[18, 9648]",298714,en,X: Past Is Present,When a middle-aged filmmaker meets an alluring...,3.155,/cJXEhJBO81fBaisJt3qsw4RaDZG.jpg,2014-11-18,X: Past Is Present,False,5.6,13,2014,"[18, 9648]",X: Past Is Present,,,,,
4,False,,[18],404278,en,Skinship,"Set against a technological backdrop, in a tim...",4.0211,,2014-05-01,Skinship,False,0.0,0,2014,[18],Skinship,,,,,


In [8]:
# Check for and remove duplicate entries
print("Checking for duplicate entries...")

# Check for Twisters variants
twisters_check = merged_df[merged_df['title'].str.contains('Twisters', case=False, na=False)]
print(f"Twisters entries found: {len(twisters_check)}")
if len(twisters_check) > 0:
    print(twisters_check[['title', 'release_year', 'revenue_domestic']])

# Fix "The Twisters" to "Twisters" (remove "The " prefix)
merged_df.loc[merged_df['title'] == 'The Twisters', 'title'] = 'Twisters'
merged_df.loc[merged_df['title'] == 'The Twisters (2024)', 'title'] = 'Twisters'

# Remove exact duplicates (same title, year, revenue)
initial_count = len(merged_df)
merged_df = merged_df.drop_duplicates(subset=['title', 'release_year', 'revenue_domestic']).copy()
final_count = len(merged_df)

if initial_count != final_count:
    print(f"\nRemoved {initial_count - final_count} duplicate entries")
    print(f"Dataset size: {initial_count} → {final_count}")
else:
    print("\nNo exact duplicate entries found")

# For remaining duplicates with same title and year but different revenue, keep the higher revenue
duplicates_title_year = merged_df[merged_df.duplicated(subset=['title', 'release_year'], keep=False)]
if len(duplicates_title_year) > 0:
    print(f"\nFound {len(duplicates_title_year)} entries with same title/year but different revenue:")
    print(duplicates_title_year[['title', 'release_year', 'revenue_domestic']].sort_values(['title', 'release_year']))
    
    # Keep the entry with highest revenue for each title/year combination
    merged_df = merged_df.loc[merged_df.groupby(['title', 'release_year'])['revenue_domestic'].idxmax()].copy()
    final_count_after_merge = len(merged_df)
    print(f"After keeping highest revenue entries: {final_count} → {final_count_after_merge}")

# Re-check Twisters after cleanup
twisters_check_after = merged_df[merged_df['title'].str.contains('Twisters', case=False, na=False)]
print(f"\nTwisters entries after cleanup: {len(twisters_check_after)}")
if len(twisters_check_after) > 0:
    print(twisters_check_after[['title', 'release_year', 'revenue_domestic']])

# Apply systematic title corrections
print(f"\n🔧 Applying systematic title corrections...")

def apply_title_corrections(merged_df):
    """Apply systematic title corrections using TITLE_CORRECTIONS dictionary"""
    corrections_applied = 0
    
    for incorrect_title, correct_title in TITLE_CORRECTIONS.items():
        # Look for exact matches or partial matches that might be truncated
        mask = merged_df['title'].str.contains(incorrect_title, case=False, na=False) & (merged_df['title'] != correct_title)
        
        if mask.any():
            # Only fix if the found title is significantly shorter (likely truncated)
            problematic_entries = merged_df[mask]
            for idx, row in problematic_entries.iterrows():
                if len(row['title']) < len(correct_title) * 0.8:  # Title is much shorter than expected
                    print(f"✅ Title correction: '{row['title']}' → '{correct_title}' ({row['release_year']})")
                    merged_df.loc[idx, 'title'] = correct_title
                    corrections_applied += 1
    
    if corrections_applied > 0:
        print(f"Applied {corrections_applied} title corrections")
    else:
        print("No title corrections needed")
    
    return merged_df

merged_df = apply_title_corrections(merged_df)

Checking for duplicate entries...
Twisters entries found: 2
             title  release_year  revenue_domestic
5894      Twisters          2024       267762265.0
6156  The Twisters          2024       267762265.0

Removed 1 duplicate entries
Dataset size: 7587 → 7586

Twisters entries after cleanup: 1
         title  release_year  revenue_domestic
5894  Twisters          2024       267762265.0

🔧 Applying systematic title corrections...
✅ Title correction: 'Hedgehog' → 'Sonic the Hedgehog 3' (2024)
✅ Title correction: 'The Wilds' → 'The Wild Robot' (2022)
✅ Title correction: 'Deadpool' → 'Deadpool & Wolverine' (2016)
✅ Title correction: 'Deadpool 2' → 'Deadpool & Wolverine' (2018)
Applied 4 title corrections


### Filter + Export
Now, go to `2_feature_engineering.ipynb`

In [9]:
# ============================================
# 7) Final filters + save + summary
# ============================================

final_df = merged_df.copy()

# Modeling window
final_df = final_df[final_df["release_year"].between(START_YEAR, END_YEAR)]

# Must have lifetime domestic revenue
final_df = final_df[final_df["revenue_domestic"].notna() & (final_df["revenue_domestic"] > 0)]

# English-only (defensive)
if "original_language" in final_df.columns:
    final_df = final_df[final_df["original_language"] == "en"]

# Drop TV/docs leftovers
if "genres" in final_df.columns:
    final_df = final_df[~final_df["genres"].astype(str).str.contains("99|10770|Documentary|TV", case=False, na=False)]

# Save
MERGED_CSV = os.path.join(DATA_DIR, "dataset_domestic_lifetime_merged.csv")
final_df.to_csv(MERGED_CSV, index=False)
print(f"\n💾 Saved merged dataset → {MERGED_CSV}")

# Summary
print(f"Total movies: {len(final_df):,}")
if len(final_df) > 0:
    yr_min = int(final_df["release_year"].min())
    yr_max = int(final_df["release_year"].max())
    rev_min = final_df["revenue_domestic"].min()
    rev_max = final_df["revenue_domestic"].max()
    rev_avg = final_df["revenue_domestic"].mean()

    print(f"Year range: {yr_min}–{yr_max}")
    print(f"Lifetime domestic range: ${rev_min:,.0f} — ${rev_max:,.0f}")
    print(f"Average lifetime domestic: ${rev_avg:,.0f}")

    cols = ["title","release_year","distributor","revenue_domestic"]
    cols = [c for c in cols if c in final_df.columns]
    display(final_df.nlargest(10, "revenue_domestic")[cols].reset_index(drop=True))
else:
    print("⚠️ Final dataset is empty — review filters and merge keys.")



💾 Saved merged dataset → ../data/dataset_domestic_lifetime_merged.csv
Total movies: 1,531
Year range: 2015–2025
Lifetime domestic range: $521,202 — $936,662,225
Average lifetime domestic: $60,442,440


Unnamed: 0,title,release_year,distributor,revenue_domestic
0,Star Wars: The Force Awakens,2015,Walt Disney Studios Motion Pictures,936662225.0
1,Avengers: Endgame,2019,Walt Disney Studios Motion Pictures,858373000.0
2,Spider-Man: No Way Home,2021,Sony Pictures Releasing,814866759.0
3,Top Gun: Maverick,2022,Paramount Pictures,718732821.0
4,Black Panther,2018,Walt Disney Studios Motion Pictures,700426566.0
5,Avatar: The Way of Water,2022,20th Century Studios,684075767.0
6,Avengers: Infinity War,2018,Walt Disney Studios Motion Pictures,678815482.0
7,Jurassic World,2015,Universal Pictures,653406625.0
8,Inside Out 2,2024,Walt Disney Studios Motion Pictures,652980194.0
9,Deadpool & Wolverine,2024,Walt Disney Studios Motion Pictures,636745858.0


The above table should match the screenshot below (visually remove movies from pre-2025)

![Box Office Mojo as of 09/21/2025](../box-office-mojo.png)
