# Project 3: The "Movie DNA" Galaxy Explorer
## Part 1: Enriching Data with Keywords from TMDB

Our goal is to create a rich "fingerprint" for each film, and that requires more than just genre and cast. We need to understand the film's plot and themes. In this step, we will use the TMDB API to fetch descriptive keywords and other useful metadata (like poster paths) for every Pre-Code film in our dataset.

**Methodology:**
1.  **Load Data:** Start with our `hollywood_df.pkl` file.
2.  **Connect to TMDB:** Use the `tmdbsimple` library and our API key to connect to The Movie Database.
3.  **Fetch Keywords:** For each movie (identified by its `tconst`), we will query the TMDB API to find its keywords.
4.  **Handle Missing Data:** Some older films may not have entries or keywords. Our code must handle these cases gracefully.
5.  **Save Enriched Data:** We will save the result to a new file, `hollywood_df_enriched.pkl`, to use in the next steps.

In [None]:
import pandas as pd
import os
import tmdbsimple as tmdb
from tqdm.auto import tqdm
from dotenv import load_dotenv

# --- 1. Load Environment Variables and Setup TMDB API ---
# This will find the .env file in your project root and load the keys.
load_dotenv()

tmdb.API_KEY = os.getenv('TMDB_API_KEY')

if tmdb.API_KEY:
    print("TMDB API key loaded successfully from .env file.")
else:
    print("Error: Could not load TMDB_API_KEY from .env file.")
    print("Please ensure your .env file exists in the project root and contains the key.")

# --- 2. Load our Hollywood DataFrame ---
HOLLYWOOD_DF_PATH = "../data/processed/hollywood_df.pkl"
hollywood_df = pd.read_pickle(HOLLYWOOD_DF_PATH)
unique_movies_df = hollywood_df[['tconst', 'primaryTitle', 'startYear']].drop_duplicates(subset=['primaryTitle']).reset_index(drop=True)

# --- 3. Function to Fetch Keywords for a tconst ---
def get_keywords_from_tmdb(tconst):
    if not tmdb.API_KEY:
        return "NO_API_KEY", ""
    try:
        find = tmdb.Find(tconst)
        response = find.info(external_source='imdb_id')
        if not response['movie_results']:
            return "not_found", ""
        movie_id = response['movie_results'][0]['id']
        movie = tmdb.Movies(movie_id)
        keywords = movie.keywords()['keywords']
        keyword_str = ' '.join([k['name'] for k in keywords])
        poster_path = response['movie_results'][0].get('poster_path', '')
        return keyword_str, poster_path
    except Exception:
        return "api_error", ""

# --- 4. Loop Through Movies and Enrich Data ---
if tmdb.API_KEY:
    tqdm.pandas(desc="Fetching Keywords from TMDB")
    results = unique_movies_df['tconst'].progress_apply(get_keywords_from_tmdb)
    unique_movies_df[['keywords', 'poster_path']] = pd.DataFrame(results.tolist(), index=unique_movies_df.index)

    # --- 5. Save the Enriched Data ---
    ENRICHED_DF_PATH = "../data/processed/hollywood_df_enriched.pkl"
    unique_movies_df.to_pickle(ENRICHED_DF_PATH)

    print(f"\nEnrichment complete. Saved {len(unique_movies_df)} movies with keyword data.")
    print("Sample of enriched data:")
    display(unique_movies_df.head())
else:
    print("\nSkipping data enrichment because TMDB API key was not found.")

TMDB API key loaded successfully from .env file.


Fetching Keywords from TMDB:   0%|          | 0/4514 [00:00<?, ?it/s]