# Project 3: The "Movie DNA" Galaxy Explorer
## Part 1: Enriching Data with Keywords from TMDB

Our goal is to create a rich "fingerprint" for each film, and that requires more than just genre and cast. We need to understand the film's plot and themes. In this step, we will use the TMDB API to fetch descriptive keywords and other useful metadata (like poster paths) for every Pre-Code film in our dataset.

**Methodology:**
1.  **Load Data:** Start with our `hollywood_df.pkl` file.
2.  **Connect to TMDB:** Use the `tmdbsimple` library and our API key to connect to The Movie Database.
3.  **Fetch Keywords:** For each movie (identified by its `tconst`), we will query the TMDB API to find its keywords.
4.  **Handle Missing Data:** Some older films may not have entries or keywords. Our code must handle these cases gracefully.
5.  **Save Enriched Data:** We will save the result to a new file, `hollywood_df_enriched.pkl`, to use in the next steps.

In [3]:
import pandas as pd
import os
import tmdbsimple as tmdb
from tqdm.auto import tqdm
from dotenv import load_dotenv

# --- 1. Load Environment Variables and Setup TMDB API ---
# This will find the .env file in your project root and load the keys.
load_dotenv()

tmdb.API_KEY = os.getenv('TMDB_API_KEY')

if tmdb.API_KEY:
    print("TMDB API key loaded successfully from .env file.")
else:
    print("Error: Could not load TMDB_API_KEY from .env file.")
    print("Please ensure your .env file exists in the project root and contains the key.")

# --- 2. Load our Hollywood DataFrame ---
HOLLYWOOD_DF_PATH = "../data/processed/hollywood_df.pkl"
hollywood_df = pd.read_pickle(HOLLYWOOD_DF_PATH)
unique_movies_df = hollywood_df[['tconst', 'primaryTitle', 'startYear']].drop_duplicates(subset=['primaryTitle']).reset_index(drop=True)

# --- 3. Function to Fetch Keywords for a tconst ---
def get_keywords_from_tmdb(tconst):
    if not tmdb.API_KEY:
        return "NO_API_KEY", ""
    try:
        find = tmdb.Find(tconst)
        response = find.info(external_source='imdb_id')
        if not response['movie_results']:
            return "not_found", ""
        movie_id = response['movie_results'][0]['id']
        movie = tmdb.Movies(movie_id)
        keywords = movie.keywords()['keywords']
        keyword_str = ' '.join([k['name'] for k in keywords])
        poster_path = response['movie_results'][0].get('poster_path', '')
        return keyword_str, poster_path
    except Exception:
        return "api_error", ""

# --- 4. Loop Through Movies and Enrich Data ---
if tmdb.API_KEY:
    tqdm.pandas(desc="Fetching Keywords from TMDB")
    results = unique_movies_df['tconst'].progress_apply(get_keywords_from_tmdb)
    unique_movies_df[['keywords', 'poster_path']] = pd.DataFrame(results.tolist(), index=unique_movies_df.index)

    # --- 5. Save the Enriched Data ---
    ENRICHED_DF_PATH = "../data/processed/hollywood_df_enriched.pkl"
    unique_movies_df.to_pickle(ENRICHED_DF_PATH)

    print(f"\nEnrichment complete. Saved {len(unique_movies_df)} movies with keyword data.")
    print("Sample of enriched data:")
    display(unique_movies_df.head())
else:
    print("\nSkipping data enrichment because TMDB API key was not found.")

TMDB API key loaded successfully from .env file.


Fetching Keywords from TMDB:   0%|          | 0/4514 [00:00<?, ?it/s]


Enrichment complete. Saved 4514 movies with keyword data.
Sample of enriched data:


Unnamed: 0,tconst,primaryTitle,startYear,keywords,poster_path
0,tt0017578,The Wrecker,1929,,/oGCsBdjxDq7b4eTpTsjJrA6VayX.jpg
1,tt0018362,The Scar of Shame,1929,marriage contract prison escape class differen...,/tosJ21bDxJzvg2JcTuKQMqWglvM.jpg
2,tt0018588,Three Loves,1929,black and white silent film,/ncyxOfdoS0Rz9VWHxyx6HLyU5nB.jpg
3,tt0018630,After the Verdict,1929,sports,/dOGxHjXBFp1D59hZLzBl9Gheg20.jpg
4,tt0018685,The Bellamy Trial,1929,,


## Part 2: Engineering the "Movie DNA" with AI

With our enriched dataset, we can now perform the core machine learning task. We will use a pre-trained Sentence Transformer model, a powerful form of NLP AI, to read the plot keywords for each film and convert them into a high-dimensional vector, also known as an "embedding." This vector is the film's unique "DNA," capturing its thematic essence in a way the machine can understand.

**Methodology:**
1.  **Load Enriched Data:** We'll load the `hollywood_df_enriched.pkl` file we created in the previous step.
2.  **Instantiate AI Model:** We will load a state-of-the-art model (`all-MiniLM-L6-v2`) from the `sentence-transformers` library. The first time this runs, it will download the model files (a few hundred MB).
3.  **Generate Embeddings:** We will feed the `keywords` column into the model. The model will output a 384-dimension vector for each film.
4.  **Save the DNA:** We will save these embeddings to a file so we don't have to re-calculate them every time. This is a crucial step in any ML pipeline.

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
import os

# --- 1. Load the Enriched Data ---
ENRICHED_DF_PATH = "../data/processed/hollywood_df_enriched.pkl"
enriched_df = pd.read_pickle(ENRICHED_DF_PATH)

print("Enriched movie data loaded successfully.")

# --- 2. Prepare the Text Data ---
# Fill any missing keywords with an empty string so the model can process them.
enriched_df['keywords'] = enriched_df['keywords'].fillna('')

# Create a list of all keyword strings to feed to the model
corpus = enriched_df['keywords'].tolist()

# --- 3. Instantiate and Use the Transformer Model ---
# This model is small but powerful, great for our use case.
# The model will be downloaded from the internet the first time you run this.
print("Loading Sentence Transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded.")

# --- 4. Generate the Embeddings (The "Movie DNA") ---
# The model.encode() function will process the text and output the vectors.
# We wrap it in tqdm to see a progress bar.
print("Generating movie DNA embeddings... (This may take a minute)")
movie_dna_embeddings = model.encode(corpus, show_progress_bar=True)

# --- 5. Save the Embeddings ---
EMBEDDINGS_PATH = "../data/processed/movie_dna_embeddings.npy"
np.save(EMBEDDINGS_PATH, movie_dna_embeddings)

print("\nMovie DNA creation complete!")
print(f"Shape of our DNA matrix: {movie_dna_embeddings.shape}")
print(f"(This means {movie_dna_embeddings.shape[0]} movies, each with a {movie_dna_embeddings.shape[1]}-dimension DNA vector)")
print(f"Embeddings saved to: {EMBEDDINGS_PATH}")