# Project 3: The "Movie DNA" Galaxy Explorer
## Part 1: Enriching Data with Keywords from TMDB

Our goal is to create a rich "fingerprint" for each film, and that requires more than just genre and cast. We need to understand the film's plot and themes. In this step, we will use the TMDB API to fetch descriptive keywords and other useful metadata (like poster paths) for every Pre-Code film in our dataset.

**Methodology:**
1.  **Load Data:** Start with our `hollywood_df.pkl` file.
2.  **Connect to TMDB:** Use the `tmdbsimple` library and our API key to connect to The Movie Database.
3.  **Fetch Keywords:** For each movie (identified by its `tconst`), we will query the TMDB API to find its keywords.
4.  **Handle Missing Data:** Some older films may not have entries or keywords. Our code must handle these cases gracefully.
5.  **Save Enriched Data:** We will save the result to a new file, `hollywood_df_enriched.pkl`, to use in the next steps.

In [6]:
import pandas as pd
import os
import tmdbsimple as tmdb
from tqdm.auto import tqdm
from dotenv import load_dotenv
import json # We need this library for caching

# --- 1. Load Environment Variables and Setup TMDB API ---
load_dotenv()
tmdb.API_KEY = os.getenv('TMDB_API_KEY')
if tmdb.API_KEY:
    print("TMDB API key loaded successfully from .env file.")
else:
    print("Error: Could not load TMDB_API_KEY from .env file.")

# --- 2. Load Hollywood DataFrame & Setup Cache Directory ---
HOLLYWOOD_DF_PATH = "../data/processed/hollywood_df.pkl"
hollywood_df = pd.read_pickle(HOLLYWOOD_DF_PATH)
unique_movies_df = hollywood_df[['tconst', 'primaryTitle', 'startYear', 'genres']].drop_duplicates(subset=['primaryTitle']).reset_index(drop=True)
print(f"Loaded {len(unique_movies_df)} unique movies from Hollywood DataFrame.")

# Define the directory where we will store our cached results
CACHE_DIR = "../data/tmdb_cache"
os.makedirs(CACHE_DIR, exist_ok=True)
print(f"Using cache directory: {CACHE_DIR}")

# --- 3. Function to Fetch Keywords (with Caching) ---
def get_keywords_from_tmdb(tconst, cache_dir):
    """
    Fetches keywords and poster path for a tconst, using a local file cache.
    """
    cache_filepath = os.path.join(cache_dir, f"{tconst}.json")

    # First, check if the result is already in our cache
    if os.path.exists(cache_filepath):
        with open(cache_filepath, 'r') as f:
            cached_data = json.load(f)
            # Return the cached keywords and poster path
            return cached_data.get('keywords', ''), cached_data.get('poster_path', '')

    # If not in cache, proceed with the API call
    if not tmdb.API_KEY:
        return "NO_API_KEY", ""
    try:
        find = tmdb.Find(tconst)
        response = find.info(external_source='imdb_id')
        
        if not response['movie_results']:
            result_to_cache = {"status": "not_found", "keywords": "not_found", "poster_path": ""}
        else:
            movie_id = response['movie_results'][0]['id']
            movie = tmdb.Movies(movie_id)
            keywords = movie.keywords()['keywords']
            keyword_str = ' '.join([k['name'] for k in keywords])
            poster_path = response['movie_results'][0].get('poster_path', '')
            result_to_cache = {"status": "success", "keywords": keyword_str, "poster_path": poster_path}

    except Exception:
        result_to_cache = {"status": "api_error", "keywords": "api_error", "poster_path": ""}

    # Save the result to the cache file before returning
    with open(cache_filepath, 'w') as f:
        json.dump(result_to_cache, f)

    return result_to_cache.get('keywords', ''), result_to_cache.get('poster_path', '')

# --- 4. Loop Through Movies and Enrich Data ---
if tmdb.API_KEY:
    tqdm.pandas(desc="Fetching Keywords from TMDB (with Cache)")
    # We pass the cache directory to our function using a lambda
    results = unique_movies_df['tconst'].progress_apply(lambda tconst: get_keywords_from_tmdb(tconst, CACHE_DIR))
    unique_movies_df[['keywords', 'poster_path']] = pd.DataFrame(results.tolist(), index=unique_movies_df.index)

    # --- 5. Save the Enriched Data ---
    ENRICHED_DF_PATH = "../data/processed/hollywood_df_enriched.pkl"
    unique_movies_df.to_pickle(ENRICHED_DF_PATH)

    print(f"\nEnrichment complete. Saved {len(unique_movies_df)} movies with keyword data.")
    print("Sample of enriched data:")
    display(unique_movies_df.head())
else:
    print("\nSkipping data enrichment because TMDB API key was not found.")

TMDB API key loaded successfully from .env file.
Loaded 4514 unique movies from Hollywood DataFrame.
Using cache directory: ../data/tmdb_cache


Fetching Keywords from TMDB (with Cache):   0%|          | 0/4514 [00:00<?, ?it/s]


Enrichment complete. Saved 4514 movies with keyword data.
Sample of enriched data:


Unnamed: 0,tconst,primaryTitle,startYear,genres,keywords,poster_path
0,tt0017578,The Wrecker,1929,"Crime,Drama",,/oGCsBdjxDq7b4eTpTsjJrA6VayX.jpg
1,tt0018362,The Scar of Shame,1929,"Crime,Drama,Romance",marriage contract prison escape class differen...,/tosJ21bDxJzvg2JcTuKQMqWglvM.jpg
2,tt0018588,Three Loves,1929,Drama,black and white silent film,/ncyxOfdoS0Rz9VWHxyx6HLyU5nB.jpg
3,tt0018630,After the Verdict,1929,"Drama,Romance,Sport",sports,/dOGxHjXBFp1D59hZLzBl9Gheg20.jpg
4,tt0018685,The Bellamy Trial,1929,"Adventure,Crime,Drama",,


## Part 2: Engineering the "Movie DNA" with AI

With our enriched dataset, we can now perform the core machine learning task. We will use a pre-trained Sentence Transformer model, a powerful form of NLP AI, to read the plot keywords for each film and convert them into a high-dimensional vector, also known as an "embedding." This vector is the film's unique "DNA," capturing its thematic essence in a way the machine can understand.

**Methodology:**
1.  **Load Enriched Data:** We'll load the `hollywood_df_enriched.pkl` file we created in the previous step.
2.  **Instantiate AI Model:** We will load a state-of-the-art model (`all-MiniLM-L6-v2`) from the `sentence-transformers` library. The first time this runs, it will download the model files (a few hundred MB).
3.  **Generate Embeddings:** We will feed the `keywords` column into the model. The model will output a 384-dimension vector for each film.
4.  **Save the DNA:** We will save these embeddings to a file so we don't have to re-calculate them every time. This is a crucial step in any ML pipeline.

In [7]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
import os

# --- 1. Define Paths and Load Enriched Data ---
ENRICHED_DF_PATH = "../data/processed/hollywood_df_enriched.pkl"
EMBEDDINGS_PATH = "../data/processed/movie_dna_embeddings.npy"

enriched_df = pd.read_pickle(ENRICHED_DF_PATH)
print("Enriched movie data loaded successfully.")

# --- 2. Check for Cached Embeddings ---
if os.path.exists(EMBEDDINGS_PATH):
    print(f"Found cached 'Movie DNA' embeddings. Loading from: {EMBEDDINGS_PATH}")
    movie_dna_embeddings = np.load(EMBEDDINGS_PATH)
else:
    print("No cached embeddings found. Generating new ones...")
    
    # --- Prepare the Text Data ---
    enriched_df['keywords'] = enriched_df['keywords'].fillna('')
    corpus = enriched_df['keywords'].tolist()

    # --- Instantiate and Use the Transformer Model ---
    print("Loading Sentence Transformer model (this may download the model)...")
    model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Model loaded.")

    # --- Generate the Embeddings (The "Movie DNA") ---
    print("Generating movie DNA embeddings... (This may take a minute)")
    movie_dna_embeddings = model.encode(corpus, show_progress_bar=True)

    # --- Save the Embeddings to the Cache ---
    np.save(EMBEDDINGS_PATH, movie_dna_embeddings)
    print(f"Embeddings saved to cache: {EMBEDDINGS_PATH}")

print("\nMovie DNA creation complete!")
print(f"Shape of our DNA matrix: {movie_dna_embeddings.shape}")

Enriched movie data loaded successfully.
Found cached 'Movie DNA' embeddings. Loading from: ../data/processed/movie_dna_embeddings.npy

Movie DNA creation complete!
Shape of our DNA matrix: (4514, 384)


## Part 3: Mapping the Galaxy with t-SNE

Our movie DNA is in a 384-dimensional space, which is impossible to visualize directly. In this step, we will use **t-SNE**, a powerful dimensionality reduction algorithm, to project these high-dimensional vectors down to a 2D space (an 'x' and 'y' coordinate for each movie). This creates a map where thematically similar movies are positioned close together, forming a "galaxy" of clusters we can explore.

**Methodology:**
1.  **Load Data:** We will load our cached `movie_dna_embeddings.npy` file.
2.  **Instantiate t-SNE:** We will configure the t-SNE model from `scikit-learn`.
3.  **Run Reduction:** We'll apply the `fit_transform` method to our embeddings. This is a computationally intensive step.
4.  **Cache Results:** Just like before, we will save the resulting 2D coordinates to a file to avoid re-running this expensive step.
5.  **Merge and Save:** We will merge these new 'x' and 'y' coordinates back into our main movie DataFrame for the final visualization step.

In [8]:
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
import os

# --- 1. Define Paths and Check for Cached Coordinates ---
ENRICHED_DF_PATH = "../data/processed/hollywood_df_enriched.pkl"
EMBEDDINGS_PATH = "../data/processed/movie_dna_embeddings.npy"
TSNE_COORDS_PATH = "../data/processed/tsne_2d_coordinates.npy"

# Load the data we'll need
enriched_df = pd.read_pickle(ENRICHED_DF_PATH)
movie_dna_embeddings = np.load(EMBEDDINGS_PATH)

# --- 2. Run t-SNE or Load from Cache ---
if os.path.exists(TSNE_COORDS_PATH):
    print(f"Found cached t-SNE coordinates. Loading from: {TSNE_COORDS_PATH}")
    tsne_coords = np.load(TSNE_COORDS_PATH)
else:
    print("No cached t-SNE coordinates found. Running dimensionality reduction...")
    print("This is computationally intensive and will take several minutes.")
    
    # Configure the t-SNE model
    tsne = TSNE(
        n_components=2,          # We want a 2D map
        perplexity=30,           # A standard value for this parameter
        init='pca',              # Initialize with PCA for better stability
        n_iter=1000,             # Number of iterations
        random_state=42,         # For reproducible results
        verbose=1                # To see the progress
    )
    
    # Run the model
    tsne_coords = tsne.fit_transform(movie_dna_embeddings)
    
    # Save the results to our cache
    np.save(TSNE_COORDS_PATH, tsne_coords)
    print(f"t-SNE coordinates saved to cache: {TSNE_COORDS_PATH}")

# --- 3. Merge Coordinates into our Main DataFrame ---
print("\nMerging 2D coordinates into the main movie DataFrame...")

# Create a DataFrame from our 2D coordinates
coords_df = pd.DataFrame(tsne_coords, columns=['x', 'y'])

# Merge it with our original enriched data
final_galaxy_df = pd.concat([enriched_df.reset_index(drop=True), coords_df], axis=1)

# Save this final, fully-processed DataFrame
FINAL_DF_PATH = "../data/processed/hollywood_galaxy_df.pkl"
final_galaxy_df.to_pickle(FINAL_DF_PATH)

print(f"Final DataFrame for visualization saved to: {FINAL_DF_PATH}")
print("Sample of the final data with 'x' and 'y' coordinates:")
display(final_galaxy_df.head())

Found cached t-SNE coordinates. Loading from: ../data/processed/tsne_2d_coordinates.npy

Merging 2D coordinates into the main movie DataFrame...
Final DataFrame for visualization saved to: ../data/processed/hollywood_galaxy_df.pkl
Sample of the final data with 'x' and 'y' coordinates:


Unnamed: 0,tconst,primaryTitle,startYear,genres,keywords,poster_path,x,y
0,tt0017578,The Wrecker,1929,"Crime,Drama",,/oGCsBdjxDq7b4eTpTsjJrA6VayX.jpg,2.520578,35.050861
1,tt0018362,The Scar of Shame,1929,"Crime,Drama,Romance",marriage contract prison escape class differen...,/tosJ21bDxJzvg2JcTuKQMqWglvM.jpg,-9.107474,-16.665449
2,tt0018588,Three Loves,1929,Drama,black and white silent film,/ncyxOfdoS0Rz9VWHxyx6HLyU5nB.jpg,31.801432,-39.114357
3,tt0018630,After the Verdict,1929,"Drama,Romance,Sport",sports,/dOGxHjXBFp1D59hZLzBl9Gheg20.jpg,11.879026,0.767587
4,tt0018685,The Bellamy Trial,1929,"Adventure,Crime,Drama",,,-2.03462,34.347107


## Part 4: Building the Interactive Galaxy Explorer 🌌

This is the final step. We will now take our 2D coordinates and all our enriched metadata to build the final interactive visualization using **Plotly**. The result will be a "galaxy map" of Pre-Code Hollywood films.

**Methodology:**
1.  **Load Final Data:** We will load our final `hollywood_galaxy_df.pkl` file, which contains all movie info and their 'x' and 'y' coordinates.
2.  **Prepare for Plotting:** We will create a `primary_genre` column to color the movie points effectively.
3.  **Create Scatter Plot:** We will use `plotly.express` to create the scatter plot.
4.  **Customize Interactivity:** We will configure the hover-over tooltips to display rich information for each movie, including title, year, keywords, and the poster path.
5.  **Style and Display:** We'll apply a dark theme and other styling to make the visualization beautiful and easy to explore.

In [11]:
import pandas as pd
import plotly.express as px

# --- 1. Load the Final, Processed Data ---
FINAL_DF_PATH = "../data/processed/hollywood_galaxy_df.pkl"
final_galaxy_df = pd.read_pickle(FINAL_DF_PATH)
print("Final galaxy data loaded successfully.")

# --- 2. Prepare Data for Visualization ---
# Create the primary_genre column
final_galaxy_df['primary_genre'] = final_galaxy_df['genres'].fillna('Unknown').str.split(',').str.get(0)

# --- FIX: Re-introduce filtering to focus on the main clusters ---
genre_counts = final_galaxy_df['primary_genre'].value_counts()
# We will only plot genres that have more than 15 movies
major_genres = genre_counts[genre_counts > 15].index
plot_df = final_galaxy_df[final_galaxy_df['primary_genre'].isin(major_genres)]
print(f"Plotting {len(plot_df)} movies from major genres for a clearer view.")

# --- 3. Create the Interactive Scatter Plot ---
fig = px.scatter(
    plot_df,
    x='x',
    y='y',
    color='primary_genre',
    hover_name='primaryTitle',
    hover_data={'startYear': True, 'keywords': True, 'x': False, 'y': False},
    title="An Interactive Galaxy of Pre-Code Hollywood Films"
)

# --- 4. Style the Visualization ---
fig.update_layout(
    template='plotly_dark',
    legend_title_text='Primary Genre',
    title={'y':0.95, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'},
    xaxis_title=None, # The axes have no direct meaning, so we hide the titles
    yaxis_title=None
)
fig.update_traces(marker=dict(size=6, opacity=0.9, line=dict(width=0.5, color='DarkSlateGrey')))

# --- 5. NEW: Add Annotations to Label Key Clusters ---
# These coordinates are estimates based on the plot layout.
# They add a layer of human-readable meaning to the clusters.
annotations = [
    dict(x=-25, y=-25, text="<b>Drama & Romance Cluster</b>", showarrow=False, font=dict(color="white", size=14)),
    dict(x=35, y=-15, text="<b>Comedy & Musical<br>Neighborhood</b>", showarrow=False, font=dict(color="white", size=14)),
    dict(x=-5, y=40, text="<b>Crime & Mystery<br>Territory</b>", showarrow=False, font=dict(color="white", size=14))
]
fig.update_layout(annotations=annotations)


# --- 6. Show the Final Plot ---
fig.show()

Final galaxy data loaded successfully.
Plotting 4451 movies from major genres for a clearer view.
