<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Purpose**

The purpose of this script is to enrich the MovieLens movie dataset (`movies.dat`) with detailed movie metadata from The Movie Database (TMDB) API. This metadata includes movie overviews, genres, poster and backdrop image URLs, cast and director information, keywords, user ratings, and trailer links. The enriched dataset will serve as the foundation for building content-based, collaborative, and hybrid recommender systems.

### **Methodology**

1. **Load MovieLens Movie Data**
   The script loads the `movies.dat` file, which contains basic movie information including `movieId`, `title`, and `genres`.

2. **Clean Titles and Extract Years**
   It processes the movie titles to remove the year from the title string and separately extracts the release year to improve search accuracy when querying TMDB.

3. **Query TMDB API**
   For each movie, it sends a search request to TMDB using the cleaned title and release year. If a match is found, it retrieves the movie’s TMDB ID.

4. **Retrieve Detailed Metadata**
   Using the TMDB ID, the script fetches:

   * Overview (plot summary)
   * Poster and backdrop image paths
   * Genre IDs, which are then mapped to readable genre names
   * Top 3 cast members
   * Director(s)
   * Associated keywords
   * YouTube trailer link (if available)

5. **Construct and Save Enriched Dataset**
   All metadata is compiled into a structured format and merged with the original MovieLens data. The final dataset is saved as `movies_enriched_full.csv` for downstream use in recommendation models.


In [None]:
import pandas as pd
import requests
from tqdm import tqdm
import time

# ---------------------------------------
# CONFIG
# ---------------------------------------
BASE_URL = "https://api.themoviedb.org/3"
IMAGE_BASE = "https://image.tmdb.org/t/p/w500"

# Use your TMDB Bearer Token (v4)
HEADERS = {
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIyZGZlNjMwMGMzYjIzMjc2NzExNjQ0N2JhNzhiMjM5MyIsIm5iZiI6MTc1MTkyMjA3Ni4xMzUsInN1YiI6IjY4NmMzNTljMzc4NjllOGEyNDUxZTM0OSIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.S773ddH3FiIHtokPW4sYpJog0mXWS1o4OPov1KZneUw"
}

# TMDB genre ID to name mapping
GENRE_ID_TO_NAME = {
    28: "Action", 12: "Adventure", 16: "Animation", 35: "Comedy", 80: "Crime",
    99: "Documentary", 18: "Drama", 10751: "Family", 14: "Fantasy", 36: "History",
    27: "Horror", 10402: "Music", 9648: "Mystery", 10749: "Romance", 878: "Science Fiction",
    10770: "TV Movie", 53: "Thriller", 10752: "War", 37: "Western"
}

# ---------------------------------------
# STEP 1: Load MovieLens .dat Files
# ---------------------------------------

# Load movies.dat - format: MovieID::Title::Genres
movies_df = pd.read_csv("movies.dat", sep="::", engine='python', header=None, names=["movieId", "title", "genres"], encoding="latin-1")

# ---------------------------------------
# STEP 2: Clean Movie Titles and Extract Year
# ---------------------------------------

def extract_year(title):
    if "(" in title:
        try:
            return int(title.strip()[-5:-1])
        except:
            return None
    return None

def clean_title(title):
    if "(" in title:
        return title[:title.rfind("(")].strip()
    return title.strip()

movies_df["year"] = movies_df["title"].apply(extract_year)
movies_df["clean_title"] = movies_df["title"].apply(clean_title)

# ---------------------------------------
# STEP 3: TMDB Metadata Functions
# ---------------------------------------

# Search for movie in TMDB
def search_tmdb(title, year):
    url = f"{BASE_URL}/search/movie"
    params = {"query": title, "year": year}
    response = requests.get(url, headers=HEADERS, params=params)
    r = response.json()
    if r.get("results"):
        return r["results"][0]
    return None

# Get full metadata from TMDB
def get_full_tmdb_metadata(tmdb_id):
    metadata = {}

    # Credits (cast, crew)
    credits = requests.get(f"{BASE_URL}/movie/{tmdb_id}/credits", headers=HEADERS).json()
    cast = [c["name"] for c in credits.get("cast", [])[:3]]
    directors = [c["name"] for c in credits.get("crew", []) if c.get("job") == "Director"]

    # Keywords
    keywords = requests.get(f"{BASE_URL}/movie/{tmdb_id}/keywords", headers=HEADERS).json()
    keyword_list = [k["name"] for k in keywords.get("keywords", [])]

    # Videos (trailers)
    videos = requests.get(f"{BASE_URL}/movie/{tmdb_id}/videos", headers=HEADERS).json()
    trailer_links = [
        f"https://www.youtube.com/watch?v={v['key']}"
        for v in videos.get("results", [])
        if v["site"] == "YouTube" and v["type"] == "Trailer"
    ]

    # Final metadata dictionary
    metadata["top_3_cast"] = ", ".join(cast)
    metadata["directors"] = ", ".join(directors)
    metadata["keywords"] = ", ".join(keyword_list)
    metadata["trailer_link"] = trailer_links[0] if trailer_links else None

    return metadata

# ---------------------------------------
# STEP 4: Enrich Movie Data
# ---------------------------------------

enriched = []

for _, row in tqdm(movies_df.iterrows(), total=len(movies_df)):
    movie_data = search_tmdb(row["clean_title"], row["year"])

    if movie_data:
        tmdb_id = movie_data["id"]
        extra = get_full_tmdb_metadata(tmdb_id)

        genre_ids = movie_data.get("genre_ids", [])
        genre_names = [GENRE_ID_TO_NAME.get(gid, str(gid)) for gid in genre_ids]

        enriched.append({
            "tmdb_id": tmdb_id,
            "overview": movie_data.get("overview", ""),
            "poster_path": IMAGE_BASE + movie_data.get("poster_path", "") if movie_data.get("poster_path") else None,
            "backdrop_path": IMAGE_BASE + movie_data.get("backdrop_path", "") if movie_data.get("backdrop_path") else None,
            "vote_average": movie_data.get("vote_average", None),
            "vote_count": movie_data.get("vote_count", None),
            "tmdb_genres": ", ".join(genre_names),
            **extra
        })
    else:
        enriched.append({
            "tmdb_id": None,
            "overview": None,
            "poster_path": None,
            "backdrop_path": None,
            "vote_average": None,
            "vote_count": None,
            "tmdb_genres": None,
            "top_3_cast": None,
            "directors": None,
            "keywords": None,
            "trailer_link": None
        })

    time.sleep(0.25)  # Respect TMDB API rate limits

# ---------------------------------------
# STEP 5: Save Final Dataset
# ---------------------------------------

enriched_df = pd.DataFrame(enriched)
final_df = pd.concat([movies_df, enriched_df], axis=1)
final_df.to_csv("movies_enriched_full.csv", index=False)

print("DONE: Saved as 'movies_enriched_full.csv'")


100%|██████████| 3883/3883 [34:55<00:00,  1.85it/s]

✅ DONE: Saved as 'movies_enriched_full.csv'





## **Content-Based Filtering (CBF)**

### ***CBF Functions***

In [2]:
# --- 1. Imports & Config ---
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
import ast
import pickle
from joblib import Parallel, delayed
from tqdm import tqdm

# Set display options
pd.set_option('display.max_colwidth', None)

# --- 2. Load Movie Data ---
def load_movie_data(filepath):
    df = pd.read_csv(filepath)
    print(f"Loaded {len(df)} movies.")
    return df

# --- 3. Create Feature String Per Movie ---
# --- 3. Create Feature String Per Movie (based on your data format) ---
def create_feature_string(df):
    def clean_text(col):
        return col.fillna('').str.replace(r'\s+', '', regex=True).str.replace(',', ' ')

    df['cbf_features'] = (
        clean_text(df['tmdb_genres']) + ' ' +
        clean_text(df['keywords']) + ' ' +
        clean_text(df['top_3_cast']) + ' ' +
        clean_text(df['directors'])
    )

    return df[['movieId', 'title', 'cbf_features']]

# --- 4. Vectorize Features (TF-IDF and Count) ---
def vectorize_features(text_series, method='tfidf'):
    if method == 'tfidf':
        vectorizer = TfidfVectorizer(stop_words='english')
    elif method == 'count':
        vectorizer = CountVectorizer(stop_words='english')
    else:
        raise ValueError("Method must be 'tfidf' or 'count'")

    matrix = vectorizer.fit_transform(text_series)
    print(f"{method.upper()} vectorization complete. Shape: {matrix.shape}")
    return matrix, vectorizer

# --- 5. Compute Similarity Matrix ---
def compute_cosine_similarity(matrix):
    sim = cosine_similarity(matrix)
    print("Cosine similarity computed.")
    return sim

# --- 6. Save Matrix and Preview Similarities ---
def preview_similar_movies(sim_matrix, movie_df, movie_idx=0, top_n=5):
    sim_scores = list(enumerate(sim_matrix[movie_idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    top_movies = sim_scores[1:top_n+1]

    print(f"\nTop {top_n} movies similar to: {movie_df.iloc[movie_idx]['title']}")
    for idx, score in top_movies:
        print(f"{movie_df.iloc[idx]['title']} — Similarity Score: {score:.3f}")

def save_matrix(matrix, filename):
    with open(filename, 'wb') as f:
        pickle.dump(matrix, f)
    print(f"Saved similarity matrix to: {filename}")

# Preview Top Matches
def preview_similar_movies_df(sim_matrix, movie_df, movie_idx=0, top_n=5):
    sim_scores = list(enumerate(sim_matrix[movie_idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    top_movies = sim_scores[1:top_n+1]  # skip the movie itself (index 0)

    results = []
    for idx, score in top_movies:
        results.append({
            'Rank': len(results)+1,
            'Movie Title': movie_df.iloc[idx]['title'],
            'Similarity Score': round(score, 4)
        })

    return pd.DataFrame(results)

# --- Vectorize for Binary (Jaccard) ---
def binary_vectorize(text_series):
    vectorizer = CountVectorizer(binary=True, stop_words='english')
    matrix = vectorizer.fit_transform(text_series)
    print(f"Binary Count vectorization complete. Shape: {matrix.shape}")
    return matrix.toarray(), vectorizer

# --- Compute Jaccard Similarity (Pairwise, Parallel) ---
def jaccard_pairwise_parallel(matrix):
    n = matrix.shape[0]
    sim_matrix = np.zeros((n, n))

    def jaccard_row(i):
        a = matrix[i]
        row_sim = np.zeros(n)
        for j in range(i, n):
            b = matrix[j]
            intersection = np.logical_and(a, b).sum()
            union = np.logical_or(a, b).sum()
            score = intersection / union if union > 0 else 0.0
            row_sim[j] = score
        return i, row_sim

    results = Parallel(n_jobs=-1)(
        delayed(jaccard_row)(i) for i in tqdm(range(n), desc="Jaccard Similarity")
    )

    for i, row in results:
        sim_matrix[i, i:] = row[i:]
        sim_matrix[i:, i] = row[i:]

    print("Jaccard similarity matrix built.")
    return sim_matrix


### ***CBF Function Implementation***

In [3]:
# Load Data
movie_df = load_movie_data('movies_enriched_full.csv')

# Create Feature Strings
movie_df = create_feature_string(movie_df)

# Vectorize with TF-IDF
tfidf_matrix, tfidf_vectorizer = vectorize_features(movie_df['cbf_features'], method='tfidf')

# Compute Cosine Similarity
cosine_sim_matrix = compute_cosine_similarity(tfidf_matrix)

similar_df = preview_similar_movies_df(cosine_sim_matrix, movie_df, movie_idx=10, top_n=5)
print(f"\nTop 5 movies similar to: {movie_df.iloc[10]['title']}\n")
display(similar_df)

# Convert to DataFrame for preview (optional)
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=movie_df['title'], columns=movie_df['title'])

# Display first 5x5 block
print(f"\nTop 5 movies similarity matrix:\n{cosine_sim_df.iloc[:5, :5].round(3)}")

# Save Matrix
save_matrix(cosine_sim_matrix, 'cbf_cosine_similarity_tfidf.pkl')


Loaded 3883 movies.
TFIDF vectorization complete. Shape: (3883, 17113)
Cosine similarity computed.

Top 5 movies similar to: American President, The (1995)



Unnamed: 0,Rank,Movie Title,Similarity Score
0,1,"Story of Us, The (1999)",0.1907
1,2,Mars Attacks! (1996),0.1731
2,3,My Fellow Americans (1996),0.1588
3,4,Primary Colors (1998),0.155
4,5,First Kid (1996),0.1456



Top 5 movies similarity matrix:
title                               Toy Story (1995)  Jumanji (1995)  \
title                                                                  
Toy Story (1995)                               1.000           0.022   
Jumanji (1995)                                 0.022           1.000   
Grumpier Old Men (1995)                        0.004           0.000   
Waiting to Exhale (1995)                       0.004           0.000   
Father of the Bride Part II (1995)             0.014           0.014   

title                               Grumpier Old Men (1995)  \
title                                                         
Toy Story (1995)                                      0.004   
Jumanji (1995)                                        0.000   
Grumpier Old Men (1995)                               1.000   
Waiting to Exhale (1995)                              0.015   
Father of the Bride Part II (1995)                    0.031   

title               

### ***Evaluate Content-Based Movie Similarity Using Cosine and Jaccard Metrics***

**Purpose:**

The purpose of this module is to **evaluate and compare different similarity metrics**—specifically **Cosine Similarity** and **Jaccard Similarity**—for a **Content-Based Filtering (CBF)** recommender system using the MovieLens 1M dataset enriched with metadata (genres, keywords, cast, and directors).

**Key Objectives:**

* Transform descriptive movie features into machine-readable vectors (binary and weighted).
* Compute pairwise similarity scores using:

  * **Cosine Similarity** on TF-IDF or Count vectors.
  * **Jaccard Similarity** on binary vectors, parallelized for speed.
* Visualize and compare the distribution of similarity scores.
* Use the resulting similarity matrices for movie-to-movie recommendation and hybrid model fusion.


In [4]:
# Step 1: Vectorize
binary_matrix, _ = binary_vectorize(movie_df['cbf_features'])

# Step 2: Compute Similarity (Parallel + Progress Bar)
jaccard_sim_matrix = jaccard_pairwise_parallel(binary_matrix)

save_matrix(jaccard_sim_matrix, 'cbf_jaccard_similarity.pkl')



Binary Count vectorization complete. Shape: (3883, 17113)


Jaccard Similarity: 100%|██████████| 3883/3883 [06:26<00:00, 10.04it/s] 


Jaccard similarity matrix built.
Saved similarity matrix to: cbf_jaccard_similarity.pkl


In [7]:
# --- Required Libraries ---
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# --- Comparison Plot + Stats ---
def compare_similarity_matrices(cosine_sim, jaccard_sim):
    # Validate dimensions
    assert cosine_sim.shape == jaccard_sim.shape, "Matrices must be the same shape"

    # Extract upper triangle (excluding diagonal)
    cosine_vals = cosine_sim[np.triu_indices_from(cosine_sim, k=1)]
    jaccard_vals = jaccard_sim[np.triu_indices_from(jaccard_sim, k=1)]

    # Build long-form DataFrame
    df_long = pd.DataFrame({
        'score': np.concatenate([cosine_vals, jaccard_vals]),
        'metric': ['Cosine'] * len(cosine_vals) + ['Jaccard'] * len(jaccard_vals)
    })

    # Plot distributions
    sns.histplot(data=df_long, x='score', hue='metric', bins=50, kde=True, element='step', stat='density')
    plt.title("Cosine vs Jaccard Similarity Distribution")
    plt.xlabel("Similarity Score")
    plt.ylabel("Density")
    plt.grid(True)
    plt.show()

    # Summary statistics
    return df_long.groupby('metric')['score'].describe()
