<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Purpose**

The purpose of this script is to enrich the MovieLens movie dataset (`movies.dat`) with detailed movie metadata from The Movie Database (TMDB) API. This metadata includes movie overviews, genres, poster and backdrop image URLs, cast and director information, keywords, user ratings, and trailer links. The enriched dataset will serve as the foundation for building content-based, collaborative, and hybrid recommender systems.

### **Methodology**

1. **Load MovieLens Movie Data**
   The script loads the `movies.dat` file, which contains basic movie information including `movieId`, `title`, and `genres`.

2. **Clean Titles and Extract Years**
   It processes the movie titles to remove the year from the title string and separately extracts the release year to improve search accuracy when querying TMDB.

3. **Query TMDB API**
   For each movie, it sends a search request to TMDB using the cleaned title and release year. If a match is found, it retrieves the movie’s TMDB ID.

4. **Retrieve Detailed Metadata**
   Using the TMDB ID, the script fetches:

   * Overview (plot summary)
   * Poster and backdrop image paths
   * Genre IDs, which are then mapped to readable genre names
   * Top 3 cast members
   * Director(s)
   * Associated keywords
   * YouTube trailer link (if available)

5. **Construct and Save Enriched Dataset**
   All metadata is compiled into a structured format and merged with the original MovieLens data. The final dataset is saved as `movies_enriched_full.csv` for downstream use in recommendation models.


In [2]:
import pandas as pd
import requests
from tqdm import tqdm
import time

# ---------------------------------------
# CONFIG
# ---------------------------------------
BASE_URL = "https://api.themoviedb.org/3"
IMAGE_BASE = "https://image.tmdb.org/t/p/w500"

# Use your TMDB Bearer Token (v4)
HEADERS = {
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIyZGZlNjMwMGMzYjIzMjc2NzExNjQ0N2JhNzhiMjM5MyIsIm5iZiI6MTc1MTkyMjA3Ni4xMzUsInN1YiI6IjY4NmMzNTljMzc4NjllOGEyNDUxZTM0OSIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.S773ddH3FiIHtokPW4sYpJog0mXWS1o4OPov1KZneUw"
}

# TMDB genre ID to name mapping
GENRE_ID_TO_NAME = {
    28: "Action", 12: "Adventure", 16: "Animation", 35: "Comedy", 80: "Crime",
    99: "Documentary", 18: "Drama", 10751: "Family", 14: "Fantasy", 36: "History",
    27: "Horror", 10402: "Music", 9648: "Mystery", 10749: "Romance", 878: "Science Fiction",
    10770: "TV Movie", 53: "Thriller", 10752: "War", 37: "Western"
}

# ---------------------------------------
# STEP 1: Load MovieLens .dat Files
# ---------------------------------------

# Load movies.dat - format: MovieID::Title::Genres
movies_df = pd.read_csv("movies.dat", sep="::", engine='python', header=None, names=["movieId", "title", "genres"], encoding="latin-1")

# ---------------------------------------
# STEP 2: Clean Movie Titles and Extract Year
# ---------------------------------------

def extract_year(title):
    if "(" in title:
        try:
            return int(title.strip()[-5:-1])
        except:
            return None
    return None

def clean_title(title):
    if "(" in title:
        return title[:title.rfind("(")].strip()
    return title.strip()

movies_df["year"] = movies_df["title"].apply(extract_year)
movies_df["clean_title"] = movies_df["title"].apply(clean_title)

# ---------------------------------------
# STEP 3: TMDB Metadata Functions
# ---------------------------------------

# Search for movie in TMDB
def search_tmdb(title, year):
    url = f"{BASE_URL}/search/movie"
    params = {"query": title, "year": year}
    response = requests.get(url, headers=HEADERS, params=params)
    r = response.json()
    if r.get("results"):
        return r["results"][0]
    return None

# Get full metadata from TMDB
def get_full_tmdb_metadata(tmdb_id):
    metadata = {}

    # Credits (cast, crew)
    credits = requests.get(f"{BASE_URL}/movie/{tmdb_id}/credits", headers=HEADERS).json()
    cast = [c["name"] for c in credits.get("cast", [])[:3]]
    directors = [c["name"] for c in credits.get("crew", []) if c.get("job") == "Director"]

    # Keywords
    keywords = requests.get(f"{BASE_URL}/movie/{tmdb_id}/keywords", headers=HEADERS).json()
    keyword_list = [k["name"] for k in keywords.get("keywords", [])]

    # Videos (trailers)
    videos = requests.get(f"{BASE_URL}/movie/{tmdb_id}/videos", headers=HEADERS).json()
    trailer_links = [
        f"https://www.youtube.com/watch?v={v['key']}"
        for v in videos.get("results", [])
        if v["site"] == "YouTube" and v["type"] == "Trailer"
    ]

    # Final metadata dictionary
    metadata["top_3_cast"] = ", ".join(cast)
    metadata["directors"] = ", ".join(directors)
    metadata["keywords"] = ", ".join(keyword_list)
    metadata["trailer_link"] = trailer_links[0] if trailer_links else None

    return metadata

# ---------------------------------------
# STEP 4: Enrich Movie Data
# ---------------------------------------

enriched = []

for _, row in tqdm(movies_df.iterrows(), total=len(movies_df)):
    movie_data = search_tmdb(row["clean_title"], row["year"])

    if movie_data:
        tmdb_id = movie_data["id"]
        extra = get_full_tmdb_metadata(tmdb_id)

        genre_ids = movie_data.get("genre_ids", [])
        genre_names = [GENRE_ID_TO_NAME.get(gid, str(gid)) for gid in genre_ids]

        enriched.append({
            "tmdb_id": tmdb_id,
            "overview": movie_data.get("overview", ""),
            "poster_path": IMAGE_BASE + movie_data.get("poster_path", "") if movie_data.get("poster_path") else None,
            "backdrop_path": IMAGE_BASE + movie_data.get("backdrop_path", "") if movie_data.get("backdrop_path") else None,
            "vote_average": movie_data.get("vote_average", None),
            "vote_count": movie_data.get("vote_count", None),
            "tmdb_genres": ", ".join(genre_names),
            **extra
        })
    else:
        enriched.append({
            "tmdb_id": None,
            "overview": None,
            "poster_path": None,
            "backdrop_path": None,
            "vote_average": None,
            "vote_count": None,
            "tmdb_genres": None,
            "top_3_cast": None,
            "directors": None,
            "keywords": None,
            "trailer_link": None
        })

    time.sleep(0.25)  # Respect TMDB API rate limits

# ---------------------------------------
# STEP 5: Save Final Dataset
# ---------------------------------------

enriched_df = pd.DataFrame(enriched)
final_df = pd.concat([movies_df, enriched_df], axis=1)
final_df.to_csv("movies_enriched_full.csv", index=False)

print("DONE: Saved as 'movies_enriched_full.csv'")


FileNotFoundError: [Errno 2] No such file or directory: 'movies.dat'

## **Personalized Content-Based Movie Recommendation System**

This Python script implements a **Content-Based Filtering (CBF)** system enhanced with **personalized recommendations** using user-specific rating profiles. Built using the MovieLens 1M dataset and enriched metadata, the pipeline performs vectorization, similarity computation, and profile-based predictions.

**What This Script Does**

* **Module 1–2**: Load essential libraries and enriched movie data.
* **Module 3**: Load user ratings and demographics.
* **Module 4**: Engineer features combining genres, cast, crew, keywords, and movie overviews.
* **Module 5**: Transform content into TF-IDF, Count, or Binary vectors, and compute pairwise similarities using Cosine or Jaccard metrics.
* **Module 6**: Construct a weighted content profile per user based on past ratings.
* **Module 7**: Recommend top-N movies similar to the user profile, excluding already seen titles.

**Techniques Used**

* **Text Vectorization**: TF-IDF, CountVectorizer, Binary Count
* **Similarity Metrics**: Cosine Similarity, Jaccard Similarity
* **Personalization**: Weighted vector averaging based on each user’s rated items
* **Parallelization**: Speeds up Jaccard similarity computation using joblib

**Use Cases**

* Personalized recommendations for new users with a few ratings (cold-start)
* Improving diversity and relevance in suggested movies
* Generating fallback content suggestions in hybrid recommender systems

In [3]:
# ==============================
# Module 1: Imports & Configuration
# ==============================
# Purpose: Load all required libraries and set global display settings.
# Application: Enables necessary tools for data manipulation, modeling, and visualization.

import pandas as pd
import numpy as np
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from joblib import Parallel, delayed
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

# ==============================
# Module 2: Load Movie Data
# ==============================
# Purpose: Load enriched movie metadata from a CSV file.
# Application: Used as the content base for movie features in recommender systems.

def load_movie_data(filepath):
    df = pd.read_csv(filepath)
    print(f"Loaded {len(df)} movies.")
    return df

# ==============================
# Module 3: Load User Ratings and Demographics
# ==============================
# Purpose: Load MovieLens ratings and user demographic info.
# Application: Used to identify which movies users rated and to segment users by profile.

def load_user_data(ratings_path, users_path):
    ratings = pd.read_csv(ratings_path, sep="::", engine="python",
                          names=["userId", "movieId", "rating", "timestamp"])
    users = pd.read_csv(users_path, sep="::", engine="python",
                        names=["userId", "gender", "age", "occupation", "zip"])
    print(f"Loaded {len(ratings)} ratings and {len(users)} users.")
    return ratings, users

# ==============================
# Module 4: Feature Engineering
# ==============================
# Purpose: Merge movie metadata into a single feature string.
# Application: Used as input to text vectorization for computing similarity.

def create_feature_string(df):
    def split_and_clean(col, delimiter='|'):
        return col.fillna('').str.replace(r'\s+', '', regex=True).str.split(delimiter)

    genre_list_1 = split_and_clean(df['genres'], delimiter='|')
    genre_list_2 = split_and_clean(df['tmdb_genres'], delimiter=',')
    merged_genres = [
        ' '.join(sorted(set(g1 or []) | set(g2 or [])))
        for g1, g2 in zip(genre_list_1, genre_list_2)
    ]

    def clean_text(col):
        return col.fillna('').str.replace(r'\s+', '', regex=True).str.replace(',', ' ')

    overview_clean = df['overview'].fillna('').str.lower().str.replace('[^\w\s]', '', regex=True)

    df['cbf_features'] = (
        pd.Series(merged_genres) + ' ' +
        clean_text(df['keywords']) + ' ' +
        clean_text(df['top_3_cast']) + ' ' +
        clean_text(df['directors']) + ' ' +
        overview_clean
    )

    return df[['movieId', 'title', 'cbf_features']]

# ==============================
# Module 5: Vectorization & Similarity
# ==============================
# Purpose: Convert feature strings into vectors and compute pairwise similarity.
# Application: Supports similarity scoring between movies.

def vectorize_features(text_series, method='tfidf'):
    if method == 'tfidf':
        vectorizer = TfidfVectorizer(stop_words='english')
    elif method == 'count':
        vectorizer = CountVectorizer(stop_words='english')
    else:
        raise ValueError("Method must be 'tfidf' or 'count'")

    matrix = vectorizer.fit_transform(text_series)
    print(f"{method.upper()} vectorization complete. Shape: {matrix.shape}")
    return matrix, vectorizer

def binary_vectorize(text_series):
    vectorizer = CountVectorizer(binary=True, stop_words='english')
    matrix = vectorizer.fit_transform(text_series)
    print(f"Binary Count vectorization complete. Shape: {matrix.shape}")
    return matrix.toarray(), vectorizer

def compute_cosine_similarity(matrix):
    sim = cosine_similarity(matrix)
    print("Cosine similarity computed.")
    return sim

def jaccard_pairwise_parallel(matrix):
    n = matrix.shape[0]
    sim_matrix = np.zeros((n, n))

    def jaccard_row(i):
        a = matrix[i]
        row_sim = np.zeros(n)
        for j in range(i, n):
            b = matrix[j]
            intersection = np.logical_and(a, b).sum()
            union = np.logical_or(a, b).sum()
            score = intersection / union if union > 0 else 0.0
            row_sim[j] = score
        return i, row_sim

    results = Parallel(n_jobs=-1)(
        delayed(jaccard_row)(i) for i in tqdm(range(n), desc="Jaccard Similarity")
    )

    for i, row in results:
        sim_matrix[i, i:] = row[i:]
        sim_matrix[i:, i] = row[i:]

    print("Jaccard similarity matrix built.")
    return sim_matrix

def save_matrix(matrix, filename):
    with open(filename, 'wb') as f:
        pickle.dump(matrix, f)
    print(f"Saved similarity matrix to: {filename}")

# ==============================
# Module 6: Build User Profile
# ==============================
# Purpose: Create a personalized vector from a user's rated movies.
# Application: Encapsulates a user's preferences for use in recommendation.

def build_user_profile(user_id, ratings, tfidf_matrix, movie_df):
    user_ratings = ratings[ratings['userId'] == user_id]
    rated_movies = movie_df[movie_df['movieId'].isin(user_ratings['movieId'])]
    indices = rated_movies.index.tolist()
    weights = user_ratings.set_index('movieId').loc[rated_movies['movieId']]['rating'].values
    profile = np.average(tfidf_matrix[indices].toarray(), axis=0, weights=weights)
    return profile.reshape(1, -1)

# ==============================
# Module 7: Personalized Recommendation
# ==============================
# Purpose: Generate recommendations personalized to the user profile.
# Application: This module uses user ratings to build a content preference profile and recommend similar movies.

def recommend_movies(user_id, ratings, tfidf_matrix, movie_df, top_n=50):
    user_profile = build_user_profile(user_id, ratings, tfidf_matrix, movie_df)
    sims = cosine_similarity(user_profile, tfidf_matrix).flatten()
    user_seen = ratings[ratings['userId'] == user_id]['movieId'].tolist()
    unseen_indices = movie_df[~movie_df['movieId'].isin(user_seen)].index
    top_indices = unseen_indices[np.argsort(sims[unseen_indices])[-top_n:][::-1]]
    return movie_df.iloc[top_indices][['movieId', 'title']], sims[top_indices]


### ***1. Run CBF Data Preparation***


In [4]:
# --- Run CBF Data Preparation ---
movie_df = load_movie_data('movies_enriched_full.csv')
movie_df = create_feature_string(movie_df)

# Optional: preview one example
print("\nSample feature string:")
print(movie_df.loc[0, ['title', 'cbf_features']])


Loaded 3883 movies.

Sample feature string:
title                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Toy Story (1995)
cbf_features    Adventure Animation Children's Comedy Family rescue friendship mission jealousy villain bullying elementaryschool rivalry anthropomorphism friends computeranimation buddy walkietalkie toycar boynextdoor newtoy neighborhood toycomestolife resourcefulness toy TomHanks TimAllen DonRickles JohnLasseter led by woody andys toys live 

### ***2. Run Vectorization & Similarity Calculation***


In [5]:
# --- Run TF-IDF Vectorization and Cosine Similarity ---
tfidf_matrix, tfidf_vectorizer = vectorize_features(movie_df['cbf_features'], method='tfidf')
cosine_sim_matrix = compute_cosine_similarity(tfidf_matrix)
save_matrix(cosine_sim_matrix, 'cbf_cosine_similarity_tfidf.pkl')

# --- Run Count Vectorization and Cosine Similarity (Optional) ---
count_matrix, count_vectorizer = vectorize_features(movie_df['cbf_features'], method='count')
cosine_sim_count = compute_cosine_similarity(count_matrix)
save_matrix(cosine_sim_count, 'cbf_cosine_similarity_count.pkl')

# --- Run Binary Vectorization and Jaccard Similarity ---
binary_matrix, _ = binary_vectorize(movie_df['cbf_features'])
jaccard_sim_matrix = jaccard_pairwise_parallel(binary_matrix)
save_matrix(jaccard_sim_matrix, 'cbf_jaccard_similarity.pkl')


TFIDF vectorization complete. Shape: (3883, 33424)
Cosine similarity computed.
Saved similarity matrix to: cbf_cosine_similarity_tfidf.pkl
COUNT vectorization complete. Shape: (3883, 33424)
Cosine similarity computed.
Saved similarity matrix to: cbf_cosine_similarity_count.pkl
Binary Count vectorization complete. Shape: (3883, 33424)


Jaccard Similarity: 100%|██████████| 3883/3883 [17:31<00:00,  3.69it/s] 


Jaccard similarity matrix built.
Saved similarity matrix to: cbf_jaccard_similarity.pkl


### ***3. Run Personalized Recommendation for a User***


In [8]:
# --- Load User Data ---
ratings, users = load_user_data('ratings.dat', 'users.dat')

# --- Generate Recommendations for a Sample User ---
user_id = 5549
recommended_df, similarity_scores = recommend_movies(user_id, ratings, tfidf_matrix, movie_df, top_n=50)

print(f"\nTop 50 personalized movie recommendations for User {user_id}:")
for i, (title, score) in enumerate(zip(recommended_df['title'], similarity_scores)):
    print(f"{i+1}. {title} — Score: {score:.4f}")



Loaded 1000209 ratings and 6040 users.

Top 50 personalized movie recommendations for User 5549:
1. Professional, The (a.k.a. Leon: The Professional) (1994) — Score: 0.3010
2. Blood In, Blood Out (a.k.a. Bound by Honor) (1993) — Score: 0.2915
3. F/X 2 (1992) — Score: 0.2337
4. Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) — Score: 0.2261
5. Othello (1952) — Score: 0.2261
6. White Boys (1999) — Score: 0.2261
7. Phantom Love (Ai No Borei) (1978) — Score: 0.2261
8. All the Rage (a.k.a. It's the Rage) (1999) — Score: 0.2261
9. Black Tights (Les Collants Noirs) (1960) — Score: 0.2261
10. Forbidden Christ, The (Cristo proibito, Il) (1950) — Score: 0.2261
11. Story of Xinghua, The (1993) — Score: 0.2261
12. Day the Sun Turned Cold, The (Tianguo niezi) (1994) — Score: 0.2261
13. Ciao, Professore! (Io speriamo che me la cavo ) (1993) — Score: 0.2261
14. Two Moon Juction (1988) — Score: 0.2261
15. Naturally Native (1998) — Score: 0.2261
16. Jules and Jim (Jules et Jim) (1961) — Score: 0.2

**Collaborative Filtering (UBCF, IBCF, SVD, ALS)**

**Objective:**
This notebook implements **Collaborative Filtering** techniques for personalized movie recommendation. Unlike Content-Based Filtering, Collaborative Filtering relies on patterns in *user behavior*—how users rate movies—and not on the content of the movies themselves.

**Key Techniques Covered:**

* **Memory-Based Filtering:**

  * *User-User Collaborative Filtering (UBCF)*: Finds users with similar tastes.
  * *Item-Item Collaborative Filtering (IBCF)*: Finds items similar to what a user has liked.
  
* **Model-Based Filtering:**

  * *SVD (Surprise)*: Learns latent features from the rating matrix.
  * *ALS (PySpark)*: Scalable factorization method for large datasets.

**Core Tasks:**

* Build user-item rating matrices
* Compute cosine similarity for UBCF/IBCF
* Tune and evaluate SVD and ALS models
* Handle missing data and sparse matrix issues
* Save predictions and evaluation scores for future use

**Application:**
These methods serve as the backbone for Netflix-style recommendation engines, helping tailor suggestions based on the tastes of millions of users.


In [9]:
# ==============================
# Day 3: Collaborative Filtering (UBCF)
# ==============================
# Purpose: Implement memory-based and model-based collaborative filtering approaches.
# Application: Recommend movies by identifying similar users (User-User), similar items (Item-Item), or through latent factor models (SVD and ALS).

# ==============================
# Module 8: Memory-Based Collaborative Filtering
# ==============================
# Purpose: Compute User-User and Item-Item similarity.
# Application: Memory-based collaborative filtering using cosine similarity.

from sklearn.metrics.pairwise import pairwise_distances

# --- Create User-Item Rating Matrix ---
def create_user_item_matrix(ratings):
    user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
    return user_item_matrix.fillna(0)

# --- Compute Similarity ---
def compute_similarity(matrix, kind='user'):
    if kind == 'user':
        sim = 1 - pairwise_distances(matrix, metric='cosine')
    elif kind == 'item':
        sim = 1 - pairwise_distances(matrix.T, metric='cosine')
    else:
        raise ValueError("kind must be 'user' or 'item'")
    print(f"{kind.title()}-based similarity computed.")
    return sim

# ==============================
# Module 9: Model-Based CF using Surprise (SVD)
# ==============================
# Purpose: Use matrix factorization to learn latent user and item features.
# Application: Model-based collaborative filtering using SVD and tuning.

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, GridSearchCV
from surprise.accuracy import rmse

# --- Prepare Surprise Dataset ---
def prepare_surprise_data(ratings):
    reader = Reader(rating_scale=(0.5, 5.0))
    return Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# --- Train and Tune SVD Model ---
def tune_svd_model(data):
    param_grid = {'n_factors': [50, 100], 'lr_all': [0.005, 0.01], 'reg_all': [0.02, 0.1]}
    gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
    gs.fit(data)
    print(f"Best RMSE: {gs.best_score['rmse']} with params: {gs.best_params['rmse']}")
    return gs.best_estimator['rmse']

# --- Evaluate SVD ---
def evaluate_svd(model, data):
    trainset, testset = train_test_split(data, test_size=0.2)
    model.fit(trainset)
    predictions = model.test(testset)
    score = rmse(predictions)
    return predictions, score

# ==============================
# Module 10: Model-Based CF using PySpark ALS
# ==============================
# Purpose: Implement ALS model using PySpark for scalability.
# Application: Large-scale collaborative filtering with alternating least squares.

from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator

# --- Create Spark Session ---
def start_spark():
    spark = SparkSession.builder \
        .appName("ALSModel") \
        .getOrCreate()
    return spark

# --- Prepare DataFrame for ALS ---
def prepare_als_data(spark, ratings):
    return spark.createDataFrame(ratings[['userId', 'movieId', 'rating']])

# --- Train ALS Model ---
def train_als_model(data):
    als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating",
              rank=10, maxIter=10, regParam=0.1, coldStartStrategy="drop",
              nonnegative=True)
    model = als.fit(data)
    return model

# --- Evaluate ALS ---
def evaluate_als(model, data):
    predictions = model.transform(data)
    evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                    predictionCol="prediction")
    rmse_val = evaluator.evaluate(predictions)
    print(f"ALS RMSE: {rmse_val}")
    return predictions, rmse_val

# ==============================
# Module 11: Save Predictions and Scores
# ==============================
# Purpose: Save output predictions and evaluation metrics.
# Application: Reuse and visualize predictions for report or dashboard.

def save_predictions(predictions, filename):
    import pickle
    with open(filename, 'wb') as f:
        pickle.dump(predictions, f)
    print(f"Saved predictions to {filename}")


ModuleNotFoundError: No module named 'surprise'

### ***Module 8: Memory-Based Collaborative Filtering***



In [10]:
# ==============================
# Module 8: Memory-Based Collaborative Filtering
# ==============================
# Purpose: Compute similarity between users (User-User CF) or items (Item-Item CF) based on their ratings.
# Application: This module enables collaborative filtering without training complex models by leveraging cosine similarity
# on the user-item rating matrix. This is useful for smaller datasets and provides interpretable recommendations.

"""
This module builds the foundational logic for memory-based collaborative filtering.
It creates a sparse user-item matrix from MovieLens ratings data and computes pairwise
similarity between either users or items using cosine similarity.

Use this for:
- Generating recommendations by identifying similar users (UBCF)
- Identifying related items (IBCF) based on user preferences
- Quick prototyping and interpretable baseline models
"""

import pandas as pd
import numpy as np

# --- Step 1: Create User-Item Rating Matrix ---
def create_user_item_matrix(ratings_df):
    """
    Convert ratings DataFrame to a pivoted user-item matrix.
    Missing ratings are filled with 0 (sparse assumption).
    """
    user_item_matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating')
    return user_item_matrix.fillna(0)

# --- Step 2: Cosine Similarity Calculation (Manual) ---
def cosine_similarity_manual(A):
    """
    Compute pairwise cosine similarity between rows of matrix A using basic arithmetic.
    A: numpy array of shape (n_samples, n_features)
    Returns: cosine similarity matrix (n_samples x n_samples)
    """
    dot_product = np.dot(A, A.T)
    norm = np.linalg.norm(A, axis=1, keepdims=True)
    denominator = np.dot(norm, norm.T)

    # To avoid divide-by-zero, replace zeros in denominator with small epsilon
    denominator[denominator == 0] = 1e-9
    similarity = dot_product / denominator
    return similarity

# --- Step 3: Compute Similarity Matrix ---
def compute_memory_based_similarity(user_item_matrix, kind='user'):
    """
    Compute cosine similarity between users or items.
    For users, applies user-level centering before computing similarity.
    """
    if kind == 'user':
        # --- Step A: Replace 0s with NaNs for mean calculation ---
        centered = user_item_matrix.replace(0, np.nan)
        user_means = centered.mean(axis=1)

        # --- Step B: Center ratings around user means and fill NaNs with 0 ---
        matrix = centered.sub(user_means, axis=0).fillna(0).values

        sim = cosine_similarity_manual(matrix)
        print("User-User similarity computed using cosine on centered ratings.")

    elif kind == 'item':
        matrix = user_item_matrix.values.T
        sim = cosine_similarity_manual(matrix)
        print("Item-Item similarity computed using cosine.")

    else:
        raise ValueError("kind must be 'user' or 'item'")

    return pd.DataFrame(
        sim,
        index=user_item_matrix.index if kind == 'user' else user_item_matrix.columns,
        columns=user_item_matrix.index if kind == 'user' else user_item_matrix.columns
    )

# --- Optional: Recommend Top-K Similar Users or Items ---
def get_top_k_similar(sim_matrix, entity_id, k=5):
    """
    Given a similarity matrix, return the top-k most similar users/items to the specified entity_id.
    """
    sim_scores = sim_matrix.loc[entity_id].drop(entity_id)  # exclude self
    return sim_scores.sort_values(ascending=False).head(k)

# --- Usage ---
ubcf_user_item = create_user_item_matrix(ratings)
ubcf_user_sim = compute_memory_based_similarity(ubcf_user_item, kind='user')
ubcf_top_k_users = get_top_k_similar(ubcf_user_sim, entity_id=1, k=50)



User-User similarity computed using cosine on centered ratings.


#### *Top-50 recommendations using User-Based Collaborative Filtering (UBCF) for user_id = 5549.*

In [11]:
# ==============================
# Module X: Predict UBCF Scores with Movie Titles
# ==============================
# Purpose: Generate UBCF-based predicted ratings for unseen movies, with optional display of movie titles.
# Application: Produces top-N personalized movie recommendations using cosine similarity of user-user interactions.

import pandas as pd
import numpy as np

# --- Load movie titles mapping ---
movies = pd.read_csv("movies_enriched_full.csv")  # Adjust path if needed
movie_id_to_title = dict(zip(movies['movieId'], movies['title']))

# --- UBCF Prediction Function with Movie Titles ---
def predict_ubcf_scores(user_id, user_item_matrix, similarity_matrix, movie_id_to_title=None):
    """
    Predict scores for unseen movies for a given user using UBCF (cosine similarity).
    Temporarily removes the user to compute weighted average, then restores them.

    Parameters:
    - user_id: int, target user
    - user_item_matrix: DataFrame, user-item rating matrix
    - similarity_matrix: DataFrame, cosine similarity between users
    - movie_id_to_title: dict (optional), maps movieId to title for display

    Returns:
    - Series of predicted scores for unseen movies, optionally with movie titles as index
    """
    # Backup user data
    user_vector = user_item_matrix.loc[user_id]
    sim_vector = similarity_matrix.loc[user_id]

    # Drop user to avoid self-similarity
    user_item_matrix_dropped = user_item_matrix.drop(index=user_id)
    sim_scores_dropped = sim_vector.drop(index=user_id)

    # Align index
    sim_scores_dropped = sim_scores_dropped.loc[user_item_matrix_dropped.index]

    # Mask unrated items
    unrated_mask = user_vector == 0
    neighbor_ratings = user_item_matrix_dropped

    # Compute weighted average prediction
    numerator = np.dot(sim_scores_dropped.values, neighbor_ratings.values)
    denominator = np.abs(sim_scores_dropped.values).sum()
    denominator = denominator if denominator != 0 else 1e-9
    predicted_scores = numerator / denominator

    # Filter for unrated movies only
    pred_series = pd.Series(predicted_scores, index=user_item_matrix.columns)
    unseen_preds = pred_series[unrated_mask].sort_values(ascending=False)

    # Reinsert user data
    user_item_matrix.loc[user_id] = user_vector
    similarity_matrix.loc[user_id] = sim_vector
    similarity_matrix[user_id] = sim_vector

    # Map movieId to title if available
    if movie_id_to_title:
        unseen_preds.index = [movie_id_to_title.get(mid, f"MovieID {mid}") for mid in unseen_preds.index]

    return unseen_preds

# --- Generate Top 50 Recommendations ---
ubcf_top_50_recommendations = predict_ubcf_scores(
    user_id=5549,
    user_item_matrix=ubcf_user_item,
    similarity_matrix=ubcf_user_sim,
    movie_id_to_title=movie_id_to_title
).head(50)

# --- Display Recommendations ---
# --- Convert to DataFrame for Display ---
ubcf_recommendation_df = pd.DataFrame({
    "Rank": range(1, len(ubcf_top_50_recommendations) + 1),
    "Movie Title": ubcf_top_50_recommendations.index,
    "Predicted Rating": ubcf_top_50_recommendations.values
})

# --- Display DataFrame ---
print("\nTop 50 UBCF Recommendations for User 5549:\n")
print(ubcf_recommendation_df.to_string(index=False))




Top 50 UBCF Recommendations for User 5549:

 Rank                               Movie Title  Predicted Rating
    1            Godfather: Part II, The (1974)          0.433388
    2 Butch Cassidy and the Sundance Kid (1969)          0.326463
    3                           Die Hard (1988)          0.306014
    4          Hunt for Red October, The (1990)          0.301117
    5                      Fugitive, The (1993)          0.299904
    6                  Wizard of Oz, The (1939)          0.280176
    7 Indiana Jones and the Last Crusade (1989)          0.276143
    8                              Rocky (1976)          0.275978
    9                         Casablanca (1942)          0.259601
   10                 Dances with Wolves (1990)          0.259482
   11 Star Wars: Episode IV - A New Hope (1977)          0.255633
   12                      Lethal Weapon (1987)          0.247489
   13                     Cool Hand Luke (1967)          0.243840
   14                          

## **Module 9: Model-Based Collaborative Filtering using Surprise (SVD)**

In [None]:
# ==============================
# Module 9: Model-Based CF using Surprise (SVD)
# ==============================
# Purpose: Apply matrix factorization using Singular Value Decomposition (SVD) to uncover latent features in user-item interactions.
# Application: This module creates a powerful and compact user-item interaction model, ideal for mid-sized datasets and tuning.

"""
This module uses the Surprise library to implement and tune SVD models. It converts user ratings
into latent feature space representations, capturing hidden patterns in user preferences and item attributes.

Use this for:
- Highly personalized recommendations
- Compact representation of large rating matrices
- Performing hyperparameter tuning for optimal performance (e.g., via GridSearchCV)
- Getting performance insights via RMSE evaluation
"""


### ***Step 4: Convert to Sparse Format***

In [None]:
from scipy.sparse import csr_matrix

# --- Step 1: Replace 0s with NaN for centering ---
matrix_centered = ubcf_user_item.replace(0, np.nan)
user_means = matrix_centered.mean(axis=1)

# --- Step 2: Subtract user mean to center the ratings ---
matrix_centered = matrix_centered.sub(user_means, axis=0).fillna(0)

# --- Step 3: Convert to sparse matrix format for SVD ---
sparse_matrix = csr_matrix(matrix_centered.values)

# --- Step 4: Output shape ---
print(f"Sparse Matrix Shape: {sparse_matrix.shape}")


In [None]:
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error
from scipy.sparse import csr_matrix
import numpy as np
import matplotlib.pyplot as plt

# --- Function: Split Sparse Matrix into Train/Test ---
def train_test_split_sparse(matrix, test_ratio=0.2, seed=42):
    np.random.seed(seed)
    train = matrix.copy().toarray()
    test = np.zeros_like(train)

    for user in range(matrix.shape[0]):
        rated_items = matrix[user].nonzero()[1]
        if len(rated_items) > 1:
            test_size = max(1, int(len(rated_items) * test_ratio))
            test_items = np.random.choice(rated_items, size=test_size, replace=False)
            train[user, test_items] = 0.0
            test[user, test_items] = matrix[user, test_items].toarray()

    return csr_matrix(train), csr_matrix(test)

# --- Function: Compute RMSE ---
from sklearn.metrics import mean_squared_error
import numpy as np

def compute_rmse(pred_matrix, test_matrix):
    test_nonzero = test_matrix.nonzero()

    pred = pred_matrix[test_nonzero].flatten()
    actual = np.array(test_matrix[test_nonzero]).flatten()

    # print(f"Shape of pred: {pred.shape}")
    # print(f"Shape of actual: {actual.shape}")

    return np.sqrt(mean_squared_error(actual, pred))


# --- Step 1: Split Centered Sparse Matrix ---
train_matrix, test_matrix = train_test_split_sparse(sparse_matrix)

# --- Step 2: Evaluate SVD across k-values ---
# Generate k_values from 10 to 300
k_values = list(range(10, 101, 10)) + list(range(120, 501, 20))

rmse_scores = []

for k in k_values:
    U, sigma, Vt = svds(train_matrix, k=k)
    sigma_diag = np.diag(sigma)
    pred_matrix = np.dot(np.dot(U, sigma_diag), Vt)

    # Restore user means (aligned by row order, not userId)
    pred_matrix += user_means.values[:, np.newaxis]

    # Evaluate RMSE on test set
    rmse = compute_rmse(pred_matrix, test_matrix)
    rmse_scores.append(rmse)
    # print(f"k = {k}, RMSE = {rmse:.4f}")

# --- Step 3: Plot ---
plt.figure(figsize=(10, 6))
plt.plot(k_values, rmse_scores, marker='o')
plt.title("SVD: Number of Latent Factors vs. RMSE")
plt.xlabel("Number of Latent Factors (k)")
plt.ylabel("RMSE")
plt.grid(True)
plt.xticks(k_values)
plt.show()


In [None]:
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# --- Helper Functions ---
def train_test_split_sparse(matrix, test_ratio=0.2, seed=42):
    np.random.seed(seed)
    train = matrix.copy().toarray()
    test = np.zeros_like(train)

    for user in range(matrix.shape[0]):
        rated_items = matrix[user].nonzero()[1]
        if len(rated_items) > 1:
            test_size = max(1, int(len(rated_items) * test_ratio))
            test_items = np.random.choice(rated_items, size=test_size, replace=False)
            train[user, test_items] = 0.0
            test[user, test_items] = matrix[user, test_items].toarray()

    return csr_matrix(train), csr_matrix(test)

def compute_rmse(pred_matrix, test_matrix):
    test_nonzero = test_matrix.nonzero()
    pred = pred_matrix[test_nonzero].flatten()
    actual = np.array(test_matrix[test_nonzero]).flatten()
    return np.sqrt(mean_squared_error(actual, pred))

# --- Assume you already have: sparse_matrix, user_means ---
train_matrix, test_matrix = train_test_split_sparse(sparse_matrix)

# --- k values ---
k_values = list(range(10, 101, 10)) + list(range(120, 301, 20))
results = []

for k in k_values:
    U, sigma, Vt = svds(train_matrix, k=k)
    sigma_diag = np.diag(sigma)
    pred_matrix = np.dot(np.dot(U, sigma_diag), Vt)
    pred_matrix += user_means.values[:, np.newaxis]

    rmse = compute_rmse(pred_matrix, test_matrix)
    results.append((k, rmse, sigma))
    # print(f"k = {k}, RMSE = {rmse:.4f}")

# --- Extract Singular Values from Highest-k SVD ---
sigma_full = []
for k, rmse, sigma_list in sorted(results, key=lambda x: x[0], reverse=True):
    if len(sigma_list) > 0:
        sigma_full = sorted(sigma_list, reverse=True)
        break

# --- Cumulative Energy ---
sigma_full = np.array(sigma_full)
total_energy = np.sum(sigma_full)
explained_variance_ratio = sigma_full / total_energy
cumulative_energy = np.cumsum(explained_variance_ratio)

# --- DataFrame for Plotting ---
energy_df = pd.DataFrame({
    'k': np.arange(1, len(sigma_full) + 1),
    'SingularValue': sigma_full,
    'ExplainedVarianceRatio': explained_variance_ratio,
    'CumulativeEnergy': cumulative_energy
})

# --- Plot ---
plt.figure(figsize=(8, 4))
sns.lineplot(data=energy_df, x='k', y='CumulativeEnergy', marker='o')
plt.title("Cumulative Energy vs. Number of Latent Dimensions", loc='left')
plt.xlabel("Number of Latent Dimensions (k)")
plt.ylabel("Cumulative Energy (Explained Variance)")
plt.axhline(y=0.9, color='red', linestyle='--', label='90% Variance')
plt.axhline(y=0.95, color='green', linestyle='--', label='95% Variance')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


In [None]:
def compute_rmse(pred_matrix, test_matrix):
    # Get test matrix nonzero indices
    test_nonzero = test_matrix.nonzero()

    # Extract predicted and actual ratings
    pred = pred_matrix[test_nonzero]
    actual = test_matrix[test_nonzero]

    # Print debug info
    print(f"Shape of pred: {pred.shape}")
    print(f"Shape of actual: {actual.shape}")
    print(f"First 5 pred values: {pred[:5]}")
    print(f"First 5 actual values: {actual[:5]}")

    # Check for consistent length
    if pred.shape != actual.shape:
        raise ValueError(f"Shape mismatch: pred {pred.shape}, actual {actual.shape}")

    # Compute RMSE
    return np.sqrt(mean_squared_error(actual, pred))


### ***Module 10: Model-Based Collaborative Filtering using PySpark ALS***

In [None]:
# ==============================
# Module 10: Model-Based CF using PySpark ALS
# ==============================
# Purpose: Implement Alternating Least Squares (ALS) for collaborative filtering in distributed environments using PySpark.
# Application: Designed for scalability, this module is best suited for handling very large datasets with missing values and
# sparse interactions.

"""
This module implements ALS using PySpark's MLlib, making it suitable for distributed computing.
ALS alternates between fixing user or item factors and solving least-squares problems, optimizing for
the best factorization of the user-item rating matrix.

Use this for:
- Training collaborative filtering models on large-scale datasets
- Handling cold-start issues using 'drop' strategy
- Running evaluations on distributed clusters using RMSE
- Integration with large production systems requiring scalable solutions
"""


### ***Module 11: Save Predictions and Scores***

In [None]:
# ==============================
# Module 11: Save Predictions and Scores
# ==============================
# Purpose: Persist model prediction outputs and evaluation metrics for downstream use.
# Application: Enables reuse of trained models and their predictions across systems, visualizations, or report pipelines.

"""
This utility module provides a function to save predictions or evaluation results to disk
using Python’s pickle serialization. It ensures you can reuse prediction results without rerunning
expensive model training or inference steps.

Use this for:
- Storing model outputs for dashboards or final reporting
- Supporting reproducibility and offline analysis
- Sharing results with stakeholders or across systems
"""
