<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Purpose**

The purpose of this script is to enrich the MovieLens movie dataset (`movies.dat`) with detailed movie metadata from The Movie Database (TMDB) API. This metadata includes movie overviews, genres, poster and backdrop image URLs, cast and director information, keywords, user ratings, and trailer links. The enriched dataset will serve as the foundation for building content-based, collaborative, and hybrid recommender systems.

### **Methodology**

1. **Load MovieLens Movie Data**
   The script loads the `movies.dat` file, which contains basic movie information including `movieId`, `title`, and `genres`.

2. **Clean Titles and Extract Years**
   It processes the movie titles to remove the year from the title string and separately extracts the release year to improve search accuracy when querying TMDB.

3. **Query TMDB API**
   For each movie, it sends a search request to TMDB using the cleaned title and release year. If a match is found, it retrieves the movie’s TMDB ID.

4. **Retrieve Detailed Metadata**
   Using the TMDB ID, the script fetches:

   * Overview (plot summary)
   * Poster and backdrop image paths
   * Genre IDs, which are then mapped to readable genre names
   * Top 3 cast members
   * Director(s)
   * Associated keywords
   * YouTube trailer link (if available)

5. **Construct and Save Enriched Dataset**
   All metadata is compiled into a structured format and merged with the original MovieLens data. The final dataset is saved as `movies_enriched_full.csv` for downstream use in recommendation models.


In [None]:
import pandas as pd
import requests
from tqdm import tqdm
import time

# ---------------------------------------
# CONFIG
# ---------------------------------------
BASE_URL = "https://api.themoviedb.org/3"
IMAGE_BASE = "https://image.tmdb.org/t/p/w500"

# Use your TMDB Bearer Token (v4)
HEADERS = {
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIyZGZlNjMwMGMzYjIzMjc2NzExNjQ0N2JhNzhiMjM5MyIsIm5iZiI6MTc1MTkyMjA3Ni4xMzUsInN1YiI6IjY4NmMzNTljMzc4NjllOGEyNDUxZTM0OSIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.S773ddH3FiIHtokPW4sYpJog0mXWS1o4OPov1KZneUw"
}

# TMDB genre ID to name mapping
GENRE_ID_TO_NAME = {
    28: "Action", 12: "Adventure", 16: "Animation", 35: "Comedy", 80: "Crime",
    99: "Documentary", 18: "Drama", 10751: "Family", 14: "Fantasy", 36: "History",
    27: "Horror", 10402: "Music", 9648: "Mystery", 10749: "Romance", 878: "Science Fiction",
    10770: "TV Movie", 53: "Thriller", 10752: "War", 37: "Western"
}

# ---------------------------------------
# STEP 1: Load MovieLens .dat Files
# ---------------------------------------

# Load movies.dat - format: MovieID::Title::Genres
movies_df = pd.read_csv("movies.dat", sep="::", engine='python', header=None, names=["movieId", "title", "genres"], encoding="latin-1")

# ---------------------------------------
# STEP 2: Clean Movie Titles and Extract Year
# ---------------------------------------

def extract_year(title):
    if "(" in title:
        try:
            return int(title.strip()[-5:-1])
        except:
            return None
    return None

def clean_title(title):
    if "(" in title:
        return title[:title.rfind("(")].strip()
    return title.strip()

movies_df["year"] = movies_df["title"].apply(extract_year)
movies_df["clean_title"] = movies_df["title"].apply(clean_title)

# ---------------------------------------
# STEP 3: TMDB Metadata Functions
# ---------------------------------------

# Search for movie in TMDB
def search_tmdb(title, year):
    url = f"{BASE_URL}/search/movie"
    params = {"query": title, "year": year}
    response = requests.get(url, headers=HEADERS, params=params)
    r = response.json()
    if r.get("results"):
        return r["results"][0]
    return None

# Get full metadata from TMDB
def get_full_tmdb_metadata(tmdb_id):
    metadata = {}

    # Credits (cast, crew)
    credits = requests.get(f"{BASE_URL}/movie/{tmdb_id}/credits", headers=HEADERS).json()
    cast = [c["name"] for c in credits.get("cast", [])[:3]]
    directors = [c["name"] for c in credits.get("crew", []) if c.get("job") == "Director"]

    # Keywords
    keywords = requests.get(f"{BASE_URL}/movie/{tmdb_id}/keywords", headers=HEADERS).json()
    keyword_list = [k["name"] for k in keywords.get("keywords", [])]

    # Videos (trailers)
    videos = requests.get(f"{BASE_URL}/movie/{tmdb_id}/videos", headers=HEADERS).json()
    trailer_links = [
        f"https://www.youtube.com/watch?v={v['key']}"
        for v in videos.get("results", [])
        if v["site"] == "YouTube" and v["type"] == "Trailer"
    ]

    # Final metadata dictionary
    metadata["top_3_cast"] = ", ".join(cast)
    metadata["directors"] = ", ".join(directors)
    metadata["keywords"] = ", ".join(keyword_list)
    metadata["trailer_link"] = trailer_links[0] if trailer_links else None

    return metadata

# ---------------------------------------
# STEP 4: Enrich Movie Data
# ---------------------------------------

enriched = []

for _, row in tqdm(movies_df.iterrows(), total=len(movies_df)):
    movie_data = search_tmdb(row["clean_title"], row["year"])

    if movie_data:
        tmdb_id = movie_data["id"]
        extra = get_full_tmdb_metadata(tmdb_id)

        genre_ids = movie_data.get("genre_ids", [])
        genre_names = [GENRE_ID_TO_NAME.get(gid, str(gid)) for gid in genre_ids]

        enriched.append({
            "tmdb_id": tmdb_id,
            "overview": movie_data.get("overview", ""),
            "poster_path": IMAGE_BASE + movie_data.get("poster_path", "") if movie_data.get("poster_path") else None,
            "backdrop_path": IMAGE_BASE + movie_data.get("backdrop_path", "") if movie_data.get("backdrop_path") else None,
            "vote_average": movie_data.get("vote_average", None),
            "vote_count": movie_data.get("vote_count", None),
            "tmdb_genres": ", ".join(genre_names),
            **extra
        })
    else:
        enriched.append({
            "tmdb_id": None,
            "overview": None,
            "poster_path": None,
            "backdrop_path": None,
            "vote_average": None,
            "vote_count": None,
            "tmdb_genres": None,
            "top_3_cast": None,
            "directors": None,
            "keywords": None,
            "trailer_link": None
        })

    time.sleep(0.25)  # Respect TMDB API rate limits

# ---------------------------------------
# STEP 5: Save Final Dataset
# ---------------------------------------

enriched_df = pd.DataFrame(enriched)
final_df = pd.concat([movies_df, enriched_df], axis=1)
final_df.to_csv("movies_enriched_full.csv", index=False)

print("DONE: Saved as 'movies_enriched_full.csv'")


FileNotFoundError: [Errno 2] No such file or directory: 'movies.dat'

## **Personalized Content-Based Movie Recommendation System**

This Python script implements a **Content-Based Filtering (CBF)** system enhanced with **personalized recommendations** using user-specific rating profiles. Built using the MovieLens 1M dataset and enriched metadata, the pipeline performs vectorization, similarity computation, and profile-based predictions.

**What This Script Does**

* **Module 1–2**: Load essential libraries and enriched movie data.
* **Module 3**: Load user ratings and demographics.
* **Module 4**: Engineer features combining genres, cast, crew, keywords, and movie overviews.
* **Module 5**: Transform content into TF-IDF, Count, or Binary vectors, and compute pairwise similarities using Cosine or Jaccard metrics.
* **Module 6**: Construct a weighted content profile per user based on past ratings.
* **Module 7**: Recommend top-N movies similar to the user profile, excluding already seen titles.

**Techniques Used**

* **Text Vectorization**: TF-IDF, CountVectorizer, Binary Count
* **Similarity Metrics**: Cosine Similarity, Jaccard Similarity
* **Personalization**: Weighted vector averaging based on each user’s rated items
* **Parallelization**: Speeds up Jaccard similarity computation using joblib

**Use Cases**

* Personalized recommendations for new users with a few ratings (cold-start)
* Improving diversity and relevance in suggested movies
* Generating fallback content suggestions in hybrid recommender systems

In [9]:
# ==============================
# Module 1: Imports & Configuration
# ==============================
# Purpose: Load all required libraries and set global display settings.
# Application: Enables necessary tools for data manipulation, modeling, and visualization.

import pandas as pd
import numpy as np
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from joblib import Parallel, delayed
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

# ==============================
# Module 2: Load Movie Data
# ==============================
# Purpose: Load enriched movie metadata from a CSV file.
# Application: Used as the content base for movie features in recommender systems.

def load_movie_data(filepath):
    df = pd.read_csv(filepath)
    print(f"Loaded {len(df)} movies.")
    return df

# ==============================
# Module 3: Load User Ratings and Demographics
# ==============================
# Purpose: Load MovieLens ratings and user demographic info.
# Application: Used to identify which movies users rated and to segment users by profile.

def load_user_data(ratings_path, users_path):
    ratings = pd.read_csv(ratings_path, sep="::", engine="python",
                          names=["userId", "movieId", "rating", "timestamp"])
    users = pd.read_csv(users_path, sep="::", engine="python",
                        names=["userId", "gender", "age", "occupation", "zip"])
    print(f"Loaded {len(ratings)} ratings and {len(users)} users.")
    return ratings, users

# ==============================
# Module 4: Feature Engineering
# ==============================
# Purpose: Merge movie metadata into a single feature string.
# Application: Used as input to text vectorization for computing similarity.

def create_feature_string(df):
    def split_and_clean(col, delimiter='|'):
        return col.fillna('').str.replace(r'\s+', '', regex=True).str.split(delimiter)

    genre_list_1 = split_and_clean(df['genres'], delimiter='|')
    genre_list_2 = split_and_clean(df['tmdb_genres'], delimiter=',')
    merged_genres = [
        ' '.join(sorted(set(g1 or []) | set(g2 or [])))
        for g1, g2 in zip(genre_list_1, genre_list_2)
    ]

    def clean_text(col):
        return col.fillna('').str.replace(r'\s+', '', regex=True).str.replace(',', ' ')

    overview_clean = df['overview'].fillna('').str.lower().str.replace('[^\w\s]', '', regex=True)

    df['cbf_features'] = (
        pd.Series(merged_genres) + ' ' +
        clean_text(df['keywords']) + ' ' +
        clean_text(df['top_3_cast']) + ' ' +
        clean_text(df['directors']) + ' ' +
        overview_clean
    )

    return df[['movieId', 'title', 'cbf_features']]

# ==============================
# Module 5: Vectorization & Similarity
# ==============================
# Purpose: Convert feature strings into vectors and compute pairwise similarity.
# Application: Supports similarity scoring between movies.

def vectorize_features(text_series, method='tfidf'):
    if method == 'tfidf':
        vectorizer = TfidfVectorizer(stop_words='english')
    elif method == 'count':
        vectorizer = CountVectorizer(stop_words='english')
    else:
        raise ValueError("Method must be 'tfidf' or 'count'")

    matrix = vectorizer.fit_transform(text_series)
    print(f"{method.upper()} vectorization complete. Shape: {matrix.shape}")
    return matrix, vectorizer

def binary_vectorize(text_series):
    vectorizer = CountVectorizer(binary=True, stop_words='english')
    matrix = vectorizer.fit_transform(text_series)
    print(f"Binary Count vectorization complete. Shape: {matrix.shape}")
    return matrix.toarray(), vectorizer

def compute_cosine_similarity(matrix):
    sim = cosine_similarity(matrix)
    print("Cosine similarity computed.")
    return sim

def jaccard_pairwise_parallel(matrix):
    n = matrix.shape[0]
    sim_matrix = np.zeros((n, n))

    def jaccard_row(i):
        a = matrix[i]
        row_sim = np.zeros(n)
        for j in range(i, n):
            b = matrix[j]
            intersection = np.logical_and(a, b).sum()
            union = np.logical_or(a, b).sum()
            score = intersection / union if union > 0 else 0.0
            row_sim[j] = score
        return i, row_sim

    results = Parallel(n_jobs=-1)(
        delayed(jaccard_row)(i) for i in tqdm(range(n), desc="Jaccard Similarity")
    )

    for i, row in results:
        sim_matrix[i, i:] = row[i:]
        sim_matrix[i:, i] = row[i:]

    print("Jaccard similarity matrix built.")
    return sim_matrix

def save_matrix(matrix, filename):
    with open(filename, 'wb') as f:
        pickle.dump(matrix, f)
    print(f"Saved similarity matrix to: {filename}")

# ==============================
# Module 6: Build User Profile
# ==============================
# Purpose: Create a personalized vector from a user's rated movies.
# Application: Encapsulates a user's preferences for use in recommendation.

def build_user_profile(user_id, ratings, tfidf_matrix, movie_df):
    user_ratings = ratings[ratings['userId'] == user_id]
    rated_movies = movie_df[movie_df['movieId'].isin(user_ratings['movieId'])]
    indices = rated_movies.index.tolist()
    weights = user_ratings.set_index('movieId').loc[rated_movies['movieId']]['rating'].values
    profile = np.average(tfidf_matrix[indices].toarray(), axis=0, weights=weights)
    return profile.reshape(1, -1)

# ==============================
# Module 7: Personalized Recommendation
# ==============================
# Purpose: Generate recommendations personalized to the user profile.
# Application: This module uses user ratings to build a content preference profile and recommend similar movies.

def recommend_movies(user_id, ratings, tfidf_matrix, movie_df, top_n=50):
    user_profile = build_user_profile(user_id, ratings, tfidf_matrix, movie_df)
    sims = cosine_similarity(user_profile, tfidf_matrix).flatten()
    user_seen = ratings[ratings['userId'] == user_id]['movieId'].tolist()
    unseen_indices = movie_df[~movie_df['movieId'].isin(user_seen)].index
    top_indices = unseen_indices[np.argsort(sims[unseen_indices])[-top_n:][::-1]]
    return movie_df.iloc[top_indices][['movieId', 'title']], sims[top_indices]


***Content-Based Similarity Recommendations***

Purpose:
Generate item recommendations using multiple content-based similarity strategies. Each set of recommendations is labeled by model type for downstream evaluation and comparison.

Methodology:
1. Load enriched movie metadata and user ratings.
2. Create combined feature strings using genres, keywords, cast, directors, and overview.
3. Vectorize the features using three methods: TF-IDF, Count, and Binary.
4. Compute pairwise similarity:
   - Cosine similarity for TF-IDF and Count vectors
   - Jaccard similarity for binary vectors
5. For a given user, identify previously seen movies and score unseen ones based on average similarity to the seen set.
6. Return top-N recommendations as labeled DataFrames including: movieId, title, predicted score, and model name.



In [10]:
# ==============================
# Module 8: Content-Based Similarity Recommendations (Multi-Model)
# ==============================

# ==============================
# Step 1: Load Movie & User Data
# ==============================

movie_df = load_movie_data("movies_enriched_full.csv")
ratings, users = load_user_data("ratings.dat", "users.dat")

# Recreate CBF Features
movie_df = create_feature_string(movie_df)

# ==============================
# Step 2: Vectorize & Compute Similarities
# ==============================

# TF-IDF
tfidf_matrix, _ = vectorize_features(movie_df['cbf_features'], method='tfidf')
cosine_sim_matrix = compute_cosine_similarity(tfidf_matrix)

# Count
count_matrix, _ = vectorize_features(movie_df['cbf_features'], method='count')
cosine_sim_count = compute_cosine_similarity(count_matrix)

# Binary + Jaccard
binary_matrix, _ = binary_vectorize(movie_df['cbf_features'])
jaccard_sim_matrix = jaccard_pairwise_parallel(binary_matrix)

# ==============================
# Step 3: Recommendation Functions
# ==============================

def recommend_from_similarity_matrix(user_id, ratings, sim_matrix, movie_df, model_label, top_n=50):
    """
    Purpose: Recommend top-N movies to a user based on item similarity using a precomputed similarity matrix.
    Returns: DataFrame with movieId, title, score, and model label for downstream use.
    """
    seen_movie_ids = ratings[ratings['userId'] == user_id]['movieId'].tolist()
    seen_indices = movie_df[movie_df['movieId'].isin(seen_movie_ids)].index.tolist()
    unseen_indices = movie_df[~movie_df['movieId'].isin(seen_movie_ids)].index.tolist()

    if not seen_indices:
        print(f"No ratings found for user {user_id}.")
        return pd.DataFrame(columns=['movieId', 'title', 'score', 'model'])

    mean_sims = sim_matrix[unseen_indices][:, seen_indices].mean(axis=1)
    top_indices = np.argsort(mean_sims)[-top_n:][::-1]
    top_movie_indices = np.array(unseen_indices)[top_indices]

    return pd.DataFrame({
        'movieId': movie_df.iloc[top_movie_indices]['movieId'].values,
        'title': movie_df.iloc[top_movie_indices]['title'].values,
        'score': mean_sims[top_indices],
        'model': model_label
    })

# ==============================
# Step 4: Generate Recommendations (Labeled Outputs)
# ==============================

user_id = 5549

# TF-IDF + Cosine
df_tfidf, scores_tfidf = recommend_movies(user_id, ratings, tfidf_matrix, movie_df, top_n=50)
df_tfidf_cosine = pd.DataFrame({
    'movieId': df_tfidf['movieId'].values,
    'title': df_tfidf['title'].values,
    'score': scores_tfidf,
    'model': 'TF-IDF + Cosine'
})

# Count + Cosine
df_count, scores_count = recommend_movies(user_id, ratings, count_matrix, movie_df, top_n=50)
df_count_cosine = pd.DataFrame({
    'movieId': df_count['movieId'].values,
    'title': df_count['title'].values,
    'score': scores_count,
    'model': 'Count + Cosine'
})

# Binary + Jaccard
df_jaccard_binary = recommend_from_similarity_matrix(
    user_id, ratings, jaccard_sim_matrix, movie_df,
    model_label="Binary + Jaccard", top_n=50
)


Loaded 3883 movies.
Loaded 1000209 ratings and 6040 users.
TFIDF vectorization complete. Shape: (3883, 33424)
Cosine similarity computed.
COUNT vectorization complete. Shape: (3883, 33424)
Cosine similarity computed.
Binary Count vectorization complete. Shape: (3883, 33424)


Jaccard Similarity: 100%|██████████| 3883/3883 [17:44<00:00,  3.65it/s]


Jaccard similarity matrix built.


### **Memory-based collaborative filtering module (UBCF, IBCF)**

***Purpose:***

This module implements **memory-based collaborative filtering** using **user-user** or **item-item** similarity. It addresses **user bias** by normalizing ratings through mean-centering and optionally **rescaling predictions** to the original rating scale for interpretability.

***Methodology:***

1. **Rating Matrix Construction**:

   * A user-item matrix is built from raw MovieLens-style ratings data.
   * For `kind='user'`, ratings are mean-centered per user to reduce bias from lenient or strict raters.
   * For `kind='item'`, raw ratings are used directly (no normalization), as the algorithm focuses on item similarities based on a single user's input.

2. **Similarity Computation**:

   * Cosine similarity is computed either:

     * **Across users** for user-based CF (`kind='user'`)
     * **Across items** for item-based CF (`kind='item'`)
   * `sklearn.metrics.pairwise_distances` is used to derive similarity as `1 - cosine_distance`.

3. **Prediction Generation**:

   * For **user-based CF**:

     * Ratings from similar users are weighted by similarity and averaged.
     * The user’s mean rating is **added back** to restore predictions to the original scale (e.g., 1–5).
   * For **item-based CF**:

     * A user’s own ratings are used to compute scores for similar items.
     * No mean is added back, since predictions are already on the correct scale.

4. **Top-N Recommendations**:

   * The system filters out movies the user has already rated.
   * It ranks unseen movies by predicted score and returns the top-N recommendations.
   * Each recommendation is labeled with the model type (`User-Based CF` or `Item-Based CF`) for downstream tracking.

In [5]:
# ==============================
# Module 8: Memory-Based Collaborative Filtering (Bias-Normalized)
# ==============================
# Purpose: Compute user-user or item-item similarity from the rating matrix.
# Application: Real-time, interpretable recommendations with optional bias correction.

from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
import pandas as pd

# --- Create Mean-Centered User-Item Matrix ---
def create_normalized_user_item_matrix(ratings):
    """
    Purpose: Create a user-item matrix with ratings mean-centered per user.
    Application: Reduces bias from generous or harsh raters.
    """
    matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
    user_means = matrix.mean(axis=1)
    return matrix.sub(user_means, axis=0).fillna(0), user_means

# --- Compute Cosine Similarity ---
def compute_similarity(matrix, kind='user'):
    """
    Purpose: Compute pairwise cosine similarity between users or items.
    Application: Support for User-User or Item-Item collaborative filtering.
    """
    if kind == 'user':
        sim = 1 - pairwise_distances(matrix, metric='cosine')
    elif kind == 'item':
        sim = 1 - pairwise_distances(matrix.T, metric='cosine')
    else:
        raise ValueError("kind must be 'user' or 'item'")

    print(f"{kind.title()}-based similarity computed. Shape: {sim.shape}")
    return sim

# --- Generate Top-N Recommendations with Rescaled Predictions ---
def recommend_memory_based(user_id, user_item_matrix, user_means, similarity_matrix, kind='user', top_n=50):
    """
    Purpose: Recommend items using normalized ratings and return predictions on original scale.
    Application: User-User or Item-Item CF with appropriate bias handling.
    """
    model_label = f"{kind.title()}-Based CF"

    if kind == 'user':
        # User-based: normalize ratings and add back mean after prediction
        user_sim_scores = similarity_matrix[user_id - 1]
        normalized_ratings = user_item_matrix.values

        weighted_scores = user_sim_scores @ normalized_ratings
        sum_weights = np.abs(user_sim_scores).sum()

        if sum_weights == 0:
            print("No similar users found.")
            return pd.DataFrame(columns=['movieId', 'score', 'model'])

        predicted_ratings = weighted_scores / sum_weights
        user_seen = user_item_matrix.loc[user_id]
        unseen_mask = user_seen == 0
        recs = pd.Series(predicted_ratings, index=user_item_matrix.columns)[unseen_mask]\
            .sort_values(ascending=False).head(top_n)

        # Re-center predictions to original scale
        recs += user_means.loc[user_id]

    elif kind == 'item':
        # Item-based: do NOT add back user mean
        user_ratings = user_item_matrix.loc[user_id]
        scores = user_ratings @ similarity_matrix
        sum_weights = (user_ratings != 0) @ np.abs(similarity_matrix)

        with np.errstate(divide='ignore', invalid='ignore'):
            predicted_ratings = np.true_divide(scores, sum_weights)
            predicted_ratings[sum_weights == 0] = 0

        unseen_mask = user_ratings == 0
        recs = pd.Series(predicted_ratings, index=user_item_matrix.columns)[unseen_mask]\
            .sort_values(ascending=False).head(top_n)

    else:
        raise ValueError("kind must be 'user' or 'item'")

    return pd.DataFrame({
        'movieId': recs.index,
        'score': recs.values,
        'model': model_label
    })


***Application of UBCF and IBCF***

In [7]:
# Step 1: Create bias-normalized matrix
user_item_matrix, user_means = create_normalized_user_item_matrix(ratings)

# Step 2: Compute similarity matrices
user_sim_matrix = compute_similarity(user_item_matrix, kind='user')
item_sim_matrix = compute_similarity(user_item_matrix, kind='item')

# Step 3: Generate recommendations
user_cf_recs = recommend_memory_based(5549, user_item_matrix, user_means, user_sim_matrix, kind='user', top_n=50)
item_cf_recs = recommend_memory_based(5549, user_item_matrix, user_means, item_sim_matrix, kind='item', top_n=50)

# Optional: Merge movie titles
# user_cf_recs = user_cf_recs.merge(movies_df[['movieId', 'title']], on='movieId', how='left')


User-based similarity computed. Shape: (6040, 6040)
Item-based similarity computed. Shape: (3706, 3706)


## **Model-Based Filtering:**

  * *SVD (Surprise)*: Learns latent features from the rating matrix.
  * *ALS (PySpark)*: Scalable factorization method for large datasets.


### **Module 9: Model-Based Collaborative Filtering (SVD using Surprise)**

**Purpose:**
Use matrix factorization (SVD) to learn latent user/item features from the rating matrix.

**Application:**
- Accurate, scalable recommendations for sparse datasets using user/item embeddings.
- Suitable for small to medium datasets.
- Optimized via `GridSearchCV` for hyperparameter tuning.
- Good interpretability of latent factors per user and item.



In [18]:
# Step 1: Force reinstall numpy
!pip install --force-reinstall numpy

# Step 2: Reinstall scikit-surprise after numpy is set up correctly
!pip uninstall -y scikit-surprise
!pip install scikit-surprise



Collecting numpy
  Downloading numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl (16.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m83.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have num

Found existing installation: scikit-surprise 1.1.4
Uninstalling scikit-surprise-1.1.4:
  Successfully uninstalled scikit-surprise-1.1.4
Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [17]:
# ==============================
# Module 9: Model-Based Collaborative Filtering (SVD using Surprise)
# ==============================
# Purpose: Use matrix factorization (SVD) to learn latent user/item features from the rating matrix.
# Application: Accurate, scalable recommendations for sparse datasets using user/item embeddings.

# ==============================
# Conditional Installation of scikit-surprise
# ==============================

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, GridSearchCV
from surprise.accuracy import rmse
import pandas as pd

# --- Prepare Surprise Dataset ---
def prepare_surprise_data(ratings):
    """
    Convert ratings DataFrame into Surprise's internal format.
    """
    reader = Reader(rating_scale=(0.5, 5.0))
    return Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# --- Tune SVD Model with Grid Search ---
def tune_svd_model(data):
    """
    Use GridSearchCV to find best hyperparameters for SVD based on RMSE.
    """
    param_grid = {
        'n_factors': [50, 100],
        'lr_all': [0.005, 0.01],
        'reg_all': [0.02, 0.1]
    }
    gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
    gs.fit(data)
    print(f"Best RMSE: {gs.best_score['rmse']} with params: {gs.best_params['rmse']}")
    return gs.best_estimator['rmse']

# --- Train and Evaluate SVD ---
def evaluate_svd(model, data):
    """
    Split data, train SVD model, predict, and return RMSE score and predictions.
    """
    trainset, testset = train_test_split(data, test_size=0.2)
    model.fit(trainset)
    predictions = model.test(testset)
    score = rmse(predictions)
    return predictions, score



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelapp.py", line 712, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.11/dist-package

ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).

### **Model-Based Collaborative Filtering (ALS using PySpark)**

**Purpose:**
Use Alternating Least Squares (ALS) to learn latent user/item features at scale.

**Application:**
- Distributed recommendation system for large-scale datasets.
- Runs on Apache Spark for horizontal scalability.
- Handles sparsity well using factorization.
- Suited for real-time, production-level systems with massive data.


In [16]:
# ==============================
# Module 10: Model-Based Collaborative Filtering (ALS using PySpark)
# ==============================
# Purpose: Use Alternating Least Squares (ALS) to learn latent user/item features at scale.
# Application: Distributed recommendation on large datasets using Spark MLlib.

from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# --- Start Spark Session ---
def start_spark():
    spark = SparkSession.builder \
        .appName("ALSModel") \
        .getOrCreate()
    return spark

# --- Convert Ratings to Spark DataFrame ---
def prepare_als_data(spark, ratings):
    return spark.createDataFrame(ratings[['userId', 'movieId', 'rating']])

# --- Train ALS Model ---
def train_als_model(data):
    als = ALS(
        userCol="userId", itemCol="movieId", ratingCol="rating",
        rank=10, maxIter=10, regParam=0.1,
        coldStartStrategy="drop", nonnegative=True
    )
    model = als.fit(data)
    return model

# --- Evaluate ALS Model ---
def evaluate_als(model, data):
    predictions = model.transform(data)
    evaluator = RegressionEvaluator(
        metricName="rmse", labelCol="rating", predictionCol="prediction"
    )
    rmse_val = evaluator.evaluate(predictions)
    print(f"ALS RMSE: {rmse_val}")
    return predictions, rmse_val
