<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Project_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Import Required Libraries

In [17]:
# This section imports all necessary libraries for data processing, similarity computation, evaluation, and visualization.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import normalize
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


# 2. Load and Preprocess Data

- Downloads smaller, pre-filtered versions of the ratings and movies datasets from GitHub.
- These files contain fewer rows and are easier to work with in Colab (won’t crash memory).
- pd.read_csv() loads them into DataFrames named ratings and movies.

In [18]:
# Step 1: Load subset datasets
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/movies_subset.csv
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/ratings_subset.csv

ratings = pd.read_csv("ratings_subset.csv")
movies = pd.read_csv("movies_subset.csv")

# Step 2: Convert genres to multi-hot encoded format
# This block prepares genre data for content-based filtering.
# genres_list: Converts the genre string (e.g., 'Action|Adventure') into a Python list.
# all_genres: Builds a sorted list of all unique genres in the dataset.
# The loop creates a new column for each genre (multi-hot encoding):
# If a movie has that genre, it gets a 1, else 0.

movies['genres'] = movies['genres'].fillna('')
movies['genres_list'] = movies['genres'].apply(lambda x: x.split('|'))

all_genres = sorted(set(genre for sublist in movies['genres_list'] for genre in sublist))
for genre in all_genres:
    movies[genre] = movies['genres_list'].apply(lambda x: 1 if genre in x else 0)

# Step 3: Merge movie features with ratings
# Merges the processed movies DataFrame (now with genre vectors) with ratings.
# This results in movie_data, a dataset where each row contains:
    ## The user ID
    ## The movie's genre indicators (1s and 0s)
    ## The rating the user gave that movie

movie_data = pd.merge(movies.drop(columns=['genres_list']), ratings, on='movieId')

print("Shape of merged dataset:", movie_data.shape)
print(movie_data.head())


Shape of merged dataset: (100000, 26)
   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   (no genres listed)  Action  Adventure  Animation  Children  Comedy  Crime  \
0                   0       0          1          1         1       1      0   
1                   0       0          1          1         1       1      0   
2                   0       0          1          1         1       1      0   
3                   0       0          1          1         1       1      0   
4                   0       0          1          1         1       1      0   

   ...  Musical  M

# 3. Content-Based Filtering Using Genre Vectors and Cosine Similarity

This code implements a **content-based recommender system** using movie genres. Each movie is represented as a binary (multi-hot) vector based on its associated genres (e.g., Action, Comedy, Drama). The steps include:

* Normalizing the genre vectors using **L2 norm** so that each vector has unit length.
* Calculating **cosine similarity** between movie vectors to measure how similar their genre compositions are.
* Creating a function that, given a movie title, returns the top-N most similar movies (excluding itself) based purely on genre similarity.

This technique does not rely on user ratings — instead, it recommends items that are similar in content (genre) to a given movie.


In [19]:
# Imports tools for normalizing feature vectors and computing similarity between them.
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Use only unique movie rows for similarity matrix
# Copies the movies DataFrame and resets the index to ensure each movie is uniquely indexed.
unique_movies = movies.copy().reset_index(drop=True)

# Normalize genre matrix
# Extracts the genre vectors for each movie (multi-hot encoded).
# Applies L2 normalization so that all genre vectors have a length of 1 (helps with cosine similarity).
genre_cols = all_genres
genre_matrix = unique_movies[genre_cols].values
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

# Create title-to-index map for unique movies
# Creates a dictionary-like mapping from movie titles to their corresponding row index — used to look up vector positions.
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# Define function to get recommendations
# Defines a function that takes a movie title and returns the top N most similar movies.
def get_recommendations(title, topN=10):
    if title not in movie_idx:
        return f"Movie '{title}' not found in dataset." # Returns an error message if the title is not in the dataset.

    # Retrieves the normalized genre vector for the given title and compares it to all other movie vectors using cosine similarity.
    idx = movie_idx[title]
    query_vector = genre_matrix_normalized[idx].reshape(1, -1)
    sim_scores = cosine_similarity(query_vector, genre_matrix_normalized)[0]

    # Sorts similarity scores in descending order and skips the movie itself (which will always have a similarity of 1).
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:topN+1]

    # Retrieves the titles of the top-N similar movies and returns them as recommendations.
    movie_indices = [i[0] for i in sim_scores]
    return unique_movies['title'].iloc[movie_indices]

# Sample 10 titles from the dataset to explore
print("Available sample titles:")
print(unique_movies['title'].sample(10, random_state=42).to_list())

# Get recommendations based on genre similarity to 'Heat (1995)'
print("\nContent-based Recommendations for 'Heat (1995)':")
print(get_recommendations('Heat (1995)'))


Available sample titles:
['Murder on the Orient Express (2017)', 'Rhapsody in August (Hachi-gatsu no kyôshikyoku) (1991)', 'First Position (2011)', 'Wait Until Dark (1967)', 'Coffy (1973)', 'The Philadelphia Experiment (2012)', 'Vampires (1998)', 'Mr. Nobody (2009)', 'Live by Night (2017)', 'Paterson']

Content-based Recommendations for 'Heat (1995)':
22                       Assassins (1995)
142     Die Hard: With a Vengeance (1995)
158                       Net, The (1995)
249           Natural Born Killers (1994)
410                 Judgment Night (1993)
503                         Batman (1989)
804                       Die Hard (1988)
1314                     Hard Rain (1998)
1324      Replacement Killers, The (1998)
1334                 U.S. Marshals (1998)
Name: title, dtype: object


# Hybrid Content-Based Rating Prediction Using Genre Similarity, User Behavior, and Fallback Handling

This code demonstrates a *hybrid recommendation system* that combines **content-based filtering using genre similarity** with **collaborative filtering using user-specific ratings**. The objective is to predict how much a user will like a movie they've never seen, based on the genres of that movie and their past rating behavior.

The prediction process incorporates a **fallback mechanism** and **debug printouts** to gracefully handle edge cases where standard hybrid predictions aren’t possible. These cases include users with no rating history, movies not present in the similarity matrix, or when no meaningful similarity is found.

#### How it Works:

1. **Genre Vector Normalization**:

   * The genre columns are multi-hot encoded (e.g., Action, Comedy, etc.).
   * Each movie’s genre vector is normalized using L2 norm so that cosine similarity is well-defined and scale-invariant.

2. **Genre-Based Similarity Matrix**:

   * Cosine similarity is computed between all pairs of movies based on genre vectors.

3. **Mapping Setup**:

   * The code builds lookup maps between `movieId` and its corresponding row index in the genre matrix to allow fast access.

4. **Hybrid Prediction Function**:

   * For a given `user_id` and `movie_id`, the function:

     * Retrieves all movies rated by the user.
     * Finds the top-K rated movies that are most genre-similar to the target movie.
     * Computes a **weighted average of the ratings**, where the weights are the genre similarity scores.
     * **If no such ratings or similarities are available**, the function **falls back to the global average rating of the movie**.
     * Each fallback trigger is logged with a `[Debug]` message.

5. **Application**:

   * The model is tested on a sample user and generates predicted ratings for movies that are most similar in genre to a reference movie (e.g., *Heat (1995)*).

This hybrid approach offers:

* Personalization from collaborative filtering.
* Interpretability from content-based features (genres).
* Robustness from fallback logic to handle cold-starts or sparse data situations.


In [33]:
# --- Required Libraries ---
import pandas as pd
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# --- Step 1: Normalize Genre Matrix ---
# Make a clean copy of the movie list and normalize genre vectors
unique_movies = movies.copy().reset_index(drop=True)
genre_cols = all_genres
genre_matrix = unique_movies[genre_cols].values
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

# --- Step 2: Create Mapping Between movieId and genre_matrix Index ---
movieId_to_index = dict(zip(unique_movies['movieId'], unique_movies.index))
index_to_movieId = dict(zip(unique_movies.index, unique_movies['movieId']))
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# --- Step 3: Compute Cosine Similarity Between Movies ---
genre_sim_matrix = cosine_similarity(genre_matrix_normalized)

# --- Step 4: Define Hybrid Prediction Function with Fallback ---
def predict_rating_genre_weighted(user_id, target_movie_id, k=10):
    """
    Predicts a user's rating for a given movie using a hybrid method combining
    genre similarity and past user ratings. Falls back to the movie's global
    average rating if no valid prediction is possible.
    """

    # Check if user and movie exist in the dataset
    if user_id not in user_movie_matrix.index or target_movie_id not in movieId_to_index:
        print(f"[Debug] Invalid user_id {user_id} or movie_id {target_movie_id}. Returning NaN.")
        return np.nan

    # Get user's past ratings
    user_ratings = user_movie_matrix.loc[user_id].dropna()
    if user_ratings.empty:
        print(f"[Fallback 1] User {user_id} has no ratings. Using global average for movieId {target_movie_id}.")
        return movie_data[movie_data['movieId'] == target_movie_id]['rating'].mean()

    # Get the index of the target movie
    target_idx = movieId_to_index[target_movie_id]

    # Find indices for rated movies that exist in the similarity matrix
    rated_movie_indices = [movieId_to_index[mid] for mid in user_ratings.index if mid in movieId_to_index]
    if not rated_movie_indices:
        print(f"[Fallback 2] User {user_id} rated movies not found in genre matrix. Using global average for movieId {target_movie_id}.")
        return movie_data[movie_data['movieId'] == target_movie_id]['rating'].mean()

    # Compute similarity scores between target movie and rated movies
    sims = genre_sim_matrix[target_idx, rated_movie_indices]
    sims_series = pd.Series(sims, index=[index_to_movieId[i] for i in rated_movie_indices])

    # Take top-k most similar rated movies
    top_similar = sims_series.sort_values(ascending=False).head(k)
    top_ratings = user_ratings[top_similar.index]

    # Compute weighted prediction
    weighted_sum = np.dot(top_similar.values, top_ratings.values)
    normalization = np.sum(top_similar.values)

    if normalization > 0:
        return weighted_sum / normalization
    else:
        print(f"[Fallback 3] No similarity found for user {user_id} and movieId {target_movie_id}. Using global average.")
        return movie_data[movie_data['movieId'] == target_movie_id]['rating'].mean()

# --- Step 5: Predict Ratings for Similar Movies to "Heat (1995)" ---

# Choose the target movie
target_movie = 'Heat (1995)'

# Print a few sample titles
print("Available sample titles:")
print(unique_movies['title'].sample(10, random_state=42).to_list())

# Get top 10 similar movies by genre (cosine similarity)
idx = movie_idx[target_movie]
sim_scores = cosine_similarity(genre_matrix_normalized[idx].reshape(1, -1), genre_matrix_normalized)[0]
sim_indices = np.argsort(sim_scores)[::-1][1:11]  # exclude the target movie itself
top_similar_movie_ids = unique_movies.loc[sim_indices, 'movieId']
top_similar_titles = unique_movies.loc[sim_indices, 'title']

# Display top similar movies
print(f"\nTop 10 Genre-Similar Movies to '{target_movie}':")
print(top_similar_titles)

# Pick a sample user
user_id = 1
print(f"\nPredicted Ratings for User {user_id} Using Genre-Weighted Hybrid Model with Fallback:\n")

# Predict and print the ratings
for movie_id, title in zip(top_similar_movie_ids, top_similar_titles):
    pred = predict_rating_genre_weighted(user_id=user_id, target_movie_id=movie_id, k=100)
    print(f"{title:<45} Predicted Rating: {pred:.2f}")


Available sample titles:
['Murder on the Orient Express (2017)', 'Rhapsody in August (Hachi-gatsu no kyôshikyoku) (1991)', 'First Position (2011)', 'Wait Until Dark (1967)', 'Coffy (1973)', 'The Philadelphia Experiment (2012)', 'Vampires (1998)', 'Mr. Nobody (2009)', 'Live by Night (2017)', 'Paterson']

Top 10 Genre-Similar Movies to 'Heat (1995)':
22                          Assassins (1995)
158                          Net, The (1995)
8497     Sin City: A Dame to Kill For (2014)
4758                         Trespass (1992)
9732           John Wick: Chapter Two (2017)
4822                    Punisher, The (2004)
3868                              xXx (2002)
10009          Brawl in Cell Block 99 (2017)
3817                   Gangster No. 1 (2000)
5850                    Transporter 2 (2005)
Name: title, dtype: object

Predicted Ratings for User 1 Using Genre-Weighted Hybrid Model with Fallback:

[Debug] Invalid user_id 1 or movie_id 23. Returning NaN.
Assassins (1995)                   

## Comparison of Overlapping and Divergent Recommendations

Both methods returned several overlapping recommendations, but also differed in meaningful ways:

#### Similar Recommendations from Both Methods

* **Assassins (1995)**
* **Net, The (1995)**

These consistent suggestions indicate that both the pure content-based and hybrid genre-weighted models identify core genre traits effectively.

#### Recommendations Unique to Each Method

**Only in Content-Based (Cosine Genre Similarity):**

* *Die Hard (1988)*
* *Batman (1989)*
* *U.S. Marshals (1998)*

**Only in Hybrid Model (Genre + Ratings Fallback):**

* *Sin City: A Dame to Kill For (2014)*
* *John Wick: Chapter Two (2017)*
* *Transporter 2 (2005)*

These differences show that the hybrid method is able to introduce newer or slightly more nuanced genre matches, even when rating data for a specific user is missing and fallback mechanisms are triggered.


# Text Block: Optimized Jaccard Similarity for Content-Based Filtering

This block introduces a more efficient method for computing **Jaccard similarity** between movies based on their genre information. Unlike the traditional nested-loop approach, this implementation uses the `pdist()` function from `scipy.spatial.distance` to compute all pairwise Jaccard distances in a **fully vectorized** manner. The result is a symmetric similarity matrix, which is then used to identify the most similar movies to a given title. This optimization drastically reduces computation time and is highly recommended for medium-to-large datasets.

using `scipy.spatial.distance.pdist()` **does calculate all pairwise similarities**, but it does so much more efficiently than a manual loop.

Here’s how it works:

* `pdist(binary_matrix, metric='jaccard')` computes the **Jaccard distance** (which is `1 - Jaccard similarity`) between **all unique pairs** of rows (i.e., movies) in the binary genre matrix.
* The output is a **condensed distance matrix** — a flat array containing the upper triangle of the full pairwise distance matrix.
* This condensed matrix is converted back into a full square **symmetric matrix** using `squareform()`, giving us the distance between all pairs.
* We then compute similarity as `1 - distance`.

Every possible movie-to-movie similarity is calculated — but with optimized vectorized operations under the hood, which is much faster than nested Python loops.



In [28]:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd

# Step 1: Prepare genre binary matrix
unique_movies = movies.copy().reset_index(drop=True)
genre_cols = all_genres
genre_matrix = unique_movies[genre_cols].astype(bool).astype(int).values  # ensure binary format

# Step 2: Compute Jaccard distance (1 - similarity)
# pdist returns a condensed distance matrix; squareform converts it to square form
jaccard_distance = pdist(genre_matrix, metric='jaccard')  # returns 1 - Jaccard similarity
jaccard_sim_matrix = 1 - squareform(jaccard_distance)      # convert to full similarity matrix

# Step 3: Create mapping from title to matrix index
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# Step 4: Define recommendation function
def get_recommendations_jaccard(title, topN=10):
    if title not in movie_idx:
        return f"Movie '{title}' not found in dataset."

    idx = movie_idx[title]
    sim_scores = list(enumerate(jaccard_sim_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:topN+1]  # exclude self
    top_indices = [i[0] for i in sim_scores]
    return unique_movies['title'].iloc[top_indices]

# Step 5: Try a sample movie
print("Sample titles:", unique_movies['title'].sample(5, random_state=42).to_list())
print("\nJaccard Recommendations for 'Heat (1995)':")
print(get_recommendations_jaccard("Heat (1995)"))


Sample titles: ['Murder on the Orient Express (2017)', 'Rhapsody in August (Hachi-gatsu no kyôshikyoku) (1991)', 'First Position (2011)', 'Wait Until Dark (1967)', 'Coffy (1973)']

Jaccard Recommendations for 'Heat (1995)':
22                       Assassins (1995)
142     Die Hard: With a Vengeance (1995)
158                       Net, The (1995)
249           Natural Born Killers (1994)
410                 Judgment Night (1993)
503                         Batman (1989)
804                       Die Hard (1988)
1314                     Hard Rain (1998)
1324      Replacement Killers, The (1998)
1334                 U.S. Marshals (1998)
Name: title, dtype: object


# Comparison of Content-Based Recommendations: Cosine vs. Jaccard Similarity

Both the **cosine similarity** and **Jaccard similarity** methods returned *identical top-10 movie recommendations* for the query movie **"Heat (1995)"**. This indicates that in the context of the MovieLens genre-based content filtering:

* **Both methods effectively captured the same neighborhood of similar films**.
* The movies recommended (e.g., *Assassins*, *Die Hard*, *The Net*, *Natural Born Killers*) suggest that the genre combinations for these titles closely match those of *Heat (1995)*.
* While **cosine similarity** operates on normalized multi-hot vectors and measures angular proximity,
  **Jaccard similarity** measures the overlap in genre tags directly.

### Key Takeaway:

Despite their different mathematical underpinnings, both methods **produced the same results** because:

* The genre vectors are binary (multi-hot encoded), where normalization (in cosine) doesn’t distort information significantly.
* The dominant factor influencing similarity is the **overlap of genre labels**, which both metrics capture well.

However:

* **Cosine similarity is computationally faster** and more scalable.
* **Jaccard similarity is slower** when computed pairwise using loops, though vectorized solutions like `pdist()` improve it significantly.

You can safely use either in this binary genre context, but for large-scale systems, cosine is typically preferred for efficiency.


# 4. User-User Collaborative Filtering with Cosine Similarity (Memory-Efficient Version)

This block implements **User-User Collaborative Filtering** while addressing memory constraints in Google Colab. It includes two major optimizations:

* Limits the dataset to users with `userId <= 50,000` to keep the similarity matrix size manageable.
* Uses **cosine similarity** (instead of Pearson correlation), which is more memory-efficient and suitable for sparse rating data.

The code builds a **mean-centered user-item matrix**, computes cosine similarity between users, and defines a prediction function based on the top-k most similar users who rated a given movie. This approach estimates a target user’s rating by combining how similar users rated the same item, adjusted for their own average rating.


In [20]:
# Option 1: Reduce number of users to avoid excessive memory usage
# This line limits the dataset to users with IDs up to 50,000.
# It keeps the similarity matrix manageable in size (<= 50,000x50,000).
ratings_small = ratings[ratings['userId'] <= 50000]

# Create a user-item matrix and normalize ratings
# Creates a matrix where:
# Rows = users
# Columns = movies
# Cells = user ratings
# If a user hasn't rated a movie, the cell will be NaN.
user_movie_matrix = ratings_small.pivot(index='userId', columns='movieId', values='rating')

# Computes the average rating per user.
# This is used to center each user’s ratings around their personal mean.
user_means = user_movie_matrix.mean(axis=1)

# Subtracts the user’s mean from their ratings → i.e., mean-centering.
# Replaces missing ratings (NaN) with 0 (so they don't contribute to similarity).
user_movie_centered = user_movie_matrix.sub(user_means, axis=0).fillna(0)

# Option 3: Compute cosine similarity instead of Pearson correlation
# Cosine similarity handles sparse matrices more efficiently than correlation
# and is more memory-efficient for user-user comparisons.
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between users
user_similarity = cosine_similarity(user_movie_centered)
# Stores the similarity scores in a DataFrame, where:
# Rows and columns = userId
# Cell (i, j) = similarity between user i and user j
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)

# Define prediction function
# Predicts how much user_id would rate movie_id using k nearest neighbors (top-k similar users).
def predict_user_user(user_id, movie_id, k=10):
    if movie_id not in user_movie_matrix.columns:  # If the movie doesn't exist in the matrix, return NaN.
        return np.nan

    # Get the similarity scores for the target user to all other users (excluding themselves).
    sims = user_sim_df[user_id].drop(user_id)
    # Find which users actually rated the target movie.
    rated_users = user_movie_matrix[movie_id].dropna()
    # Keep only similarities for users who rated the movie.
    sims = sims[rated_users.index]
    # Select the top-k most similar users.
    top_users = sims.sort_values(ascending=False).head(k)
    # Get those users’ ratings for the target movie, centered around their mean.
    top_ratings = user_movie_matrix.loc[top_users.index, movie_id] - user_means[top_users.index]

    # Final prediction:
    ## Adds the weighted average of the top users' deviations back to the target user’s mean.
    ## This estimates how the target user would rate the movie.
    prediction = user_means[user_id] + np.dot(top_users, top_ratings) / top_users.sum() if top_users.sum() > 0 else np.nan
    return prediction


# 5. Item-Item Collaborative Filtering with Adjusted Cosine Similarity

This code implements **item-item collaborative filtering** using **adjusted cosine similarity**, which accounts for differences in users’ individual rating scales. The steps include:

* **Transposing** the user-item matrix so that movies are represented as rows.
* **Centering** each movie's vector by subtracting each user's mean rating — this adjustment ensures that the similarity metric reflects agreement in rating patterns, not absolute values.
* **Computing cosine similarity** between movie vectors to identify similar items.
* Defining a **prediction function** that estimates how a user would rate a target movie by aggregating their ratings for similar movies (weighted by similarity).

This memory-efficient approach allows personalized movie recommendations by looking at how similar a target movie is to other movies the user has already rated.

In [21]:
# Create user-movie matrix (assumes user_movie_matrix is already defined earlier)

# Transpose and center by user mean
# Transposes the matrix so that:
# Rows = movies (items)
# Columns = users
# This format is needed to compute item-item similarity.
movie_user_matrix = user_movie_matrix.T

# Center each row (movie) by subtracting the mean rating given by each user.
# This is known as adjusted cosine similarity — it accounts for differences in user rating scales.
movie_user_centered = movie_user_matrix.sub(user_movie_matrix.mean(axis=1), axis=1).fillna(0)

# Compute cosine similarity between items (movies)
# Cosine similarity is computed between each pair of movie vectors.
# The result is a similarity matrix where:
# Rows and columns = movieId
# Cell (i, j) = similarity between movie i and movie j
from sklearn.metrics import pairwise_distances
item_similarity = 1 - pairwise_distances(movie_user_centered, metric='cosine')
item_sim_df = pd.DataFrame(item_similarity, index=movie_user_matrix.index, columns=movie_user_matrix.index)

# Define prediction function
# Predicts how much user_id would rate movie_id using k nearest neighbors (top-k similar items).
def predict_item_item(user_id, movie_id, k=10):
    # If the movie is not in the matrix, return NaN
    if movie_id not in user_movie_matrix.columns:
        return np.nan

    # Get all movies that this user has rated
    user_ratings = user_movie_matrix.loc[user_id].dropna()

    # If user has not rated any other movies, return NaN
    if user_ratings.empty:
        return np.nan

    # Get similarity scores between the target movie and all movies the user has rated
    sims = item_sim_df.loc[movie_id, user_ratings.index]

    # Select the top-k most similar items
    top_items = sims.sort_values(ascending=False).head(k)

    # Get the user’s ratings for those items
    top_ratings = user_ratings[top_items.index]

    # Compute the weighted average rating, weighted by item similarity
    prediction = np.dot(top_items, top_ratings) / top_items.sum() if top_items.sum() > 0 else np.nan
    return prediction


# Item-Item Collaborative Filtering using Jaccard Similarity

This code implements item-item collaborative filtering by measuring the similarity between movies using the Jaccard index. Jaccard similarity is a metric that compares the similarity of two sets by dividing the size of their intersection by the size of their union. In this context, it is used to compare movies based on whether users have rated them or not—ignoring the actual rating values.

To apply Jaccard similarity:

1. The user-item ratings matrix is converted into a binary format (1 if a user rated a movie, 0 otherwise).
2. Each pair of movies is compared using Jaccard similarity, which is computed as the ratio of the number of users who rated both movies to the number of users who rated either movie.
3. A prediction function is defined to estimate how much a user would rate a given movie, based on the user’s ratings for the most similar movies, weighted by Jaccard similarity.

This method is particularly useful when the presence or absence of a rating is more important than the rating value itself. It also tends to be more memory-efficient in sparse datasets where many entries are missing.

In [24]:
# Required library for Jaccard similarity
from sklearn.metrics import jaccard_score

# ---- SUBSET THE DATASET TO AVOID MEMORY ISSUES ----

# Reduce to users with userId ≤ 50,000 for memory-efficient computation
ratings_small = ratings[ratings['userId'] <= 10000]

# Create user-movie matrix: rows = users, columns = movies, values = ratings
user_movie_matrix = ratings_small.pivot(index='userId', columns='movieId', values='rating')

# Step 1: Convert to Binary Rating Matrix
# Convert to binary: 1 if user rated a movie, 0 otherwise
# Transpose so that rows = movies, columns = users
movie_user_binary = user_movie_matrix.notna().astype(int).T

# Step 2: Compute Jaccard Similarity Matrix Between Movies
# Initialize empty DataFrame to store Jaccard similarities
movie_ids = movie_user_binary.index.tolist()
jaccard_sim_df = pd.DataFrame(index=movie_ids, columns=movie_ids, dtype=float)

# Compute Jaccard similarity for each pair of movies
# This may take time depending on how many movies are present
for i in range(len(movie_ids)):
    for j in range(i, len(movie_ids)):
        movie_i = movie_user_binary.loc[movie_ids[i]].values
        movie_j = movie_user_binary.loc[movie_ids[j]].values
        sim = jaccard_score(movie_i, movie_j)
        jaccard_sim_df.iloc[i, j] = sim
        jaccard_sim_df.iloc[j, i] = sim  # Ensure symmetry

# Step 3: Define Prediction Function Using Jaccard Similarity
# Predict rating for a given user and movie using top-k similar movies
def predict_item_item_jaccard(user_id, movie_id, k=10):
    # Check if movie exists in the matrix
    if movie_id not in user_movie_matrix.columns:
        return np.nan

    # Get movies the user has rated
    user_ratings = user_movie_matrix.loc[user_id].dropna()
    if user_ratings.empty:
        return np.nan

    # Get Jaccard similarity between target movie and movies the user has rated
    sims = jaccard_sim_df.loc[movie_id, user_ratings.index]

    # Select top-k similar movies
    top_items = sims.sort_values(ascending=False).head(k)
    top_ratings = user_ratings[top_items.index]

    # Weighted average of ratings using Jaccard similarity as weights
    prediction = np.dot(top_items, top_ratings) / top_items.sum() if top_items.sum() > 0 else np.nan
    return prediction





KeyboardInterrupt: 

# 6. Evaluation: RMSE Comparison

In [None]:
# Split data for evaluation
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

# Predict ratings using both collaborative methods
user_preds = test.apply(lambda row: predict_user_user(row['userId'], row['movieId']), axis=1)
item_preds = test.apply(lambda row: predict_item_item(row['userId'], row['movieId']), axis=1)

# Calculate RMSE
user_rmse = np.sqrt(mean_squared_error(test['rating'].dropna(), user_preds.dropna()))
item_rmse = np.sqrt(mean_squared_error(test['rating'].dropna(), item_preds.dropna()))

# Plot RMSE comparison
plt.figure(figsize=(8, 4))
plt.bar(['User-User (Pearson)', 'Item-Item (Adjusted Cosine)'], [user_rmse, item_rmse], color=['blue', 'green'])
plt.title("RMSE Comparison of Collaborative Filtering Models")
plt.ylabel("RMSE")
plt.show()


# 7. Summary Output

In [None]:
print("\n--- Summary ---")
print(f"User-User RMSE (Pearson): {user_rmse:.4f}")
print(f"Item-Item RMSE (Adjusted Cosine): {item_rmse:.4f}")
print("Content-Based filtering used L2-normalized cosine similarity on genre vectors.")
print("User-user filtering used Pearson correlation and centered ratings.")
print("Item-item filtering used adjusted cosine similarity with user-centered item vectors.")
