<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Project_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Import Required Libraries

In [1]:
# This section imports all necessary libraries for data processing, similarity computation, evaluation, and visualization.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import normalize
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


# 2. Load and Preprocess Data

- Downloads smaller, pre-filtered versions of the ratings and movies datasets from GitHub.
- These files contain fewer rows and are easier to work with in Colab (won’t crash memory).
- pd.read_csv() loads them into DataFrames named ratings and movies.

In [2]:
# Step 1: Load subset datasets
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/movies_subset.csv
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/ratings_subset.csv

ratings = pd.read_csv("ratings_subset.csv")
movies = pd.read_csv("movies_subset.csv")

# Step 2: Convert genres to multi-hot encoded format
# This block prepares genre data for content-based filtering.
# genres_list: Converts the genre string (e.g., 'Action|Adventure') into a Python list.
# all_genres: Builds a sorted list of all unique genres in the dataset.
# The loop creates a new column for each genre (multi-hot encoding):
# If a movie has that genre, it gets a 1, else 0.

movies['genres'] = movies['genres'].fillna('')
movies['genres_list'] = movies['genres'].apply(lambda x: x.split('|'))

all_genres = sorted(set(genre for sublist in movies['genres_list'] for genre in sublist))
for genre in all_genres:
    movies[genre] = movies['genres_list'].apply(lambda x: 1 if genre in x else 0)

# Step 3: Merge movie features with ratings
# Merges the processed movies DataFrame (now with genre vectors) with ratings.
# This results in movie_data, a dataset where each row contains:
    ## The user ID
    ## The movie's genre indicators (1s and 0s)
    ## The rating the user gave that movie

movie_data = pd.merge(movies.drop(columns=['genres_list']), ratings, on='movieId')

print("Shape of merged dataset:", movie_data.shape)
print(movie_data.head())


Shape of merged dataset: (100000, 26)
   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   (no genres listed)  Action  Adventure  Animation  Children  Comedy  Crime  \
0                   0       0          1          1         1       1      0   
1                   0       0          1          1         1       1      0   
2                   0       0          1          1         1       1      0   
3                   0       0          1          1         1       1      0   
4                   0       0          1          1         1       1      0   

   ...  Musical  M

# 3. Content-Based Filtering Using Genre Vectors and Cosine Similarity

This code implements a **content-based recommender system** using movie genres. Each movie is represented as a binary (multi-hot) vector based on its associated genres (e.g., Action, Comedy, Drama). The steps include:

* Normalizing the genre vectors using **L2 norm** so that each vector has unit length.
* Calculating **cosine similarity** between movie vectors to measure how similar their genre compositions are.
* Creating a function that, given a movie title, returns the top-N most similar movies (excluding itself) based purely on genre similarity.

This technique does not rely on user ratings — instead, it recommends items that are similar in content (genre) to a given movie.

This code implements a **non-personalized content-based recommender system** using only movie genres. It does **not use user ratings or preferences**. Instead, it recommends movies that are **similar in genre** to a specified movie.

#### How It Works:

* Each movie is represented as a binary (multi-hot encoded) vector across genres (e.g., Action, Comedy, Drama).
* These vectors are **L2-normalized** so that all movies lie on a unit hypersphere — making **cosine similarity** an effective way to measure closeness.
* Given a movie title, the model:

  * Finds its genre vector.
  * Computes cosine similarity to all other movies.
  * Returns the top-N most similar movies (excluding itself).

#### What It Does Not Do:

* It does **not use any user data** (no `userId`, no ratings).
* There is **no personalization**. All users will get the same recommendations for a given movie.

#### Best Use Case:

This type of model is ideal when:

* You have **no user data** (cold start).
* You want to recommend movies **based on content alone** (e.g., genre-based similarity).
* You’re building a basic recommender system that can later be enhanced with collaborative filtering or hybrid techniques.




In [3]:
# --- Imports ---
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
import random

# --- Step 1: Use unique movies for similarity computation ---
unique_movies = movies.copy().reset_index(drop=True)

# --- Step 2: Normalize genre matrix ---
genre_cols = all_genres  # Assumes 'all_genres' is your list of genre columns
genre_matrix = unique_movies[genre_cols].values
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

# --- Step 3: Create title-to-index mapping ---
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# --- Step 4: Recommendation Function Based on Genre Similarity ---
def get_recommendations(title, topN=20):
    if title not in movie_idx:
        return f"Movie '{title}' not found in dataset."

    idx = movie_idx[title]
    query_vector = genre_matrix_normalized[idx].reshape(1, -1)
    sim_scores = cosine_similarity(query_vector, genre_matrix_normalized)[0]

    # Rank and filter out the movie itself
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:topN+1]

    # Output as list of (title, similarity score)
    recommendations = [(unique_movies['title'][i], score) for i, score in sim_scores]
    return recommendations

# --- Step 5: Use Fixed Target Movie ---
target_title = 'O.J.: Made in America (2016)'

# Print explanation
print("\nContent-based Recommendations using GENRE similarity (cosine distance):")

# Run recommendation
if target_title in movie_idx:
    print(f"\nTop 20 Movies Most Similar in Genre to '{target_title}':")
    for title, sim in get_recommendations(target_title):
        print(f"{title:<45} Similarity: {sim:.4f}")
else:
    print(f"Movie '{target_title}' not found in the dataset.")



Content-based Recommendations using GENRE similarity (cosine distance):

Top 20 Movies Most Similar in Genre to 'O.J.: Made in America (2016)':
Catwalk (1996)                                Similarity: 1.0000
Anne Frank Remembered (1995)                  Similarity: 1.0000
Man of the Year (1995)                        Similarity: 1.0000
Crumb (1994)                                  Similarity: 1.0000
Unzipped (1995)                               Similarity: 1.0000
Hoop Dreams (1994)                            Similarity: 1.0000
Wonderful, Horrible Life of Leni Riefenstahl, The (Macht der Bilder: Leni Riefenstahl, Die) (1993) Similarity: 1.0000
War Room, The (1993)                          Similarity: 1.0000
Celluloid Closet, The (1995)                  Similarity: 1.0000
Haunted World of Edward D. Wood Jr., The (1996) Similarity: 1.0000
Maya Lin: A Strong Clear Vision (1994)        Similarity: 1.0000
Synthetic Pleasures (1995)                    Similarity: 1.0000
Microcosmos (Microco

In [4]:
# Imports tools for normalizing feature vectors and computing similarity between them.
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
import random

# Use only unique movie rows for similarity matrix
# Copies the movies DataFrame and resets the index to ensure each movie is uniquely indexed.
unique_movies = movies.copy().reset_index(drop=True)

# Normalize genre matrix
# Extracts the genre vectors for each movie (multi-hot encoded).
# Applies L2 normalization so that all genre vectors have a length of 1 (helps with cosine similarity).
genre_cols = all_genres
genre_matrix = unique_movies[genre_cols].values
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

# Create title-to-index map for unique movies
# Creates a dictionary-like mapping from movie titles to their corresponding row index — used to look up vector positions.
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# Define function to get recommendations
# Defines a function that takes a movie title and returns the top N most similar movies.
def get_recommendations(title, topN=20):
    if title not in movie_idx:
        return f"Movie '{title}' not found in dataset."

    idx = movie_idx[title]
    query_vector = genre_matrix_normalized[idx].reshape(1, -1)
    sim_scores = cosine_similarity(query_vector, genre_matrix_normalized)[0]

    # Enumerate and sort scores, excluding the movie itself
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:topN+1]

    # Format output: list of (title, similarity) tuples
    recommendations = [(unique_movies['title'][i], score) for i, score in sim_scores]
    return recommendations


# Sample 20 titles
print("Available sample titles:")
print(unique_movies['title'].sample(20, random_state=41).to_list())

# Randomly select a movie title from the available titles
random_title = random.choice(unique_movies['title'].to_list())

print(f"\n Randomly selected movie for recommendation: '{random_title}'")

# Explanation
print("\n Content-based Recommendations are based on GENRE similarity using cosine similarity between genre vectors.")

# Get recommendations
print(f"\nTop 20 Movies Most Similar in Genre to '{random_title}':")
for title, sim in get_recommendations(random_title):
    print(f"{title:<45} Similarity: {sim:.4f}")






Available sample titles:
['Atomica (2017)', 'Adventures in Babysitting (1987)', 'Big Picture, The (1989)', 'Annabelle (2014)', 'Beau Is Afraid (2023)', 'First Blood (Rambo: First Blood) (1982)', 'The Meg (2018)', 'Little Nemo: Adventures in Slumberland (1992)', 'Amen. (2002)', 'Danger: Diabolik (Diabolik) (1968)', 'Bugsy Malone (1976)', 'The Good Dinosaur (2015)', 'Goofy Movie, A (1995)', 'Man Called Horse, A (1970)', 'Terms and Conditions May Apply (2013)', 'StageFright: Aquarius (1987)', 'Shanghai Dreams (Qing hong) (2005)', 'I, Daniel Blake (2016)', "Amores Perros (Love's a Bitch) (2000)", "Cookie's Fortune (1999)"]

 Randomly selected movie for recommendation: 'Brave One, The (2007)'

 Content-based Recommendations are based on GENRE similarity using cosine similarity between genre vectors.

Top 20 Movies Most Similar in Genre to 'Brave One, The (2007)':
Amateur (1994)                                Similarity: 1.0000
Kiss of Death (1995)                          Similarity: 1.0000

# Hybrid Content-Based Rating Prediction Using Genre Similarity, User Behavior, and Fallback Handling

This code demonstrates a *hybrid recommendation system* that combines **content-based filtering using genre similarity** with **collaborative filtering using user-specific ratings**. The objective is to predict how much a user will like a movie they've never seen, based on the genres of that movie and their past rating behavior.

The prediction process incorporates a **fallback mechanism** and **debug printouts** to gracefully handle edge cases where standard hybrid predictions aren’t possible. These cases include users with no rating history, movies not present in the similarity matrix, or when no meaningful similarity is found.

#### How it Works:

1. **Genre Vector Normalization**:

   * The genre columns are multi-hot encoded (e.g., Action, Comedy, etc.).
   * Each movie’s genre vector is normalized using L2 norm so that cosine similarity is well-defined and scale-invariant.

2. **Genre-Based Similarity Matrix**:

   * Cosine similarity is computed between all pairs of movies based on genre vectors.

3. **Mapping Setup**:

   * The code builds lookup maps between `movieId` and its corresponding row index in the genre matrix to allow fast access.

4. **Hybrid Prediction Function**:

   * For a given `user_id` and `movie_id`, the function:

     * Retrieves all movies rated by the user.
     * Finds the top-K rated movies that are most genre-similar to the target movie.
     * Computes a **weighted average of the ratings**, where the weights are the genre similarity scores.
     * **If no such ratings or similarities are available**, the function **falls back to the global average rating of the movie**.
     * Each fallback trigger is logged with a `[Debug]` message.

5. **Application**:

   * The model is tested on a sample user and generates predicted ratings for movies that are most similar in genre to a reference movie (e.g., *Heat (1995)*).

This hybrid approach offers:

* Personalization from collaborative filtering.
* Interpretability from content-based features (genres).
* Robustness from fallback logic to handle cold-starts or sparse data situations.


In [5]:
# --- Required Libraries ---
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import MultiLabelBinarizer, normalize
from sklearn.metrics.pairwise import cosine_similarity

# --- Step 1: Keep All Ratings (No User Filtering) ---
ratings_filtered = ratings.copy()

# Build user-movie matrix
user_movie_matrix = ratings_filtered.pivot(index='userId', columns='movieId', values='rating')

# --- Step 2: Prepare Genre Matrix ---
# Convert genre strings to lists
movies['genres'] = movies['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else [])

# Filter only movies present in ratings
valid_movie_ids = user_movie_matrix.columns
movies_filtered = movies[movies['movieId'].isin(valid_movie_ids)].copy()

# One-hot encode genres
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies_filtered['genres'])
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

# Create mappings
unique_movies = movies_filtered.reset_index(drop=True)
movieId_to_index = dict(zip(unique_movies['movieId'], unique_movies.index))
index_to_movieId = dict(zip(unique_movies.index, unique_movies['movieId']))
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# --- Step 3: Compute Cosine Similarity Between Genre Vectors ---
genre_sim_matrix = cosine_similarity(genre_matrix_normalized)

# --- Step 4: Hybrid Prediction Function with Fallbacks and Recommendation Message ---
def predict_rating_genre_weighted(user_id, target_movie_id, k=10):
    if user_id not in user_movie_matrix.index or target_movie_id not in movieId_to_index:
        print(f"[Debug] Invalid user_id {user_id} or movie_id {target_movie_id}. Returning NaN.")
        return np.nan

    user_ratings = user_movie_matrix.loc[user_id].dropna()
    if user_ratings.empty:
        print(f"[Fallback] User {user_id} has no ratings. Using global average for movieId {target_movie_id}.")
        pred = ratings_filtered[ratings_filtered['movieId'] == target_movie_id]['rating'].mean()
        print(f"[Recommendation] Predicted Rating: {pred:.2f} → {'Recommend' if pred >= 3.5 else 'Not Recommended'}")
        return pred

    target_idx = movieId_to_index[target_movie_id]
    rated_movie_indices = [movieId_to_index[mid] for mid in user_ratings.index if mid in movieId_to_index]

    if not rated_movie_indices:
        print(f"[Fallback] Rated movies not found for user {user_id}. Using global average.")
        pred = ratings_filtered[ratings_filtered['movieId'] == target_movie_id]['rating'].mean()
        print(f"[Recommendation] Predicted Rating: {pred:.2f} → {'Recommend' if pred >= 3.5 else 'Not Recommended'}")
        return pred

    sims = genre_sim_matrix[target_idx, rated_movie_indices]
    sims_series = pd.Series(sims, index=[index_to_movieId[i] for i in rated_movie_indices])

    top_similar = sims_series.sort_values(ascending=False).head(k)
    top_ratings = user_ratings[top_similar.index]

    # Debug logs for inspection
    print(f"\n[Debug] Similarity Weights for User {user_id} on Target Movie {target_movie_id}:")
    print(top_similar)
    print("[Debug] Corresponding Ratings:")
    print(top_ratings)

    weighted_sum = np.dot(top_similar.values, top_ratings.values)
    normalization = np.sum(top_similar.values)

    if normalization > 0:
        pred = weighted_sum / normalization
        print(f"[Prediction] Personalized prediction used for user {user_id} on movieId {target_movie_id}.")
        print(f"[Recommendation] Predicted Rating: {pred:.2f} → {'Recommend' if pred >= 3.5 else 'Not Recommended'}")
        return pred
    else:
        print(f"[Fallback] No similarity weights found. Using global average for movieId {target_movie_id}.")
        pred = ratings_filtered[ratings_filtered['movieId'] == target_movie_id]['rating'].mean()
        print(f"[Recommendation] Predicted Rating: {pred:.2f} → {'Recommend' if pred >= 3.5 else 'Not Recommended'}")
        return pred

# --- Step 5: Run with Fixed User and Movie ---

# Set static user and movie
user_id = 174949
target_movie = 'O.J.: Made in America (2016)'

print(f"Using user {user_id} for prediction.")
print(f"Target movie exists: '{target_movie}' →", target_movie in movie_idx)

if target_movie in movie_idx:
    idx = movie_idx[target_movie]
    sim_scores = cosine_similarity(genre_matrix_normalized[idx].reshape(1, -1), genre_matrix_normalized)[0]
    sim_indices = np.argsort(sim_scores)[::-1][1:11]  # Exclude the movie itself

    top_similar_movie_ids = unique_movies.loc[sim_indices, 'movieId']
    top_similar_titles = unique_movies.loc[sim_indices, 'title']

    print(f"\nTop 10 Genre-Similar Movies to '{target_movie}':")
    print(top_similar_titles)

    print(f"\nPredicted Ratings for User {user_id} Using Hybrid Genre-Based Model:\n")
    for movie_id, title in zip(top_similar_movie_ids, top_similar_titles):
        pred = predict_rating_genre_weighted(user_id=user_id, target_movie_id=movie_id, k=100)
        print(f"{title:<45} Predicted Rating: {pred:.2f}")
else:
    print("Target movie not found in index.")


Using user 174949 for prediction.
Target movie exists: 'O.J.: Made in America (2016)' → True

Top 10 Genre-Similar Movies to 'O.J.: Made in America (2016)':
11172    Indiana Jones: The Search for the Lost Golden ...
98                                          Catwalk (1996)
104                           Anne Frank Remembered (1995)
118                                 Man of the Year (1995)
139                                           Crumb (1994)
177                                        Unzipped (1995)
213                                     Hoop Dreams (1994)
315      Wonderful, Horrible Life of Leni Riefenstahl, ...
480                                   War Room, The (1993)
496                           Celluloid Closet, The (1995)
Name: title, dtype: object

Predicted Ratings for User 174949 Using Hybrid Genre-Based Model:


[Debug] Similarity Weights for User 174949 on Target Movie 287443:
1207    0.0
2671    0.0
dtype: float64
[Debug] Corresponding Ratings:
1207    5.0
2671    

**Hybrid Recommender: Genre-Weighted Collaborative Filtering**

This code predicts a user's rating for a movie by combining collaborative filtering and genre-based similarity. Here's how it works:

### **Step-by-Step Explanation**

**1. Data Preparation**

* It loads the `ratings` and `movies` datasets.
* The user-movie ratings matrix is built using `.pivot()` (rows = users, columns = movies, values = ratings).
* Movie genres are split and one-hot encoded using `MultiLabelBinarizer`.
* Genre vectors are normalized to enable cosine similarity comparison.

**2. Genre Similarity Calculation**

* Cosine similarity is computed between normalized genre vectors of all movies.
* This generates a matrix showing how similar each pair of movies is based on genre.

**3. `hybrid_predict()` Function:**
This is the main prediction function. Here's what it does:

* **Step 1**: Skips invalid user/movie inputs.
* **Step 2**: Loops through all other users (excluding the target user).
* **Step 3**: For each user, checks if they rated the target movie.
* **Step 4**: Collects that user's other rated movies and looks up genre similarity between those and the target movie.
* **Step 5**: Uses a weighted average of the other user's ratings on similar movies, weighted by genre similarity.
* **Step 6**: Averages all such weighted predictions from other users to generate the final prediction.
* **Fallback**: If no useful ratings are found, it falls back to the global average rating for the movie.

**4. Prediction Execution**

* The code sets `user_id = 174949` and `target_movie = 'O.J.: Made in America (2016)'`.
* It retrieves the `movieId` and runs the `hybrid_predict()` function.
* Finally, it prints the predicted rating for that user and movie.

*This approach combines user behavior (collaborative filtering) with genre-based content similarity to improve prediction accuracy, especially for sparse data or cold-start problems.*


# 1. Vectorized NumPy Logic – Genre-Based Hybrid Prediction

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, normalize
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Load subset datasets
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/movies_subset.csv
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/ratings_subset.csv

ratings = pd.read_csv("ratings_subset.csv")
movies = pd.read_csv("movies_subset.csv")

# Step 1: Prepare Data
ratings_filtered = ratings.copy()
user_movie_matrix = ratings_filtered.pivot(index='userId', columns='movieId', values='rating')

movies['genres'] = movies['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else [])
valid_movie_ids = user_movie_matrix.columns
movies_filtered = movies[movies['movieId'].isin(valid_movie_ids)].copy()

mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies_filtered['genres'])
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

unique_movies = movies_filtered.reset_index(drop=True)
movieId_to_index = dict(zip(unique_movies['movieId'], unique_movies.index))
index_to_movieId = dict(zip(unique_movies.index, unique_movies['movieId']))
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# Step 2: Vectorized Hybrid Prediction Function (Genre-only Weighted)
def vectorized_hybrid_predict(user_id, target_movie_id, k=10):
    if user_id not in user_movie_matrix.index or target_movie_id not in movieId_to_index:
        return np.nan

    target_idx = movieId_to_index[target_movie_id]
    sim_vector = cosine_similarity(genre_matrix_normalized[target_idx].reshape(1, -1), genre_matrix_normalized)[0]

    user_ratings = user_movie_matrix.loc[user_id].dropna()
    rated_movie_ids = user_ratings.index.intersection(user_movie_matrix.columns)
    rated_indices = [movieId_to_index[mid] for mid in rated_movie_ids if mid in movieId_to_index]

    sim_scores = sim_vector[rated_indices]
    ratings_values = user_ratings.loc[rated_movie_ids].values

    if len(sim_scores) == 0 or np.sum(sim_scores) == 0:
        return np.nan

    top_k_indices = np.argsort(sim_scores)[-k:]
    sim_top = sim_scores[top_k_indices]
    rating_top = ratings_values[top_k_indices]

    return np.dot(sim_top, rating_top) / np.sum(sim_top)

# Example Usage
# user_id = 174949
valid_user = None
for uid in user_movie_matrix.index:
    rated_movies = user_movie_matrix.loc[uid].dropna().index
    if rated_movies.intersection(movieId_to_index.keys()).any():
        valid_user = uid
        break  # Exit the loop immediately once a valid user is found

user_id = valid_user


target_movie = 'O.J.: Made in America (2016)'
target_movie_id = unique_movies.loc[movie_idx[target_movie], 'movieId']
pred = vectorized_hybrid_predict(user_id, target_movie_id, k=100)
print(f"Predicted rating for '{target_movie}' by user {user_id}: {pred:.2f}")


Predicted rating for 'O.J.: Made in America (2016)' by user 10: nan


# 2. Blended Hybrid (Genre + Collaborative Filtering) with Precomputed Hybrid Similarity

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, normalize
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Load subset datasets
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/movies_subset.csv
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/ratings_subset.csv

# Step 1: Prepare Data
ratings_filtered = ratings.copy()
user_movie_matrix = ratings_filtered.pivot(index='userId', columns='movieId', values='rating')

movies['genres'] = movies['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else [])
valid_movie_ids = user_movie_matrix.columns
movies_filtered = movies[movies['movieId'].isin(valid_movie_ids)].copy()

mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies_filtered['genres'])
genre_matrix_normalized = normalize(genre_matrix, norm='l2')

unique_movies = movies_filtered.reset_index(drop=True)
movieId_to_index = dict(zip(unique_movies['movieId'], unique_movies.index))
index_to_movieId = dict(zip(unique_movies.index, unique_movies['movieId']))
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# Step 2: Compute Hybrid Similarity Matrix
genre_sim = cosine_similarity(genre_matrix_normalized)
user_movie_centered = user_movie_matrix.sub(user_movie_matrix.mean(axis=1), axis=0).fillna(0)
item_sim = cosine_similarity(user_movie_centered.T.fillna(0))

# Ensure both matrices are same shape
alpha = 0.5  # genre-collab blend weight
hybrid_sim = alpha * genre_sim + (1 - alpha) * item_sim

# Step 3: Prediction Function Using Hybrid Similarity
def blended_hybrid_predict(user_id, target_movie_id, k=10):
    if user_id not in user_movie_matrix.index or target_movie_id not in movieId_to_index:
        return np.nan

    target_idx = movieId_to_index[target_movie_id]
    sim_vector = hybrid_sim[target_idx]

    user_ratings = user_movie_matrix.loc[user_id].dropna()
    rated_movie_ids = user_ratings.index.intersection(user_movie_matrix.columns)
    rated_indices = [movieId_to_index[mid] for mid in rated_movie_ids if mid in movieId_to_index]

    sim_scores = sim_vector[rated_indices]
    ratings_values = user_ratings.loc[rated_movie_ids].values

    if len(sim_scores) == 0 or np.sum(sim_scores) == 0:
        return np.nan

    top_k_indices = np.argsort(sim_scores)[-k:]
    sim_top = sim_scores[top_k_indices]
    rating_top = ratings_values[top_k_indices]

    return np.dot(sim_top, rating_top) / np.sum(sim_top)

# Example Usage
user_id = 174949
target_movie = 'O.J.: Made in America (2016)'
target_movie_id = unique_movies.loc[movie_idx[target_movie], 'movieId']
pred = blended_hybrid_predict(user_id, target_movie_id, k=100)
print(f"Predicted rating for '{target_movie}' by user {user_id}: {pred:.2f}")


ValueError: Found array with 0 feature(s) (shape=(11190, 0)) while a minimum of 1 is required by the normalize function.

## Comparison of Overlapping and Divergent Recommendations

Both methods returned several overlapping recommendations, but also differed in meaningful ways:

#### Similar Recommendations from Both Methods

* **Assassins (1995)**
* **Net, The (1995)**

These consistent suggestions indicate that both the pure content-based and hybrid genre-weighted models identify core genre traits effectively.

#### Recommendations Unique to Each Method

**Only in Content-Based (Cosine Genre Similarity):**

* *Die Hard (1988)*
* *Batman (1989)*
* *U.S. Marshals (1998)*

**Only in Hybrid Model (Genre + Ratings Fallback):**

* *Sin City: A Dame to Kill For (2014)*
* *John Wick: Chapter Two (2017)*
* *Transporter 2 (2005)*

These differences show that the hybrid method is able to introduce newer or slightly more nuanced genre matches, even when rating data for a specific user is missing and fallback mechanisms are triggered.


# Optimized Jaccard Similarity for Content-Based Filtering

This block introduces a more efficient method for computing **Jaccard similarity** between movies based on their genre information. Unlike the traditional nested-loop approach, this implementation uses the `pdist()` function from `scipy.spatial.distance` to compute all pairwise Jaccard distances in a **fully vectorized** manner. The result is a symmetric similarity matrix, which is then used to identify the most similar movies to a given title. This optimization drastically reduces computation time and is highly recommended for medium-to-large datasets.

using `scipy.spatial.distance.pdist()` **does calculate all pairwise similarities**, but it does so much more efficiently than a manual loop.

Here’s how it works:

* `pdist(binary_matrix, metric='jaccard')` computes the **Jaccard distance** (which is `1 - Jaccard similarity`) between **all unique pairs** of rows (i.e., movies) in the binary genre matrix.
* The output is a **condensed distance matrix** — a flat array containing the upper triangle of the full pairwise distance matrix.
* This condensed matrix is converted back into a full square **symmetric matrix** using `squareform()`, giving us the distance between all pairs.
* We then compute similarity as `1 - distance`.

Every possible movie-to-movie similarity is calculated — but with optimized vectorized operations under the hood, which is much faster than nested Python loops.



In [5]:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd

# Step 1: Prepare genre binary matrix
unique_movies = movies.copy().reset_index(drop=True)
genre_cols = all_genres
genre_matrix = unique_movies[genre_cols].astype(bool).astype(int).values  # ensure binary format

# Step 2: Compute Jaccard distance (1 - similarity)
# pdist returns a condensed distance matrix; squareform converts it to square form
jaccard_distance = pdist(genre_matrix, metric='jaccard')  # returns 1 - Jaccard similarity
jaccard_sim_matrix = 1 - squareform(jaccard_distance)      # convert to full similarity matrix

# Step 3: Create mapping from title to matrix index
movie_idx = pd.Series(unique_movies.index, index=unique_movies['title']).drop_duplicates()

# Step 4: Define recommendation function
def get_recommendations_jaccard(title, topN=10):
    if title not in movie_idx:
        return f"Movie '{title}' not found in dataset."

    idx = movie_idx[title]
    sim_scores = list(enumerate(jaccard_sim_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:topN+1]  # exclude self
    top_indices = [i[0] for i in sim_scores]
    return unique_movies['title'].iloc[top_indices]

# Step 5: Try a sample movie
print("Sample titles:", unique_movies['title'].sample(5, random_state=42).to_list())
print(f"\nJaccard Recommendations for {random_title}:")
# print(get_recommendations_jaccard("Heat (1995)"))
print(get_recommendations_jaccard(random_title))


Sample titles: ['Murder on the Orient Express (2017)', 'Rhapsody in August (Hachi-gatsu no kyôshikyoku) (1991)', 'First Position (2011)', 'Wait Until Dark (1967)', 'Coffy (1973)']

Jaccard Recommendations for Youth (2015):
25                      Othello (1995)
30              Dangerous Minds (1995)
38     Cry, the Beloved Country (1995)
41                  Restoration (1995)
51                      Georgia (1995)
52        Home for the Holidays (1995)
57           Mr. Holland's Opus (1995)
62                     Two Bits (1995)
103           Margaret's Museum (1995)
108    Boys of St. Vincent, The (1992)
Name: title, dtype: object


# Comparison of Content-Based Recommendations: Cosine vs. Jaccard Similarity

Both the **cosine similarity** and **Jaccard similarity** methods returned *identical top-10 movie recommendations* for the query movie **"Heat (1995)"**. This indicates that in the context of the MovieLens genre-based content filtering:

* **Both methods effectively captured the same neighborhood of similar films**.
* The movies recommended (e.g., *Assassins*, *Die Hard*, *The Net*, *Natural Born Killers*) suggest that the genre combinations for these titles closely match those of *Heat (1995)*.
* While **cosine similarity** operates on normalized multi-hot vectors and measures angular proximity,
  **Jaccard similarity** measures the overlap in genre tags directly.

### Key Takeaway:

Despite their different mathematical underpinnings, both methods **produced the same results** because:

* The genre vectors are binary (multi-hot encoded), where normalization (in cosine) doesn’t distort information significantly.
* The dominant factor influencing similarity is the **overlap of genre labels**, which both metrics capture well.

However:

* **Cosine similarity is computationally faster** and more scalable.
* **Jaccard similarity is slower** when computed pairwise using loops, though vectorized solutions like `pdist()` improve it significantly.

You can safely use either in this binary genre context, but for large-scale systems, cosine is typically preferred for efficiency.


# 4. User-User Collaborative Filtering with Cosine Similarity (Memory-Efficient Version)

This block implements **User-User Collaborative Filtering** while addressing memory constraints in Google Colab. It includes two major optimizations:

* Limits the dataset to users with `userId <= 50,000` to keep the similarity matrix size manageable.
* Uses **cosine similarity** (instead of Pearson correlation), which is more memory-efficient and suitable for sparse rating data.

The code builds a **mean-centered user-item matrix**, computes cosine similarity between users, and defines a prediction function based on the top-k most similar users who rated a given movie. This approach estimates a target user’s rating by combining how similar users rated the same item, adjusted for their own average rating.


In [22]:
# --- Required Libraries ---
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import random

# Step 1: Load subset datasets
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/movies_subset.csv
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/ratings_subset.csv

ratings = pd.read_csv("ratings_subset.csv")
movies = pd.read_csv("movies_subset.csv")
print("Data loaded successfully.")

# Step 2: Reduce to a manageable number of users
sampled_user_ids = ratings['userId'].drop_duplicates().sample(n=10000) #, random_state=42)
ratings_small = ratings[ratings['userId'].isin(sampled_user_ids)]

print(f"Using {ratings_small['userId'].nunique()} users and {ratings_small['movieId'].nunique()} movies.")

# Step 3: Create user-movie matrix and normalize
user_movie_matrix = ratings_small.pivot(index='userId', columns='movieId', values='rating')
user_means = user_movie_matrix.mean(axis=1)
user_movie_centered = user_movie_matrix.sub(user_means, axis=0).fillna(0)
global_mean = ratings_small['rating'].mean()  # Fallback if user mean is unavailable

# Step 4: Compute cosine similarity between users
print("Computing user-user cosine similarity...")
user_similarity = cosine_similarity(user_movie_centered)
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
print("Similarity matrix created.")

# Step 5: Define prediction function with full fallback
def predict_user_user(user_id, movie_id, k=10):
    if user_id not in user_sim_df.index or movie_id not in user_movie_matrix.columns:
        print("User or movie not in dataset. Returning global mean.")
        return global_mean  # Fallback to global average

    sims = user_sim_df[user_id].drop(user_id)
    rated_users = user_movie_matrix[movie_id].dropna()
    sims = sims[sims.index.intersection(rated_users.index)]

    if sims.empty:
        print("No neighbors rated the movie. Returning user mean or global mean.")
        return user_means.get(user_id, global_mean)  # Fallback to user mean or global

    top_users = sims.sort_values(ascending=False).head(k)
    top_ratings = user_movie_matrix.loc[top_users.index, movie_id] - user_means[top_users.index]

    if top_users.sum() > 0:
        prediction = user_means[user_id] + np.dot(top_users, top_ratings) / top_users.sum()
    else:
        print("Top users sum is zero. Returning user mean or global mean.")
        prediction = user_means.get(user_id, global_mean)  # Fallback again

    return prediction

# Step 6: Find a suitable user-movie pair for prediction
def find_predictable_pair(k=10):
    for user_id in user_sim_df.index:
        sims = user_sim_df[user_id].drop(user_id).sort_values(ascending=False).head(k)
        neighbors = sims.index
        unrated_movies = user_movie_matrix.columns.difference(user_movie_matrix.loc[user_id].dropna().index)

        for movie_id in unrated_movies:
            neighbor_ratings = user_movie_matrix.loc[neighbors, movie_id].dropna()
            if len(neighbor_ratings) >= 1:
                return user_id, movie_id
    return None, None

# Step 7: Run the full prediction
user_id, movie_id = find_predictable_pair(k=10)

if user_id is not None and movie_id is not None:
    pred = predict_user_user(user_id, movie_id, k=10)
    movie_title = movies[movies['movieId'] == movie_id]['title'].values[0]
    print(f"\nPredicted rating for user {user_id} on movie '{movie_title}' (movieId {movie_id}): {pred:.2f}")
else:
    print("No suitable user-movie pair found for prediction.")

# Step 8: Test fallback logic using the same movieId but with a random user
def test_fallback_with_random_user_same_movie(movie_id, k=10):
    # Randomly select a user who has NOT rated the movie
    eligible_users = user_movie_matrix.index.difference(
        user_movie_matrix[movie_id].dropna().index
    )

    if eligible_users.empty:
        print("No eligible random users found for fallback test.")
        return

    random_user = random.choice(eligible_users.tolist())

    pred = predict_user_user(random_user, movie_id, k=k)
    movie_title = movies[movies['movieId'] == movie_id]['title'].values[0]
    print(f"\n[Fallback Test] Predicted rating for random user {random_user} on movie '{movie_title}' (movieId {movie_id}): {pred:.2f}")

# Call it with movie_id from Step 7
test_fallback_with_random_user_same_movie(movie_id, k=10)



Data loaded successfully.
Using 10000 users and 4997 movies.
Computing user-user cosine similarity...
Similarity matrix created.
Top users sum is zero. Returning user mean or global mean.

Predicted rating for user 18 on movie 'Judge Dredd (1995)' (movieId 173): 4.00
Top users sum is zero. Returning user mean or global mean.

[Fallback Test] Predicted rating for random user 145566 on movie 'Judge Dredd (1995)' (movieId 173): 3.50


### 4. User-User Collaborative Filtering with Cosine and Genre-Weighted Ratings

This block implements **User-User Collaborative Filtering** enhanced with **genre-aware weighting** to improve prediction accuracy. The design makes two major improvements over basic collaborative filtering:

* **Subset Sampling**: Randomly samples 10,000 users from the full ratings dataset to ensure scalability and performance within memory-constrained environments like Google Colab.
* **Genre-Based Weighting**: In addition to user similarity (via cosine similarity), the prediction also considers how similar the target movie is—based on genre—to movies rated by each neighbor. This reduces noise from irrelevant user ratings.

The process involves:

1. Building a **mean-centered user-movie matrix**.
2. Calculating **user-user cosine similarity**.
3. Creating a **normalized genre matrix** from the movies dataset.
4. For each prediction, retrieving the top-k most similar users who rated the movie, then:

   * Calculating how similar the movie is (by genre) to other movies each user has rated.
   * Averaging these genre similarities to derive a genre weight.
   * Multiplying genre weights by user similarity to weight each neighbor’s influence on the prediction.

This results in **genre-weighted collaborative filtering**, which makes predictions that are both socially and semantically relevant.


In [25]:
# --- Required Libraries ---
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer, normalize
import random

# Step 1: Load subset datasets
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/movies_subset.csv
!wget -q https://raw.githubusercontent.com/hawa1983/DATA-612/refs/heads/main/ratings_subset.csv

ratings = pd.read_csv("ratings_subset.csv")
movies = pd.read_csv("movies_subset.csv")
print("Data loaded successfully.")

# Step 2: Sample 10,000 users to reduce memory footprint and speed up similarity calculations
sampled_user_ids = ratings['userId'].drop_duplicates().sample(n=10000) #, random_state=42)
ratings_small = ratings[ratings['userId'].isin(sampled_user_ids)]
print(f"Using {ratings_small['userId'].nunique()} users and {ratings_small['movieId'].nunique()} movies.")

# Step 3: Create the user-movie rating matrix and normalize by subtracting each user's average rating
user_movie_matrix = ratings_small.pivot(index='userId', columns='movieId', values='rating')
user_means = user_movie_matrix.mean(axis=1)
user_movie_centered = user_movie_matrix.sub(user_means, axis=0).fillna(0)

# Step 4: Compute cosine similarity between users using the normalized matrix
print("Computing user-user cosine similarity...")
user_similarity = cosine_similarity(user_movie_centered)
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
print("Similarity matrix created.")

# Step 5: Prepare the genre matrix from movie metadata
movies['genres'] = movies['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else [])
movie_ids_in_matrix = user_movie_matrix.columns
movies_filtered = movies[movies['movieId'].isin(movie_ids_in_matrix)]

mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies_filtered['genres'])  # multi-hot encoding
genre_matrix_normalized = normalize(genre_matrix, norm='l2')  # L2-normalized for cosine similarity

movieId_to_index = dict(zip(movies_filtered['movieId'], range(len(movies_filtered))))  # fast lookup

# Step 6: Define prediction function that combines user similarity and genre similarity
def predict_user_user_genre_weighted(user_id, movie_id, k=10):
    # Skip prediction if user or movie not in dataset
    if user_id not in user_sim_df.index or movie_id not in user_movie_matrix.columns:
        return np.nan

    # Find similar users who have rated this movie
    sims = user_sim_df[user_id].drop(user_id)
    rated_users = user_movie_matrix[movie_id].dropna()
    sims = sims[sims.index.intersection(rated_users.index)]

    if sims.empty:
        return np.nan

    # Select top-k most similar users who rated the movie
    top_users = sims.sort_values(ascending=False).head(k)
    top_user_ids = top_users.index

    # Centered ratings from top-k users
    top_ratings = user_movie_matrix.loc[top_user_ids, movie_id] - user_means[top_user_ids]

    # Retrieve normalized genre vector of the target movie
    if movie_id not in movieId_to_index:
        return np.nan
    target_vec = genre_matrix_normalized[movieId_to_index[movie_id]]

    genre_weights = []
    for uid in top_user_ids:
        # Find all movies the similar user has rated
        user_rated_movies = user_movie_matrix.loc[uid].dropna().index

        # Convert those movieIds to genre matrix indices
        rated_indices = [movieId_to_index[mid] for mid in user_rated_movies if mid in movieId_to_index]

        # Compute cosine similarity between target movie genre and each rated movie genre
        if rated_indices:
            user_genre_sims = cosine_similarity(target_vec.reshape(1, -1), genre_matrix_normalized[rated_indices])[0]
            # Take the average genre similarity — this becomes the genre weight for this neighbor
            genre_weights.append(np.mean(user_genre_sims))
        else:
            genre_weights.append(0)

    genre_weights = np.array(genre_weights)

    # Final weights = user similarity × genre similarity
    combined_weights = top_users.values * genre_weights

    # Weighted prediction calculation
    if combined_weights.sum() > 0:
        prediction = user_means[user_id] + np.dot(combined_weights, top_ratings.values) / combined_weights.sum()
    else:
        prediction = np.nan

    return prediction

# Step 7: Helper function to find a user-movie pair where prediction is possible
def find_predictable_pair(k=10):
    for user_id in user_sim_df.index:
        sims = user_sim_df[user_id].drop(user_id).sort_values(ascending=False).head(k)
        neighbors = sims.index
        unrated_movies = user_movie_matrix.columns.difference(user_movie_matrix.loc[user_id].dropna().index)

        for movie_id in unrated_movies:
            neighbor_ratings = user_movie_matrix.loc[neighbors, movie_id].dropna()
            if len(neighbor_ratings) >= 1:
                return user_id, movie_id
    return None, None

# Step 8: Execute prediction
user_id, movie_id = find_predictable_pair(k=10)

if user_id is not None and movie_id is not None:
    pred = predict_user_user_genre_weighted(user_id, movie_id, k=10)
    movie_title = movies[movies['movieId'] == movie_id]['title'].values[0]
    print(f"\nPredicted rating for user {user_id} on movie '{movie_title}' (movieId {movie_id}): {pred:.2f}")
else:
    print("No suitable user-movie pair found for prediction.")


# Step 9: Test fallback logic using the same movieId but with a random user
def test_genre_weighted_fallback_same_movie(movie_id, k=10):
    # Find users who have NOT rated the selected movie
    eligible_users = user_movie_matrix.index.difference(
        user_movie_matrix[movie_id].dropna().index
    )

    if eligible_users.empty:
        print("No eligible users found for fallback test.")
        return

    # Pick a random eligible user
    random_user = random.choice(eligible_users.tolist())

    # Try genre-weighted prediction
    pred = predict_user_user_genre_weighted(random_user, movie_id, k=k)
    movie_title = movies[movies['movieId'] == movie_id]['title'].values[0]
    print(f"\n[Fallback Test] Predicted rating for random user {random_user} on movie '{movie_title}' (movieId {movie_id}): {pred:.2f}")

# Call it using the movie_id from Step 8
test_genre_weighted_fallback_same_movie(movie_id, k=10)



Data loaded successfully.
Using 10000 users and 4963 movies.
Computing user-user cosine similarity...
Similarity matrix created.

Predicted rating for user 33 on movie 'Outbreak (1995)' (movieId 292): nan

[Fallback Test] Predicted rating for random user 162867 on movie 'Outbreak (1995)' (movieId 292): nan


# 5. Item-Item Collaborative Filtering with Adjusted Cosine Similarity

This code implements **item-item collaborative filtering** using **adjusted cosine similarity**, which accounts for differences in users’ individual rating scales. The steps include:

* **Transposing** the user-item matrix so that movies are represented as rows.
* **Centering** each movie's vector by subtracting each user's mean rating — this adjustment ensures that the similarity metric reflects agreement in rating patterns, not absolute values.
* **Computing cosine similarity** between movie vectors to identify similar items.
* Defining a **prediction function** that estimates how a user would rate a target movie by aggregating their ratings for similar movies (weighted by similarity).

This memory-efficient approach allows personalized movie recommendations by looking at how similar a target movie is to other movies the user has already rated.

In [17]:
# Create user-movie matrix (assumes user_movie_matrix is already defined earlier)

# Transpose and center by user mean
# Transposes the matrix so that:
# Rows = movies (items)
# Columns = users
# This format is needed to compute item-item similarity.
movie_user_matrix = user_movie_matrix.T

# Center each row (movie) by subtracting the mean rating given by each user.
# This is known as adjusted cosine similarity — it accounts for differences in user rating scales.
movie_user_centered = movie_user_matrix.sub(user_movie_matrix.mean(axis=1), axis=1).fillna(0)

# Compute cosine similarity between items (movies)
# Cosine similarity is computed between each pair of movie vectors.
# The result is a similarity matrix where:
# Rows and columns = movieId
# Cell (i, j) = similarity between movie i and movie j
from sklearn.metrics import pairwise_distances
item_similarity = 1 - pairwise_distances(movie_user_centered, metric='cosine')
item_sim_df = pd.DataFrame(item_similarity, index=movie_user_matrix.index, columns=movie_user_matrix.index)

# Define prediction function
# Predicts how much user_id would rate movie_id using k nearest neighbors (top-k similar items).
def predict_item_item(user_id, movie_id, k=10):
    # If the movie is not in the matrix, return NaN
    if movie_id not in user_movie_matrix.columns:
        return np.nan

    # Get all movies that this user has rated
    user_ratings = user_movie_matrix.loc[user_id].dropna()

    # If user has not rated any other movies, return NaN
    if user_ratings.empty:
        return np.nan

    # Get similarity scores between the target movie and all movies the user has rated
    sims = item_sim_df.loc[movie_id, user_ratings.index]

    # Select the top-k most similar items
    top_items = sims.sort_values(ascending=False).head(k)

    # Get the user’s ratings for those items
    top_ratings = user_ratings[top_items.index]

    # Compute the weighted average rating, weighted by item similarity
    prediction = np.dot(top_items, top_ratings) / top_items.sum() if top_items.sum() > 0 else np.nan
    return prediction


# Item-Item Collaborative Filtering using Jaccard Similarity

This code implements item-item collaborative filtering by measuring the similarity between movies using the Jaccard index. Jaccard similarity is a metric that compares the similarity of two sets by dividing the size of their intersection by the size of their union. In this context, it is used to compare movies based on whether users have rated them or not—ignoring the actual rating values.

To apply Jaccard similarity:

1. The user-item ratings matrix is converted into a binary format (1 if a user rated a movie, 0 otherwise).
2. Each pair of movies is compared using Jaccard similarity, which is computed as the ratio of the number of users who rated both movies to the number of users who rated either movie.
3. A prediction function is defined to estimate how much a user would rate a given movie, based on the user’s ratings for the most similar movies, weighted by Jaccard similarity.

This method is particularly useful when the presence or absence of a rating is more important than the rating value itself. It also tends to be more memory-efficient in sparse datasets where many entries are missing.

In [18]:
# Required library for Jaccard similarity
from sklearn.metrics import jaccard_score

# ---- SUBSET THE DATASET TO AVOID MEMORY ISSUES ----

# Reduce to users with userId ≤ 50,000 for memory-efficient computation
ratings_small = ratings[ratings['userId'] <= 10000]

# Create user-movie matrix: rows = users, columns = movies, values = ratings
user_movie_matrix = ratings_small.pivot(index='userId', columns='movieId', values='rating')

# Step 1: Convert to Binary Rating Matrix
# Convert to binary: 1 if user rated a movie, 0 otherwise
# Transpose so that rows = movies, columns = users
movie_user_binary = user_movie_matrix.notna().astype(int).T

# Step 2: Compute Jaccard Similarity Matrix Between Movies
# Initialize empty DataFrame to store Jaccard similarities
movie_ids = movie_user_binary.index.tolist()
jaccard_sim_df = pd.DataFrame(index=movie_ids, columns=movie_ids, dtype=float)

# Compute Jaccard similarity for each pair of movies
# This may take time depending on how many movies are present
for i in range(len(movie_ids)):
    for j in range(i, len(movie_ids)):
        movie_i = movie_user_binary.loc[movie_ids[i]].values
        movie_j = movie_user_binary.loc[movie_ids[j]].values
        sim = jaccard_score(movie_i, movie_j)
        jaccard_sim_df.iloc[i, j] = sim
        jaccard_sim_df.iloc[j, i] = sim  # Ensure symmetry

# Step 3: Define Prediction Function Using Jaccard Similarity
# Predict rating for a given user and movie using top-k similar movies
def predict_item_item_jaccard(user_id, movie_id, k=10):
    # Check if movie exists in the matrix
    if movie_id not in user_movie_matrix.columns:
        return np.nan

    # Get movies the user has rated
    user_ratings = user_movie_matrix.loc[user_id].dropna()
    if user_ratings.empty:
        return np.nan

    # Get Jaccard similarity between target movie and movies the user has rated
    sims = jaccard_sim_df.loc[movie_id, user_ratings.index]

    # Select top-k similar movies
    top_items = sims.sort_values(ascending=False).head(k)
    top_ratings = user_ratings[top_items.index]

    # Weighted average of ratings using Jaccard similarity as weights
    prediction = np.dot(top_items, top_ratings) / top_items.sum() if top_items.sum() > 0 else np.nan
    return prediction





KeyboardInterrupt: 

# 6. Evaluation: RMSE Comparison

In [None]:
# Split data for evaluation
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

# Predict ratings using both collaborative methods
user_preds = test.apply(lambda row: predict_user_user(row['userId'], row['movieId']), axis=1)
item_preds = test.apply(lambda row: predict_item_item(row['userId'], row['movieId']), axis=1)

# Calculate RMSE
user_rmse = np.sqrt(mean_squared_error(test['rating'].dropna(), user_preds.dropna()))
item_rmse = np.sqrt(mean_squared_error(test['rating'].dropna(), item_preds.dropna()))

# Plot RMSE comparison
plt.figure(figsize=(8, 4))
plt.bar(['User-User (Pearson)', 'Item-Item (Adjusted Cosine)'], [user_rmse, item_rmse], color=['blue', 'green'])
plt.title("RMSE Comparison of Collaborative Filtering Models")
plt.ylabel("RMSE")
plt.show()


# 7. Summary Output

In [None]:
print("\n--- Summary ---")
print(f"User-User RMSE (Pearson): {user_rmse:.4f}")
print(f"Item-Item RMSE (Adjusted Cosine): {item_rmse:.4f}")
print("Content-Based filtering used L2-normalized cosine similarity on genre vectors.")
print("User-user filtering used Pearson correlation and centered ratings.")
print("Item-item filtering used adjusted cosine similarity with user-centered item vectors.")
