# **Movie Recommendation System Project**


**📊 Project Overview**

I've developed a comprehensive movie recommendation system using collaborative filtering techniques and matrix factorization. The system analyzes user-movie ratings data to provide personalized movie recommendations based on user preferences and behavioral patterns.

**🛠️ Technical Implementation**

**Data Processing**
- Loaded and merged user ratings data with movie metadata
- Created user-item matrix with 943 users and 1664 movies
- Handled missing values by filling with zeros

**Recommendation Algorithms Implemented:**

*User-Based Collaborative Filtering*
- Computed cosine similarity between users
- Generated recommendations based on similar users' preferences
- Achieved Precision@10: 0.0506

*Item-Based Collaborative Filtering*
- Calculated item similarity matrix
- Recommended movies similar to those users have liked
- Achieved Precision@10: 0.0506

*Matrix Factorization (SVD)*
- Implemented Singular Value Decomposition for latent factor modeling
- Achieved significantly better performance with Precision@10: 0.5786
- RMSE: 0.9621

**Evaluation Metrics**
- Precision@K to measure recommendation quality
- RMSE for rating prediction accuracy

**📈 Key Results**
The SVD-based approach outperformed both collaborative filtering methods, demonstrating the power of matrix factorization in capturing latent patterns in user-item interactions.

**💡 Skills Demonstrated**
- Data preprocessing and feature engineering
- Collaborative filtering algorithms
- Matrix factorization techniques
- Model evaluation and validation
- Python, pandas, numpy, scikit-learn, scikit-surprise

In [None]:
import pandas as pd
import numpy as np

In [None]:
ratings = pd.read_csv("u.data", sep="\t", names=["user_id","movie_id","rating","timestamp"])
movies = pd.read_csv("u.item", sep="|", encoding="latin-1", header=None, usecols=[0,1], names=["movie_id","title"])


In [None]:
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [None]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [None]:
df = pd.merge(ratings, movies, on="movie_id")
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


In [None]:
# Build User–Item matrix (rows = users, cols = movies, values = ratings)
user_item_matrix = df.pivot_table(index='user_id', columns='title', values='rating')

# Replace NaN with 0
user_item_filled = user_item_matrix.fillna(0)

# Show shape of the filled matrix
print("User–Item Matrix (filled with 0) shape:", user_item_filled.shape)

# Display sample (first 5 users x 10 movies)
print("\nSample User–Item Matrix (filled with 0):")
print(user_item_filled.iloc[:5, :10])


User–Item Matrix (filled with 0) shape: (943, 1664)

Sample User–Item Matrix (filled with 0):
title    'Til There Was You (1997)  1-900 (1994)  101 Dalmatians (1996)  \
user_id                                                                   
1                              0.0           0.0                    2.0   
2                              0.0           0.0                    0.0   
3                              0.0           0.0                    0.0   
4                              0.0           0.0                    0.0   
5                              0.0           0.0                    2.0   

title    12 Angry Men (1957)  187 (1997)  2 Days in the Valley (1996)  \
user_id                                                                 
1                        5.0         0.0                          0.0   
2                        0.0         0.0                          0.0   
3                        0.0         2.0                          0.0   
4              

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarity = cosine_similarity(user_item_filled)

user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_filled.index,
    columns=user_item_filled.index
)


print("User–User Cosine Similarity (sample 10x10):")
print(user_similarity_df.iloc[:10, :10])


User–User Cosine Similarity (sample 10x10):
user_id        1         2         3         4         5         6         7   \
user_id                                                                         
1        1.000000  0.168937  0.048388  0.064561  0.379670  0.429682  0.443097   
2        0.168937  1.000000  0.113393  0.179694  0.073623  0.242106  0.108604   
3        0.048388  0.113393  1.000000  0.349781  0.021592  0.074018  0.067423   
4        0.064561  0.179694  0.349781  1.000000  0.031804  0.068431  0.091507   
5        0.379670  0.073623  0.021592  0.031804  1.000000  0.238636  0.374733   
6        0.429682  0.242106  0.074018  0.068431  0.238636  1.000000  0.493529   
7        0.443097  0.108604  0.067423  0.091507  0.374733  0.493529  1.000000   
8        0.320079  0.104257  0.084419  0.188060  0.248930  0.202514  0.285815   
9        0.078385  0.162470  0.062039  0.101284  0.056847  0.184997  0.146092   
10       0.377733  0.161273  0.066217  0.060859  0.201427  0.5548

In [None]:
# Function to recommend top-rated unseen movies for a given user
def recommend_movies(user_id, df, user_similarity_df, top_n=5, top_k=10):

    similar_users = user_similarity_df[user_id].drop(index=user_id)
    similar_users = similar_users.sort_values(ascending=False)
    top_similar_users = similar_users.head(top_n).index.tolist()

    movies_seen = set(df[df['user_id'] == user_id]['title'])

    similar_users_ratings = df[df['user_id'].isin(top_similar_users)]

    candidate_movies = similar_users_ratings[~similar_users_ratings['title'].isin(movies_seen)]

    movie_scores = candidate_movies.groupby('title')['rating'].mean()

    recommended_movies = movie_scores.sort_values(ascending=False).head(top_k)

    return recommended_movies

recommendations = recommend_movies(1, df, user_similarity_df, top_n=5, top_k=10)
print("Top 10 Recommended Movies for User 1:\n")
print(recommendations)


Top 10 Recommended Movies for User 1:

title
Sophie's Choice (1982)                5.0
Walk in the Clouds, A (1995)          5.0
Hamlet (1996)                         5.0
It's a Wonderful Life (1946)          5.0
People vs. Larry Flynt, The (1996)    5.0
Stealing Beauty (1996)                5.0
Casablanca (1942)                     5.0
Emma (1996)                           5.0
Chinatown (1974)                      5.0
Titanic (1997)                        5.0
Name: rating, dtype: float64


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Split the dataset into Train and Test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Build User–Item matrix from training set only
train_user_item = train_df.pivot_table(index='user_id', columns='title', values='rating').fillna(0)

# Compute user-user similarity using Cosine Similarity
user_similarity = cosine_similarity(train_user_item)
user_similarity_df = pd.DataFrame(user_similarity, index=train_user_item.index, columns=train_user_item.index)


# Function to calculate Precision@K for a single user
def precision_at_k(user_id, train_df, test_df, user_similarity_df, k=10, top_n=5):
    """
    Compute Precision@K for one user
    """
    # Get top-K recommended movies
    recommended = recommend_movies(user_id, train_df, user_similarity_df, top_n=top_n, top_k=k)
    recommended_movies = set(recommended.index)

    # Movies the user actually rated as "relevant" (e.g. rating >= 4) in the test set
    relevant_movies = set(test_df[(test_df['user_id']==user_id) & (test_df['rating'] >= 4)]['title'])

    # Count overlap (hits) between recommended and relevant movies
    hits = recommended_movies.intersection(relevant_movies)

    # Precision = hits / K
    if len(recommended_movies) > 0:
        precision = len(hits) / k
    else:
        precision = 0.0

    return precision


# Function to evaluate the whole system (average precision across users)
def evaluate_system(train_df, test_df, user_similarity_df, k=10, top_n=5):
    user_ids = test_df['user_id'].unique()
    precisions = []
    for user in user_ids:
        p = precision_at_k(user, train_df, test_df, user_similarity_df, k=k, top_n=top_n)
        precisions.append(p)
    return sum(precisions)/len(precisions)


# Example: Evaluate Precision@10
precision_score = evaluate_system(train_df, test_df, user_similarity_df, k=10, top_n=5)
print(f"Average Precision@10: {precision_score:.4f}")


Average Precision@10: 0.0506


 # Implement item-based collaborative filtering

In [None]:
# Precision@K for Item-Based CF
def precision_at_k_item(user_id, train_matrix, test_df, item_similarity_df, k=10):
    """
    Compute Precision@K for one user using Item-Based CF
    """
    # Get recommendations from item-based CF
    recommended = recommend_movies(user_id, train_df, user_similarity_df, top_n=5, top_k=k)
    recommended_movies = set(recommended.index)

    # Relevant movies = movies rated >= 4 in the test set
    relevant_movies = set(test_df[(test_df['user_id'] == user_id) & (test_df['rating'] >= 4)]['title'])

    # Hits = intersection between recommended and relevant
    hits = recommended_movies.intersection(relevant_movies)

    # Precision = hits / K
    precision = len(hits) / k if k > 0 else 0.0
    return precision


# Evaluate Item-Based CF across all users
def evaluate_item_based(train_df, test_df, k=10):
    # Build train user-item matrix
    train_user_item = train_df.pivot_table(index='user_id', columns='title', values='rating').fillna(0)

    # Compute item similarity
    item_similarity = cosine_similarity(train_user_item.T)
    item_similarity_df = pd.DataFrame(item_similarity, index=train_user_item.columns, columns=train_user_item.columns)

    user_ids = test_df['user_id'].unique()
    precisions = []
    for user in user_ids:
        p = precision_at_k_item(user, train_user_item, test_df, item_similarity_df, k=k)
        precisions.append(p)

    return sum(precisions) / len(precisions)


# Example: Evaluate Precision@10 for Item-Based CF
precision_item = evaluate_item_based(train_df, test_df, k=10)
print(f"Average Precision@10 (Item-Based CF): {precision_item:.4f}")


Average Precision@10 (Item-Based CF): 0.0506


# Try matrix factorization (SVD)

In [None]:
!pip install numpy==1.26.4
!pip install scikit-surprise --no-cache-dir




In [None]:
# Install surprise if not available
# !pip install scikit-surprise

from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# Prepare dataset for Surprise
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'title', 'rating']], reader)

# Train-test split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Initialize SVD model
svd = SVD(n_epochs=50, lr_all=0.005, reg_all=0.02)
svd.fit(trainset)

# Make predictions
predictions = svd.test(testset)

# Evaluate with RMSE
rmse = accuracy.rmse(predictions)

# Precision@K
def precision_at_k_svd(predictions, k=10, threshold=4.0):
    """
    Compute Precision@K for SVD predictions
    threshold = rating considered relevant
    """
    # Map user -> list of predictions
    from collections import defaultdict
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = []
    for uid, user_ratings in user_est_true.items():
        # Sort by estimated rating
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        top_k = user_ratings[:k]

        # Hits = relevant among top_k
        relevant = sum((true_r >= threshold) for (_, true_r) in top_k)
        precisions.append(relevant / k)

    return sum(precisions) / len(precisions)


# Evaluate Precision@10
precision_svd = precision_at_k_svd(predictions, k=10, threshold=4.0)
print(f"Average Precision@10 (SVD): {precision_svd:.4f}")


RMSE: 0.9621
Average Precision@10 (SVD): 0.5786


In [None]:
def recommend_movies_for_user(user_id, model, movies_df, ratings_df, num_recommendations):
    """
    Recommend movies for a given user using a trained Surprise model (e.g., SVD).

    user_id            : ID of the target user
    model              : Trained Surprise model (e.g., SVD)
    movies_df          : DataFrame with movieId, title
    ratings_df         : DataFrame with userId, movieId, rating
    num_recommendations: Number of movies to recommend
    """

    # All movies in the dataset
    all_movie_ids = movies_df['movie_id'].unique()

    # Movies the user has already rated
    rated_movie_ids = ratings_df[ratings_df['user_id'] == user_id]['movie_id']

    # Movies the user hasn't rated yet
    movies_to_predict = [movie_id for movie_id in all_movie_ids if movie_id not in rated_movie_ids.values]

    # Predict ratings for each unseen movie
    predictions = [model.predict(user_id, movie_id) for movie_id in movies_to_predict]

    # Sort by estimated rating in descending order
    predictions.sort(key=lambda x: x.est, reverse=True)

    # Top N recommendations
    top_recommendations = predictions[:num_recommendations]

    # Extract movie IDs and predicted ratings
    recommended_movies = [(pred.iid, pred.est) for pred in top_recommendations]

    # Build a DataFrame with titles
    recommended_movies_df = pd.DataFrame(recommended_movies, columns=['movie_id', 'predicted_rating'])
    recommended_movies_df = recommended_movies_df.merge(movies_df[['movie_id', 'title']], on='movie_id')

    return recommended_movies_df[['title', 'predicted_rating']]


In [None]:
user_id = 1
recommended_movies = recommend_movies_for_user(user_id, svd, movies, ratings, num_recommendations=10)
print("🎬 Top 10 Recommended Movies for User 1:\n")
print(recommended_movies)


🎬 Top 10 Recommended Movies for User 1:

                                             title  predicted_rating
0                                      Heat (1995)          3.434088
1                                   Sabrina (1995)          3.434088
2                     Sense and Sensibility (1995)          3.434088
3                         Leaving Las Vegas (1995)          3.434088
4                               Restoration (1995)          3.434088
5                              Bed of Roses (1996)          3.434088
6  Once Upon a Time... When We Were Colored (1995)          3.434088
7                     Up Close and Personal (1996)          3.434088
8                           River Wild, The (1994)          3.434088
9                           Time to Kill, A (1996)          3.434088
