# Task 5: Movie Recommendation System  

In this task, I built a **movie recommendation system** using the **MovieLens 100K dataset**.  
The system recommends movies based on **user similarity** and **item similarity**, and I also experimented with **matrix factorization (SVD)**.

Covered topics:  
- Numpy  
- Pandas  
- Scikit-learn  
- Recommendation Systems (User-Based CF, Item-Based CF, Matrix Factorization)


##Step 1: Importing Libraries
I imported the necessary libraries to build my movie recommendation system, including pandas for data handling, numpy for numerical operations, cosine_similarity from sklearn for similarity calculations, train_test_split for data splitting, and TruncatedSVD for matrix factorization.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD

## Step 2: Upload,Load and Explore Dataset
I downloaded the MovieLens 100K dataset from Kaggle.  
It contained multiple files inside the `ml-100k` folder.  
I uploaded the zip file to google colab and extracted its contents.
For this task, I used `u.data` (ratings) and `u.item` (movie details).

In [2]:
from google.colab import files
import zipfile

# Upload the ml-100k.zip file from your laptop
uploaded = files.upload()

# Extract into a folder named "ml-100k"
with zipfile.ZipFile("ml-100k.zip", "r") as zip_ref:
    zip_ref.extractall("ml-100k")

Saving ml-100k.zip to ml-100k.zip


In [6]:
!ls ml-100k


ml-100k


In [7]:
ratings = pd.read_csv("ml-100k/ml-100k/u.data", sep="\t",
                      names=["user_id", "movie_id", "rating", "timestamp"])

movies = pd.read_csv("ml-100k/ml-100k/u.item", sep="|", encoding="latin-1", header=None,
                     names=["movie_id", "title"], usecols=[0, 1])

data = pd.merge(ratings, movies, on="movie_id")
data.head()


Unnamed: 0,user_id,movie_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


## Step 3: Split Data and Create Matrics
I split the dataset into training and test sets, then created user-item matrices using pivot_table to organize ratings with users as rows and movie IDs as columns for both sets.

In [9]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Create user-item matrices using movie_id
train_matrix = train_data.pivot_table(index="user_id", columns="movie_id", values="rating")
test_matrix = test_data.pivot_table(index="user_id", columns="movie_id", values="rating")

## Step 4: User-Based Collaborative Filtering
I computed user similarity using cosine similarity on the filled training matrix, then defined a recommend_movies function to suggest top-k movies based on similar users' ratings.

In [10]:
user_similarity = cosine_similarity(np.nan_to_num(train_matrix.fillna(0)))
user_similarity = pd.DataFrame(user_similarity, index=train_matrix.index, columns=train_matrix.index)

In [11]:
def recommend_movies(user_id, k=5):
    if user_id not in train_matrix.index:
        return []

    sim_scores = user_similarity[user_id].drop(user_id)
    sim_scores = sim_scores[sim_scores > 0]

    weighted_scores = {}
    sim_sums = {}

    for neighbor, sim in sim_scores.items():
        neighbor_ratings = train_matrix.loc[neighbor]
        for movie, rating in neighbor_ratings.dropna().items():
            if pd.isna(train_matrix.loc[user_id, movie]):
                weighted_scores[movie] = weighted_scores.get(movie, 0) + sim * rating
                sim_sums[movie] = sim_sums.get(movie, 0) + sim

    predictions = {m: weighted_scores[m] / sim_sums[m] for m in weighted_scores if sim_sums[m] > 0}
    top_movies = sorted(predictions.items(), key=lambda x: x[1], reverse=True)[:k]
    recommendations = [(movies.set_index("movie_id").loc[mid]["title"], score) for mid, score in top_movies]
    return recommendations

##Step 5: Evaluation Metrics
I defined precision_at_k and recall_at_k functions to evaluate the recommendation quality, calculating hits against relevant movies in the test set rated 4 or higher.

In [15]:
def precision_at_k(user_id, k=5):
    if user_id not in train_matrix.index:
        return 0
    recommended = recommend_movies(user_id, k)
    if not recommended:
        return 0
    recommended_movies = [movies[movies["title"] == title]["movie_id"].values[0] for title, _ in recommended]
    relevant_movies = test_matrix.loc[user_id][test_matrix.loc[user_id] >= 4].dropna().index.tolist()
    if not relevant_movies:
        return 0
    hits = len(set(recommended_movies) & set(relevant_movies))
    return hits / k

def recall_at_k(user_id, k=5):
    if user_id not in train_matrix.index:
        return 0
    recommended = recommend_movies(user_id, k)
    if not recommended:
        return 0
    recommended_movies = [movies[movies["title"] == title]["movie_id"].values[0] for title, _ in recommended]
    relevant_movies = test_matrix.loc[user_id][test_matrix.loc[user_id] >= 4].dropna().index.tolist()
    if not relevant_movies:
        return 0
    hits = len(set(recommended_movies) & set(relevant_movies))
    return hits / len(relevant_movies) if relevant_movies else 0

##Step 6: Aggregate Evaluation
I evaluated the system by calculating mean precision and recall for a subset of users, limiting to the first 100 for efficiency.

In [17]:
k = 5
users = test_data["user_id"].unique()[:100]  # Limit to 100 users
precisions = []
recalls = []

# Precompute relevant movies for all users
relevant = test_matrix[test_matrix >= 4].stack().reset_index()
relevant.columns = ["user_id", "movie_id", "rating"]  # Rename columns explicitly
relevant = relevant[relevant["user_id"].isin(users)].set_index("user_id")

for user in users:
    p = precision_at_k(user, k)
    r = recall_at_k(user, k)
    if p is not None and p > 0:  # Skip None or zero
        precisions.append(p)
    if r is not None and r > 0:
        recalls.append(r)

print(f"Mean Precision@{k}: {np.mean(precisions) if precisions else 0:.4f}")
print(f"Mean Recall@{k}: {np.mean(recalls) if recalls else 0:.4f}")

Mean Precision@5: 0.2000
Mean Recall@5: 0.0283


##Bonus Step 7: Item-Based Collaborative Filtering
I implemented item-based collaborative filtering by computing item similarity and created a recommend_item_based function to recommend movies based on similar items.

In [18]:
item_similarity = cosine_similarity(np.nan_to_num(train_matrix.T.fillna(0)))
item_similarity = pd.DataFrame(item_similarity, index=train_matrix.columns, columns=train_matrix.columns)

def recommend_item_based(user_id, k=5):
    if user_id not in train_matrix.index:
        return []
    user_ratings = train_matrix.loc[user_id].dropna()
    scores = {}
    sim_sums = {}
    for movie, rating in user_ratings.items():
        sim_movies = item_similarity[movie].drop(movie)
        sim_movies = sim_movies[sim_movies > 0]
        for sim_movie, sim in sim_movies.items():
            if pd.isna(train_matrix.loc[user_id, sim_movie]):
                scores[sim_movie] = scores.get(sim_movie, 0) + sim * rating
                sim_sums[sim_movie] = sim_sums.get(sim_movie, 0) + sim
    predictions = {m: scores[m] / sim_sums[m] for m in scores if sim_sums[m] > 0}
    top_movies = sorted(predictions.items(), key=lambda x: x[1], reverse=True)[:k]
    recommendations = [(movies.set_index("movie_id").loc[mid]["title"], score) for mid, score in top_movies]
    return recommendations

##Bonus Step 8: Matrix Factorization with SVD
I applied SVD for matrix factorization, computed user and item factors, and defined a recommend_svd function to generate recommendations using latent factors.

In [19]:
matrix_filled = train_matrix.fillna(0)
svd = TruncatedSVD(n_components=50, random_state=42)
user_factors = svd.fit_transform(matrix_filled)
item_factors = svd.components_.T

def recommend_svd(user_id, k=5):
    if user_id not in train_matrix.index:
        return []
    user_idx = list(train_matrix.index).index(user_id)
    user_vec = user_factors[user_idx]
    pred_ratings = np.dot(user_vec, item_factors.T)
    seen_movies = train_matrix.loc[user_id].dropna().index
    unseen_movies = [m for m in train_matrix.columns if m not in seen_movies]
    pred_dict = {m: pred_ratings[list(train_matrix.columns).index(m)] for m in unseen_movies}
    top_movies = sorted(pred_dict.items(), key=lambda x: x[1], reverse=True)[:k]
    recommendations = [(movies.set_index("movie_id").loc[mid]["title"], score) for mid, score in top_movies]
    return recommendations

##Step 9: Test All Methods
I tested all recommendation methods for a sample user to compare their outputs.

In [20]:
user_id = 1
print("User-Based:", recommend_movies(user_id, k=5))
print("Item-Based:", recommend_item_based(user_id, k=5))
print("SVD:", recommend_svd(user_id, k=5))

User-Based: [('Prefontaine (1997)', 5.000000000000001), ('Perfect Candidate, A (1996)', 5.0), ('Marlene Dietrich: Shadow and Light (1996) ', 5.0), ('Delta of Venus (1994)', 5.0), ('Saint of Fort Washington, The (1993)', 5.0)]
Item-Based: [('Further Gesture, A (1996)', 4.535520418144707), ("C'est arrivé près de chez vous (1992)", 4.425031280328295), ("Some Mother's Son (1996)", 4.270570244345701), ('Guantanamera (1994)', 4.257253244849679), ('A Chef in Love (1996)', 4.247623734753724)]
SVD: [('Heat (1995)', np.float64(3.278345490826269)), ('Blues Brothers, The (1980)', np.float64(3.0351139507680998)), ('Piano, The (1993)', np.float64(2.8231493097487643)), ('Reservoir Dogs (1992)', np.float64(2.717629487031369)), ('My Left Foot (1989)', np.float64(2.705065114242167))]
