Movies Recommendation using User-Based Collaborative Filtering
What is User-Based Collaborative Filtering?
User-based collaborative filtering is a technique used in recommender systems to provide personalized recommendations to users based on their preferences and the preferences of similar users. It is a form of collaborative filtering that focuses on the similarity between users rather than items.

Necessary Libraries

In [2]:
# import Required Libraries

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import pdist, squareform

Importing data

In [5]:
movies_df=pd.read_csv(r'C:\Users\hp\Desktop\IT projects\Movie recommendation p1\ml-latest-small\movies.csv')

In [6]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
ratings_df=pd.read_csv(r'C:\Users\hp\Desktop\IT projects\Movie recommendation p1\ml-latest-small\ratings.csv')

In [8]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [9]:
tags_df=pd.read_csv(r'C:\Users\hp\Desktop\IT projects\Movie recommendation p1\ml-latest-small\tags.csv')

In [10]:
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [11]:
links_df=pd.read_csv(r'C:\Users\hp\Desktop\IT projects\Movie recommendation p1\ml-latest-small\links.csv')

In [12]:
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [13]:
movies_df.shape


(9742, 3)

In [14]:
links_df.shape

(9742, 3)

In [15]:
tags_df.shape

(3683, 4)

In [16]:
ratings_df.shape

(100836, 4)

In [17]:
# Merging movie and rating data sets

movies = movies_df.merge(ratings_df, how="left", on="movieId")

In [18]:
movies.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982700.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847435000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106636000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510578000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696000.0


In [19]:
movies.shape

(100854, 6)

In [20]:
# Find duplicate rows across all columns
duplicates_all = movies[movies.duplicated(keep=False)]
duplicates_all.info()



<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movieId    0 non-null      int64  
 1   title      0 non-null      object 
 2   genres     0 non-null      object 
 3   userId     0 non-null      float64
 4   rating     0 non-null      float64
 5   timestamp  0 non-null      float64
dtypes: float64(3), int64(1), object(2)
memory usage: 0.0+ bytes


In [21]:
# Print all column name in movies table
for col in movies.columns:
    print(col)

movieId
title
genres
userId
rating
timestamp


similarity between users based 

In [22]:
# Pivot the DataFrame to create a user-item matrix
user_item_matrix = movies.pivot(index='userId', columns='movieId', values='rating')
# Fill NaN values with 0 (or with the mean rating or other strategies)
user_item_matrix = user_item_matrix.fillna(0)


In [23]:
user_item_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculate Cosine Similarity

In [24]:
# Calculate cosine similarity between users
cosine_sim = cosine_similarity(user_item_matrix)

# Convert the result to a DataFrame
cosine_sim_df = pd.DataFrame(cosine_sim, index=user_item_matrix.index, columns=user_item_matrix.index)

In [25]:
cosine_sim_df.head()

userId,NaN,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,601.0,602.0,603.0,604.0,605.0,606.0,607.0,608.0,609.0,610.0
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2.0,0.0,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3.0,0.0,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4.0,0.0,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683


In [26]:
# Define the number of neighbors (e.g., top 5 similar users)
N = 5

def get_top_neighbors(similarity_df, N):
    # Get the top N neighbors for each user
    neighbors = similarity_df.apply(lambda x: x.nlargest(N + 1).index[1:N + 1].tolist(), axis=1)
    return neighbors

top_neighbors = get_top_neighbors(cosine_sim_df, N)

# Convert to DataFrame for easier viewing
# Create a DataFrame with the correct shape
top_neighbors_df = pd.DataFrame(top_neighbors.tolist(), columns=[f'Neighbor_{i + 1}' for i in range(N)])

print(top_neighbors_df.head())

   Neighbor_1  Neighbor_2  Neighbor_3  Neighbor_4  Neighbor_5
0         1.0         2.0         3.0         4.0         5.0
1       266.0       313.0       368.0        57.0        91.0
2       366.0       417.0       378.0       550.0       189.0
3       313.0       377.0       532.0       527.0       312.0
4       391.0       603.0       156.0       275.0       597.0


Recommendation generation

In [27]:
def recommend_items_with_titles(user_id, user_item_matrix, top_neighbors, similarity_df, movies_df, num_recommendations=5):
    # Get the user's neighbors
    neighbors = top_neighbors.loc[user_id]
    
    # Get ratings from neighbors
    neighbor_ratings = user_item_matrix.loc[neighbors]
    
    # Get similarity scores for neighbors
    sim_scores = similarity_df.loc[user_id, neighbors]
    
    # Calculate weighted ratings
    weighted_ratings = neighbor_ratings.T.dot(sim_scores)
    
    # Remove already rated items from recommendations
    user_rated_items = user_item_matrix.loc[user_id][user_item_matrix.loc[user_id] > 0].index
    weighted_ratings = weighted_ratings.drop(user_rated_items)
    
    # Normalize ratings to the scale of 1 to 5
    min_rating = 1
    max_rating = 5
    if weighted_ratings.max() > 0:  # Prevent division by zero
        normalized_ratings = ((weighted_ratings - weighted_ratings.min()) / 
                              (weighted_ratings.max() - weighted_ratings.min()) * 
                              (max_rating - min_rating) + min_rating)
    else:
        normalized_ratings = weighted_ratings  # Keep original if no ratings

    # Get the top N recommendations
    recommendations = normalized_ratings.nlargest(num_recommendations)
    
    # Map movie IDs to titles
    recommended_movie_ids = recommendations.index
    recommended_movies = movies_df[movies_df['movieId'].isin(recommended_movie_ids)]
    
    # Create a DataFrame with recommendations
    recommended_movies['predicted_rating'] = recommendations.values
    return recommended_movies[['movieId', 'title', 'predicted_rating']]

#Get recommendations for user 1 with movie titles
recommendations_for_1_with_titles = recommend_items_with_titles(1, user_item_matrix, top_neighbors_df, cosine_sim_df, movies_df)
print(recommendations_for_1_with_titles)




      movieId                              title  predicted_rating
474       541                Blade Runner (1982)          5.000000
507       589  Terminator 2: Judgment Day (1991)          4.581711
793      1036                    Die Hard (1988)          4.336589
902      1200                      Aliens (1986)          4.335730
1211     1610   Hunt for Red October, The (1990)          4.332638


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


user interaction

In [28]:
import ipywidgets as widgets
from IPython.display import display

user_id_widget = widgets.IntText(value=1, description='User ID:')
button = widgets.Button(description="Get Recommendations")

def on_button_click(b):
    user_id = user_id_widget.value
    recommendations = recommend_items_with_titles(user_id, user_item_matrix, top_neighbors_df, cosine_sim_df, movies_df)
    display(recommendations)

button.on_click(on_button_click)
display(user_id_widget, button)


IntText(value=1, description='User ID:')

Button(description='Get Recommendations', style=ButtonStyle())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
474,541,Blade Runner (1982),5.0
507,589,Terminator 2: Judgment Day (1991),4.581711
793,1036,Die Hard (1988),4.336589
902,1200,Aliens (1986),4.33573
1211,1610,"Hunt for Red October, The (1990)",4.332638


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
97,110,Braveheart (1995),5.0
418,480,Jurassic Park (1993),4.825517
990,1291,Indiana Jones and the Last Crusade (1989),4.75648
4800,7153,"Lord of the Rings: The Return of the King, The...",4.64242
5917,33794,Batman Begins (2005),4.554763


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
97,110,Braveheart (1995),5.0
418,480,Jurassic Park (1993),4.825517
990,1291,Indiana Jones and the Last Crusade (1989),4.75648
4800,7153,"Lord of the Rings: The Return of the King, The...",4.64242
5917,33794,Batman Begins (2005),4.554763


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
134,161,Crimson Tide (1995),5.0
217,253,Interview with the Vampire: The Vampire Chroni...,4.203821
244,282,Nell (1994),3.904988
461,527,Schindler's List (1993),3.860807
510,593,"Silence of the Lambs, The (1991)",3.852854


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
134,161,Crimson Tide (1995),5.0
217,253,Interview with the Vampire: The Vampire Chroni...,4.203821
244,282,Nell (1994),3.904988
461,527,Schindler's List (1993),3.860807
510,593,"Silence of the Lambs, The (1991)",3.852854


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
35,39,Clueless (1995),5.0
260,300,Quiz Show (1994),4.826924
463,529,Searching for Bobby Fischer (1993),4.520733
483,551,"Nightmare Before Christmas, The (1993)",4.059761
546,648,Mission: Impossible (1996),3.863885


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommendations.values


Unnamed: 0,movieId,title,predicted_rating
35,39,Clueless (1995),5.0
260,300,Quiz Show (1994),4.826924
463,529,Searching for Bobby Fischer (1993),4.520733
483,551,"Nightmare Before Christmas, The (1993)",4.059761
546,648,Mission: Impossible (1996),3.863885
