<a href="https://colab.research.google.com/github/manola1109/Recommender-system-with-Python/blob/main/5_3_Item_Based_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movielens - 100K Dataset

MovieLens 100K dataset has been a standard dataset used for benchmarking recommender systems for more than 20 years now and hence this provides a good point to start our learning journey for recommender systems. For non commercial personalised recommendations for movies you can check out the website: https://movielens.org/

This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies.
	* Each user has rated at least 20 movies.
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

## Data Description


**Ratings**    -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a comma separated list of
	         user id | item id | rating | timestamp.
              The time stamps are unix seconds since 1/1/1970 UTC   


**Movie Information**   -- Information about the items (movies); this is a comma separated
              list of
              movie id | movie title | release date | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.


**User Demographics**    -- Demographic information about the users; this is a comma
              separated list of
              user id | age | gender | occupation | zip code

## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

In [2]:
#Reading ratings file:
ratings = pd.read_csv('ratings.csv')

#Reading Movie Info File
movie_info = pd.read_csv('movie_info.csv')

## 2.  Merging Movie information to ratings dataframe <a class="anchor" id="merge"></a>

The movie names are contained in a separate file. Let's merge that data with ratings and store it in ratings dataframe. The idea is to bring movie title information in ratings dataframe as it would be useful later on

In [3]:
ratings = ratings.merge(movie_info[['movie id','movie title']], how='left', left_on = 'movie_id', right_on = 'movie id')

In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title
0,196,242,3,881250949,242,Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997)


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [5]:
ratings['movie'] = ratings['movie_id'].map(str) + str(': ') + ratings['movie title'].map(str)

In [6]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp', 'movie id',
       'movie title', 'movie'],
      dtype='object')

Keeping the columns movie, user_id and rating in the ratings dataframe and drop all others

In [7]:
ratings = ratings.drop(['movie id', 'movie title', 'movie_id','unix_timestamp'], axis = 1)

In [8]:
ratings = ratings[['user_id','movie','rating']]

## 3. Creating Train & Test Data & Setting Evaluation Metric <a class="anchor" id="eval"></a>
In order to test how well we do with a given rating prediction method, we would first need to define our train and test set, we will only use the train set to build different models and evaluate our model using the test set.

In [9]:
#Assign X as the original ratings dataframe
X = ratings.copy()

#Split into training and test datasets
X_train, X_test = train_test_split(X, test_size = 0.25, random_state=42)

In [10]:
#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

## 4. Simple Baseline using average of all ratings <a class="anchor" id="simplebaseline"></a>

A simple baseline would give us the RMSE score that we get from just averaging all the available ratings and using it as predictions for all user movie pairs in the test set. This will also help us ensure that further when we use more complex techniques, we beat this score. If that is not the case maybe we need to change things.

In [11]:
#Define the baseline model to always return average of all available ratings
def baseline(user_id, movie):
    return X_train['rating'].mean()

In [12]:
#Function to compute the RMSE score obtained on the test set by a model
def rmse_score(model):

    #Construct a list of user-movie tuples from the test dataset
    id_pairs = zip(X_test['user_id'], X_test['movie'])

    #Predict the rating for every user-movie tuple
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])

    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])

    #Return the final RMSE score
    return rmse(y_true, y_pred)

In [13]:
rmse_score(baseline)

1.1244396573898978

## 5. Item based Collaborative filtering with simple item mean <a class="anchor" id="itemmean"></a>
Again in item based CF we discussed steps for using weighted mean of similar items' ratings, let's first try just a simple average of all ratings given by a particular user to all other movies and make predictions. To do that first we will create the ratings matrix using pandas pivot_table function.

In [14]:
#Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie')

r_matrix.head()

movie,1000: Lightning Jack (1994),"1001: Stupids, The (1996)","1002: Pest, The (1997)",1003: That Darn Cat! (1997),1004: Geronimo: An American Legend (1993),"1005: Double vie de Véronique, La (Double Life of Veronique, The) (1991)",1006: Until the End of the World (Bis ans Ende der Welt) (1991),1007: Waiting for Guffman (1996),1008: I Shot Andy Warhol (1996),1009: Stealing Beauty (1996),...,992: Head Above Water (1996),993: Hercules (1997),"994: Last Time I Committed Suicide, The (1997)","995: Kiss Me, Guido (1997)","996: Big Green, The (1995)",997: Stuart Saves His Family (1995),998: Cabin Boy (1994),999: Clean Slate (1994),99: Snow White and the Seven Dwarfs (1937),9: Dead Man Walking (1995)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,3.0,5.0
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [15]:
#Item Based Collaborative Filter using Mean Ratings
def cf_item_mean(user_id, movie):

    #Compute the mean of all the ratings given by the user
    mean_rating = r_matrix.loc[user_id].mean()

    return mean_rating

In [16]:
#Compute RMSE for the Mean model
rmse_score(cf_item_mean)

1.044885130655045

The RMSE score that we get from this simple technique is lower than simple user mean that we discussed in the last module by a small margin, now let us check item based collaborative filtering with weighted mean of most similar items

## 6. Item based Collaborative filtering with similarity weighted mean <a class="anchor" id="itemwmean"></a>
Now let's use cosine similarity and evaluate item based filtering by using similarity based weighted mean. Now cosine similarity varies from 0 to 1 and the function from sklearn that we are going to use does not work on missing values in the user item matrix so in order to create the item-item matrix we will fill all the missing values with 0.
This means that for all movie user pairs where we don't have rating will accumulate a 0.

In [17]:
#Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

In [18]:
#Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy.T, r_matrix_dummy.T)

#Convert into pandas dataframe
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.columns, columns=r_matrix.columns)

cosine_sim.head(5)

movie,1000: Lightning Jack (1994),"1001: Stupids, The (1996)","1002: Pest, The (1997)",1003: That Darn Cat! (1997),1004: Geronimo: An American Legend (1993),"1005: Double vie de Véronique, La (Double Life of Veronique, The) (1991)",1006: Until the End of the World (Bis ans Ende der Welt) (1991),1007: Waiting for Guffman (1996),1008: I Shot Andy Warhol (1996),1009: Stealing Beauty (1996),...,992: Head Above Water (1996),993: Hercules (1997),"994: Last Time I Committed Suicide, The (1997)","995: Kiss Me, Guido (1997)","996: Big Green, The (1995)",997: Stuart Saves His Family (1995),998: Cabin Boy (1994),999: Clean Slate (1994),99: Snow White and the Seven Dwarfs (1937),9: Dead Man Walking (1995)
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000: Lightning Jack (1994),1.0,0.230365,0.102062,0.081111,0.0,0.0,0.0,0.073011,0.0,0.0,...,0.549972,0.054554,0.0,0.0,0.218797,0.111187,0.140488,0.202031,0.090591,0.0
"1001: Stupids, The (1996)",0.230365,1.0,0.0,0.0,0.0,0.0,0.0,0.193947,0.006421,0.040001,...,0.325785,0.056553,0.0,0.0,0.0,0.256137,0.0,0.0,0.105294,0.05417
"1002: Pest, The (1997)",0.102062,0.0,1.0,0.066227,0.0,0.0,0.0,0.0,0.06069,0.0,...,0.0,0.0,0.0,0.0,0.029775,0.030261,0.19118,0.0,0.013449,0.056888
1003: That Darn Cat! (1997),0.081111,0.0,0.066227,1.0,0.0,0.0,0.0,0.0,0.0,0.056336,...,0.0,0.0,0.0,0.077557,0.236624,0.144296,0.395029,0.0,0.042751,0.018838
1004: Geronimo: An American Legend (1993),0.0,0.0,0.0,0.0,1.0,0.065556,0.106504,0.053298,0.0,0.056336,...,0.0,0.053099,0.0,0.0,0.011831,0.15632,0.0,0.016387,0.056111,0.043326


Using cosine similarity we have estimated the similarity between each pair of items and we can use the same to check the most similar movies to each movie

In [19]:
#Checking movies most similar to Star Wars
cosine_sim['50: Star Wars (1977)'].sort_values(ascending = False)[1:6]

Unnamed: 0_level_0,50: Star Wars (1977)
movie,Unnamed: 1_level_1
181: Return of the Jedi (1983),0.654124
"172: Empire Strikes Back, The (1980)",0.520989
174: Raiders of the Lost Ark (1981),0.520819
1: Toy Story (1995),0.51373
100: Fargo (1996),0.513665


Without feeding the information that return of the jedi and empire strikes back belong to the same universe as star wars, we see that cosine similarity has ranked these movies amongst the top. Quite interesting how just the user preferences can be used to find such hidden information.

Now, we have the item item similarities stored in the matrix cosine_sim. We will define a function to predict the unknown ratings in the test set using item based collarborative filtering with simiarity as cosine and using all the ratings of other items. For each user movie pair:
1. Check if a movie is there in train set, if its not in that case we will just predict the mean rating as the predicted rating
2. Extract cosine similarity values from matrix cosine_sim
3. Drop all the unrated items as they cannot contribute to the prediction from both similarity scores and ratings
4. Use the prediction formula to make rating predictions
<img src="Picture 1.png" style="width: 500px;"/>

In [20]:
#Item Based Collaborative Filter using Weighted Mean Ratings
def cf_item_wmean(user_id, movie_id):

    #Check if movie_id exists in r_matrix
    if movie_id in r_matrix:

        #Get the similarity scores for the item in question with every other item
        sim_scores = cosine_sim[movie_id]

        #Get the movie ratings for the user in question
        m_ratings = r_matrix.loc[user_id]

        #Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index

        #Drop the NaN values from the m_ratings Series (removing unrated items)
        m_ratings = m_ratings.dropna()

        #Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)

        #Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings)/ sim_scores.sum()

    else:
        #Default to average rating in the absence of any information on the movie in train set
        wmean_rating = X_train['rating'].mean()

    return wmean_rating

In [21]:
rmse_score(cf_item_wmean)

1.0166073087455623

We see that not much has improved here, so we could go ahead and try to constraint the neighbourhood as well as use different similarity measures such as adjusted cosine similarity to improve

## 7. Importing Surprise & Loading Dataset <a class="anchor" id="dataload"></a>

In [22]:
!pip install "numpy<2"




In [23]:
#Install surprise package if not already installed
%pip install scikit-surprise

#Importing functions to be used in this notebook from Surprise Package
from surprise import Dataset, Reader
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import KNNWithMeans



To load a dataset from a pandas dataframe within Surprise, you will need the load_from_df() method.
1. You will also need a `Reader` object and the `rating_scale` parameter must be specified.
2. The dataframe here must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order.
3. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.

In [24]:
#Assign X as the original ratings dataframe
X = ratings.copy()

#Split into training and test datasets
X_train, X_test = train_test_split(X, test_size = 0.25, random_state=42)

#Reader object to import ratings from X_train
reader = Reader(rating_scale=(1, 5))

#Storing Data in surprise format from X_train
data = Dataset.load_from_df(X_train[['user_id','movie','rating']], reader)

## 8. Grid Search for Neighbourhood size and similarity measure <a class="anchor" id="gridsearch"></a>

The `cross_validate()` function reports accuracy metric over a cross-validation procedure for a given set of parameters. If you want to know which parameter combination yields the best results, the `GridSearchCV` class comes to the rescue.

Given a dict of parameters, this class exhaustively tries all the combinations of parameters and reports the best parameters for any accuracy measure (averaged over the different splits). It is heavily inspired from scikit-learn’s GridSearchCV.

In [25]:
#Defining the parameter grid with k as the neighbourhood size & trying 2 similarity measures KNNwithMeans
#& 5 folds, we also use user_based as True and false to try both user based and item based collaborative filtering
#and check which performs better
param_grid = {"k":list(range(1,50,5)),
              "sim_options":{"name":["cosine","pearson"],'user_based': [True,False]}}

#Trying to find the best set of hyperparameters using Grid Search
gs = GridSearchCV(KNNWithMeans,
                  param_grid,
                  measures=['rmse'],
                  cv=5,
                  n_jobs = -1)

#We fit the grid search on data to find out the best score
gs.fit(data)

#Printing the best score
print(gs.best_score['rmse'])

#Printing the best set of parameters
print(gs.best_params['rmse'])

0.9500464123876704
{'k': 46, 'sim_options': {'name': 'cosine', 'user_based': False}}


This is good. We have got some improvement with item based filtering, now lets check the same on the test data. Note that knnwithmeans ensure that we are using ratings normalized by average user ratings which is what adjusted cosine similarity offers.

Now using the best parameters we can fit the model on complete dataset and check performance on the test set

## 9. Fitting Model on complete train set & checking performance on test data<a class="anchor" id="testperf"></a>

In [26]:
#Defining similarity measure as per the best parameters
sim_options = {'name': 'cosine', 'user_based': False}

#Fitting the model on train data
model = KNNWithMeans(k = 46, sim_options = sim_options)

#Build full trainset will essentially fits the knnwithmeans on the complete train set instead of a part of it
#like we do in cross validation
model.fit(data.build_full_trainset())

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7acc71bd2c90>

In [27]:
#id pairs for test set
id_pairs = zip(X_test['user_id'], X_test['movie'])

#Making predictions for test set using predict method from Surprise
y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

#Actual rating values for test set
y_true = X_test['rating']

# Checking performance on test set
rmse(y_true, y_pred)

0.9451846317639762