<a href="https://colab.research.google.com/github/prathmesh-trip/RecommendationEngine/blob/main/User_Based_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Downloading the Ratings Dataset
!wget https://raw.githubusercontent.com/prathmesh-trip/RecommendationEngine/main/ratings.csv

--2022-08-31 10:28:52--  https://raw.githubusercontent.com/prathmesh-trip/RecommendationEngine/main/ratings.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2483723 (2.4M) [text/plain]
Saving to: ‘ratings.csv’


2022-08-31 10:28:52 (37.3 MB/s) - ‘ratings.csv’ saved [2483723/2483723]



In [12]:
# Downloading the Movies Dataset
!wget https://raw.githubusercontent.com/susanli2016/Machine-Learning-with-Python/master/movielens_data/movies.csv

--2022-08-31 10:35:42--  https://raw.githubusercontent.com/susanli2016/Machine-Learning-with-Python/master/movielens_data/movies.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458390 (448K) [text/plain]
Saving to: ‘movies.csv.1’


2022-08-31 10:35:42 (12.2 MB/s) - ‘movies.csv.1’ saved [458390/458390]



In [3]:
# Installing the scikit surprise package
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.4 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633994 sha256=54fc82700dbdf5f0ce066b46966eeea1f1c2f8d6be995a18379ca86e5876a364
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


## Preparing Data for Collaborative Filtering

In [5]:
# Importing the basic libraries
import pandas as pd
import numpy as np

In [13]:
# Reading both the datasets
movies = pd.read_csv('/content/movies.csv',on_bad_lines='skip')

# Check the shape of the dataframe
print(movies.shape)

# Check the first few rows of dataframe
movies.head(10)

(9125, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [15]:
# Check the first few rows of ratings
ratings = pd.read_csv('/content/ratings.csv')

ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [16]:
# Remove the timestamp column
ratings.drop(['timestamp'], axis = 1, inplace = True)

# Check first few rows again
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [17]:
# Now, we have to create a csv file for the updated rating data, as Surprise Library accepts CSV files as input
# We specify the header as None, as the Surprise Library canoot take in columns names
# We will set the index also as False, as Surprise Library does not handle index

ratings.to_csv('ratings_modified.csv',
               header = None,
               index = False)

In [18]:
# Let's take a look at the modified dataset before feeding it to the surprise data reader
x = pd.read_csv('ratings_modified.csv')
x.head()

Unnamed: 0,1,1.1,4.0
0,1,3,4.0
1,1,6,4.0
2,1,47,5.0
3,1,50,5.0
4,1,70,3.0


In [19]:
# Let's import the data import into the surprise reader
from surprise import Dataset, Reader

# Lets first specify the file path and reader parameters required for loading the data
file_path = 'ratings_modified.csv'
reader = Reader(line_format = 'user item rating',
                sep= ',',
                rating_scale = (1,5))

# Lets load the dataset into the surprise reader, we cannot read this dataset, as this is a surprise object
data = Dataset.load_from_file(file_path, reader = reader)

In [20]:
# Lets build the training dataset
train = data.build_full_trainset()

# Lets get the number of users and items
print("Number of users in Database : ", train.n_users)
print("Number of items in Database : ", train.n_items)

# There are fewwer users and more number of items, therefore user based collaborative filtering is the best option here

Number of users in Database :  610
Number of users in Database :  9724


## Implementation of User Based Collaborative Filtering

- Using **surprise** package
- Algorithm : KNN-Means algorithm to implement user based collaborative filtering
  - Underlying assumption : When two datapoints are close to each other, then they are similar to each other

In [23]:
# Importing the module
from surprise import KNNWithMeans

# User Based Collabortive Filtering
# sim_options -- refers to similarity options
my_sim_option = {'name' : 'pearson', # Correlation measure to be used
                 'user_based' : True} # Set user based collaborative filtering option as true

# KNN model as backend
algo = KNNWithMeans(k=15, # Max number of neighbors to take into account for aggregation
                    min_k = 5, # Min number of neighbors to take into account for aggregation
                    sim_options = my_sim_option, # How to measure similarity
                    verbose = True)

In [24]:
# Running Cross validation and evaluating model performance
from surprise.model_selection import cross_validate

# Cross Validation
results = cross_validate(algo = algo,
                         data = data,
                         measures = ['RMSE'], # Accuracy Metric RMSE
                         cv = 5, # 5 fold cross validation being used, where data is split into 5 groups
                         return_train_measures = True)

print(results['test_rmse'].mean())

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
0.8945631606125983


In [26]:
# Performance seems good, Training the model
algo.fit(train)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7fa2bddf9f50>

## Making Predictions and Interpreting them

In [None]:
# Building a mapping of movie ID and movie title
# The surprise package will give results of user ID and movie ID

# Creatin an empty dicttionary to map the movie ID and movie names
movie_id_to_title_map = {}

for m_id, title in zip(movies['movieId'].values, movies['title'].values):
  movie_id_to_title_map[(str(m_id))] = title
  
# Let's check the mapping now
movie_id_to_title_map

In [33]:
# Realtime Prediction

# Check what would be the rating given by User 1 to Movie ID 31
val = algo.predict(uid = '1', iid = '31')
print(val)
print(movie_id_to_title_map[val[1]],val[3])

# The user 1 would rate the movie Dangerous Minds as 4.03 stars

user: 1          item: 31         r_ui = None   est = 4.03   {'actual_k': 15, 'was_impossible': False}
Dangerous Minds (1995) 4.032384425265291


In [34]:
# Making a function to recommend top movies for all the users in the database
from collections import defaultdict

def get_top_n(predictions, n=10):
  top_n = defaultdict(list)
  for uid, iid, true_r, est, _ in predictions:
    top_n[uid].append((iid,est))
  
  for uid, user_ratings in top_n.items():
    user_ratings.sort(key = lambda x : x[1], reverse = True)
    top_n[uid] = user_ratings[:n]

  return top_n

In [35]:
# Build anti-test set
# Anti-Test Set, for every user -> there is some data that does not belong to him
# Example Anti - test Set for User 1 --> Data for user ranging from 2-100

testdata = train.build_anti_testset()
predictions = algo.test(testdata)
top_n = get_top_n(predictions, n = 10)

In [47]:
# Lets Create a Function to Fetch all the Movies Watched by the Users 
def PreviousMoviedUserWatched(user_df , user_id , item_map):
    user_df = user_df[user_df.iloc[: , 0] == user_id]
    for movie , rating in zip(user_df.iloc[:,1].values , user_df.iloc[:,2].values):
        print(item_map[str(movie)] , rating)

In [44]:
# Create a fnction to predict momvies to the user based on movies wathced previously
def UserPredictions(user_id, top_n, item_map):
  print("Predictions for user ID : ", user_id)
  user_ratings = top_n[user_id]
  for item_id, rating in user_ratings :
    print(item_map[item_id], " : ", rating)

In [48]:
# PreviousMoviesUserWatched(ratings, 4, movie_id_to_title_map)

In [45]:
UserPredictions('1' , top_n , movie_id_to_title_map)

Predictions for user ID :  1
Shawshank Redemption, The (1994)  :  5
Departed, The (2006)  :  5
Exit Through the Gift Shop (2010)  :  5
Inside Job (2010)  :  5
Wallace & Gromit: The Best of Aardman Animation (1996)  :  5
My Fair Lady (1964)  :  5
Thing, The (1982)  :  5
Postman, The (Postino, Il) (1994)  :  5
Crumb (1994)  :  5
Shallow Grave (1994)  :  5
