## Collaborative Filtering - MovieLens Dataset

**Meta Data** of the MovieLens Dataset:



First, we import the libraries:

*  NumPy - for data manipulation.
*  Pandas - for data manipulation.
*  MatPlotLib - for data visualization.
*  Seaborn - for data visualization.

In [1]:
# importing the libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

Next, we read all the datasets and store them in a Pandas dataframe. To do this, we use the read_csv() function from the Pandas library. We also view the first few rows of each dataset to get a glimpse.

In [2]:
# reading the movies dataset
movies = pd.read_csv('D:/200968182_DA/movies.csv')

In [3]:
# viewing the first few rows of the movies dataset
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
# reading the ratings dataset
ratings = pd.read_csv('D:/200968182_DA/ratings.csv')

In [5]:
# viewing the first few rows of the ratings dataset
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
# reading the links dataset
links = pd.read_csv('D:/200968052_DA/links.csv')

In [7]:
# viewing the first few rows of the links dataset
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [8]:
# reading the tags dataset
tags = pd.read_csv('D:/200968052_DA/tags.csv')

In [9]:
# viewing the first few rows of the tags dataset
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


#### **Q2.** Read the “ratings.csv” file and create a pivot table with index=‘userId’, columns=‘movieId’, values = “rating". 

To solve Question 2, we use the pivot_table() function from the Pandas library, and pass the values 'userId', 'movieId, and 'rating' to the parameters index, columns, and values respectively. This way, every individual row represents a user and columns represent movies.

In [10]:
# creating a pivot table
pd.pivot_table(ratings, index='userId', columns='movieId', values='rating')

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


As we can see, there are a lot of NaN or missing values in the dataset. To handle the missing values, we will fill them with a constant. The constant has to be a value that does not clash with the values in the dataset. To do so, let's find the ratings present in the dataset.

In [11]:
# finding all the values in the rating column
ratings.rating.unique()

array([4. , 5. , 3. , 2. , 1. , 4.5, 3.5, 2.5, 0.5, 1.5])

As we can see, no user has given a rating of '0' to any movie. Hence, we use the constant 0 to fill in the missing values, and assign the pivot table to a variable names df to be used for further analysis.

In [12]:
# filling the missing values in the pivot table with 0 and assigning it to a variable
df = pd.pivot_table(ratings, index='userId', columns='movieId', values='rating', fill_value=0)
df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0,0.0,4.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,4.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0


As we can see, the missing values are now filled with 0. Hence, a rating more than 0 represents that the user has watched the movie and given the rating, and a value 0 represents that the user has not watched the movie. Another reason for filling the missing values with 0 was to make the pivot table numeric for further analysis.

#### **Q3, 4, 5.** Importing the required packages.

To solve further Questions, we download the following packages to compute the distances and hence similarity between the users.

In [13]:
# importing the packages
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

#### **Q6.** Find the 5 most similar user for user with user Id 10.

To solve Question 6, we first compute pairwise distances between the rows, that is the users, based on the cosine similarity. To do this, we use the pairwise_distances() function from the ScikitLearn library and set the value of the parameter metric to 'cosine'.

In [14]:
# computing pairwise distances using cosine similarity between the users
distance_matrix = pairwise_distances(df, metric='cosine')

Next, we splice and select the 9th row from the resultant distance matrix to select the 10th User. We do this because arrays in Python are 0-indexed and the UserIDs start from 1. Post that, we sort the cosine similarities in the descending order to find the smallest distance between users, and hence the most similar users. We use the argsort() function on the negated array to return the indices, and hence the UserIDs of the sorted array. Then, we splice the array to find the top 5 similar users. To do this, we splice values from 1 to 6. We leave out the index 0 because it is by default the same user. After splicing, we add 1 to the indices since arrays in Python are 0-indexed and UserIDs start from 1. 

In [15]:
# finding the top 5 most similar users
(-distance_matrix[9]).argsort()[1:6]+1

array([194, 521, 206, 214,  90], dtype=int64)

As we can see, the users 194, 521, 206, 214, and 90 are most similar to user 10.

In [16]:
set(ratings.loc[ratings.userId==2, 'movieId']).intersection(set(ratings.loc[ratings.userId==338, 'movieId']))

{318, 6874}

In [17]:
print('Movie ID 318: ', movies.loc[movies.movieId==318, 'title'].values[0])
print('Rating by User 2: ', ratings.loc[((ratings.userId==2) & (ratings.movieId==318)), 'rating'].values[0])
print('Rating by User 338: ', ratings.loc[((ratings.userId==338) & (ratings.movieId==318)), 'rating'].values[0])
print()
print('Movie ID 6874: ', movies.loc[movies.movieId==6874, 'title'].values[0])
print('Rating by User 2: ', ratings.loc[((ratings.userId==2) & (ratings.movieId==6874)), 'rating'].values[0])
print('Rating by User 338: ', ratings.loc[((ratings.userId==338) & (ratings.movieId==6874)), 'rating'].values[0])

Movie ID 318:  Shawshank Redemption, The (1994)
Rating by User 2:  3.0
Rating by User 338:  5.0

Movie ID 6874:  Kill Bill: Vol. 1 (2003)
Rating by User 2:  4.0
Rating by User 338:  4.5


In [18]:
set(ratings.loc[((ratings.userId==2) & (ratings.rating>=4.0)), 'movieId']).intersection(set(ratings.loc[((ratings.userId==338) & (ratings.rating>=4.0)), 'movieId']))

{6874}

In [19]:
print('Movie ID 6874: ', movies.loc[movies.movieId==6874, 'title'].values[0])

Movie ID 6874:  Kill Bill: Vol. 1 (2003)


In [20]:
df = pd.pivot_table(ratings, index='movieId', columns='userId', values='rating', fill_value=0)
df

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4,0.0,0.0,0,4,0,4.5,0,0,0.0,...,4.0,0,4,3,4.0,2.5,4,2.5,3,5.0
2,0,0.0,0.0,0,0,4,0.0,4,0,0.0,...,0.0,4,0,5,3.5,0.0,0,2.0,0,0.0
3,4,0.0,0.0,0,0,5,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,2.0,0,0.0
4,0,0.0,0.0,0,0,3,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
5,0,0.0,0.0,0,0,5,0.0,0,0,0.0,...,0.0,0,0,3,0.0,0.0,0,0.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
193583,0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
193585,0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
193587,0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0


In [21]:
distance_matrix = pairwise_distances(df, metric='correlation')
distance_matrix

In [22]:
for movie in movies.title:
    if('Godfather' in str(movie)):
        print(movie)

Godfather, The (1972)
Godfather: Part II, The (1974)
Godfather: Part III, The (1990)
Tokyo Godfathers (2003)
The Godfather Trilogy: 1972-1990 (1992)


In [23]:
movies.loc[(movies.title=='Godfather, The (1972)'), 'movieId'].values[0]

858

In [24]:
distance_matrix[857].argsort()[1:6]+1

array([2763, 2757, 1475, 2247, 1803], dtype=int64)