# Collaborative Filtering With MovieLens dataset

Dataset from MovieLens: https://grouplens.org/datasets/movielens/100k
* [User-based Collaborative Filering](#userBased)
* [Item-based Collaborative Filtering](#itemBased)
* [Supporting Materials](#references)

In [1]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

In [2]:
# loading data from movielens dataset
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=['userId', 'movieId', 'rating'], usecols=range(3), encoding="ISO-8859-1")
movies = pd.read_csv('ml-100k/u.item', sep='|', names=['movieId', 'title'], usecols=range(2), encoding="ISO-8859-1")

data = pd.merge(movies, ratings)
data.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


In [3]:
n_ratings = len(data)
n_movies = len(data["title"].unique())

print(f"There are {n_ratings} ratings in a total of {n_movies} movies in this dataset.")

There are 100000 ratings in a total of 1664 movies in this dataset.


In [4]:
movie = "Psycho (1960)"
#movie = "2001: A Space Odyssey (1968)"
#movie = 'Star Wars (1977)'

# setting the minimum number of ratings for a movie to be considered in our algorithm
min_num_ratings = 100

<a id="userBased"></a>
## User-based Collaborative Filtering
User-based Collaborative Filtering relies on the idea that users who have similar rating behaviours so far, will likely exhibit similar rating behaviors going forward. That is, neighbouring users are identified based on the similarity with the active user, and the scoring of the items is done based on neighbor’s ratings followed by a recommendation of an item based item’s scores by the recommendation system.

### __Step 1__: Create a user / movie rating matrix

In [5]:
user_rating_matrix = data.pivot_table(index=['userId'], columns=['title'], values='rating')
user_rating_matrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


### __Step 2__: Data Normalization.
Some users tend to give higher ratings and some users tend to give lower ratings (eg. some users will give 5 stars to a movie they highly enjoyed, while other users will only give that rating to their absolute favourites). We compensate for this through normalization, that is, adjusting the scale of rating to be comparable across users.

However, this step is made automatically with the person correlation function from pandas.

In [6]:
normalization = "Z-score"
#nomalization = "Mean Subtraction"

if normalization == "Mean Subtraction":
    user_avg = user_rating_matrix.mean(axis=1)
    norm_rating_matrix = user_rating_matrix.subtract(user_avg, axis = 'rows')
elif normalization == "Z-score":
    user_avg = user_rating_matrix.mean(axis=1)
    user_std = user_rating_matrix.std(axis=1)
    norm_rating_matrix = user_rating_matrix.subtract(user_avg, axis = 'rows').divide(user_std, axis = 'rows')
    
norm_rating_matrix

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,-1.270831,1.104308,,,-0.479118,0.312595,,,...,,,,1.104308,-0.479118,,,,0.312595,
2,,,,,,,,,-2.60505,,...,,,,,,,,,,
3,,,,,-0.634553,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,-0.641460,,,,,0.825932,,,...,,,,0.825932,,,,,0.825932,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,,,...,,,,,,,,,,
940,,,,,,,,,,,...,,,,,,,,,,
941,,,,,,,,,,,...,,,,,,,,,,
942,,,,,,,,-1.645849,,-1.645849,...,,,,,,,,,,


### __Step 3__: Getting all the ratings for the movie I want reccomendations from

In [7]:
movie_rating = user_rating_matrix[movie]
movie_rating

userId
1      4.0
2      NaN
3      NaN
4      NaN
5      3.0
      ... 
939    NaN
940    NaN
941    NaN
942    NaN
943    2.0
Name: Psycho (1960), Length: 943, dtype: float64

### __Step 4:__ Getting similarity scores for each movie.

The similarity between users (also known as the distance between users) is a mathematical method to quantify how different or similar users are to each other.
We will be using correlation-based similarity metrics using Pearson correlation.
The choice of Pearson correlation makes sense as similarity metrics as rating data for each user tend to be normally distributed.

__Note__: at this stage, our results are probably getting polluted by movies with very few ratings from people who happened to also like our chosen movie. We will solve this by using a "popularity" threshold in __Step 4__. 

In [8]:
sim_scores_df = user_rating_matrix.corrwith(movie_rating, method='pearson')
sim_scores_df = sim_scores_df.dropna().reset_index().rename(columns={0: 'score'})

# deleting our chosen (and already watched) movie from our considerations
sim_scores_df = sim_scores_df[sim_scores_df["title"]!=movie]

# sorting items from the most similar to the least
sim_scores_df = pd.DataFrame(sim_scores_df).sort_values(by=["score"], ascending=False)
sim_scores_df.head(10)

Unnamed: 0,title,score
0,'Til There Was You (1997),1.0
10,8 Seconds (1994),1.0
403,Far From Home: The Adventures of Yellow Dog (1...,1.0
355,Dream With the Fishes (1997),1.0
1006,Romper Stomper (1992),1.0
1016,"Run of the Country, The (1995)",1.0
1306,Year of the Horse (1997),1.0
766,Maya Lin: A Strong Clear Vision (1994),1.0
585,I Like It Like That (1994),1.0
230,"Cement Garden, The (1993)",1.0


### __Step 5__: Getting a list of all the movies that comply with our estalished minimum number of ratings. 

In [9]:
popular_movies = data[["title", "movieId"]].groupby(['title']).count().sort_values(by=["movieId"], ascending=False).reset_index()
popular_movies = popular_movies[popular_movies["movieId"]>= min_num_ratings]

print(f"There are {len(popular_movies)} movies we will consider for our reccomendation algorithm.")

There are 338 movies we will consider for our reccomendation algorithm.


### __Step 6__: Eliminating movies with few ratings and getting our final recommendations. 

In [10]:
recommendations = sim_scores_df[sim_scores_df["title"].isin(popular_movies['title'])].sort_values(by=["score"], ascending=False)
recommendations["title"][:10]

662                       Kingpin (1996)
367                       Ed Wood (1994)
1155                  Taxi Driver (1976)
802                        Mother (1996)
329         Devil's Advocate, The (1997)
218                        Carrie (1976)
868          Nutty Professor, The (1996)
961                   Raging Bull (1980)
748     Manchurian Candidate, The (1962)
208                Cable Guy, The (1996)
Name: title, dtype: object

<a id="itemBased"></a>
## Item-Based Collaborative Filtering

### __Step 1__: Create a user / movie rating matrix

In [11]:
user_rating_matrix = data.pivot_table(index=['userId'], columns=['title'], values='rating')

### __Step 2__: Creating a correlation matrix between movies

In [12]:
corr_matrix = user_rating_matrix.corr(method='pearson', min_periods=min_num_ratings)
corr_matrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


### __Step 3__: Sorting similarity scores and dropping the movie we have already watched

In [13]:
sims = corr_matrix[movie].dropna().sort_values(ascending=False).drop(movie)
sims[:10]

title
Taxi Driver (1976)            0.458875
Rear Window (1954)            0.407264
North by Northwest (1959)     0.363540
Godfather, The (1972)         0.336903
Birds, The (1963)             0.324694
GoodFellas (1990)             0.309107
Graduate, The (1967)          0.307563
Chinatown (1974)              0.306500
Clockwork Orange, A (1971)    0.304095
Jaws (1975)                   0.293453
Name: Psycho (1960), dtype: float64

<a id="references"></a>
## References:
* https://towardsdatascience.com/user-user-collaborative-filtering-for-jokes-recommendation-b6b1e4ec8642
* https://towardsdatascience.com/recommender-systems-item-customer-collaborative-filtering-ff0c8f41ae8a
* https://www.analyticsvidhya.com/blog/2021/05/item-based-collaborative-filtering-build-your-own-recommender-system/
* https://towardsdatascience.com/comprehensive-guide-on-item-based-recommendation-systems-d67e40e2b75d
* https://towardsdatascience.com/item-based-collaborative-filtering-in-python-91f747200fab