# Collaborating Filtering for Recommendation Systems
## User-based filtering

If we have a dataset of users that have watch movies, buy items etc., we can use the pattern of similar users preferences to recommend items to the user. 
In this example, we use movie rating dataset to recommend new movies to users.

## Import packages

In [1]:
# Data processing
import pandas as pd
import numpy as np
import scipy.stats
# Visualization
import seaborn as sns
# Similarity
from sklearn.metrics.pairwise import cosine_similarity

## Load data

In [2]:
# We got data from https://grouplens.org/datasets/movielens/
# We just use two ratings.csv and movies.csv files.

# Read in data
ratings = pd.read_csv('../data/ml-latest-small/ratings.csv')
movies = pd.read_csv('../data/ml-latest-small/movies.csv')

In [3]:
# Take a look at the rating data
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
# Take a look at the movies data
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
# Get some information about our data

print('unique users #:', ratings['userId'].nunique())
print('unique movies #:', ratings['movieId'].nunique())
print('unique ratings #:', ratings['rating'].nunique())
print('unique ratings:', sorted(ratings['rating'].unique()))

unique users #: 610
unique movies #: 9724
unique ratings #: 10
unique ratings: [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]


In [6]:
# Merge ratings and movies datasets
df = pd.merge(ratings, movies, on='movieId', how='inner')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


## we will transform the dataset into a matrix format. 

In [7]:
# The rows of the matrix are users, and the columns of the matrix are movies. 
# The value of the matrix is the user rating of the movie if there is a rating. Otherwise, it shows ‘NaN’.
matrix_user_item = df.pivot_table(index='userId', columns='movieId', values='rating')
matrix_user_item.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


## Convert ratings to binary values.

In [8]:
# if rating >= 3, rating=1 else rating = 0 
matrix = matrix_user_item.copy()
matrix[:] = np.where(matrix >=3, 1, 0)

In [9]:
matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Identify similar users

We can use one of the follwoing methods to identify similar users.

## 1. Identify similar users using pearson correlation

In [10]:
user_similarity_pearson = matrix.T.corr()
user_similarity_pearson.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.017802,0.040994,0.147724,0.118053,0.096331,0.136577,0.133735,0.048678,-0.003592,...,0.058251,0.138752,0.174025,0.067439,0.120601,0.105963,0.257209,0.244912,0.090242,0.086501
2,0.017802,1.0,-0.002272,-0.006975,0.027143,0.01351,0.013513,0.025938,-0.003126,0.049795,...,0.206636,0.012001,-0.001077,-0.005269,-0.007247,0.014524,0.00762,0.036485,0.028482,0.083128
3,0.040994,-0.002272,1.0,-0.005693,-0.002768,-0.007604,-0.004478,-0.00287,-0.002551,-0.004606,...,-0.004412,-0.004773,0.005014,-0.0043,0.011835,-0.006599,0.012234,-0.000653,-0.002661,0.013694
4,0.147724,-0.006975,-0.005693,1.0,0.115175,0.055245,0.093989,0.038909,0.00558,0.015799,...,0.048917,0.086563,0.236024,0.026831,0.058189,0.141236,0.101578,0.09774,0.017544,0.049093
5,0.118053,0.027143,-0.002768,0.115175,1.0,0.260769,0.102658,0.383165,-0.003807,0.023511,...,0.056807,0.345075,0.087951,0.253601,0.122297,0.084568,0.136768,0.09969,0.256999,0.040823


## 2. Identify similar users using cosine similarity

In [11]:
user_similarity_cosine_vals = cosine_similarity(matrix.fillna(0))
user_similarity_cosine = pd.DataFrame(data=user_similarity_cosine_vals, index=matrix.index, columns=matrix.index)
user_similarity_cosine.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.025603,0.047036,0.164717,0.126211,0.120263,0.150023,0.142017,0.05704,0.012685,...,0.072808,0.153089,0.208015,0.081469,0.138825,0.147913,0.272311,0.271855,0.098421,0.132866
2,0.025603,1.0,0.0,0.0,0.030429,0.022448,0.018871,0.029348,0.0,0.055048,...,0.210644,0.017716,0.013835,0.0,0.0,0.030567,0.01459,0.048154,0.031639,0.0961
3,0.047036,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.016944,0.0,0.017568,0.007487,0.017869,0.009829,0.0,0.027694
4,0.164717,0.0,0.0,1.0,0.122352,0.076721,0.106231,0.047203,0.013271,0.029512,...,0.061599,0.099731,0.261456,0.039489,0.074981,0.174527,0.117327,0.125856,0.025443,0.09092
5,0.126211,0.030429,0.0,0.122352,1.0,0.26742,0.10853,0.385794,0.0,0.030151,...,0.062932,0.349334,0.102299,0.258199,0.129636,0.100453,0.143839,0.112095,0.259938,0.060377


In [12]:
# Chose pearson or cosine similarity
user_similarity = user_similarity_cosine

# Find similar users

In [13]:
# Pick a user ID
picked_userid = 5
# Remove picked user ID from the candidate list
user_similarity.drop(index=picked_userid, inplace=True)
# Take a look at the data
user_similarity.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.025603,0.047036,0.164717,0.126211,0.120263,0.150023,0.142017,0.05704,0.012685,...,0.072808,0.153089,0.208015,0.081469,0.138825,0.147913,0.272311,0.271855,0.098421,0.132866
2,0.025603,1.0,0.0,0.0,0.030429,0.022448,0.018871,0.029348,0.0,0.055048,...,0.210644,0.017716,0.013835,0.0,0.0,0.030567,0.01459,0.048154,0.031639,0.0961
3,0.047036,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.016944,0.0,0.017568,0.007487,0.017869,0.009829,0.0,0.027694
4,0.164717,0.0,0.0,1.0,0.122352,0.076721,0.106231,0.047203,0.013271,0.029512,...,0.061599,0.099731,0.261456,0.039489,0.074981,0.174527,0.117327,0.125856,0.025443,0.09092
6,0.120263,0.022448,0.0,0.076721,0.26742,1.0,0.057189,0.329074,0.020004,0.016682,...,0.01741,0.408036,0.092238,0.380952,0.095634,0.092632,0.145903,0.138633,0.191759,0.047967


In [14]:
# Number of similar users
n = 10
# User similarity threashold
user_similarity_threshold = 0.25
# Get top n similar users
similar_users = user_similarity[user_similarity[picked_userid]>user_similarity_threshold][picked_userid].sort_values(ascending=False)[:n]
# Print out top n similar users
print(f'Top {n} similar users for user {picked_userid}: ', similar_users)

Top 10 similar users for user 5:  userId
470    0.512652
229    0.509902
235    0.494032
565    0.456435
468    0.456435
142    0.441894
512    0.433614
455    0.426401
58     0.420644
117    0.416422
Name: 5, dtype: float64


## Narrow down candidate items

1. Remove the items associated to the target user.
2. Keep only the items associate to the similar users.

In [15]:
# keep only the row for userId=`picked_userid` in the user items matrix and remove the items with missing values
picked_userid_watched = matrix_user_item[matrix_user_item.index == picked_userid].dropna(axis=1, how='all')
picked_userid_watched

movieId,1,21,34,36,39,50,58,110,150,153,...,534,588,589,590,592,594,595,596,597,608
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,4.0,4.0,4.0,4.0,3.0,4.0,5.0,4.0,3.0,3.0,...,3.0,4.0,3.0,5.0,3.0,5.0,5.0,5.0,3.0,3.0


In [16]:
# Movies that similar users watched. Remove movies that none of the similar users have watched
similar_user_movies = matrix_user_item[matrix_user_item.index.isin(similar_users.index)].dropna(axis=1, how='all')
similar_user_movies


movieId,1,2,3,5,6,7,10,11,14,17,...,880,986,1022,1023,1027,1028,1035,1036,1073,1079
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
58,,,3.0,4.0,,5.0,,,,,...,,,,,,,,,5.0,
117,,3.0,3.0,3.0,3.0,4.0,3.0,4.0,,3.0,...,3.0,3.0,4.0,3.0,3.0,4.0,4.0,3.0,,4.0
142,,,,,,,,,,,...,,,,,,,,,,
229,5.0,,,3.0,,,4.0,,,,...,,,,,,,,,,
235,,,,,,,2.0,4.0,,,...,,,,,,,,,,
455,,,,,,,,4.0,,,...,,,,,,,,,,
468,4.0,,,,,,,,,,...,,,,,,,,,,
470,4.0,3.0,3.0,3.0,3.0,3.0,3.0,,4.0,,...,,,,,,,,,4.0,
512,,3.0,,,,,,,,,...,,,,,,,,,,
565,,,,,,,,,,,...,,,,,,,,,,


In [17]:
# Remove the watched movies from the movie list
# errors='ignore' drops columns if they exist without giving an error message
similar_user_movies.drop(picked_userid_watched.columns,axis=1, inplace=True, errors='ignore')
# Take a look at the data
similar_user_movies

movieId,2,3,5,6,7,10,11,14,17,19,...,880,986,1022,1023,1027,1028,1035,1036,1073,1079
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
58,,3.0,4.0,,5.0,,,,,1.0,...,,,,,,,,,5.0,
117,3.0,3.0,3.0,3.0,4.0,3.0,4.0,,3.0,2.0,...,3.0,3.0,4.0,3.0,3.0,4.0,4.0,3.0,,4.0
142,,,,,,,,,,,...,,,,,,,,,,
229,,,3.0,,,4.0,,,,3.0,...,,,,,,,,,,
235,,,,,,2.0,4.0,,,,...,,,,,,,,,,
455,,,,,,,4.0,,,,...,,,,,,,,,,
468,,,,,,,,,,,...,,,,,,,,,,
470,3.0,3.0,3.0,3.0,3.0,3.0,,4.0,,3.0,...,,,,,,,,,4.0,
512,3.0,,,,,,,,,,...,,,,,,,,,,
565,,,,,,,,,,3.0,...,,,,,,,,,,


# Recommend items

We can use one of the following methods to decide which items to recommed to the target user.

## 1. Recommend based on the number of similar users that have watched a movie.

In [18]:
# A dictionary to store item scores
item_score = {}

matrix_similar_users = similar_user_movies.copy()
matrix_similar_users[:] = np.where(similar_user_movies >=3, 1, 0)
# Loop through items
items_watch_count = matrix_similar_users.sum(axis=0)

d = {"movieId": items_watch_count.index, 
     "watch_count": items_watch_count.values 
    }
item_score = pd.DataFrame(d)
item_score    
# Sort the movies by score
ranked_item_score = item_score.sort_values(by='watch_count', ascending=False)
# Select top m items
m = 10
recommended_items = ranked_item_score[:m]
recommended_items

Unnamed: 0,movieId,watch_count
87,356,9.0
32,165,9.0
94,377,9.0
119,539,8.0
49,225,8.0
106,480,8.0
99,434,8.0
103,454,7.0
14,32,7.0
71,292,7.0


## 2. Recommend based on movie rating of similar users

In [19]:
# A dictionary to store item scores
item_score = {}

# Loop through items
for i in similar_user_movies.columns:
  # Get the ratings for movie i
  movie_rating = similar_user_movies[i]
  # Create a variable to store the score
  total = 0
  # Create a variable to store the number of scores
  count = 0
  # Loop through similar users
  for u in similar_users.index:
    # If the movie has rating
    if pd.isna(movie_rating[u]) == False:
      # Score is the sum of user similarity score multiply by the movie rating
      score = similar_users[u] * movie_rating[u]
      # Add the score to the total score for the movie so far
      total += score
      # Add 1 to the count
      count +=1
  # Get the average score for the item
  item_score[i] = total / count
# Convert dictionary to pandas dataframe
item_score = pd.DataFrame(item_score.items(), columns=['movieId', 'movie_score'])
    
# Sort the movies by score
ranked_item_score = item_score.sort_values(by='movie_score', ascending=False)
# Select top m movies
m = 10
recommended_items = ranked_item_score[:m]
recommended_items

Unnamed: 0,movieId,movie_score
11,25,2.470161
129,593,2.183571
45,218,2.103222
121,543,2.103222
46,222,2.103222
110,491,2.082112
172,1073,2.076916
81,342,2.071415
7,14,2.05061
111,494,2.05061


source: https://medium.com/grabngoinfo/recommendation-system-user-based-collaborative-filtering-a2e76e3e15c4 