<h1 align="center"><font size="5">COLLABORATIVE FILTERING</font></h1>

**Collaborative Filtering** is a technique that used by other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. <br>
<br>
Collaborative Users is part of Recommendation System. <br>
<br>
**Recommendation systems** are a collection of algorithms used to recommend items to users based on information taken from the user.  <br>
<br>
In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple version by Movies Database.

First step in this collaborative filtering is same like others case to Import library that is needed for analyzing and processing the data.

In [0]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Data Description

In [0]:
movies_df = pd.read_csv('movies.csv', sep=";")
ratings_df = pd.read_csv('ratings.csv', sep=";")

### Movies Data


In [105]:
movies_df

Unnamed: 0,MovieId,Title
0,0,AADC 2
1,1,Gundala
2,2,Dilan 1991
3,3,Bumi Manusia
4,4,Dua Garis Biru
5,5,Avengers End Game
6,6,The Lion King
7,7,Aladdin
8,8,Spiderman Far From Home
9,9,Captain Marvel


In this dataframe there was a title of the movie and the movieId as a feature. The dataset is contain 10 movies that consist of 5 movies from Indonesian and 5 Movies from out of Indonesia. <br>
- MovieId 0 until 4 is an Indonesian Movie
- MovieId 5 until 9 is an Non-Indonesian Movie

### Ratings Data


In [106]:
ratings_df.head()

Unnamed: 0,UserId,Name,MovieId,Title,Ratings
0,1,Hania,0,AADC 2,3
1,1,Hania,1,Gundala,5
2,1,Hania,2,Dilan 1991,4
3,1,Hania,3,Bumi Manusia,4
4,1,Hania,4,Dua Garis Biru,4


In ratings dataframe, there was an input from a user by the movie that they already watch. <br>
So, there was a feature UserId and Name of the User, MovieId and Tittle of Movie, and How they give a ratings for the Movie.

## Collaborative Filtering
The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score

### Input User

This is the new data as a new movie reviewers. By this data, **we want to know what is the other movie that this user need to watch.**

In [107]:
userInput = [
            {'Title':'AADC 2', 'Ratings':3},
            {'Title':'Dilan 1991', 'Ratings':2},
            {'Title':'Dua Garis Biru', 'Ratings':4},
            {'Title':'Avengers End Game', 'Ratings':5},
            {'Title':'Captain Marvel', 'Ratings':3}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,Ratings,Title
0,3,AADC 2
1,2,Dilan 1991
2,4,Dua Garis Biru
3,5,Avengers End Game
4,3,Captain Marvel


By this data, we can now that this user is likely to watch Indonesian movie because 60% movie that User ever whatced is Indonesian Movie.

### Add MovieId to input user
First step after inputing the data of new user is extract the input movies's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movies' title and then merging this subset with the input dataframe.

In [108]:
#Filtering out the movies by title
inputId = movies_df[movies_df['Title'].isin(inputMovies['Title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
inputMovies

Unnamed: 0,MovieId,Title,Ratings
0,0,AADC 2,3
1,2,Dilan 1991,2
2,4,Dua Garis Biru,4
3,5,Avengers End Game,5
4,9,Captain Marvel,3


### Users who has seen the same movies
Now with the movie ID's in the input, the next step is get the subset of users that have watched and reviewed the movies in the input.


In [109]:
userSubset = ratings_df[ratings_df['MovieId'].isin(inputMovies['MovieId'].tolist())]
userSubset.head()

Unnamed: 0,UserId,Name,MovieId,Title,Ratings
0,1,Hania,0,AADC 2,3
2,1,Hania,2,Dilan 1991,4
4,1,Hania,4,Dua Garis Biru,4
5,2,Topik,5,Avengers End Game,5
8,2,Topik,9,Captain Marvel,2


Group The Raw by **UserId**

In [0]:
userSubsetGroup = userSubset.groupby(['UserId'])

trying to look at one of the users, e.g. the one with userID=4

In [111]:
userSubsetGroup.get_group(4)

Unnamed: 0,UserId,Name,MovieId,Title,Ratings
11,4,Frans,0,AADC 2,4
13,4,Frans,2,Dilan 1991,4
15,4,Frans,4,Dua Garis Biru,3
16,4,Frans,5,Avengers End Game,5
20,4,Frans,9,Captain Marvel,4


sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [0]:
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

This is the result of first user that ever watched the same movie like the Input User

In [113]:
userSubsetGroup[0:3]

[(4,     UserId   Name  MovieId              Title  Ratings
  11       4  Frans        0             AADC 2        4
  13       4  Frans        2         Dilan 1991        4
  15       4  Frans        4     Dua Garis Biru        3
  16       4  Frans        5  Avengers End Game        5
  20       4  Frans        9     Captain Marvel        4),
 (5,     UserId   Name  MovieId              Title  Ratings
  21       5  Indra        0             AADC 2        3
  22       5  Indra        2         Dilan 1991        2
  23       5  Indra        4     Dua Garis Biru        5
  24       5  Indra        5  Avengers End Game        5
  27       5  Indra        9     Captain Marvel        5),
 (10,     UserId   Name  MovieId              Title  Ratings
  51      10  Putri        0             AADC 2        4
  53      10  Putri        2         Dilan 1991        2
  55      10  Putri        4     Dua Garis Biru        3
  56      10  Putri        5  Avengers End Game        5
  58      10  Put

### Similarity Of Users

In this part is to compare all users to input user and find the one that is most similar.  
The way to find out how similar each user is to the input through the **Pearson Correlation Coefficient** that used to measure the strength of a linear association between two variables.

select a subset of users to iterate through. This limit is imposed because It can minimize to waste too much time going through every single user.

In [0]:
userSubsetGroup = userSubsetGroup[0:24]

In [0]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='MovieId')
    inputMovies = inputMovies.sort_values(by='MovieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['MovieId'].isin(group['MovieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['Ratings'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['Ratings'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [116]:
pearsonCorrelationDict.items()

dict_items([(4, 0.3100868364730211), (5, 0.7752170911825527), (10, 0.8076923076923078), (14, -0.2941742027072765), (16, 0.7399400733959447), (7, 0.760885910252682), (8, 0.13483997249264842), (9, 0.6622661785325219), (11, 0.760885910252682), (17, 0.6882472016116852), (19, 0.30151134457776363), (1, 0.0), (13, 0), (15, 0.7559289460184533), (18, 0.49999999999999667), (22, 0), (23, 0.866025403784439), (2, 1.0), (3, -1.0), (6, 1.0), (12, 0), (21, 0)])

In [120]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['UserId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,UserId
0,0.310087,4
1,0.775217,5
2,0.807692,10
3,-0.294174,14
4,0.73994,16


Top Users Similarity is who is the person that have the same type of Movie that likely for watching together.

In [74]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:24]
topUsers.head()

Unnamed: 0,similarityIndex,UserId
19,1.0,6
17,1.0,2
16,0.866025,23
2,0.807692,10
1,0.775217,5


By this result, we know that the Users can watched the movie with UserId 6 and 2. It was Andre and Topik.  Because Andre and Topik has the highest similarity Index with this User.

### Recommend Movie For User

Recommend Movie for User can be done by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. First step for this is get the movies watched by the users in **pearsonDF** from the ratings dataframe and then store their correlation in a new column called **similarityIndex**. This is achieved below by merging of these two tables.

In [75]:
topUsersRating=topUsers.merge(ratings_df, left_on='UserId', right_on='UserId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,UserId,Name,MovieId,Title,Ratings
0,1.0,6,Andre,5,Avengers End Game,4
1,1.0,6,Andre,6,The Lion King,2
2,1.0,6,Andre,7,Aladdin,5
3,1.0,6,Andre,8,Spiderman Far From Home,4
4,1.0,6,Andre,9,Captain Marvel,3


Next step is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

It can be easy to do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns.

It shows the idea of all similar users to candidate movies for the input user.

In [76]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['Ratings']
topUsersRating.head()

Unnamed: 0,similarityIndex,UserId,Name,MovieId,Title,Ratings,weightedRating
0,1.0,6,Andre,5,Avengers End Game,4,4.0
1,1.0,6,Andre,6,The Lion King,2,2.0
2,1.0,6,Andre,7,Aladdin,5,5.0
3,1.0,6,Andre,8,Spiderman Far From Home,4,4.0
4,1.0,6,Andre,9,Captain Marvel,3,3.0


In [77]:
tempTopUsersRating = topUsersRating.groupby('MovieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
MovieId,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7.013424,25.926711
1,5.691759,20.216037
2,6.101816,18.8204
3,0.958445,4.278706
4,2.775113,11.792006


In [78]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['MovieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,MovieId
MovieId,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3.696727,0
1,3.551808,1
2,3.084393,2
3,4.464217,3
4,4.249198,4


Now let's sort it and see the top 3 movies that the algorithm recommended

In [81]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(3)

Unnamed: 0_level_0,weighted average recommendation score,MovieId
MovieId,Unnamed: 1_level_1,Unnamed: 2_level_1
5,5.015278,5
7,4.507684,7
3,4.464217,3


In [86]:
movies_df.loc[movies_df['MovieId'].isin(recommendation_df.head(3)['MovieId'].tolist())]

Unnamed: 0,MovieId,Title
3,3,Bumi Manusia
5,5,Avengers End Game
7,7,Aladdin


We can see that the top 3 recommended movies for this Users need to watch is **Bumi Manusia, Avengers End Game, and Aladdin**.

## Conclusion

The next movie that the User need to watch is **Bumi Manusia, Avengers End Game and Aladdin**. The user can also watch the movie with **Andre or Topik**. <br>
<br>
Notes : Avengers End Game is needed to watch again may be because the rating that The User give to this movie is 5 which is mean it a good movie and may be the User want to watch this movie again.