In this notebook, out of the 2 widely adopted approaches of building recommender systems, we will be using the Collaborative Filtering methodology to recommend movies to our user.

The 2 widely used approaches are -

#### 1. Content Based Recommenders -
In this, recommendations are provided to users on basis of their profile, which revolves around their preferences and tastes.

#### 2. Collaborative Filtering -
In this, user is matched to similar users (based on preferences), and then recommends items that the similar users have liked against the provided input. Basically in this, users are matched and there is no need to extract information from the recommended item unlike content-based filters.

# Collaborative Filtering

Collaborative Filtering is also known asn User-User filtering because in this technique, we try finding similar users. There are 2 popular approaches for collaborative filtering -

1. User Based Collaborative Filtering - Which is based on similarity of users (i.e. their preferences). Here we have an active user for whom the recommendation is aimed. The engine will first look for users who are similar i.e. users who share the active users rating patterns, preferences etc. 

2. Item Based Collaborative Filtering - Which is based on finding similarity among items and building neighbourhood of items i.e. if a user liked one item, he/she might also like the neighbouring item. The criteria of building neighbourhood of items is not their content, but their recommendation sources, i.e. the users.

Here, we will be using the User Based collaborative filtering, by utilizing the ratings dataset available with us.

So stepwise the process is listed below:

#### 1. Data Acquisition

In [1]:
import pandas as pd
import numpy as np
from math import sqrt

In [24]:
movies_df = pd.read_csv('../input/moviesdataset/movies.csv')
ratings_df = pd.read_csv('../input/moviesdataset/ratings.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#### 2. Data Modification
We will transform this dataset a bit.

First we will remove year from the movie title and store it in a separate column.
Next we will remove the genres column as we don't need that in our movies dataset for this recommendation system

Next in the ratings dataframe, we won't be needing the timestamp column so we'll remove that as well

In [3]:
#Extracting and storing year in a new column
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing Parentheses from the year
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the extracted year's text from the title column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any whitespace characters
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df = movies_df.drop('genres', 1)

movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Now, the process for generating recommendations would involve the following steps:

1. Select users who've watched movies present in our input
2. Based on these users' ratings to movies, identify the top X user neighbours where top users' ratings would be similar to input ratings
3. For each computer neighbour user, find their watched movie record
4. Calculate the similarity score of these watched movies with input movies
5. Recommend the items with highest similarity score

#### 3. Creating input user

In [21]:
input_dict = [
            {'title':"Dark Knight, The", 'rating':5},
            {'title':'Memento', 'rating':4.5},
            {'title':'Prestige, The', 'rating':4},
            {'title':"Man of Steel", 'rating':5},
            {'title':'Avengers: Age of Ultron', 'rating':4}
         ] 
input_movies = pd.DataFrame(input_dict)
input_movies

Unnamed: 0,title,rating
0,"Dark Knight, The",5.0
1,Memento,4.5
2,"Prestige, The",4.0
3,Man of Steel,5.0
4,"Avengers, The",4.0


Our Ratings DataFrame only has the Movie ID's. So, to be able to find neighbours, we will need to add movie id's to our input dataframe as well.
For this we'll filter the movies dataset, reading the rows corresponding to the input movies and then merging it with the input dataframe.

In [22]:
filtered_df = movies_df[movies_df['title'].isin(input_movies['title'].tolist())]
input_with_id = pd.merge(filtered_df, input_movies)
input_with_id = input_with_id.drop('year', 1)
input_with_id

Unnamed: 0,movieId,title,rating
0,2153,"Avengers, The",4.0
1,89745,"Avengers, The",4.0
2,4226,Memento,4.5
3,48780,"Prestige, The",4.0
4,58559,"Dark Knight, The",5.0
5,103042,Man of Steel,5.0


#### 4. Finding users who've rated all or some of these input movies
Now using the input movies dataframe, we can filter out the list of users who have rated these movies from the ratings dataframe.

In [8]:
user_df = ratings_df[ratings_df['movieId'].isin(input_with_id['movieId'].tolist())]
user_df

Unnamed: 0,userId,movieId,rating
16,1,296,3.0
244,2,74458,4.0
255,2,109487,3.0
320,4,296,1.0
533,5,296,5.0
...,...,...,...
99552,610,296,5.0
100220,610,48780,4.0
100429,610,74458,4.5
100495,610,85414,3.0


In [9]:
#grouping the user_df by user_id to get groups of users who have recommended all or some of the movies from our input
user_groups = user_df.groupby('userId')
#we will also sort these groups so that users who've rated most number of input movies come first, thereby having higher priority in the selection list
user_groups = sorted(user_groups, key= lambda x: len(x[1]), reverse=True)
user_groups[0:4]

[(18,
        userId  movieId  rating
  1796      18      296     4.0
  2087      18    48780     4.5
  2152      18    74458     4.5
  2173      18    85414     3.5
  2217      18   109487     4.5),
 (62,
        userId  movieId  rating
  8797      62      296     4.5
  8945      62    48780     5.0
  9003      62    74458     4.0
  9018      62    85414     4.5
  9071      62   109487     5.0),
 (105,
         userId  movieId  rating
  16226     105      296     5.0
  16605     105    48780     5.0
  16699     105    74458     5.0
  16727     105    85414     3.5
  16812     105   109487     4.0),
 (249,
         userId  movieId  rating
  36398     249      296     4.0
  36878     249    48780     3.5
  37039     249    74458     5.0
  37107     249    85414     3.5
  37297     249   109487     5.0)]

#### 5. Comparing and finding similar users
The next step in our algorithm involves finding the most similar users to our input user, by computing a similarity score on the basis of the ratings provided to these input movies. We'll be using the Pearson Correlation Coefficient to compute similarity between these users and our input user, since it conveniently measures the strength of linear correlation between two variables.

Reason for using Pearson Correlation out of many other similarity computation measures is because Pearson Correlation is invariant to scaling i.e. multiplying all elements by a non-zero constant or adding any constant to all elements. This is a very important property for our recommendation system because two users might rate two series of items very differently in terms of absolute rates but they still could be similar users (i.e. with similar ideas) with similar rates in various scales.

We will use a subset of 100 users to iterate through to compute the similarity, since computing for all users would be redundant and expensive processing wise. 

In [10]:
user_groups = user_groups[0:100]
#We will now calculate the Pearson Correlation between input user and user subset group, and store it in a dictionary, where the key is the user_id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for user_id, group in user_groups:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    input_with_id = input_with_id.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = input_with_id[input_with_id['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempMovieRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupRatingList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempMovieRatingList]) - pow(sum(tempMovieRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupRatingList]) - pow(sum(tempGroupRatingList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempMovieRatingList, tempGroupRatingList)) - sum(tempMovieRatingList)*sum(tempGroupRatingList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[user_id] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[user_id] = 0
        
pearsonCorrelationDict.items()

dict_items([(18, -0.8385254915624226), (62, -0.5976143046671957), (105, -0.17677669529663687), (249, -0.3296902366978937), (279, -0.38348249442368487), (305, 0.6593804733957874), (352, 0.0), (573, -0.4564354645876406), (610, 0.15811388300841897), (15, -0.7071067811865475), (50, 0.8703882797784892), (63, 0.760885910252682), (122, 0), (123, 0.0), (211, -0.1654758480803738), (233, 0.9045340337332909), (247, -0.5222329678670935), (274, 0.899228803025897), (298, 0.8058229640253802), (318, -0.8703882797784892), (339, -0.8703882797784892), (414, 0.0), (483, -0.5773502691896258), (560, 0.8528028654224417), (561, 0.7385489458759964), (582, -0.7644707871564383), (596, -0.5222329678670935), (599, 0.457495710997814), (65, 0), (68, -0.9819805060619652), (80, -0.8660254037844264), (103, 0.9999999999999947), (177, 0.8660254037844264), (212, -0.7559289460184498), (227, 0.8660254037844402), (317, 1.0), (326, 0.6546536707079778), (332, 1.0000000000000107), (334, 0.3273268353539889), (357, 0.0), (378, -0

In [12]:
#Now we will translate this matrix to get the similarity scores for user_id's

pearson_df = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearson_df.columns = ['similarity_index']
pearson_df['userId'] = pearson_df.index
pearson_df.index = range(len(pearson_df))
pearson_df.head()

Unnamed: 0,similarity_index,userId
0,-0.838525,18
1,-0.597614,62
2,-0.176777,105
3,-0.32969,249
4,-0.383482,279


In [14]:
#Now we will find the top 50 similar users, i.e. top 50 users from the pearson_df sorted by descending order of similarity_index
most_similar_users = pearson_df.sort_values(by='similarity_index', ascending=False)[0:50]
most_similar_users.head()

Unnamed: 0,similarity_index,userId
37,1.0,332
35,1.0,317
89,1.0,323
90,1.0,351
69,1.0,166


Our next step consists of finding the weighted ratings of all the movies watched by these 50 similar users, and from among those movies, find the top 10 which had the highest weighted recommendation score.

To find the weighted ratings of all the movies watched by these similar users, we will first get the movies watched by these similar users, then find the weighted rating by multiplying the ratings given by these users to their watched movies with the user's similarity_index with our input user (which we have found above).

Then we will add the weighted ratings given for one movie by all users, and divide the addition by the sum of the weights (i.e. the user's similarity indexes) to find the weighted recommendation scores.

To get all the movies watched by similar users, we will merge the two dataframes - most_similar_users and ratings_df

In [15]:
movies_and_ratings_by_similar_users = most_similar_users.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
movies_and_ratings_by_similar_users.head()

Unnamed: 0,similarity_index,userId,movieId,rating
0,1.0,332,1,4.0
1,1.0,332,16,3.5
2,1.0,332,32,2.5
3,1.0,332,47,4.0
4,1.0,332,50,4.0


In [16]:
#multiplying ratings by their corresponding users' similarity_indexes i.e. weights to get the weighted ratings
movies_and_ratings_by_similar_users['weighted_ratings'] = movies_and_ratings_by_similar_users['similarity_index']*movies_and_ratings_by_similar_users['rating']
movies_and_ratings_by_similar_users.head()

Unnamed: 0,similarity_index,userId,movieId,rating,weighted_ratings
0,1.0,332,1,4.0,4.0
1,1.0,332,16,3.5,3.5
2,1.0,332,32,2.5,2.5
3,1.0,332,47,4.0,4.0
4,1.0,332,50,4.0,4.0


In [17]:
#grouping the dataframe by movies and adding the weighted ratings as well as similarity indexes of users to whom the weighted 
#ratings belong to, we get the sum of weighted ratings and the sum of weights
aggregated_ratings_and_weights_for_movies = movies_and_ratings_by_similar_users.groupby('movieId').sum()[['similarity_index', 'weighted_ratings']]
aggregated_ratings_and_weights_for_movies.columns=['sum_of_weights', 'sum_of_weighted_ratings']
aggregated_ratings_and_weights_for_movies.head()

Unnamed: 0_level_0,sum_of_weights,sum_of_weighted_ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,24.007199,86.020278
2,15.426502,47.487068
3,3.457496,11.186244
5,5.104574,12.39866
6,8.912768,36.708243


In [18]:
#Now we will divide the weighted ratings with the weights to find the final weighted average which will be our recommendation_score
recommendation_df = pd.DataFrame()
recommendation_df['weighted_average_recommendation_score'] = aggregated_ratings_and_weights_for_movies['sum_of_weighted_ratings']/aggregated_ratings_and_weights_for_movies['sum_of_weights']
recommendation_df['movie_id'] = aggregated_ratings_and_weights_for_movies.index
recommendation_df.head()

Unnamed: 0_level_0,weighted_average_recommendation_score,movie_id
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.583103,1
2,3.078278,2
3,3.23536,3
5,2.428931,5
6,4.118613,6


In [19]:
#We will now filter the movies_df for top 20 of these movie id's in our recommendation dataframe (top chosen on basis of 
#highest weighted recommendation score), to constitute our final recommendation.

recommendation_df = recommendation_df.sort_values(by='weighted_average_recommendation_score', ascending=False)
recommendation_df.head(20)

Unnamed: 0_level_0,weighted_average_recommendation_score,movie_id
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
7121,5.0,7121
955,5.0,955
62293,5.0,62293
3451,5.0,3451
77846,5.0,77846
25906,5.0,25906
932,5.0,932
926,5.0,926
971,5.0,971
93008,5.0,93008


In [20]:
movies_df[movies_df['movieId'].isin(recommendation_df.head(20)['movie_id'].tolist())]

Unnamed: 0,movieId,title,year
707,926,All About Eve,1950
713,932,"Affair to Remember, An",1957
735,955,Bringing Up Baby,1938
744,971,Cat on a Hot Tin Roof,1958
1417,1939,"Best Years of Our Lives, The",1946
2582,3451,Guess Who's Coming to Dinner,1967
3253,4396,"Cannonball Run, The",1981
3691,5088,"Going Places (Valseuses, Les)",1974
4782,7121,Adam's Rib,1949
5429,25906,Mr. Skeffington,1944
