# Advanced Recommendation System

## The Data

You can download the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k.zip) or just use the u.data file

In [1]:
import numpy as np
import pandas as pd

In [2]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

In [3]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


We can use the Movie_Id_Titles to grab the movie names and merge it with the dataframe.

In [4]:
movie_titles = pd.read_csv('Movie_Id_Titles')
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [5]:
df = pd.merge(df, movie_titles, on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


In [6]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Number of Users: '+str(n_users))
print('Number of Movies: '+str(n_items))

Number of Users: 944
Number of Movies: 1682


## Train Test Split

We'll use the same structure that we have used before to evaluate the data. However, we won't do the classic X_train, y_train, ...

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
train_data, test_data = train_test_split(df, test_size=0.25)

## Memory Based Collaborative Filtering

Memory Based CF can be divided into **user-item** and **item-item** filtering.

- *Item-Item*: "Users who liked this item also liked...
- *User-Item*: "Users who are similar to you also liked...

In both cases, you created a user-item matrix built from the entire dataset. 

The similarity values between items in *item-item* are measured by observing all the users who have rated both items.

The similarity values between users in *user-item* are measured by observing all the items rated by both users.

A distance metric commonly used in recommender system is *cosine similarity*, where ratings are considered vectors in `n`-dimensional space and the similarity is calculated based on the angle between these vectors.
<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?s_u^{cos}(u_k,u_a)=\frac{u_k&space;\cdot&space;u_a&space;}{&space;\left&space;\|&space;u_k&space;\right&space;\|&space;\left&space;\|&space;u_a&space;\right&space;\|&space;}&space;=\frac{\sum&space;x_{k,m}x_{a,m}}{\sqrt{\sum&space;x_{k,m}^2\sum&space;x_{a,m}^2}}"/>

The first step will be to create the user-item matrices.

In [9]:
train_data_matrix = np.zeros((n_users, n_items))
test_data_matrix = np.zeros((n_users, n_items))

In [10]:
for line in train_data.itertuples():
    train_data_matrix[line[1] - 1, line[2] - 1] = line[3]

In [11]:
for line in test_data.itertuples():
    test_data_matrix[line[1] - 1, line[2] - 1] = line[3]

We can use pairwise_distances from sklearn to calculate the cosine similarity.

In [12]:
from sklearn.metrics.pairwise import pairwise_distances

In [13]:
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

These matrices enables us to make predictions by applying the following formula:

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?\hat{x}_{k,m}&space;=&space;\bar{x}_{k}&space;&plus;&space;\frac{\sum\limits_{u_a}&space;sim_u(u_k,&space;u_a)&space;(x_{a,m}&space;-&space;\bar{x_{u_a}})}{\sum\limits_{u_a}|sim_u(u_k,&space;u_a)|}"/>

You can look at the similarity between users *k* and *a* as weights that are multiplied by the ratings of a similar user *a*, corrected for the average rating of that user.

Some users may tend always to give high or low ratings to all movies. The relative difference in the ratings that these users give is more important than the absolute values.

When making a prediction for **item-based CF** there is no need to correct for user's average rating.

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?\hat{x}_{k,m}&space;=&space;\frac{\sum\limits_{i_b}&space;sim_i(i_m,&space;i_b)&space;(x_{k,b})&space;}{\sum\limits_{i_b}|sim_i(i_m,&space;i_b)|}"/>

In [14]:
def predict(ratings, similarity, flag='user'):
    if flag == 'user':
        mean_user_rating = ratings.mean(axis=1)
        # Use np.newaxis so that mean_user_rating has the same format.
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif flag == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

And now we apply the predict formula in each case

In [15]:
item_prediction = predict(train_data_matrix, item_similarity, flag='item')
user_prediction = predict(train_data_matrix, user_similarity, flag='user')