#                                 Recommendation Systems


##  Brief Introduction :

With this notebook, I have build a recommender system using the Movie Lens -100k dataset that's available here. https://grouplens.org/datasets/movielens/100k/. The dataset folder contains a number of files. I will be using the 'ua.base' file which contains 90,000 ratings and the 'ua.test' file which contains 10,000.

The recommendation system I had build will be user-user based collaborative filtering & item-item based collaborative filtering and later go onto try a model based collaborative filtering using Singular Value Decomposition.

##  Importing libraries : 

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error

import math

##  Getting the datasets & setting the column names :

In [2]:
rs_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings_base = pd.read_csv('Data/ua.base', sep='\t', names=rs_cols, encoding='latin-1')
ratings_test = pd.read_csv('Data/ua.test', sep='\t', names=rs_cols, encoding='latin-1')

In [3]:
ratings_base.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [25]:
ratings_test.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,20,4,887431883
1,1,33,4,878542699
2,1,61,4,878542420
3,1,117,3,874965739
4,1,155,2,878542201


#### Dataset description :

- column user_id : ids' of users starting from 1,
- column movie_id : ids' of users starting from 1, and
- 'rating' column : the corresponding ratings.

Let figure out how many unique users and how many unique movies (items) are there !!!!

In [4]:
ratings_base.describe()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
count,90570.0,90570.0,90570.0,90570.0
mean,461.494038,428.104891,3.523827,883507300.0
std,266.004364,333.088029,1.126073,5341684.0
min,1.0,1.0,1.0,874724700.0
25%,256.0,174.0,3.0,879448400.0
50%,442.0,324.0,4.0,882814300.0
75%,682.0,636.0,4.0,888204900.0
max,943.0,1682.0,5.0,893286600.0


In [5]:
n_users_base = ratings_base['user_id'].unique().max()
n_items_base = ratings_base['movie_id'].unique().max()

n_users_base,n_items_base

(943, 1682)

####  Findings 1 :
There are 943 users and 1682 movies in the training set.

In [6]:
ratings_test.describe()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
count,9430.0,9430.0,9430.0,9430.0
mean,472.0,400.800954,3.587805,883735400.0
std,272.234934,306.859789,1.12024,5360562.0
min,1.0,1.0,1.0,874724700.0
25%,236.0,182.0,3.0,879451500.0
50%,472.0,303.0,4.0,883390400.0
75%,708.0,566.0,4.0,888637800.0
max,943.0,1664.0,5.0,893286600.0


In [7]:
n_users_test = ratings_test['user_id'].unique().max()
n_items_test = ratings_test['movie_id'].unique().max()
n_users_test,n_items_test

(943, 1664)

## Creating : user - item matrix

There are 943 users and 1664 movies in the training set. 
- Now let us go ahead and create our user-item matrices, test_matrix and train_matrix which contain number of rows equal to the number of unique users and number of columns equal to the number of unique movies. 
- The cells of this matrix are filled with the corresponding rating a user has given to a movie. 
        - If a user has not rated a movie then the cell is filled with 0.
        

In [8]:
train_matrix = np.zeros((n_users_base, n_items_base))
for line in ratings_base.itertuples():
    train_matrix[line[1]-1,line[2]-1] = line[3]

test_matrix = np.zeros((n_users_test, n_items_test))
for line in ratings_test.itertuples():
    test_matrix[line[1]-1,line[2]-1] = line[3]

## user-user  based collaborative filtering

The first approach we try is user-user based collaborative filtering. In this method, we first create a similarity matrix which specifies the similarity between two users based on the ratings they have given to different movies. 

- We use the " cosine similarity " metric which computers the dot product between the two vectors made up of the ratings of the movies they have rated.

In [19]:
train_matrix.shape

(943, 1682)

In [9]:
user_similarity = pairwise_distances(train_matrix, metric='cosine')
print('shape: ',user_similarity.shape)

user_similarity

shape:  (943, 943)


array([[0.        , 0.85324924, 0.9493235 , ..., 0.96129522, 0.8272823 ,
        0.61960392],
       [0.85324924, 0.        , 0.87419215, ..., 0.82629308, 0.82681535,
        0.91905667],
       [0.9493235 , 0.87419215, 0.        , ..., 0.97201154, 0.87518372,
        0.97030738],
       ...,
       [0.96129522, 0.82629308, 0.97201154, ..., 0.        , 0.96004871,
        0.98085615],
       [0.8272823 , 0.82681535, 0.87518372, ..., 0.96004871, 0.        ,
        0.85528944],
       [0.61960392, 0.91905667, 0.97030738, ..., 0.98085615, 0.85528944,
        0.        ]])

- The similarity matrix has a shape of 943 x 943 as expected with each cell corresponding to the similarity between two users.

- Now we will write a prediction function which will predict the values in the user-item matrix. We will only consider the top n users which are similar to a user to make predictions for that user. 

- In here, we normalise the ratings of users by subtracting the mean rating of a user from every rating given by the user.

\begin{equation*}
\hat{x}_{k,m} =\bar{x}_{k} + \frac{\sum\limits_{u_a} sim_u(u_k, u_a) (x_{a,m} - \bar{x}_{u_a})}{\sum\limits_{u_a}|sim_u(u_k, u_a)|}
\end{equation*}

In [14]:
def predict_user_user(train_matrix, user_similarity, n_similar = 30):
    
    similar_n = user_similarity.argsort()[:,-n_similar:][:,::-1]
    pred = np.zeros((n_users_base,n_items_base))
    
    for i,users in enumerate(similar_n):
        similar_users_indexes = users
        similarity_n = user_similarity[i,similar_users_indexes]
        matrix_n = train_matrix[similar_users_indexes,:]
        rated_items = similarity_n[:,np.newaxis].T.dot(matrix_n - matrix_n.mean(axis=1)[:,np.newaxis])/ similarity_n.sum()
        pred[i,:]  = rated_items
        
    return pred

- We will use this function to find the predicted ratings and add the average rating of every use to give back the final predicted ratings.

- Here, we are considering the top 50 users which are similar to our user and using their ratings to predict our user's ratings.

In [15]:

predictions = predict_user_user(train_matrix,user_similarity, 50) + train_matrix.mean(axis=1)[:, np.newaxis]
print('predictions shape ',predictions.shape)

predictions

predictions shape  (943, 1682)


array([[ 0.53079191,  0.53079191,  0.53079191, ...,  0.53079191,
         0.53079191,  0.53079191],
       [ 0.27556554,  0.17581381, -0.00189689, ..., -0.00189689,
        -0.00189689, -0.00189689],
       [ 1.17064209,  0.07064209,  0.01064209, ...,  0.01064209,
         0.01064209,  0.01064209],
       ...,
       [-0.0479786 , -0.0479786 , -0.0479786 , ..., -0.0479786 ,
        -0.0479786 , -0.0479786 ],
       [ 0.8909642 ,  0.12995357,  0.12995357, ...,  0.12995357,
         0.12995357,  0.12995357],
       [ 0.27315101,  0.27315101,  0.27315101, ...,  0.31315101,
         0.27315101,  0.27315101]])

- Let us consider only those ratings which are not zero in the test matrix and use them to find the error in our model

In [16]:
predicted_ratings = predictions[test_matrix.nonzero()]

test_truth = test_matrix[test_matrix.nonzero()]

####  error computation

In [17]:
math.sqrt(mean_squared_error(predicted_ratings,test_truth))

3.507744099069281

## item-item  based collaborative filtering

Now, I will go on and try item-item based collaborative filtering. This method finds the similarity between items instead of users, exactly like the previous method using 'cosine similarity'. 

- Using the similarity between items and the users rating for similar items, we find the predicted ratings for un-rated items. Let us make the item similarity matrix.

In [33]:
item_similarity = pairwise_distances(train_matrix.T, metric = 'cosine')
item_similarity.shape

(1682, 1682)

In [35]:
item_similarity

array([[0.        , 0.59704074, 0.66673863, ..., 1.        , 0.94919585,
        0.94919585],
       [0.59704074, 0.        , 0.7308149 , ..., 1.        , 0.91844091,
        0.91844091],
       [0.66673863, 0.7308149 , 0.        , ..., 1.        , 1.        ,
        0.90098525],
       ...,
       [1.        , 1.        , 1.        , ..., 0.        , 1.        ,
        1.        ],
       [0.94919585, 0.91844091, 1.        , ..., 1.        , 0.        ,
        1.        ],
       [0.94919585, 0.91844091, 0.90098525, ..., 1.        , 1.        ,
        0.        ]])

- The similarity matrix has a shape of 1682 x 1682 as expected with each cell corresponding to the similarity between two users.
- Now we will write a prediction function which will predict the values in the user-item matrix.
- We will only consider the top n items which are similar to a item to make predictions.
- here we don't need normalise the ratings of users as we are using items to make predictions instead of users.

\begin{equation*}
\hat{x}_{k,m} = \frac{\sum\limits_{i_b} sim_i(i_m, i_b) (x_{k,b}) }{\sum\limits_{i_b}|sim_i(i_m, i_b)|}
\end{equation*}

In [41]:
def predict_item_item(train_matrix, item_similarity, n_similar=30):
    
    similar_n = item_similarity.argsort()[:,-n_similar:][:,::-1]
    print('similar_n shape: ', similar_n.shape)
    pred = np.zeros((n_users_base,n_items_base))
    
    
    for i,items in enumerate(similar_n):
        similar_items_indexes = items
        similarity_n = item_similarity[i,similar_items_indexes]
        matrix_n = train_matrix[:,similar_items_indexes]
        rated_items = matrix_n.dot(similarity_n)/similarity_n.sum()
        pred[:,i]  = rated_items
        
    return pred

We will use this function to find the predicted ratings. 
- Here, we are considering the top 50 users which are similar to our user and using their ratings to predict our user's ratings.

In [43]:
predictions = predict_item_item(train_matrix,item_similarity,50)
print('predictions shape ',predictions.shape)

predictions

similar_n shape:  (1682, 50)
predictions shape  (943, 1682)


array([[0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.66],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       ...,
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.1 , 0.08, 0.08],
       [0.  , 0.  , 0.  , ..., 0.44, 0.2 , 0.06]])

Let us consider only those ratings which are not zero in the test matrix and use them to find the error in our model

In [44]:
predicted_ratings = predictions[test_matrix.nonzero()]
test_truth = test_matrix[test_matrix.nonzero()]

math.sqrt(mean_squared_error(predicted_ratings,test_truth))

3.749688827167227

## Getting recommendations for user

In the next part we get recommendations for a user based on the highest predicted ratings for a particular user. Let us get predictions for the user with user id 77. I am using the predictions from the item-item collaborative filtering model for this.

In [20]:
user_id = 45
user_ratings = predictions[user_id-1,:]

We extract the indices of the movies in the matrix which have not been rated by the user i.e. value is 0 and get their predicted ratings. 

In [21]:
train_unkown_indices = np.where(train_matrix[user_id-1,:] == 0)[0]
train_unkown_indices

array([   1,    2,    3, ..., 1679, 1680, 1681], dtype=int64)

In [22]:
user_recommendations = user_ratings[train_unkown_indices]

In [23]:
user_recommendations.shape

(1644,)

We go on and print the top 5 recommendations.

In [24]:
print('\nRecommendations for user {} are the movies: \n'.format(user_id))

for movie_id in user_recommendations.argsort()[-5:][: : -1]:
    print(movie_id +1)


Recommendations for user 45 are the movies: 

282
293
280
267
254


## Trying : singular value decomposition

After we have tried out both the memory based methods i.e user-user and item-item collaborative filtering, 
- in this method we will try a model-based method.
- Singular value decomposition is a mathematical techinique used to find the missing values in a matrix.
- It decomposes a matrix into three matrices two of which are rectangular and the middle one is a diagonal matrix.

\begin{equation*}
X=U \times S \times V^T
\end{equation*}

In [51]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

In [52]:
u, s, vt = svds(train_matrix, k = 20)

In [53]:
u.shape, s.shape, vt.shape

((943, 20), (20,), (20, 1682))

In [54]:
s_diag_matrix = np.diag(s)

#### We get the predictions by finding the dot product of the three matrices.

In [55]:
predictions_svd = np.dot(np.dot(u,s_diag_matrix),vt)

In [56]:
predictions_svd.shape

(943, 1682)

In [57]:
predicted_ratings_svd = predictions_svd[test_matrix.nonzero()]
test_truth = test_matrix[test_matrix.nonzero()]

math.sqrt(mean_squared_error(predicted_ratings_svd,test_truth))

2.8258075694458307

### The root mean square error is the least using this method. 
- Let us now get the recommendations for user 33.

In [60]:
user_id = 33
user_ratings = predictions_svd[user_id-1,:]
train_unkown_indices = np.where(train_matrix[user_id-1,:] == 0)[0]
user_recommendations = user_ratings[train_unkown_indices]

user_recommendations.shape

(1668,)

In [61]:
print('\nRecommendations for user {} are the movies: \n'.format(user_id))

for movie_id in user_recommendations.argsort()[-5:][: : -1]:
    print(movie_id +1)


Recommendations for user 33 are the movies: 

257
321
736
325
319


                                All about recommendation systems !!!