In [1]:
from surprise import SVD, evaluate, KNNBasic, Dataset, similarities, Reader
from surprise.model_selection import GridSearchCV, cross_validate
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
#import math
from tabulate import tabulate
#%matplotlib inline


## User-User Based Recommendation.
For this recommendation system. We will base our recommendaton on similarities between users. The similaritie can be calculated using cosine or euclidean distance or nearest neighbour algorithms. This recommenation also utilizes one of the few available python packages for recommendation. The packege is called "surprise".

The first step in our analysis is to read in all the data from github repository for easy analysis. There are two tables that are store into pandas dataframes. One table has rating of movie and another table has the names of the movies that can mapped to the ID numbers of in the rating table. 

In [2]:
algo = SVD()


In [3]:
df1 = pd.read_csv('https://raw.githubusercontent.com/angus001/Data643/master/Project2/ratings2.csv', sep=',')
movie_names = pd.read_csv ('https://raw.githubusercontent.com/angus001/Data643/master/Project2/movies.csv',sep = ',')
df1.head()


Unnamed: 0,userId,movieId,Id2,rating,timestamp
0,7,1,1,3.0,851866703
1,9,1,1,4.0,938629179
2,13,1,1,5.0,1331380058
3,15,1,1,2.0,997938310
4,19,1,1,3.0,855190091


In [4]:
movie_names.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
reader = Reader(rating_scale = (1,5))
df1a = df1[['userId','Id2','rating']].head(9000)

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}

df2 = Dataset.load_from_df(df1a[['userId','Id2','rating']], reader)
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(df2)
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.926769964319
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


## Nearest Neighbour or KNN method


The second method using nearest neighbour method is also shown below. In this example the predicted rating for user '2' for movie - '10' is 3.53 while the actual rating is 4.

In [6]:
trainset = df2.build_full_trainset()
testset = trainset.build_anti_testset()
sim_options={'name':'cosine', 'user_based':True}
algo2 = KNNBasic(sim_options =sim_options)

algo2.fit(trainset)
predictions = algo2.test(testset)

userid = str(2)
itemid = str(10)
actual_rating = 4
algo2.predict(userid,itemid, actual_rating)


Computing the cosine similarity matrix...
Done computing similarity matrix.


Prediction(uid='2', iid='10', r_ui=4, est=3.5390000000000001, details={'was_impossible': True, 'reason': 'User and/or item is unkown.'})

## SVD Singular Value Decomposition

The idea of singular value decomposition is similar to PCA: principal component analysis. Both algorithm try to reduce the dimension of the dataset in order to keep the core features that affect the prediction of the model while removing the noise. SVD is heavily based on matrix factorization using multiple matrix transposes. One requirement in SVD is specifying the number of components. In below calculation, the number of component is set at 120. 

In [7]:
#Using Truncated SVD -single value decomposition

movie_rating_matrix = pd.merge(df1,movie_names, on = 'movieId')
movie_rating_matrix.head()

Unnamed: 0,userId,movieId,Id2,rating,timestamp,title,genres
0,7,1,1,3.0,851866703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,9,1,1,4.0,938629179,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,13,1,1,5.0,1331380058,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,1,2.0,997938310,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,19,1,1,3.0,855190091,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [8]:
df5 = movie_rating_matrix[['Id2','title']]

movie_names_ID2 = df5.drop_duplicates()
df3 = movie_rating_matrix[['userId','Id2','rating','title']]
df3.rename(columns={'Id2':'movie_id'},inplace =True)
df3.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,userId,movie_id,rating,title
0,7,1,3.0,Toy Story (1995)
1,9,1,4.0,Toy Story (1995)
2,13,1,5.0,Toy Story (1995)
3,15,1,2.0,Toy Story (1995)
4,19,1,3.0,Toy Story (1995)


In [17]:
rating_matrix = df3.pivot_table(values = 'rating', index='userId', columns = 'title', fill_value = 0)
rating_matrix1 = df3.pivot_table(values = 'rating', index='movie_id', columns = 'userId', fill_value = 0)

In [10]:
df4 = rating_matrix.T
SVD1 = TruncatedSVD(n_components = 120)
transposed_matrix = SVD1.fit_transform(rating_matrix)
transposed_matrix.shape

(671, 120)

In [19]:
#build correlation between users based on rating
corr_matrix = np.corrcoef(transposed_matrix)
corr_matrix[:3]

array([[ 1.        ,  0.01965178, -0.00902512, ...,  0.09817083,
         0.04248892,  0.08837197],
       [ 0.01965178,  1.        ,  0.34187875, ...,  0.05362721,
         0.43360384,  0.19057733],
       [-0.00902512,  0.34187875,  1.        , ...,  0.37813395,
         0.55849335,  0.4715013 ]])

A separate function is created to map the recommended movies to the actual name of the movies. In below example, the top 3 recommended movies are extracted for specific user ids. The function then shows the title of the movies. 

In [13]:
#A function to get top recommended items

from collections import defaultdict

def get_top3_recommendation(predictions, topN = 3):
    top_recs =defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_recs[uid].append((iid, est))
    for uid, user_ratings in top_recs.items():
        user_ratings.sort(key = lambda x: x[1], reverse = True)
        top_recs[uid] = user_ratings[:topN]

    return top_recs

In [14]:
top_recommendation = get_top3_recommendation(predictions, topN = 3)

In [15]:
#create a new dataframe with new ID for movies title

movie_names_ID2 = movie_names_ID2.reset_index()
movie_names_ID2 = movie_names_ID2[['Id2','title']]
movie_names_ID2.head()
idlist = [1,2,3]
movie_names_ID2.loc[movie_names_ID2['Id2'].isin (idlist)]


Unnamed: 0,Id2,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)


The final results for user-user based recommendation shows for similar users, the recommended movies also seem to overlap or are similar to each another. For user 7, 9 and 13, the recommended movies include one identical movie and the rest are in similar genre. 

In [16]:
#a nested for loop for pulling the first couple of customers
first2 = {k:top_recommendation[k] for k in list(top_recommendation)[:3]}
first2

 
for uid, user_ratings in first2.items():
    print ('Recommendation for Person ID:%d' %uid)
    movieid = []
    for (iid,_) in user_ratings:
         movieid.append(iid)
    df2a =pd.DataFrame(movie_names_ID2.loc[movie_names_ID2['Id2'].isin (movieid)])
    #print (df2a)
    print (tabulate(df2a, headers = 'keys', tablefmt = 'psql'))
 

Recommendation for Person ID:7
+-----+-------+-----------------------------+
|     |   Id2 | title                       |
|-----+-------+-----------------------------|
|  50 |    51 | Lamerica (1994)             |
| 157 |   158 | Love & Human Remains (1993) |
| 161 |   162 | Mute Witness (1994)         |
+-----+-------+-----------------------------+
Recommendation for Person ID:9
+-----+-------+------------------+
|     |   Id2 | title            |
|-----+-------+------------------|
|  50 |    51 | Lamerica (1994)  |
| 236 |   237 | Enfer, L' (1994) |
| 267 |   268 | Priest (1994)    |
+-----+-------+------------------+
Recommendation for Person ID:13
+-----+-------+---------------------+
|     |   Id2 | title               |
|-----+-------+---------------------|
|  50 |    51 | Lamerica (1994)     |
| 161 |   162 | Mute Witness (1994) |
| 267 |   268 | Priest (1994)       |
+-----+-------+---------------------+
