# Building and Testing Recommender Systems With Surprise

Surprise is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

<http://surpriselib.com/>

In [65]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

In [70]:
movies = pd.read_csv(r'C:\Users\delchain_default\Documents\GitHub\Python-Notes\Machine Learning\Recommender System (Advanced)\Movie_data.csv')

movies.head()

Unnamed: 0.1,Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,0,50,5,881250949,Star Wars (1977)
1,1,290,50,5,880473582,Star Wars (1977)
2,2,79,50,4,891271545,Star Wars (1977)
3,3,2,50,5,888552084,Star Wars (1977)
4,4,8,50,5,879362124,Star Wars (1977)


In [74]:
users = movies.groupby('user_id')['rating'].count().reset_index().sort_values('rating', ascending=False)
users

Unnamed: 0,user_id,rating
405,405,737
655,655,685
13,13,636
450,450,540
276,276,518
...,...,...
571,571,20
19,19,20
888,888,20
895,895,20


Most of the users gave less than 5 ratings, and very few users gave many ratings, although the most productive user have given 13,602 ratings.

I'm sure you have noticed that the above two charts share the same distribution. The number of ratings per movie and the bnumber of ratings per user decay exponentially.

To reduce the dimensionality of the dataset, we will filter out rarely rated movies and rarely rating users.

In [79]:
min_movie_ratings = 50
filter_movies = movies['title'].value_counts() > min_movie_ratings
filter_movies = filter_movies[filter_movies].index.tolist()

min_user_ratings = 50
filter_users = movies['user_id'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df = movies[(movies['title'].isin(filter_movies)) & (movies['user_id'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(movies.shape))
print('The new data frame shape:\t{}'.format(df.shape))

The original data frame shape:	(100003, 6)
The new data frame shape:	(73134, 6)


In [118]:
df['title'].nunique()

596

In [80]:
df

Unnamed: 0.1,Unnamed: 0,user_id,item_id,rating,timestamp,title
1,1,290,50,5,880473582,Star Wars (1977)
2,2,79,50,4,891271545,Star Wars (1977)
3,3,2,50,5,888552084,Star Wars (1977)
4,4,8,50,5,879362124,Star Wars (1977)
5,5,274,50,5,878944679,Star Wars (1977)
...,...,...,...,...,...,...
97645,97645,493,881,1,884130009,Money Talks (1997)
97647,97647,697,881,2,882621523,Money Talks (1997)
97649,97649,629,881,3,880116023,Money Talks (1997)
97652,97652,889,881,3,880176717,Money Talks (1997)


## Surprise

To load a data set from the above pandas data frame, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The data frame must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

In [81]:
from surprise import Reader
from surprise import Dataset

from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering

from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

In [133]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'title', 'rating']], reader)

The Reader class is used to parse a file containing ratings.

Such a file is assumed to specify only one rating per line, and each line needs to respect the following structure:

In [134]:
benchmark = []

# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(),SlopeOne(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:   # NMF()
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
       

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


In [136]:

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results
    

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.904197,51.608334,2.03639
KNNBaseline,0.909188,0.194483,3.200658
SVD,0.919603,1.969649,0.130978
SlopeOne,0.920265,0.161577,1.781889
KNNWithMeans,0.921643,0.160902,2.651117
KNNWithZScore,0.922562,0.191471,2.867672
BaselineOnly,0.922689,0.060172,0.09541
CoClustering,0.930808,0.828451,0.112698
KNNBasic,0.946183,0.15226,2.464787
NormalPredictor,1.456741,0.052526,0.128323


## Train and Predict

SVDpp produces the best results, however it it very time consuming. lets train and predict using KNNBaseline

In [137]:
print('Using KNNBaseline')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = KNNBaseline()
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Using KNNBaseline
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.9062596 , 0.90635535, 0.91391631]),
 'fit_time': (0.21741890907287598, 0.19944000244140625, 0.1965010166168213),
 'test_time': (3.2124080657958984, 3.241358518600464, 3.121652364730835)}

We use the train_test_split() to sample a trainset and a testset with given sizes, and use the accuracy metric of rmse. We’ll then use the fit() method which will train the algorithm on the trainset, and the test() method which will return the predictions made from the testset.

In [138]:
trainset, testset = train_test_split(data, test_size=0.25)
algo = KNNBaseline()

predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8962


0.8961858518195142

To inspect our predictions in details, we are going to build a pandas data frame with all the predictions. The following code were largely taken from this notebook.

In [139]:
trainset = algo.trainset
print(algo.__class__.__name__)

KNNBaseline


To inspect our predictions in details, we are going to build a pandas data frame with all the predictions.

In [140]:
def get_Iu(user_id):
#     """ return the number of items rated by given user args: 
#       user_id: the id of the user 
#       Item_User : the number of items rated by the user
#     """
    try:
        return len(trainset.ur[trainset.to_inner_uid(user_id)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(item_id):
#     """ return number of users that have rated given item :
#       item_id: the title of the movie
#       User_Item: the number of users that have rated the item.
#     """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(item_id)])
    except ValueError:
        return 0
    
df1 = pd.DataFrame(predictions, columns=['user_id', 'item_id', 'real', 'est', 'details'])
df1['Item_USer'] = df1.user_id.apply(get_Iu)
df1['User_Item'] = df1.item_id.apply(get_Ui)
df1['err'] = abs(df1.est - df1.real)

In [141]:

df1.head()

Unnamed: 0,user_id,item_id,real,est,details,Item_USer,User_Item,err
0,504,Four Weddings and a Funeral (1994),3.0,3.890772,"{'actual_k': 40, 'was_impossible': False}",143,182,0.890772
1,361,"Birds, The (1963)",3.0,3.791958,"{'actual_k': 40, 'was_impossible': False}",86,117,0.791958
2,650,Highlander (1986),3.0,2.998583,"{'actual_k': 40, 'was_impossible': False}",184,111,0.001417
3,234,Batman (1989),1.0,3.007702,"{'actual_k': 40, 'was_impossible': False}",252,146,2.007702
4,383,Bob Roberts (1992),4.0,4.331532,"{'actual_k': 40, 'was_impossible': False}",53,63,0.331532


In [142]:
# what is 'details' ? 
# what i am wondering: if i create a user with a couple of movies ratings, how do i issue recommendations for this user?
# we used cross validation in order to validate
# how can we optimize the knn parameters? is there a GridSearch method available to improve the fit?

In [143]:

best_predictions = df1.sort_values(by='err')[:20]
worst_predictions = df1.sort_values(by='err')[-20:]


In [144]:
df1.shape

(18284, 8)

In [145]:
best_predictions

Unnamed: 0,user_id,item_id,real,est,details,Item_USer,User_Item,err
17487,130,Braveheart (1995),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",203,210,0.0
7597,457,"Wrong Trousers, The (1993)",5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",186,77,0.0
2094,507,Titanic (1997),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",37,158,0.0
7480,330,Schindler's List (1993),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",103,196,0.0
3257,16,"Usual Suspects, The (1995)",5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",88,180,0.0
8539,907,"Great Escape, The (1963)",5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",100,87,0.0
14338,152,Braveheart (1995),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",74,210,0.0
7433,821,Casablanca (1942),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",43,164,0.0
3355,118,Apocalypse Now (1979),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",40,150,0.0
11367,532,Schindler's List (1993),5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",135,196,0.0


In [146]:
worst_predictions

Unnamed: 0,user_id,item_id,real,est,details,Item_USer,User_Item,err
14333,207,Vegas Vacation (1997),5.0,1.929244,"{'actual_k': 40, 'was_impossible': False}",145,43,3.070756
11690,246,Pulp Fiction (1994),1.0,4.074338,"{'actual_k': 40, 'was_impossible': False}",121,278,3.074338
4748,867,Leaving Las Vegas (1995),1.0,4.083843,"{'actual_k': 40, 'was_impossible': False}",65,167,3.083843
978,887,Private Parts (1997),1.0,4.084153,"{'actual_k': 40, 'was_impossible': False}",100,64,3.084153
6736,189,Local Hero (1983),1.0,4.086916,"{'actual_k': 40, 'was_impossible': False}",120,47,3.086916
1628,562,One Flew Over the Cuckoo's Nest (1975),1.0,4.111777,"{'actual_k': 40, 'was_impossible': False}",50,188,3.111777
17369,314,Pulp Fiction (1994),1.0,4.116691,"{'actual_k': 40, 'was_impossible': False}",124,278,3.116691
9277,588,Groundhog Day (1993),1.0,4.164159,"{'actual_k': 40, 'was_impossible': False}",143,207,3.164159
8103,559,Braveheart (1995),1.0,4.165262,"{'actual_k': 40, 'was_impossible': False}",52,210,3.165262
7307,145,Face/Off (1997),1.0,4.165268,"{'actual_k': 40, 'was_impossible': False}",174,112,3.165268
