# Importation of packages and loading of `main_df`

In [2]:
# importation of packages
import pandas as pd
import sklearn

In [3]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [4]:
!pip install scikit-surprise

from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import cross_validate
from surprise.reader import Reader
from surprise import Dataset
from surprise import accuracy



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 3.7 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1634004 sha256=56b3096b43cf8d9d99b2fc5e03018fc39e69bde3e168a3919a9d6c6b85f1003e
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [5]:
# import reviews dataset
id = '1ig3CDboWXJOQOXNVy0Kt71ll2r2sQEY5' # The shareable link

downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('main_df_updated.csv')  
main_df = pd.read_csv('main_df_updated.csv')
# Dataset is now stored in a Pandas Dataframe


# Checking of `main_df` and creation of `dataset` dataframe

In [6]:
# check reviews dataframe
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26408 entries, 0 to 26407
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          26408 non-null  int64  
 1   productID           26408 non-null  object 
 2   reviewerID          26408 non-null  object 
 3   vote                26408 non-null  object 
 4   rating              26408 non-null  int64  
 5   date                26408 non-null  object 
 6   helpfulness_votes   26408 non-null  int64  
 7   total_votes         26408 non-null  int64  
 8   helpfulness_rating  15724 non-null  float64
 9   category            26408 non-null  object 
 10  brand               0 non-null      float64
 11  title               26408 non-null  object 
 12  year                26408 non-null  int64  
 13  month               26408 non-null  int64  
 14  month year          26408 non-null  object 
dtypes: float64(2), int64(6), object(7)
memory usage: 3.0+

In [7]:
# change rating column to integer
main_df['rating'] = main_df['rating'].astype(int)


In [8]:
# check
main_df.head()

Unnamed: 0.1,Unnamed: 0,productID,reviewerID,vote,rating,date,helpfulness_votes,total_votes,helpfulness_rating,category,brand,title,year,month,month year
0,437,B00000DMAT,A1LBAC84TLIGAX,"[3, 3]",5,2007-12-12,3,3,100.0,Nintendo 64 Games,,GoldenEye 007,2007,12,2007-12
1,439,B00000DMAT,AMP7TQRWAIE84,"[1, 1]",4,2009-05-31,1,1,100.0,Nintendo 64 Games,,GoldenEye 007,2009,5,2009-05
2,440,B00000DMAT,A1G0VFQ9198IUF,"[2, 2]",5,2012-07-30,2,2,100.0,Nintendo 64 Games,,GoldenEye 007,2012,7,2012-07
3,441,B00000DMAT,A2N1C9JKI2C5XD,"[0, 0]",4,2002-07-28,0,0,,Nintendo 64 Games,,GoldenEye 007,2002,7,2002-07
4,443,B00000DMAT,A1AR8HYZ17T5H7,"[0, 0]",5,2002-04-10,0,0,,Nintendo 64 Games,,GoldenEye 007,2002,4,2002-04


In [9]:
main_df = main_df.drop(['Unnamed: 0'], axis=1)

In [10]:
# Set the reader with accurate rating scale
reader = Reader(rating_scale=(1, 5))

# Set the dataset
dataset = Dataset.load_from_df(main_df[["reviewerID", "productID", "rating"]], reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x7f8a6e49f3d0>

The reader has been set with the 1-5 scale as Amazon uses a 5-star rating system

# Algorithm tuning with GridSearchCV

In this section, I will be tuning the algorithm using GridSearchCV.

I will be fitting GridSearchCV onto `dataset`, and then obtaining the best parameters to obtain the best FCP score.

In [11]:
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV

# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.0005, 0.1],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(dataset)

In [12]:
# Check the FCP accuracy score (1.0 is ideal and 0 is worst)
GS.best_score['fcp']

0.613101556668934

In [13]:
# Check the best parameters
GS.best_params['fcp']

{'n_factors': 150, 'n_epochs': 20, 'lr_all': 0.1, 'biased': False}

As stated above, the best parameters to obtain the best FCP score are:


*   `n_factors`: 100
*   `n_epochs`: 10
*   `lr_all`: 0.1
*   `biased`: False

These are the parameters that will be used during the train-test split with FunkSVD.



# Train-test split with FunkSVD using 25% test size

In this section, I will be conducting the train-test split with Funk SVD using a test size of 25% and the parameters given to obtain the best FCP score in the previous section.

In [14]:
# Import train_test_split
from surprise.model_selection import train_test_split

# Split train test set
trainset, testset = train_test_split(dataset, test_size=0.25)

# Set the algorithm
svd = FunkSVD(n_factors=150, 
                 n_epochs=20, 
                 lr_all=0.1, 
                 biased=False,
                 verbose=0)
# Fit train set
svd.fit(trainset)

# Test the algorithm using test set
pred = svd.test(testset)

In [15]:
# Put pred result in a dataframe
df_prediction = pd.DataFrame(pred, columns=['reviewerID',
                                                     'productID',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

# Calculate the difference of actual and prediction into diff column
df_prediction['diff'] = abs(df_prediction['prediction'] - 
                            df_prediction['actual'])

In [16]:
# Check the df_prediction
df_prediction.head()

Unnamed: 0,reviewerID,productID,actual,prediction,details,diff
0,A20DZX38KRBIT8,B00004SVYQ,5.0,2.639222,{'was_impossible': False},2.360778
1,A1LRMNOS0FZK0T,B0000657SP,5.0,5.0,{'was_impossible': False},0.0
2,A2290OIJTU42QP,B00002STXN,5.0,4.199655,{'was_impossible': False},0.800345
3,A1Y5LUJZ8879PP,B0009A4EVM,3.0,4.346253,{'was_impossible': False},1.346253
4,A1OMXVXXP07F05,B0050SY77E,5.0,3.648412,{'was_impossible': False},1.351588


In [17]:
# See the best 10 predictions
df_prediction.sort_values(by='diff')[:10]

Unnamed: 0,reviewerID,productID,actual,prediction,details,diff
2422,A3GKMQFL05Z79K,B000084318,5.0,5.0,{'was_impossible': False},0.0
1498,A3KZZ9JRY9FROO,B0007TS24U,5.0,5.0,{'was_impossible': False},0.0
2374,A1AFBLHAJXW2MS,B001E8VB6O,5.0,5.0,{'was_impossible': False},0.0
2952,A3GKMQFL05Z79K,B000B69E9G,5.0,5.0,{'was_impossible': False},0.0
3580,A1RMGCJY22YIMZ,B00178630A,1.0,1.0,{'was_impossible': False},0.0
2368,A2PSEMWT9TR272,B00269DXCK,5.0,5.0,{'was_impossible': False},0.0
4685,A2KIWW4WKIQ08Z,B001TOQ8R0,1.0,1.0,{'was_impossible': False},0.0
4694,A215WH6RUDUCMP,B002AU0HZQ,5.0,5.0,{'was_impossible': False},0.0
5369,A2UWU9TMO9N7AX,B002NN7AKU,5.0,5.0,{'was_impossible': False},0.0
2326,A1VW4NKCLT1D0T,B00004Y57G,5.0,5.0,{'was_impossible': False},0.0


In [18]:
# See the worst 10 predictions
df_prediction.sort_values(by='diff')[-10:]

Unnamed: 0,reviewerID,productID,actual,prediction,details,diff
3171,A3IXPFN5DU9Z9L,B000X9FV5M,5.0,1.175346,{'was_impossible': False},3.824654
6502,A4VF4V6A4W0H7,B002CZ38KA,5.0,1.156028,{'was_impossible': False},3.843972
3534,AZAH84SERW5GR,B00007KUUD,5.0,1.075935,{'was_impossible': False},3.924065
3712,A2YNK8YBJQ6DY9,B001TOQ8R0,5.0,1.028452,{'was_impossible': False},3.971548
5530,A1JSHCJ6HI1XVV,B001TOQ8R0,5.0,1.0,{'was_impossible': False},4.0
5844,A27GR13SEFSTKN,B001TOQ8R0,5.0,1.0,{'was_impossible': False},4.0
5345,A151R01M3AL59K,B000BC38LA,5.0,1.0,{'was_impossible': False},4.0
4003,A3NG1G3P89FI60,B007FTE2VW,5.0,1.0,{'was_impossible': False},4.0
5774,A2KGZBP46QI6CL,B001BNFQKO,5.0,1.0,{'was_impossible': False},4.0
706,ATBR7F7455PTZ,B0053BCO00,5.0,1.0,{'was_impossible': False},4.0


In [19]:
# Check total rows with same actual and prediction ratings
df_prediction[df_prediction['diff'] <= 0]

Unnamed: 0,reviewerID,productID,actual,prediction,details,diff
1,A1LRMNOS0FZK0T,B0000657SP,5.0,5.0,{'was_impossible': False},0.0
9,A2II09GQGWOMTQ,B0000657SP,5.0,5.0,{'was_impossible': False},0.0
90,A1187K7PTO0C3D,B0002A6CQ4,5.0,5.0,{'was_impossible': False},0.0
105,A1DDG2R80UWTPI,B002I0IVC4,5.0,5.0,{'was_impossible': False},0.0
196,A2LHTGEN0KRG0K,B003O6E800,5.0,5.0,{'was_impossible': False},0.0
...,...,...,...,...,...,...
6350,A3284KYDZ00BZA,B001BNFQKO,1.0,1.0,{'was_impossible': False},0.0
6378,APS7IH14C8AZ9,B00000F1GM,5.0,5.0,{'was_impossible': False},0.0
6391,AXIQ99RS1E2JW,B0050SX0UY,5.0,5.0,{'was_impossible': False},0.0
6426,A15U64VGUV6RBF,B0050SXX88,5.0,5.0,{'was_impossible': False},0.0


Below, we can see that only 0.01% of the data have the same prediction rating witht he actual rating. This is because the predicted ratings are floats.

In [20]:
(df_prediction['diff'] == 0).mean()

0.019842471978188427

To account for the fact that the predicted ratings are floats, we have added a threshold of +-1 for the difference.

Below, we can see that 60% of the predictions are almost accurate. Now, we can move to the next section where we can build the full train set.

In [21]:
(df_prediction["diff"] <= 1).mean()

0.6161769160860345

# Full trainset and full testset

In [22]:
# Build full trainset
full_trainset = dataset.build_full_trainset()

# Build the SVD algorithm
svd = FunkSVD(n_factors=150, 
                 n_epochs=20, 
                 lr_all=0.1,    
                 biased=False, 
                 verbose=0)

# Fit with full trainset
svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f8a6d399a10>

In [23]:
# Define the full test set
full_testset = full_trainset.build_anti_testset(fill=-1)

In [24]:
# Set the prediction
prediction = svd.test(full_testset)

In [25]:
# Put into a dataframe
df_prediction = pd.DataFrame(prediction, columns=['reviewerID',
                                                     'productID',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [26]:
df_prediction.head()

Unnamed: 0,reviewerID,productID,actual,prediction,details
0,A1LBAC84TLIGAX,B00000DMB3,-1.0,4.103668,{'was_impossible': False}
1,A1LBAC84TLIGAX,B00000F1GM,-1.0,3.837017,{'was_impossible': False}
2,A1LBAC84TLIGAX,B00000I1BJ,-1.0,4.21217,{'was_impossible': False}
3,A1LBAC84TLIGAX,B00000I1BY,-1.0,3.075118,{'was_impossible': False}
4,A1LBAC84TLIGAX,B00000INR2,-1.0,4.282903,{'was_impossible': False}


# Analysis of latent factors

Before evaluating the model, we will examine the latent factors are by selecting a product.

The user matrix is stored under the `svd.pu` attribute.

In [27]:
# user matrix
user_matrix = svd.pu
user_matrix.shape

(2536, 150)

The product matrix is stored under the `svd.qi` attribute.

In [28]:
# product matrix
product_matrix = svd.qi.T
product_matrix.shape

(150, 734)

From the user and product matrices, we can see that there are 150 latent variables.

Below, we will get the predictions associated with the ratings of reviewer `A4Z9NHOW5LT0M`. This reviewer rated nine products.

In [35]:
# check reviewerID `A29IKPDK3G334J`'s ratings
main_df.loc[main_df['reviewerID'] == 'A29IKPDK3G334J']

Unnamed: 0,productID,reviewerID,vote,rating,date,helpfulness_votes,total_votes,helpfulness_rating,category,brand,title,year,month,month year
5182,B0000696CZ,A29IKPDK3G334J,"[0, 0]",5,2007-12-21,0,0,,PlayStation 2 Games,,Grand Theft Auto Vice City,2007,12,2007-12
11031,B000FQ2DTA,A29IKPDK3G334J,"[3, 8]",5,2010-03-09,3,8,38.0,PlayStation 3 Games,,Final Fantasy XIII - Playstation 3,2010,3,2010-03
12150,B000GPVUQ2,A29IKPDK3G334J,"[2, 9]",5,2006-10-12,2,9,22.0,PlayStation 2 Games,,Mortal Kombat Armageddon - PlayStation 2,2006,10,2006-10
12627,B000JLIXIG,A29IKPDK3G334J,"[5, 7]",5,2006-12-25,5,7,71.0,PlayStation 3 Games,,Resistance: Fall of Man - Playstation 3,2006,12,2006-12
12801,B000K9OR4Q,A29IKPDK3G334J,"[8, 11]",5,2007-10-05,8,11,73.0,PlayStation 3 Games,,Lair - Playstation 3,2007,10,2007-10
21055,B002BSA1C6,A29IKPDK3G334J,"[7, 14]",5,2010-11-24,7,14,50.0,PlayStation 3 Games,,Gran Turismo 5 - Playstation 3,2010,11,2010-11
21384,B002BSC4ZS,A29IKPDK3G334J,"[4, 10]",5,2010-09-02,4,10,40.0,Wii Games,,Metroid: Other M,2010,9,2010-09
21601,B002CZ38KA,A29IKPDK3G334J,"[1, 1]",5,2010-03-21,1,1,100.0,PlayStation 3 Games,,Heavy Rain - Greatest Hits,2010,3,2010-03


In [36]:
# Check reviewerID `A15JPYV0L19RF` predictions
df = df_prediction[df_prediction['reviewerID'] == 'A29IKPDK3G334J']\
    .sort_values(by=['prediction'], ascending=False)\
    .head(20)

display(df)

Unnamed: 0,reviewerID,productID,actual,prediction,details
871015,A29IKPDK3G334J,B001E8VB6O,-1.0,5.0,{'was_impossible': False}
870927,A29IKPDK3G334J,B000X25GW2,-1.0,5.0,{'was_impossible': False}
870944,A29IKPDK3G334J,B000ZK9QD2,-1.0,5.0,{'was_impossible': False}
870957,A29IKPDK3G334J,B0013RATNM,-1.0,5.0,{'was_impossible': False}
870963,A29IKPDK3G334J,B0014X7SQ6,-1.0,5.0,{'was_impossible': False}
870964,A29IKPDK3G334J,B0015AARJI,-1.0,5.0,{'was_impossible': False}
870981,A29IKPDK3G334J,B00184219U,-1.0,5.0,{'was_impossible': False}
870993,A29IKPDK3G334J,B001B1W3GG,-1.0,5.0,{'was_impossible': False}
870998,A29IKPDK3G334J,B001BX6JUA,-1.0,5.0,{'was_impossible': False}
870999,A29IKPDK3G334J,B001C6GVI6,-1.0,5.0,{'was_impossible': False}


In [37]:
merge_df = df.merge(main_df[['productID', 'title', 'category']].drop_duplicates(), how='left', left_on=['productID'], right_on=['productID'])

# check recommendations for reviewerID `A1LBAC84TLIGAX`
merge_df

Unnamed: 0,reviewerID,productID,actual,prediction,details,title,category
0,A29IKPDK3G334J,B001E8VB6O,-1.0,5.0,{'was_impossible': False},Batman: Arkham Asylum - Playstation 3,PlayStation 3 Games
1,A29IKPDK3G334J,B000X25GW2,-1.0,5.0,{'was_impossible': False},No More Heroes,Wii Games
2,A29IKPDK3G334J,B000ZK9QD2,-1.0,5.0,{'was_impossible': False},Gears of War 2 - Xbox 360,Xbox 360 Games
3,A29IKPDK3G334J,B0013RATNM,-1.0,5.0,{'was_impossible': False},Just Cause 2 - Xbox 360,Xbox 360 Games
4,A29IKPDK3G334J,B0014X7SQ6,-1.0,5.0,{'was_impossible': False},Crisis Core: Final Fantasy VII - Sony PSP,Sony PSP Games
5,A29IKPDK3G334J,B0015AARJI,-1.0,5.0,{'was_impossible': False},PlayStation 3 Dualshock 3 Wireless Controller ...,PlayStation 3 Games
6,A29IKPDK3G334J,B00184219U,-1.0,5.0,{'was_impossible': False},Final Fantasy IV,Nintendo DS Games
7,A29IKPDK3G334J,B001B1W3GG,-1.0,5.0,{'was_impossible': False},Bioshock - Playstation 3,PlayStation 3 Games
8,A29IKPDK3G334J,B001BX6JUA,-1.0,5.0,{'was_impossible': False},Rock Band 2 - Xbox 360 (Game only),Xbox 360 Games
9,A29IKPDK3G334J,B001C6GVI6,-1.0,5.0,{'was_impossible': False},Shin Megami Tensei: Persona 4 - PlayStation 2,PlayStation 2 Games


Above are the recommendations that we can make to reviewer `A4Z9NHOW5LT0M` based on their own reviews. The video games suggested are quite similar with what the reviewer rated as 5.

# Evaluation

The Fraction of Concordant Pairs (FCP) score is 62%. FCP measures the fraction of pairs whose relative ranking order is correct (i.e. where the actual and predicted ratings are the same). This is ideal, as we don't want to have 100% accuracy. We want to be able to recommend new products to customers too, not just the exact same products (i.e. recommending Tekken 3 and Tekken 4 after they liked Tekken 2).

In [32]:
# FCP
FCP = accuracy.fcp(pred, verbose=False)
print(FCP) 

0.6130780459196109


The Root Mean Squared Error (RMSE) score is 1.16. The lower the score, the more accurate the recommender system. As mentioned above, we need to tolerate some inacurracy in order to introduce novelty into the list of products being recommended.

In [33]:
# RMSE
RMSE = accuracy.rmse(pred, verbose=False)
print(RMSE)

1.1424525170520687


The Mean Average Error (MAE) score is 0.93. This metric evaluates the absolute distance of the entries in the dataset to the predictions on a regression, taking the average of all observations.

In [34]:
# MAE
MAE = accuracy.mae(pred, verbose=False)
print(MAE)

0.9162591214162541
