# Recommendation Systems project

## Introduction

**DOMAIN**: Smartphone, Electronics

**CONTEXT**: India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India
in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by
smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has
made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they
are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the
right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system
based on individual consumer’s behaviour or choice.

**DATA DESCRIPTION**:

• author : name of the person who gave the rating

• country : country the person who gave the rating belongs to

• data : date of the rating

• domain: website from which the rating was taken from

• extract: rating content

• language: language in which the rating was given

• product: name of the product/mobile phone for which the rating was given

• score: average rating for the phone

• score_max: highest rating given for the phone

• source: source from where the rating was taken

Import the necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
!pip install scikit-surprise



In [3]:
from collections import defaultdict
from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

In [4]:
phone1 = pd.read_csv('phone_user_review_file_1.csv', encoding='latin-1')
phone2 = pd.read_csv('phone_user_review_file_2.csv', encoding='latin-1')
phone3 = pd.read_csv('phone_user_review_file_3.csv', encoding='latin-1')
phone4 = pd.read_csv('phone_user_review_file_4.csv', encoding='latin-1')
phone5 = pd.read_csv('phone_user_review_file_5.csv', encoding='latin-1')
phone6 = pd.read_csv('phone_user_review_file_6.csv', encoding='latin-1')

In [5]:
phone1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374910 entries, 0 to 374909
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  374910 non-null  object 
 1   date       374910 non-null  object 
 2   lang       374910 non-null  object 
 3   country    374910 non-null  object 
 4   source     374910 non-null  object 
 5   domain     374910 non-null  object 
 6   score      366691 non-null  float64
 7   score_max  366691 non-null  float64
 8   extract    371934 non-null  object 
 9   author     371641 non-null  object 
 10  product    374910 non-null  object 
dtypes: float64(2), object(9)
memory usage: 31.5+ MB


In [6]:
phone1.shape

(374910, 11)

In [7]:
phone2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114925 entries, 0 to 114924
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  114925 non-null  object 
 1   date       114925 non-null  object 
 2   lang       114925 non-null  object 
 3   country    114925 non-null  object 
 4   source     114925 non-null  object 
 5   domain     114925 non-null  object 
 6   score      112166 non-null  float64
 7   score_max  112166 non-null  float64
 8   extract    113965 non-null  object 
 9   author     113290 non-null  object 
 10  product    114925 non-null  object 
dtypes: float64(2), object(9)
memory usage: 9.6+ MB


In [8]:
phone2.shape

(114925, 11)

Merge all the datasets into one dataframe

In [9]:
data_df = pd.concat([phone1, phone2, phone3, phone4, phone5, phone6], axis=0)
data_df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


## Data Preparation and Analysis

In [10]:
data_df.shape

(1415133, 11)

In [11]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


In [12]:
data_df.describe()

Unnamed: 0,score,score_max
count,1351644.0,1351644.0
mean,8.00706,10.0
std,2.616121,0.0
min,0.2,10.0
25%,7.2,10.0
50%,9.2,10.0
75%,10.0,10.0
max,10.0,10.0


In [13]:
data_df['product'].value_counts(ascending=False)

Lenovo Vibe K4 Note (White,16GB)           5226
Lenovo Vibe K4 Note (Black, 16GB)          4390
OnePlus 3 (Graphite, 64 GB)                4103
OnePlus 3 (Soft Gold, 64 GB)               3563
Huawei P8lite zwart / 16 GB                2707
                                           ... 
JiaYu G4 Advanced (Black)                     1
SAMSUNG M140 CEP TELEFONU                     1
Samsung SGH-M140                              1
Sony Xperia tipo schwarz                      1
HTC Desire 601 - 8GB - Black Smartphone       1
Name: product, Length: 61313, dtype: int64

Except score and score_max (which are of float type) all other features are of object type

score_max has the maximum score or rating to be provided and is a constant - 10 throughout

Same product has multiple names with different features, they can be considered as a single product. For example, Lenovo Vibe K4 Note has White, black etc but can be considered as a single product

Round off scores to the nearest integers

In [14]:
data_df['score'] = round(data_df['score']).astype('Int64')

In [15]:
data_df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


Check for missing values and impute them

In [16]:
data_df.isnull().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

In [17]:
data_df['author'].value_counts(ascending=False)

Amazon Customer                    76978
Cliente Amazon                     19304
e-bit                               8663
Client d'Amazon                     7716
Amazon Kunde                        4750
                                   ...  
Julen Crespo Arocena                   1
Sigfrido                               1
Alex Torres                            1
Maria Concepcion AndrÃ©s LÃ³pez        1
claudia0815                            1
Name: author, Length: 801103, dtype: int64

In [18]:
data_df['phone_url'].value_counts(ascending=False).head(50)

/cellphones/samsung-galaxy-s-iii/                                 17093
/cellphones/apple-iphone-5s/                                      16379
/cellphones/samsung-galaxy-s6/                                    16145
/cellphones/samsung-galaxy-s5/                                    16082
/cellphones/samsung-galaxy-s7-edge/                               15917
/cellphones/motorola-moto-g/                                      14476
/cellphones/samsung-galaxy-s7-789999/                             13488
/cellphones/samsung-i9500-galaxy-s-iv/                            13161
/cellphones/huawei-p8-lite/                                       12629
/cellphones/lenovo-vibe-k4-note/                                   9662
/cellphones/samsung-galaxy-s4-mini-gt-i9190-gt-i9192-dual-sim/     9027
/cellphones/samsung-galaxy-s6-edge-sm-g925f/                       8844
/cellphones/apple-iphone-4s/                                       8602
/cellphones/samsung-galaxy-s3-mini/                             

In [19]:
data_df[data_df["phone_url"]=='/cellphones/samsung-galaxy-s-iii/'][['product']].value_counts()

product                                                                                                                                         
Samsung Galaxy Express I8730                                                                                                                        2685
Samsung Galaxy S III 16GB (Virgin Mobile)                                                                                                            730
Samsung Galaxy S III                                                                                                                                 689
Samsung Galaxy S III 16GB (Straight Talk)                                                                                                            556
Samsung Galaxy S III i9300 Smartphone 16 GB (12,2 cm (4,8 Zoll) HD Super-AMOLED-Touchscreen, 8 Megapixel Kamera, Micro-SIM, Android 4.0) schwarz     381
                                                                                          

In [20]:
data_df[data_df["phone_url"]=='/cellphones/apple-iphone-5s/'][['product']].value_counts()

product                                                                                                              
Apple iPhone 5s (Silver, 16GB)                                                                                           1603
Apple iPhone 5s 16GB (ÑÐµÑÐµÐ±ÑÐ¸ÑÑÑÐ¹)                                                                            1355
Apple iPhone 5s GSM Unlocked Cellphone, 16 GB, Space Gray                                                                1273
Apple iPhone 5s (Gold, 16GB)                                                                                              682
Apple iPhone 5s 16GB (ÑÐµÑÑÐ¹ ÐºÐ¾ÑÐ¼Ð¾Ñ)                                                                            592
                                                                                                                         ... 
Apple iPhone 5S (4", 16GB, 8MP, Space Grau)                                                                                 1


There are multiple names for the same product. So let's use phone name and model number rather than other details mentioned in 'product' column

In [21]:
data_df['phone'] = data_df['phone_url'].str.split("/").apply(lambda col: col[2]).replace('-', ' ', regex=True)
data_df['product'] = data_df['phone']
data_df['phone'].unique()

array(['samsung galaxy s8', 'samsung galaxy s6 edgeplus',
       'samsung galaxy s8 plus', ..., 'siemens c10', 'maxon mx 3204',
       'alcatel ot club_1187'], dtype=object)

In [22]:
product = data_df['product'].value_counts()
print('Distribution of number of ratings per item: \n',product)

Distribution of number of ratings per item: 
 samsung galaxy s iii      17093
apple iphone 5s           16379
samsung galaxy s6         16145
samsung galaxy s5         16082
samsung galaxy s7 edge    15917
                          ...  
blu quattro 5 7 hd            1
lg vx3450                     1
blackberry 7105t              1
o2 xda nova                   1
samsung chrono 2              1
Name: product, Length: 5556, dtype: int64


In [23]:
data_df.isnull().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          0
phone            0
dtype: int64

Check for duplicate values in the data and impute them

In [24]:
data_df.duplicated().sum()

24895

Impute null 'scores' with the median

In [25]:
data_df['score'] = data_df['score'].fillna(data_df['score'].median())

Drop the rows with null values

In [26]:
data_df.dropna(inplace=True)

In [27]:
data_df.shape

(1275917, 12)

Remove rows where author is unknown or Anonymous

In [28]:
data_df = data_df[(data_df["author"] != 'Anonymous')&(data_df["author"] != 'unknown')&(data_df["author"] != 'Anonymous ')]

In [29]:
data_df.shape

(1270117, 12)

Drop duplicates

In [30]:
data_df = data_df.drop_duplicates()

Keep only 1 Million data samples

In [31]:
finaldata1_df = data_df.sample(n=1000000, random_state=612)

Drop irrelevant features. Keep features like Author, Product, and Score

In [32]:
relevant_features=['author','product','score']

In [33]:
finaldata_df = finaldata1_df.loc[:,relevant_features]

In [34]:
finaldata_df.shape

(1000000, 3)

In [35]:
finaldata_df.describe()

Unnamed: 0,score
count,1000000.0
mean,7.999746
std,2.62816
min,0.0
25%,7.0
50%,9.0
75%,10.0
max,10.0


In [36]:
finaldata_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 47055 to 188972
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   author   1000000 non-null  object
 1   product  1000000 non-null  object
 2   score    1000000 non-null  Int64 
dtypes: Int64(1), object(2)
memory usage: 31.5+ MB


Identify the most rated products

In [37]:
# Top 10 most rated products
finaldata_df['product'].value_counts(ascending=False)[:10]

samsung galaxy s6            12399
samsung galaxy s7 edge       11973
apple iphone 5s              11687
samsung galaxy s5            11553
motorola moto g              11409
samsung galaxy s iii         10794
samsung galaxy s7 789999     10201
huawei p8 lite                9850
samsung i9500 galaxy s iv     9688
lenovo vibe k4 note           7724
Name: product, dtype: int64

Identify the users with most number of reviews

In [38]:
# Top 10 users with most number of reviews
finaldata_df['author'].value_counts(ascending=False)[:10]

Amazon Customer    61346
Cliente Amazon     15334
e-bit               6699
Client d'Amazon     6124
Amazon Kunde        3778
einer Kundin        2084
einem Kunden        1524
David                747
Marco                680
Alex                 630
Name: author, dtype: int64

Select data with products having more than 50 ratings and users who have given more than 50 ratings

In [39]:
topauthors = finaldata_df['author'].value_counts()
top50author = topauthors[topauthors > 50].index.tolist()

topproducts = finaldata_df['product'].value_counts()
top50product = topproducts[topproducts > 50].index.tolist()

finaltop50_df = finaldata_df[(finaldata_df['author'].isin(top50author)) & (finaldata_df['product'].isin(top50product))]
finaltop50_df.shape

(165453, 3)

**Build a popularity based model and recommend top 5 mobile phones**

Build model for top 50 products and authors

In [40]:
# Determine average score for each product
ratings = pd.DataFrame(finaltop50_df.groupby('product')['score'].mean())

# Determine count of ratings for each product
ratings['count'] = finaltop50_df.groupby('product')['score'].count()

# Sort by average score and count of ratings to get the most popular phones
ratings = ratings.sort_values(by=['score','count'], ascending=[False,False])

ratings.head()

Unnamed: 0_level_0,score,count
product,Unnamed: 1_level_1,Unnamed: 2_level_1
samsung f400,10.0,10
nokia 5140i,10.0,9
nokia 5250,10.0,9
pantech caper,10.0,7
huawei ideos x5,10.0,4


Build model for all products

In [41]:
# Determine average score for each product
overall_ratings = pd.DataFrame(finaldata_df.groupby('product')['score'].mean())

# Determine count of ratings for each product
overall_ratings['count'] = finaldata_df.groupby('product')['score'].count()

# Sort by average score and count of ratings to get the most popular phones
overall_ratings = overall_ratings.sort_values(by=['score','count'], ascending=[False,False])

overall_ratings.head()

Unnamed: 0_level_0,score,count
product,Unnamed: 1_level_1,Unnamed: 2_level_1
xiaomi mi 5s,10.0,11
verykool t742,10.0,7
thl w7s,10.0,6
lg p503,10.0,5
supersonic sc 150,10.0,5


**Build a collaborative filtering model using SVD**

In [42]:
data = Dataset.load_from_df(finaltop50_df[['author','product','score']], Reader(rating_scale=(1, 10)))
data.df.head()

Unnamed: 0,author,product,score
218434,Wolfgang,lg gd880 mini,10
336129,Carlos,motorola moto g,10
37957,Client d'Amazon,samsung galaxy s7 789999,10
58575,Amazon Customer,lenovo a2010,10
369807,e-bit,samsung galaxy grand prime,10


In [43]:
trainset, testset = train_test_split(data, test_size=.25,random_state=123)

In [44]:
svd_model = SVD(n_factors=5, biased=False, random_state=123)
svd_model.fit(trainset)
test_pred = svd_model.test(testset)

Evaluate the collaborative model

In [45]:
svd_rmse = accuracy.rmse(test_pred)

RMSE: 2.8687


**Collaborative filtering model using kNNWithMeans_Item based**

In [46]:
knn_item = KNNWithMeans(k=10, sim_options={ 'user_based': False})
knn_item.fit(trainset)
knn_item_pred = knn_item.test(testset)

Computing the msd similarity matrix...
Done computing similarity matrix.


In [47]:
knn_item_rmse = accuracy.rmse(knn_item_pred)

RMSE: 2.8929


**Collaborative filtering model using kNNWithMeans_User based**

In [48]:
knn_user = KNNWithMeans(k=10, sim_options={ 'user_based': True})
knn_user.fit(trainset)
knn_user_pred = knn_user.test(testset)

Computing the msd similarity matrix...
Done computing similarity matrix.


In [49]:
knn_user_rmse = accuracy.rmse(knn_user_pred)

RMSE: 2.8711


In [50]:
print('RMSE for SVD: ',svd_rmse,'\nRMSE for KNN with Means item based: ',knn_item_rmse, '\nRMSE for KNN with Means user based:',knn_user_rmse)

RMSE for SVD:  2.8687414864304 
RMSE for KNN with Means item based:  2.8928916844465125 
RMSE for KNN with Means user based: 2.871138461778053


RMSE for the first SVD model is better than KNN with means item based or user based. The SVD model is the best model for the dataset.

Predict score (average rating) for test users.

In [51]:
test_pred[0]

Prediction(uid='Ð\x90Ð½Ð°Ñ\x81Ñ\x82Ð°Ñ\x81Ð¸Ñ\x8f', iid='apple iphone 5s', r_ui=10.0, est=9.48552336303562, details={'was_impossible': False})

In [52]:
svd_pred_df = pd.DataFrame(test_pred, columns=['uid', 'iid', 'rui', 'est', 'details'])
print('Average rating based on SVD')
print('average prediction for test users: ', svd_pred_df['est'].mean())
print('average rating by test users: ', svd_pred_df['rui'].mean())
print('average prediction error for test users: ', (svd_pred_df['rui'] - svd_pred_df['est']).abs().mean())

Average rating based on SVD
average prediction for test users:  7.997427815215116
average rating by test users:  7.761773522870128
average prediction error for test users:  2.2386185286390776


In [53]:
knn_item_df = pd.DataFrame(knn_item_pred, columns=['uid', 'iid', 'rui', 'est', 'details'])
print('Average rating based on KNN with means item based')
print('average prediction for test users: ', knn_item_df['est'].mean())
print('average rating by test users: ', knn_item_df['rui'].mean())
print('average prediction error for test users: ', (knn_item_df['rui'] - knn_item_df['est']).abs().mean())

Average rating based on KNN with means item based
average prediction for test users:  7.750485011789254
average rating by test users:  7.761773522870128
average prediction error for test users:  2.286957897278912


In [54]:
knn_user_df = pd.DataFrame(knn_user_pred, columns=['uid', 'iid', 'rui', 'est', 'details'])
print('Average rating based on KNN with means user based')
print('average prediction for test users: ', knn_user_df['est'].mean())
print('average rating by test users: ', knn_user_df['rui'].mean())
print('average prediction error for test users: ', (knn_user_df['rui'] - knn_user_df['est']).abs().mean())

Average rating based on KNN with means user based
average prediction for test users:  7.822884012039611
average rating by test users:  7.761773522870128
average prediction error for test users:  2.2531803345853607


Inferences:

Most popular phone (rated 10 by highest number of people):
 * Overall: xiaomi mi 5s
 * Amongst top users: samsung f400       

Data overall is skewed by huge number of reviews by 'Amazon Customer'. There are multiple customers from Amazon categorized as 'Cliente Amazon', etc. Rather than categorizing them this way, it would be better to have the actual names of customers who provided a review of the product.

For our dataset, SVD has provided a better model with lower RMSE than KNN item based and KNN user based

Recommend top 5 products for test users

In [55]:
tp = pd.DataFrame(test_pred)
tp.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,ÐÐ½Ð°ÑÑÐ°ÑÐ¸Ñ,apple iphone 5s,10.0,9.485523,{'was_impossible': False}
1,ÐÐ»ÐµÐºÑÐ°Ð½Ð´Ñ,sony xperia acro s,10.0,7.69637,{'was_impossible': False}
2,Client d'Amazon,huawei honor 7,10.0,9.404716,{'was_impossible': False}
3,Amazon Customer,xiaomi mi 4i,10.0,8.060954,{'was_impossible': False}
4,Cliente Amazon,xiaomi redmi note,8.0,5.735508,{'was_impossible': False}


In [56]:
top = tp.groupby(['uid','iid']).agg({'r_ui': np.average}).sort_values(['uid','r_ui','iid'], ascending=[False,False,True])
top

Unnamed: 0_level_0,Unnamed: 1_level_0,r_ui
uid,iid,Unnamed: 2_level_1
Ð®ÑÐ¸Ð¹,apple iphone 5s,10.0
Ð®ÑÐ¸Ð¹,asus zenfone 2,10.0
Ð®ÑÐ¸Ð¹,htc desire x,10.0
Ð®ÑÐ¸Ð¹,htc one m7,10.0
Ð®ÑÐ¸Ð¹,motorola moto g,10.0
...,...,...
#,nokia 5310 xm,6.0
#,nokia n8,6.0
#,samsung b5722,6.0
#,samsung b7722,6.0


In [57]:
top5 = top.groupby('uid').head(5)
print(top5)

                             r_ui
uid      iid                     
Ð®ÑÐ¸Ð¹ apple iphone 5s     10.0
         asus zenfone 2      10.0
         htc desire x        10.0
         htc one m7          10.0
         motorola moto g     10.0
...                           ...
#        apple iphone 4      10.0
         nokia 5230          10.0
         nokia 5233          10.0
         nokia 6120 classic  10.0
         nokia 6680          10.0

[3095 rows x 1 columns]


In [58]:
# Retrieve top 5 products for all users
for item,value in top5.iterrows():
  print(item)

('Ð®Ñ\x80Ð¸Ð¹', 'apple iphone 5s')
('Ð®Ñ\x80Ð¸Ð¹', 'asus zenfone 2')
('Ð®Ñ\x80Ð¸Ð¹', 'htc desire x')
('Ð®Ñ\x80Ð¸Ð¹', 'htc one m7')
('Ð®Ñ\x80Ð¸Ð¹', 'motorola moto g')
('Ð®Ð»Ð¸Ñ\x8f', 'apple iphone 5s')
('Ð®Ð»Ð¸Ñ\x8f', 'apple iphone 6')
('Ð®Ð»Ð¸Ñ\x8f', 'apple iphone se')
('Ð®Ð»Ð¸Ñ\x8f', 'htc one sv')
('Ð®Ð»Ð¸Ñ\x8f', 'huawei g610')
('Ð¢Ð°Ñ\x82Ñ\x8cÑ\x8fÐ½Ð°', 'apple iphone 5s')
('Ð¢Ð°Ñ\x82Ñ\x8cÑ\x8fÐ½Ð°', 'htc one mini')
('Ð¢Ð°Ñ\x82Ñ\x8cÑ\x8fÐ½Ð°', 'nokia c2 03')
('Ð¢Ð°Ñ\x82Ñ\x8cÑ\x8fÐ½Ð°', 'samsung galaxy a5')
('Ð¢Ð°Ñ\x82Ñ\x8cÑ\x8fÐ½Ð°', 'samsung galaxy j1 2016 4 5 sm j120')
('Ð¡ÐµÑ\x80Ð³ÐµÐ¹', 'apple iphone 6 plus')
('Ð¡ÐµÑ\x80Ð³ÐµÐ¹', 'apple iphone 6s')
('Ð¡ÐµÑ\x80Ð³ÐµÐ¹', 'htc wildfire s')
('Ð¡ÐµÑ\x80Ð³ÐµÐ¹', 'huawei honor 2')
('Ð¡ÐµÑ\x80Ð³ÐµÐ¹', 'huawei u8500')
('Ð¡Ð²ÐµÑ\x82Ð»Ð°Ð½Ð°', 'lenovo ideaphone s960')
('Ð¡Ð²ÐµÑ\x82Ð»Ð°Ð½Ð°', 'lg gt540')
('Ð¡Ð²ÐµÑ\x82Ð»Ð°Ð½Ð°', 'nokia asha 200')
('Ð¡Ð²ÐµÑ\x82Ð»Ð°Ð½Ð°', 'samsung galaxy a3 819970')
('Ð¡Ð²ÐµÑ\x82Ð»Ð°Ð½Ð°', 'samsung galaxy s4 mini

In [59]:
# Retrieve top 5 products for a particular user - Florian
for item,value in top5.iterrows():
  if item[0] == 'Florian':
    print(item)

('Florian', 'cubot x9')
('Florian', 'htc 7 trophy')
('Florian', 'lenovo moto g4')
('Florian', 'motorola moto g')
('Florian', 'nokia lumia 1320')


In [60]:
# Retrieve top 5 products for a particular user - Dave
for item,value in top5.iterrows():
  if item[0] == 'Dave':
    print(item)

('Dave', 'apple iphone 6s')
('Dave', 'apple iphone 7 plus')
('Dave', 'asus zenfone 2')
('Dave', 'elephone p9000')
('Dave', 'htc 10')


Try cross validation to get better results

In [61]:
svd_cv = cross_validate(svd_model,data, measures=['RMSE'], cv=3, verbose=False)
svd_cv['test_rmse'].mean()

2.8704490027990044

In [62]:
knn_item_cv = cross_validate(knn_item,data, measures=['RMSE'], cv=3, verbose=False)
knn_item_cv['test_rmse'].mean()

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


2.8798326746914995

In [63]:
knn_user_cv = cross_validate(knn_user,data, measures=['RMSE'], cv=3, verbose=False)
knn_user_cv['test_rmse'].mean()

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


2.867927771050414

After using 3-fold cross validation, the KNN user-based model is the best model with RMSE of 2.868, while KNN item-based and SVD have RMSE of 2.88 and 2.87 respectively