***
# $\mathbf{\text{Recommendation System}}$<br>
***


### - Recommendation systems can be based on content, item or user based and model based.

### - Content based recommendation systems rely on analyzing the content and suggesting products that have similar content. The major drawback is only products that are very similar to each other are recommended.

### - User based recommendation systems rely on understanding how one user is similar to the other based on the product ratings. The drawback of this method is users tend to change their opinions and if the user is new to the product, then there is no data to support the recommendations.

### - Item based recommendation systems rely on understanding how one item or product is related to others according to their porpularity, rating for instances. The drawback is not all items have ratings and if the product is new, then there is no data to support the recommendations.

### - Model based recommendation systems rely on matrix factorization approach, where all the relationships between users and items are condensed and converted into two separate vectors. This approach resolves the issues of user based and item based recommendation systems.

## The objective is to create product based collaborative filtering models based on Amazon review data on pet supplies category.

### The data was taken from http://snap.stanford.edu/data/web-Amazon-links.html

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from wordcloud import WordCloud, STOPWORDS

from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from scipy.spatial.distance import cosine

from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import re
import string

In [2]:
df = pd.read_csv('Pet_Supplies.csv')
df.head()

Unnamed: 0,productId,title,price,userId,profileName,helpfulness,score,time,summary,text
0,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A2FEQ9XL6ML51C,Just an everyday Dad,1/1,5.0,1286064000,"Little Ball, for Little Dogs...","Great Toy, hard to find! We get ours online h..."
1,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A183LI95B2WNUQ,"V. J. Mcmillen ""vmcmillen""",1/1,5.0,1230249600,glow ball,I have bought several of these small Orbee Tu...
2,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A1LSSENM0XIQQR,gcoronado4,0/0,3.0,1309046400,Too Big,It is a quality ball but the small is still t...
3,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A2E5PZE1PZVK38,jerry,0/0,5.0,1308873600,no good,I gave it 5 stars because my little dog had s...
4,B0002ARHAE,Kent Marine Pro-Clear Freshwater Clarifier,3.73,A3PXLJE4OPIQTY,"M. Thomas ""sea_anemone""",0/0,5.0,1356912000,Best clarifier ever,I've used many products to try and help the w...


In [3]:
# Remove unknown userId
df = df[df['userId'].str.contains('unknown') == False]

In [4]:
# Impute missing values of title
df['title'] = df['title'].fillna(value = 'unknown')

In [5]:
# Removing the duplicate values by keeping the latest review (the last value found)
df.drop_duplicates(subset = ['productId', 'userId'], keep ='last', inplace = True)

In [6]:
# Rename the columns
df.rename(columns = {'productId' : 'ProductID', 'title' : 'Title', 'price' : 'Price', 'profileName' : 'Profile_Name',
                    'userId' : 'UserID', 'helpfulness' : 'Helpfulness', 'score' : 'Rating', 
                    'time' : 'Time', 'summary' : 'Summary', 'text' : 'Text'}, inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215244 entries, 0 to 217169
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   ProductID     215244 non-null  object 
 1   Title         215244 non-null  object 
 2   Price         215244 non-null  object 
 3   UserID        215244 non-null  object 
 4   Profile_Name  215244 non-null  object 
 5   Helpfulness   215244 non-null  object 
 6   Rating        215244 non-null  float64
 7   Time          215244 non-null  int64  
 8   Summary       215244 non-null  object 
 9   Text          215244 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 18.1+ MB


# Product based collaborative filtering

In [8]:
count = df.groupby('ProductID', as_index=False).count()
mean = df.groupby('ProductID', as_index=False).mean()

print(count.shape)
print(mean.head(10))

(17493, 10)
     ProductID    Rating          Time
0   0595281370  4.538462  1.256821e+09
1   0793821053  4.750000  1.217700e+09
2   0975412809  4.607143  1.259462e+09
3   1903492022  4.500000  1.315786e+09
4   1903499920  4.000000  1.198310e+09
5   2272800020  2.000000  1.189987e+09
6   2272800022  4.500000  1.258805e+09
7   2272800024  2.500000  1.254917e+09
8   2272800025  4.250000  1.274432e+09
9   2272800033  4.000000  1.226899e+09


In [9]:
count

Unnamed: 0,ProductID,Title,Price,UserID,Profile_Name,Helpfulness,Rating,Time,Summary,Text
0,0595281370,13,13,13,13,13,13,13,13,13
1,0793821053,4,4,4,4,4,4,4,4,4
2,0975412809,28,28,28,28,28,28,28,28,28
3,1903492022,2,2,2,2,2,2,2,2,2
4,1903499920,3,3,3,3,3,3,3,3,3
...,...,...,...,...,...,...,...,...,...,...
17488,B000TZ0YBG,1,1,1,1,1,1,1,1,1
17489,B000VNV6H2,4,4,4,4,4,4,4,4,4
17490,B004UNGISQ,1,1,1,1,1,1,1,1,1
17491,B00068K9P2,1,1,1,1,1,1,1,1,1


In [10]:
# Merge dataframes
df_merged = pd.merge(df, count, how='right', on=['ProductID'])
print(df_merged.shape)
df_merged.head()

(215244, 19)


Unnamed: 0,ProductID,Title_x,Price_x,UserID_x,Profile_Name_x,Helpfulness_x,Rating_x,Time_x,Summary_x,Text_x,Title_y,Price_y,UserID_y,Profile_Name_y,Helpfulness_y,Rating_y,Time_y,Summary_y,Text_y
0,595281370,Rabbit Health in the 21st Century Second Edition,16.24,A3REM89A39TWW9,Julie,25/25,5.0,1126569600,Know How to Tell When Your Rabbit is Sick,Since rabbits are prey animals they hide illn...,13,13,13,13,13,13,13,13,13
1,595281370,Rabbit Health in the 21st Century Second Edition,16.24,AAYOEBTSJ5E1G,"""kevnkel""",26/28,5.0,1070323200,Great Read About Common Medical Problems in R...,This is a wonderful book for any bunny parent...,13,13,13,13,13,13,13,13,13
2,595281370,Rabbit Health in the 21st Century Second Edition,16.24,A9C6G0K20H9WK,"Margi J. Winters ""rabbit mommy""",17/17,5.0,1162771200,Best bunny health book available,This is the second time I purchased this book...,13,13,13,13,13,13,13,13,13
3,595281370,Rabbit Health in the 21st Century Second Edition,16.24,A37ULHRZOSIIPN,H. Bringhurst,5/5,5.0,1276473600,Read and Used Frequently,The HRS chapter we adopted from considers thi...,13,13,13,13,13,13,13,13,13
4,595281370,Rabbit Health in the 21st Century Second Edition,16.24,A33ZX9TLDJNEIK,Charalampos D. Makripodis,3/3,4.0,1265414400,A must-have book for all the rabbit owners,This book covers all the health-related issue...,13,13,13,13,13,13,13,13,13


In [11]:
# Rename columns
df_merged['Total_Reviewers'] = df_merged['UserID_y']
df_merged['Overall_Rating'] = df_merged['Rating_x']
df_merged['Review_Summary'] = df_merged['Summary_x']

df_new = df_merged[['ProductID', 'Review_Summary', 'Overall_Rating', 'Total_Reviewers']]
print(df_new.shape)
df_new.head(10)

(215244, 4)


Unnamed: 0,ProductID,Review_Summary,Overall_Rating,Total_Reviewers
0,595281370,Know How to Tell When Your Rabbit is Sick,5.0,13
1,595281370,Great Read About Common Medical Problems in R...,5.0,13
2,595281370,Best bunny health book available,5.0,13
3,595281370,Read and Used Frequently,5.0,13
4,595281370,A must-have book for all the rabbit owners,4.0,13
5,595281370,Great read for Rabbit Owners,5.0,13
6,595281370,Rabbit Health in the 21st Century,5.0,13
7,595281370,Good book,4.0,13
8,595281370,worth buying,4.0,13
9,595281370,Great book and a must have item for any bunny...,4.0,13


### Selecting products which have more than 50 reviews

In [12]:
print(df_merged.shape)
df_merged = df_merged.sort_values(by='Total_Reviewers', ascending=False)
df_count = df_merged[df_merged['Total_Reviewers'] >= 100]
df_count.shape

(215244, 22)


(95266, 22)

In [13]:
df_prod_unique = df_merged['ProductID'].unique()
df_prod_unique.shape

(17493,)

### Grouping all the summary review by ProductID

In [14]:
df_product_review = df.groupby('ProductID', as_index=False).mean()                            
print(df_product_review.shape)

product_review_summary = df_count.groupby('ProductID')['Review_Summary'].apply(list)          # 
# print(product_review_summary)

product_review_summary = pd.DataFrame(product_review_summary)
print(product_review_summary)

product_review_summary.to_csv('Product_Review_Summary.csv')

(17493, 3)
                                                Review_Summary
ProductID                                                     
 7310172001  [ Chihuahuas Favorite,  Loves liver treats,  O...
 7310172101  [ MY DOG LOVES THEM AND NEVER TIRES OF THEM,  ...
 B00004RA8P  [ My Favorite Feeder,  Perky Pet 30 oz feeder,...
 B00004RBDU  [ Excellent,  It works,  Flea trapper is great...
 B00004ZAVR  [ Great birdfeeder,  Excellent customer servic...
...                                                        ...
 B000Q5KM1G  [ Poor customer service,  danger,  advantix su...
 B000Q7AH3W  [ Outstanding product,  Not for smart chewers,...
 B000QFMYWQ  [ Just an average product...,  Wonderful Cage!...
 B000QFT1R2  [ Does the Trick,  Nice crate and big enough f...
 B000S6XSA0  [ Slowed down my little dog...,  Did the trick...

[328 rows x 1 columns]


In [15]:
df_product_review.head()

Unnamed: 0,ProductID,Rating,Time
0,595281370,4.538462,1256821000.0
1,793821053,4.75,1217700000.0
2,975412809,4.607143,1259462000.0
3,1903492022,4.5,1315786000.0
4,1903499920,4.0,1198310000.0


## Create a dataframe only with features needed

In [16]:
df_rec = pd.read_csv('Product_Review_Summary.csv')
print(df_rec.shape)

df_rec = pd.merge(df_rec, df_product_review, on='ProductID', how='inner')
print(df_rec.head(10))

(328, 2)
     ProductID                                     Review_Summary    Rating  \
0   7310172001  [' Chihuahuas Favorite', ' Loves liver treats'...  4.753846   
1   7310172101  [' MY DOG LOVES THEM AND NEVER TIRES OF THEM',...  4.753846   
2   B00004RA8P  [' My Favorite Feeder', ' Perky Pet 30 oz feed...  4.146617   
3   B00004RBDU  [' Excellent', ' It works', ' Flea trapper is ...  3.881720   
4   B00004ZAVR  [' Great birdfeeder', ' Excellent customer ser...  3.728261   
5   B00004ZAW4  [' Awesome price for this feeder', ' Garden so...  3.418182   
6   B00005MF9T  [' Love the idea, hate the hassle', ' Junk Mai...  3.047847   
7   B00005MF9U  [' scoop litter, or clean the rake?', ' Mixed ...  3.188889   
8   B00005MF9V  [' my cats love it', " It's huge!!!", ' You wo...  2.561290   
9   B00005OU62  [" I'D LIKE TO MEET THE GENUIS WHO INVENTED TH...  3.054054   

           Time  
0  1.293693e+09  
1  1.293693e+09  
2  1.284127e+09  
3  1.328259e+09  
4  1.256450e+09  
5  1.232974e+

In [17]:
df_rec = df_rec[['ProductID', 'Review_Summary', 'Rating']]
df_rec.head(10)

Unnamed: 0,ProductID,Review_Summary,Rating
0,7310172001,"[' Chihuahuas Favorite', ' Loves liver treats'...",4.753846
1,7310172101,"[' MY DOG LOVES THEM AND NEVER TIRES OF THEM',...",4.753846
2,B00004RA8P,"[' My Favorite Feeder', ' Perky Pet 30 oz feed...",4.146617
3,B00004RBDU,"[' Excellent', ' It works', ' Flea trapper is ...",3.88172
4,B00004ZAVR,"[' Great birdfeeder', ' Excellent customer ser...",3.728261
5,B00004ZAW4,"[' Awesome price for this feeder', ' Garden so...",3.418182
6,B00005MF9T,"[' Love the idea, hate the hassle', ' Junk Mai...",3.047847
7,B00005MF9U,"[' scoop litter, or clean the rake?', ' Mixed ...",3.188889
8,B00005MF9V,"[' my cats love it', "" It's huge!!!"", ' You wo...",2.56129
9,B00005OU62,"["" I'D LIKE TO MEET THE GENUIS WHO INVENTED TH...",3.054054


## Text cleaning of Review_Summary column

In [18]:
#function for tokenizing summary
col_regex = re.compile('[^a-z]+')
def clean_review(Text):
    Text = Text.lower()
    Text = col_regex.sub(' ', Text).strip()
    return Text

In [19]:
# Reset index and drop duplicate rows
df_rec['Summary_Cleaned'] = df_rec['Review_Summary'].apply(clean_review)
df_rec = df_rec.drop_duplicates(['Rating'], keep='last')
df_rec = df_rec.reset_index()

print(df_rec.shape)
df_rec.head(10)

(313, 5)


Unnamed: 0,index,ProductID,Review_Summary,Rating,Summary_Cleaned
0,1,7310172101,"[' MY DOG LOVES THEM AND NEVER TIRES OF THEM',...",4.753846,my dog loves them and never tires of them the ...
1,2,B00004RA8P,"[' My Favorite Feeder', ' Perky Pet 30 oz feed...",4.146617,my favorite feeder perky pet oz feeder great f...
2,3,B00004RBDU,"[' Excellent', ' It works', ' Flea trapper is ...",3.88172,excellent it works flea trapper is great i lik...
3,4,B00004ZAVR,"[' Great birdfeeder', ' Excellent customer ser...",3.728261,great birdfeeder excellent customer service im...
4,5,B00004ZAW4,"[' Awesome price for this feeder', ' Garden so...",3.418182,awesome price for this feeder garden song the ...
5,6,B00005MF9T,"[' Love the idea, hate the hassle', ' Junk Mai...",3.047847,love the idea hate the hassle junk maid is mor...
6,7,B00005MF9U,"[' scoop litter, or clean the rake?', ' Mixed ...",3.188889,scoop litter or clean the rake mixed feelings ...
7,8,B00005MF9V,"[' my cats love it', "" It's huge!!!"", ' You wo...",2.56129,my cats love it it s huge you would think payi...
8,9,B00005OU62,"["" I'D LIKE TO MEET THE GENUIS WHO INVENTED TH...",3.054054,i d like to meet the genuis who invented this ...
9,10,B000062WUT,[' Funny Toy Dog loves and hasnt fallen apart'...,4.210526,funny toy dog loves and hasnt fallen apart dog...


In [20]:
reviews = df_rec['Summary_Cleaned'] 
count_vector = CountVectorizer(max_features = 300, stop_words='english') 
transformed_reviews = count_vector.fit_transform(reviews) 
print(transformed_reviews.A.shape)
df_reviews = pd.DataFrame(transformed_reviews.A, columns=count_vector.get_feature_names())
df_reviews = df_reviews.astype(int)
df_reviews.head(10)

(313, 300)


Unnamed: 0,absolutely,actually,advertised,amazing,amazon,angel,angels,anti,away,awesome,...,wonderful,work,worked,working,works,worth,wow,wrong,year,years
0,1,0,1,2,0,0,0,0,0,1,...,1,0,0,0,0,1,0,1,1,0
1,0,0,0,0,0,0,0,0,1,0,...,2,0,0,1,2,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,2,...,0,14,4,2,27,0,1,2,0,0
3,0,0,0,0,1,0,0,0,0,1,...,2,1,1,0,0,0,0,0,0,1
4,0,0,1,0,0,0,0,0,0,1,...,0,2,0,0,5,1,1,0,0,0
5,1,0,0,1,1,0,0,0,0,2,...,1,9,2,2,8,9,0,2,2,3
6,2,1,1,1,1,0,0,0,2,4,...,3,19,8,5,37,22,0,2,2,9
7,0,1,0,0,0,0,0,0,0,0,...,0,3,1,0,4,6,0,0,0,1
8,1,0,0,0,0,0,0,0,0,1,...,3,5,1,0,10,5,1,0,1,3
9,1,1,0,3,0,0,0,0,0,5,...,1,0,0,1,1,2,0,0,0,0


In [21]:
# Save dataframe to csv 
df_reviews.to_csv('Reviews.csv')

In [22]:
# Create a new dataframe
df_split = np.array(df_reviews)

# Create train and test set
tpercent = 0.8
tsize = int(np.floor(tpercent * len(df_reviews)))
df_reviews_train = df_split[:tsize]
df_reviews_test = df_split[tsize:]

#len of train and test
len_train = len(df_reviews_train)
len_test = len(df_reviews_test)

In [23]:
# Size of sets
print(len_train)
print(len_test)

250
63


# Using KNN classifier to find the two most similar products

### KNN is a machine learning algorithm that can be used to find clusters of similar users based on common item ratings, in this case is pet related stuffs. Then, KNN makes predictions using the average rating of top-k nearest neighbors.

In [24]:
neighbor = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(df_reviews_train)

# Find the k-neighbors of each point in df_split
distances, indices = neighbor.kneighbors(df_reviews_train)
print(distances, indices)

[[   0.            5.56776436   56.95612346]
 [   0.           76.95453203   79.15175298]
 [   0.           43.23193264   44.15880433]
 [   0.           67.57218363   70.87312608]
 [   0.           33.42154993   40.73082371]
 [   0.           28.37252192   37.52332608]
 [   0.          134.11934983  148.02702456]
 [   0.           25.37715508   28.96549672]
 [   0.           28.37252192   28.68797658]
 [   0.           37.73592453   53.43220003]
 [   0.           18.05547009   21.9544984 ]
 [   0.           43.23193264   52.30678732]
 [   0.           27.31300057   27.91057147]
 [   0.           60.49793385   71.8748913 ]
 [   0.           28.75760769   30.59411708]
 [   0.           57.24508713   60.35726965]
 [   0.           21.30727575   24.77902339]
 [   0.           30.5450487    30.5450487 ]
 [   0.          133.78340704  215.73826735]
 [   0.          115.83177457  119.97082979]
 [   0.          136.43313381  182.06317585]
 [   0.           64.83826031   70.3562364 ]
 [   0.   

In [25]:
# Find the two most related products
for i in range(len_test):
    x = neighbor.kneighbors([df_reviews_test[i]])
    related_product_list = x[1]

    first_related_product = [item[0] for item in related_product_list]
    first_related_product = str(first_related_product).strip('[]')
    first_related_product = int(first_related_product)

    second_related_product = [item[1] for item in related_product_list]
    second_related_product = str(second_related_product).strip('[]')
    second_related_product = int(second_related_product)
    
    print ('The average rating of product reviews for{} is {}'.format(df_rec['ProductID'][len_train + i], df_rec['Rating'][len_train + i]))
    print ('The first similar product is{}, with average rating {}'.format(df_rec['ProductID'][first_related_product], df_rec['Rating'][first_related_product]))
    print ('The second similar product is{}, with average rating {}'.format(df_rec['ProductID'][second_related_product], df_rec['Rating'][second_related_product]))
    print ('---------------------------------------------------------------------------------')

The average rating of product reviews for B000H3VAQ8 is 3.7559395248380127
The first similar product is B00061RITE, with average rating 3.334928229665072
The second similar product is B00068R98C, with average rating 2.4350132625994694
---------------------------------------------------------------------------------
The average rating of product reviews for B000H5AU4Y is 4.217142857142857
The first similar product is B00063466U, with average rating 4.240875912408759
The second similar product is B0009YWLCM, with average rating 4.187279151943463
---------------------------------------------------------------------------------
The average rating of product reviews for B000H6AK7A is 1.7066666666666668
The first similar product is B00005MF9T, with average rating 3.047846889952153
The second similar product is B00005OU62, with average rating 3.054054054054054
---------------------------------------------------------------------------------
The average rating of product reviews for B000HHJEM6

In [26]:
# Trying to find the similar products for one random particular product

print ('The average rating of product reviews for{} is {}'.format(df_rec['ProductID'][123], df_rec['Rating'][123]))
print ('The first similar product is{}, with average rating {}'.format(df_rec['ProductID'][first_related_product], df_rec['Rating'][first_related_product]))
print ('The second similar product is{}, with average rating {}'.format(df_rec['ProductID'][second_related_product], df_rec['Rating'][second_related_product]))

The average rating of product reviews for B00063KG5K is 4.034965034965035
The first similar product is B000ANOT9U, with average rating 3.47787610619469
The second similar product is B0002DIRYG, with average rating 4.303921568627451


### In this step, the KNN algorithm measures the distance to determine the closeness of instances. The algorithm then classifies each instance by finding its nearest neighbors, and picks the most popular class among the neighbors.

# Rating prediction

In [27]:
import warnings
warnings.filterwarnings('ignore')

In [28]:
def evaluate(df):
    
    n_neighbors = [3, 5]

    for i in n_neighbors:
        df_train_target = df_rec['Rating'][:len_train]
        df_test_target = df_rec['Rating'][len_train:len_train+len_test]
        
        df_train_target = df_train_target.astype(int)
        df_test_target = df_test_target.astype(int)
        
        knn_clf = neighbors.KNeighborsClassifier(i, weights='distance')
        knn_clf.fit(df_reviews_train, df_train_target)
        knn_pred = knn_clf.predict(df_reviews_test)
        
        print(f'Report for k = {i} \n')
        print ('Accuracy:', accuracy_score(df_test_target, knn_pred))
        print('RMSE:', np.sqrt(mean_squared_error(df_test_target, knn_pred)))
        print()
        print(classification_report(df_test_target, knn_pred))
        print('-----------------------------------------------------------')

In [29]:
evaluate(df_reviews)

Report for k = 3 

Accuracy: 0.7619047619047619
RMSE: 0.5345224838248488

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           3       0.50      0.60      0.55        15
           4       0.87      0.83      0.85        47

    accuracy                           0.76        63
   macro avg       0.46      0.48      0.46        63
weighted avg       0.77      0.76      0.76        63

-----------------------------------------------------------
Report for k = 5 

Accuracy: 0.7301587301587301
RMSE: 0.563436169819011

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           3       0.44      0.53      0.48        15
           4       0.84      0.81      0.83        47

    accuracy                           0.73        63
   macro avg       0.43      0.45      0.44        63
weighted avg       0.74      0.73      0.73        63

-------------------------------

## Another trial using different train size, algorithms, and number of k

In [30]:
# Create a new dataframe
df_split = np.array(df_reviews)

# Create train and test set. Now, the train set size is 70%
tpercent = 0.7
tsize = int(np.floor(tpercent * len(df_reviews)))
df_reviews_train = df_split[:tsize]
df_reviews_test = df_split[tsize:]

#len of train and test
len_train = len(df_reviews_train)
len_test = len(df_reviews_test)

In [31]:
print(len_train)
print(len_test)

219
94


In [32]:
neighbor = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(df_reviews_train)

# Find the k-neighbors of each point in df_split
distances, indices = neighbor.kneighbors(df_reviews_train)

In [33]:
# Find the two most related products
for i in range(len_test):
    x = neighbor.kneighbors([df_reviews_test[i]])
    related_product_list = x[1]

    first_related_product = [item[0] for item in related_product_list]
    first_related_product = str(first_related_product).strip('[]')
    first_related_product = int(first_related_product)

    second_related_product = [item[1] for item in related_product_list]
    second_related_product = str(second_related_product).strip('[]')
    second_related_product = int(second_related_product)
    
    print ('The average rating of product reviews for{} is {}'.format(df_rec['ProductID'][len_train + i], df_rec['Rating'][len_train + i]))
    print ('The first similar product is{}, with average rating {}'.format(df_rec['ProductID'][first_related_product], df_rec['Rating'][first_related_product]))
    print ('The second similar product is{}, with average rating {}'.format(df_rec['ProductID'][second_related_product], df_rec['Rating'][second_related_product]))
    print ('---------------------------------------------------------------------------------')

The average rating of product reviews for B000ER3QM8 is 4.301724137931035
The first similar product is B0002ASM94, with average rating 4.327102803738318
The second similar product is B0002DGLNK, with average rating 4.432
---------------------------------------------------------------------------------
The average rating of product reviews for B000ERNO0M is 4.293906810035843
The first similar product is B0002AR18M, with average rating 4.366666666666666
The second similar product is B000AUJFHE, with average rating 4.359605911330049
---------------------------------------------------------------------------------
The average rating of product reviews for B000F0VZV8 is 4.614634146341463
The first similar product is B0006GW0YC, with average rating 4.626436781609195
The second similar product is B000CMJTBM, with average rating 4.006410256410256
---------------------------------------------------------------------------------
The average rating of product reviews for B000F1OS20 is 3.260765550

In [34]:
df_train_target = df_rec['Rating'][:len_train]
df_test_target = df_rec['Rating'][len_train:len_train+len_test]

df_train_target = df_train_target.astype(int)
df_test_target = df_test_target.astype(int)

n_neighbors = 5

knn_clf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knn_clf.fit(df_reviews_train, df_train_target)
knn_pred = knn_clf.predict(df_reviews_test)

print(classification_report(df_test_target, knn_pred))
print ('Accuracy:', accuracy_score(df_test_target, knn_pred))
print('RMSE:', np.sqrt(mean_squared_error(df_test_target, knn_pred)))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           3       0.67      0.52      0.58        27
           4       0.82      0.91      0.86        66

    accuracy                           0.79        94
   macro avg       0.50      0.48      0.48        94
weighted avg       0.77      0.79      0.77        94

Accuracy: 0.7872340425531915
RMSE: 0.4946522526622413


## KNN with k = 3, algorithm = brute

In [35]:
neighbor = NearestNeighbors(n_neighbors=3, algorithm='brute').fit(df_reviews_train)

# Find the k-neighbors of each point in df_split
distances, indices = neighbor.kneighbors(df_reviews_train)

In [36]:
df_train_target = df_rec['Rating'][:len_train]
df_test_target = df_rec['Rating'][len_train:len_train+len_test]

df_train_target = df_train_target.astype(int)
df_test_target = df_test_target.astype(int)

n_neighbors = 3

knn_clf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knn_clf.fit(df_reviews_train, df_train_target)
knn_pred = knn_clf.predict(df_reviews_test)

print(classification_report(df_test_target, knn_pred))
print ('Accuracy:', accuracy_score(df_test_target, knn_pred))
print('RMSE:', np.sqrt(mean_squared_error(df_test_target, knn_pred)))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         0
           3       0.69      0.67      0.68        27
           4       0.87      0.88      0.87        66

    accuracy                           0.81        94
   macro avg       0.39      0.39      0.39        94
weighted avg       0.81      0.81      0.81        94

Accuracy: 0.8085106382978723
RMSE: 0.5052911526399113


## KNN with k = 5, algorithm = KD_Tree

In [37]:
neighbor = NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(df_reviews_train)

# Find the k-neighbors of each point in df_split
distances, indices = neighbor.kneighbors(df_reviews_train)

In [38]:
df_train_target = df_rec['Rating'][:len_train]
df_test_target = df_rec['Rating'][len_train:len_train+len_test]

df_train_target = df_train_target.astype(int)
df_test_target = df_test_target.astype(int)

n_neighbors = 5

knn_clf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knn_clf.fit(df_reviews_train, df_train_target)
knn_pred = knn_clf.predict(df_reviews_test)

print(classification_report(df_test_target, knn_pred))
print ('Accuracy:', accuracy_score(df_test_target, knn_pred))
print('RMSE:', np.sqrt(mean_squared_error(df_test_target, knn_pred)))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           3       0.67      0.52      0.58        27
           4       0.82      0.91      0.86        66

    accuracy                           0.79        94
   macro avg       0.50      0.48      0.48        94
weighted avg       0.77      0.79      0.77        94

Accuracy: 0.7872340425531915
RMSE: 0.4946522526622413


### By using k = 3 and 5 and different algorithms, there is no significant difference obtained. The model accuracy in predicting rating score is aroun 78 to 80% with RMSE about 0.49 to 0.50.