### Recommender System:
The approach is the item-based K-nearest neighbor (KNN) algorithm. Its philosophy is as follows: 
in order to determine the rating of User on Products, we can find other products that are similar to product rating, and based on User ratings on those similar product we infer his rating on Product. KNN finds the nearest K neighbors of each Product under the defined similarity function, and use the weighted means to predict the rating. 

In [96]:
#importing the required Libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame 
import nltk

from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from scipy.spatial.distance import cosine
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import re
import string
from wordcloud import WordCloud, STOPWORDS
from sklearn.metrics import mean_squared_error

In [97]:
#Reading the dataset
reviews= pd.read_json('Beauty_5.json',lines=True)

In [98]:
#getting the count of Product ID
count = reviews.groupby("asin", as_index=False).count()
mean = reviews.groupby("asin", as_index=False).mean()
#Merging them into Dataset
reviewsMerged = pd.merge(reviews, count, how='right', on=['asin'])
reviewsMerged.head(5)

Unnamed: 0,asin,helpful_x,overall_x,reviewText_x,reviewTime_x,reviewerID_x,reviewerName_x,summary_x,unixReviewTime_x,helpful_y,overall_y,reviewText_y,reviewTime_y,reviewerID_y,reviewerName_y,summary_y,unixReviewTime_y
0,7806397051,"[3, 4]",1,Very oily and creamy. Not at all what I expect...,"01 30, 2014",A1YJEY40YUW4SE,Andrea,Don't waste your money,1391040000,8,8,8,8,8,8,8,8
1,7806397051,"[1, 1]",3,This palette was a decent price and I was look...,"04 18, 2014",A60XNB876KYML,Jessica H.,OK Palette!,1397779200,8,8,8,8,8,8,8,8
2,7806397051,"[0, 1]",4,The texture of this concealer pallet is fantas...,"09 6, 2013",A3G6XNM240RMWA,Karen,great quality,1378425600,8,8,8,8,8,8,8,8
3,7806397051,"[2, 2]",2,I really can't tell what exactly this thing is...,"12 8, 2013",A1PQFP6SAJ6D80,Norah,Do not work on my face,1386460800,8,8,8,8,8,8,8,8
4,7806397051,"[0, 0]",3,"It was a little smaller than I expected, but t...","10 19, 2013",A38FVHZTNQ271F,Nova Amor,It's okay.,1382140800,8,8,8,8,8,8,8,8


In [99]:
#rename column
reviewsMerged["totalReviewers"] = reviewsMerged["reviewerID_y"]
reviewsMerged["overallScore"] = reviewsMerged["overall_x"]
reviewsMerged["summaryReview"] = reviewsMerged["summary_x"]


**Selecting products which have more than 50 reviews**

In [100]:
#Sorting them by No of reviews and getting the products that has more than or equal to 100 reviews
reviewsMerged = reviewsMerged.sort_values(by='totalReviewers', ascending=False)
reviewsCount = reviewsMerged[reviewsMerged.totalReviewers >= 100]
reviewsCount.head(5)

Unnamed: 0,asin,helpful_x,overall_x,reviewText_x,reviewTime_x,reviewerID_x,reviewerName_x,summary_x,unixReviewTime_x,helpful_y,overall_y,reviewText_y,reviewTime_y,reviewerID_y,reviewerName_y,summary_y,unixReviewTime_y,totalReviewers,overallScore,summaryReview
112506,B004OHQR1Q,"[7, 9]",1,first off.... i ordered these expecting there ...,"10 24, 2013",A3BP5ZF51CHZOE,Caitlyn Johnson,crap!,1382572800,431,431,431,431,431,431,431,431,431,1,crap!
112578,B004OHQR1Q,"[0, 0]",5,I use this all the time helps out with my desi...,"06 25, 2013",A3796JLADKK5Z7,Freckvanilla,YOU NEED THIS!!,1372118400,431,431,431,431,431,431,431,431,431,5,YOU NEED THIS!!
112580,B004OHQR1Q,"[0, 0]",5,Love the array of colors and the different siz...,"05 6, 2013",A22NETHJ4KTWJP,gadgetGirl,Great product at a very affordable price,1367798400,431,431,431,431,431,431,431,431,431,5,Great product at a very affordable price
112581,B004OHQR1Q,"[0, 0]",5,I would recommend these dotting tools to anyon...,"04 22, 2014",A1L9ZTGO75717E,gatorsgate,Dotting 5 X 2 Way Marbleizing Dotting Pen Set,1398124800,431,431,431,431,431,431,431,431,431,5,Dotting 5 X 2 Way Marbleizing Dotting Pen Set
112582,B004OHQR1Q,"[0, 0]",5,These are easy to use. Gets the job done and c...,"08 7, 2013",A3DDZQYUAE9WNP,gee,Love the look.,1375833600,431,431,431,431,431,431,431,431,431,5,Love the look.


### Grouping all the summary Reviews by product ID
We need to group the summary and review summary to get the product by revivew count. We will group by it and store it into the csv. 


In [108]:
#Groupping them By summary Review
dfProductReview = reviews.groupby("asin", as_index=False).mean()
ProductReviewSummary = reviewsCount.groupby("asin")["summaryReview"].apply(list)
ProductReviewSummary = pd.DataFrame(ProductReviewSummary)
ProductReviewSummary.to_csv("ProductReviewSummary.csv")

In [109]:
dfProductReview.head(5)

Unnamed: 0,asin,overall,unixReviewTime
0,7806397051,2.625,1382087000.0
1,9759091062,3.090909,1390930000.0
2,9788072216,5.0,1342552000.0
3,9790790961,4.333333,1378858000.0
4,9790794231,3.6,1298212000.0


### create dataframe with certain columns

In [110]:
#Getting Product Id , SummaryReview and Overall review 
df3 =pd.read_csv("ProductReviewSummary.csv")
df3 = pd.merge(df3, dfProductReview, on="asin", how='inner')
df3 = df3[['asin','summaryReview','overall']]
df3.head()

Unnamed: 0,asin,summaryReview,overall
0,B0000530ED,"['Great Color', 'This Is *Deep-Burgundy-Plum* ...",4.009709
1,B0000632EN,"['Nice small size', 'This product is really ni...",3.802721
2,B0000CC64W,"['Love it', 'Light', 'A skin must-have', 'Noti...",4.314685
3,B000142C1A,"['Beautiful!', 'Pretty - more orange in person...",4.482456
4,B000142FVW,"['my perfect ""Every-day"" soft neutral', ""Looks...",4.55298


### Text Cleaning - Summary column
For text formating, we will use the Regex Functions. RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern. Python has a built-in package called re, which can be used to work with Regular Expressions. The sub() function replaces the matches with the text of your choice.We have also apply the strip() function which is used to remove all the leading and trailing spaces from a string. 

In [111]:
#function for tokenizing summary
regEx = re.compile('[^a-z]+')
def cleanReviews(reviewText):
    reviewText = reviewText.lower()
    reviewText = regEx.sub(' ', reviewText).strip()
    return reviewText

In [112]:
#reset index and drop duplicate rows
df3["summaryClean"] = df3["summaryReview"].apply(cleanReviews)
df3 = df3.drop_duplicates(['overall'], keep='last')
df3 = df3.reset_index()

First, we need to transform the dataframe of ratings into a proper format that can be consumed by a KNN model. We want the data to be in an array.
##### CountVectorizer:
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction. Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality.

In [147]:
#transforming applying CountVectorizer OnCleaned summary data 
reviews = df3["summaryClean"] 
countVector = CountVectorizer(max_features = 300, stop_words='english') 
transformedReviews = countVector.fit_transform(reviews) 

dfReviews = DataFrame(transformedReviews.A, columns=countVector.get_feature_names())
dfReviews = dfReviews.astype(int)


In [114]:
#save 
dfReviews.to_csv("dfReviews.csv")

In [115]:
# First let's create a dataset called X
X = np.array(dfReviews)
 # create train and test
tpercent = 0.9
tsize = int(np.floor(tpercent * len(dfReviews)))
dfReviews_train = X[:tsize]
dfReviews_test = X[tsize:]
#len of train and test
lentrain = len(dfReviews_train)
lentest = len(dfReviews_test)

In [116]:
# KNN classifier to find similar products
print(lentrain)
print(lentest)

154
18


### KNN-
To implement an item based collaborative filtering, KNN is a perfect go-to model and also a very good baseline for recommender system development. We use unsupervised algorithms with sklearn.neighbors

The ball tree nearest-neighbor algorithm examines nodes in depth-first order, starting at the root. During the search, the algorithm maintains a max-first priority queue (often implemented with a heap), denoted Q here, of the k nearest points encountered so far. At each node B, it may perform one of three operations, before finally returning an updated version of the priority queue:

1.If the distance from the test point t to the current node B is greater than the furthest point in Q, ignore B and return Q.

2.If B is a leaf node, scan through every point enumerated in B and update the nearest-neighbor queue appropriately. Return the updated queue.

3.If B is an internal node, call the algorithm recursively on B's two children, searching the child whose center is closer to t first. Return the queue after each of these calls has updated it in turn.

In [117]:
# Applying the nearest neighbour Algorithm
neighbor = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(dfReviews_train)

# Let's find the k-neighbors of each point in object X. To do that we call the kneighbors() function on object X.
distances, indices = neighbor.kneighbors(dfReviews_train)

In [148]:
#find most related products
for i in range(lentest):
    """" 
    This will find the most related two product based on the Average overll rating of the product
    """
    a = neighbor.kneighbors([dfReviews_test[i]])
    related_product_list = a[1]

    first_related_product = [item[0] for item in related_product_list]
    first_related_product = str(first_related_product).strip('[]')
    first_related_product = int(first_related_product)
    second_related_product = [item[1] for item in related_product_list]
    second_related_product = str(second_related_product).strip('[]')
    second_related_product = int(second_related_product)
    
    print ("Based on product reviews, for ", df3["asin"][lentrain + i] ," average rating is ",df3["overall"][lentrain + i])
    print ("The first similar product is ", df3["asin"][first_related_product] ," average rating is ",df3["overall"][first_related_product])
    print ("The second similar product is ", df3["asin"][second_related_product] ," average rating is ",df3["overall"][second_related_product])
    print ("-----------------------------------------------------------")

Based on product reviews, for  B00AHF1GK6  average rating is  4.181286549707602
The first similar product is  B007BJ3KQ4  average rating is  4.3700787401574805
The second similar product is  B001RMP7M6  average rating is  4.655172413793103
-----------------------------------------------------------
Based on product reviews, for  B00AHF1GTM  average rating is  4.336633663366337
The first similar product is  B0090UJFYI  average rating is  4.245454545454545
The second similar product is  B0030HKJ8I  average rating is  4.324137931034483
-----------------------------------------------------------
Based on product reviews, for  B00AO379NE  average rating is  4.401869158878505
The first similar product is  B004LUZ956  average rating is  4.024390243902439
The second similar product is  B005XIDZHO  average rating is  3.4554455445544554
-----------------------------------------------------------
Based on product reviews, for  B00AO4E9E0  average rating is  4.388349514563107
The first similar pro

Here, we got two Product reccomended based on the first Product review. 

Let's Predict the review score 

## Predicting Review Score and getting the accuracy for the model k=3

In [119]:
df5_train_target = df3["overall"][:lentrain]
df5_test_target = df3["overall"][lentrain:lentrain+lentest]
df5_train_target = df5_train_target.astype(int)
df5_test_target = df5_test_target.astype(int)

n_neighbors = 3
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knnclf.fit(dfReviews_train, df5_train_target)
knnpreds_test = knnclf.predict(dfReviews_test)

print(classification_report(df5_test_target, knnpreds_test))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         0
           4       1.00      0.94      0.97        18

   micro avg       0.94      0.94      0.94        18
   macro avg       0.50      0.47      0.49        18
weighted avg       1.00      0.94      0.97        18



  'recall', 'true', average, warn_for)


### Accuracy of the model

In [120]:
print (accuracy_score(df5_test_target, knnpreds_test))

0.9444444444444444


In [121]:
print(mean_squared_error(df5_test_target, knnpreds_test))

0.05555555555555555


We got the accuracy of model for the K=3. Let's try with the K=5 for same data

### Predicting Review Score with k = 5

In [136]:
df5_train_target = df3["overall"][:lentrain]
df5_test_target = df3["overall"][lentrain:lentrain+lentest]
df5_train_target = df5_train_target.astype(int)
df5_test_target = df5_test_target.astype(int)

n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knnclf.fit(dfReviews_train, df5_train_target)
knnpreds_test = knnclf.predict(dfReviews_test)
#print (knnpreds_test)

print(classification_report(df5_test_target, knnpreds_test))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         0
           4       1.00      0.81      0.89        26

   micro avg       0.81      0.81      0.81        26
   macro avg       0.50      0.40      0.45        26
weighted avg       1.00      0.81      0.89        26



  'recall', 'true', average, warn_for)


In [137]:
print (accuracy_score(df5_test_target, knnpreds_test))

0.8076923076923077


In [138]:
print(mean_squared_error(df5_test_target, knnpreds_test))

0.19230769230769232


### Predicting reviews with 85, 15 train, test split and k = 5

In [149]:
# First let's create a dataset called X
X = np.array(dfReviews)
 # create train and test
tpercent = 0.85
tsize = int(np.floor(tpercent * len(dfReviews)))
dfReviews_train = X[:tsize]
dfReviews_test = X[tsize:]
#len of train and test
lentrain = len(dfReviews_train)
lentest = len(dfReviews_test)

In [150]:
# Next we will instantiate a nearest neighbor object, and call it nbrs. Then we will fit it to dataset X.
neighbor = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(dfReviews_train)

# Let's find the k-neighbors of each point in object X. To do that we call the kneighbors() function on object X.
distances, indices = neighbor.kneighbors(dfReviews_train)

In [151]:
#find most related products
for i in range(lentest):
    a = neighbor.kneighbors([dfReviews_test[i]])
    related_product_list = a[1]

    first_related_product = [item[0] for item in related_product_list]
    first_related_product = str(first_related_product).strip('[]')
    first_related_product = int(first_related_product)
    second_related_product = [item[1] for item in related_product_list]
    second_related_product = str(second_related_product).strip('[]')
    second_related_product = int(second_related_product)
    
    print ("Based on product reviews, for ", df3["asin"][lentrain + i] ," average rating is ",df3["overall"][lentrain + i])
    print ("The first similar product is ", df3["asin"][first_related_product] ," average rating is ",df3["overall"][first_related_product])
    print ("The second similar product is ", df3["asin"][second_related_product] ," average rating is ",df3["overall"][second_related_product])
    print ("-----------------------------------------------------------")

Based on product reviews, for  B00AHF1GK6  average rating is  4.181286549707602
The first similar product is  B007BJ3KQ4  average rating is  4.3700787401574805
The second similar product is  B001RMP7M6  average rating is  4.655172413793103
-----------------------------------------------------------
Based on product reviews, for  B00AHF1GTM  average rating is  4.336633663366337
The first similar product is  B0090UJFYI  average rating is  4.245454545454545
The second similar product is  B0030HKJ8I  average rating is  4.324137931034483
-----------------------------------------------------------
Based on product reviews, for  B00AO379NE  average rating is  4.401869158878505
The first similar product is  B004LUZ956  average rating is  4.024390243902439
The second similar product is  B005XIDZHO  average rating is  3.4554455445544554
-----------------------------------------------------------
Based on product reviews, for  B00AO4E9E0  average rating is  4.388349514563107
The first similar pro

In [152]:
df5_train_target = df3["overall"][:lentrain]
df5_test_target = df3["overall"][lentrain:lentrain+lentest]
df5_train_target = df5_train_target.astype(int)
df5_test_target = df5_test_target.astype(int)

n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knnclf.fit(dfReviews_train, df5_train_target)
knnpreds_test = knnclf.predict(dfReviews_test)
#print (knnpreds_test)

print(classification_report(df5_test_target, knnpreds_test))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         0
           4       1.00      0.81      0.89        26

   micro avg       0.81      0.81      0.81        26
   macro avg       0.50      0.40      0.45        26
weighted avg       1.00      0.81      0.89        26



  'recall', 'true', average, warn_for)


In [153]:
print (accuracy_score(df5_test_target, knnpreds_test))

0.8076923076923077


In [154]:
print(mean_squared_error(df5_test_target, knnpreds_test))

0.19230769230769232


We got the the best accuracy of Model with k=3.

## Conclusion:
We used the Item based approach in this project for our product recommender system. The reason why we preferred item based approach over user based approach is because User-based approach is often harder to scale because of the dynamic nature of users, whereas items usually don’t change much, and item based approach often can be computed offline and served without constantly re-training.

We predicted the review score of the product by taking n_neighbors = 3 and test train split of 90-10. 
The accuracy of the model was 0.9444444444444444 and the MSE was 0.05555555555555555.

Then, we also predicted the review score of the product by taking n_neighbors = 5 and test train split of 90-10 . 
The accuracy of the model was 0.8076 and the MSE was 0.19.

We also predicted the review score of the product by taking n_neighbors = 5 and test train split of 85-15. The accuracy of the model was 0.8076923076923077 and the MSE was 0.19230769230769232. Changing the split did not make any difference in the accuracy.