# **PROBLEM STATEMENT**

E-commerce company Ebuss wants to grow quickly in the market to become a major leader, it has to compete with the likes of Amazon, Flipkart, etc., which are already market leaders.
Ebuss wants to build a model that will improve the recommendations given to the users given their past reviews and ratings.

In order to do this, below tasks are planned to build a sentiment-based product recommendation system: 

1. Data sourcing and sentiment analysis
2. Building a recommendation system
3. Improving the recommendations using the sentiment analysis model
4. Deploying the end-to-end project with a user interface

This notebook has following major sections: 

1. Exploratory data analysis
2. Data preprocessing
3. Feature extraction
4. Training a text classification model
5. Building a recommendation system
6. Improving the recommendations using the sentiment analysis model



# **1. Exploratory Data Analysis**

In [1]:
 #Importing libraries
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
## Mounting google drive for running the notebook on Google Colab

from google.colab import drive
drive.mount('/content/gdrive')

KeyboardInterrupt: ignored

In [None]:
cd gdrive/My Drive/Colab Notebooks/Capstone project/

In [None]:
#Importing labeled data for training the classifier
master_df = pd.read_csv('sample30.csv')

In [None]:
master_df.head()

In [None]:
#Checking the datatype of columns

master_df.info()

In [None]:
#We will drop the columns with high number of missing values, which are not so significant:
# reviews_date, reviews_doRecommend, reviews_userCity, reviews_userProvince

master_df.drop(['reviews_date', 'reviews_doRecommend', 'reviews_userCity', 'reviews_userProvince'], axis=1, inplace=True)
master_df.info()

In [None]:
#Lets check for null/missing values in the data

master_df.isnull().sum()

In [None]:
#Column 'reviews_didPurchase' has large number of missing values. 
#Lets check the distribution of reviews over 'reviews_didPurchase'

#Filling missing values with null
master_df['reviews_didPurchase'].fillna('Null', inplace=True)

#checking distribution of reviews_didpurchased
plt.figure(figsize=(8,6))
sns.set_theme(style="darkgrid")
ax = sns.countplot(master_df['reviews_didPurchase'],palette="Set3")
ax.set_xlabel(xlabel="Customers who purchased the product", fontsize=15)
ax.set_ylabel(ylabel='Count of Reviews', fontsize=15)
ax.axes.set_title('Reviews distributed over customers who did purchase or did not', fontsize=15)
ax.tick_params(labelsize=13)
plt.show()

In [None]:
#Lets check the count of reviews distributed over purchase

print("Count of Reviews which are related to a Purchase:")
master_df['reviews_didPurchase'].value_counts()

**Obervation:** There is very less number of reviews based on the actual purchase, and almost 50% data is missing, this column will not be of much significant. We will be dropping it later. 

In [None]:
#There is one row without user_sentiment label. We will drop the row later. 

master_df[master_df['user_sentiment'].isna()]

In [None]:
##Looking at unique values in Key columns

for i in ['brand', 'categories', 'manufacturer', 'name','reviews_username', 'reviews_rating', 'user_sentiment']:
  print("No. of unique %s is: %s" %(i, master_df[i].nunique()))

In [None]:
#Checking top 10 most purchased product
result = master_df[master_df['reviews_didPurchase'] == True]
result['name'].value_counts()[0:10].plot(kind = 'barh', figsize=[15,10], fontsize=15,color='Blue').invert_yaxis()

In [None]:
#Checking top 10 most trusted brands based on the positive review
from matplotlib import cm
result = master_df[(master_df.user_sentiment=="Positive")]
result['brand'].value_counts()[0:10].plot(kind = 'barh', figsize=[15,10], fontsize=15,color='Green').invert_yaxis()

In [None]:
#Checking top 10 most badly rated brands based on the negative review
from matplotlib import cm
result = master_df[(master_df.user_sentiment=="Negative")]
result['brand'].value_counts()[0:10].plot(kind = 'barh', figsize=[15,10], fontsize=15,color='Red').invert_yaxis()

**Observation:** Clorox brand has most number of positive as well as negative reviews. Looks like Clorox is the most reviewed brand, followed by Warner Home Video

In [None]:
#Lets check Brand vs Rating

plt.figure(figsize=(10,8))
ax = sns.countplot(y=master_df['brand'], hue=master_df['reviews_rating'], order=master_df['brand'].value_counts().iloc[:20].index)
ax.set_xlabel(xlabel="reviews_rating", fontsize=15)
ax.set_ylabel(ylabel='brand', fontsize=15)
ax.axes.set_title('Brand vs Rating, grouped on Rating', fontsize=15)
ax.tick_params(labelsize=13)
plt.grid()
plt.show()

Observation: This plot confirms our observation that Clorox is the most reviewed brand.


In [None]:
#overall ratings as per Reviews for all the products
sns.countplot(x = 'reviews_rating', data=master_df,palette = 'dark').set_title('Ratings Trend - Count of Reviews by Ratings', fontsize=14)

In [None]:
#Checking the count of ratings
master_df['reviews_rating'].value_counts()

**Observation:** Most reviews are highly rated (rating 5) 

In [None]:
master_df['categories'].nunique()

In [None]:
# Distribution of Reviews by word length - helps understand the strength of sentiment

f = plt.figure(figsize=(8,5))
df_reviews = master_df[['id','reviews_username','reviews_text','reviews_title','reviews_rating']]
df_reviews['reviewLength'] = df_reviews['reviews_text'].apply(lambda x: len(x.split()))

reviews_word_length = df_reviews.groupby(pd.cut(df_reviews.reviewLength, np.arange(0,330,30))).count()
reviews_word_length = reviews_word_length.rename(columns={'reviewLength':'count'})
reviews_word_length = reviews_word_length.reset_index()

reviewLengthChart = sns.barplot(x='reviewLength',y='count',data=reviews_word_length,palette = 'dark')
reviewLengthChart.set_title('Distribution of Reviews by Word Length', fontsize=15)
reviewLengthChart.set_xticklabels(reviewLengthChart.get_xticklabels(), rotation = 45, horizontalalignment = 'right')

f.tight_layout()

In [None]:
# Distribution of Review Lengths by Ratings - Demonstrates How Length of Reviews relates to Ratings
# As most reviews are less than 150 words (shown in plot above), we will consider data with review length < 150 only

df_reviews = df_reviews[df_reviews['reviewLength'] < 150]

f = plt.figure(figsize=(8,5))

# Distribution of Length of Reviews by Rating - Box Plot
reviewLength_vs_Rating = df_reviews[['id','reviewLength','reviews_rating']]
reviewLength_vs_Rating = sns.boxplot(x='reviews_rating', y='reviewLength', data=reviewLength_vs_Rating)
reviewLength_vs_Rating.set_title('Review Length vs Overall Rating', fontsize=15)

f.tight_layout()

**Observation:** Low rated reviews are longer compared to high rated reviews

In [None]:
# Distribution of number of reviews written by each user

user_reviews_df = master_df[['reviews_username','id']]
user_reviews_df = user_reviews_df.groupby(['reviews_username']).count().reset_index()
user_reviews_df = user_reviews_df.sort_values('id',ascending = False)
user_reviews_df = user_reviews_df.rename(columns={'id':'review count'})
user_reviews_df.head()

# **2. Data Preprocessing**

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
import re

In [None]:
#Checking duplicates for username and unique identity number 
duplicates = master_df[master_df.duplicated(subset={"reviews_username","id"})]
duplicates.reviews_username.value_counts()

In [None]:
# Lets look at the user 'byamazon customer' as it shows lot of duplicates

master_df[master_df['reviews_username'] == 'byamazon customer']

***Observation: *** User 'byamazon customer' has given multiple reviews of the same product. It is possible that the reviews are genuine, we will take average of the ratings given by a user per product.


In [None]:
# Take Average of Ratings

master_df['avg_ratings'] = master_df.groupby(['id','reviews_username'])['reviews_rating'].transform('mean')
master_df['avg_ratings']= master_df['avg_ratings'].round(2)
master_df[['id','reviews_username','reviews_rating','avg_ratings']]

In [None]:
# We will delete duplicate Reviews for same product ID and User (Reviewer/Shopper)
# Copying final data to another dataframe, which will be used from here on 

df_final =  master_df.drop_duplicates(subset={"reviews_username","id"},keep="first")
df_final.head()

In [None]:
# Lets check if duplicates for User = 'byamazon customer' got deleted

df_final[df_final['reviews_username'] == 'byamazon customer']

Duplicate reviews are removed. The user "byamazon customer" now has only 1 review for one product.
 

In [None]:
#Lets see how much data we lost in removing the duplicates

size_diff = df_final['id'].size/master_df['id'].size

print("%.2f%% reduction in data post duplicate review deletion"%((1-size_diff)*100))
print("Revised size of data = ",df_final['id'].size,"rows ")

In [None]:
df_final.info()

As the review title and review text both the columns contain text that will help us in the sentiment analysis, so we will combine the two columns together. 


In [None]:
#Combining reviews_title and reviews_text and save as new column "user_reviews"
#Adding a period at the end of Review Titles in the new column 
#Adding blank spaces, for Review Title with missing values 

df_final['reviews_title'] = df_final['reviews_title'].fillna('')
df_final['user_reviews'] = df_final[['reviews_title', 'reviews_text']].agg('. '.join, axis=1).str.lstrip('. ')
df_final.head()

We will define some functions for cleaning of review text

In [None]:
#Defining function for removing html tags

def striphtml(data):
    p = re.compile('<.*?>')
    return p.sub('',data)

In [None]:
#Defining function for removing punctuation marks

def strippunc(data):
    p = re.compile(r'[?|!|\'|"|#|.|,|)|(|\|/|~|%|*]')
    return p.sub('',data)

In [None]:
#Initializing stopwords and SnowballStemmer

stop = stopwords.words('english') #All the stopwords in English language
snow = SnowballStemmer('english')

In [None]:
#Defining function to convert NLTK tags to WordNet tags

# Function: NLTK tags to Wordnet tags

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [None]:
#Defining function to tokenize the sentence and return the POS tag for respective tokens

def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged) #tuple of (token, wordnet_tag)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
#If no available tag, append the token AS IS, else use the tag to lemmatize the token
        if tag is None: 
            lemmatized_sentence.append(snow.stem(word)) 
        else:
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag)) #lemmatize the token
    return " ".join(lemmatized_sentence)

In [None]:
#Defining function to carry out preprocessing

def preprocess_text(text, stem=False): 
#transforming text to lower case 
  text = text.lower()
#calling function to remove HTML Tags            
  text = striphtml(text)
#calling function to remove Punctuation           
  text = strippunc(text)           
  return lemmatize_sentence(text)

In [None]:
#Preprocessing the dataset, creating a 'Review' column which will be used for further analysis

#Copying to new dataframe 
df_main = df_final.copy(deep = True)
#Creating new column - Review
df_main['Review'] = df_main['user_reviews'].map(preprocess_text)
#Removing stop words from the new column - Review
df_main['Review'] = df_main['Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df_main.head()

In [None]:
#Defining function for plotting common Words in given column

def common_wds(column, terms, title_label):
  all_words_column = ' '.join([text for text in column])
  all_words_column = all_words_column.split()

  fr_dist = FreqDist(all_words_column)
  words_df = pd.DataFrame({'word':list(fr_dist.keys()), 'count':list(fr_dist.values())})

  word_rank = words_df.nlargest(columns="count", n = terms)   # Select Top 20 most frequent words
  plt.figure(figsize=(10,5))
  ax = sns.scatterplot(data=word_rank, x= "count", y = "word", color = "darkred")
  ax.set(ylabel = 'common words')
  plt.title(title_label, fontsize = 14)
  plt.grid()
  plt.show()

In [None]:
#Plotting Common Words in Review column ranked upto 20, using the above defined function

common_wds(df_main['Review'],20,'Common Words in Review')

In [None]:
#Defining function for plotting least occurring words in given column

def rare_wds(column, terms, title_label):
  all_words_column = ' '.join([text for text in column])
  all_words_column = all_words_column.split()

  fr_dist = FreqDist(all_words_column)
  words_df = pd.DataFrame({'word':list(fr_dist.keys()), 'count':list(fr_dist.values())})

  # selecting top 20 most frequent words
  word_rank = words_df.nsmallest(columns="count", n = terms) 
  plt.figure(figsize=(10,5))
  ax = sns.scatterplot(data=word_rank, x= "count", y = "word", color = "darkred")
  ax.set(ylabel = 'rare words')
  plt.title(title_label, fontsize = 14)
  plt.grid()
  plt.show()

In [None]:
#Plotting Rare Words in Review column ranked upto 20, using function defined above

rare_wds(df_main['Review'],20, 'Rare Words in Review')

In [None]:
#Using wordcloud to view the most frequent words in the Review column

from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
from matplotlib import pyplot as plt

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='black',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
).generate(str(data))

    fig = plt.figure(1, figsize=(15, 15))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

show_wordcloud(df_main['Review'])

# **3. Feature extraction**

In [None]:
#Importing necessary libraries

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

In [None]:
#Keeping only the relevant columns 

df_main=df_main[['Review','reviews_rating','user_sentiment']]
data=df_main
data.head()

In [None]:
#Lets check the new data 

data.info()

No missing values, data looks good

In [None]:
#Saving data for future purpose

import pickle as pickle
pickle.dump(data, open("data.pkl","wb"))

In [None]:
#Lets do Feature Extraction using TF-IDF vectorization

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
tfidf_vectorizer.fit(data['Review'])
X = tfidf_vectorizer.transform(data['Review'])
y = data['user_sentiment']


In [None]:
#Saving the vocabulary used in tf-idf vectorizer as features

pickle.dump(tfidf_vectorizer.vocabulary_, open("features.pkl","wb"))

In [None]:
#Saving tf-idf vectorizer

pickle.dump(tfidf_vectorizer, open("tfidf.pkl", "wb"))

In [None]:
#Lets split train test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

Firstly, we will check the class imbalance in the data, handle it (if present) and then proceed with the training 

In [None]:
#Checking Class Imbalance 

data.groupby(['user_sentiment']).count()

There is big difference between positive and negative labels, class imbalance is present. We would use SMOTE technique to handle the class imbalance

**Handling Class Imbalance**

In [None]:
from collections import Counter
from imblearn import over_sampling
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
counter = Counter(y_train)
print("Before", counter)

#oversampling using SMOTE
smote = SMOTE()
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

counter = Counter(y_train_sm)
print("After", counter)

# **4. Training a text classification model**

We need to build at least three ML models. We then need to analyse the performance of each of these models and choose the best model. At least three out of the following four models need to be built. 
1. Logistic regression
2. Random forest
3. XGBoost
4. Naive Bayes

In [None]:
#Importing libraries

from sklearn.neighbors import NearestNeighbors
from sklearn import neighbors
from scipy.spatial.distance import cosine
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud, STOPWORDS
import pickle

**Defining metrics for model evaluation**

We will now define the metrics based on which models will be evaluated. 

We will look at the accuracy of the model which will tell us what fraction of prediction is correct.

Looking from consumer's point of view, recommending products with negative sentiments will make consumers lose interest in checking the recommended products. It means positive predictive rate should be good. So we will look at the precision of the model. 

At the same time, missing to recommend products with positive sentiments will cause business loss. So, the sensitivity of the model should also be good. 

And since we want to look at both preicsion and sensitivity, F1-score will also be useful for us. 

In [None]:
#Defining a function for creating confusion matrix and displaying scores
#It will be useful in evaluating all the models

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix

def display_score(classifier):
    cm = confusion_matrix(y_test, classifier.predict(X_test))
    plot_confusion_matrix(classifier, X_test, y_test,include_values=True,values_format='g',cmap=plt.cm.Blues) 
    p_acc = float(accuracy_score(y_test, classifier.predict(X_test)))  
    p_sen = float(format(cm[1][1]/sum(cm[1])))                # sensitivity = true positives/(true positives + false negatives)
    p_pre = float(format(cm[1][1]/((cm[1][1])+(cm[0][1]))))   # precision = true positives/(true positives + false positives)
    p_f1s = float(format(2*(p_pre * p_sen)/(p_pre + p_sen)))  # F1 = 2*((precision*sensitivity)/(precision+sensitivity))
    print(classifier)
    print('\n')
    print(f"Accuracy is {p_acc:.4f}")
    print(f"Sensitivity is {p_sen:.4f}")
    print(f"Precision is {p_pre:.4f}")
    print(f"F1 Score is {p_f1s:.4f}")
    return p_acc, p_sen, p_pre, p_f1s

**Model 1 - Logistic Regression**

In [None]:
#Lets try out different learning rates and select the best one

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train_sm, y_train_sm)
    cm = confusion_matrix(y_test, lr.predict(X_test))
    print('Sensitivity for C = {0} is {1}'.format(c, cm[1][1]/sum(cm[1])))
    print('Specificity for C = {0} is {1}'.format(c, cm[0][0]/sum(cm[0])))

**Observation**: We will take c=0.05, as it gives the best metrics. 

In [None]:
final_lr = LogisticRegression(C=0.05)
final_lr.fit(X_train_sm, y_train_sm)

In [None]:
df_lr = display_score(final_lr)
df_lr

**Model 2 - Random Forest**

In [None]:
#Fitting a Random Forest classifier without any hyperparameter tuning

from sklearn.ensemble import RandomForestClassifier
import time

from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier()
rf.fit(X_train_sm, y_train_sm)

In [None]:
df_rf = display_score(rf)
df_rf

In [None]:
#Fitting a Random Forest classifier with various hyperparameters

#parameter grid based on the results of random search 
param_grid = {
    'max_depth': [15, 20],
    'min_samples_leaf': [100,200],
    'min_samples_split': [200,400],
    'n_estimators': [100, 300]
}


final_rf = RandomForestClassifier()

# Instantiate the grid search model
rf_tuned = GridSearchCV(estimator = final_rf, param_grid = param_grid, scoring='roc_auc', cv = 3, n_jobs = -1,verbose = 1)
rf_tuned.fit(X_train_sm, y_train_sm)

In [None]:
#Checking the best hyperparameters

print("Best AUC-ROC Score on train data: ", rf_tuned.best_score_)
print("Best hyperparameters: ", rf_tuned.best_params_)

In [None]:
#Test data performance metrics
df_rft = display_score(rf_tuned)
df_rft

**Model 3 - XGBoost**

In [None]:
#Fitting a XGBoost classifier without any hyperparameter tuning

# importing libraries for XGBoost classifier
import xgboost as xgb
from xgboost import XGBClassifier

final_xgb = XGBClassifier(booster='gbtree')
final_xgb.fit(X_train_sm, y_train_sm)

In [None]:
#Displaying Confusion matrix Scores

df_xgb = display_score(final_xgb)
df_xgb

In [None]:
#Fitting a XGBoost classifier with various custom hyperparameters.

param_grid = {'learning_rate': [0.001, 0.01], 'max_depth':[ 5, 10],  'n_estimators':[1, 3]}

final_xgb = XGBClassifier(booster='gbtree')

# set up GridSearchCV()
xgb_tuned = GridSearchCV(estimator = final_xgb, 
                        param_grid = param_grid, 
                        scoring= 'roc_auc', 
                        cv =3, 
                        verbose = 1,
                        return_train_score=True)

xgb_tuned.fit(X_train_sm, y_train_sm)

In [None]:
#printing best hyperparameters

print("Best AUC-ROC Score on train data: ", xgb_tuned.best_score_)
print("Best hyperparameters: ", xgb_tuned.best_params_)

In [None]:
#Displaying Confusion matrix Scores

df_xgbt = display_score(xgb_tuned)
df_xgbt

**Model 4 - Naive Bayes**

In [None]:
#Fitting Naive Bayes Model

nb=MultinomialNB()
nb.fit(X_train_sm, y_train_sm)

In [None]:
#Test Data Performance Metrics

df_nb = display_score(nb)
df_nb

We are done with model training. Let us now compare the metrics we obtained for each model, based on which we can select our final model for the Sentiment Classification. 

In [None]:
#Displaying metrics in tabular form
#Index: 0=Accuracy, 1=Sensitivity, 2=Precision, 3=F1Score

results = {('LR'):[df_lr[0],df_lr[1],df_lr[2],df_lr[3]],
           ('NB'):[df_nb[0],df_nb[1],df_nb[2],df_nb[3]],
           ('XGB'):[df_xgb[0],df_xgb[1],df_xgb[2],df_xgb[3]],
           ('XGB Tuned'):[df_xgbt[0],df_xgbt[1],df_xgbt[2],df_xgbt[3]],
           ('RF'):[df_rf[0],df_rf[1],df_rf[2],df_rf[3]],
           ('RF Tuned'):[df_rft[0],df_rft[1],df_rft[2],df_rft[3]]
          }
pd.DataFrame(results, index=['Accuracy', 'Sensitivity', 'Precision', 'F1Score'])

**Model selection**

If we look at the accuracy, all models, except XGB-tuned, are comparable.

As we had discussed during defining the metrice for evaluation, both precision and sensitivity need to be high. Naive Bayes and Random Forest(without tuning) seem to be the best options in this case. 
 
As F1 score gives equal weight to Precision and Recall, high F1 score means both Precision and Recall are high. NB and RF(without tuning) seem to have best F1-score. 

Considering all the evaluation points above, **Naive Bayes** and **Random Forest without tuning** seems to be the best choices. 
I am selecting **Naive Bayes as the final model** here, as the size of Random Forest model pickle file without tuning may be a problem  while uploading on github. 

In [None]:
#Saving the final model 

saved_model = pickle.dump(nb, open('naive_bayes_model.pkl', 'wb'))

# **5. Building a recommendation system**

We will build the following types of recommendation systems.

1. User-based recommendation system
2. Item-based recommendation system

Then we will analyse the recommendation systems and select the one that is best suited in this case. 

In [None]:
# Importing Libraries

from sklearn.metrics.pairwise import pairwise_distances

In [None]:
#Lets start with reading the original data file, as we did some data cleaning and preprocessing to the already read file for sentiment analysis.

ratings = pd.read_csv("sample30.csv", sep=',')
ratings.head()

In [None]:
#We will keep only the relevant columns, i.e. id, reviews_rating and username

ratings=ratings[['id', 'reviews_rating', 'reviews_username']]

In [None]:
#Checking for the null values

ratings.info()

There are 63 records with missing username. We will drop these records. 

In [None]:
#Dropping missing values
ratings = ratings[~ratings.reviews_username.isna()]
#Renaming the columns for ease to handle
ratings.columns=['productId', 'rating', 'user']
ratings.head()

In [None]:
ratings.info()

In [None]:
#Splitting the data into train and test datasets

train, test = train_test_split(ratings, test_size=0.30, random_state=12)

print(train.shape)
print(test.shape)

In [None]:
#Pivot the train ratings dataset into matrix format in which columns are productId and the rows are username

df_pivot = train.pivot_table(
    index='user',
    columns='productId',
    values='rating'
).fillna(0)

df_pivot.head(3)

**Creating dummy train & dummy test dataset**
These dataset will be used for prediction 
- Dummy train will be used later for prediction of the products which has not been rated by the user. To ignore the products rated by the user, we will mark it as 0 during prediction. The products not rated by user is marked as 1 for prediction in dummy train dataset. 

- Dummy test will be used for evaluation. To evaluate, we will only make prediction on the products rated by the user. So, this is marked as 1. This is just opposite of dummy_train.

In [None]:
#Copying the train dataset into dummy_train
dummy_train = train.copy()
dummy_train.head(5)

In [None]:
#The products not rated by user is marked as 1 for prediction. 

dummy_train['rating'] = dummy_train['rating'].apply(lambda x: 0 if x>=1 else 1)

In [None]:
#Converting the dummy train dataset into matrix format.

dummy_train = dummy_train.pivot_table(
    index='user',
    columns='productId',
    values='rating'
).fillna(1)


dummy_train.head()

# **User Based Similarity**

**Cosine Similarity**

Cosine Similarity is a measurement that quantifies the similarity between two vectors, which is Rating Vector in this case.

In [None]:
df_pivot.index.nunique()

In [None]:
#User Similarity Matrix via pairwise_distance function

user_correlation = 1 - pairwise_distances(df_pivot, metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

In [None]:
user_correlation.shape

**Adjusted Cosine**

Adjusted cosine similarity is a modified version of vector-based similarity where we incorporate the fact that different users have different ratings schemes. In other words, some users might rate items highly in general, and others might give items lower ratings as a preference. To handle this nature from rating given by user , we subtract average ratings for each user from each user's rating for different movies.

Here, we are not removing the NaN values and calculating the mean only for the movies rated by the user

In [None]:
# Create a user-product matrix.

df_pivot = train.pivot_table(
    index='user',
    columns='productId',
    values='rating'
)

In [None]:
#Normalising the rating of the movie for each user around 0 mean

mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

df_subtracted.head()

In [None]:
#Creating the User Similarity Matrix using pairwise_distance function

user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0

print(user_correlation)

In [None]:
user_correlation.shape

**Prediction - User User**

Doing the prediction for the users which are positively related with other users, and not the users which are negatively related as we are interested in the users which are more similar to the current users. So, ignoring the correlation for values less than 0.

In [None]:
user_correlation[user_correlation<0]=0
user_correlation

Rating predicted by the user (for products rated as well as not rated) is the weighted sum of correlation with the product rating (as present in the rating dataset).

In [None]:
user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
user_predicted_ratings

In [None]:
user_predicted_ratings.shape

Since we are interested only in the products not rated by the user, we will ignore the products rated by the user by setting it to zero.

In [None]:
user_final_rating = np.multiply(user_predicted_ratings,dummy_train)
user_final_rating.head()

**Finding the top 20 recommendation for the user**

In [None]:
#Lets take a random user ID from the given dataset as input

user_input='joshua'
print(user_input)

In [None]:
#Top 20 recommendations
d = user_final_rating.loc[user_input].sort_values(ascending=False)[0:20]
d

**Evaluation - User User**

Evaluation will be same as for the prediction. The only difference being, we will evaluate for the products already rated by the user insead of predicting it for the products not rated by the user.

In [None]:
# Find out the common users of test and train dataset.
common = test[test.user.isin(train.user)]
common.shape

In [None]:
common.head()

In [None]:
#Converting into the user-movie matrix

common_user_based_matrix = common.pivot_table(index='user', columns='productId', values='rating')

In [None]:
#Converting the user_correlation matrix into dataframe

user_correlation_df = pd.DataFrame(user_correlation)
user_correlation_df.head(2)

In [None]:
user_correlation_df['user'] = df_subtracted.index
user_correlation_df.set_index('user',inplace=True)
user_correlation_df.head(2)

In [None]:
list_name = common.user.tolist()

user_correlation_df.columns = df_subtracted.index.tolist()

user_correlation_df_1 =  user_correlation_df[user_correlation_df.index.isin(list_name)]

In [None]:
user_correlation_df_1.shape

In [None]:
#Taking transpose of the df_1
user_correlation_df_2 = user_correlation_df_1.T[user_correlation_df_1.T.index.isin(list_name)]

In [None]:
#Taking transpose of df_2 

user_correlation_df_3 = user_correlation_df_2.T
user_correlation_df_3.head(2)

In [None]:
user_correlation_df_3.shape

In [None]:
#Taking users which are positively correlated with other users

user_correlation_df_3[user_correlation_df_3<0]=0

common_user_predicted_ratings = np.dot(user_correlation_df_3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings

In [None]:
#Creating dummy copy to mark the products which are already rated by the user as 1 

dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='user', columns='productId', values='rating').fillna(0)

In [None]:
dummy_test.shape

In [None]:
#Multiplying predicted_ratings df with dummy_test so we are left with ratings of the products which are already rated by the user, 
#and others will be set to 0

common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)
common_user_predicted_ratings.head(2)

Calculating the RMSE for only the products rated by user. For RMSE, normalising the rating to (1,5) range.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

#Making a copy of common_users_predicted_ratings and normalizing the rating to (1,5) range

X  = common_user_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

In [None]:
common_ = common.pivot_table(index='user', columns='productId', values='rating')

In [None]:
#Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [None]:
#Calculating and printing rmse for evaluation

rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

Take a note of rmse here. We will be comparing this with rmse of item based similarity recommendation system

# **Item Based Similarity**

Taking the transpose of the rating matrix to normalize the rating around the mean for different products ID. In the user based similarity, we had taken mean for each user instead of each products. 

In [None]:
df_pivot = train.pivot_table(
    index='user',
    columns='productId',
    values='rating'
).T

df_pivot.head(2)

In [None]:
#Normalising the Product rating for each product for using the Adujsted Cosine

mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

df_subtracted.head(2)

In [None]:
#Finding the Cosine Similarity using pairwise distances approach

# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)
print(item_correlation.shape)

In [None]:
#Filtering for positive correlation - only for which the value is greater than 0

item_correlation[item_correlation<0]=0
item_correlation

**Prediction - Item Item**

In [None]:
#Predicting the rating based on item similarity

item_predicted_ratings = np.dot((df_pivot.fillna(0).T),item_correlation)
item_predicted_ratings

In [None]:
item_predicted_ratings.shape

In [None]:
dummy_train.shape

In [None]:
#Filtering the rating only for the products not rated by the user for recommendation, by multiplying with dummy_train

item_final_rating = np.multiply(item_predicted_ratings,dummy_train)
item_final_rating.head()

In [None]:
# Take a random user ID from dataset as input

user_input='zipperdoo'
print(user_input)

In [None]:
#Recommending the Top 20 products to the user.
d = item_final_rating.loc[user_input].sort_values(ascending=False)[0:20]
d

**Evaluation - Item Item**

Evaluation will be same as for the prediction. The only difference being, we will evaluate for the products already rated by the user insead of predicting it for the products not rated by the user.

In [None]:
# Find out the common usersproducts of test and train dataset.
common =  test[test.productId.isin(train.productId)]
common.shape

In [None]:
common.head()

In [None]:
#Converting into the user-product matrix, taking transpose

common_item_based_matrix = common.pivot_table(index='user', columns='productId', values='rating').T
common_item_based_matrix.shape

In [None]:
#Converting the user_correlation matrix into dataframe

item_correlation_df = pd.DataFrame(item_correlation)
item_correlation_df.head(2)

In [None]:
item_correlation_df['productId'] = df_subtracted.index
item_correlation_df.set_index('productId',inplace=True)
item_correlation_df.head(2)

In [None]:
list_name = common.productId.tolist()

item_correlation_df.columns = df_subtracted.index.tolist()

item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]

In [None]:
item_correlation_df_1.shape

In [None]:
item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]

item_correlation_df_3 = item_correlation_df_2.T

In [None]:
item_correlation_df_3.head(2)

In [None]:
item_correlation_df_3.shape

In [None]:
#Taking item which are positively correlated with other items

item_correlation_df_3[item_correlation_df_3<0]=0

common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))
common_item_predicted_ratings

In [None]:
common_item_predicted_ratings.shape

In [None]:
#Creating dummy copy to mark the products which are already rated by the user as 1 

dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='user', columns='productId', values='rating').T.fillna(0)


In [None]:
dummy_test.shape

In [None]:
#Multiplying predicted_ratings df with dummy_test so we are left with ratings of the products which are already rated by the user, 
#and others will be set to 0

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)
common_item_predicted_ratings.head(2)

Calculating the RMSE for only the products rated by user. For RMSE, normalising the rating to (1,5) range.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

#Making a copy of common_item_predicted_ratings and normalizing the rating to (1,5) range

X  = common_item_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

In [None]:
#Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [None]:
common_ = common.pivot_table(index='user', columns='productId', values='rating').T

In [None]:
#Calculating and printing rmse for evaluation

rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

**Selecting recommendation system**

rmse value for user similarity based system = 2.589725958923943

rmse value for item similarity based system = 3.5462471410112615

Based on rmse value, we select user similarity based recommendation system, as its rmse is smaller

In [None]:
# saving the model
pickle.dump(user_final_rating.astype('float32'), open('user_final_rating.pkl', 'wb'))

-- End of the notebook -- 