# Business Understanding

## Problem Statement

You are working as a Machine Learning Engineer in an e-commerce company named 'Ebuss' & you are required to build a model that will improve the recommendations given to the users given their past reviews and ratings. 

In order to do this, need to build a sentiment-based product recommendation system using following steps:

1. Data sourcing and sentiment analysis

2. Building a recommendation system

3. Improving the recommendations using the sentiment analysis model

4. Deploying the end-to-end project with a user interface

## End Goals 

An end-to-end Jupyter Notebook, which consists of the entire code of recommendation system including following points:

* Data cleaning steps
* Text preprocessing
* Feature extraction
* 3 ML models used to build sentiment analysis models
* Two recommendation systems and their evaluations



# Data Understanding

In [15]:
import numpy as np
import pandas as pd
import sys
from collections import Counter
import matplotlib.pyplot as plt
import string
import nltk
from nltk.tokenize import word_tokenize
import re
import sys

In [16]:
df = pd.read_csv('input/sample30.csv')

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        29999 non-null  object
dtypes: int64(1), object(14)
memory usage

In [18]:
df['user_sentiment'].value_counts()

Positive    26632
Negative     3367
Name: user_sentiment, dtype: int64

In [19]:
df['user_sentiment']=df['user_sentiment'].fillna('Positive')
df['reviews_title']= df['reviews_title'].fillna(' ')

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         30000 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        30000 non-null  object
dtypes: int64(1), object(14)
memory usage

# Data Preparation

In [120]:
df_master = df[['user_sentiment']].copy()
df_master['merged'] = df['reviews_title'] + " " + df['reviews_text']
df_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   user_sentiment  30000 non-null  object
 1   merged          30000 non-null  object
dtypes: object(2)
memory usage: 468.9+ KB


In [121]:
df_master.head()

Unnamed: 0,user_sentiment,merged
0,Positive,Just Awesome i love this album. it's very good...
1,Positive,Good Good flavor. This review was collected as...
2,Positive,Good Good flavor.
3,Negative,Disappointed I read through the reviews on her...
4,Negative,Irritation My husband bought this gel for us. ...


In [122]:
#Lower case the text and Remove Punctuations/Special Characters
df_master['merged'] = df_master['merged'].str.lower()
df_master['merged'] = df_master['merged'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

In [123]:
df_master.head()

Unnamed: 0,user_sentiment,merged
0,Positive,just awesome i love this album its very good m...
1,Positive,good good flavor this review was collected as ...
2,Positive,good good flavor
3,Negative,disappointed i read through the reviews on her...
4,Negative,irritation my husband bought this gel for us t...


In [124]:
#Tokenize
df_master['merged'] = df_master['merged'].apply(word_tokenize)
df_master.head()

Unnamed: 0,user_sentiment,merged
0,Positive,"[just, awesome, i, love, this, album, its, ver..."
1,Positive,"[good, good, flavor, this, review, was, collec..."
2,Positive,"[good, good, flavor]"
3,Negative,"[disappointed, i, read, through, the, reviews,..."
4,Negative,"[irritation, my, husband, bought, this, gel, f..."


In [125]:
#Remove the stop words
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

df_master['merged'] = df_master['merged'].apply(lambda x: [i for i in x if i not in stop])
df_master.head()

[nltk_data] Downloading package stopwords to C:\Users\Octillion
[nltk_data]     0017\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,user_sentiment,merged
0,Positive,"[awesome, love, album, good, hip, hop, side, c..."
1,Positive,"[good, good, flavor, review, collected, part, ..."
2,Positive,"[good, good, flavor]"
3,Negative,"[disappointed, read, reviews, looking, buying,..."
4,Negative,"[irritation, husband, bought, gel, us, gel, ca..."


### Lemmatization & Stemming

In [126]:
df_master['lem_stem'] = df_master['merged'].apply(lambda x: " ".join(x))
df_master.head()

Unnamed: 0,user_sentiment,merged,lem_stem
0,Positive,"[awesome, love, album, good, hip, hop, side, c...",awesome love album good hip hop side current p...
1,Positive,"[good, good, flavor, review, collected, part, ...",good good flavor review collected part promotion
2,Positive,"[good, good, flavor]",good good flavor
3,Negative,"[disappointed, read, reviews, looking, buying,...",disappointed read reviews looking buying one c...
4,Negative,"[irritation, husband, bought, gel, us, gel, ca...",irritation husband bought gel us gel caused ir...


In [127]:
nltk.download('wordnet')

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]


[nltk_data] Downloading package wordnet to C:\Users\Octillion
[nltk_data]     0017\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [128]:
df_master['lem_stem'] = df_master.lem_stem.apply(lemmatize_text)

In [129]:
df_master.head()

Unnamed: 0,user_sentiment,merged,lem_stem
0,Positive,"[awesome, love, album, good, hip, hop, side, c...","[awesome, love, album, good, hip, hop, side, c..."
1,Positive,"[good, good, flavor, review, collected, part, ...","[good, good, flavor, review, collected, part, ..."
2,Positive,"[good, good, flavor]","[good, good, flavor]"
3,Negative,"[disappointed, read, reviews, looking, buying,...","[disappointed, read, review, looking, buying, ..."
4,Negative,"[irritation, husband, bought, gel, us, gel, ca...","[irritation, husband, bought, gel, u, gel, cau..."


In [130]:
#df1 = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
#df1['text_lemmatized'] = df1.text.apply(lemmatize_text)
#df1

In [131]:
#from nltk.stem.snowball import SnowballStemmer
#stemmer = SnowballStemmer("english")
#print(stemmer.stem("Blessing"))
#print(stemmer.stem("reached"))

In [132]:
df_master['lem_stem'] = df_master['lem_stem'].apply(lambda x: " ".join(x))
df_master['lem_stem'].head()

0    awesome love album good hip hop side current p...
1     good good flavor review collected part promotion
2                                     good good flavor
3    disappointed read review looking buying one co...
4    irritation husband bought gel u gel caused irr...
Name: lem_stem, dtype: object

In [133]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def stemming_text(text):
    return [stemmer.stem(w) for w in w_tokenizer.tokenize(text)]

In [134]:
df_master['lem_stem'] = df_master.lem_stem.apply(stemming_text)
df_master.head()

Unnamed: 0,user_sentiment,merged,lem_stem
0,Positive,"[awesome, love, album, good, hip, hop, side, c...","[awesom, love, album, good, hip, hop, side, cu..."
1,Positive,"[good, good, flavor, review, collected, part, ...","[good, good, flavor, review, collect, part, pr..."
2,Positive,"[good, good, flavor]","[good, good, flavor]"
3,Negative,"[disappointed, read, reviews, looking, buying,...","[disappoint, read, review, look, buy, one, cou..."
4,Negative,"[irritation, husband, bought, gel, us, gel, ca...","[irrit, husband, bought, gel, u, gel, caus, ir..."


### Spell Checker

In [135]:
from spellchecker import SpellChecker
spell = SpellChecker()
#spell.correction('awesom')

In [136]:
def correct_spellings(text):
    corrected_text = []
    missplled_words = spell.unknown(text.split())
    for word in text.split():
        if word in missplled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)  

correct_spellings("speling correctin")

'spelling correcting'

In [137]:
df_master['lem_stem'] = df_master['lem_stem'].apply(lambda x: " ".join(x))
df_master.head()

Unnamed: 0,user_sentiment,merged,lem_stem
0,Positive,"[awesome, love, album, good, hip, hop, side, c...",awesom love album good hip hop side current po...
1,Positive,"[good, good, flavor, review, collected, part, ...",good good flavor review collect part promot
2,Positive,"[good, good, flavor]",good good flavor
3,Negative,"[disappointed, read, reviews, looking, buying,...",disappoint read review look buy one coupl lubr...
4,Negative,"[irritation, husband, bought, gel, us, gel, ca...",irrit husband bought gel u gel caus irrit felt...


In [138]:
df_master['spell_checked'] = df_master.lem_stem.apply(correct_spellings)
df_master.head()

Unnamed: 0,user_sentiment,merged,lem_stem,spell_checked
0,Positive,"[awesome, love, album, good, hip, hop, side, c...",awesom love album good hip hop side current po...,awesome love album good hip hop side current p...
1,Positive,"[good, good, flavor, review, collected, part, ...",good good flavor review collect part promot,good good flavor review collect part promote
2,Positive,"[good, good, flavor]",good good flavor,good good flavor
3,Negative,"[disappointed, read, reviews, looking, buying,...",disappoint read review look buy one coupl lubr...,disappoint read review look buy one coupl rubr...
4,Negative,"[irritation, husband, bought, gel, us, gel, ca...",irrit husband bought gel u gel caus irrit felt...,ifrit husband bought gel u gel caus ifrit felt...


In [140]:
#df_master.to_csv('clean_data.csv')

### Other
* #tags
* Email
* Links
* Numbers

# Data Modeling

In [None]:
df_clean = pd.read_csv('clean_data.csv')
df_clean.drop(['merged','lem_stem'], inplace=True, axis=1)

In [None]:
df_clean.head()