# In this Jupyter notebook we perform Sentiment Analysis for the Amazon Software dataset using Bag-of-Words. 

The Amazon Softare dataset is taken from this link: https://nijianmo.github.io/amazon/index.html 

We use BOW (Bag-of-Words) to perform Sentiment Analysis. 

We use two sources of data: 

(1) one zip file -i.e., "Software.json.gz"- contain the main dataset, and it contains the reviews of the clients;

(2) another zip file -i.e., "meta_Software.json.gz"- contains the "title" of the product (i.e., the name of the product), and it contains the brand of the product and the main category of the product. 

The two datasets are merged together using the product ID which is included in both datasets. 

Description of the main variables in the main dataset (i.e., "Software.json.gz"): 

- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- summary - summary of the review
- reviewText - text of the review
- overall - rating of the product
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)

Description of the main variables in the metadata dataset (i.e., "meta_Software.json.gz"): 

- asin - ID of the product, e.g. 0000013714
- title - name of the product
- brand - brand name
- main_cat - main category of the product (e.g., "Software"; "All Electronics")

These sections are included in this Jupyter notebook: 
- In Section 1 we prepare the two datasets and we merge them using the product ID.  
- In Section 2 we create the binary rating variable. The binary rating variable is used as the target of the BOW (bag-of-words) model in the following sections. 
- In Section 3 we apply pre-processing to the clients' reviews. We delete stopwords and numbers, and we use lower-case letters for all reviews; we remove punctuation, and we apply lemmatization. 
- In Section 4 we perform Sentiment Analysis. In Section 4 we use Grid search (i.e., GridSearchCV) in order to select the best (most predictive) model. We also use a pipeline within GridSearchCV. 
- In Section 5 we use grid search together with SMOTE (Synthetic Minority Over-sampling Technique); in this section we use the unbalanced dataset, and we correct unbalancedness through the use of SMOTE. 
- In Section 6 we draw a few conclusions about the effect of SMOTE on the "precision" and the "recall" of the estimator. 

# *Section 1: Preparing the Amazon (software) dataset*

# Importing libraries

In [1]:
# Dataframe
import pandas as pd
import json

# Array
import numpy as np

# Decompress the file
import gzip

# Visualizations
import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
import seaborn as sns
import matplotlib.colors as colors
%matplotlib inline

# Datetime
#from datetime import datetime

## Warnings
import warnings
from scipy import stats
warnings.filterwarnings('ignore')


In [2]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)
        
def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient = 'index')

# Import the main dataset which includes the reviews and the ratings: The main dataset is included in 2 ZIP files: "Software_df1.json.gz" and "Software_df2.json.gz"

In [3]:
review_df1 = getDF('data/Software_df1.json.gz')

print(review_df1.shape)
review_df1.head(2)

(185001, 12)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,4.0,True,"03 11, 2014",A240ORQ2LF9LUI,77613252,{'Format:': ' Loose Leaf'},Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,,
1,4.0,True,"02 23, 2014",A1YCCU0YRLS0FE,77613252,{'Format:': ' Loose Leaf'},Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,,


In [4]:
###########################
### Drop duplicated records
###########################
review_df = review_df.drop_duplicates(subset='reviewText', keep='first')
print(review_df.shape)

(421433, 12)


In [5]:
###########################
### Rename column "overall" to "Rating" 
###########################
review_df = review_df.rename(columns={'overall':'Rating'})

In [6]:
###########################
### Save it as CSV file
###########################
review_df.to_csv('data/review_df.csv',index=False)

# Import the metadata to extract the title, brand and main category of the product. The "title" is the name of the product. 

In [7]:
metadata = getDF('data/meta_Software.json.gz')
print(metadata.shape)
metadata.head(2)

(26790, 18)


Unnamed: 0,category,tech1,description,fit,title,also_buy,image,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,details
0,[],,[],,HOLT PHYSICS LESSON PRESENTATION CD-ROM QUICK ...,[],[],,HOLT. RINEHART AND WINSTON,[],"25,550 in Software (",[],Software,,</div>,.a-box-inner{background-color:#fff}#alohaBuyBo...,30672120,
1,[],,"[, <b>Latin rhythms that will get your kids si...",,"Sing, Watch, &amp; Learn Spanish (DVD + Guide)...",[],[https://images-na.ssl-images-amazon.com/image...,,McGraw Hill,[],"15,792 in Software (",[],Software,,</div>,,71480935,


In [8]:
###########################
### Drop duplicated records
###########################
metadata = metadata.drop_duplicates(subset='title', keep='first')
print(metadata.shape)

(21110, 18)


In [10]:
###########################
### Keep only 'title', 'brand' and 'asin' (i.e., the product ID) for the metadata dataframe 
###########################
metadata = metadata[['title', 'brand', 'asin', 'main_cat']]

In [41]:
###########################
### Save it as CSV file
###########################
metadata.to_csv('data/metadata.csv',index=False)

## Data Wrangling

In [11]:
#############################
####  Data cleaning: check missing values 
#############################
print(review_df.isna().sum())
print('')
print(metadata.isna().sum())

Rating                 0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             206584
reviewerName          21
reviewText             1
summary               40
unixReviewTime         0
vote              298680
image             419964
dtype: int64

title       0
brand       0
asin        0
main_cat    0
dtype: int64


In [12]:
###############################
### Drop rows with missing reviewText and summary
################################
review_df = review_df.dropna(subset=['reviewText', 'summary'])
review_df.isna().sum()

Rating                 0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             206565
reviewerName          21
reviewText             0
summary                0
unixReviewTime         0
vote              298647
image             419923
dtype: int64

**We merge the two dataset using the product ID (i.e., "asin").**

**Some of the products (i.e., "asin") that are included in the main dataset are not included in the metadata.** 

**Thus, we use inner join on "asin" because we want to have all information (e.g., information on "title", "brand", "main category") for all the reviews included in the final dataset.**  

In [13]:
review_df = review_df.merge(metadata, how='inner', on='asin')

In [13]:
review_df.shape

(410492, 15)

In [14]:
review_df.isna().sum()

Rating                 0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             203413
reviewerName          21
reviewText             0
summary                0
unixReviewTime         0
vote              290118
image             409064
title                  0
brand                  0
main_cat               0
dtype: int64

In [14]:
###############################
### We drop columns 'image', 'vote', 'style', 'verified' as these columns are redundant and useless.  
################################
review_df.drop('image', inplace=True, axis=1)
review_df.drop('vote', inplace=True, axis=1)
review_df.drop('style', inplace=True, axis=1)
review_df.drop('verified', inplace=True, axis=1)

In [15]:
print(review_df.shape)
review_df.head(2)

(410492, 11)


Unnamed: 0,Rating,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,title,brand,main_cat
0,4.0,"03 11, 2014",A240ORQ2LF9LUI,77613252,Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software
1,4.0,"02 23, 2014",A1YCCU0YRLS0FE,77613252,Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software


**There are lots of categories (i.e., main_cat). Since we want to focus only on software, we discard observations for products that do not belong to the category of software** 

In [16]:
review_df.main_cat.unique()

array(['Software', 'Books', 'Video Games', 'Movies &amp; TV',
       'Cell Phones &amp; Accessories', 'Office Products',
       'Toys &amp; Games', 'All Electronics', 'Cell Phones & Accessories',
       'GPS & Navigation', 'Movies & TV', 'Toys & Games',
       'Musical Instruments', 'GPS &amp; Navigation',
       'Arts, Crafts &amp; Sewing', 'Home Audio &amp; Theater',
       'Camera & Photo', 'Car Electronics', 'Arts, Crafts & Sewing',
       'Home Audio & Theater', 'Computers', 'Tools & Home Improvement',
       'Amazon Home', 'Camera &amp; Photo', 'Baby', 'Pet Supplies',
       'Tools &amp; Home Improvement',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>',
       'Automotive', 'Health & Personal Care', 'Sports &amp; Outdoors',
       'Health &amp; Personal Care', 'Sports & Outdoors',
       '<img src="https://m.media-amazon.com/images/G/01/digital/music

In [17]:
review_df = review_df[review_df.main_cat == 'Software']

In [18]:
print(review_df.shape)
review_df.head(2)

(389927, 11)


Unnamed: 0,Rating,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,title,brand,main_cat
0,4.0,"03 11, 2014",A240ORQ2LF9LUI,77613252,Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software
1,4.0,"02 23, 2014",A1YCCU0YRLS0FE,77613252,Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software


# We create the client's review text which is the ensemble of 'summary' and 'reviewText': 

In [19]:
review_df['review_2.0'] = review_df['summary'] + " " + review_df['reviewText'] 
review_df.head(1)

Unnamed: 0,Rating,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,title,brand,main_cat,review_2.0
0,4.0,"03 11, 2014",A240ORQ2LF9LUI,77613252,Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software,Material Great The materials arrived early and...





# *Section 2: we create the binary rating variable. The binary rating variable is used as the target in the following sections.* 

In [20]:
review_df['Rating'].value_counts()

5.0    171743
1.0     93061
4.0     62584
3.0     34306
2.0     28233
Name: Rating, dtype: int64

We use rating = {1,2,3} as bad/neutral rating, and we use rating = {4,5} as good rating. 
See the discussion at https://sellercentral.amazon.com/forums/t/does-a-neutral-3-star-rating-on-your-feedback-count-against-odr/1081/14

In [21]:
review_df['Rating_binary'] = review_df['Rating'].apply(lambda x: 0 if x < 4 else 1)
review_df.head(1)

Unnamed: 0,Rating,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,title,brand,main_cat,review_2.0,Rating_binary
0,4.0,"03 11, 2014",A240ORQ2LF9LUI,77613252,Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software,Material Great The materials arrived early and...,1


In [22]:
review_df['Rating_binary'].value_counts()


1    234327
0    155600
Name: Rating_binary, dtype: int64

# *Section 3: Apply pre-processing to the clients' reviews* 


In [23]:
# lower-case: 
#df['review_2.0'] = df['review_2.0'].str.replace(r'\n', ' ')
def remove_symbols(mystring):
    new_string = mystring.replace(r'\n', ' ')
    return new_string 


# remove numbers: 
def remove_nmbrs(mystring):
    mystring_no_numbers = ''.join(word for word in mystring if not word.isdigit())
    return mystring_no_numbers 


#df['review_2.0'] = df['review_2.0'].astype(str)
def stringify(mystring):
    new_string = str(mystring) 
    return new_string 



# Remove Punctuation:
def punct(mystring):
    import string 
    list_s_p = string.punctuation
    for punctuation in list_s_p:
        mystring = mystring.replace(punctuation, ' ')
    return mystring


# lower-case: 
def lower_it(mystring):
    lowered_mystring = mystring.lower()
    return lowered_mystring 



#Remove StopWords:
def stopwords(mystring):
    from nltk.corpus import stopwords 
    from nltk.tokenize import word_tokenize
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(mystring) 
    splitting_string = [w for w in word_tokens if not w in stop_words] #creates a list!
    text  = ' '.join(word for word in splitting_string)
    return text 
 

#Lemmatize:
def lem(mystring):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    splitting_text = mystring.split()
    lemmatized = [lemmatizer.lemmatize(word) for word in splitting_text]
    mystring = ' '.join(word for word in lemmatized)
    return mystring


def preprocessing(text):
    text = remove_symbols(text)
    text = remove_nmbrs(text)
    text = stringify(text)

    text = punct(text)
    text = lower_it(text)
    text = stopwords(text)
    text = lem(text)
    return text

In [67]:
review_df['review_2.0'] =  review_df['review_2.0'].map(preprocessing)

In [68]:
review_df.head(3)

Unnamed: 0,Rating,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,title,brand,main_cat,review_2.0,Rating_binary
0,4.0,"03 11, 2014",A240ORQ2LF9LUI,77613252,Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software,material great material arrived early excellen...,1
1,4.0,"02 23, 2014",A1YCCU0YRLS0FE,77613252,Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software,health really enjoying book worksheet make rev...,1
2,1.0,"02 17, 2014",A1BJHRQDYVAY2J,77613252,Allan R. Baker,"IF YOU ARE TAKING THIS CLASS DON""T WASTE YOUR ...",ARE YOU KIDING ME?,1392595200,Connect Personal Health with LearnSmart 1 Seme...,McGraw-Hill Humanities/Social Sciences/Languages,Software,kiding taking class waste money called book bo...,0


# *Section 4: Grid Search using a pipeline which is composed by (1) TfidfVectorizer; (2) Algorithm for classification.*
# *In this section we use a balanced dataset: I.e., the number of 0s is the same as the number of 1s.*


In [80]:
review_df_balanced = review_df.copy()
print(review_df_balanced.shape)
review_df_balanced['Rating_binary'].value_counts()

(389927, 13)


1    234327
0    155600
Name: Rating_binary, dtype: int64

In [81]:
###############################
#### Balancing the dataframe: 
################################

num = (review_df_balanced['Rating_binary'].value_counts()).min()
print(num )
#155600

df_pos = review_df_balanced[review_df_balanced['Rating_binary'] == 1].sample(num, random_state=0)
df_neg = review_df_balanced[review_df_balanced['Rating_binary'] == 0].sample(num)
review_df_balanced = pd.concat([df_pos, df_neg], verify_integrity=True)
print(review_df_balanced['Rating_binary'].value_counts())

155600
1    155600
0    155600
Name: Rating_binary, dtype: int64


# **Baseline Estimator (sklearn.dummy.DummyClassifier)**

In [84]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000) 
dummy = DummyClassifier()  

pipeline_dummy = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('dummy', dummy)])
parameters = {}
# Perform simply CV (Cross-validation) without searching the best parameters:
grid_search = GridSearchCV(pipeline_dummy, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_balanced['review_2.0']
y = review_df_balanced['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)
#0.5


Fitting 5 folds for each of 1 candidates, totalling 5 fits
{}
0.5


**The accuracy of the baseline estimator is 0.5: I.e., only 50% of the target is correctly predicted.**

# Naive Bayes with GridSearchCV for the "alpha" parameter. 

In [85]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000) 
naive_bayes = MultinomialNB()  

pipeline_dummy = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('nb', naive_bayes)])

parameters = {'nb__alpha': (0, 0.01, 0.05, 0.1, 0.3, 0.5, 1),
              }
# Perform simply CV (Cross-validation) without searching the best parameters:
grid_search = GridSearchCV(pipeline_dummy, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_balanced['review_2.0']
y = review_df_balanced['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)

#Fitting 5 folds for each of 7 candidates, totalling 35 fits
#{'nb__alpha': 1}
#0.8554016709511568


Fitting 5 folds for each of 7 candidates, totalling 35 fits
{'nb__alpha': 1}
0.8554016709511568


**The best accuracy of the Naive Bayes estimator is 0.855: I.e., 85.5% of the target is correctly predicted. This is a fairly large improvement with respect to the baseline estimator.**

# Logistic regression with GridSearchCV for the "penalty" parameter. 

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000) 
log_reg = LogisticRegression()  

pipeline_dummy = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('log', log_reg)])

parameters = {'log__penalty': ['l1','l2']
              }
# Perform simply CV (Cross-validation) without searching the best parameters:
grid_search = GridSearchCV(pipeline_dummy, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_balanced['review_2.0']
y = review_df_balanced['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)

#Fitting 5 folds for each of 2 candidates, totalling 10 fits 
#0.89


Fitting 5 folds for each of 2 candidates, totalling 10 fits
{'log__penalty': 'l2'}
0.9003566838046272


**The best accuracy of the Logistic Regression is around 0.90: I.e., aorund 90% of the target is correctly predicted. This is a fairly large improvement with respect to the baseline estimator.**

# Decision tree algorithm with GridSearchCV for the "max_depth" parameter. 

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000) 
tree = DecisionTreeClassifier()  

pipeline_dummy = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('tree', tree)])

parameters = {'tree__max_depth': [2, 3, 6, 8]
              }

# Perform simply CV (Cross-validation) without searching the best parameters:
grid_search = GridSearchCV(pipeline_dummy, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_balanced['review_2.0']
y = review_df_balanced['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)

#Fitting 5 folds for each of 5 candidates, totalling 25 fits
#{'tree__max_depth': 8}
#0.7578374035989717


Fitting 5 folds for each of 4 candidates, totalling 20 fits
{'tree__max_depth': 8}
0.7578374035989717


**The performance of the decision tree is worse than Logistic Regression.**

# XGBoost classifier with GridSearchCV. 

In [93]:
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier     
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000) 
xgboost = XGBClassifier(learning_rate=0.02, n_estimators=100, objective='binary:logistic',
                    silent=True, nthread=-1)

pipeline_dummy = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('xgboost', xgboost)])

parameters = {'xgboost__max_depth': [6,7]
              }

# Perform simply CV (Cross-validation) without searching the best parameters:
grid_search = GridSearchCV(pipeline_dummy, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_balanced['review_2.0']
y = review_df_balanced['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)

#Fitting 5 folds for each of 2 candidates, totalling 10 fits
#{'xgboost__max_depth': 7}
#0.7877988431876606

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'xgboost__max_depth': 7}
0.7877988431876606


**The performance of XGBoost classifier is worse than Logistic Regression.**

# Conclusion: Logistic Regression is the best algorithm for our task. Moreover, the Lasso penalty -i.e., l2-penalty- is the best penalty for Logistic Regression.

# Thus, we save the Logistic Regression model with Lasso penalty as our favorite estimator. We use joblib to save the best model. 

In [96]:
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
#from imblearn.pipeline import Pipeline

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000) 
log_reg = LogisticRegression()  

pipeline_dummy = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('log', log_reg)])

parameters = {'log__penalty': ['l1','l2']
              }
# Perform simply CV (Cross-validation) without searching the best parameters:
grid_search = GridSearchCV(pipeline_dummy, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_balanced['review_2.0']
y = review_df_balanced['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)

#Fitting 5 folds for each of 2 candidates, totalling 10 fits 
#0.89

Fitting 5 folds for each of 2 candidates, totalling 10 fits
{'log__penalty': 'l2'}
0.8909254498714653


In [106]:
import joblib
# save the model to disk
filename = 'finalized_model.joblib'
#joblib.dump(grid_search.best_estimator_, filename, compress = 1)
joblib.dump(grid_search.best_estimator_, filename)

['finalized_model.joblib']

# Out-of-sample prediction: We use 2 reviews from Amazon UK (Software) to further test our model (The above model was estimated with data from Amazon.com): 

**Example of review with bad rating:**

In [114]:
# load the model from disk
loaded_model = joblib.load(filename)

#I use one example of a review from Amazon UK (Software): https://www.amazon.co.uk/Microsoft-Professional-Genuine-Lifetime-Product/dp/B00WYPDA4C/ref=sr_1_16?dchild=1&keywords=microsoft+office&qid=1608730193&s=software&sr=1-16
#This review has a bad rating (i.e., 2 stars):
example_of_review = 'When I received this product, I thought I would get the actual disk from Microsoft. However, it was a burned disk with instructions on how to download it and activate it. The instructions were very confusing and when I put in my product key, it said it was not a valid key. Got on the phone and could get no help, so had to send it back and buy from someone else. Very happy with the new product which was office 2010. If you buy this hope you have better luck than me.'

#Apply the preprocessing: 
example_of_review =  preprocessing(example_of_review)

#Transform the "review" into a list (or an iterable) containing a single element: 
final_review = [example_of_review]
result = loaded_model.predict(final_review)
print(result)

[0]


**Example of review with good rating:**

In [116]:
# load the model from disk
loaded_model = joblib.load(filename)

#I use one example of a review from Amazon UK (Software): https://www.amazon.co.uk/Microsoft-Professional-Genuine-Lifetime-Product/dp/B00WYPDA4C/ref=sr_1_16?dchild=1&keywords=microsoft+office&qid=1608730193&s=software&sr=1-16
#This review has a good rating (i.e., 5 stars):
example_of_review = 'I don\'t usually write reviews but this compnay deserves a big shout our for great customer service. I am not computer savey and was really messing up installation, so i called the tech support and ended up talking to Mitchell he was patient and didn\'t make me feel like and idoit for messing things up. If i hadn\'t been for him this review would have been much different. He went out of his way to make sure eveything was installed right and even made called microsoft because i had messed things up so bad. Thank you thank you for companys with intergrity and great customer service.'

#Apply the preprocessing: 
example_of_review =  preprocessing(example_of_review)

#Transform the "review" into a list (or an iterable) containing a single element: 
final_review = [example_of_review]
result = loaded_model.predict(final_review)
print(result)

[1]


# Conclusion for out-of-sample prediction: both cases are correctly predicted.

# *Section 5: Grid Search using a pipeline which is composed by (1) TfidfVectorizer; (2) SMOTE (Synthetic Minority Over-sampling Technique); (3) Logistic regression. In this section we use the unbalanced dataset, and we do not balance the dataset because we use SMOTE.*

# We use only 10% of the original (unbalanced) dataset, because of memory issues which happened during the estimation of the full (unbalanced) dataset. 

**Below, we use the best model that was found in Section 4, but we additionally use SMOTE on the unbalanced dataset.**

In order to use SMOTE, we need to: 
- use "from imblearn.pipeline import Pipeline";
- import SMOTE from  imblearn.over_sampling; 
- use the unbalanced dataset which contains 10% of the original observations (i.e., review_df_restricted). See below for more details about review_df_restricted. 

All other settings are exactly the same as for our preferred model from Section 4. 

In [29]:
print(review_df.shape)
#(389927, 13)

import math 
n_obs = int(math.floor(389927/10))
review_df_restricted  = review_df.sample(n_obs, random_state=1)
print(review_df_restricted.shape)

(389927, 13)
(38992, 13)


In [30]:
#Pre-processing was alread applid above for the full dataset. 
#review_df_restricted['review_2.0'] =  review_df_restricted['review_2.0'].map(preprocessing)

review_df_restricted['Rating_binary'].value_counts()

1    23426
0    15566
Name: Rating_binary, dtype: int64

In [37]:
# testing the model with gridsearchCV

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
#from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE 
from sklearn.linear_model import LogisticRegression

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000)
log_reg = LogisticRegression() 

pipeline_log_reg = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('sampling', SMOTE()),
        ('log_reg', log_reg)])

parameters = {'log_reg__penalty': ['l2']} 

# Perform cross validation: 
grid_search = GridSearchCV(pipeline_log_reg, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

X = review_df_restricted['review_2.0']
y = review_df_restricted['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
{'log_reg__penalty': 'l2'}
0.8764619239989765


# We also save our model which is based on SMOTE in a joblib file.

In [38]:
import joblib
# save the model to disk
filename = 'finalized_model_SMOTE.joblib'
#joblib.dump(grid_search.best_estimator_, filename, compress = 1)
joblib.dump(grid_search.best_estimator_, filename)

['finalized_model_SMOTE.joblib']

# Out-of-sample prediction: We use 2 reviews from Amazon UK (Software) to further test our model (The above model was estimated with data from Amazon.com): 

**Example of review with bad rating:**

In [39]:
# load the model from disk
loaded_model = joblib.load('finalized_model_SMOTE.joblib')

#I use one example of a review from Amazon UK (Software): https://www.amazon.co.uk/Microsoft-Professional-Genuine-Lifetime-Product/dp/B00WYPDA4C/ref=sr_1_16?dchild=1&keywords=microsoft+office&qid=1608730193&s=software&sr=1-16
#This review has a bad rating (i.e., 2 stars):
example_of_review = 'When I received this product, I thought I would get the actual disk from Microsoft. However, it was a burned disk with instructions on how to download it and activate it. The instructions were very confusing and when I put in my product key, it said it was not a valid key. Got on the phone and could get no help, so had to send it back and buy from someone else. Very happy with the new product which was office 2010. If you buy this hope you have better luck than me.'

#Apply the preprocessing: 
example_of_review =  preprocessing(example_of_review)

#Transform the "review" into a list (or an iterable) containing a single element: 
final_review = [example_of_review]
result = loaded_model.predict(final_review)
print(result)

[0]


**Example of review with good rating:**

In [40]:
# load the model from disk
loaded_model = joblib.load('finalized_model_SMOTE.joblib')

#I use one example of a review from Amazon UK (Software): https://www.amazon.co.uk/Microsoft-Professional-Genuine-Lifetime-Product/dp/B00WYPDA4C/ref=sr_1_16?dchild=1&keywords=microsoft+office&qid=1608730193&s=software&sr=1-16
#This review has a good rating (i.e., 5 stars):
example_of_review = 'I don\'t usually write reviews but this compnay deserves a big shout our for great customer service. I am not computer savey and was really messing up installation, so i called the tech support and ended up talking to Mitchell he was patient and didn\'t make me feel like and idoit for messing things up. If i hadn\'t been for him this review would have been much different. He went out of his way to make sure eveything was installed right and even made called microsoft because i had messed things up so bad. Thank you thank you for companys with intergrity and great customer service.'

#Apply the preprocessing: 
example_of_review =  preprocessing(example_of_review)

#Transform the "review" into a list (or an iterable) containing a single element: 
final_review = [example_of_review]
result = loaded_model.predict(final_review)
print(result)

[1]


# Conclusion for out-of-sample prediction: both cases are correctly predicted.

# *Section 6 we draw a few conclusions about the effect of SMOTE on the "precision" and the "recall" of the estimator.*


# Section 6.1) SMOTE is used to "balance" the dataset

**We use "precision" as the scoring measure  (and do not use accuracy), and we estimate the model using SMOTE.**

**We also show the confusion matrix using SMOTE** 

In [46]:
# testing the model with gridsearchCV

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
#from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE 
from sklearn.linear_model import LogisticRegression

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000)
log_reg = LogisticRegression() 

pipeline_log_reg = Pipeline(steps=[
        ('vectorizer', count_vect),
        ('sampling', SMOTE()),
        ('log_reg', log_reg)])

parameters = {'log_reg__penalty': ['l2']} 

# Perform cross validation: 
grid_search = GridSearchCV(pipeline_log_reg, parameters, n_jobs=-1, 
                           verbose=1, scoring = "precision", 
                           refit=True, cv=5)

X = review_df_restricted['review_2.0']
y = review_df_restricted['Rating_binary']

grid_search.fit(X,y)
grid_search.best_params_
grid_search.best_score_
print(grid_search.best_params_)
print(grid_search.best_score_)
#Fitting 5 folds for each of 1 candidates, totalling 5 fits

# precision with SMOTE: 
#{'log_reg__penalty': 'l2'}
#0.9107987821384562

Fitting 5 folds for each of 1 candidates, totalling 5 fits
{'log_reg__penalty': 'l2'}
0.9107987821384562


In [47]:
from sklearn.metrics import confusion_matrix
y_pred = grid_search.best_estimator_.predict(X)
confusion_matrix(y, y_pred)

array([[14128,  1438],
       [ 2170, 21256]])

# Conclusion: the "precision" is 0.910 using SMOTE AND THE CONFUCSION MATRIX IS: 

# [14128,  1438],

# [ 2170, 21256]

# where TN = 14128; TP = 21256; FP (false positives) = 1438; FN (false negatives) = 2170. 

# *Section 6.2) SMOTE is OMITTED below.* 

**We use "precision" as the scoring measure  (and do not use accuracy), and we estimate the model WITHOUT SMOTE.**

**We also show the confusion matrix WITHOUT SMOTE** 

In [52]:
# testing the model with gridsearchCV

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
#from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE 
from sklearn.linear_model import LogisticRegression

count_vect = TfidfVectorizer(ngram_range=(1, 1), min_df=1,  max_df = 1000000)
log_reg = LogisticRegression() 

pipeline_log_reg = Pipeline(steps=[
        ('vectorizer', count_vect),
        #('sampling', SMOTE()),
        ('log_reg', log_reg)])

parameters = {'log_reg__penalty': ['l2']} 

# Perform cross validation: 
grid_WITHOUT_SMOTE = GridSearchCV(pipeline_log_reg, parameters, n_jobs=-1, 
                           verbose=1, scoring = "precision", 
                           refit=True, cv=5)

X = review_df_restricted['review_2.0']
y = review_df_restricted['Rating_binary']

grid_WITHOUT_SMOTE.fit(X,y)
grid_WITHOUT_SMOTE.best_params_
grid_WITHOUT_SMOTE.best_score_
print(grid_WITHOUT_SMOTE.best_params_)
print(grid_WITHOUT_SMOTE.best_score_)
#Fitting 5 folds for each of 1 candidates, totalling 5 fits
#{'log_reg__penalty': 'l2'}
#0.8900174434732577




Fitting 5 folds for each of 1 candidates, totalling 5 fits
{'log_reg__penalty': 'l2'}
0.8900174434732577


array([[13517,  2049],
       [ 1516, 21910]])

In [53]:
from sklearn.metrics import confusion_matrix
y_pred_WITHOUT_SMOTE = grid_WITHOUT_SMOTE.best_estimator_.predict(X)
confusion_matrix(y, y_pred_WITHOUT_SMOTE)

#array([[13517,  2049],
#       [ 1516, 21910]])

array([[13517,  2049],
       [ 1516, 21910]])

# Conclusion: the "precision" is 0.890 when we omit SMOTE from our pipeline and the confusion matrix is:  

# [13517,  2049],

# [ 1516, 21910]

# where TN = 14128; TP = 21256; FP (false positives) = 1438; FN (false negatives) = 2170. 

# *Section 6.3) Comparison of the estimator with and without SMOTE.* 

**The precision is higher in the case that we use SMOTE (compared to the case that we omit SMOTE).**

# Explanation: 
- The dataset (review_df_restricted) contains many more 1s than 0s (zeros).
- SMOTE forces the estimator to see more 0s (zeros) as the target variable (compared to the case that we omit SMOTE from the pipeline). 
- As a result of the above bullet point, we have that the number of FP (false positives) decreases drastically for the model that includes SMOTE (compared to the case that we omit SMOTE from the pipeline). This is exactly what we would expect from the use of SMOTE in our case!
- Precision's formula: TP / (TP + FP). 
- In conclusion, SMOTE forces the estimator to see more 0s (zeros), and this implies that the number of FP (false positives) decreases by almost 25% compared to the the number of FP for the non-SMOTEd model. Finally, the fact that the number of FP is much lower for the SMOTEd model implies that the "Precision" is higher for the SMOTEd model. This is also as expected for our specific dataset. 

# Final remark on the tradeoff between Recall and Precision:

**On the other hand, the recall is lower in the case of the SMOTEd model (compared to the case that we omit SMOTE from the pipeline).**

**Recall's formula: TP / (TP + FN).**

**Recall of the SMOTEd model = 21256 / ( 21256 + 2170) = 0.907**

**Recall of the model that omits SMOTE = 21910 / (21910 + 1516) = 0.935**

**This is also as expected. Indeed, the model without SMOTE "sees" fewer 0s (zeros) than the SMOTEd model, and thus the number of FN (false negatives) is much lower in the case that we omit SMOTE.**