## Data Dictionary

An overview of the features in our dataset.

| Feature     	| Type 	| Description                                                       	|
|:-------------	|:------	|:-------------------------------------------------------------------	|
| content     	| obj  	| Raw text containing user reviews                                  	|
| content_stem 	| obj  	| Pre-processed text for modeling                                   	|
| score       	| int  	| No. of star ratings the user gave (1-5)                           	|
| target      	| int  	| Target variable <br>Postive sentiment: 0<br>Negative sentiment: 1 	|

## Import Libaries

In [11]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Libraries for classical machine learning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Libraries for deep learning
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Flatten, Embedding, Dropout, Bidirectional, SpatialDropout1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import ModelCheckpoint

# Libraries for topic modeling
from pprint import pprint
import gensim
import spacy
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from wordcloud import WordCloud
import matplotlib.colors as mcolors
from nltk.corpus import stopwords
import pyLDAvis.gensim

%matplotlib inline

## Load Data

Load in Shopee's Google Play reviews that we have collected, cleaned and pre-processed.

In [18]:
# Read the clean dataset
reviews = pd.read_csv('../SentimentModel/clean_train.csv')

In [19]:
# View the first 5 rows of our dataset
reviews.head()

Unnamed: 0,content,content_stem,score,target
0,Highly Recommended The only shopping app that ...,highli recommend shop love shop reliabl custom...,5,0
1,"In my experiences, Shopee offers better price ...",experi offer better price product reason unkno...,5,0
2,"So far so good, but the delivery can be quite ...",far good deliveri quit slow time even though l...,4,0
3,Facing really bad experience with Shopee wareh...,face realli experi warehous shown return track...,1,1
4,Edit to 3 star: Shopee finally refunded my ite...,edit star final refund return declar miss cour...,3,1


In [20]:
# 3515 documents in our dataset
reviews.shape

(3040, 4)

In [21]:
# Check the data types
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3040 entries, 0 to 3039
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   content       3040 non-null   object
 1   content_stem  3040 non-null   object
 2   score         3040 non-null   int64 
 3   target        3040 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 95.1+ KB


In [22]:
# Establish our baseline score
reviews['target'].value_counts(normalize=True)

target
0    0.526316
1    0.473684
Name: proportion, dtype: float64

Given that the majority class is class 0, our **baseline score** is an accuracy of **0.61**. The baseline score will serve as a point of comparison when evaluating our models.

## Pre-Modeling

In [23]:
X = reviews['content_stem']
y = reviews['target']

In [24]:
# Perform train test split so that we can train, score and tune our models' hyperparameters 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [25]:
X_train.shape

(2432,)

In [26]:
X_val.shape

(608,)

In [27]:
y_train.value_counts(normalize=True)

target
0    0.526316
1    0.473684
Name: proportion, dtype: float64

In [28]:
y_val.value_counts(normalize=True)

target
0    0.526316
1    0.473684
Name: proportion, dtype: float64

In [29]:
# Use count vectorizer to check how many unique words there are
cvec = CountVectorizer(stop_words='english') 
cvec_df = pd.DataFrame(cvec.fit_transform(X_train).todense(), columns=cvec.get_feature_names_out())
cvec_df.shape

(2432, 3483)

There are 2632 unique words in our corpus.

In [30]:
# Write a function that takes in the actual y value and model predictions, 
# and prints out the confusion matrix and classification report
# Dataset: Validation or test set

def cmat(actual_y, predictions, dataset):
    
    # Create a classification report
    print('Classification report for', dataset)
    print(classification_report(actual_y, predictions))
    print('')
    
    # Create a confusion matrix
    cm = confusion_matrix(actual_y, predictions)
    cm_df = pd.DataFrame(cm, columns=['Predicted Positive Review','Predicted Negative Review'], index=['Actual Positive Review', 'Actual Negative Review'])
    print('Confusion matrix for', dataset)
    print(cm_df)

## Modeling

We will explore both classical machine learning and deep learning for sentiment analysis. The production model will be selected based on accuracy and recall on the validation set. 

## Classical Machine Learning

We will use the Bag of Words (BoW) representation to extract features from the text. This will be done through vectorization, specifically the CountVectorizer and TF-IDF Vectorizer. The CountVectorizer simply tokenizes and counts the word occurrences in our corpus. While on the other hand, TF-IDF tells us which words are important to one document, relative to all other documents. Words that occur often in one document but don't occur in many documents contain more predictive power.  

After vectorizing, we will fit a Logistic Regression, Naive Bayes and Support Vector Machine on the training data and evaluate the models' performance on the validation set.

### Count Vectorizer & Logistic Regression

In [31]:
# Create a pipeline with Count Vectorizer and Logistic Regression
pipe_cvec_lr = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')), 
    ('lr', LogisticRegression(random_state=42))
])

# Search over the following values of hyperparameters:
pipe_cvec_lr_params = {
    'cvec__max_features': [300], #100,200
    'cvec__min_df': [2,3], 
    'cvec__max_df': [.9,.95], 
#     'cvec__ngram_range':[(1,1),(1,2)],  
    'lr__penalty': ['l2'],
    'lr__C': [.01,.1]
}

# Instantiate GridSearchCV
gs_cvec_lr = GridSearchCV(pipe_cvec_lr, # Objects to optimise
                          param_grid = pipe_cvec_lr_params, # Hyperparameters for tuning
                          cv=10) # 10-fold cross validation

# Fit model on to training data
gs_cvec_lr.fit(X_train, y_train)

# Generate predictions on validation set
cvec_lr_pred = gs_cvec_lr.predict(X_val)

# Print best parameters
print('Best parameters: ', gs_cvec_lr.best_params_)

# Print accuracy scores
print('Best CV score: ', gs_cvec_lr.best_score_)
print('Training score:', gs_cvec_lr.score(X_train, y_train))
print('Validation score:', gs_cvec_lr.score(X_val, y_val))
print('')

# Print classification report and confusion matrix
cmat(y_val, cvec_lr_pred, 'validation set')

Best parameters:  {'cvec__max_df': 0.9, 'cvec__max_features': 300, 'cvec__min_df': 2, 'lr__C': 0.1, 'lr__penalty': 'l2'}
Best CV score:  0.839227889091277
Training score: 0.8667763157894737
Validation score: 0.8305921052631579

Classification report for validation set
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       320
           1       0.83      0.80      0.82       288

    accuracy                           0.83       608
   macro avg       0.83      0.83      0.83       608
weighted avg       0.83      0.83      0.83       608


Confusion matrix for validation set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        274                         46
Actual Negative Review                         57                        231


### TF-IDF & Logistic Regression

In [32]:
# Create a pipeline with TF-IDF and Logistic Regression
pipe_tvec_lr = Pipeline([
    ('tvec', TfidfVectorizer(stop_words='english')),
    ('lr', LogisticRegression(random_state=42))
])

# Search over the following values of hyperparameters:
pipe_tvec_lr_params = {
    'tvec__max_features': [300], #100,200
    'tvec__min_df': [2,3], #2,3 
    'tvec__max_df': [.9,.95], 
#     'tvec__ngram_range':[(1,1),(1,2)],  
    'lr__penalty': ['l2'],
    'lr__C': [.1, 1] #.1, .01
}

# Instantiate GridSearchCV
gs_tvec_lr = GridSearchCV(pipe_tvec_lr, # Objects to optimise
                          param_grid = pipe_tvec_lr_params, # Hyperparameters for tuning
                          cv=10) # 10-fold cross validation

# Fit model on to training data
gs_tvec_lr.fit(X_train, y_train)

# Generate predictions on validation set
tvec_lr_pred = gs_tvec_lr.predict(X_val)

# Print best parameters
print('Best parameters: ', gs_tvec_lr.best_params_)

# Print accuracy scores
print('Best CV score: ', gs_tvec_lr.best_score_)
print('Training score:', gs_tvec_lr.score(X_train, y_train))
print('Validation score:', gs_tvec_lr.score(X_val, y_val))
print('')

# Print classification report and confusion matrix
cmat(y_val, tvec_lr_pred, 'validation set')

Best parameters:  {'lr__C': 1, 'lr__penalty': 'l2', 'tvec__max_df': 0.9, 'tvec__max_features': 300, 'tvec__min_df': 2}
Best CV score:  0.8396545908385618
Training score: 0.8704769736842105
Validation score: 0.8355263157894737

Classification report for validation set
              precision    recall  f1-score   support

           0       0.85      0.83      0.84       320
           1       0.82      0.84      0.83       288

    accuracy                           0.84       608
   macro avg       0.84      0.84      0.84       608
weighted avg       0.84      0.84      0.84       608


Confusion matrix for validation set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        266                         54
Actual Negative Review                         46                        242


### Count Vectorizer & Naive Bayes

In [33]:
# Create a pipeline with Count Vectorizer and Naive Bayes
pipe_cvec_nb = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])

# Search over the following values of hyperparameters:
pipe_cvec_nb_params = {
    'cvec__max_features': [500], #200
    'cvec__min_df': [2,3],
    'cvec__max_df': [.9,.95], 
#     'cvec__ngram_range':[(1,1),(1,2)],  
}

# Instantiate GridSearchCV
gs_cvec_nb = GridSearchCV(pipe_cvec_nb, # Objects to optimise
                          param_grid = pipe_cvec_nb_params, # Hyperparameters for tuning
                          cv=10) # 10-fold cross validation

# Fit model on to training data
gs_cvec_nb.fit(X_train, y_train)

# Generate predictions on validation set
cvec_nb_pred = gs_cvec_nb.predict(X_val)

# Print best parameters
print('Best parameters: ', gs_cvec_nb.best_params_)

# Print accuracy scores
print('Best CV score: ', gs_cvec_nb.best_score_)
print('Training score:', gs_cvec_nb.score(X_train, y_train))
print('Validation score:', gs_cvec_nb.score(X_val, y_val))
print('')

# Print classification report and confusion matrix
cmat(y_val, cvec_nb_pred, 'validation set')

Best parameters:  {'cvec__max_df': 0.9, 'cvec__max_features': 500, 'cvec__min_df': 3}
Best CV score:  0.8400593671996223
Training score: 0.8560855263157895
Validation score: 0.8388157894736842

Classification report for validation set
              precision    recall  f1-score   support

           0       0.85      0.84      0.85       320
           1       0.83      0.84      0.83       288

    accuracy                           0.84       608
   macro avg       0.84      0.84      0.84       608
weighted avg       0.84      0.84      0.84       608


Confusion matrix for validation set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        269                         51
Actual Negative Review                         47                        241


### TF-IDF & Naive Bayes

In [34]:
# Create a pipeline with TF-IDF and Naive Bayes
pipe_tvec_nb = Pipeline([
    ('tvec', TfidfVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])

# Search over the following values of hyperparameters:
pipe_tvec_nb_params = {
    'tvec__max_features': [500], #200
    'tvec__min_df': [2,3], #
    'tvec__max_df': [.9,.95], 
#     'tvec__ngram_range':[(1,1),(1,2)],  
}

# Instantiate GridSearchCV
gs_tvec_nb = GridSearchCV(pipe_tvec_nb, # Objects to optimise
                        param_grid = pipe_tvec_nb_params, # Hyperparameters for tuning
                        cv=10) # 10-fold cross validation

# Fit model on to training data
gs_tvec_nb.fit(X_train, y_train)

# Generate predictions on validation set
tvec_nb_pred = gs_tvec_nb.predict(X_val)

# Print best parameters
print('Best parameters: ', gs_tvec_nb.best_params_)

# Print accuracy scores
print('Best CV score: ', gs_tvec_nb.best_score_)
print('Training score:', gs_tvec_nb.score(X_train, y_train))
print('Validation score:', gs_tvec_nb.score(X_val, y_val))
print('')

# Print classification report and confusion matrix
cmat(y_val, tvec_nb_pred, 'validation set')

Best parameters:  {'tvec__max_df': 0.9, 'tvec__max_features': 500, 'tvec__min_df': 2}
Best CV score:  0.8380068137354112
Training score: 0.8614309210526315
Validation score: 0.8355263157894737

Classification report for validation set
              precision    recall  f1-score   support

           0       0.84      0.84      0.84       320
           1       0.83      0.83      0.83       288

    accuracy                           0.84       608
   macro avg       0.84      0.84      0.84       608
weighted avg       0.84      0.84      0.84       608


Confusion matrix for validation set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        270                         50
Actual Negative Review                         50                        238


In [35]:
text = 'The product is strong and great!'
probs = gs_tvec_nb.predict_proba([text, 'This product is bad and ugly, slow, hard'])  # predict probabilities for the text
print('Predicted probabilities:', probs)  # print the predicted probabilities

Predicted probabilities: [[0.88685368 0.11314632]
 [0.5161704  0.4838296 ]]


### Count Vectorizer & SVC

In [36]:
# Create a pipeline with Count Vectorizer and SVC
pipe_cvec_svc = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')),
    ('svc', SVC(random_state=42))
])

# Search over the following values of hyperparameters:
pipe_cvec_svc_params = {
    'cvec__max_features': [300], #200,500
    'cvec__min_df': [2,3], 
    'cvec__max_df': [.9,.95], 
#     'cvec__ngram_range':[(1,1),(1,2)],  
    'svc__kernel': ['linear'], #'poly', 'rbf'
#     'svc__degree': [3],
    'svc__C': [.1]
}

# Instantiate GridSearchCV
gs_cvec_svc = GridSearchCV(pipe_cvec_svc, # Objects to optimise
                        param_grid = pipe_cvec_svc_params, # Hyperparameters for tuning
                        cv=10) # 10-fold cross validation

# Fit model on to training data
gs_cvec_svc.fit(X_train, y_train)

# Generate predictions on validation set
cvec_svc_pred = gs_cvec_svc.predict(X_val)

# Print best parameters
print('Best parameters: ', gs_cvec_svc.best_params_)

# Print accuracy scores
print('Best CV score: ', gs_cvec_svc.best_score_)
print('Training score:', gs_cvec_svc.score(X_train, y_train))
print('Validation score:', gs_cvec_svc.score(X_val, y_val))
print('')

# Print classification report and confusion matrix
cmat(y_val, cvec_svc_pred, 'validation set')

Best parameters:  {'cvec__max_df': 0.9, 'cvec__max_features': 300, 'cvec__min_df': 2, 'svc__C': 0.1, 'svc__kernel': 'linear'}
Best CV score:  0.8347028266882548
Training score: 0.8795230263157895
Validation score: 0.8338815789473685

Classification report for validation set
              precision    recall  f1-score   support

           0       0.82      0.87      0.85       320
           1       0.85      0.80      0.82       288

    accuracy                           0.83       608
   macro avg       0.83      0.83      0.83       608
weighted avg       0.83      0.83      0.83       608


Confusion matrix for validation set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        278                         42
Actual Negative Review                         59                        229


### TF-IDF & SVC

In [37]:
# Create a pipeline with TF-IDF Vectorizer and SVC
pipe_tvec_svc = Pipeline([
    ('tvec', TfidfVectorizer(stop_words='english')),
    ('svc', SVC(probability=True, random_state=42)) 
])

# Search over the following values of hyperparameters:
pipe_tvec_svc_params = {
    'tvec__max_features': [800], #200,500
    'tvec__min_df': [2,3], 
    'tvec__max_df': [.9,.95], 
#     'tvec__ngram_range':[(1,1),(1,2)],  
    'svc__kernel': ['linear'], #'poly', 'rbf'
#     'svc__degree': [3],
    'svc__C': [.1] # .01
}

# Instantiate GridSearchCV
gs_tvec_svc = GridSearchCV(pipe_tvec_svc, # Objects to optimise
                          param_grid = pipe_tvec_svc_params, # Hyperparameters for tuning
                          cv=10) # 10-fold cross validation

# Fit model on to training data
gs_tvec_svc.fit(X_train, y_train)

# Generate predictions on validation set
tvec_svc_pred = gs_tvec_svc.predict(X_val)

# Print best parameters
print('Best parameters: ', gs_tvec_svc.best_params_)

# Print accuracy scores
print('Best CV score: ', gs_tvec_svc.best_score_)
print('Training score:', gs_tvec_svc.score(X_train, y_train))
print('Validation score:', gs_tvec_svc.score(X_val, y_val))
print('')

# Print classification report and confusion matrix
cmat(y_val, tvec_svc_pred, 'validation set')

Best parameters:  {'svc__C': 0.1, 'svc__kernel': 'linear', 'tvec__max_df': 0.9, 'tvec__max_features': 800, 'tvec__min_df': 2}
Best CV score:  0.8380068137354112
Training score: 0.8630756578947368
Validation score: 0.84375

Classification report for validation set
              precision    recall  f1-score   support

           0       0.88      0.81      0.85       320
           1       0.81      0.88      0.84       288

    accuracy                           0.84       608
   macro avg       0.85      0.85      0.84       608
weighted avg       0.85      0.84      0.84       608


Confusion matrix for validation set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        259                         61
Actual Negative Review                         34                        254


## Evaluate Production Model on Test Set

Finally, we will evaluate our model's performance on the test set.

In [39]:
# Read test set into a dataframe
test = pd.read_csv('../SentimentModel/clean_test.csv')

In [40]:
# There are 870 documents in our test set
test.shape

(760, 4)

In [41]:
# The class representation in our test set looks similar to our training set as we used stratify
test['target'].value_counts(normalize=True)

target
0    0.526316
1    0.473684
Name: proportion, dtype: float64

In [42]:
# Establish our X and y variables
X_test = test['content_stem']
y_test = test['target']

In [43]:
# Generate predictions on test set
test_pred = gs_tvec_nb.predict(X_test)

In [44]:
print('Evaluation metrics for test set')
print('')
print('Accuracy score: ', accuracy_score(y_test, test_pred))
print('')

# Print classification report and confusion matrix
cmat(y_test, test_pred, 'test set')

Evaluation metrics for test set

Accuracy score:  0.8592105263157894

Classification report for test set
              precision    recall  f1-score   support

           0       0.87      0.87      0.87       400
           1       0.85      0.85      0.85       360

    accuracy                           0.86       760
   macro avg       0.86      0.86      0.86       760
weighted avg       0.86      0.86      0.86       760


Confusion matrix for test set
                        Predicted Positive Review  Predicted Negative Review
Actual Positive Review                        347                         53
Actual Negative Review                         54                        306


Given that our production model achieves 0.892 on accuracy and 0.87 on recall, we can conclude that the model generalises well on unseen data.