## Sentiment Classification AU Reviews Data (BOW, non-Deep Learning)

This notebook covers two good approaches to perform sentiment classification - Naive Bayes and Logistic Regression. We will train AU reviews data on both.

As a rule of thumb, reviews that are 3 stars and above are **positive**, and vice versa.

In [1]:
%pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import gzip
import json
import matplotlib.pyplot as plt
import numpy as np
import re
import random
import pandas as pd
import seaborn as sns
import gensim
import spacy
import nltk
import gensim.downloader
from collections import Counter, defaultdict
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.metrics import f1_score, classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from tqdm import tqdm

In [3]:
RANDOM_SEED = 33

In [4]:
reviews = pd.read_pickle("assets/au_reviews.pkl")
reviews.head()

Unnamed: 0,date,review,rating,app
0,2020-07-11,Iâ€™ve been a DoorDash user for a while now and ...,3,DoorDash
1,2020-05-26,I ordered a meal for delivery and after 1:30 I...,1,DoorDash
2,2020-09-03,"I have gotten three orders from Doordash, all ...",1,DoorDash
3,2021-08-13,The delay and customer support I experienced w...,1,DoorDash
4,2021-11-01,I have had countless problems using DoorDash s...,1,DoorDash


In [5]:
reviews['label'] = np.where(reviews['rating'] >= 3, 0, 1)

In [6]:
reviews.head()

Unnamed: 0,date,review,rating,app,label
0,2020-07-11,Iâ€™ve been a DoorDash user for a while now and ...,3,DoorDash,0
1,2020-05-26,I ordered a meal for delivery and after 1:30 I...,1,DoorDash,1
2,2020-09-03,"I have gotten three orders from Doordash, all ...",1,DoorDash,1
3,2021-08-13,The delay and customer support I experienced w...,1,DoorDash,1
4,2021-11-01,I have had countless problems using DoorDash s...,1,DoorDash,1


## 1. Data Processing

Check the dataset size:

In [7]:
print(len(reviews))

626377


And the type of apps:

In [8]:
app_list = list(reviews['app'].unique())
app_list

['DoorDash', 'UberEats', 'Deliveroo', 'MenuLog', 'Grubhub']

Let's also get a sense of our dataset's balance

In [9]:
reviews['label'].value_counts(normalize=True)

0    0.662462
1    0.337538
Name: label, dtype: float64

In [10]:
# By app

for app in app_list:
    print(reviews[reviews['app'] == app]['label'].value_counts(normalize=True))

0    0.760423
1    0.239577
Name: label, dtype: float64
0    0.648032
1    0.351968
Name: label, dtype: float64
0    0.651189
1    0.348811
Name: label, dtype: float64
0    0.682461
1    0.317539
Name: label, dtype: float64
0    0.56053
1    0.43947
Name: label, dtype: float64


Across the board the distribution of positive and negative reviews are quite consistent between the apps. Overall, there's an imbalance in our dataset, with positive reviews making for 75% of the dataset. Let's also check for null values.

In [11]:
reviews.isnull().sum()

date       0
review    54
rating     0
app        0
label      0
dtype: int64

In [12]:
reviews = reviews.dropna()

In [13]:
df_proc = reviews.copy()
df_proc.drop(columns=['date', 'rating', 'app'], inplace=True)
df_proc.head()

Unnamed: 0,review,label
0,Iâ€™ve been a DoorDash user for a while now and ...,0
1,I ordered a meal for delivery and after 1:30 I...,1
2,"I have gotten three orders from Doordash, all ...",1
3,The delay and customer support I experienced w...,1
4,I have had countless problems using DoorDash s...,1


In [14]:
X = df_proc['review']
y = df_proc['label']

## 2. Train Corpus on Word2Vec Model

It's unclear to me whether we should create the word2vec model on the entire corpus, or just the train dataset. From 655 class it seems the entire corpus was used, so let's give it a shot.

Also referencing: https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking_python3.ipynb

In [15]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/meln/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
#https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
stop_words = set(stopwords.words('english'))

In [17]:
all_tokenized_reviews = []

for review in tqdm(X):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    all_tokenized_reviews.append(tokens)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 626323/626323 [00:05<00:00, 107859.88it/s]


In [18]:
full_model = Word2Vec(sentences=all_tokenized_reviews, vector_size=100, 
                      window=2, min_count=100, workers=4, seed=RANDOM_SEED)

In [19]:
full_model.save("word2vec_au.model")

In [42]:
full_model_kv = full_model.wv

We will split the dataset into `train`, `test`, and `dev`, with 80%, 10%, 10% ratio, respectively.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
X_test, X_dev, y_test, y_dev = train_test_split(X_test, y_test, test_size=0.5, random_state=RANDOM_SEED)

In [21]:
len(X_train)

501058

In [22]:
len(X_dev)

62633

In [23]:
len(X_test)

62632

In [24]:
X_train.iloc[0]

'Nice apk but some restaurants supply bad quality food'

In [25]:
X_test.iloc[0]

'dope food app'

## 3. Word Embeddings Approach on Logistic Regression

This section explores the use of word embeddings as feature extraction. We'll be working with dense representations of documents instead of the bag-of-words representations we used earlier. To do this, we'll use the average (or mean) word vector of a document and classify from those representations.

As a first step, let's tokenize the reviews here using regular expressions. However, since we're going to be computing an average word vector, let's remove stop words. Here, we'll use NLTK's list of English stop words. Since these words shouldn't affect our classification decision, we can remove them to avoid adding any noisy they might cause. Note that all of the stopwords in NLTK's list are lower-cased, but it's possible that some stopwords in your documents are not entirely lower-cased, so they may not match without some further processing.

We'll be using our corpus to train the model. We'll also be using a few of Word2Vec's pre-trained models, `word2vec-google-news-300`.

In [26]:
gnews = gensim.downloader.load('word2vec-google-news-300')

In [28]:
gnews.vector_size

300

In [35]:
glove_small = gensim.downloader.load('glove-wiki-gigaword-100')



In [36]:
glove_small.vector_size

100

In [37]:
glove_big = gensim.downloader.load('glove-wiki-gigaword-300')



In [38]:
glove_big.vector_size

300

In [29]:
def generate_dense_features(tokenized_texts, word_vectors): 
    #HINT: Create an empty list to hold your results 
        #HINT:Iterate through each item in tokenized_text
            #HINT:Create a list that contains current item(s) if found in word_vectors
            #HINT:if the length of this list is greater than zero:
                #HINT:We set this as a feature, this is done by using numpyâ€™s mean function and append it to our results list 
            #HINT:Otherwise: create a vector of numpy zeros using word_vectors.vector_size as the parameter and append it to the results list
    #HINT:Return the results list as a numpy array (data type)

    res = []
    for token in tokenized_texts:
        items_in_vocab = [item for item in token if item in word_vectors]
        if len(items_in_vocab) > 0:
            res.append(np.mean(word_vectors[items_in_vocab], axis=0))
        else:
            res.append(np.zeros(word_vectors.vector_size))
    return np.array(res)

In [31]:
tokenized_train_items = []
for review in tqdm(X_train):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_train_items.append(tokens)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 501058/501058 [00:04<00:00, 100831.14it/s]


In [32]:
len(set(tokenized_train_items[0]))

7

In [33]:
tokenized_train_items[0]

['Nice', 'apk', 'restaurants', 'supply', 'bad', 'quality', 'food']

In [34]:
tokenized_dev_items = []
for review in tqdm(X_dev):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_dev_items.append(tokens)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 62633/62633 [00:00<00:00, 136940.28it/s]


In [39]:
def train_model(clf):
    print("_" * 80)
    print("Training: ")
    clf.fit(X_train_wp, y_train)
    y_dev_pred = clf.predict(X_dev_wp)
    
    score = accuracy_score(y_dev, y_dev_pred)
    print("accuracy:   %0.3f" % score)
    
    print("classification report:")
    print(classification_report(y_dev, y_dev_pred))
    
    print("confusion matrix:")
    print(confusion_matrix(y_dev, y_dev_pred))
    print("Training Complete")
    print()
    
    clf_descr = str(clf).split("(")[0]
    return clf_descr, score, y_dev_pred

It's training time!

### 3.1 Word2Vec on reviews corpus

In [43]:
X_train_wp = generate_dense_features(tokenized_train_items, full_model_kv)

In [44]:
print(X_train_wp.shape)

(501058, 100)


In [45]:
X_dev_wp = generate_dense_features(tokenized_dev_items, full_model_kv)

In [46]:
print(X_dev_wp.shape)

(62633, 100)


In [47]:
preds = {} # A dict to store our dev set predictions

for clf, name in (
    (DummyClassifier(strategy='uniform', random_state=RANDOM_SEED), "Uniform Classifier"),
    (DummyClassifier(strategy='most_frequent', random_state=RANDOM_SEED), "Most Frequent Classifier"),
    (LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED), "Logistic Regression")
):
    print("=" * 80)
    print("Training Results")
    print(name)
    mod = train_model(clf)
    preds[name] = mod[2]

Training Results
Uniform Classifier
________________________________________________________________________________
Training: 
accuracy:   0.499
classification report:
              precision    recall  f1-score   support

           0       0.66      0.50      0.57     41413
           1       0.34      0.50      0.40     21220

    accuracy                           0.50     62633
   macro avg       0.50      0.50      0.49     62633
weighted avg       0.55      0.50      0.51     62633

confusion matrix:
[[20678 20735]
 [10627 10593]]
Training Complete

Training Results
Most Frequent Classifier
________________________________________________________________________________
Training: 
accuracy:   0.661
classification report:


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.66      1.00      0.80     41413
           1       0.00      0.00      0.00     21220

    accuracy                           0.66     62633
   macro avg       0.33      0.50      0.40     62633
weighted avg       0.44      0.66      0.53     62633

confusion matrix:
[[41413     0]
 [21220     0]]
Training Complete

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.891
classification report:
              precision    recall  f1-score   support

           0       0.96      0.87      0.91     41413
           1       0.79      0.93      0.85     21220

    accuracy                           0.89     62633
   macro avg       0.87      0.90      0.88     62633
weighted avg       0.90      0.89      0.89     62633

confusion matrix:
[[36041  5372]
 [ 1475 19745]]
Training Complete



Pretty good result!

In [48]:
# Create a dataframe for mis-classifications
def create_mis_classification_df(name):
    mis_class = pd.DataFrame(X_dev)
    mis_class['Actual'] = y_dev
    mis_class['Predicted'] = preds[name]
    mis_class = mis_class[mis_class['Actual'] != mis_class['Predicted']]
    return mis_class

In [49]:
mis_class_logreg = create_mis_classification_df('Logistic Regression')

In [50]:
mis_class_logreg.sample(10).values

array([["Orders are generally not as advertised, recently vehicles don't show on map. Seema it has gone down quality.",
        0, 1],
       ['Not everywhere in my area yet but getting there and very nice app',
        0, 1],
       ['I got my chic FIL e', 0, 1],
       ["I no technophobe but even I could navigate through this app and for someone of my era that's saying something!",
        0, 1],
       ["3 star cause why there is not internet banking ??? I don't like to add cards in apps ... Internet banking is more handy for me but it's not available..",
        0, 1],
       ['Lately the app has been playing up, pictures and stores wont show up. Also feel like the prices of delivery have gone up which is why i havnt used the app as much lately.',
        0, 1],
       ["The app runs perfectly and I've never had a problem with the actual app, my only problem is the massive price increase on the food compared to at the restaurant, it's almost double the price excluding delivery and 

### 3.2 Google News

In [51]:
X_train_wp = generate_dense_features(tokenized_train_items, gnews)
X_dev_wp = generate_dense_features(tokenized_dev_items, gnews)

In [55]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.879
classification report:
              precision    recall  f1-score   support

           0       0.96      0.85      0.90     41413
           1       0.77      0.93      0.84     21220

    accuracy                           0.88     62633
   macro avg       0.86      0.89      0.87     62633
weighted avg       0.89      0.88      0.88     62633

confusion matrix:
[[35404  6009]
 [ 1570 19650]]
Training Complete



In [56]:
mis_class_logreg_gnews = create_mis_classification_df('Logistic Regression')

In [57]:
mis_class_logreg_gnews.sample(10).values

array([['Very difficult to use. Frustrating and limited choice.', 1, 0],
       ["I changed my phone and number..I logged onto my old phone, changed my details.. And when at checkout.. It won't let me ðŸ¤£",
        0, 1],
       ["Nicely delivered, but overpriced. I wish Uber eats would stop charging bs fees. A delivery fee and a service fee, forgot to mention on top of that you gotta pay a tip. Seriously ! I'm fine with tips, but everything else is bs.",
        0, 1],
       ['Had trouble adding payment method', 0, 1],
       ["It charged me twice for one order! I had to jump through hoops to get a refund and I was refunded everything but the tip! Why should I be charged a tip fee for something that never arrived? I won't be ordering through uber eats again",
        0, 1],
       ['I do like this app, but for some reason, in some cases, the app menu does not offer all of the items (or the customizations) that you can order directly from the restaurant website.',
        0, 1],
    

### 3.3 Glove Small

In [58]:
X_train_wp = generate_dense_features(tokenized_train_items, glove_small)
X_dev_wp = generate_dense_features(tokenized_dev_items, glove_small)

In [59]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.826
classification report:
              precision    recall  f1-score   support

           0       0.92      0.81      0.86     41413
           1       0.70      0.86      0.77     21220

    accuracy                           0.83     62633
   macro avg       0.81      0.84      0.82     62633
weighted avg       0.84      0.83      0.83     62633

confusion matrix:
[[33424  7989]
 [ 2893 18327]]
Training Complete



In [60]:
mis_class_logreg_glove_small = create_mis_classification_df('Logistic Regression')

In [61]:
mis_class_logreg_glove_small.sample(10).values

array([['Excelente opcion para los que no tienen tiempo de salir a comer. Ordenar es extremadamente sencillo.. Ya no mas orden incorrecta porque la persona en el telefono no entendiÃ³ bien!!',
        0, 1],
       ["It is so hard to use, my home address/delivery details get deleted with each new update. I have seen my rating drop because the delivery person cannot access my apartment as they need a buzzer code which each update removes. The GPS doesn't always work so tracking orders is hard. I didn't realize food was delivered until I received the noticed that it had as I couldn't track it.",
        0, 1],
       ['Works without fail. Brilliant for lazy buggers like me or just those that are time deprived. I order on the way home from work to a usual short wait.',
        0, 1],
       ["If you want an app that'll pester you, sometimes multiple times a day, with useless adds for their own services, then you're in the right place. Many of the promotions they are 'offering' are only go

### 3.4 Glove Large

In [62]:
X_train_wp = generate_dense_features(tokenized_train_items, glove_big)
X_dev_wp = generate_dense_features(tokenized_dev_items, glove_big)

In [63]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.849
classification report:
              precision    recall  f1-score   support

           0       0.93      0.83      0.88     41413
           1       0.73      0.88      0.80     21220

    accuracy                           0.85     62633
   macro avg       0.83      0.86      0.84     62633
weighted avg       0.86      0.85      0.85     62633

confusion matrix:
[[34405  7008]
 [ 2477 18743]]
Training Complete



In [64]:
mis_class_logreg_glove_big = create_mis_classification_df('Logistic Regression')

In [65]:
mis_class_logreg_glove_big.sample(10).values

array([["I ordered bread today from Cobb's bakery & was basically sent stale stuff! When I 1st complained to uber eats they said they couldn't give me a refund because I didn't provide enough information! I don't know what else I was supposed to do! I tried to reach out again & I've been told their team has to review my complaint! That was over 2 hours ago! How long does it take to get a refund! Got a feeling I've getting screwed!",
        0, 1],
       ['It has been a lifesaver this year. And the couple times there was a error, a missing item. They refunded me right away. Love it when they offer specials!',
        0, 1],
       ['need to improve on delivery time', 1, 0],
       ["Couldn't any more info.. UP?", 1, 0],
       ["I'm getting fat. Thanks Uber Eats! :)", 0, 1],
       ['Certain items are not included in some store/restaurants menus. I donâ€™t know if this is an issue caused by Menulog or the individual stores, but I find it annoying either way.',
        0, 1],
       ['t

### 3.5 Summary

Perhaps unsurprisingly, the model achieved the highest macro average F1-score on the Word2Vec model trained using the reviews corpus. However, it's also striking to see that pre-trained Word2Vec models did just as well. Gnews for example, achieved a F1-score of 0.87, while Glove Big achieved 0.84. While this is still inferior to the bag-of-words approach, it shows just how effective the word embeddings method can be.

Unfortunately, it is a lot harder to understand the mis-classifications. Except for those reviews where the rating given isn't an accurate reflection of the sentiment, it's hard to understand the rest.

### 3.6 Tune Logistic Regression Params

We'll try tuning Logistic Regression C parameter again, similar to BOW. I really like the fact that Google News embeddings did so well, so let's use that as the baseline.

In [66]:
X_train_wp = generate_dense_features(tokenized_train_items, gnews)
X_dev_wp = generate_dense_features(tokenized_dev_items, gnews)

In [None]:
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=3000, random_state=RANDOM_SEED)
param_grid = {'C' : np.logspace(-4, 4, 20)}


print("=" * 80)
print("LogReg Grid Search")
clf = GridSearchCV(clf, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train_wp, y_train)
print(clf.best_params_)
y_dev_pred = clf.predict(X_dev_wp)
score = accuracy_score(y_dev, y_dev_pred)
print("accuracy:   %0.3f" % score)
print("classification report:")
print(classification_report(y_dev, y_dev_pred))
print("confusion matrix:")
print(confusion_matrix(y_dev, y_dev_pred))
print("Training Complete")
print()

LogReg Grid Search
Fitting 5 folds for each of 20 candidates, totalling 100 fits
