## Sentiment Classification AU Reviews Data (WE, non-Deep Learning)

This notebook covers two good approaches to perform sentiment classification - Naive Bayes and Logistic Regression. We will train AU reviews data on both.

As a rule of thumb, reviews that are 3 stars and above are **positive**, and vice versa.

In [1]:
%pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import gzip
import json
import matplotlib.pyplot as plt
import numpy as np
import re
import random
import pandas as pd
import seaborn as sns
import gensim
import spacy
import nltk
import gensim.downloader
from collections import Counter, defaultdict
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.metrics import f1_score, classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from tqdm import tqdm

In [3]:
RANDOM_SEED = 33

In [4]:
reviews = pd.read_pickle("assets/au_reviews.pkl")
reviews.head()

Unnamed: 0,date,review,rating,app
0,2020-07-11,I’ve been a DoorDash user for a while now and ...,3,DoorDash
1,2020-05-26,I ordered a meal for delivery and after 1:30 I...,1,DoorDash
2,2020-09-03,"I have gotten three orders from Doordash, all ...",1,DoorDash
3,2021-08-13,The delay and customer support I experienced w...,1,DoorDash
4,2021-11-01,I have had countless problems using DoorDash s...,1,DoorDash


In [5]:
reviews['label'] = np.where(reviews['rating'] >= 3, 0, 1)

In [6]:
reviews.head()

Unnamed: 0,date,review,rating,app,label
0,2020-07-11,I’ve been a DoorDash user for a while now and ...,3,DoorDash,0
1,2020-05-26,I ordered a meal for delivery and after 1:30 I...,1,DoorDash,1
2,2020-09-03,"I have gotten three orders from Doordash, all ...",1,DoorDash,1
3,2021-08-13,The delay and customer support I experienced w...,1,DoorDash,1
4,2021-11-01,I have had countless problems using DoorDash s...,1,DoorDash,1


## 1. Data Processing

Check the dataset size:

In [7]:
print(len(reviews))

626377


And the type of apps:

In [8]:
app_list = list(reviews['app'].unique())
app_list

['DoorDash', 'UberEats', 'Deliveroo', 'MenuLog', 'Grubhub']

Let's also get a sense of our dataset's balance

In [9]:
reviews['label'].value_counts(normalize=True)

0    0.662462
1    0.337538
Name: label, dtype: float64

In [10]:
# By app

for app in app_list:
    print(reviews[reviews['app'] == app]['label'].value_counts(normalize=True))

0    0.760423
1    0.239577
Name: label, dtype: float64
0    0.648032
1    0.351968
Name: label, dtype: float64
0    0.651189
1    0.348811
Name: label, dtype: float64
0    0.682461
1    0.317539
Name: label, dtype: float64
0    0.56053
1    0.43947
Name: label, dtype: float64


Across the board the distribution of positive and negative reviews are quite consistent between the apps. Overall, there's an imbalance in our dataset, with positive reviews making for 75% of the dataset. Let's also check for null values.

In [11]:
reviews.isnull().sum()

date       0
review    54
rating     0
app        0
label      0
dtype: int64

In [12]:
reviews = reviews.dropna()

In [13]:
df_proc = reviews.copy()
df_proc.drop(columns=['date', 'rating', 'app'], inplace=True)
df_proc.head()

Unnamed: 0,review,label
0,I’ve been a DoorDash user for a while now and ...,0
1,I ordered a meal for delivery and after 1:30 I...,1
2,"I have gotten three orders from Doordash, all ...",1
3,The delay and customer support I experienced w...,1
4,I have had countless problems using DoorDash s...,1


In [14]:
X = df_proc['review']
y = df_proc['label']

## 2. Train Corpus on Word2Vec Model

It's unclear to me whether we should create the word2vec model on the entire corpus, or just the train dataset. From 655 class it seems the entire corpus was used, so let's give it a shot.

Also referencing: https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking_python3.ipynb

In [15]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/meln/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
#https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
stop_words = set(stopwords.words('english'))

In [17]:
all_tokenized_reviews = []

for review in tqdm(X):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    all_tokenized_reviews.append(tokens)

100%|██████████| 626323/626323 [00:05<00:00, 110167.67it/s]


In [18]:
full_model = Word2Vec(sentences=all_tokenized_reviews, vector_size=100, 
                      window=2, min_count=100, workers=4, seed=RANDOM_SEED)

In [19]:
full_model.save("word2vec_au.model")

In [20]:
full_model_kv = full_model.wv

We will split the dataset into `train`, `test`, and `dev`, with 80%, 10%, 10% ratio, respectively.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
X_test, X_dev, y_test, y_dev = train_test_split(X_test, y_test, test_size=0.5, random_state=RANDOM_SEED)

In [22]:
len(X_train)

501058

In [23]:
len(X_dev)

62633

In [24]:
len(X_test)

62632

In [25]:
X_train.iloc[0]

'Nice apk but some restaurants supply bad quality food'

In [26]:
X_test.iloc[0]

'dope food app'

## 3. Word Embeddings Approach on Logistic Regression

This section explores the use of word embeddings as feature extraction. We'll be working with dense representations of documents instead of the bag-of-words representations we used earlier. To do this, we'll use the average (or mean) word vector of a document and classify from those representations.

As a first step, let's tokenize the reviews here using regular expressions. However, since we're going to be computing an average word vector, let's remove stop words. Here, we'll use NLTK's list of English stop words. Since these words shouldn't affect our classification decision, we can remove them to avoid adding any noisy they might cause. Note that all of the stopwords in NLTK's list are lower-cased, but it's possible that some stopwords in your documents are not entirely lower-cased, so they may not match without some further processing.

We'll be using our corpus to train the model. We'll also be using a few of Word2Vec's pre-trained models, `word2vec-google-news-300`.

In [27]:
gnews = gensim.downloader.load('word2vec-google-news-300')

In [28]:
gnews.vector_size

300

In [29]:
glove_small = gensim.downloader.load('glove-wiki-gigaword-100')

In [30]:
glove_small.vector_size

100

In [31]:
glove_big = gensim.downloader.load('glove-wiki-gigaword-300')

In [32]:
glove_big.vector_size

300

In [33]:
def generate_dense_features(tokenized_texts, word_vectors): 
    #HINT: Create an empty list to hold your results 
        #HINT:Iterate through each item in tokenized_text
            #HINT:Create a list that contains current item(s) if found in word_vectors
            #HINT:if the length of this list is greater than zero:
                #HINT:We set this as a feature, this is done by using numpy’s mean function and append it to our results list 
            #HINT:Otherwise: create a vector of numpy zeros using word_vectors.vector_size as the parameter and append it to the results list
    #HINT:Return the results list as a numpy array (data type)

    res = []
    for token in tokenized_texts:
        items_in_vocab = [item for item in token if item in word_vectors]
        if len(items_in_vocab) > 0:
            res.append(np.mean(word_vectors[items_in_vocab], axis=0))
        else:
            res.append(np.zeros(word_vectors.vector_size))
    return np.array(res)

In [34]:
tokenized_train_items = []
for review in tqdm(X_train):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_train_items.append(tokens)

100%|██████████| 501058/501058 [00:04<00:00, 100296.93it/s]


In [35]:
len(set(tokenized_train_items[0]))

7

In [36]:
tokenized_train_items[0]

['Nice', 'apk', 'restaurants', 'supply', 'bad', 'quality', 'food']

In [37]:
tokenized_dev_items = []
for review in tqdm(X_dev):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_dev_items.append(tokens)

100%|██████████| 62633/62633 [00:00<00:00, 136772.80it/s]


In [38]:
def train_model(clf):
    print("_" * 80)
    print("Training: ")
    clf.fit(X_train_wp, y_train)
    y_dev_pred = clf.predict(X_dev_wp)
    
    score = accuracy_score(y_dev, y_dev_pred)
    print("accuracy:   %0.3f" % score)
    
    print("classification report:")
    print(classification_report(y_dev, y_dev_pred))
    
    print("confusion matrix:")
    print(confusion_matrix(y_dev, y_dev_pred))
    print("Training Complete")
    print()
    
    clf_descr = str(clf).split("(")[0]
    return clf_descr, score, y_dev_pred

It's training time!

### 3.1 Word2Vec on reviews corpus

In [39]:
X_train_wp = generate_dense_features(tokenized_train_items, full_model_kv)

In [40]:
print(X_train_wp.shape)

(501058, 100)


In [41]:
X_dev_wp = generate_dense_features(tokenized_dev_items, full_model_kv)

In [42]:
print(X_dev_wp.shape)

(62633, 100)


In [43]:
preds = {} # A dict to store our dev set predictions

for clf, name in (
    (DummyClassifier(strategy='uniform', random_state=RANDOM_SEED), "Uniform Classifier"),
    (DummyClassifier(strategy='most_frequent', random_state=RANDOM_SEED), "Most Frequent Classifier"),
    (LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED), "Logistic Regression")
):
    print("=" * 80)
    print("Training Results")
    print(name)
    mod = train_model(clf)
    preds[name] = mod[2]

Training Results
Uniform Classifier
________________________________________________________________________________
Training: 
accuracy:   0.499
classification report:
              precision    recall  f1-score   support

           0       0.66      0.50      0.57     41413
           1       0.34      0.50      0.40     21220

    accuracy                           0.50     62633
   macro avg       0.50      0.50      0.49     62633
weighted avg       0.55      0.50      0.51     62633

confusion matrix:
[[20678 20735]
 [10627 10593]]
Training Complete

Training Results
Most Frequent Classifier
________________________________________________________________________________
Training: 
accuracy:   0.661
classification report:


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.66      1.00      0.80     41413
           1       0.00      0.00      0.00     21220

    accuracy                           0.66     62633
   macro avg       0.33      0.50      0.40     62633
weighted avg       0.44      0.66      0.53     62633

confusion matrix:
[[41413     0]
 [21220     0]]
Training Complete

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.890
classification report:
              precision    recall  f1-score   support

           0       0.96      0.87      0.91     41413
           1       0.79      0.93      0.85     21220

    accuracy                           0.89     62633
   macro avg       0.87      0.90      0.88     62633
weighted avg       0.90      0.89      0.89     62633

confusion matrix:
[[36029  5384]
 [ 1487 19733]]
Training Complete



Pretty good result!

In [44]:
# Create a dataframe for mis-classifications
def create_mis_classification_df(name):
    mis_class = pd.DataFrame(X_dev)
    mis_class['Actual'] = y_dev
    mis_class['Predicted'] = preds[name]
    mis_class = mis_class[mis_class['Actual'] != mis_class['Predicted']]
    return mis_class

In [45]:
mis_class_logreg = create_mis_classification_df('Logistic Regression')

In [46]:
mis_class_logreg.sample(10).values

array([['But they should live chat like zomato does U have any valid number to call',
        0, 1],
       ['Customer helpfull', 0, 1],
       ['Ghatiya app not useful', 1, 0],
       ['Would rate 5 if they added Google pay as an option.', 0, 1],
       ['i have a trouble to login with my number only', 0, 1],
       ['Great app but super expensive.', 1, 0],
       ["Door dash sucks but I'm hungry and lazy so take my money you pigs",
        1, 0],
       ['Que tal dar condições DECENTES aos trabalhadores?', 1, 0],
       ["Very convenient way to get delivery. My only complaint is that you have to tip the dasher before they even deliver. I haven't had many bad experiences but when I have, the dasher was already tipped 18%. Also, add a custom tip option. 15% is for a waiter. 15% is the lowest tip option, which is fine when ordering for one, when ordering 100 dollars or more of food for a guy to drive five minutes up the road and get 18% it gets a little over the top.",
        0, 1],
  

### 3.2 Google News

In [47]:
X_train_wp = generate_dense_features(tokenized_train_items, gnews)
X_dev_wp = generate_dense_features(tokenized_dev_items, gnews)

In [48]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.879
classification report:
              precision    recall  f1-score   support

           0       0.96      0.85      0.90     41413
           1       0.77      0.93      0.84     21220

    accuracy                           0.88     62633
   macro avg       0.86      0.89      0.87     62633
weighted avg       0.89      0.88      0.88     62633

confusion matrix:
[[35404  6009]
 [ 1570 19650]]
Training Complete



In [49]:
mis_class_logreg_gnews = create_mis_classification_df('Logistic Regression')

In [50]:
mis_class_logreg_gnews.sample(10).values

array([['Some options like cancel could be clearer their support service has dropped since first started',
        0, 1],
       ["It's an OK app the only thing is when I order a McDonald's the food it's always mostly cold otherwise I would give 5 starts",
        0, 1],
       ["Not sure why deliveroo charged me 300 fils when I scanned my card for setting up payment method, no disclaimer was mentioned before this. Apart from that the app navigation and usuage is easy and user friendly. But I definitely didn't like to be charged for something I wasnt aware off or reason why.",
        0, 1],
       ["Some places don't care and get the orders wrong just because they don't read the order. It's frustrating. It's not the app or the drivers fault.",
        0, 1],
       ["Can't order some food in the menu.", 0, 1],
       ["The customer service isn't very good", 1, 0],
       ['It won’t let me add my card and it says they allow commonwealth bank cards but every time I go to save it it just

### 3.3 Glove Small

In [51]:
X_train_wp = generate_dense_features(tokenized_train_items, glove_small)
X_dev_wp = generate_dense_features(tokenized_dev_items, glove_small)

In [52]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.826
classification report:
              precision    recall  f1-score   support

           0       0.92      0.81      0.86     41413
           1       0.70      0.86      0.77     21220

    accuracy                           0.83     62633
   macro avg       0.81      0.84      0.82     62633
weighted avg       0.84      0.83      0.83     62633

confusion matrix:
[[33424  7989]
 [ 2893 18327]]
Training Complete



In [53]:
mis_class_logreg_glove_small = create_mis_classification_df('Logistic Regression')

In [54]:
mis_class_logreg_glove_small.sample(10).values

array([['Yes of course, anything that will help others, teach kids, bring people and communities together . It will only be good for us everyone...',
        0, 1],
       ['Excellent user friendly, accurate', 0, 1],
       ['Awesome food del app', 0, 1],
       ['Bad', 1, 0],
       ['Not been able to use at all yet as will not accept payment card with mo explanation as to why.',
        0, 1],
       ['No complaints here, never had a bad experience with uber eats',
        0, 1],
       ['Never had a problem with the app itself. Easy to understand ordering process. Love that I get rewards for some orders.',
        0, 1],
       ['can never checkout with eats. straught forward ordering with deliveroo. 1st time user',
        0, 1],
       ["Good but I'd like to be able to take my card off of the app as it says you cannot delete your only active.",
        0, 1],
       ['The app makes me log in every single time I open it 🤨', 0, 1]],
      dtype=object)

### 3.4 Glove Large

In [55]:
X_train_wp = generate_dense_features(tokenized_train_items, glove_big)
X_dev_wp = generate_dense_features(tokenized_dev_items, glove_big)

In [56]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.849
classification report:
              precision    recall  f1-score   support

           0       0.93      0.83      0.88     41413
           1       0.73      0.88      0.80     21220

    accuracy                           0.85     62633
   macro avg       0.83      0.86      0.84     62633
weighted avg       0.86      0.85      0.85     62633

confusion matrix:
[[34405  7008]
 [ 2477 18743]]
Training Complete



In [57]:
mis_class_logreg_glove_big = create_mis_classification_df('Logistic Regression')

In [58]:
mis_class_logreg_glove_big.sample(10).values

array([['Cancel order not available and all systems manually reading after step following .. other app to different...',
        0, 1],
       ['I like being able to track the orders. Not a huge fan of the fact that half the time drivers on bicycles end up being different people in cars but I live inWest Philly so whatever.',
        0, 1],
       ['Bad', 1, 0],
       ['Best app on my phone. Feed me!', 0, 1],
       ['Always prompt never late', 0, 1],
       ['Not what expected to be and not to polite. Slow...Too many ads cost too much rather use local treats!',
        1, 0],
       ["Was really impressed with their customer service. One thing to note, if you are trying to reschedule a delivery let them know what timezone you are in - to make it easier for them. Also, they were super helpful when I couldn't place a tip (I had an old version on the app). They manually entered the tip for me.",
        0, 1],
       ['Excelente app para entrega de comida. En estos tiempos de emergencia

### 3.5 Summary

Perhaps unsurprisingly, the model achieved the highest macro average F1-score on the Word2Vec model trained using the reviews corpus. However, it's also striking to see that pre-trained Word2Vec models did just as well. Gnews for example, achieved a F1-score of 0.87, while Glove Big achieved 0.84. While this is still inferior to the bag-of-words approach, it shows just how effective the word embeddings method can be.

Unfortunately, it is a lot harder to understand the mis-classifications. Except for those reviews where the rating given isn't an accurate reflection of the sentiment, it's hard to understand the rest.

### 3.6 Tune Logistic Regression Params

We'll try tuning Logistic Regression C parameter again, similar to BOW. I really like the fact that Google News embeddings did so well, so let's use that as the baseline.

In [59]:
X_train_wp = generate_dense_features(tokenized_train_items, gnews)
X_dev_wp = generate_dense_features(tokenized_dev_items, gnews)

In [60]:
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=3000, random_state=RANDOM_SEED)


print("=" * 80)
print("LogReg Grid Search")
clf = GridSearchCV(clf, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train_wp, y_train)
print(clf.best_params_)
y_dev_pred = best_clf.predict(X_dev_wp)
score = accuracy_score(y_dev, y_dev_pred)
print("accuracy:   %0.3f" % score)
print("classification report:")
print(classification_report(y_dev, y_dev_pred))
print("confusion matrix:")
print(confusion_matrix(y_dev, y_dev_pred))
print("Training Complete")
print()

LogReg Grid Search
Fitting 5 folds for each of 20 candidates, totalling 100 fits
{'C': 78.47599703514607}
accuracy:   0.879
classification report:
              precision    recall  f1-score   support

           0       0.96      0.86      0.90     41413
           1       0.77      0.93      0.84     21220

    accuracy                           0.88     62633
   macro avg       0.86      0.89      0.87     62633
weighted avg       0.89      0.88      0.88     62633

confusion matrix:
[[35426  5987]
 [ 1572 19648]]
Training Complete



## 4. Running on test set

In [61]:
tokenized_test_items = []
for review in tqdm(X_test):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_test_items.append(tokens)

100%|██████████| 62632/62632 [00:00<00:00, 101958.08it/s]


In [62]:
X_test_wp = generate_dense_features(tokenized_test_items, gnews)

In [63]:
y_test_pred = best_clf.predict(X_test_wp)
score = accuracy_score(y_test, y_test_pred)
print("accuracy:   %0.3f" % score)
print("classification report:")
print(classification_report(y_test, y_test_pred))
print("confusion matrix:")
print(confusion_matrix(y_test, y_test_pred))
print("Training Complete")
print()

accuracy:   0.877
classification report:
              precision    recall  f1-score   support

           0       0.96      0.85      0.90     41472
           1       0.76      0.92      0.84     21160

    accuracy                           0.88     62632
   macro avg       0.86      0.89      0.87     62632
weighted avg       0.89      0.88      0.88     62632

confusion matrix:
[[35346  6126]
 [ 1594 19566]]
Training Complete



In [65]:
best_clf.best_params_

{'C': 78.47599703514607}