## Sentiment Classification SG Reviews Data (WE, non-Deep Learning)

This notebook covers two good approaches to perform sentiment classification - Naive Bayes and Logistic Regression. We will train SG reviews data on both.

As a rule of thumb, reviews that are 3 stars and above are **positive**, and vice versa.

In [1]:
%pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import gzip
import json
import matplotlib.pyplot as plt
import numpy as np
import re
import random
import pandas as pd
import seaborn as sns
import gensim
import spacy
import nltk
import gensim.downloader
from collections import Counter, defaultdict
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.metrics import f1_score, classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from tqdm import tqdm

In [3]:
RANDOM_SEED = 33

We'll be reusing the SG filtered data which has excluded non-English reviews in the BOW notebook.

In [4]:
reviews = pd.read_csv("assets/reviews_sg_filtered.csv")
reviews.head()

Unnamed: 0.1,Unnamed: 0,review,label,language,cof_score
0,0,"Used to be a good app been using for years, no...",0,en,0.999996
1,1,Grab app is convenient because you can use mul...,1,en,0.999999
2,2,I used to love the subscription plans that the...,1,en,0.999999
3,3,I ordered a grabfood and one of the 3 items ar...,1,en,0.999998
4,4,This platform gives too much power to restaura...,1,en,0.999998


In [5]:
reviews.drop(columns=['Unnamed: 0', 'language', 'cof_score'], inplace=True)

In [7]:
reviews.head()

Unnamed: 0,review,label
0,"Used to be a good app been using for years, no...",0
1,Grab app is convenient because you can use mul...,1
2,I used to love the subscription plans that the...,1
3,I ordered a grabfood and one of the 3 items ar...,1
4,This platform gives too much power to restaura...,1


In [8]:
print(len(reviews))

374633


In [9]:
reviews = reviews.dropna()

In [10]:
df_proc = reviews.copy()
df_proc.head()

Unnamed: 0,review,label
0,"Used to be a good app been using for years, no...",0
1,Grab app is convenient because you can use mul...,1
2,I used to love the subscription plans that the...,1
3,I ordered a grabfood and one of the 3 items ar...,1
4,This platform gives too much power to restaura...,1


In [11]:
X = df_proc['review']
y = df_proc['label']

## 2. Train Corpus on Word2Vec Model

It's unclear to me whether we should create the word2vec model on the entire corpus, or just the train dataset. From 655 class it seems the entire corpus was used, so let's give it a shot.

Also referencing: https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking_python3.ipynb

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/meln/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
#https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
stop_words = set(stopwords.words('english'))

In [14]:
all_tokenized_reviews = []

for review in tqdm(X):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    all_tokenized_reviews.append(tokens)

100%|██████████| 374633/374633 [00:03<00:00, 120609.41it/s]


In [15]:
full_model = Word2Vec(sentences=all_tokenized_reviews, vector_size=100, 
                      window=2, min_count=100, workers=4, seed=RANDOM_SEED)

In [16]:
full_model.save("word2vec_sg.model")

In [17]:
full_model_kv = full_model.wv

We will split the dataset into `train`, `test`, and `dev`, with 80%, 10%, 10% ratio, respectively.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
X_test, X_dev, y_test, y_dev = train_test_split(X_test, y_test, test_size=0.5, random_state=RANDOM_SEED)

In [19]:
len(X_train)

299706

In [20]:
len(X_dev)

37464

In [21]:
len(X_test)

37463

In [22]:
X_train.iloc[0]

'Great'

In [23]:
X_test.iloc[0]

'Food pandas have made it very easy to order food and drink.'

## 3. Word Embeddings Approach on Logistic Regression

This section explores the use of word embeddings as feature extraction. We'll be working with dense representations of documents instead of the bag-of-words representations we used earlier. To do this, we'll use the average (or mean) word vector of a document and classify from those representations.

As a first step, let's tokenize the reviews here using regular expressions. However, since we're going to be computing an average word vector, let's remove stop words. Here, we'll use NLTK's list of English stop words. Since these words shouldn't affect our classification decision, we can remove them to avoid adding any noisy they might cause. Note that all of the stopwords in NLTK's list are lower-cased, but it's possible that some stopwords in your documents are not entirely lower-cased, so they may not match without some further processing.

We'll be using our corpus to train the model. We'll also be using a few of Word2Vec's pre-trained models, `word2vec-google-news-300`.

In [24]:
gnews = gensim.downloader.load('word2vec-google-news-300')

In [25]:
gnews.vector_size

300

In [26]:
glove_small = gensim.downloader.load('glove-wiki-gigaword-100')

In [27]:
glove_small.vector_size

100

In [28]:
glove_big = gensim.downloader.load('glove-wiki-gigaword-300')

In [29]:
glove_big.vector_size

300

In [30]:
def generate_dense_features(tokenized_texts, word_vectors): 
    #HINT: Create an empty list to hold your results 
        #HINT:Iterate through each item in tokenized_text
            #HINT:Create a list that contains current item(s) if found in word_vectors
            #HINT:if the length of this list is greater than zero:
                #HINT:We set this as a feature, this is done by using numpy’s mean function and append it to our results list 
            #HINT:Otherwise: create a vector of numpy zeros using word_vectors.vector_size as the parameter and append it to the results list
    #HINT:Return the results list as a numpy array (data type)

    res = []
    for token in tokenized_texts:
        items_in_vocab = [item for item in token if item in word_vectors]
        if len(items_in_vocab) > 0:
            res.append(np.mean(word_vectors[items_in_vocab], axis=0))
        else:
            res.append(np.zeros(word_vectors.vector_size))
    return np.array(res)

In [31]:
tokenized_train_items = []
for review in tqdm(X_train):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_train_items.append(tokens)

100%|██████████| 299706/299706 [00:02<00:00, 101077.49it/s]


In [32]:
len(set(tokenized_train_items[0]))

1

In [33]:
tokenized_train_items[0]

['Great']

In [34]:
tokenized_dev_items = []
for review in tqdm(X_dev):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_dev_items.append(tokens)

100%|██████████| 37464/37464 [00:00<00:00, 145539.68it/s]


In [35]:
def train_model(clf):
    print("_" * 80)
    print("Training: ")
    clf.fit(X_train_wp, y_train)
    y_dev_pred = clf.predict(X_dev_wp)
    
    score = accuracy_score(y_dev, y_dev_pred)
    print("accuracy:   %0.3f" % score)
    
    print("classification report:")
    print(classification_report(y_dev, y_dev_pred))
    
    print("confusion matrix:")
    print(confusion_matrix(y_dev, y_dev_pred))
    print("Training Complete")
    print()
    
    clf_descr = str(clf).split("(")[0]
    return clf_descr, score, y_dev_pred

It's training time!

### 3.1 Word2Vec on reviews corpus

In [36]:
X_train_wp = generate_dense_features(tokenized_train_items, full_model_kv)

In [37]:
print(X_train_wp.shape)

(299706, 100)


In [38]:
X_dev_wp = generate_dense_features(tokenized_dev_items, full_model_kv)

In [39]:
print(X_dev_wp.shape)

(37464, 100)


In [40]:
preds = {} # A dict to store our dev set predictions

for clf, name in (
    (DummyClassifier(strategy='uniform', random_state=RANDOM_SEED), "Uniform Classifier"),
    (DummyClassifier(strategy='most_frequent', random_state=RANDOM_SEED), "Most Frequent Classifier"),
    (LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED), "Logistic Regression")
):
    print("=" * 80)
    print("Training Results")
    print(name)
    mod = train_model(clf)
    preds[name] = mod[2]

Training Results
Uniform Classifier
________________________________________________________________________________
Training: 
accuracy:   0.499
classification report:
              precision    recall  f1-score   support

           0       0.66      0.50      0.57     24645
           1       0.34      0.50      0.41     12819

    accuracy                           0.50     37464
   macro avg       0.50      0.50      0.49     37464
weighted avg       0.55      0.50      0.51     37464

confusion matrix:
[[12280 12365]
 [ 6402  6417]]
Training Complete

Training Results
Most Frequent Classifier
________________________________________________________________________________
Training: 
accuracy:   0.658
classification report:
              precision    recall  f1-score   support

           0       0.66      1.00      0.79     24645
           1       0.00      0.00      0.00     12819

    accuracy                           0.66     37464
   macro avg       0.33      0.50      0.40

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


accuracy:   0.869
classification report:
              precision    recall  f1-score   support

           0       0.95      0.84      0.89     24645
           1       0.75      0.92      0.83     12819

    accuracy                           0.87     37464
   macro avg       0.85      0.88      0.86     37464
weighted avg       0.88      0.87      0.87     37464

confusion matrix:
[[20813  3832]
 [ 1066 11753]]
Training Complete



Pretty good result!

In [41]:
# Create a dataframe for mis-classifications
def create_mis_classification_df(name):
    mis_class = pd.DataFrame(X_dev)
    mis_class['Actual'] = y_dev
    mis_class['Predicted'] = preds[name]
    mis_class = mis_class[mis_class['Actual'] != mis_class['Predicted']]
    return mis_class

In [42]:
mis_class_logreg = create_mis_classification_df('Logistic Regression')

In [43]:
mis_class_logreg.sample(10).values

array([['I think their systems are slow, app needs some change', 0, 1],
       ['I have been ordering from deliveroo for around 12 months and have NEVER had a problem. Most of the time the order comes a bit quicker but never late.!',
        0, 1],
       ["Didn't cover near by all restaurants sometime it does sometimes not",
        0, 1],
       ['Bhai Delivery to Late night tak rakho', 0, 1],
       ['Terrible CEO in handling Foodpanda Rider in Malaysia... So arrogant',
        0, 1],
       ['I like it and has now become a necessary mode of transportation but lack of competition has made it more expensive than before. Watch out for the occasional promo codes.',
        0, 1],
       ["GPS never works properly, so never know when the car will arrive, guess quality and monopoly don't go together.",
        0, 1],
       ['So much easy to get foods and transportation with GRAB. How about getting sea cruises and short trips around Singapore. Thanks',
        1, 0],
       ['Food', 1, 0

### 3.2 Google News

In [44]:
X_train_wp = generate_dense_features(tokenized_train_items, gnews)
X_dev_wp = generate_dense_features(tokenized_dev_items, gnews)

In [45]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.863
classification report:
              precision    recall  f1-score   support

           0       0.95      0.83      0.89     24645
           1       0.74      0.92      0.82     12819

    accuracy                           0.86     37464
   macro avg       0.85      0.88      0.85     37464
weighted avg       0.88      0.86      0.87     37464

confusion matrix:
[[20564  4081]
 [ 1053 11766]]
Training Complete



In [46]:
mis_class_logreg_gnews = create_mis_classification_df('Logistic Regression')

In [47]:
mis_class_logreg_gnews.sample(10).values

array([["Couldn't add visa, amex or mastercard...", 1, 0],
       ['One day i was hungry in midnight and thought what to do suddenly i see an app on my f.b page first i download app than wait for 15mnt my order on my door thank a lot food panda',
        0, 1],
       ["sometimes the app doesn't recognizes my address or there are some other store/restaurant that is not available the next day after you order.",
        0, 1],
       ['I was forced to use this app because the competing app Z stopped working and their customer service was terrible. This app works very smoothly and with no hassles at all',
        0, 1],
       ["App is good. Delivery isn't... just for my food to get to me from the place i ordered...which is literally 2 minutes away. It took 40 minutes. I have no idea what the person delivering it was doing...just going round in circles...only to go to the wrong address then say they were waiting outside the whole time...when we were watching them on the live gps going aro

### 3.3 Glove Small

In [48]:
X_train_wp = generate_dense_features(tokenized_train_items, glove_small)
X_dev_wp = generate_dense_features(tokenized_dev_items, glove_small)

In [49]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.799
classification report:
              precision    recall  f1-score   support

           0       0.90      0.78      0.84     24645
           1       0.67      0.83      0.74     12819

    accuracy                           0.80     37464
   macro avg       0.78      0.81      0.79     37464
weighted avg       0.82      0.80      0.80     37464

confusion matrix:
[[19337  5308]
 [ 2224 10595]]
Training Complete



In [50]:
mis_class_logreg_glove_small = create_mis_classification_df('Logistic Regression')

In [51]:
mis_class_logreg_glove_small.sample(10).values

array([['Cannot located my delivery location at all', 1, 0],
       ["Yasmin to a few weeks ago but the UK to the season with salt in my love. The new one for a bit late but it's still available on a few weeks back on Monday and the rest of the",
        0, 1],
       ['What happened to money transfer?', 0, 1],
       ['dear developer why prepaid top up cannot be used?', 0, 1],
       ["Hopefully the driver finds our address this time, they haven't on the last 8 orders. ( not holding my breath). Driver's never read the delivery notes.",
        0, 1],
       ['Should allow history of riders, drivers and restaurants so feedback / rating / tips can be given after giving them some thought on the quality of service / satisfaction level (rather than having to respond almost immediately after the transaction)',
        0, 1],
       ['Cannot change number and no other log in ways', 1, 0],
       ['Stupid Service', 1, 0],
       ['Easy to order.', 0, 1],
       ['Almost no bugs. Runs very smo

### 3.4 Glove Large

In [52]:
X_train_wp = generate_dense_features(tokenized_train_items, glove_big)
X_dev_wp = generate_dense_features(tokenized_dev_items, glove_big)

In [53]:
preds = {} # A dict to store our dev set predictions
print("=" * 80)
print("Training Results")
print("Logistic Regression")
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=1000, random_state=RANDOM_SEED)
mod = train_model(clf)
preds[name] = mod[2]

Training Results
Logistic Regression
________________________________________________________________________________
Training: 
accuracy:   0.821
classification report:
              precision    recall  f1-score   support

           0       0.91      0.81      0.86     24645
           1       0.69      0.85      0.76     12819

    accuracy                           0.82     37464
   macro avg       0.80      0.83      0.81     37464
weighted avg       0.84      0.82      0.82     37464

confusion matrix:
[[19857  4788]
 [ 1926 10893]]
Training Complete



In [54]:
mis_class_logreg_glove_big = create_mis_classification_df('Logistic Regression')

In [55]:
mis_class_logreg_glove_big.sample(10).values

array([['They need more improvement this app', 1, 0],
       ['I really like this app. I use it almost every weekend and never been disappointed. Thanks food panda team for this app..',
        0, 1],
       ['lund app h mc bc', 1, 0],
       ["Problem.... Apps... I don't know how to track my order... No contact no.... And contact number appear fake....",
        0, 1],
       ['When will you guys begin services in Poipet City, Cambodia?', 0,
        1],
       ['I have been using deliveroo for years without any issues', 0, 1],
       ['Disgut this app', 1, 0],
       ["Great interface as it is very user friendly. Saving of address is also very intuitive. Edit: Recently, the app has gone to the dumps. Plus, no one cares what your order is delayed / cancelled by the vendors. Foodpanda just promises to refund you 'soon'. I'm not sure how that will help me feed my elderly parents who were waiting for dinner only for it to be cancelled only after half hour of waiting. Recommendation : It w

### 3.5 Summary

Perhaps unsurprisingly, the model achieved the highest macro average F1-score on the Word2Vec model trained using the reviews corpus. However, it's also striking to see that pre-trained Word2Vec models did just as well. Gnews for example, achieved a F1-score of 0.87, while Glove Big achieved 0.84. While this is still inferior to the bag-of-words approach, it shows just how effective the word embeddings method can be.

Unfortunately, it is a lot harder to understand the mis-classifications. Except for those reviews where the rating given isn't an accurate reflection of the sentiment, it's hard to understand the rest.

### 3.6 Tune Logistic Regression Params

We'll try tuning Logistic Regression C parameter again, similar to BOW. I really like the fact that Google News embeddings did so well, so let's use that as the baseline.

In [56]:
X_train_wp = generate_dense_features(tokenized_train_items, gnews)
X_dev_wp = generate_dense_features(tokenized_dev_items, gnews)

In [58]:
clf = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=3000, random_state=RANDOM_SEED)
param_grid = {'C' : np.logspace(-4, 4, 20)}

print("=" * 80)
print("LogReg Grid Search")
clf = GridSearchCV(clf, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train_wp, y_train)
print(clf.best_params_)
y_dev_pred = best_clf.predict(X_dev_wp)
score = accuracy_score(y_dev, y_dev_pred)
print("accuracy:   %0.3f" % score)
print("classification report:")
print(classification_report(y_dev, y_dev_pred))
print("confusion matrix:")
print(confusion_matrix(y_dev, y_dev_pred))
print("Training Complete")
print()

LogReg Grid Search
Fitting 5 folds for each of 20 candidates, totalling 100 fits
{'C': 1438.44988828766}
accuracy:   0.863
classification report:
              precision    recall  f1-score   support

           0       0.95      0.83      0.89     24645
           1       0.74      0.92      0.82     12819

    accuracy                           0.86     37464
   macro avg       0.85      0.88      0.86     37464
weighted avg       0.88      0.86      0.87     37464

confusion matrix:
[[20577  4068]
 [ 1058 11761]]
Training Complete



## 4. Running on test set

In [59]:
tokenized_test_items = []
for review in tqdm(X_test):
    tokens = [token for token in re.findall(r'\w+', review) if token not in stop_words]
    tokenized_test_items.append(tokens)

100%|██████████| 37463/37463 [00:00<00:00, 118152.38it/s]


In [60]:
X_test_wp = generate_dense_features(tokenized_test_items, gnews)

In [61]:
y_test_pred = best_clf.predict(X_test_wp)
score = accuracy_score(y_test, y_test_pred)
print("accuracy:   %0.3f" % score)
print("classification report:")
print(classification_report(y_test, y_test_pred))
print("confusion matrix:")
print(confusion_matrix(y_test, y_test_pred))
print("Training Complete")
print()

accuracy:   0.865
classification report:
              precision    recall  f1-score   support

           0       0.95      0.84      0.89     24612
           1       0.75      0.92      0.82     12851

    accuracy                           0.86     37463
   macro avg       0.85      0.88      0.86     37463
weighted avg       0.88      0.86      0.87     37463

confusion matrix:
[[20568  4044]
 [ 1026 11825]]
Training Complete



In [62]:
best_clf.best_params_

{'C': 1438.44988828766}