<a href="https://colab.research.google.com/github/noahnguyen2004/Scotiabank-Customer-App-Review-Sentiment-Analysis/blob/main/customer_review_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import model_selection, preprocessing, linear_model, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import ensemble
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from xgboost import XGBClassifier

from IPython.display import display   # more user-friendly dataframe display


import nltk                           # tagging (e.g. positive, neutral, negative) classification
nltk.download('stopwords')
from nltk.corpus import stopwords     # stopwords to eliminate words that don't convey important information
from textblob import Word             # textblob for sentiment analysis
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

from transformers import pipeline
from gensim import corpora
from gensim.models import LdaModel

from termcolor import colored
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn import set_config
set_config(print_changed_only = False)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## Loading data

In [30]:
df = pd.read_csv('/content/drive/MyDrive/Scotiabank customer review datathon/Winter 2024 Scotia DSD Data Set.csv', delimiter = ';', encoding = 'utf-8', on_bad_lines = 'skip')

In [31]:
# make a copy of the existing data. From now on we will work with df_copy
df_copy = df[:]
df_copy

Unnamed: 0,Review_ID,Date,Rating,Review_Language,Version,Review_Likes,Review
0,0,2022-04-20 11:38:29,1,en,20.37.2,0,"Worst bank on the planet. Liars, cheats, and t..."
1,1,2023-03-25 19:10:42,5,en,20.47.0,0,App is great.
2,2,2022-05-31 00:54:40,1,en,20.38.1,0,Barely works. Barely. Stopped giving out notif...
3,3,2021-06-18 13:16:44,5,en,20.27.0,0,Really easy for a technophobe
4,4,2023-11-29 13:54:17,1,en,2310.0.1,0,Sucks
...,...,...,...,...,...,...,...
9171,9171,2021-04-20 10:16:28,5,en,20.25.1,0,Great app. Easy to use!
9172,9172,2023-03-05 10:27:12,5,en,20.47.0,0,Great App Top Notch Top Notch!
9173,9173,2023-03-14 15:28:08,5,en,20.47.0,0,It works like this to should
9174,9174,2022-10-08 15:08:03,1,en,,0,This bank insists on barriers that prevents di...


## Preprocessing data

### Drop the ID column

In [32]:
df_copy = df_copy.drop('Review_ID', axis = 1)

In [33]:
df_copy

Unnamed: 0,Date,Rating,Review_Language,Version,Review_Likes,Review
0,2022-04-20 11:38:29,1,en,20.37.2,0,"Worst bank on the planet. Liars, cheats, and t..."
1,2023-03-25 19:10:42,5,en,20.47.0,0,App is great.
2,2022-05-31 00:54:40,1,en,20.38.1,0,Barely works. Barely. Stopped giving out notif...
3,2021-06-18 13:16:44,5,en,20.27.0,0,Really easy for a technophobe
4,2023-11-29 13:54:17,1,en,2310.0.1,0,Sucks
...,...,...,...,...,...,...
9171,2021-04-20 10:16:28,5,en,20.25.1,0,Great app. Easy to use!
9172,2023-03-05 10:27:12,5,en,20.47.0,0,Great App Top Notch Top Notch!
9173,2023-03-14 15:28:08,5,en,20.47.0,0,It works like this to should
9174,2022-10-08 15:08:03,1,en,,0,This bank insists on barriers that prevents di...


### Check the language of each customer's review

In [6]:
df_copy['Review_Language'].value_counts()

en    9176
Name: Review_Language, dtype: int64

There is no review of another language, this column is unimportant. Thus we would drop this column.

In [34]:
df_copy = df_copy.drop('Review_Language', axis = 1)

In [35]:
df_copy

Unnamed: 0,Date,Rating,Version,Review_Likes,Review
0,2022-04-20 11:38:29,1,20.37.2,0,"Worst bank on the planet. Liars, cheats, and t..."
1,2023-03-25 19:10:42,5,20.47.0,0,App is great.
2,2022-05-31 00:54:40,1,20.38.1,0,Barely works. Barely. Stopped giving out notif...
3,2021-06-18 13:16:44,5,20.27.0,0,Really easy for a technophobe
4,2023-11-29 13:54:17,1,2310.0.1,0,Sucks
...,...,...,...,...,...
9171,2021-04-20 10:16:28,5,20.25.1,0,Great app. Easy to use!
9172,2023-03-05 10:27:12,5,20.47.0,0,Great App Top Notch Top Notch!
9173,2023-03-14 15:28:08,5,20.47.0,0,It works like this to should
9174,2022-10-08 15:08:03,1,,0,This bank insists on barriers that prevents di...


In [54]:
df_copy['Date'] = pd.to_datetime(df_copy['Date'])
df_copy['Year'] = df_copy['Date'].dt.year
df_copy['Month'] = df_copy['Date'].dt.month
df_copy['Day'] = df_copy['Date'].dt.day
df_copy['Time'] = df_copy['Date'].dt.time

In [55]:
df_copy

Unnamed: 0,Date,Rating,Version,Review_Likes,Review,Year,Month,Day,Time
0,2022-04-20 11:38:29,1,20.37.2,0,"Worst bank on the planet. Liars, cheats, and t...",2022,4,20,11:38:29
1,2023-03-25 19:10:42,5,20.47.0,0,App is great.,2023,3,25,19:10:42
2,2022-05-31 00:54:40,1,20.38.1,0,Barely works. Barely. Stopped giving out notif...,2022,5,31,00:54:40
3,2021-06-18 13:16:44,5,20.27.0,0,Really easy for a technophobe,2021,6,18,13:16:44
4,2023-11-29 13:54:17,1,2310.0.1,0,Sucks,2023,11,29,13:54:17
...,...,...,...,...,...,...,...,...,...
9171,2021-04-20 10:16:28,5,20.25.1,0,Great app. Easy to use!,2021,4,20,10:16:28
9172,2023-03-05 10:27:12,5,20.47.0,0,Great App Top Notch Top Notch!,2023,3,5,10:27:12
9173,2023-03-14 15:28:08,5,20.47.0,0,It works like this to should,2023,3,14,15:28:08
9174,2022-10-08 15:08:03,1,,0,This bank insists on barriers that prevents di...,2022,10,8,15:08:03


In [61]:
# reorder column
column_names = ['Date', 'Year', 'Month', 'Day', 'Time', 'Rating', 'Version', 'Review_Likes', 'Review']
df_copy = df_copy.reindex(columns = column_names)

In [62]:
df_copy

Unnamed: 0,Date,Year,Month,Day,Time,Rating,Version,Review_Likes,Review
0,2022-04-20 11:38:29,2022,4,20,11:38:29,1,20.37.2,0,"Worst bank on the planet. Liars, cheats, and t..."
1,2023-03-25 19:10:42,2023,3,25,19:10:42,5,20.47.0,0,App is great.
2,2022-05-31 00:54:40,2022,5,31,00:54:40,1,20.38.1,0,Barely works. Barely. Stopped giving out notif...
3,2021-06-18 13:16:44,2021,6,18,13:16:44,5,20.27.0,0,Really easy for a technophobe
4,2023-11-29 13:54:17,2023,11,29,13:54:17,1,2310.0.1,0,Sucks
...,...,...,...,...,...,...,...,...,...
9171,2021-04-20 10:16:28,2021,4,20,10:16:28,5,20.25.1,0,Great app. Easy to use!
9172,2023-03-05 10:27:12,2023,3,5,10:27:12,5,20.47.0,0,Great App Top Notch Top Notch!
9173,2023-03-14 15:28:08,2023,3,14,15:28:08,5,20.47.0,0,It works like this to should
9174,2022-10-08 15:08:03,2022,10,8,15:08:03,1,,0,This bank insists on barriers that prevents di...


### Convert customer review into lowercase

In [63]:
def lowercase(col):
    return col.apply(lambda x: x.lower() if isinstance(x, str) else x)

In [64]:
df_copy['Review'] = lowercase(df_copy['Review'])
df_copy['Review']

0       worst bank on the planet. liars, cheats, and t...
1                                           app is great.
2       barely works. barely. stopped giving out notif...
3                           really easy for a technophobe
4                                                   sucks
                              ...                        
9171                              great app. easy to use!
9172                       great app top notch top notch!
9173                         it works like this to should
9174    this bank insists on barriers that prevents di...
9175                                      very convenient
Name: Review, Length: 9176, dtype: object

In [65]:
df_copy

Unnamed: 0,Date,Year,Month,Day,Time,Rating,Version,Review_Likes,Review
0,2022-04-20 11:38:29,2022,4,20,11:38:29,1,20.37.2,0,"worst bank on the planet. liars, cheats, and t..."
1,2023-03-25 19:10:42,2023,3,25,19:10:42,5,20.47.0,0,app is great.
2,2022-05-31 00:54:40,2022,5,31,00:54:40,1,20.38.1,0,barely works. barely. stopped giving out notif...
3,2021-06-18 13:16:44,2021,6,18,13:16:44,5,20.27.0,0,really easy for a technophobe
4,2023-11-29 13:54:17,2023,11,29,13:54:17,1,2310.0.1,0,sucks
...,...,...,...,...,...,...,...,...,...
9171,2021-04-20 10:16:28,2021,4,20,10:16:28,5,20.25.1,0,great app. easy to use!
9172,2023-03-05 10:27:12,2023,3,5,10:27:12,5,20.47.0,0,great app top notch top notch!
9173,2023-03-14 15:28:08,2023,3,14,15:28:08,5,20.47.0,0,it works like this to should
9174,2022-10-08 15:08:03,2022,10,8,15:08:03,1,,0,this bank insists on barriers that prevents di...


### Print out all common stopwords in English obtained by NLTK

In [66]:
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Customer's reviews tokenization

In [67]:
def review_split(col):
  '''
  Split a review into list of each word as an element
  '''
  for i in range(len(col)):
    col[i] = col[i].split()
  return col

In [68]:
df_copy['Review'] = review_split(df_copy['Review'])

In [69]:
df_copy

Unnamed: 0,Date,Year,Month,Day,Time,Rating,Version,Review_Likes,Review
0,2022-04-20 11:38:29,2022,4,20,11:38:29,1,20.37.2,0,"[worst, bank, on, the, planet., liars,, cheats..."
1,2023-03-25 19:10:42,2023,3,25,19:10:42,5,20.47.0,0,"[app, is, great.]"
2,2022-05-31 00:54:40,2022,5,31,00:54:40,1,20.38.1,0,"[barely, works., barely., stopped, giving, out..."
3,2021-06-18 13:16:44,2021,6,18,13:16:44,5,20.27.0,0,"[really, easy, for, a, technophobe]"
4,2023-11-29 13:54:17,2023,11,29,13:54:17,1,2310.0.1,0,[sucks]
...,...,...,...,...,...,...,...,...,...
9171,2021-04-20 10:16:28,2021,4,20,10:16:28,5,20.25.1,0,"[great, app., easy, to, use!]"
9172,2023-03-05 10:27:12,2023,3,5,10:27:12,5,20.47.0,0,"[great, app, top, notch, top, notch!]"
9173,2023-03-14 15:28:08,2023,3,14,15:28:08,5,20.47.0,0,"[it, works, like, this, to, should]"
9174,2022-10-08 15:08:03,2022,10,8,15:08:03,1,,0,"[this, bank, insists, on, barriers, that, prev..."


### Filling missing data

In [70]:
def nan_value_count(df):
  '''
    Check number of missing values in each column
  '''
  df_null = {}
  for col in df.columns:
    num_null = df[col].isna().sum()
    df_null[col] = [num_null]
  df_null = pd.DataFrame(df_null, index=[0]).T
  df_null = df_null.rename(columns = {0: 'Number of missing values'})
  return df_null

In [71]:
nan_value_count(df_copy)

Unnamed: 0,Number of missing values
Date,0
Year,0
Month,0
Day,0
Time,0
Rating,0
Version,693
Review_Likes,0
Review,0


We observe that the number of missing values in the Version column is 693. We can proceed with assigning 0 to those records.

In [72]:
df_copy['Version'].fillna(0, inplace=True)

### Filter out stopwords & Lemmatize in the customer's reviews

In [74]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [75]:
# all common English stopwords (context independent)
stop_words = set(stopwords.words('english')).union(ENGLISH_STOP_WORDS)
lemmatizer = WordNetLemmatizer()

In [76]:
def filter_stopword(text):
  '''
  '''
  tokens = [lemmatizer.lemmatize(word) for word in text if word.isalpha() and word not in stop_words]
  return ' '.join(tokens)

In [77]:
df_copy['Review'] = df_copy['Review'].apply(filter_stopword)

In [78]:
df_copy

Unnamed: 0,Date,Year,Month,Day,Time,Rating,Version,Review_Likes,Review
0,2022-04-20 11:38:29,2022,4,20,11:38:29,1,20.37.2,0,worst bank sell info frauded told account clos...
1,2023-03-25 19:10:42,2023,3,25,19:10:42,5,20.47.0,0,app
2,2022-05-31 00:54:40,2022,5,31,00:54:40,1,20.38.1,0,barely stopped giving notification recent upda...
3,2021-06-18 13:16:44,2021,6,18,13:16:44,5,20.27.0,0,really easy technophobe
4,2023-11-29 13:54:17,2023,11,29,13:54:17,1,2310.0.1,0,suck
...,...,...,...,...,...,...,...,...,...
9171,2021-04-20 10:16:28,2021,4,20,10:16:28,5,20.25.1,0,great easy
9172,2023-03-05 10:27:12,2023,3,5,10:27:12,5,20.47.0,0,great app notch
9173,2023-03-14 15:28:08,2023,3,14,15:28:08,5,20.47.0,0,work like
9174,2022-10-08 15:08:03,2022,10,8,15:08:03,1,0,0,bank insists barrier prevents disabled custome...


### Tokenize again after filtering out stopwords from each review

In [79]:
reviews = [review.split() for review in df_copy['Review']]

In [117]:
# reviews

### Create a dictionary for Tokenized reviews

In [80]:
review_dict = corpora.Dictionary(reviews)     # reviews are tokenized => corpus

In [81]:
print("Length of Review dictionary: ", len(review_dict))
print(review_dict)

Length of Review dictionary:  5056
Dictionary<5056 unique tokens: ['account', 'allowed', 'bank', 'business', 'close']...>


### Bag of Words (BoW)

In [82]:
# create a bag of words
BoW = [review_dict.doc2bow(review) for review in reviews]

In [118]:
# BoW

### Gather topics from customer reviews

In [83]:
lda_model = LdaModel(BoW, num_topics = 50, id2word = review_dict, passes = 15, random_state = 42)

In [84]:
topics = lda_model.print_topics(num_words = 10)
for topic in topics:
  print(topic)

(44, '0.318*"banking" + 0.137*"app" + 0.116*"online" + 0.109*"mobile" + 0.053*"fingerprint" + 0.033*"website" + 0.029*"far" + 0.026*"completely" + 0.018*"person" + 0.016*"unable"')
(49, '0.433*"bank" + 0.130*"best" + 0.059*"thank" + 0.048*"app" + 0.032*"branch" + 0.030*"long" + 0.024*"simply" + 0.020*"glitchy" + 0.018*"far" + 0.016*"free"')
(6, '0.242*"account" + 0.172*"access" + 0.051*"garbage" + 0.049*"app" + 0.046*"bank" + 0.038*"fixed" + 0.035*"download" + 0.031*"switch" + 0.027*"dont" + 0.022*"get"')
(46, '0.133*"breakfast" + 0.090*"slow" + 0.086*"support" + 0.063*"poor" + 0.059*"comfortable" + 0.038*"room" + 0.037*"looking" + 0.030*"thought" + 0.027*"staying" + 0.022*"choice"')
(27, '0.377*"service" + 0.267*"customer" + 0.045*"maintenance" + 0.024*"worst" + 0.020*"cancel" + 0.020*"seen" + 0.017*"hit" + 0.014*"sudden" + 0.013*"amazing" + 0.009*"freezing"')
(14, '0.229*"need" + 0.123*"keep" + 0.100*"able" + 0.096*"app" + 0.065*"crashing" + 0.047*"bug" + 0.026*"update" + 0.020*"go" 

### What do all those numbers mean?

The output looks tedious at first, but let's take a closer look at those numbers. The numbers associated (*) with the word inside the parenthesis are the weight for that corresponding word. The number at the very beginning is the topic number. Each topic number carries a "topic", determining by the words assigned in that topic.

For example, "banking", "app", "online", etc. are associated with topic 44 from the output above. From there, we can kind of determine the name for topic 44. We can do the same thing with the rest of other topic numbers.

The weights represent the probability of a word being part of the topic. From here, we can extract which words are in a topic number, possibly by setting a probability threshold.

Notice that the model has printed out 20 topics. While I specified the `num_topics` to be 50, it's not always that the model can learn all the topics. The reason might be that some of the topics have low significance.

Now we have 20 topics, we will decide to assign a name to each topic number based on the weight (probability) of each word assigned in that topic.

### Naming each of the 20 topics

### Create a Sentiment list & Label it to each Customer's review

In [109]:
sentiment = {'Sentiment': ['Very positive', 'Positive', 'Neutral', 'Negative', 'Very negative']}
sentiment_analyzer = pipeline('sentiment-analysis')

def label_sentiment(review):
  '''
    Label each review with the appropriate sentiment (e.g. positive, negative)
  '''
  result = sentiment_analyzer(review)
  sentiment_label = result[0]['label']
  return sentiment_label

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
# # assign sentiment labels to each review
# df_copy['Review'] = df_copy['Review'].apply(label_sentiment)