# Preprocessing + Topic Modeling (Mixed Language Approach)
This notebook serves as a demo for topic modeling GCash App Store Reviews using LDA. The approach of this notebook is to ignore the multi-lingual nature of reviews and move directly to LDA after data cleaning. This notebook will be divided into four parts:

1. Data Cleaning
2. Text Cleaning
3. Modeling (LDA)
4. Exploration of Results

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
!ls "/content/drive/My Drive/Data/Reviews/App Store/GCash"

app_info.csv	     gcash_0613-1427.csv	 Old
gcash_0604-2124.csv  gcash_appstore_reviews.csv


In [None]:
!pip install emot contractions pyLDAvis lda
!python3 -m spacy download en

In [None]:
# importing libraries
import pandas as pd
import numpy as np 
import regex as re
import datetime
import json

from emot.emo_unicode import UNICODE_EMO, EMOTICONS
import contractions


from spacy.lang.tl.stop_words import STOP_WORDS as tl_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

#nltk
import nltk
from nltk.corpus import stopwords

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import nltk
import tqdm

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

  from collections import Iterable
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


In [None]:
# importing the dataset
df = pd.read_csv('/content/drive/My Drive/Data/Reviews/App Store/GCash/gcash_0613-1427.csv')

In [None]:
df.shape

(4867, 9)

In [None]:
df.dtypes

title                object
date                 object
rating                int64
userName             object
isEdited               bool
review               object
app_name             object
app_id                int64
developerResponse    object
dtype: object

In [None]:
df.head(2)

Unnamed: 0,title,date,rating,userName,isEdited,review,app_name,app_id,developerResponse
0,It is starting to become pretty annoying,2018-12-29 15:24:59,3,Fhayerie0406,False,Almost everything is smooth except for three t...,gcash,520020791,
1,Actually ok.,2020-05-16 07:33:32,5,i don't usually rate/review,False,This app gets so many negative reviews but it’...,gcash,520020791,


## Part 1: Data Exploration + Cleaning

In [None]:
# Checking for null values
df.isnull().sum() 

title                   0
date                    0
rating                  0
userName                0
isEdited                0
review                  0
app_name                0
app_id                  0
developerResponse    4718
dtype: int64

In [None]:
# Dropping unwanted columns [userName, app_name, app_id, developerResponse]
df = df.drop(['userName', 'app_name', 'app_id', 'developerResponse', 'isEdited'], axis=1)

In [None]:
# Splitting datetime column + Checking date range

# Code block below separated the timestamp [[date]] column into Date and Time column respectively
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
df = df.drop('date', axis=1)

In [None]:
df.sort_values(by='Dates', ascending=False).head(1)

Unnamed: 0,title,rating,review,Dates,Time
2570,Annoying,2,I can’t open my GCash account 4days after i ac...,2021-06-11,09:05:47


In [None]:
df.sort_values(by='Dates', ascending=True).head(1)

Unnamed: 0,title,rating,review,Dates,Time
4134,App does not work,1,App does not work. It hangs and automatically ...,2012-05-23,16:24:07


In the section above, we can see that the dataset ranges from 2012-05-23 to 2021-06-11.

In [None]:
df.dtypes

title     object
rating     int64
review    object
Dates     object
Time      object
dtype: object

In [None]:
# Joining the title and review column

df['final_review'] = df.title.map(str) + " " + df.review

In [None]:
df = df.drop(['review','title'], axis=1)

In [None]:
df['word_count'] = df['final_review'].str.split().str.len()

In [None]:
df.head(1)

Unnamed: 0,rating,Dates,Time,final_review,word_count
0,3,2018-12-29,15:24:59,It is starting to become pretty annoying Almos...,234


In [None]:
df2 = df.copy()

In [None]:
# Filter pandas df for reviews from january 2020 to may 2021
# import datetime
df2 = df2[(df2['Dates']>datetime.date(2020,1,1)) & (df2['Dates']<datetime.date(2021,5,31))] 

In [None]:
df2.shape

(2498, 5)

In [None]:
df2.sort_values(by='word_count')

Unnamed: 0,rating,Dates,Time,final_review,word_count
4491,1,2021-01-21,02:28:07,Poor WTF,2
4308,5,2020-06-14,14:51:30,omsim 🔥🔥🔥🔥,2
4325,1,2020-04-17,12:20:45,Nakputaka Nakputaka,2
4539,1,2020-09-24,02:47:09,Habducy Hansjaknahd,2
4230,5,2021-05-12,16:22:47,Yes Kaksnsksksk,2
...,...,...,...,...,...
3727,1,2020-01-04,12:59:40,Don’t trust your money with them If there is a...,305
3751,2,2021-04-24,01:39:51,"The best in PH, because it’s the only thing we...",343
3206,1,2020-02-15,11:35:03,Too many issues/Customer Service unhelpful Tho...,349
3778,1,2020-03-18,14:44:27,"App is horrible Hi, this is my 2nd time of sen...",406


In [None]:
df2['word_count'].value_counts().sort_index().head(10)

2     14
3      4
4     22
5     23
6     34
7     31
8     53
9     49
10    46
11    48
Name: word_count, dtype: int64

In [None]:
df2.loc[df2['word_count'].isin([2])]

Unnamed: 0,rating,Dates,Time,final_review,word_count
3670,1,2021-02-03,01:50:17,-0 Disappointing.,2
3927,5,2021-04-25,03:08:11,Good Good,2
3943,2,2020-09-28,18:43:16,Scammers. Trash.,2
4006,1,2021-01-28,09:19:40,😤 😤,2
4007,1,2021-04-15,10:51:18,. 🤦🏻‍♀️,2
4230,5,2021-05-12,16:22:47,Yes Kaksnsksksk,2
4272,5,2020-12-11,08:23:09,Mae0210 5star,2
4286,1,2020-10-05,15:50:45,trashhhhh trashhhhhh,2
4299,1,2020-07-26,01:10:49,ULOL Kakainis,2
4308,5,2020-06-14,14:51:30,omsim 🔥🔥🔥🔥,2


## Part 2: Text Cleaning

Now that we have done the initial cleaning of the df, let us now proceed to text cleaning for topic modeling purposes.

My approach in this notebook would be as follows:
1. Do the necessary text cleaning steps (remove encodings, lowercase, strip punctutation, expand contractions etc)
2. Remove Filipino stop words and possibly lemmatize Filipino words
3. Remove English stop words and lemmatize
4. Tokenization and other methods

In [None]:
df2.head(4)

Unnamed: 0,rating,Dates,Time,final_review,word_count
1,5,2020-05-16,07:33:32,Actually ok. This app gets so many negative re...,205
2,3,2020-02-04,00:25:57,Cash In Concern A little bit disappointed when...,186
6,5,2020-06-07,04:57:17,Gcash user since 2018 I've been using gcash fo...,173
7,1,2020-04-26,04:57:47,GCash is Losing its Prime Okay GCash dati 3-4 ...,250


#### Remove emojis and emoticons

In [None]:
# Remove emojis and emoticons

# Code block 1: remove emojis
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

#Example
remove_emoji("Have fun with NLP! 😃😃")

'Have fun with NLP! '

In [None]:
# apply the remove emoji function to the review column
df2['final_review'] = df2['final_review'].apply(lambda x: remove_emoji(x))

In [None]:
df2.loc[df2['word_count'].isin([1])]

Unnamed: 0,rating,Dates,Time,final_review,word_count


In [None]:
# Code block 2: remove emoticons

# from emot.emo_unicode import UNICODE_EMO, EMOTICONS

# Function for removing emoticons
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)


In [None]:
#Example
remove_emoticons("I used to play Super Mario games :)")

'I used to play Super Mario games '

In [None]:
# Applying remove emoticons code
df2['final_review'] = df2['final_review'].apply(lambda x: remove_emoticons(x))

In [None]:
df2['word_count'] = df2['final_review'].str.split().str.len()

In [None]:
df2['word_count'].value_counts().sort_index().head(6)

0     1
1     2
2    11
3     5
4    22
5    28
Name: word_count, dtype: int64

In [None]:
df2.loc[df2['word_count'].isin([2])]

Unnamed: 0,rating,Dates,Time,final_review,word_count
3670,1,2021-02-03,01:50:17,-0 Disappointing.,2
3927,5,2021-04-25,03:08:11,Good Good,2
3943,2,2020-09-28,18:43:16,Scammers. Trash.,2
4230,5,2021-05-12,16:22:47,Yes Kaksnsksksk,2
4272,5,2020-12-11,08:23:09,Mae0210 5star,2
4286,1,2020-10-05,15:50:45,trashhhhh trashhhhhh,2
4299,1,2020-07-26,01:10:49,ULOL Kakainis,2
4325,1,2020-04-17,12:20:45,Nakputaka Nakputaka,2
4463,4,2021-04-12,08:32:31,svdtstbgdvyvg tvtvyvttttt,2
4491,1,2021-01-21,02:28:07,Poor WTF,2


#### Remove \n, \t, \r

In [None]:
df2['final_review'][12]

'Best Finance App! GCash is a big help, indeed. I’ve been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances. \r\n\r\nSome of the best features include:\r\n- It allows you to buy load and earn rebates! \r\n- Upon registering to promo, you can earn coupons which you can use on scan to pay features\r\n- allows you to save with high interest, and invest too! \r\n- easily transfer funds to bank with real-time posting and no extra fees! \r\n- Scan to pay is very useful\r\n- Secured and convenient e-wallet\r\n- Wide-range of coverage, a lot of establishments are using GCash already!\r\n- You can help nature by just loading or walking! \r\n- Smooth application\r\n\r\nWhy 4 star not 5 then? \r\nI think there is a need for the following upgrades: \r\n\r\n1. GCredit\r\nAllow users to pay it in installment. Like for 5,000 they can choose of to pay it within 3

In [None]:
# Remove \r \n \t
df2['final_review'] = df2['final_review'].replace(r'\r+|\n+|\t+',' ', regex=True)

In [None]:
df2['final_review'][12]

'Best Finance App! GCash is a big help, indeed. I’ve been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances.     Some of the best features include:  - It allows you to buy load and earn rebates!   - Upon registering to promo, you can earn coupons which you can use on scan to pay features  - allows you to save with high interest, and invest too!   - easily transfer funds to bank with real-time posting and no extra fees!   - Scan to pay is very useful  - Secured and convenient e-wallet  - Wide-range of coverage, a lot of establishments are using GCash already!  - You can help nature by just loading or walking!   - Smooth application    Why 4 star not 5 then?   I think there is a need for the following upgrades:     1. GCredit  Allow users to pay it in installment. Like for 5,000 they can choose of to pay it within 3 to 5 months.     Include card for

#### Convert reviews to lowercase

In [None]:
df2['final_review'] = df2['final_review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df2['final_review'].head()

1    actually ok. this app gets so many negative re...
2    cash in concern a little bit disappointed when...
6    gcash user since 2018 i've been using gcash fo...
7    gcash is losing its prime okay gcash dati 3-4 ...
8    read please!! those others who doesn’t like gc...
Name: final_review, dtype: object

#### Removing URLS

In [None]:
# import re

def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)

df2['final_review'] = df2.final_review.apply(remove_urls)
df2.final_review.head()

1    actually ok. this app gets so many negative re...
2    cash in concern a little bit disappointed when...
6    gcash user since 2018 i've been using gcash fo...
7    gcash is losing its prime okay gcash dati 3-4 ...
8    read please!! those others who doesn’t like gc...
Name: final_review, dtype: object

#### Expand Contractions

Contractions are words or combinations of words which are shortened by dropping letters and replacing them by an apostrophe. 

Let’s have a look at some examples:
* we’re = we are
* we’ve = we have
* I’d = I would

Note: This step needs to be done before word tokenizer because NLTK word tokenizer has in-built methods for dealing with contractions. However, NLTK word tokenizer's approach separates contractions without expanding. Expanding is a better method than simply separating.

In [None]:
# import contractions

# Example text 
text = ''' She'll be airport in 30 mins. We are supposed to catch the arrival, aren't we?  
          I'd love to welcome her personally. It'll be an awesome vacation.'''
  
# creating an empty list 
expanded_words = []     
for word in text.split(): 
  # using contractions.fix to expand the shotened words 
  expanded_words.append(contractions.fix(word))    
    
expanded_text = ' '.join(expanded_words) 
print('Original text: ' + text) 
print('\n') 
print('Expanded_text: ' + expanded_text)

Original text:  She'll be airport in 30 mins. We are supposed to catch the arrival, aren't we?  
          I'd love to welcome her personally. It'll be an awesome vacation.


Expanded_text: she will be airport in 30 mins. We are supposed to catch the arrival, are not we? I would love to welcome her personally. it will be an awesome vacation.


In [None]:
df2['final_review'] = df2['final_review'].apply(lambda x: contractions.fix(x))
df.final_review.sample(5)

200     Suggestion I hope there would be automatic deb...
1174    My Account got bugged My Account was okay with...
2162    Innacurate interest incurred Paid for somethin...
1824    UPRGADE IN PROGRESS i'm so disappointed in thi...
4784    NEVER USE THEIR PAY BILLS!!! They claim that i...
Name: final_review, dtype: object

#### Removing 1 and 2 word reviews + Separating good and bad reviews into separate dataframes

In [None]:
df2 = df2.reset_index()

In [None]:
df2.final_review[11]

'was good, but... i use gcash to pay my bills. my bank (hsbc, is not one of the banks that has direct communication with gcash, but i was able to cash in via the mastercard/visa option. now they removed it. since hsbc online does not support gcash, i need to physically withdraw money, and cash in via over the counter. this is a shame because of the current pandemic, i want to stay away from people as much as possible. now it is two crowded areas i have to go to. another thing, 7-11’s cash in is a confusing robbery! i did two transactions, because they only allowed up to 10k for a transaction. the 10k cost 60php but the 9.1k transaction was 180php! what?! one more, i linked it with my app store. i bought an in game product, got deducted, but nothing was credited to my account. i contacted the ticket desk, but there has been no reply except for an automated thank you note. convenience was the only thing you had going for you, and you are messing it up just like all other globe services.'

In [None]:
df2['word_count'].value_counts().sort_index().head(6)

0     1
1     2
2    11
3     5
4    22
5    28
Name: word_count, dtype: int64

In [None]:
df2.loc[df2['word_count'].isin([3])]

Unnamed: 0,index,rating,Dates,Time,final_review,word_count
1416,2715,2,2020-05-01,01:32:09,system always down!!,3
1907,3603,4,2021-01-18,15:06:28,gcash 5 30,3
2061,3872,1,2020-04-21,02:23:58,worst very unreliable,3
2133,3986,1,2020-07-05,04:15:02,no freebies scam,3
2168,4237,1,2021-04-14,16:40:57,stup1d app mandarambong,3


In [None]:
df2['rating'].value_counts()

1    1456
2     373
3     358
5     182
4     129
Name: rating, dtype: int64

In [None]:
df3 = df2.copy()

df3 = df3[~df3['word_count'].isin([0,1,2])]

In [None]:
df2.shape

(2498, 6)

In [None]:
df3.shape 

(2484, 6)

In [None]:
df_goodreviews = df3.loc[df['rating'].isin([5])].copy()
df_badreviews = df3.loc[df['rating'].isin([1])].copy()

In [None]:
print(f'df_good_reviews: {df_goodreviews.shape}')
print(f'df_bad_reviews: {df_badreviews.shape}')

df_good_reviews: (354, 6)
df_bad_reviews: (1073, 6)


## Part 3: Modeling

#### Download nltk English stopwords and spacy model (EN for lemmatization)

In [None]:
# Run in python console
# import nltk
nltk.download('stopwords')

# Run in terminal or command prompt
#!python3 -m spacy download |en

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

#### Prepare Stop Words (English + Tagalog)

Sources for English stopwords: 
1. nltk library

Sources for Tagalog stopwords:
1. Typical Stop Words: https://github.com/stopwords-iso/stopwords-tl/blob/master/stopwords-tl.json
2. Profanity : https://github.com/jromest/filipino-badwords-list/blob/master/src/filipino-badwords-list.js

In [None]:
# from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
# import json
with open('/content/drive/My Drive/Data/stopwords/tagalog-sw') as file:
    tl_stopwords = json.load(file)

In [None]:
# Extend stopwords to include tagalog + common gcash related words 
stop_words.extend(tl_stopwords)
stop_words.extend(['gcash','g-cash','po', 'app', 'would','can'])
stop_words.extend(['nyo','naman','yung', 'di', 'wala ','mag', 'nag', 'pag', 'kayo', 'lang', 'ung', 'niyo', 'tapos', 'ba', 'mo', 'please', 'pls'])
stop_words.extend(['nga','yan','yun', 'akong', 'inyo', 'ur', 'star', 'nalang', 'kasi', 'talaga'])
stop_words.extend(['nmn', 'nlng','un','blah', 'tru', 'pwede','like','one','globe', 'guys', 'even', 'seems', 'hi', 'guess','can','would'])

#### Tokenize and Clean Up Text some more

In [None]:
# Check columns
df_badreviews.columns

Index(['index', 'rating', 'Dates', 'Time', 'final_review', 'word_count'], dtype='object')

In [None]:
# Convert the review columns into a list
bad_reviews = df_badreviews.final_review.values.tolist()
good_reviews = df_goodreviews.final_review.values.tolist()

In [None]:
# Define a function to tokenize and remove punctuations
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

br_words = list(sent_to_words(bad_reviews))
gr_words = list(sent_to_words(good_reviews))

#### Create Bigram and Trigram Models

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are *min_count* and *threshold*. The higher the values of these param, the harder it is for words to be combined to bigrams.

###### Bigram and Trigram for good reviews

In [None]:
# Build the bigram and trigram models
g_bigram = gensim.models.Phrases(gr_words, min_count=5, threshold=80) # higher threshold fewer phrases.
g_trigram = gensim.models.Phrases(g_bigram[gr_words], threshold=80)  

# Faster way to get a sentence clubbed as a trigram/bigram
g_bigram_mod = gensim.models.phrases.Phraser(g_bigram)
g_trigram_mod = gensim.models.phrases.Phraser(g_trigram)

# See trigram example
print(g_trigram_mod[g_bigram_mod[gr_words[0]]])



['cash', 'in', 'concern', 'little', 'bit', 'disappointed', 'when', 'cashed', 'in', 'via', 'debit_card', 'waited', 'mins', 'to', 'receive', 'the', 'otp', 'had', 'to', 'restart', 'the', 'application', 'but', 'still', 'same', 'scenario', 'happened', 'when', 'checked', 'my', 'account', 'was', 'already', 'credited', 'by', 'my', 'bank', 'and', 'so', 'reached', 'for', 'customer', 'support', 'since', 'it', 'was', 'not', 'funded', 'in', 'my', 'gcash', 'account', 'the', 'first', 'one', 'talked', 'to', 'said', 'that', 'they', 'had', 'no', 'transactions', 'so', 'need', 'to', 'call', 'my', 'issuing', 'bank', 'called', 'them', 'and', 'said', 'that', 'if', 'it', 'was', 'already', 'deducted', 'on', 'my', 'account', 'meaning', 'it', 'was', 'successful', 'transaction', 'and', 'need', 'to', 'talk', 'to', 'gcash', 'so', 'tried', 'reaching', 'out', 'to', 'gcash', 'customer', 'support', 'but', 'still', 'no', 'update', 'on', 'my', 'concern', 'this', 'was', 'always', 'concern', 'to', 'me', 'whenever', 'cash',

##### Bigram and Trigam for bad reviews

In [None]:
# Build the bigram and trigram models
b_bigram = gensim.models.Phrases(br_words, min_count=5, threshold=80) # higher threshold fewer phrases.
b_trigram = gensim.models.Phrases(b_bigram[br_words], threshold=80)  

# Faster way to get a sentence clubbed as a trigram/bigram
b_bigram_mod = gensim.models.phrases.Phraser(b_bigram)
b_trigram_mod = gensim.models.phrases.Phraser(b_trigram)

# See trigram example
print(b_trigram_mod[b_bigram_mod[br_words[3]]])



['best', 'finance', 'app', 'gcash', 'is', 'big', 'help', 'indeed', 'have', 'been', 'using', 'using', 'this', 'app', 'since', 'and', 'it', 'really', 'helped', 'me', 'to', 'be', 'more', 'financially', 'literate', 'for', 'it', 'has', 'lot', 'of', 'features', 'that', 'will', 'enable', 'youth', 'professionals', 'to', 'efficiently', 'manage', 'their', 'finances', 'some', 'of', 'the', 'best', 'features', 'include', 'it', 'allows', 'you', 'to', 'buy_load', 'and', 'earn', 'rebates', 'upon', 'registering', 'to', 'promo', 'you', 'can', 'earn', 'coupons', 'which', 'you', 'can', 'use', 'on', 'scan', 'to', 'pay', 'features', 'allows', 'you', 'to', 'save', 'with', 'high', 'interest', 'and', 'invest', 'too', 'easily', 'transfer', 'funds', 'to', 'bank', 'with', 'real', 'time', 'posting', 'and', 'no', 'extra', 'fees', 'scan', 'to', 'pay', 'is', 'very', 'useful', 'secured', 'and', 'convenient', 'wallet', 'wide', 'range', 'of', 'coverage', 'lot', 'of', 'establishments', 'are', 'using', 'gcash', 'already',

#### Remove Stopwords, Make Bigrams and Lemmatize (Eng. words only)

##### Functionalizing the steps

I have excluded lemmatization as it brought out errant results when added to tagalog corpus.

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def g_make_bigrams(texts):
    return [g_bigram_mod[doc] for doc in texts]

def g_make_trigrams(texts):
    return [g_trigram_mod[g_bigram_mod[doc]] for doc in texts]

def b_make_bigrams(texts):
    return [b_bigram_mod[doc] for doc in texts]

def b_make_trigrams(texts):
    return [b_trigram_mod[b_bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

##### Applying the functions to tokenized bad and good reviews


In [None]:
# APPLYING TO TOKENIZED GOOD REVIEWS

# Remove Stop Words
gr_words_nostops = remove_stopwords(gr_words)

### REMOVED LEMMATIZATION COS IT AFFECTS THE TAGALOG WORDS / TAGALOG WORDS ARE RENDERED INEFFECTIVE
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
gr_lemmatized = lemmatization(gr_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])


# Form Bigrams
gr_words_bigrams = g_make_bigrams(gr_lemmatized)

# Form Trigrams
gr_words_trigrams = g_make_trigrams(gr_words_bigrams)


  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILAB

In [None]:
gr_words_trigrams[13]

['bad',
 'ever',
 'terrible',
 'unreliable',
 'time',
 'emergency',
 'always',
 'crash',
 'can',
 'use',
 'money',
 'buy_load',
 'transfer',
 'money',
 'user',
 'can',
 'transfer',
 'money',
 'bank',
 'decrease',
 'money',
 'transfer',
 'successful',
 'money',
 'credit',
 'day',
 'unreliable',
 'especially',
 'time',
 'badly',
 'need',
 'customer_service',
 'support',
 'know',
 'can',
 'use',
 'gcredit',
 'service',
 'user',
 'friendly',
 'especially',
 'try',
 'cash',
 'use',
 'go',
 'always',
 'uninstall',
 'transfer',
 'money']

In [None]:
# APPLYING TO TOKENIZED BAD REVIEWS

# Remove Stop Words
br_words_nostops = remove_stopwords(br_words)

## REMOVED LEMMATIZATION COS IT AFFECTS THE TAGALOG WORDS / TAGALOG WORDS ARE RENDERED INEFFECTIVE
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
br_lemmatized = lemmatization(br_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# Form Bigrams
br_words_bigrams = b_make_bigrams(br_lemmatized)

br_words_trigrams = b_make_trigrams(br_words_bigrams)




  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILAB

In [None]:
br_words_trigrams[3]


['good',
 'finance',
 'big',
 'help',
 'indeed',
 'use',
 'use',
 'really',
 'help',
 'financially',
 'literate',
 'lot',
 'feature',
 'enable',
 'youth',
 'professional',
 'efficiently',
 'manage',
 'finance',
 'good',
 'feature',
 'include',
 'allow',
 'buy_load',
 'earn',
 'rebate',
 'register',
 'promo',
 'earn',
 'coupon',
 'use',
 'scan',
 'pay',
 'feature',
 'allow',
 'save',
 'high',
 'interest',
 'invest',
 'easily',
 'transfer',
 'fund',
 'bank',
 'real',
 'time',
 'post',
 'extra',
 'fee',
 'scan',
 'pay',
 'useful',
 'secure',
 'convenient',
 'wallet',
 'wide',
 'range',
 'coverage',
 'lot',
 'establishment',
 'use',
 'already',
 'help',
 'nature',
 'loading',
 'walk',
 'smooth',
 'application',
 'think',
 'need',
 'follow',
 'upgrade',
 'gcredit',
 'allow',
 'user',
 'pay',
 'installment',
 'choose',
 'pay',
 'month',
 'include',
 'card',
 'customer_service',
 'know',
 'customer_service',
 'really',
 'honestly',
 'user',
 'interface',
 'security',
 'allow',
 'user',
 'log'

#### Create the Dictionary and Corpus needed for Topic Modeling

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them.

##### Good Reviews -  Dictionary and Corpus

In [None]:
# Create Dictionary
g_id2word = corpora.Dictionary(gr_words_trigrams)
g_id2word.filter_extremes(no_below=10, no_above=0.35)
g_id2word.compactify()
# Create Corpus
g_texts = gr_words_trigrams

# Term Document Frequency
g_corpus = [g_id2word.doc2bow(text) for text in g_texts]

# View
print(g_corpus[:1])

[[(0, 3), (1, 2), (2, 2), (3, 2), (4, 1), (5, 5), (6, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 3), (20, 1), (21, 1), (22, 2), (23, 1), (24, 2), (25, 2), (26, 2), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]


##### Bad Reviews -  Dictionary and Corpus

In [None]:
# Create Dictionary
b_id2word = corpora.Dictionary(br_words_trigrams)
b_id2word.filter_extremes(no_below=10, no_above=0.35)
b_id2word.compactify()

# Create Corpus
b_texts = br_words_trigrams

# Term Document Frequency
b_corpus = [b_id2word.doc2bow(text) for text in b_texts]

# View
print(b_corpus[:1])

[[(0, 2), (1, 2), (2, 1), (3, 2), (4, 4), (5, 1), (6, 1), (7, 4), (8, 1), (9, 2), (10, 2), (11, 1), (12, 1), (13, 4), (14, 1), (15, 1), (16, 1), (17, 1), (18, 3), (19, 1), (20, 2), (21, 1), (22, 1)]]


#### Building the Topic Model

##### Optizing Number of Topics for Good Reviews

In [None]:
# Build initial LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=g_corpus,
                                           id2word=g_id2word,
                                           num_topics=6, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
print(json.dumps(lda_model.print_topics(), indent=1))

[
 [
  0,
  "0.257*\"would\" + 0.092*\"verification\" + 0.079*\"work\" + 0.066*\"student\" + 0.055*\"back\" + 0.049*\"option\" + 0.042*\"need\" + 0.041*\"want\" + 0.032*\"long\" + 0.024*\"verify\""
 ],
 [
  1,
  "0.044*\"get\" + 0.043*\"email\" + 0.042*\"open\" + 0.040*\"number\" + 0.037*\"issue\" + 0.031*\"still\" + 0.029*\"say\" + 0.028*\"wait\" + 0.027*\"send\" + 0.027*\"time\""
 ],
 [
  2,
  "0.214*\"update\" + 0.060*\"fix\" + 0.056*\"work\" + 0.056*\"need\" + 0.054*\"always\" + 0.054*\"load\" + 0.046*\"user\" + 0.038*\"crash\" + 0.036*\"really\" + 0.035*\"can\""
 ],
 [
  3,
  "0.099*\"use\" + 0.079*\"service\" + 0.051*\"time\" + 0.051*\"money\" + 0.050*\"ticket\" + 0.047*\"can\" + 0.044*\"ask\" + 0.042*\"bad\" + 0.042*\"try\" + 0.036*\"much\""
 ],
 [
  4,
  "0.136*\"account\" + 0.087*\"verify\" + 0.077*\"can\" + 0.041*\"say\" + 0.033*\"fix\" + 0.031*\"mpin\" + 0.031*\"fully\" + 0.030*\"access\" + 0.026*\"try\" + 0.024*\"soon\""
 ],
 [
  5,
  "0.060*\"cash\" + 0.060*\"money\" + 0.0

In [None]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=gr_words_trigrams, dictionary=g_id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

In [None]:
# #Optimizing topic models

# import numpy as np
# import tqdm
# grid = {}
# grid['Validation_Set'] = {}
# # Topics range
# min_topics = 2
# max_topics = 8
# step_size = 1
# topics_range = range(min_topics, max_topics, step_size)
# # Alpha parameter
# alpha = list(np.arange(0.01, 1, 0.3))
# alpha.append('symmetric')
# alpha.append('asymmetric')
# # Beta parameter
# beta = list(np.arange(0.01, 1, 0.3))
# beta.append('symmetric')
# # Validation sets
# num_of_docs = len(g_corpus)
# corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25), 
#                # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5), 
#                #gensim.utils.ClippedCorpus(g_corpus, num_of_docs*0.75), 
#                g_corpus]
# corpus_title = ['75% Corpus', '100% Corpus']
# model_results = {'Validation_Set': [],
#                  'Topics': [],
#                  'Alpha': [],
#                  'Beta': [],
#                  'Coherence': []
#                 }
# # Can take a long time to run
# if 1 == 1:
#     pbar = tqdm.tqdm(total=540)
    
#     # iterate through validation corpuses
#     for i in range(len(corpus_sets)):
#         # iterate through number of topics
#         for k in topics_range:
#             # iterate through alpha values
#             for a in alpha:
#                 # iterare through beta values
#                 for b in beta:
#                     # get the coherence score for the given parameters
#                     cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=g_id2word, 
#                                                   k=k, a=a, b=b)
#                     # Save the model results
#                     model_results['Validation_Set'].append(corpus_title[i])
#                     model_results['Topics'].append(k)
#                     model_results['Alpha'].append(a)
#                     model_results['Beta'].append(b)
#                     model_results['Coherence'].append(cv)
                    
#                     pbar.update(1)
#     pd.DataFrame(model_results).to_csv('gcash_gr_lda_tuning_results.csv', index=False)
#     pbar.close()

In [None]:
# model_cv_df_gr = pd.DataFrame(model_results) 
# model_cv_df_gr.sort_values(by='Coherence',ascending=False)

In [None]:
# Build optimized LDA model

lda_model_experimental = gensim.models.LdaMulticore(corpus=g_corpus,
                                           id2word=g_id2word,
                                           num_topics=6, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.61,
                                           eta=0.91)

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

In [None]:
print(json.dumps(lda_model_experimental.print_topics(), indent=1))

[
 [
  0,
  "0.081*\"work\" + 0.075*\"pay\" + 0.050*\"bill\" + 0.037*\"load\" + 0.035*\"good\" + 0.034*\"problem\" + 0.032*\"use\" + 0.030*\"transaction\" + 0.024*\"feature\" + 0.024*\"really\""
 ],
 [
  1,
  "0.055*\"get\" + 0.044*\"email\" + 0.044*\"issue\" + 0.039*\"still\" + 0.038*\"try\" + 0.038*\"help\" + 0.038*\"number\" + 0.033*\"time\" + 0.032*\"receive\" + 0.026*\"wait\""
 ],
 [
  2,
  "0.171*\"update\" + 0.097*\"can\" + 0.049*\"error\" + 0.046*\"always\" + 0.043*\"keep\" + 0.036*\"transaction\" + 0.032*\"access\" + 0.027*\"user\" + 0.026*\"log\" + 0.024*\"say\""
 ],
 [
  3,
  "0.059*\"money\" + 0.055*\"use\" + 0.041*\"send\" + 0.035*\"time\" + 0.035*\"ticket\" + 0.034*\"service\" + 0.031*\"take\" + 0.031*\"mpin\" + 0.031*\"ask\" + 0.028*\"bad\""
 ],
 [
  4,
  "0.127*\"account\" + 0.101*\"would\" + 0.092*\"verify\" + 0.039*\"verification\" + 0.035*\"option\" + 0.035*\"need\" + 0.028*\"fully\" + 0.027*\"back\" + 0.026*\"student\" + 0.021*\"make\""
 ],
 [
  5,
  "0.088*\"can\" 

##### Optimizing Number of Topics for Bad Reviews

In [None]:
# Build LDA model
lda_model2 = gensim.models.ldamodel.LdaModel(corpus=b_corpus,
                                           id2word=b_id2word,
                                           num_topics=6, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)



  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

In [None]:
print(json.dumps(lda_model2.print_topics(), indent=1))

[
 [
  0,
  "0.101*\"update\" + 0.081*\"pay\" + 0.063*\"bill\" + 0.056*\"work\" + 0.044*\"link\" + 0.040*\"time\" + 0.034*\"use\" + 0.031*\"phone\" + 0.024*\"useless\" + 0.023*\"name\""
 ],
 [
  1,
  "0.067*\"load\" + 0.059*\"error\" + 0.054*\"receive\" + 0.052*\"customer_service\" + 0.046*\"money\" + 0.042*\"bad\" + 0.041*\"poor\" + 0.036*\"system\" + 0.032*\"service\" + 0.030*\"balance\""
 ],
 [
  2,
  "0.080*\"can\" + 0.068*\"account\" + 0.046*\"send\" + 0.042*\"try\" + 0.037*\"fix\" + 0.036*\"code\" + 0.034*\"email\" + 0.029*\"money\" + 0.026*\"say\" + 0.025*\"use\""
 ],
 [
  3,
  "0.222*\"would\" + 0.127*\"verification\" + 0.111*\"student\" + 0.061*\"verify\" + 0.049*\"option\" + 0.039*\"open\" + 0.035*\"process\" + 0.027*\"slow\" + 0.023*\"valid\" + 0.023*\"face\""
 ],
 [
  4,
  "0.061*\"use\" + 0.056*\"transaction\" + 0.055*\"bank\" + 0.044*\"cash\" + 0.044*\"make\" + 0.039*\"transfer\" + 0.034*\"mpin\" + 0.032*\"thank\" + 0.031*\"fully_verify\" + 0.030*\"need\""
 ],
 [
  5,
  "

In [None]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=gr_words_trigrams, dictionary=g_id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

#### Visualizing LDA model results

##### Plotting topic models using pyLDAvis

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, g_corpus, g_id2word)
vis

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model2, b_corpus, b_id2word)
vis

##### Checking time series distribution of reviews [ January 1 2020 to May 31 2021 ]