# Preprocessing + Topic Modeling (Mixed Language Approach)
This notebook serves as a demo for topic modeling GCash App Store Reviews using LDA. The approach of this notebook is to ignore the multi-lingual nature of reviews and move directly to LDA after data cleaning. This notebook will be divided into four parts:

1. Data Cleaning
2. Text Cleaning
3. Modeling (LDA)
4. Exploration of Results

In [None]:
!pip install emot contractions pyLDAvis lda
!python3 -m spacy download en

In [181]:
# importing libraries
import pandas as pd
import numpy as np 
import regex as re
import datetime
import json

from emot.emo_unicode import UNICODE_EMO, EMOTICONS
import contractions


from spacy.lang.tl.stop_words import STOP_WORDS as tl_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

#nltk
import nltk
from nltk.corpus import stopwords

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import nltk

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

In [122]:
# importing the dataset
df = pd.read_csv('gcash_0613-1427.csv')

In [123]:
df.shape

(4867, 9)

In [124]:
df.dtypes

title                object
date                 object
rating                int64
userName             object
isEdited               bool
review               object
app_name             object
app_id                int64
developerResponse    object
dtype: object

In [125]:
df.head(2)

Unnamed: 0,title,date,rating,userName,isEdited,review,app_name,app_id,developerResponse
0,It is starting to become pretty annoying,2018-12-29 15:24:59,3,Fhayerie0406,False,Almost everything is smooth except for three t...,gcash,520020791,
1,Actually ok.,2020-05-16 07:33:32,5,i don't usually rate/review,False,This app gets so many negative reviews but it’...,gcash,520020791,


## Part 1: Data Exploration + Cleaning

In [126]:
# Checking for null values
df.isnull().sum() 

title                   0
date                    0
rating                  0
userName                0
isEdited                0
review                  0
app_name                0
app_id                  0
developerResponse    4718
dtype: int64

In [127]:
# Dropping unwanted columns [userName, app_name, app_id, developerResponse]
df = df.drop(['userName', 'app_name', 'app_id', 'developerResponse', 'isEdited'], axis=1)

In [128]:
# Splitting datetime column + Checking date range

# Code block below separated the timestamp [[date]] column into Date and Time column respectively
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
df = df.drop('date', axis=1)

In [129]:
df.sort_values(by='Dates', ascending=False).head(1)

Unnamed: 0,title,rating,review,Dates,Time
2570,Annoying,2,I can’t open my GCash account 4days after i ac...,2021-06-11,09:05:47


In [130]:
df.sort_values(by='Dates', ascending=True).head(1)

Unnamed: 0,title,rating,review,Dates,Time
4134,App does not work,1,App does not work. It hangs and automatically ...,2012-05-23,16:24:07


In the section above, we can see that the dataset ranges from 2012-05-23 to 2021-06-11.

In [131]:
df.dtypes

title     object
rating     int64
review    object
Dates     object
Time      object
dtype: object

In [132]:
# Joining the title and review column

df['final_review'] = df.title.map(str) + " " + df.review

In [133]:
df = df.drop(['review','title'], axis=1)

In [134]:
df['word_count'] = df['final_review'].str.split().str.len()

In [135]:
df.head(1)

Unnamed: 0,rating,Dates,Time,final_review,word_count
0,3,2018-12-29,15:24:59,It is starting to become pretty annoying Almos...,234


In [136]:
df2 = df.copy()

In [137]:
# Filter pandas df for reviews from january 2020 to may 2021
# import datetime
df2 = df2[(df2['Dates']>datetime.date(2020,1,1)) & (df2['Dates']<datetime.date(2021,5,31))] 

In [138]:
df2.shape

(2498, 5)

In [139]:
df2.sort_values(by='word_count')

Unnamed: 0,rating,Dates,Time,final_review,word_count
4491,1,2021-01-21,02:28:07,Poor WTF,2
4308,5,2020-06-14,14:51:30,omsim 🔥🔥🔥🔥,2
4325,1,2020-04-17,12:20:45,Nakputaka Nakputaka,2
4539,1,2020-09-24,02:47:09,Habducy Hansjaknahd,2
4230,5,2021-05-12,16:22:47,Yes Kaksnsksksk,2
...,...,...,...,...,...
3727,1,2020-01-04,12:59:40,Don’t trust your money with them If there is a...,305
3751,2,2021-04-24,01:39:51,"The best in PH, because it’s the only thing we...",343
3206,1,2020-02-15,11:35:03,Too many issues/Customer Service unhelpful Tho...,349
3778,1,2020-03-18,14:44:27,"App is horrible Hi, this is my 2nd time of sen...",406


In [140]:
df2['word_count'].value_counts().sort_index().head(10)

2     14
3      4
4     22
5     23
6     34
7     31
8     53
9     49
10    46
11    48
Name: word_count, dtype: int64

In [141]:
df2.loc[df2['word_count'].isin([2])]

Unnamed: 0,rating,Dates,Time,final_review,word_count
3670,1,2021-02-03,01:50:17,-0 Disappointing.,2
3927,5,2021-04-25,03:08:11,Good Good,2
3943,2,2020-09-28,18:43:16,Scammers. Trash.,2
4006,1,2021-01-28,09:19:40,😤 😤,2
4007,1,2021-04-15,10:51:18,. 🤦🏻‍♀️,2
4230,5,2021-05-12,16:22:47,Yes Kaksnsksksk,2
4272,5,2020-12-11,08:23:09,Mae0210 5star,2
4286,1,2020-10-05,15:50:45,trashhhhh trashhhhhh,2
4299,1,2020-07-26,01:10:49,ULOL Kakainis,2
4308,5,2020-06-14,14:51:30,omsim 🔥🔥🔥🔥,2


## Part 2: Text Cleaning

Now that we have done the initial cleaning of the df, let us now proceed to text cleaning for topic modeling purposes.

My approach in this notebook would be as follows:
1. Do the necessary text cleaning steps (remove encodings, lowercase, strip punctutation, expand contractions etc)
2. Remove Filipino stop words and possibly lemmatize Filipino words
3. Remove English stop words and lemmatize
4. Tokenization and other methods

In [142]:
df2.head(4)

Unnamed: 0,rating,Dates,Time,final_review,word_count
1,5,2020-05-16,07:33:32,Actually ok. This app gets so many negative re...,205
2,3,2020-02-04,00:25:57,Cash In Concern A little bit disappointed when...,186
6,5,2020-06-07,04:57:17,Gcash user since 2018 I've been using gcash fo...,173
7,1,2020-04-26,04:57:47,GCash is Losing its Prime Okay GCash dati 3-4 ...,250


#### Remove emojis and emoticons

In [143]:
# Remove emojis and emoticons

# Code block 1: remove emojis
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

#Example
remove_emoji("Have fun with NLP! 😃😃")

'Have fun with NLP! '

In [144]:
# apply the remove emoji function to the review column
df2['final_review'] = df2['final_review'].apply(lambda x: remove_emoji(x))

In [145]:
df2.loc[df2['word_count'].isin([1])]

Unnamed: 0,rating,Dates,Time,final_review,word_count


In [146]:
# Code block 2: remove emoticons

# from emot.emo_unicode import UNICODE_EMO, EMOTICONS

# Function for removing emoticons
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)


In [147]:
#Example
remove_emoticons("I used to play Super Mario games :)")

'I used to play Super Mario games '

In [148]:
# Applying remove emoticons code
df2['final_review'] = df2['final_review'].apply(lambda x: remove_emoticons(x))

In [149]:
df2['word_count'] = df2['final_review'].str.split().str.len()

In [150]:
df2['word_count'].value_counts().sort_index().head(6)

0     1
1     2
2    11
3     5
4    22
5    28
Name: word_count, dtype: int64

In [151]:
df2.loc[df2['word_count'].isin([2])]

Unnamed: 0,rating,Dates,Time,final_review,word_count
3670,1,2021-02-03,01:50:17,-0 Disappointing.,2
3927,5,2021-04-25,03:08:11,Good Good,2
3943,2,2020-09-28,18:43:16,Scammers. Trash.,2
4230,5,2021-05-12,16:22:47,Yes Kaksnsksksk,2
4272,5,2020-12-11,08:23:09,Mae0210 5star,2
4286,1,2020-10-05,15:50:45,trashhhhh trashhhhhh,2
4299,1,2020-07-26,01:10:49,ULOL Kakainis,2
4325,1,2020-04-17,12:20:45,Nakputaka Nakputaka,2
4463,4,2021-04-12,08:32:31,svdtstbgdvyvg tvtvyvttttt,2
4491,1,2021-01-21,02:28:07,Poor WTF,2


#### Remove \n, \t, \r

In [152]:
df2['final_review'][12]

'Best Finance App! GCash is a big help, indeed. I’ve been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances. \r\n\r\nSome of the best features include:\r\n- It allows you to buy load and earn rebates! \r\n- Upon registering to promo, you can earn coupons which you can use on scan to pay features\r\n- allows you to save with high interest, and invest too! \r\n- easily transfer funds to bank with real-time posting and no extra fees! \r\n- Scan to pay is very useful\r\n- Secured and convenient e-wallet\r\n- Wide-range of coverage, a lot of establishments are using GCash already!\r\n- You can help nature by just loading or walking! \r\n- Smooth application\r\n\r\nWhy 4 star not 5 then? \r\nI think there is a need for the following upgrades: \r\n\r\n1. GCredit\r\nAllow users to pay it in installment. Like for 5,000 they can choose of to pay it within 3

In [153]:
# Remove \r \n \t
df2['final_review'] = df2['final_review'].replace(r'\r+|\n+|\t+',' ', regex=True)

In [154]:
df2['final_review'][12]

'Best Finance App! GCash is a big help, indeed. I’ve been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances.     Some of the best features include:  - It allows you to buy load and earn rebates!   - Upon registering to promo, you can earn coupons which you can use on scan to pay features  - allows you to save with high interest, and invest too!   - easily transfer funds to bank with real-time posting and no extra fees!   - Scan to pay is very useful  - Secured and convenient e-wallet  - Wide-range of coverage, a lot of establishments are using GCash already!  - You can help nature by just loading or walking!   - Smooth application    Why 4 star not 5 then?   I think there is a need for the following upgrades:     1. GCredit  Allow users to pay it in installment. Like for 5,000 they can choose of to pay it within 3 to 5 months.     Include card for

#### Convert reviews to lowercase

In [155]:
df2['final_review'] = df2['final_review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df2['final_review'].head()

1    actually ok. this app gets so many negative re...
2    cash in concern a little bit disappointed when...
6    gcash user since 2018 i've been using gcash fo...
7    gcash is losing its prime okay gcash dati 3-4 ...
8    read please!! those others who doesn’t like gc...
Name: final_review, dtype: object

#### Removing URLS

In [156]:
# import re

def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)

df2['final_review'] = df2.final_review.apply(remove_urls)
df2.final_review.head()

1    actually ok. this app gets so many negative re...
2    cash in concern a little bit disappointed when...
6    gcash user since 2018 i've been using gcash fo...
7    gcash is losing its prime okay gcash dati 3-4 ...
8    read please!! those others who doesn’t like gc...
Name: final_review, dtype: object

#### Expand Contractions

Contractions are words or combinations of words which are shortened by dropping letters and replacing them by an apostrophe. 

Let’s have a look at some examples:
* we’re = we are
* we’ve = we have
* I’d = I would

Note: This step needs to be done before word tokenizer because NLTK word tokenizer has in-built methods for dealing with contractions. However, NLTK word tokenizer's approach separates contractions without expanding. Expanding is a better method than simply separating.

In [157]:
# import contractions

# Example text 
text = ''' She'll be airport in 30 mins. We are supposed to catch the arrival, aren't we?  
          I'd love to welcome her personally. It'll be an awesome vacation.'''
  
# creating an empty list 
expanded_words = []     
for word in text.split(): 
  # using contractions.fix to expand the shotened words 
  expanded_words.append(contractions.fix(word))    
    
expanded_text = ' '.join(expanded_words) 
print('Original text: ' + text) 
print('\n') 
print('Expanded_text: ' + expanded_text)

Original text:  She'll be airport in 30 mins. We are supposed to catch the arrival, aren't we?  
          I'd love to welcome her personally. It'll be an awesome vacation.


Expanded_text: she will be airport in 30 mins. We are supposed to catch the arrival, are not we? I would love to welcome her personally. it will be an awesome vacation.


In [158]:
df['final_review'] = df['final_review'].apply(lambda x: contractions.fix(x))
df.final_review.sample(5)

3795    buy load just want to ask what happened po? sa...
4237                              stup1d app Mandarambong
4086    Crashing It crashes everytime I buy load or se...
3432    blaming users for tech issues please just make...
2703    Customer service Medyo makulit ako at maraming...
Name: final_review, dtype: object

#### Remove punctuation

In [159]:
df2['final_review'] = df2['final_review'].str.replace(r'[^\w\s]','')

  """Entry point for launching an IPython kernel.


In [160]:
df2['final_review'][12]

'best finance app gcash is a big help indeed ive been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances some of the best features include  it allows you to buy load and earn rebates  upon registering to promo you can earn coupons which you can use on scan to pay features  allows you to save with high interest and invest too  easily transfer funds to bank with realtime posting and no extra fees  scan to pay is very useful  secured and convenient ewallet  widerange of coverage a lot of establishments are using gcash already  you can help nature by just loading or walking  smooth application why 4 star not 5 then i think there is a need for the following upgrades 1 gcredit allow users to pay it in installment like for 5000 they can choose of to pay it within 3 to 5 months include card for gcredit 2 customer service you know what to do customer servic

In [161]:
df2.dtypes

rating           int64
Dates           object
Time            object
final_review    object
word_count       int64
dtype: object

#### Removing 1 and 2 word reviews + Separating good and bad reviews into separate dataframes

In [162]:
df2 = df2.reset_index()

In [163]:
df2.final_review[11]

'was good but i use gcash to pay my bills my bank hsbc is not one of the banks that has direct communication with gcash but i was able to cash in via the mastercardvisa option now they removed it since hsbc online does not support gcash i need to physically withdraw money and cash in via over the counter this is a shame because of the current pandemic i want to stay away from people as much as possible now its two crowded areas i have to go to another thing 711s cash in is a confusing robbery i did two transactions because they only allowed up to 10k for a transaction the 10k cost 60php but the 91k transaction was 180php what one more i linked it with my app store i bought an in game product got deducted but nothing was credited to my account i contacted the ticket desk but there has been no reply except for an automated thank you note convenience was the only thing you had going for you and youre messing it up just like all other globe services'

In [164]:
df2['word_count'].value_counts().sort_index().head(6)

0     1
1     2
2    11
3     5
4    22
5    28
Name: word_count, dtype: int64

In [165]:
df2.loc[df2['word_count'].isin([3])]

Unnamed: 0,index,rating,Dates,Time,final_review,word_count
1416,2715,2,2020-05-01,01:32:09,system always down,3
1907,3603,4,2021-01-18,15:06:28,gcash 5 30,3
2061,3872,1,2020-04-21,02:23:58,worst very unreliable,3
2133,3986,1,2020-07-05,04:15:02,no freebies scam,3
2168,4237,1,2021-04-14,16:40:57,stup1d app mandarambong,3


In [166]:
df3 = df2.copy()

df3 = df3[~df3['word_count'].isin([0,1,2,3])]

In [167]:
df2.shape

(2498, 6)

In [168]:
df3.shape 

(2479, 6)

In [169]:
df_goodreviews = df3.loc[df['rating'].isin([4,5])].copy()
df_badreviews = df3.loc[df['rating'].isin([1,2,3])].copy()

In [170]:
print(f'df_good_reviews: {df_goodreviews.shape}')
print(f'df_bad_reviews: {df_badreviews.shape}')

df_good_reviews: (555, 6)
df_bad_reviews: (1924, 6)


## Part 3: Modeling

#### Download nltk English stopwords and spacy model (EN for lemmatization)

In [175]:
# Run in python console
# import nltk
nltk.download('stopwords')

# Run in terminal or command prompt
#!python3 -m spacy download |en

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

#### Prepare Stop Words (English + Tagalog)

Sources for English stopwords: 
1. nltk library

Sources for Tagalog stopwords:
1. Typical Stop Words: https://github.com/stopwords-iso/stopwords-tl/blob/master/stopwords-tl.json
2. Profanity : https://github.com/jromest/filipino-badwords-list/blob/master/src/filipino-badwords-list.js

In [177]:
# from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [236]:
# import json
with open('tagalog-sw') as file:
    tl_stopwords = json.load(file)

In [336]:
# Extend stopwords to include tagalog + common gcash related words 
stop_words.extend(tl_stopwords)
stop_words.extend(['gcash','g-cash','po', 'app', 'would','can'])
stop_words.extend(['nyo ','naman','yung', 'di', 'wala ','mag', 'nag', 'pag', 'kayo', 'lang', 'ung', 'niyo', 'tapos', 'ba', 'mo'])

#### Tokenize and Clean Up Text some more

In [337]:
df_badreviews.columns

Index(['index', 'rating', 'Dates', 'Time', 'final_review', 'word_count'], dtype='object')

In [338]:
bad_reviews = df_badreviews.final_review.values.tolist()
good_reviews = df_goodreviews.final_review.values.tolist()

In [339]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

br_words = list(sent_to_words(bad_reviews))
gr_words = list(sent_to_words(good_reviews))

#### Create Bigram and Trigram Models

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are *min_count* and *threshold*. The higher the values of these param, the harder it is for words to be combined to bigrams.

###### Bigram and Trigram for good reviews

In [340]:
# Build the bigram and trigram models
g_bigram = gensim.models.Phrases(gr_words, min_count=5, threshold=80) # higher threshold fewer phrases.
g_trigram = gensim.models.Phrases(g_bigram[gr_words], threshold=80)  

# Faster way to get a sentence clubbed as a trigram/bigram
g_bigram_mod = gensim.models.phrases.Phraser(g_bigram)
g_trigram_mod = gensim.models.phrases.Phraser(g_trigram)

# See trigram example
print(g_trigram_mod[g_bigram_mod[gr_words[0]]])



['cash', 'in', 'concern', 'little', 'bit', 'disappointed', 'when', 'cashed', 'in', 'via', 'debit_card', 'waited', 'mins', 'to', 'receive', 'the', 'otp', 'had', 'to', 'restart', 'the', 'application', 'but', 'still', 'same', 'scenario', 'happened', 'when', 'checked', 'my', 'account', 'was', 'already', 'credited', 'by', 'my', 'bank', 'and', 'so', 'reached', 'for', 'customer', 'support', 'since', 'it', 'was', 'not', 'funded', 'in', 'my', 'gcash', 'account', 'the', 'first', 'one', 'talked', 'to', 'said', 'that', 'they', 'had', 'no', 'transactions', 'so', 'need', 'to', 'call', 'my', 'issuing', 'bank', 'called', 'them', 'and', 'said', 'that', 'if', 'it', 'was', 'already', 'deducted', 'on', 'my', 'account', 'meaning', 'it', 'was', 'successful', 'transaction', 'and', 'need', 'to', 'talk', 'to', 'gcash', 'so', 'tried', 'reaching', 'out', 'to', 'gcash', 'customer', 'support', 'but', 'still', 'no', 'update', 'on', 'my', 'concern', 'this', 'was', 'always', 'concern', 'to', 'me', 'whenever', 'cash',

##### Bigram and Trigam for bad reviews

In [341]:
# Build the bigram and trigram models
b_bigram = gensim.models.Phrases(br_words, min_count=5, threshold=80) # higher threshold fewer phrases.
b_trigram = gensim.models.Phrases(b_bigram[br_words], threshold=80)  

# Faster way to get a sentence clubbed as a trigram/bigram
b_bigram_mod = gensim.models.phrases.Phraser(b_bigram)
b_trigram_mod = gensim.models.phrases.Phraser(b_trigram)

# See trigram example
print(b_trigram_mod[b_bigram_mod[br_words[3]]])



['read', 'please', 'those', 'others', 'who', 'doesnt', 'like', 'gcash', 'dont', 'expect', 'to', 'much', 'of', 'gcash', 'because', 'not', 'everything', 'want', 'can', 'give', 'it', 'to', 'those', 'others', 'who', 'have', 'an', 'error', 'maybe', 'give', 'little', 'more', 'patience', 'think', 'of', 'it', 'as', 'the', 'time', 'is', 'giving', 'challenge', 'like', 'if', 'can', 'give', 'more', 'patience', 'or', 'not', 'you', 'just', 'really', 'need', 'time', 'and', 'maybe', 'try', 'to', 'solve', 'whats', 'the', 'problem', 'first', 'before', 'you', 'dramatically', 'go', 'crazy', 'anyways', 'other', 'than', 'that', 'base', 'on', 'my', 'experience', 'just', 'dont', 'get', 'why', 'people', 'dont', 'realize', 'that', 'we', 'are', 'all', 'hurting', 'but', 'nobody', 'cares', 'at', 'all', 'maybe', 'at_least', 'try', 'to', 'solve', 'your', 'problems', 'with', 'those', 'people', 'close', 'to', 'you', 'and', 'dont', 'make', 'up', 'things', 'thats', 'not', 'even', 'true', 'from', 'the', 'start', 'the', '

#### Remove Stopwords, Make Bigrams and Lemmatize (Eng. words only)

##### Functionalizing the steps

In [342]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def g_make_bigrams(texts):
    return [g_bigram_mod[doc] for doc in texts]

def g_make_trigrams(texts):
    return [g_trigram_mod[g_bigram_mod[doc]] for doc in texts]

def b_make_bigrams(texts):
    return [b_bigram_mod[doc] for doc in texts]

def b_make_trigrams(texts):
    return [b_trigram_mod[b_bigram_mod[doc]] for doc in texts]

# def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
#     """https://spacy.io/api/annotation"""
#     texts_out = []
#     for sent in texts:
#         doc = nlp(" ".join(sent)) 
#         texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
#     return texts_out

##### Applying the functions to tokenized bad and good reviews


In [None]:
# APPLYING TO TOKENIZED GOOD REVIEWS

# Remove Stop Words
gr_words_nostops = remove_stopwords(gr_words)

# REMOVED LEMMATIZATION COS IT AFFECTS THE TAGALOG WORDS / TOPICS LOSE MEANING
# # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# # python3 -m spacy download en
# nlp = spacy.load('en', disable=['parser', 'ner'])

# # Do lemmatization keeping only noun, adj, vb, adv
# gr_lemmatized = lemmatization(gr_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])


# Form Bigrams
gr_words_bigrams = g_make_bigrams(gr_words_nostops)


In [344]:
gr_words_bigrams[13]

['inferior', 'functionality', 'useful', 'stay', 'away', 'new', 'review', 'cant', 'link', 'bank', 'account', 'system', 'keeps', 'rejecting', 'mobile', 'banking', 'password', 'even', 'im', 'inputting', 'correct', 'one', 'got', 'locked', 'mobile', 'banking', 'account', 'trying', 'three', 'times', 'many', 'errors', 'inconvenient', 'making', 'hard', 'users', 'got', 'locked', 'forgot', 'mpin', 'security', 'questions', 'forget', 'mpin', 'totally', 'ridiculous', 'almost', 'impossible', 'recover', 'account', 'happens', 'locked', 'hours', 'try', 'making', 'things', 'inconvenient', 'users', 'cashin', 'function', 'online', 'banking', 'work', 'careful', 'receive', 'money', 'wallet', 'credit', 'card', 'linking', 'working', 'well', 'useful', 'email', 'support', 'equally', 'unhelpful', 'provides', 'totally', 'useless', 'answers', 'worthwhile', 'download', 'stay', 'away']

In [345]:
gr_lemmatized[13]

['inferior', 'functionality', 'stay', 'away', 'new', 'review', 'can', 'link', 'bank', 'account', 'system', 'keep', 'reject', 'mobile', 'banking', 'password', 'even', 'be', 'inputte', 'correct', 'get', 'lock', 'mobile', 'banking', 'account', 'try', 'time', 'many', 'error', 'inconvenient', 'make', 'hard', 'user', 'get', 'lock', 'forget', 'mpin', 'security', 'question', 'forget', 'mpin', 'totally', 'ridiculous', 'almost', 'impossible', 'recover', 'account', 'happen', 'lock', 'hour', 'try', 'make', 'thing', 'inconvenient', 'user', 'cashin', 'function', 'online', 'banking', 'work', 'careful', 'receive', 'money', 'wallet', 'credit', 'card', 'link', 'work', 'well', 'useful', 'email', 'support', 'equally', 'unhelpful', 'provide', 'totally', 'useless', 'answer', 'worthwhile', 'download', 'stay', 'away']

In [346]:
# APPLYING TO TOKENIZED BAD REVIEWS

# Remove Stop Words
br_words_nostops = remove_stopwords(br_words)

# REMOVED LEMMATIZATION COS IT AFFECTS THE TAGALOG WORDS / TOPICS LOSE MEANING
# # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# # python3 -m spacy download en
# nlp = spacy.load('en', disable=['parser', 'ner'])

# # Do lemmatization keeping only noun, adj, vb, adv
# br_lemmatized = lemmatization(br_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# Form Bigrams
br_words_bigrams = b_make_bigrams(br_words_nostops)




[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.

In [347]:
br_words_bigrams[3]


['read', 'please', 'others', 'doesnt', 'like', 'dont', 'expect', 'much', 'everything', 'want', 'give', 'others', 'error', 'maybe', 'give', 'little', 'patience', 'think', 'time', 'giving', 'challenge', 'like', 'give', 'patience', 'really', 'need', 'time', 'maybe', 'try', 'solve', 'whats', 'problem', 'first', 'dramatically', 'go', 'crazy', 'anyways', 'base', 'experience', 'dont', 'get', 'people', 'dont', 'realize', 'hurting', 'nobody', 'cares', 'maybe', 'least', 'try', 'solve', 'problems', 'people', 'close', 'dont', 'make', 'things', 'thats', 'even', 'true', 'start', 'girl', 'hurting', 'girls', 'boys', 'well', 'things', 'said', 'right', 'im', 'happy', 'using', 'thats', 'bye', 'anyways', 'sorry', 'telling', 'feelings', 'cant', 'take', 'anymore']

In [348]:
br_lemmatized[3]

['read', 'other', 'expect', 'much', 'want', 'give', 'other', 'error', 'maybe', 'give', 'little', 'patience', 'think', 'time', 'give', 'challenge', 'give', 'patience', 'really', 'need', 'time', 'maybe', 'try', 'solve', 's', 'problem', 'first', 'dramatically', 'go', 'crazy', 'anyways', 'base', 'experience', 'people', 'realize', 'hurt', 'care', 'maybe', 'least', 'try', 'solve', 'problem', 'people', 'close', 'make', 'thing', 's', 'even', 'true', 'start', 'girl', 'hurt', 'girl', 'boy', 'well', 'thing', 'say', 'be', 'happy', 'use', 's', 'sorry', 'tell', 'feeling', 'can', 'take', 'anymore']

#### Create the Dictionary and Corpus needed for Topic Modeling

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them.

##### Good Reviews -  Dictionary and Corpus

In [349]:
# Create Dictionary
g_id2word = corpora.Dictionary(gr_words_bigrams)

# Create Corpus
g_texts = gr_words_bigrams

# Term Document Frequency
g_corpus = [g_id2word.doc2bow(text) for text in g_texts]

# View
print(g_corpus[:1])

[[(0, 3), (1, 2), (2, 2), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 4), (10, 1), (11, 1), (12, 1), (13, 3), (14, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 3), (35, 1), (36, 2), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 2), (45, 1), (46, 1), (47, 1), (48, 2), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 2), (60, 1), (61, 1), (62, 1)]]


##### Bad Reviews -  Dictionary and Corpus

In [350]:
# Create Dictionary
b_id2word = corpora.Dictionary(br_words_bigrams)

# Create Corpus
b_texts = br_words_bigrams

# Term Document Frequency
b_corpus = [b_id2word.doc2bow(text) for text in b_texts]

# View
print(b_corpus[:1])

[[(0, 1), (1, 2), (2, 2), (3, 4), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 2), (38, 2), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 2), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 2), (79, 1), (80, 2)]]


#### Building the Topic Model

##### Model for Good Reviews

In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=g_corpus,
                                           id2word=g_id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [371]:
print(json.dumps(lda_model.print_topics(), indent=1))

[
 [
  0,
  "0.027*\"id\" + 0.016*\"please\" + 0.014*\"cant\" + 0.012*\"use\" + 0.010*\"always\" + 0.009*\"need\" + 0.009*\"verify\" + 0.008*\"using\" + 0.008*\"valid\" + 0.008*\"student_id\""
 ],
 [
  1,
  "0.023*\"account\" + 0.023*\"cant\" + 0.012*\"fix\" + 0.011*\"code\" + 0.010*\"still\" + 0.010*\"number\" + 0.010*\"tried\" + 0.008*\"already\" + 0.008*\"please\" + 0.008*\"mpin\""
 ],
 [
  2,
  "0.026*\"money\" + 0.010*\"bank\" + 0.010*\"service\" + 0.010*\"guys\" + 0.009*\"every\" + 0.009*\"send\" + 0.008*\"email\" + 0.008*\"load\" + 0.008*\"us\" + 0.007*\"transfer\""
 ],
 [
  3,
  "0.028*\"nyo\" + 0.019*\"update\" + 0.013*\"pera\" + 0.011*\"service\" + 0.010*\"sana\" + 0.009*\"star\" + 0.008*\"bulok\" + 0.008*\"wala\" + 0.007*\"paying_bills\" + 0.006*\"load\""
 ]
]


##### Model for Bad Reviews

In [378]:
# Build LDA model
lda_model2 = gensim.models.ldamodel.LdaModel(corpus=b_corpus,
                                           id2word=b_id2word,
                                           num_topics=6, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

In [379]:
print(json.dumps(lda_model2.print_topics(), indent=1))

[
 [
  0,
  "0.045*\"nyo\" + 0.024*\"wala\" + 0.013*\"sana\" + 0.011*\"bpi\" + 0.009*\"yun\" + 0.009*\"bulok\" + 0.008*\"talaga\" + 0.007*\"kwenta\" + 0.007*\"yan\" + 0.007*\"laging\""
 ],
 [
  1,
  "0.020*\"php\" + 0.015*\"cashin\" + 0.014*\"withdraw\" + 0.012*\"pesos\" + 0.010*\"limit\" + 0.009*\"nga\" + 0.009*\"transaction\" + 0.009*\"star\" + 0.009*\"friends\" + 0.009*\"month\""
 ],
 [
  2,
  "0.022*\"fix\" + 0.022*\"please\" + 0.021*\"use\" + 0.020*\"cash\" + 0.018*\"update\" + 0.017*\"cant\" + 0.017*\"load\" + 0.013*\"using\" + 0.013*\"error\" + 0.012*\"time\""
 ],
 [
  3,
  "0.039*\"pera\" + 0.025*\"saying\" + 0.022*\"globe\" + 0.020*\"card\" + 0.016*\"paypal\" + 0.015*\"everytime\" + 0.015*\"link\" + 0.014*\"couldnt\" + 0.014*\"mastercard\" + 0.013*\"keeps\""
 ],
 [
  4,
  "0.052*\"id\" + 0.051*\"verification\" + 0.023*\"student\" + 0.022*\"verified\" + 0.020*\"fully_verified\" + 0.017*\"login\" + 0.017*\"last\" + 0.015*\"linked\" + 0.014*\"updated\" + 0.011*\"account\""
 ],
 [

In [380]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, g_corpus, g_id2word)
vis

In [381]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model2, b_corpus, b_id2word)
vis