# Preprocessing + Topic Modeling (Mixed Language Approach)
This notebook serves as a demo for topic modeling GCash App Store Reviews using LDA. The approach of this notebook is to ignore the multi-lingual nature of reviews and move directly to LDA after data cleaning. This notebook will be divided into four parts:

1. Data Cleaning
2. Text Cleaning
3. Modeling (LDA)
4. Exploration of Results

In [None]:
# importing libraries
import pandas as pd
import regex as re
# !pip install emot
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
# ! pip install contractions
import contractions

In [None]:
# importing the dataset
df = pd.read_csv('gcash_0613-1427.csv')

In [None]:
df.shape

(4867, 9)

In [None]:
df.dtypes

title                object
date                 object
rating                int64
userName             object
isEdited               bool
review               object
app_name             object
app_id                int64
developerResponse    object
dtype: object

In [None]:
df.head(3)

Unnamed: 0,title,date,rating,userName,isEdited,review,app_name,app_id,developerResponse
0,It is starting to become pretty annoying,2018-12-29 15:24:59,3,Fhayerie0406,False,Almost everything is smooth except for three t...,gcash,520020791,
1,Actually ok.,2020-05-16 07:33:32,5,i don't usually rate/review,False,This app gets so many negative reviews but it’...,gcash,520020791,
2,Cash In Concern,2020-02-04 00:25:57,3,Mhei Castro,False,A little bit disappointed when I cashed in via...,gcash,520020791,


## Part 1: Data Cleaning

In [None]:
# Checking for null values
df.isnull().sum() 

title                   0
date                    0
rating                  0
userName                0
isEdited                0
review                  0
app_name                0
app_id                  0
developerResponse    4718
dtype: int64

In [None]:
# Dropping unwanted columns [userName, app_name, app_id, developerResponse]
df = df.drop(['userName', 'app_name', 'app_id', 'developerResponse', 'isEdited'], axis=1)

In [None]:
# Splitting datetime column + Checking date range

# Code block below separated the timestamp [[date]] column into Date and Time column respectively
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
df = df.drop('date', axis=1)

In [None]:
df.sort_values(by='Dates', ascending=False).head(1)

Unnamed: 0,title,rating,review,Dates,Time
2570,Annoying,2,I can’t open my GCash account 4days after i ac...,2021-06-11,09:05:47


In [None]:
df.sort_values(by='Dates', ascending=True).head(1)

Unnamed: 0,title,rating,review,Dates,Time
4134,App does not work,1,App does not work. It hangs and automatically ...,2012-05-23,16:24:07


In the section above, we can see that the dataset ranges from 2012-05-23 to 2021-06-11.

In [None]:
df.dtypes

title     object
rating     int64
review    object
Dates     object
Time      object
dtype: object

In [None]:
# Joining the title and review column

df['final_review'] = df.title.map(str) + " " + df.review

In [None]:
df = df.drop(['review','title'], axis=1)

## Part 2: Text Cleaning

Now that we have done the initial cleaning of the df, let us now proceed to text cleaning for topic modeling purposes.

My approach in this notebook would be as follows:
1. Do the necessary text cleaning steps (remove encodings, lowercase, strip punctutation, expand contractions etc)
2. Remove Filipino stop words and possibly lemmatize Filipino words
3. Remove English stop words and lemmatize
4. Tokenization and other methods

In [None]:
df.head(4)

Unnamed: 0,rating,Dates,Time,final_review
0,3,2018-12-29,15:24:59,It is starting to become pretty annoying Almos...
1,5,2020-05-16,07:33:32,Actually ok. This app gets so many negative re...
2,3,2020-02-04,00:25:57,Cash In Concern A little bit disappointed when...
3,1,2019-12-10,09:43:52,Frustrating and slow and you’re a bunch of thi...


#### Remove emojis and emoticons

In [None]:
# Remove emojis and emoticons

# Code block 1: remove emojis
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

#Example
remove_emoji("Have fun with NLP! 😃😃")

'Have fun with NLP! '

In [None]:
# apply the remove emoji function to the review column
df['final_review'] = df['final_review'].apply(lambda x: remove_emoji(x))

In [None]:
# Code block 2: remove emoticons

!pip install emot
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

# Function for removing emoticons
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)


Collecting emot
  Downloading https://files.pythonhosted.org/packages/49/07/20001ade19873de611b7b66a4d5e5aabbf190d65abea337d5deeaa2bc3de/emot-2.1-py3-none-any.whl
Installing collected packages: emot
Successfully installed emot-2.1


In [None]:
#Example
remove_emoticons("I used to play Super Mario games :)")

'I used to play Super Mario games '

In [None]:
# Applying remove emoticons code
df['final_review'] = df['final_review'].apply(lambda x: remove_emoticons(x))

#### Remove \n, \t, \r

In [None]:
df['final_review'][12]

'Best Finance App! GCash is a big help, indeed. I’ve been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances. \n\nSome of the best features include:\n- It allows you to buy load and earn rebates! \n- Upon registering to promo, you can earn coupons which you can use on scan to pay features\n- allows you to save with high interest, and invest too! \n- easily transfer funds to bank with real-time posting and no extra fees! \n- Scan to pay is very useful\n- Secured and convenient e-wallet\n- Wide-range of coverage, a lot of establishments are using GCash already!\n- You can help nature by just loading or walking! \n- Smooth application\n\nWhy 4 star not 5 then? \nI think there is a need for the following upgrades: \n\n1. GCredit\nAllow users to pay it in installment. Like for 5,000 they can choose of to pay it within 3 to 5 months. \n\nInclude card for

In [None]:
# Remove \r \n \t
df['final_review'] = df['final_review'].replace(r'\r+|\n+|\t+','', regex=True)

In [None]:
df['final_review'][12]

'Best Finance App! GCash is a big help, indeed. I’ve been using using this app since 2017 and it really helped me to be more financially literate for it has a lot of features that will enable youth professionals to efficiently manage their finances. Some of the best features include:- It allows you to buy load and earn rebates! - Upon registering to promo, you can earn coupons which you can use on scan to pay features- allows you to save with high interest, and invest too! - easily transfer funds to bank with real-time posting and no extra fees! - Scan to pay is very useful- Secured and convenient e-wallet- Wide-range of coverage, a lot of establishments are using GCash already!- You can help nature by just loading or walking! - Smooth applicationWhy 4 star not 5 then? I think there is a need for the following upgrades: 1. GCreditAllow users to pay it in installment. Like for 5,000 they can choose of to pay it within 3 to 5 months. Include card for GCredit2. Customer ServiceYou know wh

#### Convert reviews to lowercase

In [None]:
df['final_review'] = df['final_review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['final_review'].head()

0    it is starting to become pretty annoying almos...
1    actually ok. this app gets so many negative re...
2    cash in concern a little bit disappointed when...
3    frustrating and slow and you’re a bunch of thi...
4    payments don’t post. my payment via gcash did ...
Name: final_review, dtype: object

#### Removing URLS

In [None]:
import re

def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)

df['final_review'] = df.final_review.apply(remove_urls)
df.final_review.head()

0    it is starting to become pretty annoying almos...
1    actually ok. this app gets so many negative re...
2    cash in concern a little bit disappointed when...
3    frustrating and slow and you’re a bunch of thi...
4    payments don’t post. my payment via gcash did ...
Name: final_review, dtype: object

#### Expand Contractions

Contractions are words or combinations of words which are shortened by dropping letters and replacing them by an apostrophe. 

Let’s have a look at some examples:
* we’re = we are
* we’ve = we have
* I’d = I would

Note: This step needs to be done before word tokenizer because NLTK word tokenizer has in-built methods for dealing with contractions. However, NLTK word tokenizer's approach separates contractions without expanding. Expanding is a better method than simply separating.

In [None]:
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/6c/64/53b56cff18d3d7c4f29fe1a18699e225041ced4f0115310aec0f04c155d4/contractions-0.0.50-py2.py3-none-any.whl
Collecting textsearch>=0.0.21
  Downloading https://files.pythonhosted.org/packages/d3/fe/021d7d76961b5ceb9f8d022c4138461d83beff36c3938dc424586085e559/textsearch-0.0.21-py2.py3-none-any.whl
Collecting anyascii
[?25l  Downloading https://files.pythonhosted.org/packages/a3/14/666cd44bf53f36a961544af592cb5c5c800013f9c51a4745af8d7c17362a/anyascii-0.2.0-py3-none-any.whl (283kB)
[K     |████████████████████████████████| 286kB 7.4MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/7f/c2/eae730037ae1cbbfaa229d27030d1d5e34a1e41114b21447d1202ae9c220/pyahocorasick-1.4.2.tar.gz (321kB)
[K     |████████████████████████████████| 327kB 33.1MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone


In [None]:
import contractions

# Example text 
text = ''' She'll be airport in 30 mins. We are supposed to catch the arrival, aren't we?  
          I'd love to welcome her personally. It'll be an awesome vacation.'''
  
# creating an empty list 
expanded_words = []     
for word in text.split(): 
  # using contractions.fix to expand the shotened words 
  expanded_words.append(contractions.fix(word))    
    
expanded_text = ' '.join(expanded_words) 
print('Original text: ' + text) 
print('\n') 
print('Expanded_text: ' + expanded_text)

Original text:  She'll be airport in 30 mins. We are supposed to catch the arrival, aren't we?  
          I'd love to welcome her personally. It'll be an awesome vacation.


Expanded_text: she will be airport in 30 mins. We are supposed to catch the arrival, are not we? I would love to welcome her personally. it will be an awesome vacation.


In [None]:
df['final_review'] = df['final_review'].apply(lambda x: contractions.fix(x))
df.final_review.sample(5)

4349    bwesit na apps nag hihingi nang passcode eh wa...
1192    new update review i am using gcash for quite s...
2946    thanks to gcash ultimately the best!!!i receiv...
4787    fix this!! why i can not enter authentication ...
1470    authentication error i cannot access my gcash ...
Name: final_review, dtype: object

#### Remove punctuation

In [None]:
df['final_review'] = df['final_review'].str.replace('[^\w\s]','')
df['final_review'].head()

0    it is starting to become pretty annoying almos...
1    actually ok this app gets so many negative rev...
2    cash in concern a little bit disappointed when...
3    frustrating and slow and you are a bunch of th...
4    payments do not post my payment via gcash did ...
Name: final_review, dtype: object

#### Removing English + Tagalog Stopwords and Lemmatizing English words

In [None]:
from spacy.lang.tl.stop_words import STOP_WORDS as tl_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

final_stopwords_list = list(tl_stop) + list(en_stop) + ['gcash', 'g-cash', '\'','-', '-pron-', 'ake','d','g','m','ve']

In [None]:
import spacy
nlp = spacy.load('en')

In [None]:
def tokenizer(text):
    return [token.lemma_.lower() for token in nlp(text)]

#### Feature Extraction ( TFIDF + SVD + UMAP)

In [None]:
!pip install umap

Collecting umap
  Downloading https://files.pythonhosted.org/packages/4b/46/08ab68936625400fe690684428d4db4764f49b406782cc133df1d0299d06/umap-0.1.1.tar.gz
Building wheels for collected packages: umap
  Building wheel for umap (setup.py) ... [?25l[?25hdone
  Created wheel for umap: filename=umap-0.1.1-cp37-none-any.whl size=3568 sha256=e0355dac09a8dbdf5ecbd43f615c48b50baf0d341389f4d4c1d49372441f447e
  Stored in directory: /root/.cache/pip/wheels/7b/29/33/b4d917dc95f69c0a060e2ab012d95e15db9ed4cc0b94ccac26
Successfully built umap
Installing collected packages: umap
Successfully installed umap-0.1.1


In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import umap

vectorizer = TfidfVectorizer(min_df=10, max_features=10000, tokenizer=tokenizer, stop_words=final_stopwords_list, ngram_range=(1, 2))
vz = vectorizer.fit_transform(df['final_review'])

  'stop_words.' % sorted(inconsistent))


In [None]:
vz[0]

<1x1880 sparse matrix of type '<class 'numpy.float64'>'
	with 77 stored elements in Compressed Sparse Row format>

In [None]:
!pip install lda

Collecting lda
[?25l  Downloading https://files.pythonhosted.org/packages/d1/45/8bf6862a599649350280bf0020b0b23c4948304158702632f677dc967737/lda-2.0.0-cp37-cp37m-manylinux1_x86_64.whl (351kB)
[K     |████████████████████████████████| 358kB 8.5MB/s 
[?25hCollecting pbr<4,>=0.6
[?25l  Downloading https://files.pythonhosted.org/packages/0c/5d/b077dbf309993d52c1d71e6bf6fe443a8029ea215135ebbe0b1b10e7aefc/pbr-3.1.1-py2.py3-none-any.whl (99kB)
[K     |████████████████████████████████| 102kB 9.7MB/s 
Installing collected packages: pbr, lda
Successfully installed lda-2.0.0 pbr-3.1.1


In [None]:
df_goodreviews = df.loc[df['rating'].isin([4,5])]

In [None]:
df_goodreviews

Unnamed: 0,rating,Dates,Time,final_review
1,5,2020-05-16,07:33:32,actually ok this app gets so many negative rev...
6,5,2020-06-07,04:57:17,gcash user since 2018 I have been using gcash ...
8,5,2021-01-12,09:46:26,read please those others who does not like gca...
12,4,2020-04-21,09:25:07,best finance app gcash is a big help indeed I ...
15,5,2019-09-23,13:15:48,smooth and convenient disclaimer not to demeri...
...,...,...,...,...
4826,4,2018-06-19,06:41:03,qr code scan for electric water bill hopefull...
4827,5,2018-08-16,07:34:45,sharing code problem i can not share my reffer...
4835,5,2018-05-04,04:02:13,gcash on mobile i can buy load pay at robinson...
4839,5,2018-02-17,03:03:45,fix this i can not log in in my iphone 4 runn...


In [None]:
df_badreviews = df.loc[df['rating'].isin([1,2,3])]

In [None]:
df_badreviews

Unnamed: 0,rating,Dates,Time,final_review
0,3,2018-12-29,15:24:59,it is starting to become pretty annoying almos...
2,3,2020-02-04,00:25:57,cash in concern a little bit disappointed when...
3,1,2019-12-10,09:43:52,frustrating and slow and you are a bunch of th...
4,1,2018-06-26,18:43:31,payments do not post my payment via gcash did ...
5,1,2019-08-11,06:45:05,forgotten mpin once you forgot your mpin you h...
...,...,...,...,...
4862,1,2016-01-16,11:52:50,fix this this is one of the crucial apps for g...
4863,1,2015-11-09,08:20:37,can not access because they can not process my...
4864,1,2015-09-19,14:58:01,waste of time downloading compatibility says c...
4865,1,2015-09-09,11:45:00,can not log in useless appcan not access my gc...


### MOdeling

In [None]:
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 7.3MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pandas>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/51/51/48f3fc47c4e2144da2806dfb6629c4dd1fa3d5a143f9652b141e979a8ca9/pandas-1.2.4-cp37-cp37m-manylinux1_x86_64.whl (9.9MB)
[K     |████████████████████████████████| 9.9MB 38.2MB/s 
[?25hCollecting numpy>=1.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/a5/42/560d269f604d3e186a57c21a363e77e199358d054884e61b73e405dd217c/numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3MB)
[K     |██████████████████████████████

In [None]:
import numpy as np
import pyLDAvis

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt
  from collections import Iterable
  from collections import Mapping


In [None]:
from sklearn.feature_extraction.text import CountVectorizer 
import lda

# create document term matrix
min_df = 5 # minimum required occurences of a word, e.g 4
max_features = 10000 # max number of unique words, e.g 10000

vectorizer = CountVectorizer(
              min_df=min_df, 
              max_features=max_features,
              stop_words =  final_stopwords_list,
              tokenizer=tokenizer, # use tokenizer we defined before
              ngram_range=(1,1)) # we use unigram & bigram
X = vectorizer.fit_transform(df_goodreviews['final_review'])


# build LDA model
n_topics = 4 # pick the number of topics, e.g 5
n_iter = 2000 # number of learning iterations, e.g 2000

lda_model = lda.LDA(
              n_topics=n_topics, 
              n_iter=n_iter)
X_topics = lda_model.fit_transform(X) # X is document term matrix


# Print the topics found by the LDA model
n_top_words = 10
topic_summaries = []



In [None]:
topic_word = lda_model.topic_word_  # get the topic words
vocab = vectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: app add update feature user good great love option like
Topic 1: app pay use bill convenient money bank easy transfer good
Topic 2: card load 5 yung good cash credit po star thank
Topic 3: app account use update fix transaction verify try hope help


In [None]:

# a function to prepare the data
def prepareLDAData(lda_model, lda_df, vectorizer):
    data = {
        'vocab': vocab,
        'doc_topic_dists': lda_model.doc_topic_,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':vectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data

# load the pre-prepared pyLDAvis data
lda_data = prepareLDAData(lda_model=lda_model, lda_df=lda_df, vectorizer=vectorizer)
prepared_data = pyLDAvis.prepare(**lda_data)
pyLDAvis.display(prepared_data)

NameError: ignored