# Topic Modeling of App Reviews

## Data Preparation

First, we load the CSV, filtering only the spanish comments and joining the "Problema" and "comment" fields, which are the fields that contain user-written text

We also remove empty and very small text inputs to avoid simple things as "me gusta"

In [157]:
import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", encoding = "UTF-16", sep="\t")

df = df[df["Language"] == "es"]
df = df[["Problema", "comment"]]

df = df.replace(np.nan, '', regex=True)
df["text"] = [ x + y for x, y in zip(df["Problema"], df["comment"]) ]
df = df[(df["text"].str.strip() != "") & (df["text"].str.len() >= 25)]

print("Total: ", len(df["text"]))
print(df["text"][:5])

Total:  1204
2     No puedo modificar los datos personales. Mi ap...
3                   poder enviar tu tarjeta de embarque
8           Más interacción pero mejor que la anterior 
11                            no puedo crear un usuario
12    Está mucho mejor, es más fácil acceder a ver m...
Name: text, dtype: object


We need to import nltk to find the spanish stopwords, which are helper words such as "de", "la", "que" that we should ignore to not compromise our result.

You need to uncomment the nltk download line and download the spanish corpus first.

In [158]:
import nltk

# uncomment if you need this
# nltk.download() 

stop_words = nltk.corpus.stopwords.words('spanish')

print(stop_words[:10])

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se']


Now we remove those stop words and banned words from the reviews because they would only mess with the algorithm, also we remove the accents because some reviews use it, some don't

In [159]:
import unidecode

# banned words, we don't want to cluster around those
banned_words = ['aplicacion', 'app', 'anterior', 'actualizacion', 'actualizar', 'version']

text = [" ".join([unidecode.unidecode(word) for word in text.split(" ")
                  if word not in stop_words
                  and unidecode.unidecode(word) not in banned_words
                 ]) for text in df["text"]]

text[:10]

['No puedo modificar datos personales. Mi apellido incompleto. Soy CLAUDIA LEON PRADO ',
 'poder enviar tarjeta embarque',
 ' Mas interaccion mejor ',
 'puedo crear usuario',
 'Esta mejor, facil acceder ver kilometros. ',
 'En aparece opcion postular upgrade y, si postulando, caso, permite revisar listado postulacion. ',
 'Estoy tratando ver vuelo viaja esposo Lima Tacna saber si salio hora encuentro daba vuelo x dia dias anteriores ,posteriores.Ahora opcion...la pueden agregar xfavor ',
 ' Muy buena nueva ',
 'Que cuidado maletas  ',
 'No sirve comprar pasajes  ']

Next we tokennize and stemmatized words so we can get their essential meaning and avoid considering different genders or verb inflections of a word as a distinct word when they are probably talking about the same thing.

Those words are saved in a vocabulary for later.

In [160]:
from nltk.stem.snowball import SnowballStemmer
import re

stemmer = SnowballStemmer("spanish")

def tokenize(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

def stem(tokens):
    return [stemmer.stem(t) for t in tokens]

totalvocab_stemmed = []
totalvocab_tokenized = []
for i in text:
    allwords_tokenized = tokenize(i)
    totalvocab_tokenized.extend(allwords_tokenized)
    
    allwords_stemmed = stem(allwords_tokenized)
    totalvocab_stemmed.extend(allwords_stemmed)
    

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print(vocab_frame.head())

there are 9224 items in vocab_frame
               words
no                no
pued           puedo
modific    modificar
dat            datos
personal  personales


## Fitting the Model

Now, we are going to send our text to a TfidfVectorizer, where we use IDF to make the algorithm focus on the main no-so-common words. We also build ngrams, so things like "check in" can be considered a single thing instead of two.

Then we send it to a Latent Dirichlet Allocation model, which should group the texts around clusters, discovering topics automatically for us.

In [161]:
%%time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import adjusted_rand_score


def tokenize_and_stem(text):
    return stem(tokenize(text))


vectorizer = TfidfVectorizer(max_df=0.6, max_features=200000,
                             min_df=5, stop_words=stop_words,
                             use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

X = vectorizer.fit_transform(text).todense()
print(X.shape)

terms = vectorizer.get_feature_names()
print(terms[90:100])

K = 7
model = LatentDirichletAllocation(n_components=K, max_iter=100, random_state=8)
model.fit(X)

model

(1204, 422)
['count', 'cre', 'cre usuari', 'cuand', 'cuelg', 'cuent', 'da', 'dan', 'dat', 'dat personal']




CPU times: user 18.7 s, sys: 203 ms, total: 18.9 s
Wall time: 19.3 s


## Visualization

Here we use pyLDAvis which give us an off-the-shelf visualization for LDA. The topics with more reviews in it are the biggest circles.

You can click in any of it to see which are the most relevant words for that particular topic, and you can move the relevace slider to the right to see words that are more relevant to that topic and not others.

This is a great visualization because we can see 3 things:
- The biggest topics that we have to concern about
- Which topics have overlaps and which talks about completely distinct subjects
- Which words are in those topics, giving us a hint of what it is about

In [162]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(model, X, vectorizer)

Of course we don't only want to see the chart about, we want to be able to effectively read the reviews in each topic, so, we build a dataframe with the original texts and the cluster number and score assigned to it. The score is the proximity of that review with the most revelant words in that cluster, so we filter those with less than 60% relevance because we only want the most focused ones.

In [163]:
df["text"][2]

result = model.transform(X)
clusters = [ np.argmax(score) for score in result ]
scores = [ max(score) for score in result ]

df["cluster"] = clusters
df["score"] = scores

df.loc[(df["score"] < 0.6), "cluster"] = K

df[["text","cluster","score"]].head()

Unnamed: 0,text,cluster,score
2,No puedo modificar los datos personales. Mi ap...,5,0.727286
3,poder enviar tu tarjeta de embarque,0,0.733866
8,Más interacción pero mejor que la anterior,5,0.644139
11,no puedo crear un usuario,4,0.705803
12,"Está mucho mejor, es más fácil acceder a ver m...",7,0.483479


Then we get the main words for each cluster, some examples and the sizes of the clusters

In [164]:
# copied from https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/sklearn.py
def _get_topic_term_dists(lda_model):
    return _row_norm(lda_model.components_)

def _row_norm(dists):
    # row normalization function required
    # for doc_topic_dists and topic_term_dists
    return dists / dists.sum(axis=1)[:, None]

topic_term_dists = _get_topic_term_dists(model)

# sort cluster centers by proximity to centroid
order_centroids = topic_term_dists.argsort()[:, ::-1] 

cluster_words = [
    ", ".join([
        vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0] for ind in order_centroids[i, :6]
    ]) for i in range(K)
]
cluster_words.append("Others")

print(cluster_words[0])

cluster_examples = [
    df[df["cluster"] == i]["text"].values.tolist() for i in range(len(cluster_words))
]

print(cluster_examples[0][0])

cluster_sizes = [ len(df[df["cluster"] == i]) for i in range(len(cluster_words)) ]

print(cluster_sizes)

hacer, tarjeta, embarque, asientos, in, check
poder enviar tu tarjeta de embarque
[149, 18, 54, 58, 97, 192, 82, 554]


And we print everything for you to read

In [165]:
data = { 'text': df["text"].values.tolist() }

frame = pd.DataFrame(data, index = [clusters], columns = ['text'])

print("Top terms per cluster:")
print()

for i in range(len(cluster_sizes)):
    print("Cluster %d total:" % i, cluster_sizes[i])
    print() #add whitespace
    
    print(cluster_words[i])
    print() #add whitespace
    
    print("Cluster %d examples:" % i)
    for title in cluster_examples[i][:15]:
        print('- %s' % title)
    print() #add whitespace
    print() #add whitespace

Top terms per cluster:

Cluster 0 total: 149

hacer, tarjeta, embarque, asientos, in, check

Cluster 0 examples:
- poder enviar tu tarjeta de embarque
- no asocia mi check in  y mi categoria de socio, no puedo elegir asientos, es un real problema
- La app anterior era mejor, en esta no hay donde hacer check in ni ver flight status de otros vuelos LATAM  
- No tengo la opción de check in 
- debería actualizar la aplicación anterior... porque tiene que ser otra? me pide datos y todo eso... una tontería!!!
- La debieron lanzar completamente funcional Parece que la actualización demora un mes en lo volado 
- no puedo entrar a mi cuenta
- La tarjeta de embarque no indica q soy Preferente  
- No se figuran todas las reservas. Cuando las añado, me dice que no puede y están trabajado en solucionarlo.  
- No puedo reenviar la tarjeta de embarque a mi jefe
- La tarjeta de embarque no indica la categoría de socio 
-  Todo súper, pero dónde está el botón de check in? 
- no pude iniciar sesion ya e

So, one problem with Unsupervised Learning of clustering is that you have to choose beforehand the number of clusters that you want, previously, we choose a small number, and maybe creating topics that are actually talking about two different things.

On the other hand, if we choose too many topics we might break those topics in smaller ones, but create many one which are actually talking about the same things.

We can visualize this effect on the chart, take a look with K = 30

In [166]:
import pyLDAvis
import pyLDAvis.sklearn
K2 = 30

model2 = LatentDirichletAllocation(n_components=K2, max_iter=100, random_state=0)
model2.fit(X)

pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(model2, X, vectorizer)



## Credits

I only managed to create this notebook thanks to those two awesome tutorials, which I basically copied for this topic modelling work:

1. Document Clustering with Python - http://brandonrose.org/clustering
2. Modern NLP in Python - https://www.youtube.com/watch?v=6zm9NC9uRkk

And thanks for the [@datasciencepython](https://web.telegram.org/#/im?p=@datasciencepython) group in telegram for poiting me in the right direction