# Topic Modelling Demo

Exploring Topic Modelling using:
    - Latent Dirichlet Allocation (LDA) following both Bag of words and TF-IDF approach

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

For this test case we will be using the US Consumer Finance Complaints, which holds verbatim complaints as well as product information among other fields.
In this exercise we will focus on the customer verbatim to distinguish topics following an unsupervised learning approach

In [None]:
data = pd.read_csv('../input/consumer_complaints.csv')

In [None]:
data

We take a subset of our data

In [None]:
verbatim_product = data[['consumer_complaint_narrative','product']]

In [None]:
verbatim_product.head(5)

We remove any complaints which don't have any verbatim to analyse, 

In [None]:
filtered_verbatim = verbatim_product.dropna()
filtered_verbatim.head(2)

How many complaints are there with Verbatim?

In [None]:
len(filtered_verbatim.consumer_complaint_narrative)

What are the most popular products owned

In [None]:
filtered_verbatim['product'].value_counts()

Can we see this as a chart?

In [None]:
filtered_verbatim['product'].value_counts().plot(kind='bar')

Let's select a single complaint to start doing some NLP on

In [None]:
complaint = filtered_verbatim.iloc[1]['consumer_complaint_narrative']
pd.options.display.max_colwidth = 1000
print(complaint)

Lets import the required libraries for our natural language processing

In [None]:
import spacy #for our NLP processing
import nltk #to use the stopwords library
import string # for a list of all punctuation
from nltk.corpus import stopwords # for a list of stopwords

Now we can load and use spacy to analyse our complaint

In [None]:
nlp = spacy.load('en_core_web_sm')
text = nlp(complaint)

In [None]:
text

Let's start by tokenising our complaint -- Splitting it out into words

In [None]:
tokens = [tok for tok in text]
tokens.head(5)

For our bag of words to have less overlap - lets lemmatize our words

In [None]:
tokens = [tok.lemma_ for tok in text]
tokens

To ensure our words match up and there are no sneaky spaces let's strip any whitespace around the tokens, and lowercase our text to ensure words like 'Credit' and 'credit' are matched

In [None]:
tokens = [tok.lemma_.lower().strip() for tok in text]
tokens

Now let's get rid of all the -PRON- lemmas as they will add no value to our analysis

In [None]:
tokens = [tok.lemma_.lower().strip() for tok in text if tok.lemma_ != '-PRON-']
tokens

Lets now use another library - NLTK to get a list of stopwords (think: I, me, you ,they etc.) Words that won't really add much value to our analysis, more like fillers between the important words

In [None]:
stop_words = stopwords.words('english')
punctuations = string.punctuation
stop_words

We will now be looking at the tokens in our tokens list (no longer using the spacy/ nlp(text) object) - and remove any puntuation and stop_words

In [None]:
tokens = [tok for tok in tokens if tok not in stop_words and tok not in punctuations]

And there we have! We've tokenized, lemmatied, lowercased, stripped of white spaces + removed stopwords and punctuations

In [None]:
tokens

Let's put it all together into a function so we can apply these steps to every complaint

In [None]:
def cleanup_text(complaint):
    doc = nlp(complaint, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stop_words and tok not in punctuations]
    return tokens


We'll look to see if this looks correct on the first 100 complaints - pre-check  
We'll also declare **doc_sample** as the complaints verbatim

In [None]:
limit = 100
doc_sample = filtered_verbatim.consumer_complaint_narrative
print('tokenized and lemmatized document: ')

for idx, complaint in enumerate(doc_sample):
    print(cleanup_text(complaint))
    if idx == limit:
        break
    

In the interest of time (as we don't have much time in this interactive demo), we'll run the rest of the demo on 10k complaints

In [None]:
doc_sample = doc_sample[0:10000]
#doc_sample = doc_sample[:]

We can now apply our function to doc_sample and process our 10k complaints using the .map function

In [None]:
processed_docs = doc_sample.map(cleanup_text)

# Bag of Words

The dictionary encapsulates the mapping between normalized words and their integer ids

In [None]:
import gensim
dictionary = gensim.corpora.Dictionary(processed_docs)

The dictionary is then filtered to remove extreme values using the following parameters:
- *no_below* parameter is an absolute number - Words appearing less than 10 times in the entire corpus are removed from the analysis
- *no_above* parameter is a fraction - Words appearing more than 50% of the time are removed from the analysis

In [None]:
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)

The dictionary is then converted to a bag of words format

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

We display an example of the bag-of-words format on a single complaint to ensure it worked (here on complaint 4310)

In [None]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

# LDA

Latent Dirichlet allocation (LDA), is an  **unsupervised** algorithm: only the words in the documents are modeled. 
- The goal is to _infer topics that maximize the likelihood (or the posterior probability) of the collection_.

## Running LDA on Bag of Words

The LDA algorithm has a number of parameters than can be used to calibrate the output:
- num_topics: In this example we have prescribed a number 10, in a previous run without a prescribed number, the LDA produced 99 clusters which is not very informative for our usecase
- id2word: The previously defined dictionary mapping from word IDs to Words
- Workers: for parralelisation
- chunksize: number of documents to use in each training chunk
- passes: no. passes through the corpus during training
- alpha: Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability.
- decay: A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined.
- iterations: Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
- gamma_threshold: Minimum change in the value of the gamma parameters to continue iterating.

In [None]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2 )

Lets display our topics

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

We can also use pyLDA vis to inspect our outputs in a more interactive way

In [None]:
import pyLDAvis
import pyLDAvis.gensim as gensimvis

In [None]:
vis_data = gensimvis.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.display(vis_data)