### Topic Modelling with Gensim!

This notebook will walk you through the entire process of analysing your text - from pre-processing to creating your topic models and visualising them. 

python offers a very rich suite of NLP and CL tools, and we will illustrate these to the best of our capabilities.
Let's start by setting up our imports.

We will be needing: 
```
- Gensim
- matplotlib
- spaCy
- pyLDAVis
```


In [1]:
# Suppressing warnings
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# Importing Libraries
import os

import matplotlib.pyplot as plt
import gensim
import numpy as np
import spacy
import en_core_web_sm

from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary
import pyLDAvis.gensim




For this tutorial, we will be using the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF). The shortened version consists of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the year 2000-2001. 

We should keep in mind we can use pretty much any textual data-set and go ahead with what we will be doing.

### Pre-processing data!

It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is aa good. Let's spend this section working on cleaning and understanding our data set.
NTLK is usually a popular choice for pre-processing - but is a rather [outdated](https://explosion.ai/blog/dead-code-should-be-buried) and we will be checking out spaCy, an industry grade text-processing package. 

For safe measure, let's add some stopwords. It's a newspaper corpus, so it is likely we will be coming across variations of 'said' and 'Mister' which will not really add any value to the topic models.

Voila! With the `English` pipeline, all the heavy lifting has been done. Let's see what went on under the hood.

It seems like nothing, right? But spaCy's internal data structure has done all the work for us. Let's see how we can create our corpus. You can check out what a gensim corpus looks like [here](google.com).

And this is the magic of spaCy - just like that, we've managed to get rid of stopwords, punctauation markers, and added the lemmatized word. There's lot more we can do with spaCy which I would really recommend checking out.

Sometimes topic models make more sense when 'New' and 'York' are treated as 'New_York' - we can do this by creating a bigram model and modifying our corpus accordingly.

In [2]:
import txtpreprocess as tp
texts = tp.txtpreprocess()

File read!
Stop words added!
Text tagged!
5000 documents read!
10000 documents read!
15000 documents read!
20000 documents read!
25000 documents read!
30000 documents read!
35000 documents read!
40000 documents read!
45000 documents read!
50000 documents read!
55000 documents read!
60000 documents read!
65000 documents read!
70000 documents read!
75000 documents read!
80000 documents read!
85000 documents read!
90000 documents read!
95000 documents read!
100000 documents read!
105000 documents read!
110000 documents read!
115000 documents read!
120000 documents read!
125000 documents read!
130000 documents read!
135000 documents read!
140000 documents read!
145000 documents read!
150000 documents read!
155000 documents read!
160000 documents read!
165000 documents read!
170000 documents read!
175000 documents read!
180000 documents read!
185000 documents read!
190000 documents read!
195000 documents read!
200000 documents read!
205000 documents read!
210000 documents read!
215000 docum

In [3]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

We're now done with a very important part of any text analysis - the data cleaning and setting up of corpus. It must be kept in mind that we created the corpus the way we did because that's how gensim requires it - most algorithms still require one to clean the data set the way we did, by removing stop words and numbers, adding the lemmatized form of the word, and using bigrams. 

### LSI

LSI stands for Latent Semantic Indeixing - it is a popular information retreival method which works by decomposing the original matrix of words to maintain key topics. Gensim's implementation uses an SVD.

In [4]:
#lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [5]:
#lsimodel.show_topics(num_topics=5)  # Showing only the top 5 topics

### HDP

HDP, the Hierarchical Dirichlet process is an unsupervised topic model which figures out the number of topics on it's own.

In [6]:
#hdpmodel = HdpModel(corpus=corpus, id2word=dictionary)

In [7]:
#hdpmodel.show_topics()

### LDA

LDA, or Latent Dirichlet Allocation is arguably the most famous topic modelling algorithm out there. Out here we create a simple topic model with 10 topics.

In [8]:

ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [9]:
ldamodel.show_topics()

[(0,
  '0.051*"image" + 0.029*" " + 0.022*"view" + 0.018*"button" + 0.017*"not" + 0.016*"code" + 0.013*"item" + 0.011*"problem" + 0.011*"screen" + 0.010*"text"'),
 (1,
  '0.041*"error" + 0.029*" " + 0.023*"project" + 0.017*"not" + 0.017*"test" + 0.014*"code" + 0.010*"version" + 0.009*"library" + 0.009*"problem" + 0.009*"module"'),
 (2,
  '0.027*"list" + 0.025*" " + 0.014*"element" + 0.013*"value" + 0.010*"point" + 0.009*"time" + 0.008*"datum" + 0.007*"number" + 0.006*"b" + 0.006*"question"'),
 (3,
  '0.058*"class" + 0.046*"object" + 0.038*"method" + 0.029*"\n  " + 0.019*" " + 0.018*"model" + 0.017*"type" + 0.016*"property" + 0.013*"error" + 0.011*"exception"'),
 (4,
  '0.031*"datum" + 0.022*"server" + 0.021*"client" + 0.019*"json" + 0.017*"api" + 0.017*"xml" + 0.015*"key" + 0.015*"request" + 0.014*"message" + 0.013*"thread"'),
 (5,
  '0.059*"app" + 0.051*"user" + 0.027*"application" + 0.024*"android" + 0.014*"service" + 0.011*" " + 0.011*"device" + 0.010*"email" + 0.009*"not" + 0.008*"

### pyLDAvis 

Thanks to pyLDAvis, we can visualise our topic models in a really handy way. All we need to do is enable our notebook and prepare the object.

In [10]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]





### Topic Coherence

Topic Coherence is a new gensim functionality where we can identify which topic model is 'better'. 
By returning a score, we can compare between different topic models of the same. We use the same example from the news classification notebook to plot a graph between the topic models we have created.

In [11]:
#lsitopics = [[word for word, prob in topic] for topicid, topic in lsimodel.show_topics(formatted=False)]

#hdptopics = [[word for word, prob in topic] for topicid, topic in hdpmodel.show_topics(formatted=False)]

#ldatopics = [[word for word, prob in topic] for topicid, topic in ldamodel.show_topics(formatted=False)]

In [12]:
#lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=texts, dictionary=dictionary, window_size=10).get_coherence()

#hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=texts, dictionary=dictionary, window_size=10).get_coherence()

#lda_coherence = CoherenceModel(topics=ldatopics, texts=texts, dictionary=dictionary, window_size=10).get_coherence()

In [13]:
def evaluate_bar_graph(coherences, indices):
    """
    Function to plot bar graph.
    
    coherences: list of coherence values
    indices: Indices to be used to mark bars. Length of this and coherences should be equal.
    """
    assert len(coherences) == len(indices)
    n = len(coherences)
    x = np.arange(n)
    plt.bar(x, coherences, width=0.2, tick_label=indices, align='center')
    plt.xlabel('Models')
    plt.ylabel('Coherence Value')

In [14]:
#evaluate_bar_graph([lsi_coherence, hdp_coherence, lda_coherence],
#                   ['LSI', 'HDP', 'LDA'])

We can see that topic coherence helped us get past manually inspecting our topic models - we can now keep fine tuning our models and compare between them to see which has the best performance. 



In [15]:
pyLDAvis.save_html(pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary), "demo2.html")

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [16]:
#print(lsi_coherence)
#print(hdp_coherence)
#print(lda_coherence)