This tutorial will guide you through the process of analyzing your textual data - from pre-processing it - applying topic modelling algorithms - evaluate the topic model manually and automatically - analyzing it's few applications through visualization.

In [1]:
import gensim
import os, re
import numpy as np
import pandas as pd
import pyLDAvis.gensim

import nltk
nltk.download('wordnet') # download wordnet to be used in lemmatization
from nltk.stem import WordNetLemmatizer

from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation
from gensim.corpora import Dictionary
from gensim.models.callbacks import CoherenceMetric, DiffMetric, PerplexityMetric, ConvergenceMetric

Using TensorFlow backend.


[nltk_data] Downloading package wordnet to /Users/parul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


For this tutorial, we will use the kaggle's [fake news dataset](https://www.kaggle.com/mrisdal/fake-news).

## Pre-process

This is one of the most important step in analyzing the text data. If the preprocessing is not good, the algorithm can't do much since we are feeding it a lot of noise, or in other words, **Garbage In Garbage Out**. So let's first clean our data using the following techniques:

1. Stopword removal
2. Strip punctuation
3. Bigram collocation detection (frequently co-occuring tokens). Ex. our topic model will make more sense when 'New' and 'York' are treated as 'New_York'.
4. Lemmatization (converting word to its dictionary form)

In [2]:
df_fake = pd.read_csv('fake.csv')
df_fake[['title', 'text', 'language']].head()
df_fake = df_fake.loc[(pd.notnull(df_fake.text)) & (df_fake.language == 'english')]

# Convert data to tokenized list as required by gensim
texts = []
for line in df_fake.text:
    lowered = line.lower()
    words = re.findall(r'\w+', lowered, flags = re.UNICODE | re.LOCALE)
    texts.append(words)

# for bigram collocation detection
bigram = gensim.models.Phrases(texts)

def preprocess(texts):
    # remove punctuation
    texts = [strip_punctuation(' '.join(line)).split() for line in texts]
    # remove stopwords
    texts = [remove_stopwords(' '.join(line)).split() for line in texts]
    # collocation detection
    texts = [bigram[line] for line in texts]
    # lemmatization 
    lemmatizer = WordNetLemmatizer()
    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]
    
    return texts

# pre-processing
texts = preprocess(texts)

# split into training, holdout and test data
training_texts = texts[:5000]
holdout_texts = texts[5000:7500]
test_texts = texts[7500:]

# create dictionary mappings for training data
dictionary = Dictionary(training_texts)

# create corpus using training data's dictionary mappings
training_corpus = [dictionary.doc2bow(text) for text in training_texts]
holdout_corpus = [dictionary.doc2bow(text) for text in holdout_texts]
test_corpus = [dictionary.doc2bow(text) for text in test_texts]



## Train

Now let's train our topic model which is just a matter of single line with gensim. 

But wait! a cool functionality has recently been added to gensim which can be used to monitor the training progress of our topic model. It basically plots the training statistics (using the evaluation metrics: Coherence, Perplexity, Topic difference and Convergence), which can be used to know if our model has been sufficiently trained or there's still a scope of optimizing it more.

We will need to start the Visdom server for visualization:

`python -m visdom.server`

Visdom browser can now be accessed at http://localhost:8097.

In [3]:
# define perplexity callback for hold_out and test corpus
pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger="visdom", title="Perplexity (hold_out)")
pl_test = PerplexityMetric(corpus=test_corpus, logger="visdom", title="Perplexity (test)")

# remaining metrics
ch_umass = CoherenceMetric(corpus=training_corpus, coherence="u_mass", logger="visdom", title="Coherence (u_mass)")
ch_cv = CoherenceMetric(corpus=training_corpus, texts=training_texts, coherence="c_v", logger="visdom", title="Coherence (c_v)")
diff_kl = DiffMetric(distance="kullback_leibler", logger="visdom", title="Diff (kullback_leibler)")
convergence_jc = ConvergenceMetric(distance="jaccard", logger="visdom", title="Convergence (jaccard)")

callbacks = [pl_holdout, pl_test, ch_umass, ch_cv, diff_kl, convergence_jc]

In [9]:
# training LDA model
lda_model = LdaModel(corpus=training_corpus, id2word=dictionary, num_topics=35, passes=50, chunksize=1500, iterations=200, alpha='auto', callbacks=callbacks)

<img src="visdom_graph.png">

In [10]:
lda_model.show_topics(num_topics=5)  # Showing only the top 5 topics

[(29,
  '0.053*"brain_force" + 0.044*"http_www" + 0.029*"health" + 0.028*"widget" + 0.028*"wellness_infowars" + 0.028*"force_html" + 0.028*"utm_source" + 0.028*"infowarsstore_com" + 0.028*"utm_campaign" + 0.028*"utm_medium"'),
 (14,
  '0.013*"general" + 0.012*"navy_retired" + 0.011*"air_force" + 0.008*"army_retired" + 0.006*"modi" + 0.006*"u_s" + 0.006*"election" + 0.006*"retired_major" + 0.005*"jewish_press" + 0.005*"tape"'),
 (3,
  '0.036*"s" + 0.026*"hillary" + 0.012*"t" + 0.012*"hillary_clinton" + 0.012*"people" + 0.011*"donald_trump" + 0.011*"trump" + 0.009*"going" + 0.009*"election" + 0.009*"think"'),
 (10,
  '0.032*"s" + 0.014*"clinton" + 0.006*"public" + 0.005*"government" + 0.005*"u_s" + 0.004*"state" + 0.004*"campaign" + 0.004*"president" + 0.004*"hillary_clinton" + 0.004*"money"'),
 (6,
  '0.042*"s" + 0.011*"said" + 0.011*"t" + 0.010*"like" + 0.009*"time" + 0.008*"people" + 0.008*"know" + 0.006*"going" + 0.006*"think" + 0.005*"m"')]

We can also log the training statistics or access them after the model is trained, for any other custom use case. You can refer to [this](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Training_visualizations.ipynb) notebook for further doc on how to use the API.

Other topic modelling algorithms available in gensim:

In [11]:
# decompose the original matrix of words to maintain key topics
lsi_model = LsiModel(corpus=training_corpus, num_topics=35, id2word=dictionary)
lsi_model.show_topics(num_topics=5)

[(0,
  '0.751*"s" + 0.190*"people" + 0.156*"trump" + 0.115*"said" + 0.114*"clinton" + 0.113*"t" + 0.102*"like" + 0.092*"u_s" + 0.088*"world" + 0.087*"time"'),
 (1,
  '-0.364*"trump" + 0.286*"arabs" + 0.260*"jewish" + 0.244*"arab" + 0.235*"morris" + -0.212*"clinton" + 0.164*"jews" + 0.151*"palestine" + 0.144*"israel" + -0.131*"hillary_clinton"'),
 (2,
  '-0.325*"people" + -0.243*"god" + 0.238*"clinton" + 0.226*"trump" + 0.200*"arabs" + 0.172*"arab" + 0.167*"morris" + -0.138*"t" + -0.137*"like" + -0.133*"world"'),
 (3,
  '0.560*"trump" + -0.283*"u_s" + -0.223*"syria" + 0.206*"donald_trump" + -0.183*"s" + 0.178*"people" + -0.176*"russia" + 0.154*"arabs" + 0.148*"jewish" + 0.128*"morris"'),
 (4,
  '0.374*"trump" + -0.370*"s" + 0.252*"syria" + 0.247*"people" + 0.210*"war" + 0.188*"u_s" + 0.176*"russia" + -0.159*"lesley_stahl" + 0.146*"world" + 0.129*"government"')]

In [8]:
# a completely unsupervised algorithm which figures out the no. of topics on it's own
hdp_model = HdpModel(corpus=training_corpus, id2word=dictionary)
hdp_model.show_topics(num_topics=5)

[(0,
  '0.020*s + 0.005*people + 0.005*trump + 0.004*said + 0.003*clinton + 0.003*t + 0.003*like + 0.003*time + 0.003*world + 0.003*u_s + 0.002*hillary_clinton + 0.002*government + 0.002*new + 0.002*state + 0.002*election + 0.002*2016 + 0.002*hillary + 0.002*president + 0.002*war + 0.002*russia'),
 (1,
  '0.017*s + 0.005*said + 0.005*trump + 0.004*people + 0.003*clinton + 0.003*t + 0.003*black + 0.002*2016 + 0.002*hillary_clinton + 0.002*u_s + 0.002*russia + 0.002*war + 0.002*time + 0.002*like + 0.002*new + 0.002*state + 0.002*election + 0.002*donald_trump + 0.002*government + 0.002*syria'),
 (2,
  '0.012*s + 0.004*clinton + 0.004*trump + 0.003*said + 0.002*2016 + 0.002*time + 0.002*hillary_clinton + 0.002*people + 0.002*election + 0.002*state + 0.001*like + 0.001*t + 0.001*government + 0.001*goldman_sachs + 0.001*donald_trump + 0.001*america + 0.001*university + 0.001*city + 0.001*president + 0.001*fbi'),
 (3,
  '0.007*s + 0.002*white + 0.002*new + 0.002*world + 0.002*2016 + 0.002*peo

## Evaluation

### Manual:

We would like to know if the correct thing has been learned, does the topics inferred make sense as per our text data.

Thanks to pyLDAvis, we can visualise our topic models in a really handy way and inspect what words the topics consist of or how similar the topics are.

In [16]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, training_corpus, dictionary, sort_topics=False)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


### Automatic:

Coherence is often used to get past the manual inspection and objectively compare the topic models. By returning a score, we can compare between different topic models of the same corpus.

In [13]:
# LDA model coherence
CoherenceModel(lda_model, texts=training_texts, dictionary=dictionary, window_size=10).get_coherence()

0.43647287708995514

In [14]:
# LSI model coherence
CoherenceModel(lsi_model, texts=training_texts, dictionary=dictionary, window_size=10).get_coherence()

0.30212940791550469

In [15]:
# HDP model coherence
CoherenceModel(hdp_model, texts=training_texts, dictionary=dictionary, window_size=10).get_coherence()

0.34361949892725863