# Demonstration of the topic coherence pipeline in Gensim

## Introduction

We will be using the `c_v` coherence for two different LDA models: a "good" and a "bad" LDA model. The good LDA model will be trained over 50 iterations and the bad one for 1 iteration. Hence in theory, the good LDA model will be able come up with better or more human-understandable topics. Therefore the coherence measure output for the good LDA model should be more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable.

In [1]:
from __future__ import print_function

import logging
import warnings

from gensim.models import CoherenceModel, LdaModel
from gensim.corpora import Dictionary

warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

import load_lee_background_corpus as load_texts

### Set up corpus

In [2]:
texts = load_texts.get_train_texts()

/Users/home/Desenvolvimento/anaconda3/lib/python3.6/site-packages/gensim/test/test_data


2017-11-22 00:45:06,203 : INFO : collecting all words and their counts
2017-11-22 00:45:06,206 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-11-22 00:45:06,250 : INFO : collected 20429 word types from a corpus of 19878 words (unigram + bigrams) and 300 sentences
2017-11-22 00:45:06,251 : INFO : using 20429 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>


Returning 300 training texts


In [3]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

2017-11-22 00:45:09,085 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-11-22 00:45:09,125 : INFO : built Dictionary(4431 unique tokens: ['hundreds', 'people', 'homes', 'southern', 'highlands']...) from 300 documents (total 18861 corpus positions)


### Set up two topic models

We'll be setting up two different LDA Topic models. A good one and bad one. To build a "good" topic model, we'll simply train it using more iterations than the bad one. Therefore the coherence should in theory be better for the good model than the bad one since it would be producing more "human-interpretable" topics.

In [4]:
goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=200, passes=5, num_topics=10)
badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=10)

2017-11-22 00:45:09,164 : INFO : using symmetric alpha at 0.1
2017-11-22 00:45:09,168 : INFO : using symmetric eta at 0.00022568269013766644
2017-11-22 00:45:09,172 : INFO : using serial LDA version on this node
2017-11-22 00:45:09,654 : INFO : running online (multi-pass) LDA training, 10 topics, 5 passes over the supplied corpus of 300 documents, updating model once every 300 documents, evaluating perplexity every 300 documents, iterating 200x with a convergence threshold of 0.001000
2017-11-22 00:45:10,280 : DEBUG : bound: at document #0
2017-11-22 00:45:13,727 : INFO : -11.613 per-word bound, 3132.8 perplexity estimate based on a held-out corpus of 300 documents with 18861 words
2017-11-22 00:45:13,730 : INFO : PROGRESS: pass 0, at document #300/300
2017-11-22 00:45:13,732 : DEBUG : performing inference on a chunk of 300 documents
2017-11-22 00:45:16,011 : DEBUG : 225/300 documents converged within 200 iterations
2017-11-22 00:45:16,013 : DEBUG : updating topics
2017-11-22 00:45:16,

2017-11-22 00:45:24,474 : INFO : topic #7 (0.100): 0.013*"taliban" + 0.010*"afghanistan" + 0.008*"government" + 0.008*"australia" + 0.008*"man" + 0.007*"forces" + 0.007*"india" + 0.006*"pakistan" + 0.006*"people" + 0.005*"united_states"
2017-11-22 00:45:24,477 : INFO : topic #3 (0.100): 0.019*"arafat" + 0.011*"israeli" + 0.010*"government" + 0.008*"people" + 0.007*"israel" + 0.007*"test" + 0.007*"sharon" + 0.006*"hamas" + 0.006*"attacks" + 0.006*"west_bank"
2017-11-22 00:45:24,480 : INFO : topic #1 (0.100): 0.012*"australia" + 0.011*"metres" + 0.007*"test" + 0.006*"south_africa" + 0.006*"innings" + 0.005*"adelaide" + 0.005*"day" + 0.005*"event" + 0.005*"wicket" + 0.005*"water"
2017-11-22 00:45:24,482 : INFO : topic diff=0.070731, rho=0.408248
2017-11-22 00:45:24,484 : INFO : using symmetric alpha at 0.1
2017-11-22 00:45:24,487 : INFO : using symmetric eta at 0.00022568269013766644
2017-11-22 00:45:24,490 : INFO : using serial LDA version on this node
2017-11-22 00:45:25,015 : INFO : ru

### View the pipeline parameters for one coherence model

In [5]:
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')

2017-11-22 00:45:27,232 : DEBUG : Setting topics to those of the model: LdaModel(num_terms=4431, num_topics=10, decay=0.5, chunksize=2000)
2017-11-22 00:45:27,239 : DEBUG : Setting topics to those of the model: LdaModel(num_terms=4431, num_topics=10, decay=0.5, chunksize=2000)


### Pipeline parameters for C_V coherence

In [6]:
print(goodcm)

Coherence_Measure(seg=<function s_one_set at 0x1a11e1ee18>, prob=<function p_boolean_sliding_window at 0x1a11e32048>, conf=<function cosine_similarity at 0x1a11e7ae18>, aggr=<function arithmetic_mean at 0x1a11e74400>)


### Print coherence values

In [7]:
print(goodcm.get_coherence())
print(badcm.get_coherence())

2017-11-22 00:45:27,269 : INFO : using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows
2017-11-22 00:45:27,436 : DEBUG : completed batch 0; 64 documents processed (92 virtual)
2017-11-22 00:45:27,504 : DEBUG : completed batch 0; 64 documents processed (258 virtual)
2017-11-22 00:45:27,549 : DEBUG : completed batch 0; 64 documents processed (305 virtual)
2017-11-22 00:45:27,555 : DEBUG : observed sentinel value; terminating
2017-11-22 00:45:27,562 : DEBUG : finished all batches; 64 documents processed (305 virtual)
2017-11-22 00:45:27,566 : INFO : serializing accumulator to return to master...
2017-11-22 00:45:27,572 : INFO : accumulator serialized
2017-11-22 00:45:27,649 : DEBUG : completed batch 1; 107 documents processed (484 virtual)
2017-11-22 00:45:27,652 : DEBUG : observed sentinel value; terminating
2017-11-22 00:45:27,653 : DEBUG : completed batch 1; 128 documents processed (332 virtual)
2017-11-22 00:45:27,655 : DEBU

0.372587195389


2017-11-22 00:45:28,761 : DEBUG : completed batch 0; 64 documents processed (92 virtual)
2017-11-22 00:45:28,797 : DEBUG : completed batch 0; 64 documents processed (305 virtual)
2017-11-22 00:45:28,801 : DEBUG : completed batch 0; 64 documents processed (258 virtual)
2017-11-22 00:45:28,807 : DEBUG : observed sentinel value; terminating
2017-11-22 00:45:28,810 : DEBUG : finished all batches; 64 documents processed (258 virtual)
2017-11-22 00:45:28,813 : INFO : serializing accumulator to return to master...
2017-11-22 00:45:28,832 : DEBUG : completed batch 1; 128 documents processed (333 virtual)
2017-11-22 00:45:28,819 : INFO : accumulator serialized
2017-11-22 00:45:28,836 : DEBUG : observed sentinel value; terminating
2017-11-22 00:45:28,846 : INFO : accumulator serialized
2017-11-22 00:45:28,839 : DEBUG : finished all batches; 128 documents processed (333 virtual)
2017-11-22 00:45:28,842 : INFO : serializing accumulator to return to master...
2017-11-22 00:45:28,870 : DEBUG : compl

0.272436160451
