# Performing Model Selection Using Topic Coherence

This notebook will perform topic modeling on the 20 Newsgroups corpus using LDA. We will perform model selection (over the number of topics) using topic coherence as our evaluation metric. This will showcase some of the features of the topic coherence pipeline implemented in `gensim`. In particular, we will see several features of the `CoherenceModel`.

In [1]:
from __future__ import print_function

import os
import re

from gensim.corpora import TextCorpus, MmCorpus
from gensim import utils, models
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import deaccent

Using TensorFlow backend.


## Parsing the Dataset

The 20 Newsgroups dataset uses a hierarchical directory structure to store the articles. The structure looks something like this:
```
20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119
|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
```

The files are in the newsgroup markup format, which includes some headers, quoting of previous messages in the thread, and possibly PGP signature blocks. The message body itself is raw text, which requires preprocessing. The code immediately below is an adaptation of [an active PR](https://github.com/RaRe-Technologies/gensim/pull/1388) for parsing hierarchical directory structures into corpora. The code just below that builds on this basic corpus parser to handle the newsgroup-specific text parsing.

In [2]:
class TextDirectoryCorpus(TextCorpus):
    """Read documents recursively from a directory,
    where each file is interpreted as a plain text document.
    """
    
    def iter_filepaths(self):
        """Lazily yield paths to each file in the directory structure within the specified
        range of depths. If a filename pattern to match was given, further filter to only
        those filenames that match.
        """
        for dirpath, dirnames, filenames in os.walk(self.input):
            for name in filenames:
                yield os.path.join(dirpath, name)
                
    def getstream(self):
        for path in self.iter_filepaths():
            with utils.smart_open(path) as f:
                doc_content = f.read()
            yield doc_content
    
    def preprocess_text(self, text):
        text = deaccent(
            lower_to_unicode(
                strip_multiple_whitespaces(text)))
        tokens = simple_tokenize(text)
        return remove_short(
            remove_stopwords(tokens))
        
    def get_texts(self):
        """Iterate over the collection, yielding one document at a time. A document
        is a sequence of words (strings) that can be fed into `Dictionary.doc2bow`.
        Override this function to match your input (parse input files, do any
        text preprocessing, lowercasing, tokenizing etc.). There will be no further
        preprocessing of the words coming out of this function.
        """
        lines = self.getstream()
        if self.metadata:
            for lineno, line in enumerate(lines):
                yield self.preprocess_text(line), (lineno,)
        else:
            for line in lines:
                yield self.preprocess_text(line)

    
def remove_stopwords(tokens, stopwords=STOPWORDS):
    return [token for token in tokens if token not in stopwords]

def remove_short(tokens, minsize=3):
    return [token for token in tokens if len(token) >= minsize]

def lower_to_unicode(text):
    return utils.to_unicode(text.lower(), 'ascii', 'ignore')

RE_WHITESPACE = re.compile(r"(\s)+", re.UNICODE)
def strip_multiple_whitespaces(text):
    return RE_WHITESPACE.sub(" ", text)

PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)
def simple_tokenize(text):
    for match in PAT_ALPHABETIC.finditer(text):
        yield match.group()

In [3]:
class NewsgroupCorpus(TextDirectoryCorpus):
    """Parse 20 Newsgroups dataset."""

    def extract_body(self, text):
        return strip_newsgroup_header(
            strip_newsgroup_footer(
                strip_newsgroup_quoting(text)))

    def preprocess_text(self, text):
        body = self.extract_body(text)
        return super(NewsgroupCorpus, self).preprocess_text(body)


def strip_newsgroup_header(text):
    """Given text in "news" format, strip the headers, by removing everything
    before the first blank line.
    """
    _before, _blankline, after = text.partition('\n\n')
    return after


_QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'
                       r'|^In article|^Quoted from|^\||^>)')
def strip_newsgroup_quoting(text):
    """Given text in "news" format, strip lines beginning with the quote
    characters > or |, plus lines that often introduce a quoted section
    (for example, because they contain the string 'writes:'.)
    """
    good_lines = [line for line in text.split('\n')
                  if not _QUOTE_RE.search(line)]
    return '\n'.join(good_lines)


_PGP_SIG_BEGIN = "-----BEGIN PGP SIGNATURE-----"
def strip_newsgroup_footer(text):
    """Given text in "news" format, attempt to remove a signature block."""
    try:
        return text[:text.index(_PGP_SIG_BEGIN)]
    except ValueError:
        return text

### Loading the Dataset

Now that we have defined the necessary code for parsing the dataset, let's load it up and serialize it into Matrix Market format. We'll do this because we want to train LDA on it with several different parameter settings, and this will allow us to avoid repeating the preprocessing.

In [4]:
# Replace data_path with path to your own copy of the corpus.
# You can download it from here: http://qwone.com/~jason/20Newsgroups/
# I'm using the original, called: 20news-19997.tar.gz

home = os.path.expanduser('~')
data_dir = os.path.join(home, 'workshop', 'nlp', 'data')
data_path = os.path.join(data_dir, '20_newsgroups')

In [5]:
%%time

corpus = NewsgroupCorpus(data_path)
corpus.dictionary.filter_extremes(no_below=5, no_above=0.8)
dictionary = corpus.dictionary
print(len(corpus))
print(dictionary)

19998
Dictionary(24595 unique tokens: [u'woods', u'hanging', u'woody', u'localized', u'gaa']...)
CPU times: user 21.7 s, sys: 2.88 s, total: 24.6 s
Wall time: 30.9 s


In [6]:
%%time

mm_path = os.path.join(data_dir, '20_newsgroups.mm')
MmCorpus.serialize(mm_path, corpus, id2word=dictionary)
mm_corpus = MmCorpus(mm_path)  # load back in to use for LDA training

CPU times: user 22.8 s, sys: 842 ms, total: 23.6 s
Wall time: 24 s


## Training the Models

Our goal is to determine which number of topics produces the most coherent topics for the 20 Newsgroups corpus. The corpus is roughly 20,000 documents. If we used 100 topics and the documents were evenly distributed among topics, we'd have clusters of 200 documents. This seems like a reasonable upper bound. In this case, the corpus actually has categories, defined by the first-level directory structure. This can be seen in the directory structure shown above, and three examples are: `alt.atheism`, `comp.graphics`, and `comp.os.ms-windows.misc`. There are 20 of these (hence the name of the dataset), so we'll use 20 as our lower bound for the number of topics.

One could argue that we already know the model should have 20 topics. I'll argue there may be additional categorizations within each newsgroup and we might hope to capture those by using more topics. We'll step by increments of 10 from 20 to 100.

In [16]:
%%time

trained_models = {}
for num_topics in range(20, 101, 10):
    print("Training LDA(k=%d)" % num_topics)
    lda = models.LdaMulticore(
        mm_corpus, id2word=dictionary, num_topics=num_topics, workers=4,
        passes=10, iterations=100, random_state=42, eval_every=None,
        alpha='asymmetric',  # shown to be better than symmetric in most cases
        decay=0.5, offset=64  # best params from Hoffman paper
    )
    trained_models[num_topics] = lda

Training LDA(k=20)
Training LDA(k=30)
Training LDA(k=40)
Training LDA(k=50)
Training LDA(k=60)
Training LDA(k=70)
Training LDA(k=80)
Training LDA(k=90)
Training LDA(k=100)
CPU times: user 20min 38s, sys: 3min 16s, total: 23min 55s
Wall time: 23min 43s


## Evaluation Using Coherence

Now we get to the heart of this notebook. In this section, we'll evaluate each of our LDA models using topic coherence. Coherence is a measure of how interpretable the topics are to humans. It is based on the representation of topics as the top-N most probable words for a particular topic. More specifically, given the topic-term matrix for LDA, we sort each topic from highest to lowest term weights and then select the first N terms.

Coherence essentially measures how similar these words are to each other. There are various methods for doing this, most of which have been explored in the paper ["Exploring the Space of Topic Coherence Measures"](https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf). The authors performed a comparative analysis of various methods, correlating them to human judgements. The method named "c_v" coherence was found to be the most highly correlated. This and several of the other methods have been implemented in `gensim.models.CoherenceModel`. We will use this to perform our evaluations.

The "c_v" coherence method makes an expensive pass over the corpus, accumulating term occurrence and co-occurrence counts. It only accumulates counts for the terms in the lists of top-N terms for each topic. In order to ensure we only need to make one pass, we'll construct a "super topic" from the top-N lists of each of the models. This will consist of a single topic with all the relevant terms from all the models. We choose 20 as N.

In [17]:
# Build topic listings from each model.
import itertools
from gensim import matutils


def top_topics(lda, num_words=20):
    str_topics = []
    for topic in lda.state.get_lambda():
        topic = topic / topic.sum()  # normalize to probability distribution
        bestn = matutils.argsort(topic, topn=num_words, reverse=True)
        beststr = [lda.id2word[_id] for _id in bestn]
        str_topics.append(beststr)
    return str_topics


model_topics = {}
super_topic = set()
for num_topics, model in trained_models.items():
    topics_as_topn_terms = top_topics(model)
    model_topics[num_topics] = topics_as_topn_terms
    super_topic.update(itertools.chain.from_iterable(topics_as_topn_terms))
    
print("Number of relevant terms: %d" % len(super_topic))

Number of relevant terms: 2714


In [53]:
%%time
# Now estimate the probabilities for the CoherenceModel

cm = models.CoherenceModel(
    topics=[super_topic], topn=len(super_topic), texts=corpus.get_texts(),
    dictionary=dictionary, coherence='c_v')
accumulator = cm.estimate_probabilities()

CPU times: user 39.2 s, sys: 3.23 s, total: 42.4 s
Wall time: 1min 21s


In [55]:
%%time
import numpy as np


def eval_coherence(cm, model_topics):
    """Perform the coherence evaluation for each of the models.

    Since we have already precomputed the probabilities, this simply
    involves using the accumulated stats in the `CoherenceModel` to
    perform the evaluations, which should be pretty quick.

    Args:
        cm (CoherenceModel): coherence model to evaluate coherences with. Should
            have already estimated probabilities.
        model_topics (dict): mapping from `num_topics` to the list of lists of
            top-N words for the model trained with that number of topics.

    Returns:
        dict: mapping from `num_topics` to tuple of `(avg_topic_coherences, avg_coherence)`.
            These are the coherence values per topic and the overall model coherence.
    """
    coherences = {}
    for num_topics, topics in model_topics.items():
        cm.topics = topics

        # We evaluate at various values of N and average them. This is a more robust,
        # according to: http://people.eng.unimelb.edu.au/tbaldwin/pubs/naacl2016.pdf
        coherence_at_n = {}
        for n in (20, 15, 10, 5):
            cm.topn = n
            topic_coherences = cm.get_coherence_per_topic()

            # Let's record the coherences for each topic, as well as the aggregated
            # coherence across all of the topics.
            coherence_at_n[n] = (topic_coherences, cm.aggregate_measures(topic_coherences))

        topic_coherences, avg_coherences = zip(*coherence_at_n.values())
        avg_topic_coherences = np.vstack(topic_coherences).mean(0)
        avg_coherence = np.mean(avg_coherences)
        print("Avg coherence for num_topics=%d: %.5f" % (num_topics, avg_coherence))
        coherences[num_topics] = (avg_topic_coherences, avg_coherence)

    return coherences


def print_coherence_rankings(coherences):
    avg_coherence = \
        [(num_topics, avg_coherence)
         for num_topics, (_, avg_coherence) in coherences.items()]
    ranked = sorted(avg_coherence, key=lambda tup: tup[1], reverse=True)
    print("Ranked by average '%s' coherence:\n" % cm.coherence)
    for item in ranked:
        print("num_topics=%d:\t%.4f" % item)
    print("\nBest: %d" % ranked[0][0])

CPU times: user 18 µs, sys: 16 µs, total: 34 µs
Wall time: 30 µs


In [56]:
coherences = eval_coherence(cm, model_topics)
print_coherence_rankings(coherences)

Avg coherence for num_topics=100: 0.53087
Avg coherence for num_topics=70: 0.51614
Avg coherence for num_topics=40: 0.54629
Avg coherence for num_topics=80: 0.53581
Avg coherence for num_topics=50: 0.54383
Avg coherence for num_topics=20: 0.53597
Avg coherence for num_topics=90: 0.51484
Avg coherence for num_topics=60: 0.52619
Avg coherence for num_topics=30: 0.56122
Ranked by average 'c_v' coherence:

num_topics=30:	0.5612
num_topics=40:	0.5463
num_topics=50:	0.5438
num_topics=20:	0.5360
num_topics=80:	0.5358
num_topics=100:	0.5309
num_topics=60:	0.5262
num_topics=70:	0.5161
num_topics=90:	0.5148

Best: 30


### Results so Far

So far in this notebook, we have used `gensim`'s `CoherenceModel` to perform model selection over the number of topics for LDA. We found that for the 20 Newsgroups corpus, 30 topics is best. We showcased the ability of the coherence pipeline to evaluate individual topic coherence as well as aggregated model coherence. We also demonstrated how to avoid repeated passes over the corpus, estimating the term similarity probabilities for all relevant terms just once. Topic coherence is a powerful alternative to evaluation using perplexity on a held-out document set. It is appropriate to use whenever the objective of the topic modeling is to present the topics as top-N lists for human consumption.

Note that coherence calculations are generally much more accurate when a larger reference corpus is used to estimate the probabilities. In this case, we used the same corpus as for our modeling, which is relatively small at only 20,000 documents. A better reference corpus is the full Wikipedia corpus. The motivated explorer of this notebook is encouraged to download that corpus (see [Experiments on the English Wikipedia](https://radimrehurek.com/gensim/wiki.html)) and use it for probability estimation.

Next we'll look at another method of coherence evaluation using distributional word embeddings.

### Evaluating Coherence with Word2Vec

The fact that "c_v" coherence uses distributional semantics to evalaute word similarity motivates the use of Word2Vec for coherence evaluation. This idea is explored further in an appendix at the end of the notebook. The `CoherenceModel` implemented in `gensim` also supports this, so let's look at a few examples.

In [65]:
%%time

cm = models.CoherenceModel(
    topics=[super_topic], topn=len(super_topic), texts=corpus.get_texts(),
    dictionary=dictionary, coherence='c_w2v')
cm.estimate_probabilities()

CPU times: user 20.5 s, sys: 794 ms, total: 21.3 s
Wall time: 21.6 s


In [66]:
coherences = eval_coherence(cm, model_topics)
print_coherence_rankings(coherences)

Avg coherence for num_topics=100: 0.31072
Avg coherence for num_topics=70: 0.31160
Avg coherence for num_topics=40: 0.30917
Avg coherence for num_topics=80: 0.31221
Avg coherence for num_topics=50: 0.30868
Avg coherence for num_topics=20: 0.31732
Avg coherence for num_topics=90: 0.31081
Avg coherence for num_topics=60: 0.31131
Avg coherence for num_topics=30: 0.30768
Ranked by average 'c_w2v' coherence:

num_topics=20:	0.3173
num_topics=80:	0.3122
num_topics=70:	0.3116
num_topics=60:	0.3113
num_topics=90:	0.3108
num_topics=100:	0.3107
num_topics=40:	0.3092
num_topics=50:	0.3087
num_topics=30:	0.3077

Best: 20


#### Using pre-trained word vectors for coherence evaluation.

Whoa! These results are completely different from those of the "c_v" method, and "c_w2v" is saying the model we thought was best is actually the worst! So what happened here?

The same note must be made for Word2Vec ("c_w2v") that we made for "c_v": results are more accurate when a larger reference corpus is used. Except for "c_w2v", this is actually _way, way_ more important. Distributional word embedding techniques such as Word2Vec are fitting a probability distribution with a large number of parameters, and doing that takes a lot of data.

Luckily, there are a variety of pre-trained word vectors [freely available for download](http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/). Below we demonstrate using word vectors trained on ~100 billion words from Google News, [available at this link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). Note that this file is 1.5G, so downloading it can take quite some time. It is also quite slow to load and ends up occupying about 3.35G in memory (this load time is included in the timing below). There is no need to use such a large set of word vectors for this evaluation; this one is just readily available.

In [61]:
%%time

models_dir = os.path.join(home, 'workshop', 'nlp', 'models')
vectors_path = os.path.join(models_dir, 'GoogleNews-vectors-negative300.bin.gz')
keyed_vectors = models.KeyedVectors.load_word2vec_format(vectors_path, binary=True)

cm = models.CoherenceModel(
    topics=[super_topic], texts=corpus.get_texts(),
    dictionary=dictionary, coherence='c_w2v',
    keyed_vectors=keyed_vectors)
cm.estimate_probabilities()  # still need to estimate_probabilities, but corpus is not scanned

CPU times: user 2min 59s, sys: 4.4 s, total: 3min 3s
Wall time: 3min 4s


In [62]:
coherences = eval_coherence(cm, model_topics)
print_coherence_rankings(coherences)

Avg coherence for num_topics=100: 0.49062
Avg coherence for num_topics=70: 0.49919
Avg coherence for num_topics=40: 0.50496
Avg coherence for num_topics=80: 0.50293
Avg coherence for num_topics=50: 0.51106
Avg coherence for num_topics=20: 0.50269
Avg coherence for num_topics=90: 0.48956
Avg coherence for num_topics=60: 0.49603
Avg coherence for num_topics=30: 0.51147
Ranked by average 'c_w2v' coherence:

num_topics=30:	0.5115
num_topics=50:	0.5111
num_topics=40:	0.5050
num_topics=80:	0.5029
num_topics=20:	0.5027
num_topics=70:	0.4992
num_topics=60:	0.4960
num_topics=100:	0.4906
num_topics=90:	0.4896

Best: 30


#### Looks like we've now restored order

The "c_w2v" evalution is now agreeing with "c_v" on the best model, and the rest of the ordering is generally quite similar. Note that the "c_w2v" values should not be compared directly to those produced by the "c_v" method. Only the ranking of models is comparable.

## Appendix: Why Word2Vec for Coherence?

The "c_v" coherence method drags a sliding window across all documents in the corpus to accumulate co-occurrence statistics. Similarity is calculated using normalized pointwise mutual information (PMI) values estimated from these statistics. More specifically, each word is represented by a vector of its NPMI with every other word in its top-N topic list. These vectors are then used to compute (cosine) similarity between words. The restriction to the other words in the top-N list was found to produce better results than using the entire vocabulary and other methods of reducing the vocabulary (see section 3.2.2 of http://www.aclweb.org/anthology/W13-0102).

The fact that a reduced space is superior for these metrics indicates there is noise getting in the way. The "c_v" method can be seen as constructing an NPMI matrix between words. The vector of NPMI values for a particular word can then be looked up by indexing the row or column corresponding to that word's `Dictionary` ID. The reduction to the "topic word space" can then be achieved by using a mask to select out the top-N topic words. If we are constructing an NPMI matrix between words, then discarding some elements to reduce noise, why not factorize the matrix instead? Dimensionality reduction techniques such as SVD do a great job of reducing noise along with dimensionality, while also providing a compressed representation to work with.

[Recent work](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization) has shown that Word2Vec (trained with Skip-Gram Negative Sampling (SGNS)) is actually implicitly factorizing a PMI matrix shifted by a positive constant. [A subsequent paper](http://dl.acm.org/citation.cfm?id=2914720) compared Word2Vec to a few different PMI-based metrics and showed that it found coherence values that correlated more strongly with human judgements.