# Topic Modeling with Gensim

This is a simple worked example for training and interpreting a topic model with [Gensim](https://radimrehurek.com/gensim). It assumes that the input text has already been cleaned and processed. For a more in-depth example that also includes using [Spacy](https://spacy.io) for text cleaning, see Patrick Harrison's [Modern NLP in Python](https://www.youtube.com/watch?v=6zm9NC9uRkk) talk and accompanying [notebook](https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb).

## Setup 

To go through this tutorial, first vreate a virtual environment and then install the needed packages into it with:

    $ pip install -r requirements.txt

## Initial Config

In [23]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim

# this file should contain a collection of documents, one per line,
# with each document being represented as whitespace-separated tokens
TEXT_FILE_PATH = "cleaned_text.txt"

# The number of worker processes to use for parallel computing. A good number 
# is N-1, where N is the number of actual CPU cores you have (not hyperthreads). 
# You might need to set this to 1 on Windows.
WORKERS = 3

def tokens_from_file(path):
    """Generator yielding tokens for each document in a file.
    Assumes input file has one document per line with tokens separated 
    by whitespace. 
    """
    with open(path) as file:
        for line in file:
            yield line.strip().split()   

## Topic Modeling with Gensim
First we need to build a dictionary from the cleaned text. As part of this process we filter out tokens that are very rare or too common from the corpus. With these settings, a token must occur at least 10 times to be kept, and it will be filtered out if it occurs in more than 40% of documents. See [Gensim's documentation](https://radimrehurek.com/gensim/corpora/dictionary.html) for other parameters to the `filter_extremes` method. 

In [4]:
dictionary = Dictionary(tokens_from_file(TEXT_FILE_PATH))
dictionary.filter_extremes(no_below=10, no_above=0.4)
dictionary.compactify()

We then need to use the trained dictionary model to convert the corpus into a sequence of bag of words, before we can train the LDA model.

In [11]:
bow_texts = [dictionary.doc2bow(tokens) for tokens in tokens_from_file(TEXT_FILE_PATH)]

Now train the LDA model!

In [12]:
lda = LdaMulticore(
    bow_texts,
    num_topics=15,
    id2word=dictionary,
    workers=WORKERS,
)

Let's write a helper function to explore the top terms for each topic. Have a look at a few different topics and see if you can identify a coherent theme.

In [21]:
def explore_topic(model, topic_number, topn=20):
    """Accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
    print(f"{'term':20} frequency\n")
    for term, frequency in model.show_topic(topic_number, topn=25):
        print(f"{term:20} {frequency:.3f}")

explore_topic(lda, 1)

term                 frequency

-d-                  0.112
-dd-                 0.045
line                 0.014
b                    0.013
minister             0.013
amendment            0.011
schedule             0.011
-month-_-year-       0.008
page_-dd-            0.007
c                    0.007
year                 0.007
government           0.007
-ddd-                0.007
subsection           0.006
item                 0.006
substitute           0.006
question             0.005
page                 0.005
minute               0.005
person               0.005
-month-              0.005
leave_grant          0.004
notice               0.004
project              0.004
item_-dd-            0.004


While looking at the top key words is useful, an even better approach is the [pyLDAvis](https://github.com/bmabey/pyLDAvis) tool. Note that this can take a while to run due to the dimensionality reduction it has to perform.

In [24]:
ldavis_prepared = pyLDAvis.gensim.prepare(lda, bow_texts, dictionary, n_jobs=WORKERS)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [27]:
pyLDAvis.display(ldavis_prepared)