# Topic Modeling with Gensim

This is a simple worked example for training and interpreting a topic model with [Gensim](https://radimrehurek.com/gensim). It assumes that the input text collection has already been cleaned and processed into a single text file, with one document per line, and each document being represented as a whitespaced sperated tokens. For a more in-depth example that also includes using [Spacy](https://spacy.io) for text cleaning, see Patrick Harrison's [Modern NLP in Python](https://www.youtube.com/watch?v=6zm9NC9uRkk) talk and accompanying [notebook](https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb).

## Setup 

To go through this tutorial, first create a virtual environment and then install the needed packages into it with:

    $ pip install -r requirements.txt

## Initial Config

In [1]:
%load_ext autoreload
%autoreload 2

# this file should contain a collection of documents, one per line,
# with each document being represented as whitespace-separated tokens
TEXT_FILE_PATH = "cleaned_text.txt"

# The number of worker processes to use for parallel computing. A good number 
# is N-1, where N is the number of actual CPU cores you have (not hyperthreads). 
# You might need to set this to 1 on Windows.
WORKERS = 3

def tokens_from_file(path):
    """Generator yielding tokens for each document in a file.
    Assumes input file has one document per line with tokens separated 
    by whitespace. 
    """
    with open(path, encoding="utf8") as file:
        for line in file:
            yield line.strip().split()

## Training an LDA Topic Model
First we need to build a dictionary from the cleaned text. Gensim provides a `Dictionary` class that efficiently stores the vocabulary you will be using from your collection. As part of this process we filter out tokens that are very rare or very common from the corpus. With these settings below, a token must occur at least 10 times to be kept, and it will be filtered out if it occurs in more than 40% of documents. See [Gensim's documentation](https://radimrehurek.com/gensim/corpora/dictionary.html) for other parameters to the `filter_extremes` method. 

In [2]:
from gensim.corpora import Dictionary


# crate the dictionary with cleaned tokens
dictionary = Dictionary(tokens_from_file(TEXT_FILE_PATH))

# filter rare and common tokens from the dictionary
dictionary.filter_extremes(no_below=10, no_above=0.4)

# this removes gaps in the dictionary's data structure after token filtering, saving some space  
dictionary.compactify()

print(f"The dictionary has {dictionary.num_docs} documents and {dictionary.num_pos} tokens in its vocabulary")

The dictionary has 181965 documents and 22475719 tokens in its vocabulary


We then need to use the trained dictionary model to convert the corpus into a sequence of bag of words, before we can train the LDA model.  

In [3]:
bow_texts = [dictionary.doc2bow(tokens) for tokens in tokens_from_file(TEXT_FILE_PATH)]

# show the first 20 tokens of the first document
print(bow_texts[0][:20])

[(0, 9), (1, 1), (2, 4), (3, 1), (4, 1), (5, 9), (6, 1), (7, 5), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 3), (17, 1), (18, 7), (19, 1)]


We can see that our collection is now represented as a list of documents, with each document being represented as a list of tokens. Note that in this representation used by the dictionary, tokens have been converted into integer IDs, which is a more efficient representation. This means that whenever we need to get back the original token representation, we'll need the dictionary to convert the token IDs back to their character representations.

Now train the LDA model. We'll use Gensim's [LdaMulticore](https://radimrehurek.com/gensim/models/ldamulticore.html) class to perform parallel training using multiple worker processes, which will make the training faster. If you don't have access to an environment/machine with multicore support, set `workers` to 1. 

In [4]:
from gensim.models.ldamulticore import LdaMulticore


lda = LdaMulticore(
    bow_texts,
    num_topics=15,
    id2word=dictionary,
    workers=WORKERS,
)

## Interpreting the Model

Let's write a helper function to explore the top terms for each topic. Have a look at a few different topics and see if you can identify a coherent theme.

In [17]:
def explore_topic(model, topic_number, topn=20):
    """Accept a user-supplied topic number and
    print out a formatted list of the top tokens by probability
    """
    print(f"{'token':20} probability\n")
    for token, probability in model.show_topic(topic_number, topn=25):
        print(f"{token:20} {probability:.3f}")


explore_topic(lda, 1)

token                probability

government           0.019
australia            0.013
-d-                  0.011
year                 0.008
industry             0.007
australian           0.007
need                 0.007
climate_change       0.007
job                  0.007
-year-               0.007
policy               0.006
economy              0.006
cost                 0.006
support              0.006
scheme               0.006
labor                0.006
country              0.005
people               0.005
-dd-_cent            0.005
business             0.005
future               0.005
increase             0.004
work                 0.004
sector               0.004
coalition            0.004


While looking at the top key words is useful, an even better approach is the [pyLDAvis](https://github.com/bmabey/pyLDAvis) tool. Note that this can take a while to run due to the dimensionality reduction it has to perform.

In [5]:
import pyLDAvis
import pyLDAvis.gensim

ldavis_prepared = pyLDAvis.gensim.prepare(
    lda, 
    bow_texts, 
    dictionary, 
    sort_topics=False, 
    n_jobs=WORKERS
)

pyLDAvis.display(ldavis_prepared)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Tuning the Model

In [None]:
from gensim.models.coherencemodel import CoherenceModel


def train_lda_models(end, start=2, step=5, coherence="u_mass"):
    """Train LDA models for a range of topic numbers and get coherence scores
    
    end:       max number of topics to test
    start:     min number of topics to test
    step:      increments in topic number range to test
    coherence: coherence metric to be used. Must be valid value for the 'coherence'
               param of Gensim's 'CoherenceModel'.
               
    """
    results = []
    for num_topics in range(start, end+1, step):
        print(f"Training topic model with {num_topics} topics...")
        model = LdaMulticore(
            bow_texts,
            num_topics=num_topics,
            id2word=dictionary,
            workers=WORKERS,
        )
        cm = CoherenceModel(
            model=model, 
            corpus=bow_texts, 
            texts=tokens_from_file(TEXT_FILE_PATH), 
            coherence=coherence
        )
    results.append(num_topics, model, cm.get_coherence())

    
model_results = train_lda_models(end, start=2, step=5)

# TODO: show graphic the results and also describe what's going on

## Using the Topic Model

TODO: finish this
Once we've settled on our trained a topic model which is showing us topics that seem coherent...