## LDA 3

# Fitting an LDA to our corpus

We plan to perform topic modeling using *Latent Dirichlet Allocation* (abbreviated as LDA). An LDA is a *generative model* that learns a group of categories (or *topics*) for words that occur together in a corpus of documents. For a technical presentation of LDAs, see [Appendix A](404).

Let's start loading up our corpus:

In [1]:
from utils.corpus import Corpus

corpus = Corpus(registry_path = 'utils/article_registry.json')

We instantiate an initial `Model` object and give it access to our corpus. We must also give it the number of topics it should train on.

In [4]:
from utils.model import Model

n_topics = 10
base_model = Model(corpus, n_topics)

Loading corpus. Num. of articles: 906


To train the `Model`, we can use the `train()` method. 

In [6]:
base_model.train(time_window=50)

Bags of words collected. Starting training...
1950 - 1999: 336
2000 - 2049: 570


  convergence = np.fabs((bound - old_bound) / old_bound)


KeyboardInterrupt: 

We can save this model to a file using the `save()` method:

In [None]:
base_model.save()

## Analyzing the Model Coherence

Coherence is an important statistic to compute in order to calibrate how many topics should we have on our final model. We can compute coherence by calling the `get_coherence()` method.

In [None]:
base_model.get_coherence()

This coherence score allows us to do a search for the "best" `n_topics`. Notice that this coherence score is sensitive to the random number generation that is used when creating the `lda`. If we wanted to control this randomness, we can pass a `seed` parameter to the `train()` method. We will do this later when we implement our final model.

## Running a more complete grid-search

The last section shows we can compare different models and calibrate an optimal number of topics by training several models on a given number of topics. Now we will implement this experiment using a `gridsearch()` function. This function also makes use of the `get_stats()` method we included for each model.

TODO: Run it again after addressing this comment: https://github.com/RaRe-Technologies/gensim/issues/2115#issuecomment-443113360

In [None]:
def gridsearch(min_topics, max_topics, step, iterations=3, verbose=True):
    """
    Computes an array where we store statistics for each model. We run a search
    n number of times per number of topics and record a set of statistics for each model.
    At the end we will have n models per number of topics to compare.

    Returns an array of the following form:

    experiment = {
        n_topics: {
            0: [model(n_topics = 0).get_stats * iterations],
            1: [model(n_topics = 1).get_stats * iterations],
            ...
            iterations - 1: [model(n_topics = iterations - 1).get_stats()]
        }
    }
    
    We expect all these inner model_stats() to be slightly different
    due to stochasticity in the models.
    """
    
    experiment = {}
    for n_topics in range(min_topics, max_topics, step):
        experiment[n_topics] = {}
        print(f"\nRunning experiment for {n_topics} topics.")
        print("----------")
        for i in range(iterations):
            if verbose:
                print(f"Iteration: {i}")

            experiment[n_topics][i] = Model(corpus, n_topics).get_stats()

    return experiment

**Careful:** this gridsearch can take a whole evening.

In [None]:
experiment = gridsearch(80, 200, 10, iterations=3)

Finally, we can save it for further analysis later on:

In [None]:
experiment

In [8]:
import json
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Understanding the gridsearch results 

In [None]:
data = []
for n_topics in experiment:
    for iteration, results in experiment[n_topics].items():
        results['n_topics'] = n_topics
        results['iteration'] = iteration
        data.append(results)

df = pd.DataFrame(data)
df['time_lda'] = df['time_lda'] / 60
df['time_coherence'] = df['time_coherence'] / 60

In [None]:
df.to_json('../data/gridsearch.json')

In [None]:
df.groupby('n_topics').mean()

In [None]:
vars = ['coherence', 'log_perplexity', 'time_lda', 'time_coherence', 'avg_arts_per_topic', 'std_arts_per_topic']
rows = 3
cols = 2
fig, axs = plt.subplots(rows, cols, sharex=True)

for i, var in enumerate(vars):    
    col = i % cols
    row = i % rows
    sns.lineplot(data=df, x='n_topics', y=var, ax=axs[row, col])
    sns.scatterplot(data=df, x='n_topics', y=var, ax=axs[row, col])

fig.set_figheight(10)
fig.set_figwidth(10)

# Optimal topics seems to be 90

Let's reload the gridsearch and study what happens around 90 topics

In [None]:
with open("../data/gridsearch.json") as fp:
    g = json.load(fp)

In [None]:
g["90"]

In [None]:
for k in g:
    for v in g[k].values():
        if "-1" in v["n_articles_per_topic"]:
            print(f"Number of topics: {k}")
            print("Number of articles without 1st topic:")
            print(v["n_articles_per_topic"]["-1"])

From 100 onwards, 254 articles get systematically thrown to 0 topics. Weird!

90 doesn't have the no-topics-for-article problem, should we stick with it?

## Sticking with 90

After reading online, people recommend that we save our dictionary in order to prevent randomness in it in the future. I will also set up the seed for the LDA.

It has been 47,984 days since Wittgetstein was born (as of today, 09/09/20).

In [None]:
final_model = Model(corpus, 90)
final_model.train(seed = 47984)
final_model.save()

In [None]:
print(final_model.get_stats())