## LDA 3

# Fitting an LDA to our corpus

We plan to perform topic modeling using *Latent Dirichlet Allocation* (abbreviated as LDA). An LDA is a *generative model* that learns a group of categories (or *topics*) for words that occur together in a corpus of documents. For a technical presentation of LDAs, see [Appendix A](404).

Let's start loading up our corpus:

In [1]:
import json
import re
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

corpusPath = '../data/corpus'

corpusList = loadCorpusList(corpusPath)
corpusList = [a for a in corpusList if a.lang == "es"]

For `gensim`, we will need to provide the bag-of-words as a list of strings. Thus, we start by implementing a generic function that takes objects of our article class and returns the list representation of its bag of words. It includes a hot fix to a problem we are facing in the generation of the bag of words: there are several spaces that make `bagOfWords.split(" ")` contain empty strings.

In [2]:
def prepare_bag_of_words(article):
    """
    A hot fix on some empty strings.
    """
    bow = article.bagOfWords
    bow = bow.split(" ")
    return [w for w in bow if len(w) > 1]

In [3]:
from gensim import corpora

In [7]:
dictionary = corpora.Dictionary([
    prepare_bag_of_words(a) for a in corpusList
])
corpus = [dictionary.doc2bow(prepare_bag_of_words(a)) for a in corpusList]

Using this dictionary, we can create an LdaModel:

In [8]:
from gensim.models.ldamodel import LdaModel

In [9]:
n_topics = 10
lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, passes=15)

## Visualizing an LDA using pyLDAvis

One of the advantages of using `gensim` is `pyLDAvis`, a visualization utility that takes `gensim`'s objects (the `lda` model, the `corpus` and the `dictionary` we created above) and creates a visualization of the `lda`'s latent space and topic representation:

In [10]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

We can save this model:

In [11]:
lda.save(f"LDA_gensim_{n_topics}.model")

In [12]:
ls

1-Preprocessing_Artifacts-and-Stopwords.ipynb
2-Preprocessing-Stopword-removal.ipynb
3-LDA-using-gensim.ipynb
3-LDA_Fitting.ipynb
LDA_gensim_10.model
LDA_gensim_10.model.expElogbeta.npy
LDA_gensim_10.model.id2word
LDA_gensim_10.model.state
LDA_k_20.jl
LDA_k_50.jl
[34mmodels[m[m/
[34mwordlists[m[m/


## Analyzing the Model Coherence

In [13]:
from gensim.models import CoherenceModel

In [14]:
texts = [
    prepare_bag_of_words(a) for a in corpusList
]

In [15]:
coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.38878617966222534


This coherence score allows us to do a search for the "best" `n_topics`. Let's print the coherence doing a normal search between 10 and 150 topics. Notice that this coherence score is sensitive to the random number generation that is used when creating the `lda`.

In [None]:
coherence_per_topics = {}
for n_topics in range(10, 200, 10):
    lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, passes=15)
    coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=dictionary, coherence='c_v')
    c = coherence_model_lda.get_coherence()
    coherence_per_topics[n_topics] = c
    print(f'Topics: {n_topics}. Coherence Score: {c}')

Topics: 10. Coherence Score: 0.3953344477170396
Topics: 20. Coherence Score: 0.41243755522172315
Topics: 30. Coherence Score: 0.3918896770429213
Topics: 40. Coherence Score: 0.4091358025355166
Topics: 50. Coherence Score: 0.4098096823426642
Topics: 60. Coherence Score: 0.4044007244335085
Topics: 70. Coherence Score: 0.41026726256297735
Topics: 80. Coherence Score: 0.4099083999675628
Topics: 90. Coherence Score: 0.4075532626033013
