## Extras

# Parameter optimization using gensim

This notebook contains some experimentation using the topic coherence that is available out-of-the-box in `gensim`'s implementation of Latent Dirichlet Allocation.

In these experiments, we follow and adapt the code of two tutorials:
- [One on performing topic modeling using gensim](https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21).
- [Another one on topic coherence in gensim](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

##  Importing our corpus

As in other notebooks, we start by importing our corpus:

In [17]:
import json
import re
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

corpusPath = '../data/clean_json'

corpusList = loadCorpusList(corpusPath)
corpusList = [a for a in corpusList if a.lang == "es"]

For `gensim`, we will need to provide the bag-of-words as a list of strings. Thus, we start by implementing a generic function that takes objects of our article class and returns the list representation of its bag of words. It includes a hot fix to a problem we are facing in the generation of the bag of words: there are several spaces that make `bagOfWords.split(" ")` contain empty strings.

In [19]:
def prepare_bag_of_words(article):
    """
    A hot fix on some empty strings.
    """
    bow = article.bagOfWords
    bow = bow.split(" ")
    return [w for w in bow if len(w) > 1]

## Creating a dictionary and a corpus for `gensim`

In [21]:
from gensim import corpora

In [23]:
dictionary = corpora.Dictionary([
    prepare_bag_of_words(a) for a in corpusList
])

In [25]:
corpus = [dictionary.doc2bow(prepare_bag_of_words(a)) for a in corpusList]

TODO: save `dictionary` and `corpus` using pickle.

## Fitting an LDA using gensim

In [26]:
from gensim.models.ldamodel import LdaModel

In [27]:
n_topics = 10
lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, passes=15)

In [28]:
topics = lda.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.015*"ser" + 0.010*"mundo" + 0.005*"heidegger" + 0.005*"modo" + 0.004*"formar" + 0.004*"interpretación" + 0.004*"bien" + 0.004*"término" + 0.004*"teoría" + 0.004*"propiedad"')
(1, '0.017*"kant" + 0.011*"ser" + 0.007*"libertar" + 0.007*"moral" + 0.007*"razón" + 0.007*"concepto" + 0.006*"derecho" + 0.006*"ley" + 0.005*"bien" + 0.004*"práctico"')
(2, '0.004*"percepción" + 0.004*"cavar" + 0.004*"ser" + 0.003*"virtud" + 0.003*"empírico" + 0.003*"concienciar" + 0.003*"propiedad" + 0.003*"concepto" + 0.003*"em" + 0.003*"noë"')
(3, '0.009*"formar" + 0.008*"hegel" + 0.008*"ser" + 0.007*"político" + 0.006*"foucault" + 0.005*"vida" + 0.004*"bien" + 0.003*"lógico" + 0.003*"modo" + 0.003*"crítico"')
(4, '0.010*"ser" + 0.005*"social" + 0.005*"bien" + 0.004*"formar" + 0.004*"acción" + 0.004*"modo" + 0.004*"desear" + 0.003*"relación" + 0.003*"político" + 0.003*"vida"')
(5, '0.011*"ser" + 0.006*"vida" + 0.006*"concienciar" + 0.005*"mundo" + 0.004*"bien" + 0.004*"heidegger" + 0.004*"husserl" + 0.0

## Visualizing an LDA using pyLDAvis

One of the advantages of using `gensim` is `pyLDAvis`, a visualization utility that takes `gensim`'s objects (the `lda` model, the `corpus` and the `dictionary` we created above) and creates a visualization of the `lda`'s latent space and topic representation:

In [30]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

## Topic coherence using gensim

`gensim` itself brings a `CoherenceModel` object that measures a metric of coherence which can be used to assess and evaluate an LDA model.

In [31]:
from gensim.models import CoherenceModel

In [32]:
texts = [
    prepare_bag_of_words(a) for a in corpusList
]

In [29]:
coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)


Coherence Score:  0.33797822156558943


This coherence score allows us to do a search for the "best" `n_topics`. Let's print the coherence doing a normal search between 10 and 150 topics. Notice that this coherence score is sensitive to the random number generation that is used when creating the `lda`.

In [None]:
coherence_per_topics = {}
for n_topics in range(10, 151, 5):
    lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, passes=15)
    coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=dictionary, coherence='c_v')
    c = coherence_model_lda.get_coherence()
    coherence_per_topics[n_topics] = c
    print(f'Topics: {n_topics}. Coherence Score: {c}')

Topics: 10. Coherence Score: 0.3529584097627566
Topics: 15. Coherence Score: 0.3384085599446584
Topics: 20. Coherence Score: 0.33025337914520897
Topics: 25. Coherence Score: 0.3435222242397539
Topics: 30. Coherence Score: 0.3284817585586472
Topics: 35. Coherence Score: 0.3379262882371989
Topics: 40. Coherence Score: 0.3654173467844408
Topics: 45. Coherence Score: 0.35818268356111255


**TODO**: use some statistics to analyze, and include other hyperparameters in the search.