## LDA 3

# Fitting an LDA to our corpus

We plan to perform topic modeling using *Latent Dirichlet Allocation* (abbreviated as LDA). An LDA is a *generative model* that learns a group of categories (or *topics*) for words that occur together in a corpus of documents. For a technical presentation of LDAs, see [Appendix A](404).

Let's start loading up our corpus:

In [28]:
import json
import re
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

corpusPath = '../data/clean_json'

corpusList = loadCorpusList(corpusPath)
corpusList = [a for a in corpusList if a.lang == "es"]

In [29]:
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation

## Creating the term-document matrix

LDAs accept as input a bag-of-words representation of each document. In this representation, we build a matrix in which each column represents a word (lemmatized and in lowercase in our analysis), and each row is a document. Thus, if the matrix is called $A$, the entry $A_{ij}$ is given by how many times word $j$ appears in document $i$.

Thankfully, there is a simple way of constructing the term-document matrix using auxiliary tools from `scikit-learn`.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
documents = [a.bagOfWords for a in corpusList]

In [32]:
documents[0]

'   introducción querer comenzar disertación anécdota ser significativo temer convocar colega filósofo oportunidad participar asistente congreso religión organizar pontificio universidad javeriana par año atrás preguntarle temer tratar congreso concepto religioso distinto credo antiguo moderno institucional personal espiritualidad religioso etcétera   sorprender congreso dedicar ponencia temer ateísmo causar instintivamente salir boca consideración “ congreso religión temer ateísmo ausentar congreso físico antimateria ” texto formar paliar mencionar problema poner ateísmo temer suplementar filosofía religión bien considerar ateísmo religión estilar precisamente ser postura religión relevancia debatir temer religioso    consideración previo religioso frente ateo temer religioso dividir sociedad necesariamente creyente creyente ocasionar diferenciar personar compartir credo igualar interpretación librar sagrado parecer enconar ateo creyente   temer tratar escribir interesarme relación at

In [33]:
vec = CountVectorizer(min_df=10)
X = vec.fit_transform(documents)

In [34]:
from operator import itemgetter

In [35]:
counts = {
    word: count for word, count in zip(vec.get_feature_names(), X.sum(axis=0).tolist()[0])
}
sorted_counts = list(counts.items())
sorted_counts.sort(key=itemgetter(1), reverse=True)

for i, wordcount in enumerate(sorted_counts[:50]):
    print(f"{i+1}, {wordcount}")

1, ('ser', 12803)
2, ('formar', 5931)
3, ('bien', 5433)
4, ('mundo', 5186)
5, ('político', 4857)
6, ('filosofía', 4736)
7, ('modo', 4654)
8, ('vida', 3963)
9, ('moral', 3904)
10, ('ideo', 3900)
11, ('teoría', 3787)
12, ('concepto', 3690)
13, ('razón', 3433)
14, ('relación', 3274)
15, ('casar', 3247)
16, ('hombre', 3233)
17, ('kant', 3168)
18, ('punto', 3067)
19, ('presentar', 3040)
20, ('problema', 3002)
21, ('resultar', 2969)
22, ('experiencia', 2964)
23, ('término', 2962)
24, ('pensar', 2941)
25, ('obrar', 2897)
26, ('naturaleza', 2845)
27, ('crítico', 2813)
28, ('pensamiento', 2723)
29, ('autor', 2694)
30, ('humano', 2662)
31, ('acción', 2629)
32, ('conocimiento', 2562)
33, ('práctico', 2519)
34, ('mostrar', 2483)
35, ('tipo', 2477)
36, ('derecho', 2460)
37, ('objetar', 2451)
38, ('deber', 2388)
39, ('posibilidad', 2375)
40, ('concienciar', 2364)
41, ('permitir', 2345)
42, ('social', 2313)
43, ('diferenciar', 2290)
44, ('creencia', 2272)
45, ('realidad', 2259)
46, ('determinar', 223