# Topic Exploration

This notebook serves as an interface with the LDA model, and uses several utility functions that are specifically implemented to explore what each topic contains.

Let's start by loading up the model and these auxiliary functions:

And now, the LDA model:

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
from utils.model import Model
from utils.corpus import Corpus

corpus = Corpus(registry_path = 'utils/article_registry.json')
model = Model(corpus, num_topics=90)

Loading corpus. Num. of articles: 877


In [4]:
model.load()

In [5]:
model.ldaseq

<utils.dtmmodel.DtmModel at 0x7fac8d444040>

Our `Model` object uses `Topic` objects that we can interface with. We will use methods of both classes to navigate the model we have trained.

## Getting the top words in a topic

The `Topic.get_top_words()` method returns the top `n` words and their probabilities:

In [6]:
evolution = ""
for topic in model.topics:
    topic_list = []
    for time in range(15):
        time_slice = []
        for prob, word in model.ldaseq.show_topic(topicid=topic.id, time = time, topn=10):
            time_slice.append(word)
        topic_list.append(time_slice)
    evolution += pd.DataFrame(topic_list).T.to_markdown(index = False) + "\n\n---\n\n"

In [7]:
with open('topic_evolution.md', 'w') as fp:
    fp.write(evolution)

In [8]:
import numpy as np

In [9]:
model.ldaseq.show_topic(topicid=2, time=0, topn=10)

[(4.046617028164455e-05, 'poéticamente'),
 (4.046617028164455e-05, 'precede'),
 (4.046617028164455e-05, 'plegaria'),
 (4.046617028164455e-05, 'pop'),
 (4.046617028164455e-05, 'populacho'),
 (4.046617028164455e-05, 'parición'),
 (4.046617028164455e-05, 'premeditación'),
 (4.046617028164455e-05, 'prefigurar'),
 (4.046617028164455e-05, 'otorgado'),
 (4.046617028164455e-05, 'precedentemente')]

In [10]:
from collections import defaultdict

In [11]:
for doc in corpus.documents:
    doc.bin = (int(doc.date[:4]) - 1951) // 5

In [12]:
data = {topic: defaultdict(int) for topic in range(model.num_topics)}

for i, doc in enumerate(corpus.documents):
    topic = np.argmax(model.ldaseq.gamma_[i])
    if model.ldaseq.gamma_[i][topic] > 0.25:
        data[topic][doc.bin] += 1

In [13]:
new_data = []
for topic, vals in data.items():
    for year, counts in vals.items():
        new_data.append((topic, year, counts))

In [14]:
df = pd.DataFrame(new_data, columns = ['Topic', 'Bin', 'Count'])
df['Year'] = (df['Bin'] * 5) + 1952

In [17]:
df.groupby('Topic').sum()

Unnamed: 0_level_0,Bin,Count,Year
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,24,6,7928
15,90,66,18018
16,67,21,15951
17,36,5,6036
18,34,7,7978
19,86,38,17998
28,3,1,1967
33,13,1,2017
43,89,37,18013
49,13,1,2017


In [15]:
import altair as alt

alt.Chart(df).mark_area().encode(
    alt.X('Year:N'),
    alt.Y('sum(Count)', stack = 'center', axis = None),
    alt.Color('Topic', scale=alt.Scale(scheme='category20b'))
).properties(width=800)

## Getting all articles in a topic

You can get all article ID's and probabilities in a topic using the `Topic.get_top_articles()` method. Set an `n` if you want to cap the results. By default we have `n=5`.

In [26]:
model.topics[3].get_top_articles()

[('66898', 0.8983381390571594),
 ('48173', 0.8065474629402161),
 ('57601', 0.39241376519203186),
 ('8831', 0.009229096584022045)]

We can also get titles.

In [27]:
model.topics[5].get_top_titles()

['Leonardo Ivarola (2019/09/01). Consecuencias alternativas y asimetría de resultados en la implementación de políticas socioeconómicas',
 'José Manuel Chillón (2019/01/01). Heidegger y la prudencia aristotélica como protofenomenología',
 'Matías Abeijón (2019/01/01). Historia, estructura y experiencia. Relaciones metodológicas entre Michel Foucault y Georges Dumézil',
 'Jorge Aurelio Díaz (2021/12/15). Impacto de las políticas gubernamentales en Ideas y Valores',
 'José H. Silveira De Brito (2012/01/01). HUMANIZAÇÃO DA SAÚDE: DA INTENÇÃO À INTELIGÊNCIA EMOTIVA PELAS IDEIAS']

## Summarize an article

If you have an `Article` object, you can use the `Model.get_topics_in_article(Article)` method to get a summary of what topics the article is likely to be in.

In [22]:
model.get_topics_in_article(model.articles[3])

## Summarizing an entire topic

For this, we have implemented `Topic.summary()`.

In [6]:
from IPython.display import Markdown as md


md(model.topics[0].summary())

# Topic 0

## Top words:
|    | Word       |   Probability |
|---:|:-----------|--------------:|
|  0 | smith      |         0.01  |
|  1 | él         |         0.01  |
|  2 | moral      |         0.008 |
|  3 | naturaleza |         0.006 |
|  4 | objeto     |         0.006 |
|  5 | humano     |         0.005 |
|  6 | razón      |         0.005 |
|  7 | idea       |         0.005 |
|  8 | kant       |         0.005 |
|  9 | hombre     |         0.004 |
## Top articles:

* Rosa Colmenarejo (2016/01/01). Enfoque de capacidades  y sostenibilidad. Aportaciones de Amartya Sen  y Martha Nussbaum
* José de la Cruz Garrido (2015/09/01). El papel de la imaginación en la refutación de Adam Smith a la tesis del homo economicus
* Nicolás Novoa Artigas (2016/01/01). La problemática posición  de Adam Smith acerca  de la suerte moral
* Martín Fleitas González (2015/09/01). ¿Solo hay realismo o constructivismo moral dentro del neokantismo contemporáneo? Notas para una fundamentación moral kantiana con base en la idea de libertad
* Emilse Galvis (2016/01/01). La subjetivación política  más allá de la esfera pública: Michel Foucault, Jacques Rancière  y Simone Weil


We can export the summary of all topics in the model to pdf by using the `Model.export_summary()` method. This will output a `summary.pdf` file we can read.

In [7]:
model.export_summary("summary.pdf")