#### Latent Dirichlet Allocation (LDA) Topic Modeling

This notebook is dedicated to Latent Dirichlet Allocation (LDA), a technique to discover the abstract "topics".
 * LDA is often used to categoryzed documents, but in this context it will categorize paragraphs.
 * The script applies LDA (gensim) to identify recurring themes across the minutes.
 * The number of topics (n_components), is set to 6. This decision was taken on appendix, notebook 5.1. 
 * The script also displays the most significant words for each of these identified topics.

In [1]:
import glob
import os
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd

In [None]:
minutes_info = pd.read_excel("./data/raw/minutes_info.xlsx")

FOLDER_MINUTES_LEMMATIZED = "./data/processed/copom_minutes_lemmatized"
FOLDER_MINUTES_NOT_LEMMATIZED = "./data/processed/copom_minutes_not_lemmatized"
INITIAL_DATE = "2003-06-26"

In [3]:
minutes_info = pd.read_excel("./data/raw/minutes_info.xlsx")
minutes_info['DataReferencia'] = pd.to_datetime(minutes_info['DataReferencia'])
minutes_info = minutes_info[minutes_info["DataReferencia"] >= INITIAL_DATE]

minutes_names = minutes_info["Titulo"].to_list()

In [4]:
all_docs_with_metadata = []
all_docs_for_lda = []


for minute in minutes_names:
    lemm_minute_path = f"{FOLDER_MINUTES_LEMMATIZED}/{minute}.txt"
    not_lemm_minute_path = f"{FOLDER_MINUTES_NOT_LEMMATIZED}/{minute}.txt"

    with open(lemm_minute_path, 'r', encoding='utf-8') as f:
        lemm_paragraphs = [line.split() for line in f.readlines()]

    with open(not_lemm_minute_path, 'r', encoding='utf-8') as f:
        not_lemm_paragraphs = [line.split() for line in f.readlines()]

    if len(lemm_paragraphs) == len(not_lemm_paragraphs):
        lemm_paragraphs_final = []
        not_lemm_paragraphs_final = []
        for i in range(len(lemm_paragraphs)):
            if len(lemm_paragraphs[i]) > 5:
                lemm_paragraphs_final.append(lemm_paragraphs[i])
                not_lemm_paragraphs_final.append(not_lemm_paragraphs[i])

        if len(lemm_paragraphs_final) != len(not_lemm_paragraphs_final):
            print(f"Error 2 in: {minute}")

        for i in range(len(lemm_paragraphs_final)):
            all_docs_with_metadata.append({'original_text': not_lemm_paragraphs_final[i], 'lemm_text': lemm_paragraphs_final[i], 'minute': minute})
            all_docs_for_lda.append(lemm_paragraphs_final[i])

    else:
        print(f"Error in: {minute}")

In [5]:
dictionary = corpora.Dictionary(all_docs_for_lda)

corpus = [dictionary.doc2bow(doc) for doc in all_docs_for_lda]

In [6]:
NUM_TOPICS = 6

lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=NUM_TOPICS,
                     random_state=100,
                     passes=15)

In [7]:
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.031*"employment" + 0.029*"increase" + 0.021*"real" + 0.020*"accord" + 0.019*"rate" + 0.019*"job" + 0.018*"month" + 0.018*"year" + 0.018*"compare" + 0.017*"thousand"')
(1, '0.065*"rate" + 0.041*"inflation" + 0.034*"price" + 0.025*"projection" + 0.024*"meeting" + 0.022*"scenario" + 0.021*"copom" + 0.018*"exchange" + 0.017*"increase" + 0.017*"target"')
(2, '0.068*"billion" + 0.053*"u" + 0.030*"operation" + 0.023*"total" + 0.023*"credit" + 0.022*"reach" + 0.020*"month" + 0.019*"average" + 0.018*"increase" + 0.017*"export"')
(3, '0.018*"growth" + 0.017*"economy" + 0.014*"economic" + 0.013*"market" + 0.012*"demand" + 0.011*"activity" + 0.010*"domestic" + 0.010*"international" + 0.009*"high" + 0.009*"recovery"')
(4, '0.042*"increase" + 0.028*"month" + 0.027*"price" + 0.020*"good" + 0.015*"production" + 0.014*"industrial" + 0.014*"compare" + 0.013*"inflation" + 0.013*"sale" + 0.012*"index"')
(5, '0.057*"inflation" + 0.034*"monetary" + 0.028*"policy" + 0.023*"copom" + 0.018*"committee" +

In [9]:
# 5. Organizar os resultados em um DataFrame
def get_dominant_topic(doc_bow, lda_model):
    topic_dist = lda_model.get_document_topics(doc_bow)
    dominant_topic = sorted(topic_dist, key=lambda x: x[1], reverse=True)[0][0]
    return dominant_topic

In [11]:
results = []
# Itera sobre a lista que contém os metadados
for i, doc_info in enumerate(all_docs_with_metadata):
    doc_bow = corpus[i] # Pega o BoW correspondente pelo índice
    dominant_topic = get_dominant_topic(doc_bow, lda_model)
    results.append({
        'minute': doc_info['minute'], # Adiciona o nome do arquivo
        'original_text': ' '.join(doc_info['original_text']),
        'lemm_text': ' '.join(doc_info['lemm_text']),
        'dominant_topic': dominant_topic
    })

df_results = pd.DataFrame(results)
df_results.to_excel('./data/processed/lda_results.xlsx', index=False)
print(df_results.head())

                             minute  \
0  271st Meeting - June 17-18, 2025   
1  271st Meeting - June 17-18, 2025   
2  271st Meeting - June 17-18, 2025   
3  271st Meeting - June 17-18, 2025   
4  271st Meeting - June 17-18, 2025   

                                       original_text  \
0  1. The global environment remains adverse and ...   
1  2. In addition, the behavior and the volatilit...   
2  3. Regarding the domestic scenario, the set of...   
3  4. In recent releases, headline inflation and ...   
4  5. The inflation outlook remains challenging i...   

                                           lemm_text  dominant_topic  
0  global environment remain adverse particularly...               3  
1  addition behavior volatility different asset c...               3  
2  regard domestic scenario set indicator economi...               3  
3  recent release headline inflation measure unde...               1  
4  inflation outlook remain challenge several dim...               5  
