# Retrieval-Augmented Generation for Brazilian Financial Market
Author: Lucas Iuri dos Santos

## Data Acquisition

We begin by data acquisition, which consists of downloanding several PDF reports of Brazilian Securities and Exchange Commission, the federal agency that regulates and supervises the securities and capital markets in Brazil.

We begin by accessing the Comission website and download all the reports (in Portuguese).

In [1]:
from data_acquisition import download_pdfs

url = 'https://www.gov.br/cvm/pt-br/centrais-de-conteudo/publicacoes/relatorios/relatorio-de-gestao-da-cvm'
download_pdfs(url)

Accessing the URL: https://www.gov.br/cvm/pt-br/centrais-de-conteudo/publicacoes/relatorios/relatorio-de-gestao-da-cvm

 Found 22 report pages links. Checking them one by one...

-> Accessing report page: https://www.gov.br/cvm/pt-br/acesso-a-informacao-cvm/auditorias/relatorios-de-orgaos-de-controle/2024/relatorio-iesgo2024-cvm-tcu.pdf/view
-> Downloading "relatorio-iesgo2024-cvm-tcu.pdf"...
-> Successfully saved in data_cvm/

-> Accessing report page: https://www.gov.br/cvm/pt-br/centrais-de-conteudo/publicacoes/relatorios/relatorio-de-gestao-da-cvm/relatorio-gestao-2024.pdf/view
-> Downloading "relatorio-gestao-2024.pdf"...
-> Successfully saved in data_cvm/

-> Accessing report page: https://www.gov.br/cvm/pt-br/centrais-de-conteudo/publicacoes/relatorios/relatorio-de-gestao-da-cvm/relatorio-de-gestao-cvm-2023-versao-final-1.pdf/view
-> Downloading "relatorio-de-gestao-cvm-2023-versao-final-1.pdf"...
-> Successfully saved in data_cvm/

-> Accessing report page: https://www.gov.br/c

And now we merge all the downloaded reports into one single file.

In [2]:
from data_acquisition import merge_pdfs

merge_pdfs(folder_path = './data_cvm', output_filename='cvm_reports.pdf')

Found 22 PDF files in ./data_cvm.
Adding: Relatorio_Gestao_CVM_2020.pdf
Adding: relatorio-de-gestao-2021.pdf
Adding: relatorio-de-gestao-cvm-2023-versao-final-1.pdf
Adding: relatorio-de-gestao-da-cvm-2004.pdf
Adding: relatorio-de-gestao-da-cvm-2005.pdf
Adding: relatorio-de-gestao-da-cvm-2006.pdf
Adding: relatorio-de-gestao-da-cvm-2007.pdf


Unknown destination: '02_ Apresentação provisória.pdf' [0.0, 1]
Unknown destination: '03_ Sumário do Relatório de Gestão 2007.pdf' [0.0, 1]
Unknown destination: '04_Item 1_A CVM.pdf' [0.0, 1]
Unknown destination: '05_ Relatório de Gestão 2007 formato da Portaria CGU 1950_07.pdf' [0.0, 1]
Unknown destination: '06_Item 3.1_Colegiado.pdf' [0.0, 1]
Unknown destination: '07_ Item 3.2_ASE.pdf' [0.0, 1]
Unknown destination: '08_Item 3.3_ASC.pdf' [0.0, 1]
Unknown destination: '09_Item 3.4_SGE.pdf' [0.0, 1]
Unknown destination: '10_Item 3.5_PFE.pdf' [0.0, 1]
Unknown destination: '11_Item 3.6_SDM.pdf' [0.0, 1]
Unknown destination: '12_Item 3.7_SFI.pdf' [0.0, 1]
Unknown destination: '13_Item 3.8_SSI.pdf' [0.0, 1]
Unknown destination: '14_Item 3.9_SNC.pdf' [0.0, 1]
Unknown destination: '15_Item 3.10_SPL.pdf' [0.0, 1]
Unknown destination: '16_Item 3.11_SOI.pdf' [0.0, 1]
Unknown destination: '17_Item 3.12_SRB.pdf' [0.0, 1]
Unknown destination: '18_Item 3.13_SRS.pdf' [0.0, 1]
Unknown destination: '19

Adding: relatorio-de-gestao-da-cvm-2008.pdf
Adding: relatorio-de-gestao-da-cvm-2009.pdf
Adding: relatorio-de-gestao-da-cvm-2010.pdf
Adding: relatorio-de-gestao-da-cvm-2011.pdf
Adding: relatorio-de-gestao-da-cvm-2012.pdf
Adding: relatorio-de-gestao-da-cvm-2013.pdf
Adding: relatorio-de-gestao-da-cvm-2014.pdf
Adding: relatorio-de-gestao-da-cvm-2015.pdf
Adding: relatorio-de-gestao-da-cvm-2016.pdf
Adding: relatorio-de-gestao-da-cvm-2017.pdf
An error occurred during the merging process: 'NullObject' object has no attribute '__iter__'
Adding: relatorio-de-gestao-da-cvm-2018.pdf
Adding: relatorio-de-gestao-da-cvm-2019.pdf
Adding: relatorio-de-gestao-da-cvm-2022.pdf
Adding: relatorio-gestao-2024.pdf
Adding: relatorio-iesgo2024-cvm-tcu.pdf
Succesfully merged the files into "cvm_reports.pdf".


## Loading and Splitting the Documents

Before creating the model and feeding our document to it, we have to split the the data in smaller chunks to facilitate the processing and searching process.

In [6]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [7]:
DATA_PATH = 'data_cvm/'

# Loading PDF file
loader = PyPDFLoader(os.path.join(DATA_PATH, 'cvm_reports.pdf'))
documents = loader.load()

# Splitting the documents in chunks, with some overlap to avoid context loss.
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
texts = text_splitter.split_documents(documents)

print(f'Document divided in {len(texts)} chunks.')

# First chunk
print(texts[0].page_content)

Document divided in 8906 chunks.
Relatório
de Gestão
2020


## Embeddings Creation

Now we will convert our chunks in numerical vectors.

In [10]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model_kwargs = {'device': 'cpu'} # or 'cuda' if you want to use a GPU
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

print('Embedding model successfully loaded.')

Embedding model successfully loaded.



## Vector Database

Now we will create a vector database to use it for similarity search.

In [None]:
from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(texts, embeddings)

db.save_local('faiss_index_cvm')

print('Vector database created and saved locally.')

Vector database created and saved locally.


## Large Language Model Configuration

Now we will load the LLM that will process the input prompt and will generate our answers.

In [12]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_community.llms import HuggingFacePipeline

model_id = 'google/flan-t5-large' # use 'google/flan-t5-base' if it is too heavy

# Load and tokenize the model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Creating a HuggingFace pipeline
pipe = pipeline(
    'text2text-generation',
    model = model,
    tokenizer = tokenizer,
    max_length = 512, # raise if the answers are being cut
    temperature = 0.1 # Lower values mean more factual responses.
)

# Creating the LLM instance for LangChain
llm = HuggingFacePipeline(pipeline = pipe)

print('LLM succesfully loaded.')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Error while downloading from https://huggingface.co/google/flan-t5-large/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://huggingface.co/google/fla

LLM succesfully loaded.


  llm = HuggingFacePipeline(pipeline = pipe)


## Retrieval-Augmented Generation Chain

With our model loaded and our data prepared, we can create the chain of our RAG.

In [13]:
from langchain.chains import RetrievalQA

# We could have used the previous model and database, but to emphasize that the chain can
# be run separately we will create separate variables.
embeddings_model = HuggingFaceEmbeddings(model_name = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
db_loaded = FAISS.load_local('faiss_index_cvm', embeddings_model, allow_dangerous_deserialization=True)


# Creating the 'retriever', object that will search the information
retriever = db_loaded.as_retriever(search_kwargs = {'k':3}) # searches for the 3 most relevant chunks.

# Creating the RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = 'stuff', # 'stuff' joins all the found chunks into one prompt
    retriever = retriever,
    return_source_documents = True # optional, returns the source of the infromation.
)

print('RAG chain ready for use.')


RAG chain ready for use.


## Asking the System

With everything ready, we will create a function for the user to ask a question and process it on our chain.

In [14]:
def ask_question(query):
    result = qa_chain.invoke({'query':query})
    print('Answer: ', result['result'])
    print('\nSources:')
    for doc in result['source_documents']:
        print(f'- Page {doc.metadata.get('page', 'N/A')}: ...{doc.page_content[:200]}...')

In [16]:
question = 'What was the total budget executed by the CVM in 2023?'
ask_question(question)

Answer:  R$ 79.430.179,00

Sources:
- Page 2765: ...•  76  •
Relatório de Gestão CVM 2019
Capítulo 5 - Resultados e  
desempenho da gestão
Distribuição do Limite de Movimentação e Empenho e Limite de 
Pagamento em 2019 - Despesas Discricionárias
5.13.2...
- Page 266: ...•  155  •
Relatório de Gestão CVM 2021
Capítulo 5 - Informações orçamentárias, 
financeiras e contábeis
Porém, em consequência da manutenção da pandemia de COVID-19 e ajustes no planejamento orça -
me...
- Page 720: ...Passagens e diárias.  
 
ORÇAMENTO 
  
Lei Orçamentária 2005 
O orçamento aprovado para a CVM, estabelecido pela Lei nº 11.100, de 25 de janeiro de 2005, foi de 
R$ 79.430.179,00 , dos quais R$ 65.520...


In [17]:
question = "Summarize the CVM's main initiatives related to the ESG (Environmental, Social, and Governance) agenda mentioned in the latest management report."
ask_question(question)

Answer:  Os objetivos estratégicos da CVM e as iniciativas para o seu alcance esto definidos no documento Planejamento Estratégico – Construindo a CVM de 2023

Sources:
- Page 2580: ...17 
 
3.  PLANEJAMENTO ESTRATÉGICO E GOVERNANÇA 
 
3.1.  Planejamento Organizacional 
 
A CVM dispõe de planos nos níveis estratégico, tático e operacional, e responde, no âmbito 
do Plano Plurianual ...
- Page 37: ...•  38  •
Relatório de Gestão CVM 2020
A base sobre a qual são definidos os objetivos estratégicos e priorizados os projetos e os processos para a alocação de 
recursos é o conjunto de crenças e valore...
- Page 2724: ...•  35  •
Capítulo 3 - Governança, estratégia  
e alocação de recursos
Relatório de Gestão CVM 2019  •  35  •
3.3. Metas Institucionais 2019
Na implantação da estratégia da organização, um dos principa...


Since the documents are in Portuguese, we have to clearly state that we want the answer in other languages.

In [20]:
question = """Compare the evolution of the number of individual investors 
on the stock exchange between 2022 and 2024, based on the report data. 
PLease provide the answer in English."""
ask_question(question)

Answer:  The evolution of the number of individual investors on the stock exchange between 2022 and 2024, based on the report data.

Sources:
- Page 466: ...- Fórmula de Cálculo: Relação percentual entre o valor do investimento realizado via mercado de 
valores mobiliários e o total do investimento na economia. 
• Relação entre o valor de mercado das comp...
- Page 3040: ...•  74  •
Relatório de Gestão CVM 2024
Capítulo 4 - Governança, estratégia  
e desempenho
Particularmente no que tange às ofertas públicas de distribuição de ações, foi observada, em 2024, 
aumento sig...
- Page 159: ...matização é detalhar a legislação aplicável ao mercado, dando segurança jurídica aos seus participantes 
com o menor custo de observância possível.  
4.3.1. Evolução em 2021
No ano de 2021, foi possív...
