## Project summarize reports

### Objective:
- In this project we will combine several Brazilian monetary policy reports provided by the Central Bank and generate a summary of the 3 publications using NLTK resources with the aim of creating a broad summary document of 3 quarters that can be read and researched quickly on all the subject matter covered in the last 9 months on the Brazilian economic scenario.

- Nesse projeto vamos unir vários relatórios de política monetária do Brasil fornecido pelo Banco Central e gerar um resumo das 3 publicações utilizando recursos de NLTK com o objetivo de criar um documento resumido amplo de 3 trimestres que possa ser lido e pesquisado de forma rápida sobre todo o assunto tratado nos últimos 9 meses sobre o cenário econômico brasileiro.

### Data Origin: https://www.bcb.gov.br/publicacoes/rpm/cronologicos
- relatórios de política monetária em PDF: 
politica_monetaria_set2024.pdf
politica_monetaria_dez2024.pdf
politica_monetaria_mar2025.pdf

- The monetary policy report or inflation report is generated by the Central Bank of Brazil every quarter and contains content of great interest to the market:
    - It presents a detailed analysis of the Brazilian and international economic scenario.
    - It discloses the BCB's projections for inflation in the short, medium and long term, considering different scenarios.
    - It explains the reasons behind the decisions taken by the Monetary Policy Committee (COPOM), especially in relation to the basic interest rate (Selic).
    - It provides transparency to the actions and analyses of the Central Bank for the general public, markets and specialists.

- O relatório de política monetária ou relatório de inflação é gerado pelo Banco Central do Brasil a cada trimestre e possui um conteúdo de grande interesse do mercado:
    - Apresenta uma análise detalhada do cenário econômico brasileiro e internacional. 
    - Divulga as projeções do BCB para a inflação no curto, médio e longo prazo, considerando diferentes cenários. 
    - Explica as razões por trás das decisões tomadas pelo Comitê de Política Monetária (COPOM), especialmente em 
    - relação à taxa básica de juros (Selic). 
    - Dá transparência às ações e análises do Banco Central para o público em geral, mercados e especialistas.

- ## Análise exploratória dos dados
- ## Preparação dos dados
- ## Armazenamento dos dados tratados
- ## Geração de sumário

In [3]:
# maximiza nro de linhas e colunas para exibição
# inibe mensagens de warning
import pandas as pd
pd.set_option('display.max_rows', None) # permite a máxima visualização das linhas em um display
pd.set_option('display.max_columns', None) # permite a máxima visualização das colunas em um display
import warnings
warnings.simplefilter('ignore') # inibe a exibição de avisos de warning

In [4]:
import bs4 as bs
import urllib.request
import re
import nltk

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
sentence_list = nltk.sent_tokenize(article_text)
stopwords = nltk.corpus.stopwords.words('portuguese')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
    maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Research into the natural reservoir of the virus that caused the 2002–2004 SARS outbreak has resulted in the discovery of many SARS-like bat coronaviruses, most originating in horseshoe bats. The virion then releases RNA into the cell and forces the cell to produce and disseminate copies of the virus, which infect more cells. Examination of the topology of the phylogenetic tree at the start of the pandemic also found high similarities between human isolates. Because many of the early infectees were workers at the Huanan Seafood Market, it has been suggested that the virus might have originated from the market. It was suggested that the acquisition of the furin-cleavage site in the SARS-CoV-2 S protein was essential for zoonotic transfer to humans. However, other research indicates that visitors may have introduced the virus to the market, which then facilitated rapid expansion of the infections. SARS‑CoV‑2 is a strain of the species Betacoronavirus pandemicum (SARSr-CoV), as is SARS-Co