# NLP - Processamento de linguagem natural
O Processamento de Linguagem Natural (NLP) é a subárea de Inteligência Artificial (IA) que estuda a capacidade e as limitações de uma máquina em entender a linguagem dos seres humanos. O objetivo do NLP é fornecer aos computadores a capacidade de entender e compor textos. 'Entender' um texto significa reconhecer o contexto, fazer análises sintática, semântica, léxica e morfológica, criar resumos, extrair informação, interpretar os sentidos, analisar sentimentos e até aprender conceitos com os textos processados.

# Termos que melhor representam um documento
#### O objetivo de um saco de palavras é com o menor número de palavras representar um texto, podemos fazer isso contando a frequência das palavras importantes em um texto

#### Vamos usar um dataset com 1000 matérias da Folha de São Paulo

In [1]:
import pandas as pd
news_load = pd.read_csv('articles_1000.csv')
news_load.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
1,"'Decidi ser escrava das mulheres que sofrem', ...","Para Oumou Sangaré, cantora e ativista malines...",2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
3,Filme 'Star Wars: Os Últimos Jedi' ganha trail...,A Disney divulgou na noite desta segunda-feira...,2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
4,CBSS inicia acordos com fintechs e quer 30% do...,"O CBSS, banco da holding Elopar dos sócios Bra...",2017-09-10,mercado,,http://www1.folha.uol.com.br/mercado/2017/10/1...


In [2]:
news_load.text[3]

'A Disney divulgou na noite desta segunda-feira (9) o novo trailer de "Star Wars: Os Últimos Jedi", oitavo episódio da saga.  O trailer era aguardado pelos fãs e se tornou um dos tópicos mais comentados no Twitter no horário de seu lançamento.  Assista ao trailer de \'Os Últimos Jedi\'  Assista ao trailer de \'Os Últimos Jedi\'  Em "O Despertar da Força" (2015), episódio mais recente, a personagem Rey (Daisy Ridley) descobre que tem a Força e procura por Luke Skywalker (Mark Hamill) para começar seu treinamento Jedi.  A história do novo episódio continua desse ponto, e cenas do trailer mostram a relação entre Rey e Skywalker.  Com direção de Rian Johnson, o filme será lançado em 14 de dezembro no Brasil. O nono episódio, ainda sem título, encerra a nova trilogia em 20 de dezembro de 2019.  O estúdio também divulgou novo poster do filme.  Poster'

#### Tradicionalmente as matérias são agrupadas por categoria

In [3]:
news_load.groupby('category')['link'].count().sort_values(ascending=False)

category
colunas               157
poder                 125
mercado               111
mundo                 107
ilustrada             102
cotidiano              89
esporte                70
opiniao                35
sobretudo              27
seminariosfolha        21
saopaulo               19
ciencia                14
turismo                14
paineldoleitor         10
banco-de-dados         10
ilustrissima            8
tec                     6
empreendedorsocial      6
equilibrioesaude        4
educacao                4
tv                      3
ambiente                3
serafina                1
Name: link, dtype: int64

### Como processar um texto?
- Remover ruídos
- Pontuações
- Tags, URLs, stopwords
- Stemming (remover sufixos)
- Lemmatization (pegar a raíz da palavra)

### Quais são os passos mais usados em um pacote de NLP?
1. Converter um texto para minúsculo
    - Star Wars -> star wars
2. Tokenizar os termos
    - Star Wars -> ['Star', 'Wars']
3. Tokenizar as frases
    - Star Wars é bom? SIM! -> ['Star Wars é bom?', 'SIM!']
4. Remover stopwords (palavras sem significados)
    - Star Wars é bom? -> Star Wars bom?
5. Lemmatizer (trazendo a palavra para o infinitivo e removendo plurais)
    - Star Wars é um ótimo filme -> Star Wars é um ótimo filmar
6. Stemming (removendo o sufixo)
    - Star Wars is Amazing -> Star War is amaz
7. Frequência de uma palavra
    - Star Wars é bom? Star Wars é incrível! -> (Star,2), (Wars,2), (é,2), (bom,1), (incrível,1)
8. PoS tagging (marcar gramaticalmente elementos textuais)
    - Star Wars é bom? (Star, PROPN), (Wars, PROPN), (é, VERB), (bom, ADJ), (incrível, ADJ)
9. NER (reconhecimento de nomes próprios)
    - Star Wars é bom? -> (Star Wars, MISC)

In [4]:
news_load.info() # missings em text, subcategory

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 946 entries, 0 to 945
Data columns (total 6 columns):
title          946 non-null object
text           939 non-null object
date           946 non-null object
category       946 non-null object
subcategory    189 non-null object
link           946 non-null object
dtypes: object(6)
memory usage: 44.4+ KB


### Temos algumas matérias sem o texto delas, como elas não trazem valor vamos remover

In [5]:
news_df = news_load[pd.notnull(news_load['text'])]
news_df.info() # missing apenas na subcategory

<class 'pandas.core.frame.DataFrame'>
Int64Index: 939 entries, 0 to 945
Data columns (total 6 columns):
title          939 non-null object
text           939 non-null object
date           939 non-null object
category       939 non-null object
subcategory    189 non-null object
link           939 non-null object
dtypes: object(6)
memory usage: 51.4+ KB


### Nós vamos usar o SPACY como nosso pacote de NLP

In [6]:
import spacy # https://spacy.io/
nlp = spacy.load('pt_core_news_sm')

## Vamos começar a procura pelos melhores termos que representam uma matéria
#### Nós vamos mostrar os exemplos em um texto puro, vocês apliquem tais problemas no DataFrame de notícias

## A primeira técnica será: Popularidade
Vamos começar removendo os ruídos
**Tokenizer**: Um texto ficando em uma só string compõe um dado só. Precisamos segmentar as palavras para poder trabalhar com elas individualmente

In [7]:
doc = nlp('O live-action Turma da Mônica – Laços já foi visto por mais de um milhão e meio de espectadores. O filme ainda pode ser conferido em 471 salas.')
doc.text.split(' ')

['O',
 'live-action',
 'Turma',
 'da',
 'Mônica',
 '–',
 'Laços',
 'já',
 'foi',
 'visto',
 'por',
 'mais',
 'de',
 'um',
 'milhão',
 'e',
 'meio',
 'de',
 'espectadores.',
 'O',
 'filme',
 'ainda',
 'pode',
 'ser',
 'conferido',
 'em',
 '471',
 'salas.']

In [8]:
from spacy import displacy
displacy.render(doc, jupyter=True, style='dep')

**Clean text**: Nós acabamos deixando nesses **tokens** coisas sem significado como **pontuações** e **espaços em branco**

In [9]:
tokens = [token.text.lower() for token in doc if not (token.is_punct or token.is_space)]
tokens

['o',
 'live',
 'action',
 'turma',
 'da',
 'mônica',
 'laços',
 'já',
 'foi',
 'visto',
 'por',
 'mais',
 'de',
 'um',
 'milhão',
 'e',
 'meio',
 'de',
 'espectadores',
 'o',
 'filme',
 'ainda',
 'pode',
 'ser',
 'conferido',
 'em',
 '471',
 'salas']

## Vamos pegar os top n termos
**Contador de Popularidade**: Uma forma de buscar a representatividade de um termo em um texto é contando a sua frequência

In [10]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

[('o', 2),
 ('de', 2),
 ('live', 1),
 ('action', 1),
 ('turma', 1),
 ('da', 1),
 ('mônica', 1),
 ('laços', 1),
 ('já', 1),
 ('foi', 1)]

## Exercício: Crie por matéria uma coluna com as top 10 palavras

In [11]:
def freq_word(text):
    doc = nlp(text)
    tokens = [token.text.lower() for token in doc if not (token.is_punct or token.is_space)]
    freq = Counter(tokens)
    return freq.most_common(10)

In [12]:
news_df['freq_word'] = news_df.text.head().apply(freq_word)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
news_df.freq_word.head()

0    [(que, 26), (o, 25), (a, 24), (de, 17), (não, ...
1    [(de, 20), (que, 19), (a, 18), (e, 17), (o, 15...
2    [(de, 40), (o, 30), (a, 25), (e, 23), (do, 15)...
3    [(a, 8), (o, 8), (de, 8), (trailer, 5), (jedi,...
4    [(de, 31), (a, 29), (o, 19), (do, 14), (que, 1...
Name: freq_word, dtype: object

## Problema 1: Os artigos são os termos mais frequentes
Vamos ter que remover as palavras que não tem nenhum significado

In [16]:
list(nlp.Defaults.stop_words)[0:10]

['tentaram',
 'esteve',
 'perto',
 'teve',
 'cuja',
 'estivestes',
 'faço',
 'às',
 'obrigado',
 'daquele']

A lista de Stop Words do SPACY para português não é boa, vamos usar um outro dicionário

In [17]:
stopwords_df = pd.read_csv('stopwords.txt', header=None, names=['stop'], sep='\n') # USP
stopwords = stopwords_df['stop'].str.strip().values
stopwords[:10]

array(['de', 'a', 'o', 'que', 'e', 'do', 'da', 'em', 'um', 'para'],
      dtype=object)

In [18]:
doc  = nlp('O live-action Turma da Mônica – Laços já foi visto por mais de um milhão e meio de espectadores. O filme ainda pode ser conferido em 471 salas.')

[token.text.lower() for token in doc if not (token.is_punct or token.is_space or (token.text.lower() in stopwords))]

['live',
 'action',
 'turma',
 'mônica',
 'laços',
 'visto',
 'milhão',
 'meio',
 'espectadores',
 'filme',
 'ainda',
 'pode',
 'conferido',
 '471',
 'salas']

In [19]:
def freq_word(text):
    doc = nlp(text)
    tokens = [token.text.lower() for token in doc if not (token.is_punct or token.is_space or (token.text.lower() in stopwords))]
    freq = Counter(tokens)
    return freq.most_common(10)

news_df.text.head().apply(freq_word)

0    [(ex, 7), (disse, 7), (lula, 6), (presidente, ...
1    [(sangaré, 5), (cantora, 4), (mulher, 4), (con...
2    [(sobre, 7), (reportagens, 5), (ranking, 5), (...
3    [(trailer, 5), (jedi, 4), (episódio, 4), (novo...
4    [(fintechs, 11), (crédito, 11), (banco, 10), (...
Name: text, dtype: object

## Problema 2: Agora que nós limpamos, vamos aplicar o Lemmatization
- multiplications -> multiplication -> multiplicate -> multiple

In [20]:
doc = nlp('O live-action Turma da Mônica – Laços já foi visto por mais de um milhão e meio de espectadores. O filme ainda pode ser conferido em 471 salas.')

for token in doc:
    print(token.text, token.lemma_, token.pos_)

O O DET
live live NOUN
- - PUNCT
action action ADJ
Turma Turma PROPN
da da ADP
Mônica Mônica PROPN
– – PUNCT
Laços Laços PROPN
já já ADV
foi ser AUX
visto vestir VERB
por por ADP
mais mais ADV
de de ADP
um um NUM
milhão milhão NOUN
e e CCONJ
meio mear NOUN
de de ADP
espectadores espectador NOUN
. . PUNCT
O O DET
filme filmar NOUN
ainda ainda ADV
pode poder AUX
ser ser AUX
conferido conferir VERB
em em ADP
471 471 NUM
salas sala NOUN
. . PUNCT


In [21]:
def tokenizer_lemma_sw(doc):
    return [token.lemma_.lower() for token in doc if not (token.is_punct or token.is_space or (token.text.lower() in stopwords))]

def counter_list(tokens, top_k=10):
    return [word for word, word_count in Counter(tokens).most_common(top_k)]

def top_tokens(doc):
    doc = nlp(doc)
    tokens = tokenizer_lemma_sw(doc)
    return counter_list(tokens)

news_df.text.head().apply(top_tokens)

0    [dizer, ex, respeitar, lula, querer, president...
1    [mulher, sangaré, contar, cantor, n, país, par...
2    [reportagem, sobrar, ranking, eficiência, folh...
3    [trailer, novo, jedi, episódio, últimos, divul...
4    [banco, fintechs, crédito, cbss, parceria, diz...
Name: text, dtype: object

## Outra abordagem: Encontrar palavras com significado no texto
Podemos usar um dicionário de palavras nomeadas para nos ajudar nessa procurar **NER**

In [22]:
doc = nlp('O live-action Turma da Mônica – Laços já foi visto por mais de um milhão e meio de espectadores. O filme ainda pode ser conferido em 471 salas.')

for ent in doc.ents:
    print(ent.text, ent.label_)

Turma da Mônica PER


In [23]:
displacy.render(doc, jupyter=True, style='ent')

## Exercício: Escolha uma matéria e a exiba usando o marcado de NER do Spacy

In [24]:
displacy.render(nlp(news_df.text[8]), jupyter=True, style='ent')

## Exercício: Aplique NER para todas as matérias

In [25]:
news_df.text.head().apply(lambda x: nlp(x).ents)

0    ((Luiz, Inácio, Lula, da, Silva), (Lava, Jato)...
1    ((Oumou, Sangaré), (Canto), (Casa, da, Cultura...
2    ((Folha), (Prêmio, Petrobras, de, Jornalismo),...
3    ((Disney), (Star, Wars, :), (Twitter), (Assist...
4    ((CBSS), (Elopar), (Bradesco), (Banco, do, Bra...
Name: text, dtype: object

## Outra abordagem: Encontrar a popularidade usando quão rara a palavra é
**TF-IDF**: Term frequency-Inverse Document Frequency

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

def wm2df(wm, feat_names):
    doc_names = ['News{:d}'.format(idx) for idx, _ in enumerate(wm)]
    
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

def tokenizer_without_stopwords(docs):
    doc = nlp(docs)
    return [token.lemma_.lower() for token in doc if not (token.is_punct or token.is_space or (token.text.lower() in stopwords))]

def my_preprocessor(list_):
    return str(list_)

custom_vec = TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=tokenizer_without_stopwords)
cwm = custom_vec.fit_transform(news_df.text.head(100))

corpus_tfidf = wm2df(cwm, custom_vec.get_feature_names())
corpus_tfidf.head(10)

Unnamed: 0,'s,-aliás,-apesar,-apreciação,-as,-cena,-com,-como,-descobrir,-diariamente,...,—evento,—incluindo,—leitores,—no,—os,—para,—popular,—que,€,★
News0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
corpus_tfidf.iloc[3].sort_values(ascending=False).head(10)

trailer      0.430432
jedi         0.375259
últimos      0.281444
episódio     0.279746
poster       0.187630
assista      0.187630
skywalker    0.187630
rey          0.187630
força        0.172173
novo         0.159250
Name: News3, dtype: float64

In [28]:
corpus_tfidf[corpus_tfidf['trailer'] > 0.1]

Unnamed: 0,'s,-aliás,-apesar,-apreciação,-as,-cena,-com,-como,-descobrir,-diariamente,...,—evento,—incluindo,—leitores,—no,—os,—para,—popular,—que,€,★
News3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## TF-IDF Based Named Entity Recognition

In [29]:
def ner_tokens(docs):
    doc = nlp(docs)
    return [ent.text.lower() for ent in doc.ents]

ner_vec = TfidfVectorizer(preprocessor=my_preprocessor,
                         tokenizer=ner_tokens)
cwm_ner = ner_vec.fit_transform(news_df.text.head(100))
corpus_ner_tfidf = wm2df(cwm_ner, ner_vec.get_feature_names())

corpus_ner_tfidf.head(10)

Unnamed: 0,Unnamed: 1,"""",*,américa do sul,caiu,década,entenda,melhor seleção,políticos,áfrica,...,–reuniu,–um,—como,—e,—em,—evento,—incluindo,—no,—para,★★★
News0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News6,0.0,0.0,0.0,0.0,0.0,0.169711,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
News9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
corpus_ner_tfidf.iloc[3].sort_values(ascending=False).head(10)

os últimos jedi    0.409962
rey                0.409962
assista            0.409962
jedi               0.204981
star wars:         0.204981
luke skywalker     0.204981
rian johnson       0.204981
poster             0.204981
força              0.204981
disney             0.204981
Name: News3, dtype: float64

## Exercício: Gerar um resumidor de matéria :)
### Agora que vocês sabem a importância de cada termo vamos transferir essa importância para a frase. Após isso vamos listar a frase com maior importância

In [34]:
def frase_signif(frase,palavras,pronomes):
    raras = sum([(palavra in frase) for palavra in palavras.index.values])
    pronom = sum([(palavra in frase) for palavra in [pronom.text for pronom in pronomes]])
    return raras+pronom

def resumo(idx,k):
    palavras = corpus_tfidf.iloc[idx].sort_values(ascending=False).head(10) # palavras mais raras
    
    pronomes = nlp(news_df.text[idx]).ents
    
    frases = nlp(news_df.text[idx]).text.split('.')
    
    a = [frase_signif(frase,palavras,pronomes) for frase in frases]
    
    idx_max = [a.index(pos) for pos in sorted(a,reverse=True)]
    
    frases = [frases[idx_] for idx_ in sorted(idx_max[:k])]
    
    return '. '.join(frases)

In [35]:
news_df.text[3]

'A Disney divulgou na noite desta segunda-feira (9) o novo trailer de "Star Wars: Os Últimos Jedi", oitavo episódio da saga.  O trailer era aguardado pelos fãs e se tornou um dos tópicos mais comentados no Twitter no horário de seu lançamento.  Assista ao trailer de \'Os Últimos Jedi\'  Assista ao trailer de \'Os Últimos Jedi\'  Em "O Despertar da Força" (2015), episódio mais recente, a personagem Rey (Daisy Ridley) descobre que tem a Força e procura por Luke Skywalker (Mark Hamill) para começar seu treinamento Jedi.  A história do novo episódio continua desse ponto, e cenas do trailer mostram a relação entre Rey e Skywalker.  Com direção de Rian Johnson, o filme será lançado em 14 de dezembro no Brasil. O nono episódio, ainda sem título, encerra a nova trilogia em 20 de dezembro de 2019.  O estúdio também divulgou novo poster do filme.  Poster'

In [36]:
resumo(3,2)

'A Disney divulgou na noite desta segunda-feira (9) o novo trailer de "Star Wars: Os Últimos Jedi", oitavo episódio da saga.   Assista ao trailer de \'Os Últimos Jedi\'  Assista ao trailer de \'Os Últimos Jedi\'  Em "O Despertar da Força" (2015), episódio mais recente, a personagem Rey (Daisy Ridley) descobre que tem a Força e procura por Luke Skywalker (Mark Hamill) para começar seu treinamento Jedi'

## Vamos apresentar um novo algoritmo, o TextRank
TextRank é uma técnica de sumarização de textos extrativa e não supervisionada. Vamos dar uma olhada no fluxo do algoritmo TextRank que estaremos seguindo:

In [39]:
from collections import Counter
import heapq
import textacy.ke

def word_tokenizer(docs):
    doc = nlp(str(docs))
    return [token.lemma_.lower() for token in doc if not (token.is_punct or token.is_space or (token.text.lower() in stopwords))]

def text_rank(doc):
    return dict(textacy.ke.textrank(doc, normalize="lemma", topn=20))

def sentence_scores(doc, word_frequencies):
    sentence_scores_dict = {}
    for sent in doc.text.split("."):  
        for word in word_tokenizer(sent):
            if word in word_frequencies.keys():
                sentence_scores_dict[sent] = sentence_scores_dict.get(sent, 0) + word_frequencies[word]
    return sentence_scores_dict
  
def generate_summary(docs):
    doc = nlp(str(docs))

    word_frequencies = text_rank(doc) 
    print("word_count_frequencies =>", word_frequencies)
    
    sentence_importances = sentence_scores(doc, word_frequencies)
    print("sentence_importances =>", sentence_importances)

    summary_sentences = heapq.nlargest(1, sentence_importances, key=sentence_importances.get)
    print("\nsummary_sentences =>", summary_sentences)
    print("\noriginal =>", docs)

generate_summary(news_df['text'][3])

word_count_frequencies => {'nono episódio': 0.025107948811995122, 'Últimos Jedi': 0.020567292192163883, 'treinamento Jedi': 0.02033695349765257, 'Luke Skywalker': 0.01885255654123643, 'personagem Rey': 0.018182443394861046, 'trailer': 0.017666374189457882, 'Star Wars': 0.014624592829686856, 'Mark Hamill': 0.01375743162021866, 'Daisy Ridley': 0.01375743162021866, 'Rian Johnson': 0.013160930205147333, 'Força': 0.010768748378944527, 'filmar': 0.01069082516021235, 'tópico': 0.007731070054399828, 'feirar': 0.007425458878488675, 'título': 0.006939771190013083, 'Disney': 0.0066003395784392065, 'direção': 0.006593630312536487, 'estúdio': 0.006586584703284821, 'recente': 0.006466379576806173, 'história': 0.006406515840308318}
sentence_importances => {'A Disney divulgou na noite desta segunda-feira (9) o novo trailer de "Star Wars: Os Últimos Jedi", oitavo episódio da saga': 0.02509183306794656, '  O trailer era aguardado pelos fãs e se tornou um dos tópicos mais comentados no Twitter no horário