# Resumo de Documentos - Método Extrativo

Este notebook tem por objetivo desenvolver um processo capaz de produzir resumos de texto através de técnicas de NLP utilizando o método de extração.

**Origem do Dados**

Os dados que utilizaremos neste trabalho será o dataset Multi News, que consiste numa série de artigos de notícias do site newser.com. Os sumários presentes no dataset foram escritos por escritores profissionalmente por editoes e incluem links para os artigos originais. Como o objeto é um sumarização extrativa, vamos extrair os textos dos artigos e deixar o sumários apresentados de lado, assim termos um novo dataest com os textos originais dos artigos.

**Métricas de Avaliação**

As métricas que utilizaremos nesse trabalho estão presenres no pacote ROUGE. RPOUGE contempla uma série de métricas com objetivos de avaliar tarefas de sumarização de textos e tradução em processo de NLP (Natural Languege Processing). Abaixo uma breve explicação da principais métricas que utilizaremos:

- **ROUGE-N** - mede o numero de acertos de 'n-grams'entre a saida do resumo gerado pelo modelo e o texto de referência. 'n-grams' é um agrupamento de palavras/tokens. 1-gram consiste em uma única palavra, 2-gram se refere a duas palavras consecutivas, etc. ROUGE-N, N se refere ao 'n-gram' que será usado para avaliação.

- **ROUGE-L** - mede a mais longa sequência (LCS) entre a saida do modelo e a referência. A ideia é que quanto mais sequências longas estão presentes no resumo gerado e no texto de referência, mais similares são estes texto.


- **Recall** - conta o numero de n-grams concidentes encontrados na saída do modelo e no texto de referência (verdaeitos positivos) divido pelo número total de n-grams no texto de referência.
<img src='https://miro.medium.com/max/1400/1*XEnhQJxKbEySimh1PPWPnQ.png' width=600>

- **Precision** - calculado da mesmo forma que o recall, porém dividimos o número de n-grams coincidentes entre a saída do modelo e texto de referência pelo número total de n-grams da saída do modelo.
<img src='https://miro.medium.com/max/1400/1*aSd89F6kupr3znW71Qmb3Q.png' width=600>

- **F1-Score** - utuliza recall e precision para avaliar uma melhor relação entre estas:
<img src='https://miro.medium.com/max/1400/1*zYuwaCDNpYf51H5S4DpDRA.png' width=600>




##**1 - Preparando o Ambiente**

In [None]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.1.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 3.9 MB/s 
[?25hCollecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.12-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 68.7 MB/s 
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 39.1 MB/s 
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.2 MB/s 
Collecting srsly<3.0.0,>=2.4.1
  Downloading s

In [None]:
# doanload all data and models
#!sudo python -m spacy.en.download all
!sudo python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 79 kB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
# instalando nltk, tool kit para trabalhos com nlp
!pip install nltk



In [None]:
# download tokenized final cleaned
# !gdown --id 1YXkF_ugMx1HYCBBYF7VKq0Lujzif1JES

In [None]:
# bibliotecas de datasets do tensorflow de onde baixaremos o
# dataset Multi News
!pip install tensorflow-datasets



In [None]:
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds

from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()
# usar .progress_apply() instead apply()

##**2 - Obtendo e Preparando os Dados**

###**2.1 - Baixando o dataset**

Para permitir executar o processo de forma parcial, vamos definir duas varíaveis: uma para indicar que queremos processar o dataset parcialmente e outra para informar o tamanho da amosta. Para o processamento completo basta definir o parametro ONLY_SAMPLE para False.

In [None]:
ONLY_SAMPLE = True
SAMPLE_SIZE = 1000

In [None]:
# carregando dataset Multi News
ds = tfds.load('multi_news', split='train',data_dir='./data/', shuffle_files=False)
assert isinstance(ds, tf.data.Dataset)

[1mDownloading and preparing dataset multi_news/1.0.0 (download: 245.06 MiB, generated: Unknown size, total: 245.06 MiB) to ./data/multi_news/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to ./data/multi_news/1.0.0.incomplete0X4N82/multi_news-train.tfrecord


  0%|          | 0/44972 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to ./data/multi_news/1.0.0.incomplete0X4N82/multi_news-validation.tfrecord


  0%|          | 0/5622 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to ./data/multi_news/1.0.0.incomplete0X4N82/multi_news-test.tfrecord


  0%|          | 0/5622 [00:00<?, ? examples/s]

[1mDataset multi_news downloaded and prepared to ./data/multi_news/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
# Verificando o tamanho do dataset
tfds.as_dataframe(ds).shape
# 44972, 2 -> 44972 conjuntos de arquivos

(44972, 2)

O dataset apresenta 44.972 conjuntos de artigos com seus respectivos resumos. Os datasets do tensoflow são tensor de bytes, assim, é necessário decodificar as featues de texto para o encode utf-8. Dessa forma termos os caratetes corretos do texto original.

In [None]:
# carrega o datafarce como pandas dataframe  e transforma 
# decodificando as features para utf-8
if ONLY_SAMPLE:
    df_news = tfds.as_dataframe(ds.take(SAMPLE_SIZE))
else:
    df_news = tfds.as_dataframe(ds)
df_news['document'] = df_news['document'].str.decode('utf-8')
df_news['summary'] = df_news['summary'].str.decode('utf-8')
df_news.shape

(1000, 2)

Temos 44.972 conjuntos apontando para um subconjunto de artigos com um resumo. O número de artigos em cada item do dataset é variado.

In [None]:
# Criando um dataset de amostra para testes
df_sample = tfds.as_dataframe(ds.take(2))
df_sample['document'] = df_sample['document'].str.decode('utf-8')
df_sample['summary'] = df_sample['summary'].str.decode('utf-8')

Inspecionando os dois primeiros itens do dataset

In [None]:
df_news.document[0]

'Flag Flap Underscores Trump\'s Strained Relationship With McCain \n  \n Enlarge this image toggle caption Mandel Ngan/AFP/Getty Images Mandel Ngan/AFP/Getty Images \n  \n Updated at 9:37 p.m. ET \n  \n The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain\'s honor. \n  \n Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died, as is standard practice for a sitting member of Congress. \n  \n But on Monday morning the flag atop the White House was back at full-staff, causing some to ask whether Trump\'s strained relationship with McCain had played into the decision to not keep it lowered. The lack of a proclamation was viewed by some as a disrespectful act reflecting the president\'s dislike for McCain, which Trump continued to 

In [None]:
df_sample.document[0]

'Flag Flap Underscores Trump\'s Strained Relationship With McCain \n  \n Enlarge this image toggle caption Mandel Ngan/AFP/Getty Images Mandel Ngan/AFP/Getty Images \n  \n Updated at 9:37 p.m. ET \n  \n The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain\'s honor. \n  \n Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died, as is standard practice for a sitting member of Congress. \n  \n But on Monday morning the flag atop the White House was back at full-staff, causing some to ask whether Trump\'s strained relationship with McCain had played into the decision to not keep it lowered. The lack of a proclamation was viewed by some as a disrespectful act reflecting the president\'s dislike for McCain, which Trump continued to 

Podemos obervar que ainda permanecer algum caractere com nova linha "\n" (new line) e "\\'s" ('s contração) que limparemos mais tarde.

###**2.2 - Limpando o Dataset**

In [None]:
# transformar a coluna documento em str
# df_sample['document'] = df_sample['document'].astype(str)

In [None]:
def split_articles(doc):
    '''Funtion to read a set os article and split then in a list of articles.'''
    articles=[]
    num_articles = doc.count('|||||')
    for i in range(0, num_articles):
        # localiza o marcador de final de cada história
        index = doc.find('|||||')
        articles.append(doc[:index])
        if i != num_articles:
            doc = doc[index+5:]
    return articles

In [None]:
# Separa os artigos de cada conjunto de documento
articles = df_news["document"].progress_apply(split_articles)
# storys

  0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
len(articles[0]), len(articles[1])

(4, 2)

In [None]:
# Cria um dataset com os resultados da quebra dos storys e 
# expande para que cada linha contenha uma story/artigo
df_articles  = pd.DataFrame(articles).explode('document').reset_index(drop=True)
df_articles.shape 

(2729, 1)

In [None]:
# Apaga linhas que por ventura fiquem vazias na quebra de artigos
df_articles.dropna(inplace=True)

In [None]:
df_articles.head()

Unnamed: 0,document
0,Flag Flap Underscores Trump's Strained Relatio...
1,Tension between President Donald Trump and Se...
2,After ignoring repeated questions all day abo...
3,WASHINGTON (Reuters) - The White House lowere...
4,Over 60 years after the first excavations at Q...


Agora que temos a lista de artigos em um dataset, vamos processar as etapas de limpeza e formatação para aplicar o algoritimo de TextRank. Vamos instalar alguma bibliotecas com foco em tarefas de NLP: nltk e spacy. 

In [None]:
import nltk
nltk.download('punkt')

import re
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize 

nlp = spacy.load('en_core_web_sm')
nltk.download('stopwords')
stopwords = stopwords.words('english')
punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~©'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def cleanup_text(text):
    '''Clean some especial and new line caracteres'''
    # replaced = text.replace("b\'",'')
    replaced = text.replace("\n ","")  
    replaced = replaced.replace("\'s","")
    replaced = replaced.replace(" .",".")
    replaced = replaced.replace("  "," ")
    replaced = replaced.strip()
    return replaced

In [None]:
def process_text(docs, logging=False):
    '''Clean text and separeting tokens and convert do lowercase.'''
    texts = []
    
    # inc case of large docs, split docs in sentences
    sentences = sent_tokenize(docs)
    sent_cleaned = []
    for sent in sentences:
        # remove caracteres especiais
        sent=sent.replace("[^a-zA-Z0-9]"," ")
        doc = nlp(sent, disable=['parser', 'ner'])
        tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
        tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations] 
        tokens = ' '.join(tokens)
        # tokens = tokens.replace("b\'",'')
        # tokens = tokens.replace("\\n ",'')  
        # tokens = tokens.replace("\\\'s",'') 
        # print(tokens)
        # tokens = tokens.strip()
        sent_cleaned.append(tokens)

    sent_cleaned = ' '.join(str(x) for x in sent_cleaned)
    texts.append(sent_cleaned)
    return pd.Series(texts)

# str_test = "Flag Flap Underscores Trump Strained Relationship With McCain  Enlarge this image toggle caption Mandel Ngan/AFP/Getty Images Mandel Ngan/AFP/Getty Images  Updated at 9:37 p.m. ET  The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain honor.  Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died, as is standard practice for a sitting member of Congress. "
# process_text(str_test)[0]

In [None]:
df_articles['document'][0]

'Flag Flap Underscores Trump\'s Strained Relationship With McCain \n  \n Enlarge this image toggle caption Mandel Ngan/AFP/Getty Images Mandel Ngan/AFP/Getty Images \n  \n Updated at 9:37 p.m. ET \n  \n The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain\'s honor. \n  \n Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died, as is standard practice for a sitting member of Congress. \n  \n But on Monday morning the flag atop the White House was back at full-staff, causing some to ask whether Trump\'s strained relationship with McCain had played into the decision to not keep it lowered. The lack of a proclamation was viewed by some as a disrespectful act reflecting the president\'s dislike for McCain, which Trump continued to 

In [None]:
# aplica função de limpeza dos marcadores que ainda restaram da conversão 
# para o encode utf-8
df_articles['document_cleaned'] = df_articles['document'].progress_apply(cleanup_text)
df_articles['document_cleaned'][0]

  0%|          | 0/2729 [00:00<?, ?it/s]

'Flag Flap Underscores Trump Strained Relationship With McCain Enlarge this image toggle caption Mandel Ngan/AFP/Getty Images Mandel Ngan/AFP/Getty Images Updated at 9:37 p.m. ET The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain honor. Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died, as is standard practice for a sitting member of Congress. But on Monday morning the flag atop the White House was back at full-staff, causing some to ask whether Trump strained relationship with McCain had played into the decision to not keep it lowered. The lack of a proclamation was viewed by some as a disrespectful act reflecting the president dislike for McCain, which Trump continued to express publicly, even as recently as last week

In [None]:
# Limpa o texto dos storys em uma nova feature
# transformando o texto para minúsculo e retrirando 
# pontuações, marcações deixada pelo spacy
# retira paralavas com baixa representatividade e stopwords 
# da lingua inglesa como preposições, pronomes e adverbios
df_articles['document_cleaned_formated'] = df_articles['document_cleaned'].progress_apply(process_text)
df_articles['document_cleaned_formated'][0]

  0%|          | 0/2729 [00:00<?, ?it/s]

"flag flap underscore trump strained relationship mccain enlarge image toggle caption mandel ngan afp getty images mandel ngan afp getty images update 9:37 p.m. et beginning national memorial sen. john mccain r ariz. mar fight sign public respect president trump initially avoid issue proclamation low flag half staff federal property mccain honor . flag lower government building across washington across country saturday evening mccain die standard practice sit member congress . monday morning flag atop white house back full staff cause ask whether trump strained relationship mccain play decision keep lower . lack proclamation view disrespectful act reflect president dislike mccain trump continue express publicly even recently last week . hour reporter question white house move president ignore multiple press attempt ask reaction mccain death white house flag eventually lower half staff monday afternoon . trump say statement release shortly afterward despite difference policy politic res

In [None]:
# salva arquivo com as storys preprocessadas
df_articles.to_csv('articles.csv.gz', compression='gzip')

In [None]:
ls -l

total 6360
-rw-r--r-- 1 root root 6502331 Nov  3 12:06 articles.csv.gz
drwxr-xr-x 4 root root    4096 Nov  3 11:33 [0m[01;34mdata[0m/
drwxr-xr-x 1 root root    4096 Nov  1 13:35 [01;34msample_data[0m/


**Analisando o tamanho dos documentos**

In [None]:
# guarda o tamanho dos textos limpos e formatados como features
df_articles['document_cleaned_lenght'] = df_articles['document_cleaned'].str.len()
df_articles['document_cleaned_formated_lenght'] = df_articles['document_cleaned_formated'].str.len()

In [None]:
df_articles['document_cleaned_lenght'].describe()

count      2729.000000
mean       3882.851228
std        4825.630377
min          26.000000
25%        1431.000000
50%        2862.000000
75%        4947.000000
max      101328.000000
Name: document_cleaned_lenght, dtype: float64

Máximo de palavras em um artigo original => 2.892.949 palavras

In [None]:
df_articles['document_cleaned_formated_lenght'].describe()

count     2729.000000
mean      2698.946134
std       3555.861832
min         12.000000
25%        989.000000
50%       1973.000000
75%       3417.000000
max      90116.000000
Name: document_cleaned_formated_lenght, dtype: float64

Máximo de palavras em um artigo depois de processado => 2.069.084 palavras

In [None]:
df_articles['document_cleaned_formated_lenght'].sort_values()

2205       12
2006       17
375        25
2671       26
2272       42
        ...  
675     28175
1786    29712
2294    37389
1552    61426
622     90116
Name: document_cleaned_formated_lenght, Length: 2729, dtype: int64

In [None]:
# verificando se existe algum artigo com tamanho 0
df_articles[df_articles['document_cleaned_formated_lenght'] ==0]

Unnamed: 0,document,document_cleaned,document_cleaned_formated,document_cleaned_formated_lenght,document_cleaned_lenght


##**3 - Gerando Resumo**

###**3.1 - Usando a bibliotec Spacy**

Para gerar o resumo extrativo, uma técnica é utilizar o algoritmo TextRank, derivado do PageRank desenvolvido pelo Google. Este método consiste em encontrar as sentenças de um documento que tem maior ranking usando um frequência ponderada das palavras do documento. Neste notebook vamos usar um variaçao muito semelhante ao algotítmo TextRank. Abaixo os passos que precisamoes executar:

1. Converter todos os documentos em sentenças;
2. Processar o texto das sentenças - convertendo para minúsculas, retirando stopwords e caractreres especiais;
3. Tokenizar as sentenças - para tratar cada palavra (token) como um elemento para ser considerado no cálculo da frequência;
4. Calculae a frequência ponderada - para cada palavra do texto vamos calcular a frequência ponderadda em relação do documento. Faremos isso dividindo a frequência de cada palavra pela frequencia da palavra com mais ocorrência no documento;
5. Substituir palavras de maior frequencia ponderada na sentença original;
6. Ordenar as sentenças em ordem descrescente de soma da frequencia;

In [None]:
nltk.download('punkt')
import heapq

def generate_summary(text_without_removing_dot, cleaned_text, logging=False):
    sample_text = text_without_removing_dot

    if len(sample_text)>0:
        # Input text documetn into spacy nlp function
        doc = nlp(sample_text)
        # we are using spacy for sentence tokenization
        sentence_list=[]
        for idx, sentence in enumerate(doc.sents): 
            sentence_list.append(re.sub(r'[^\w\s]','',str(sentence)))

        # get stopwords list from english
        stopwords = nltk.corpus.stopwords.words('english')

        # calculate words frequency - TF
        word_frequencies = {}  
        for word in nltk.word_tokenize(cleaned_text):  
            if word not in stopwords:
                if word not in word_frequencies.keys():
                    word_frequencies[word] = 1
                else:
                    word_frequencies[word] += 1

        # identify higher word frequency 
        maximum_frequncy = max(word_frequencies.values())

        # calculate a inverse word frequency for workd list - TF-IDF
        for word in word_frequencies.keys():  
            word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

        # calcula scores for every sentences
        sentence_scores = {}  
        for sent in sentence_list:  
            for word in nltk.word_tokenize(sent.lower()):
                if word in word_frequencies.keys():
                    if len(sent.split(' ')) < 50: # max number of words in each sentence
                        if sent not in sentence_scores.keys():
                            sentence_scores[sent] = word_frequencies[word]
                        else:
                            sentence_scores[sent] += word_frequencies[word]

        # get seven higher score sentences
        summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
        summary = ' '.join(summary_sentences)

        # or if no summary can compute, first sentence
        if len(summary) == 0:
            summary = sentence_list[0]

        if logging:
            print("Original Text:\n")
            print(text_without_removing_dot)
            print('\n\nSummarized text:\n')
            print(summary)
    else:
        summary = sample_text
    
    return summary

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Testando a função de resumo com o primeiro elemento do dataset de artigos.

In [None]:
len(df_articles.document_cleaned[0])

5629

In [None]:
len(generate_summary(df_articles.document_cleaned[0], df_articles.document_cleaned_formated[0], logging=False))

891

In [None]:
# gerando o resumo o primeiro artigo
generate_summary( df_articles['document_cleaned'][0], df_articles['document_cleaned_formated'][0], logging=True)

Original Text:

Flag Flap Underscores Trump Strained Relationship With McCain Enlarge this image toggle caption Mandel Ngan/AFP/Getty Images Mandel Ngan/AFP/Getty Images Updated at 9:37 p.m. ET The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain honor. Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died, as is standard practice for a sitting member of Congress. But on Monday morning the flag atop the White House was back at full-staff, causing some to ask whether Trump strained relationship with McCain had played into the decision to not keep it lowered. The lack of a proclamation was viewed by some as a disrespectful act reflecting the president dislike for McCain, which Trump continued to express publicly, even as recent

'Flag Flap Underscores Trump Strained Relationship With McCain Enlarge this image toggle caption Mandel NganAFPGetty Images Mandel NganAFPGetty Images Updated at 937 pm Senator John McCain was an American hero and cherished member of The American Legion Senator John McCain was a longserving US Senator naval officer strong advocate for NATO and a good friend to Canada The veterans group AMVETS also issued a statement calling the president actions since McCain death deeply disappointing Flags were lowered at government buildings across Washington and across the country Saturday evening after McCain died as is standard practice for a sitting member of Congress We very much appreciate everything that Senator McCain has done for our country the president said Later Monday evening at a dinner with evangelical leaders Trump made his first public comments since the senator death Saturday'

Testando com os 100 primeiro artigos:

In [None]:
df_sample = df_articles[0:100]
df_sample['generated_summary'] = df_sample.progress_apply(lambda x: generate_summary(x.document_cleaned, x.document_cleaned_formated), axis=1)
df_sample

In [None]:
df_sample.to_csv('storys_summarized_sample.csv', compression='gzip')

Agora processando o dataset completo:

In [None]:
# Criando uma nova coluna para armazenar o resumo gerado
# OBS: este processo é bem demorado 
df_articles['generated_summary'] = df_articles.progress_apply(lambda x: generate_summary(x.document_cleaned, x.document_cleaned_formated), axis=1)
df_articles['generated_summary_lenght'] = df_articles['generated_summary'].str.len()
df_articles.head()

  0%|          | 0/2729 [00:00<?, ?it/s]

Unnamed: 0,document,document_cleaned,document_cleaned_formated,document_cleaned_formated_lenght,document_cleaned_lenght,generated_summary,generated_summary_lenght
0,Flag Flap Underscores Trump's Strained Relatio...,Flag Flap Underscores Trump Strained Relations...,flag flap underscore trump strained relationsh...,3999,5629,ET The beginning of the national memorial for ...,1567
1,Tension between President Donald Trump and Se...,Tension between President Donald Trump and Sen...,tension president donald trump sen. john mccai...,4251,6026,President Donald Trump delay in honoring the l...,1546
2,After ignoring repeated questions all day abo...,After ignoring repeated questions all day abou...,ignore repeat question day whether would say a...,3095,4221,Add John McCain as an interest to stay up to d...,1304
3,WASHINGTON (Reuters) - The White House lowere...,WASHINGTON (Reuters) - The White House lowered...,washington reuters white house lower u.s. flag...,2624,3704,The White House lowered its US flag to halfsta...,1420
4,Over 60 years after the first excavations at Q...,Over 60 years after the first excavations at Q...,60 year first excavation qumran researcher heb...,2958,4398,Over 60 years after the first excavations at Q...,1523


In [None]:
# conferindo o tamanho do story original e do resumo
len(df_articles.document[0]), len(df_articles.generated_summary[0])

(5783, 1567)

In [None]:
# comparando o tamanhos das string do artico original e do resumo
# OBS: no dataset completo esta operação vai demorar
df_articles.document.str.len().mean(), df_articles.generated_summary.str.len().mean()

(3979.47050201539, 1055.837303041407)

In [None]:
df_articles.document.str.len().min(), df_articles.generated_summary.str.len().min()

(28, 26)

In [None]:
# verificar se existem resumos vazios
df_articles[df_articles.generated_summary_lenght<1]

Unnamed: 0,document,document_cleaned,document_cleaned_formated,document_cleaned_formated_lenght,document_cleaned_lenght,generated_summary,generated_summary_lenght


In [None]:
# Resumo do primeiro artigo
df_articles.generated_summary[0]

'ET The beginning of the national memorial for Sen John McCain RAriz has been marred by a fight over a sign of public respect as President Trump initially avoided issuing a proclamation to lower flags to halfstaff at all federal properties in McCain honor Hours after reporters questioned the White House about the move and the president ignored multiple press attempts to ask his reaction to McCain death the White House flag was eventually lowered to halfstaff Monday afternoon Trump said in a statement released shortly afterward Despite our differences on policy and politics I respect Senator John McCain service to our country and in his honor have signed a proclamation to fly the flag of the United States at halfstaff until the day of his interment The Washington Post reported that when White House press secretary Sarah Sanders and other officials initially prepared a statement in Trump name praising McCain the president rejected that plan opting instead for his Saturday tweet It outrag

In [None]:
# salva arquivo com as storys preprocessadas
df_articles.to_csv('storys_summarized.csv.gz', compression='gzip')

###**3.2 - Usansdo a Biblioteca Textrank**

Existe uma biblioteca com uma implementação o médoto TextRank (pytextrank) que pode ser utilizado para os casos de resumos extrativos. Abaixo um teste realizado com esta biblioteca em conjunto com um pipeline do spacy.

In [None]:
!pip install pytextrank

In [None]:
# import spacy
import pytextrank

# example text
# text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
text = df_articles.document_cleaned[0]

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
# tr = pytextrank.TextRank()

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)

# examine the top-ranked phrases in the document
# for phrase in doc._.phrases:
#     print(phrase.text)
#     print(phrase.rank, phrase.count)
#     print(phrase.chunks)

# for phrase in doc._.phrases[:10]:
#     print(phrase)

tr = doc._.textrank
for sent in tr.summary(limit_phrases=10, limit_sentences=3):
    print(sent)


ET The beginning of the national memorial for Sen. John McCain, R-Ariz., has been marred by a fight over a sign of public respect, as President Trump initially avoided issuing a proclamation to lower flags to half-staff at all federal properties in McCain honor.
Hours after reporters questioned the White House about the move and the president ignored multiple press attempts to ask his reaction to McCain death, the White House flag was eventually lowered to half-staff Monday afternoon.
The veterans group AMVETS also issued a statement calling the president actions since McCain death deeply disappointing.


##**4 - Avaliando O Resultado**

###**4.1 - Calculando as métricas Rouge**

para calcular as métricas ROUGE vamos utilizar a bibloteca rouge disponível  para instalação via pip.

In [None]:
# instalando lib rouge.  Observar para que a versão 1.0.1 seja escolhida.
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


A função get_scores recupera as métricas definidas no parâmetro:

In [None]:
# calcula métricas rouge para todo o dataset
from rouge import Rouge
rouge = Rouge(metrics=['rouge-1', 'rouge-2','rouge-l'])
# rouge = Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'])

# o parâmetro avg=True calcula a média do dataset
rouge.get_scores(df_articles.generated_summary,
                 df_articles.document, 
                 avg=True)

{'rouge-1': {'f': 0.5830407188271377,
  'p': 0.9045512281857051,
  'r': 0.46787875160045383},
 'rouge-2': {'f': 0.48046573983053303,
  'p': 0.7835732339525427,
  'r': 0.3853929468061236},
 'rouge-l': {'f': 0.577619245233806,
  'p': 0.8957396083363237,
  'r': 0.4635808325237329}}

###**4.2 - Conclusão**

Olhando para as três métricas percebemos que o precision apresentam resultado próximo a 1 (quanto mais próximo a 1 melhor), isto é, o modelo está encontrando grande partes dos n-grams coincidentes no texto referência e no resumo. Se considerarmos o rouge-2 podermos ver 0.7835 de precision.

Porém, o recall está baixo, o que indica que o número de n-grams presentes nos dois textos comparativos (modelo e referência) estão bem abaixo dos n-grans do texto referência. Isso se deve a fato de que os artigos escolhido para nosso dataset são textos longos.

O resultado parece satisfatório, uma vez que os bachmarks com métodos de resumos extrativos tem valores de rouge e f1 próximos as estes.

##**5 - Salvando Arquivos**

**Usando o GDrive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# copiando os datasets compactados
!cp *.gz /content/drive/My\ Drive/Colab\ Notebooks/files/

In [None]:
drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in Drive.')

All changes made in this colab session should now be visible in Drive.


**Usando API do GCS**

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
!gsutil cp storys.csv.gz gs://alice-platform.appspot.com/files