# Processamento de Linguagem Natural (PLN) - Básico

SERPRO - SUPSS - DIVISÃO DE DESENVOLVIMENTO E SUSTENTAÇÃO DE PRODUTOS COGNITIVOS

## Parte 2: Exemplo - Sumarizador de Texto

__Objetivo__: Sumarizar um texto automaticamente usando PLN. Ou seja, gerar um resumo sobre um texto. 

__Tarefas__: 
1. Extrair o texto de uma fonte de dados. No caso do exemplo, um artigo que estará na web. 
2. NLTK para pré-processar os dados: tokenização, remoção de stop words etc. 
3. Aplicar estatística para gerar o resumo do texto.

__Para gerar o resumo do texto__:

1. Encontrar as palavras mais importantes. Importância da palavra => Frequência da palavra
2. Computar o score de significância para cada sentença com base nas suas palavras. Score de significância => Soma(Importância das Palavras)
3. Separar as sentenças mais significantes (ranking)

__Em resumo:__

Extrair o texto > Pré-processar > Extrair sentenças > Criar o Ranking

In [1]:
# Imports
from urllib.request import urlopen

#Instalar e importar pré-requisitos
!pip install bs4
!pip install lxml

import lxml
from lxml import html

#Biblioteca para extrair os dados da url
from bs4 import BeautifulSoup



## Web Scraping

O artigo a ser resumido será baixado da internet. 

In [2]:
# Link para o Artigo
articleURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids"

In [3]:
# Web Scraping
page = urlopen(articleURL).read().decode('utf8','ignore') 
soup = BeautifulSoup(page, "html.parser")

In [4]:
# Coletando dados na tag "article"
soup.find('article')

<article class="paywall" itemprop="articleBody"> <div class="inline-content inline-video"> <div class="posttv-video-embed powa" data-ad-bar="1" data-aspect-ratio="0.5625" data-blurb="1" data-live="0" data-object-id="580640a4e4b0d16481f68b07" data-org="wapo" data-playthrough="1" data-uuid="24fd7912-9548-11e6-9cae-2a3574e296a6" data-youtube-id="aGH_nLCOWOw"> <script async="" src="https://d1pz6dax0t5mop.cloudfront.net/prod/PoWaLoaderWapo.js?_=20181113"></script> </div> <div class="inline-video-caption"> <span class="pb-caption">The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)</span> </div> </div> <div class="flex-embed-dummy-script"><script charset="utf-8"></script></div> <p>There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short 

In [5]:
# Imprimindo apenas o texto dentro da tag
soup.find('article').text

"      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)    There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short while a spatula that got loose during a space shuttle mission in 2006. All of it swirling around in orbit, creating a bit of a traffic jam. For years, the Pentagon has been worried about the collisions that might be caused by an\xa0estimated 500,000 pieces of debris, taking out enormously valuable satellites and, in turn, creating even more debris. On Tuesday, the Defense Department\xa0took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering\xa0a gigantic new telescope capable of seeing small objects from very far away. Developed by the Defense Advanced Research

In [6]:
# Concatenando o Texto de todos os artigos, caso exista mais de um
text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
text

"      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)    There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short while a spatula that got loose during a space shuttle mission in 2006. All of it swirling around in orbit, creating a bit of a traffic jam. For years, the Pentagon has been worried about the collisions that might be caused by an\xa0estimated 500,000 pieces of debris, taking out enormously valuable satellites and, in turn, creating even more debris. On Tuesday, the Defense Department\xa0took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering\xa0a gigantic new telescope capable of seeing small objects from very far away. Developed by the Defense Advanced Research

In [7]:
# Substituindo caracteres indesejáveis
text.replace("\xa0"," ").replace("“"," ").replace("”"," ")

"      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)    There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short while a spatula that got loose during a space shuttle mission in 2006. All of it swirling around in orbit, creating a bit of a traffic jam. For years, the Pentagon has been worried about the collisions that might be caused by an estimated 500,000 pieces of debris, taking out enormously valuable satellites and, in turn, creating even more debris. On Tuesday, the Defense Department took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering a gigantic new telescope capable of seeing small objects from very far away. Developed by the Defense Advanced Research Project 

In [8]:
# Podemos definir uma função para fazer todos os passos anteriores de uma só vez
# Função que executa todas as operações anteriores
def getTextWaPo(url):
    page = urlopen(url).read().decode('utf8')
    soup = BeautifulSoup(page,"html.parser")
    text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
    return text.replace("\xa0"," ").replace("“"," ").replace("”"," ")

In [9]:
# Aplicando a função e recuperando o texto
text = getTextWaPo(articleURL)

## Pré-processamento

Agora que temos o texto extraído da web, vamos realizar o pré-processamento

In [10]:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [11]:
# Tokenização
sents = sent_tokenize(text)
words = word_tokenize(text.lower())

In [12]:
#Obter as stopwords +lista de caracteres de pontuação para ser removidos do texto
stopwords_english = set(stopwords.words('english') + list(punctuation))

In [13]:
words =[word for word in words if word not in stopwords_english]
print('50 primeiras palavras do texto após o processamento: \n',words[:50])

50 primeiras palavras do texto após o processamento: 
 ['space', 'surveillance', 'telescope', 'offers', 'improvements', 'determining', 'orbits', 'newly', 'discovered', 'objects', 'provides', 'rapid', 'observations', 'events', 'may', 'occur', 'relatively', 'short', 'period', 'time', 'like', 'supernova', 'darpatv', 'lot', 'rocks', 'flying', 'around', 'space', 'lots', 'debris', 'old', 'satellites', 'spent', 'rocket', 'boosters', 'even', 'short', 'spatula', 'got', 'loose', 'space', 'shuttle', 'mission', '2006.', 'swirling', 'around', 'orbit', 'creating', 'bit', 'traffic']


## Calculando as Frequências das Palavras

Uma vez que possuímos o texto processado, vamos obter a frequências das palavras. A frequência será utilizada para realizar o ranking das frases.

In [14]:
#Obtenção da frequência de através da classe FreqDist
from nltk.probability import FreqDist
freq = FreqDist(words)
freq

FreqDist({'space': 15, 'telescope': 9, 'objects': 7, 'debris': 7, 'satellites': 7, 'orbit': 6, 'air': 6, 'force': 6, 'around': 4, 'small': 4, ...})

In [15]:
#Obtenção das palavras mais frequentes
from heapq import nlargest
nlargest(10, freq, key = freq.get)

['space',
 'telescope',
 'objects',
 'debris',
 'satellites',
 'orbit',
 'air',
 'force',
 'around',
 'small']

## Cria o Ranking das frases mais importantes

In [16]:
# Criamos um dicionário em que as chaves serão as sentenças e os valores serão os scores de significância.
# Calcularemos a soma das frequências das palavras que aparecem em cada sentença
from collections import defaultdict
ranking = defaultdict(int)

for i, sent in enumerate(sents):
    for w in word_tokenize(sent.lower()):
        if w in freq:
            ranking[i] += freq[w]
            
print('Exemplo - score da sentença 18:',ranking[18])

Exemplo - score da sentença 18: 34


In [17]:
# Obtendo os índices das 4 sentenças mais significantes
sents_idx = nlargest(4, ranking, key=ranking.get)
sents_idx

[27, 17, 6, 19]

In [18]:
#Ordenando as sentenças
sorted(sents_idx)

[6, 17, 19, 27]

In [19]:
# Retornando as sentenças com os índices selecionados acima
[sents[j] for j in sorted(sents_idx)]

['On Tuesday, the Defense Department took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering a gigantic new telescope capable of seeing small objects from very far away.',
 'The telescope is  a big improvement over the legacy ground-based optical telescopes that are used by the U.S. Air Force, because it can search large areas of sky and also track very faint (small) objects in and around GEO,  Brian Weeden, a Technical Advisor at the Secure World Foundation, wrote in an email.',
 'The telescope would join another new space debris tracking technology known as the Space Fence, which is now being built by Bethesda-based Lockheed Martin.',
 'Every military operation that takes place in the world today is critically dependent on space in one way or another,  Air Force Gen. John Hyten said in an interview earlier this year when he was the commander of the Air Force Space Command.']

## Sumarizar o texto

In [20]:
# Vamos agora criar uma função que junta todos os passos anteriores
# Função para sumarizar o texto
def summarize(text, n):
    sents = sent_tokenize(text)
    
    assert n <= len(sents)
    word_sent = word_tokenize(text.lower())
    stopwords_english = set(stopwords.words('english') + list(punctuation))
    
    word_sent=[word for word in word_sent if word not in stopwords_english]
    freq = FreqDist(word_sent)
    
    ranking = defaultdict(int)
    
    for i,sent in enumerate(sents):
        for w in word_tokenize(sent.lower()):
            if w in freq:
                ranking[i] += freq[w]
        
    sents_idx = nlargest(n, ranking, key=ranking.get)
    return [sents[j] for j in sorted(sents_idx)]

In [21]:
# Sumarizando o Texto com as 4 sentenças mais importantes. 
summarize(text,3)

['On Tuesday, the Defense Department took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering a gigantic new telescope capable of seeing small objects from very far away.',
 'The telescope is  a big improvement over the legacy ground-based optical telescopes that are used by the U.S. Air Force, because it can search large areas of sky and also track very faint (small) objects in and around GEO,  Brian Weeden, a Technical Advisor at the Secure World Foundation, wrote in an email.',
 'Every military operation that takes place in the world today is critically dependent on space in one way or another,  Air Force Gen. John Hyten said in an interview earlier this year when he was the commander of the Air Force Space Command.']

# Exercício 

Crie uma função (summarize_tfidf). 
Esta função utilize TD-IDF para calcular as importâncias das palavras, ao invés da frequência.

Dica: utilize _TfidfVectorizer()_

# Fim