<a href="https://colab.research.google.com/github/m4rk00s/nlp-project/blob/master/Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extraction-based Text Summarization

# Introduction

Text summarization is an act to generate a short text that describe the text wholly. Sometimes, text summarization is called automatic summarization, because the process is automated by a software.

There's a number of way to build a text summarization, one particular is using **SumBasic** system. Here's the algorithm:

1. Compute the probability of a word $w$
   $$P(w) = \frac{f(w)}{N}$$
   where $f(w)$ is the number of occurences of the word, and $N$ is the number of all words in the input.
   
2. For each sentence, $S_j$, in the input, assigns a weight equal to the average probability of the words in the sentence:
   $$g(S_j) =\frac{\sum_{w_i\in S_j} P(w_i)}{|\{w_i|w_i\in S_j\}|}$$
   
3. Pick the best scoring sentnece that contains the highest probability word.

4. For each word in the chosen sentence, the weight is updated:
   $$p_{\text{new}}(w_i) = p_{\text{old}}(w_i)p_{\text{old}}(w_i)$$
   This word weight update indicates that the probability of a word appearing in the summary is lower than a word occuring once.
   
5. If the desired summary length has not been reached, go back to step 2

## Step 1. Load the data

First step is prepearing the data. Here's the explaination of the whole process:
1. Fetch the data from the URL using `urlopen`
2. Read the fetched data
3. Using `BeautifulSoup`, parse the data and collect the article (usually found in `<p>` tag)

In [1]:
import bs4
import urllib.request
from nltk.tokenize import sent_tokenize

# only when needed
import nltk
nltk.download('punkt')

# Fetch the content from the URL
url = 'https://www.theverge.com/circuitbreaker/2019/9/18/20868935/google-pixel-4-xl-rumors-leaks-specs-details-colors-cameras-soli'
fetched_data = urllib.request.urlopen(url)

article_read = fetched_data.read()

# Parsing the URL content and storing in a variable
soup = bs4.BeautifulSoup(article_read, 'html.parser')

article_content = []

for element in soup.find_all('p'):
    article_content.append(element.text)
    
article_content = ' '.join(article_content)
article_content = sent_tokenize(article_content)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Below are the first 5 sentences from the article.

In [2]:
article_content[:5]

['Filed under: It’s coming October 15th Google’s Pixel 3 might have been the most-leaked phone in history.',
 'Long before its unveiling, we knew practically everything about it from unboxing videos, photo comparisons, even a full review of the device.',
 'Surely, for the Pixel 4, Google would clamp down on leaks to leave some surprises for its debut on October 15th, right?',
 'Let’s just say, if that was the plan, it didn’t exactly work out.',
 'We’ve now seen the Pixel 4 XL from every angle and in three different colors.']

## Step 2. Preprocessing the data

Next, we will create a frequency table, but first we need a cleaned version of the article content (which mean we have to lemmatize the content and remove the stop words).

In [3]:
import string

# only when needed
nltk.download('stopwords')
nltk.download('wordnet')

clean_data = []

def preprocessing(text):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = text.lower()
    
    result = []
    for token in nltk.word_tokenize(text):
        root = lemmatizer.lemmatize(token)
        if root in string.punctuation: continue
        if root in stopwords: continue
        result.append(root)
    
    return ' '.join(result)
    
for sent in article_content:
    clean_data.append(preprocessing(sent))
    
clean_data[:5]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['filed ’ coming october 15th google ’ pixel 3 might most-leaked phone history',
 'long unveiling knew practically everything unboxing video photo comparison even full review device',
 'surely pixel 4 google would clamp leak leave surprise debut october 15th right',
 'let ’ say wa plan ’ exactly work',
 '’ seen pixel 4 xl every angle three different color']

After cleaning the data, next step we can make a frequency table.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(clean_data)
vectors

<121x650 sparse matrix of type '<class 'numpy.int64'>'
	with 1445 stored elements in Compressed Sparse Row format>

As you can see size of the matrix is $121\times650$, where each rows represent each sentences and the columns represent the number of word occurance. Below are the top 5 words with the highest frequency.

In [5]:
import numpy as np

res = np.sum(vectors.toarray(), axis=0)
vocab = vectorizer.vocabulary_
freq_table = dict()

for word, ix in vocab.items():
    freq_table[word] = res[ix]

N = len(freq_table.values())

for word in freq_table:
    freq_table[word] /= N

sorted(freq_table.items(), key=lambda x: x[1], reverse=True)[:5]

[('pixel', 0.1076923076923077),
 ('google', 0.06153846153846154),
 ('phone', 0.05076923076923077),
 ('know', 0.04923076923076923),
 ('camera', 0.036923076923076927)]

As we can expect, the article is about the Pixel 4, a new phone from Google, so the result is suitable with the article.

## Step 3. Finding the weighted frequencies of the sentences

Now, we already have the tokenize version of the article and the frequency table for each token. Pretty much can now begin to implement the algorithms!

In [6]:
from nltk.tokenize import word_tokenize

sent_weight = dict()

for ix, sent in enumerate(clean_data):
    
    list_word = word_tokenize(sent)
    g_sent = 0
    
    for word in list_word:
        if word in freq_table:
            g_sent += freq_table[word]
            
    sent_weight[ix] = g_sent / len(list_word)
    
top5 = sorted(sent_weight.items(), key=lambda x: x[1], reverse=True)[:5]
top5

[(54, 0.04246153846153846),
 (25, 0.031153846153846157),
 (26, 0.029230769230769237),
 (36, 0.02923076923076923),
 (46, 0.026153846153846153)]

## Step 4. Getting the summary

It's turn out the 55th sentence is the best pick to summarize the article.

In [7]:
article_content[54]

'What we know: The Pixel is no longer a single-camera phone.'

But for sake of curiosty, let's looks the top 5 sentence!

In [8]:
for ix, _ in top5:
    print(article_content[ix])

What we know: The Pixel is no longer a single-camera phone.
But the Pixel 3, Pixel 3 XL and Pixel 3A look close enough that, honestly, we doubt it.
What we know: The Pixel 4 will come in black and orange.
What we know: Not much.
What we know: Nothing.


## Conclusion

I've to admit that the result is good so far! The article is about the leaked-information on Pixel 4. It's turn out that the writer is not quite sure if he know much about the specs despite having those information. Of course, we have to look at the article by ourselves if we want to make sure if this summary is satisfying. But I leave the homework for you, the reader, to comment about those result.

## Reference

This project is nothing without this awesome source:
1. *Text Summarization Techniques: A Brief Survey* - https://arxiv.org/pdf/1707.02268
2. *Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion* - https://www.cis.upenn.edu/~nenkova/papers/ipm.pdf
3. *Applied Text Analysis with Python* - https://learning.oreilly.com/library/view/applied-text-analysis/9781491963036/
4. *Natural Language Processing with Python* - http://www.nltk.org/book/