# Text Summarization
Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting summaries from huge chunks of texts. <br><br>
There are two main types of techniques used for text summarization: 
- NLP-based techniques and 
- Deep learning-based techniques. 

We will use in this notebook a simple NLP-based technique for text summarization. We will simply use Python's NLTK library for summarizing Wikipedia articles. <br><br>

https://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk

https://stackabuse.com/text-summarization-with-nltk-in-python/

- from nltk.corpus import stopwords
- from nltk.tokenize import word_tokenize, sent_tokenize

## First model
### Four steps to build a summarizer
- 1 - Remove stop words for the analysis
- 2 - Create frequency table of words
- 3 - Assign score to each sentence depending on the words is contains and the frequency table
- 4 - Build summary by adding every sentence above a certain score threshold

__Stop words__ are any word that does not add a value to the meaning of sentence. <br><br>
__Corpus__ means a collection of text. <br><br>
__Tokenizers__ divides a text into a series of tokens.<br>
There are three main tokenizers: word, sentence and regex tokenizer.

In [9]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nunes\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nunes\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [1]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
text = "So, keep working. Keep striving. Never give up. Fall down seven times, get up eight. Ease is a greater threat to progress than hardship. Ease is a greater threat to progress than hardship. So, keep moving, keep growing, keep learning. See you at work."

First - create two arrays

In [14]:
# one array for stop words 
stopWords = set(stopwords.words("english"))
# and other array for every word in the body of the text
words = word_tokenize(text)

Second - create a dictionary for the word frequency table.<br>
Words there are not in stopWords array.

In [16]:
freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

Third - assigning a score to every sentence.<br>
Building a sentence tokenizer

In [19]:
sentences = sent_tokenize(text)

In [20]:
sentences

['So, keep working.',
 'Keep striving.',
 'Never give up.',
 'Fall down seven times, get up eight.',
 'Ease is a greater threat to progress than hardship.',
 'Ease is a greater threat to progress than hardship.',
 'So, keep moving, keep growing, keep learning.',
 'See you at work.']

Going through every sentence and giving it a score depending on the words it has.

In [22]:
sentenceValue = dict()
for sentence in sentences:
    for wordValue in freqTable:
        if wordValue[0] in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += wordValue[1]