## <center>**NLP Text Summarization Using NLTK** 
<center><em>
Text summarization refers to the technique of shortening long pieces of text, with the intention of creating a coherent and fluent summary having only the main points outlined in the document. Basically, the process of creating shorter text without removing the semantic structure of text. 
</em></center>
<br>
<center><img src="https://github.com/kkrusere/NLP-Text-Summarization/blob/main/assets/mchinelearning_text_sum.png?raw=1" width=600/></center>

***Project Contributors:*** Kuzi Rusere<br>
**MVP streamlit App URL:** N/A

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
#we will set the stop words
stopWords = set(stopwords.words("english"))

For our example text, we are going use this brief explainer of the history of Chaos theory

In [3]:
text = """
In 1961, a meteorologist by the name of Edward Lorenz made a profound discovery. Lorenz was utilising the new-found power of computers in an attempt to more accurately predict the weather. He created a mathematical model which, when supplied with a set of numbers representing the current weather, could predict the weather a few minutes in advance.
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.
One day, Lorenz decided to rerun one of his forecasts. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point.
After a well-earned coffee break, he returned to discover something unexpected. Although the computer’s new predictions started out the same as before, the two sets of predictions soon began diverging drastically. What had gone wrong?
Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it was actually crunching the numbers internally using six decimal places.
So while Lorenz had started the second run with the number 0.506, the original run had used the number 0.506127.
A difference of one part in a thousand: the same sort of difference that a flap of a butterfly’s wing might make to the breeze on your face. The starting weather conditions had been virtually identical. The two predictions were anything but.
Lorenz had found the seeds of chaos. In systems that behave nicely - without chaotic effects - small differences only produce small effects. In this case, Lorenz’s equations were causing errors to steadily grow over time.
This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.
Lorenz famously illustrated this effect with the analogy of a butterfly flapping its wings and thereby causing the formation of a hurricane half a world away.
A nice way to see this “butterfly effect” for yourself is with a game of pool or billiards. No matter how consistent you are with the first shot (the break), the smallest of differences in the speed and angle with which you strike the white ball will cause the pack of billiards to scatter in wildly different directions every time.
The smallest of differences are producing large effects - the hallmark of a chaotic system.
It is worth noting that the laws of physics that determine how the billiard balls move are precise and unambiguous: they allow no room for randomness.
What at first glance appears to be random behaviour is completely deterministic - it only seems random because imperceptible changes are making all the difference.
The rate at which these tiny differences stack up provides each chaotic system with a prediction horizon - a length of time beyond which we can no longer accurately forecast its behaviour.
In the case of the weather, the prediction horizon is nowadays about one week (thanks to ever-improving measuring instruments and models).
Some 50 years ago it was 18 hours. Two weeks is believed to be the limit we could ever achieve however much better computers and software get.
Surprisingly, the solar system is a chaotic system too - with a prediction horizon of a hundred million years. It was the first chaotic system to be discovered, long before there was a Chaos Theory.
In 1887, the French mathematician Henri Poincaré showed that while Newton’s theory of gravity could perfectly predict how two planetary bodies would orbit under their mutual attraction, adding a third body to the mix rendered the equations unsolvable.
The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.
Keeping an eye on the asteroids is difficult but worthwhile, since such chaotic effects may one day fling an unwelcome surprise our way.
On the flip side, they can also divert external surprises such as steering comets away from a potential collision with Earth.

"""

Word tokenization

In [5]:
# we are going to Tokenizing the text
words = word_tokenize(text)
print(words)

['In', '1961', ',', 'a', 'meteorologist', 'by', 'the', 'name', 'of', 'Edward', 'Lorenz', 'made', 'a', 'profound', 'discovery', '.', 'Lorenz', 'was', 'utilising', 'the', 'new-found', 'power', 'of', 'computers', 'in', 'an', 'attempt', 'to', 'more', 'accurately', 'predict', 'the', 'weather', '.', 'He', 'created', 'a', 'mathematical', 'model', 'which', ',', 'when', 'supplied', 'with', 'a', 'set', 'of', 'numbers', 'representing', 'the', 'current', 'weather', ',', 'could', 'predict', 'the', 'weather', 'a', 'few', 'minutes', 'in', 'advance', '.', 'Once', 'this', 'computer', 'program', 'was', 'up', 'and', 'running', ',', 'Lorenz', 'could', 'produce', 'long-term', 'forecasts', 'by', 'feeding', 'the', 'predicted', 'weather', 'back', 'into', 'the', 'computer', 'over', 'and', 'over', 'again', ',', 'with', 'each', 'run', 'forecasting', 'further', 'into', 'the', 'future.Accurate', 'minute-by-minute', 'forecasts', 'added', 'up', 'into', 'days', ',', 'and', 'then', 'weeks', '.', 'One', 'day', ',', 'Lo

In [12]:
#we are going to import punctuation so that we can remove them from our word requecy
from string import punctuation
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

Word-frequency table


In [13]:
#we are going to create frequency table to keep the count of each
wordfreqTable = dict()
for word in words:
    word = word.lower()
    if word not in stopWords:
        if word not in punctuation:
            if word in wordfreqTable:
                wordfreqTable[word] += 1
            else:
                wordfreqTable[word] = 1

print(wordfreqTable)

{'1961': 1, 'meteorologist': 1, 'name': 1, 'edward': 1, 'lorenz': 9, 'made': 1, 'profound': 1, 'discovery': 1, 'utilising': 1, 'new-found': 1, 'power': 1, 'computers': 2, 'attempt': 1, 'accurately': 2, 'predict': 4, 'weather': 7, 'created': 1, 'mathematical': 1, 'model': 1, 'supplied': 1, 'set': 1, 'numbers': 2, 'representing': 1, 'current': 2, 'could': 4, 'minutes': 1, 'advance': 1, 'computer': 6, 'program': 1, 'running': 1, 'produce': 2, 'long-term': 1, 'forecasts': 3, 'feeding': 1, 'predicted': 1, 'back': 3, 'run': 4, 'forecasting': 1, 'future.accurate': 1, 'minute-by-minute': 1, 'added': 1, 'days': 1, 'weeks': 2, 'one': 5, 'day': 2, 'decided': 2, 'rerun': 1, 'interests': 1, 'saving': 1, 'time': 5, 'start': 1, 'scratch': 1, 'instead': 1, 'took': 1, '’': 5, 'prediction': 5, 'halfway': 1, 'first': 4, 'used': 2, 'starting': 2, 'point': 1, 'well-earned': 1, 'coffee': 1, 'break': 2, 'returned': 1, 'discover': 1, 'something': 1, 'unexpected': 1, 'although': 1, 'new': 1, 'predictions': 6, 

In [19]:
#we are going to normalize the word frequences 
max_frequency =  max(wordfreqTable.values())
max_frequency

9

In [20]:
for word in wordfreqTable.keys():
    wordfreqTable[word] = wordfreqTable[word]/max_frequency

print(wordfreqTable)

{'1961': 0.1111111111111111, 'meteorologist': 0.1111111111111111, 'name': 0.1111111111111111, 'edward': 0.1111111111111111, 'lorenz': 1.0, 'made': 0.1111111111111111, 'profound': 0.1111111111111111, 'discovery': 0.1111111111111111, 'utilising': 0.1111111111111111, 'new-found': 0.1111111111111111, 'power': 0.1111111111111111, 'computers': 0.2222222222222222, 'attempt': 0.1111111111111111, 'accurately': 0.2222222222222222, 'predict': 0.4444444444444444, 'weather': 0.7777777777777778, 'created': 0.1111111111111111, 'mathematical': 0.1111111111111111, 'model': 0.1111111111111111, 'supplied': 0.1111111111111111, 'set': 0.1111111111111111, 'numbers': 0.2222222222222222, 'representing': 0.1111111111111111, 'current': 0.2222222222222222, 'could': 0.4444444444444444, 'minutes': 0.1111111111111111, 'advance': 0.1111111111111111, 'computer': 0.6666666666666666, 'program': 0.1111111111111111, 'running': 0.1111111111111111, 'produce': 0.2222222222222222, 'long-term': 0.1111111111111111, 'forecasts'

Sentence Tokenization


In [21]:
# Creating a dictionary to keep the score of each sentence
sentences = sent_tokenize(text)
sentenceValue = dict()
   
for sentence in sentences:
    for word, freq in wordfreqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq
   

In [22]:
sentenceValue

{'\nIn 1961, a meteorologist by the name of Edward Lorenz made a profound discovery.': 2.0,
 'Lorenz was utilising the new-found power of computers in an attempt to more accurately predict the weather.': 4.111111111111111,
 'He created a mathematical model which, when supplied with a set of numbers representing the current weather, could predict the weather a few minutes in advance.': 3.222222222222223,
 'Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.': 6.777777777777773,
 'One day, Lorenz decided to rerun one of his forecasts.': 3.0,
 'In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point.': 5.5555555555555545,
 'After a w

Summarization

In [23]:
from heapq import nlargest

select_length = int(len(sentences)*0.4)
select_length

13

In [25]:
summary = nlargest(select_length, sentenceValue, key=sentenceValue.get)
summary

['The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …\nThough the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.',
 'Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.',
 'This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.',
 'Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it

In [27]:
final_summary = [sent for sent in summary]
final_summary

['The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …\nThough the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.',
 'Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.',
 'This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.',
 'Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it

In [28]:
summary = " ".join(final_summary)
print(summary)

The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids. Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions. Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it was actually c