<a href="https://colab.research.google.com/github/kkrusere/NLP-Text-Summarization/blob/main/text_summarization_using_abstractive_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <center>**NLP Text Summarization** <center><em>
**<center>Abstractive Summarization</center>**

Text summarization refers to the technique of shortening long pieces of text, with the intention of creating a coherent and fluent summary having only the main points outlined in the document. Basically, the process of creating shorter text without removing the semantic structure of text.
</em></center>
<br>
<center><img src="https://github.com/kkrusere/NLP-Text-Summarization/blob/main/assets/mchinelearning_text_sum.png?raw=1" width=600/></center>

***Project Contributors:*** Kuzi Rusere<br>
**MVP streamlit App URL:** N/A

In [2]:
import nltk
import spacy

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words("english"))

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


**Abstractive summarization** involves understanding the core ideas of the text and then creating a new, condensed version that expresses those ideas, potentially using different words and phrasing. Unlike extractive summarization, which relies on selecting sentences or phrases from the text, abstractive summarization generates summaries that may not directly reuse sentences from the original text but instead create a human-like paraphrased version of the summary. It can be more complex because it requires the ability to truly understand the text and create meaningful new text that represents it.

For our example text, we are going use this brief explainer of the history of Chaos theory

In [3]:
text = """
In 1961, a meteorologist by the name of Edward Lorenz made a profound discovery. Lorenz was utilising the new-found power of computers in an attempt to more accurately predict the weather. He created a mathematical model which, when supplied with a set of numbers representing the current weather, could predict the weather a few minutes in advance.
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.
One day, Lorenz decided to rerun one of his forecasts. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point.
After a well-earned coffee break, he returned to discover something unexpected. Although the computer’s new predictions started out the same as before, the two sets of predictions soon began diverging drastically. What had gone wrong?
Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it was actually crunching the numbers internally using six decimal places.
So while Lorenz had started the second run with the number 0.506, the original run had used the number 0.506127.
A difference of one part in a thousand: the same sort of difference that a flap of a butterfly’s wing might make to the breeze on your face. The starting weather conditions had been virtually identical. The two predictions were anything but.
Lorenz had found the seeds of chaos. In systems that behave nicely - without chaotic effects - small differences only produce small effects. In this case, Lorenz’s equations were causing errors to steadily grow over time.
This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.
Lorenz famously illustrated this effect with the analogy of a butterfly flapping its wings and thereby causing the formation of a hurricane half a world away.
A nice way to see this “butterfly effect” for yourself is with a game of pool or billiards. No matter how consistent you are with the first shot (the break), the smallest of differences in the speed and angle with which you strike the white ball will cause the pack of billiards to scatter in wildly different directions every time.
The smallest of differences are producing large effects - the hallmark of a chaotic system.
It is worth noting that the laws of physics that determine how the billiard balls move are precise and unambiguous: they allow no room for randomness.
What at first glance appears to be random behaviour is completely deterministic - it only seems random because imperceptible changes are making all the difference.
The rate at which these tiny differences stack up provides each chaotic system with a prediction horizon - a length of time beyond which we can no longer accurately forecast its behaviour.
In the case of the weather, the prediction horizon is nowadays about one week (thanks to ever-improving measuring instruments and models).
Some 50 years ago it was 18 hours. Two weeks is believed to be the limit we could ever achieve however much better computers and software get.
Surprisingly, the solar system is a chaotic system too - with a prediction horizon of a hundred million years. It was the first chaotic system to be discovered, long before there was a Chaos Theory.
In 1887, the French mathematician Henri Poincaré showed that while Newton’s theory of gravity could perfectly predict how two planetary bodies would orbit under their mutual attraction, adding a third body to the mix rendered the equations unsolvable.
The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.
Keeping an eye on the asteroids is difficult but worthwhile, since such chaotic effects may one day fling an unwelcome surprise our way.
On the flip side, they can also divert external surprises such as steering comets away from a potential collision with Earth.

"""

#### **Abstractive Summarization** using NLTK, spaCy, Gensim, and Sumy

Abstractive summarization using traditional NLP libraries like NLTK, spaCy, Gensim, and Sumy can be more challenging since these libraries are more commonly used for extractive summarization. However, we can create a basic approach to mimic abstractive summarization by combining various techniques.

1. Preprocessing the Text
- Before we start with the summarization, we need to preprocess the text to clean and prepare it.

In [4]:
def preprocess_text(text):
    # Tokenize sentences
    sentences = sent_tokenize(text)

    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    processed_sentences = []

    for sentence in sentences:
        words = word_tokenize(sentence)
        filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word not in string.punctuation]
        processed_sentences.append(' '.join(filtered_words))

    return processed_sentences


In [5]:
preprocessed_text = preprocess_text(text)
preprocessed_text

['1961 meteorologist name edward lorenz made profound discovery',
 'lorenz utilising new-found power computers attempt accurately predict weather',
 'created mathematical model supplied set numbers representing current weather could predict weather minutes advance',
 'computer program running lorenz could produce long-term forecasts feeding predicted weather back computer run forecasting future.accurate minute-by-minute forecasts added days weeks',
 'one day lorenz decided rerun one forecasts',
 'interests saving time decided start scratch instead took computer ’ prediction halfway first run used starting point',
 'well-earned coffee break returned discover something unexpected',
 'although computer ’ new predictions started two sets predictions soon began diverging drastically',
 'gone wrong',
 'lorenz soon realised computer printing predictions three decimal places actually crunching numbers internally using six decimal places',
 'lorenz started second run number 0.506 original run u

2. Extract Keywords and Key Phrases
- To create an abstractive summary, we need to identify key phrases and concepts from the text.

In [6]:
from collections import Counter

def extract_keywords(text, num_keywords=10):
    doc = nlp(text)
    # Extract noun chunks (key phrases)
    keywords = [chunk.text for chunk in doc.noun_chunks]
    # Get the most common keywords
    common_keywords = Counter(keywords).most_common(num_keywords)
    return common_keywords

In [7]:
keywords = extract_keywords(text)
keywords

[('Lorenz', 7),
 ('which', 4),
 ('they', 4),
 ('the weather', 3),
 ('the computer', 3),
 ('he', 3),
 ('it', 3),
 ('that', 3),
 ('we', 3),
 ('the current weather', 2)]

3. Generate Sentence Embeddings (Using Gensim)
- We can use Gensim to create sentence embeddings, which will help in understanding the context and semantic similarity between sentences.

In [8]:
from gensim.models import Word2Vec

def generate_sentence_embeddings(sentences):
    # Tokenize sentences
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

    # Train Word2Vec model
    model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

    # Generate sentence embeddings by averaging word vectors
    sentence_embeddings = []
    for sentence in tokenized_sentences:
        if len(sentence) > 0:
            sentence_embedding = sum([model.wv[word] for word in sentence if word in model.wv]) / len(sentence)
            sentence_embeddings.append(sentence_embedding)
        else:
            sentence_embeddings.append(None)

    return sentence_embeddings

In [None]:
sentence_embeddings = generate_sentence_embeddings(preprocessed_text)
sentence_embeddings

4. Rank Sentences Based on Keywords and Similarity
- We rank sentences by their relevance to the extracted keywords and the similarity of their embeddings to one another.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def rank_sentences(sentences, keywords, sentence_embeddings):
    # Rank based on keyword occurrence
    keyword_sentences = [(sentence, sum(sentence.count(keyword[0]) for keyword in keywords)) for sentence in sentences]

    # Rank based on similarity (optional, more for extractive purposes)
    if sentence_embeddings:
        similarity_matrix = cosine_similarity(sentence_embeddings)
        similarity_scores = similarity_matrix.sum(axis=1)
        combined_ranking = [(sentence, keyword_score + similarity_score) for (sentence, keyword_score), similarity_score in zip(keyword_sentences, similarity_scores)]
    else:
        combined_ranking = keyword_sentences

    # Sort sentences by combined score
    ranked_sentences = sorted(combined_ranking, key=lambda x: x[1], reverse=True)

    return ranked_sentences



In [12]:
ranked_sentences = rank_sentences(preprocessed_text, keywords, sentence_embeddings)
ranked_sentences

[('1887 french mathematician henri poincaré showed newton ’ theory gravity could perfectly predict two planetary bodies would orbit mutual attraction adding third body mix rendered equations unsolvable',
  9.23914623260498),
 ('created mathematical model supplied set numbers representing current weather could predict weather minutes advance',
  7.527207374572754),
 ('lorenz utilising new-found power computers attempt accurately predict weather',
  6.4801695346832275),
 ('case weather prediction horizon nowadays one week thanks ever-improving measuring instruments models',
  6.367793560028076),
 ('computer program running lorenz could produce long-term forecasts feeding predicted weather back computer run forecasting future.accurate minute-by-minute forecasts added days weeks',
  6.196478843688965),
 ('two weeks believed limit could ever achieve however much better computers software get',
  6.0695788860321045),
 ('meant tiny errors measurement current weather would stay tiny relentless

5. Generate Abstractive Summary
- Finally, we can create an abstractive summary by paraphrasing and rephrasing the top-ranked sentences.

In [14]:
import random

def paraphrase_sentence(sentence):
    words = word_tokenize(sentence)
    random.shuffle(words)
    paraphrased_sentence = ' '.join(words)
    return paraphrased_sentence

def generate_abstractive_summary(ranked_sentences, num_sentences=3):
    top_sentences = [sentence[0] for sentence in ranked_sentences[:num_sentences]]
    paraphrased_sentences = [paraphrase_sentence(sentence) for sentence in top_sentences]
    summary = ' '.join(paraphrased_sentences)
    return summary


In [15]:
summary = generate_abstractive_summary(ranked_sentences)
summary

'equations adding predict would bodies mathematician unsolvable perfectly theory attraction orbit 1887 henri mutual rendered planetary two french body gravity mix newton could poincaré third ’ showed supplied minutes representing predict set created model numbers mathematical weather weather current could advance lorenz attempt accurately weather utilising predict power computers new-found'

In [None]:
"""
Summary=
        equations adding predict would bodies mathematician unsolvable perfectly theory attraction orbit 1887 henri mutual rendered planetary two
        french body gravity mix newton could poincaré third ’ showed supplied minutes representing predict set created model numbers mathematical
        weather weather current could advance lorenz attempt accurately weather utilising predict power computers new-found
"""

The generated summary is not a good summary of the input text (as expected).

- What the summary is focusing on:
> - The summary primarily highlights the concept of chaos theory and its impact on weather prediction.
> - It mentions Edward Lorenz's work with computer models and the butterfly effect.
> - It also briefly touches upon the solar system as a chaotic system.

* What the summary is missing:
> - The detailed explanation of Lorenz's experiment:
  > > * The text provides a step-by-step account of how Lorenz discovered chaos, including the specific details of his computer model, the rounding error, and the resulting divergence in predictions.
  > > *This is crucial to understanding the core concept of chaos theory.
> - The connection between chaos and determinism:
  > > * The text emphasizes that chaotic systems, while seemingly random, are actually governed by deterministic laws.
    > > * The summary completely omits this important point.
> - The prediction horizon and its implications:
  > > * The text explains the concept of the prediction horizon and its relevance to weather forecasting and the solar system.
  > > * The summary mentions the prediction horizon in relation to the solar system but fails to connect it to the broader theme of chaos theory's limitations on predictability.
> - The role of chaos in the solar system and asteroid trajectories:
  > > * The text discusses how chaos affects the solar system, particularly the movement of asteroids.
  > > * The summary briefly mentions the solar system as chaotic but neglects the specific implications for asteroids and potential collisions with Earth.

- What it lacks:
> - Clarity and coherence:
  > > * The summary feels disjointed and lacks a clear flow of ideas.
  > > * The sentences are poorly connected, making it difficult to follow the overall narrative.
> - Comprehensiveness:
  > > * The summary fails to capture the full scope of the text, omitting key concepts and examples that are crucial for understanding chaos theory.
> - Accuracy:
  > > * The summary oversimplifies some aspects of chaos theory, potentially leading to misunderstandings.

In essence, the summary provides a very superficial overview of chaos theory, focusing mainly on its impact on weather prediction. It misses out on the rich details, explanations, and connections that make the original text informative and engaging.

Lets incooporate some Evaluation:

To evaluate the quality of the summaries generated by our extractive summarization algorithm, we can use several evaluation metrics. The most widely used evaluation metric for summarization tasks is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). Other evaluation metrics include BLEU (Bilingual Evaluation Understudy) and METEOR (Metric for Evaluation of Translation with Explicit ORdering).

Evaluation Metrics for Summarization
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
    * ROUGE-1: Measures the overlap of unigrams (single words) between the generated summary and a reference summary.
    * ROUGE-2: Measures the overlap of bigrams (two consecutive words).
    * ROUGE-L: Measures the longest common subsequence (LCS) between the generated and reference summaries, capturing the in-sequence overlap.
- BLEU (Bilingual Evaluation Understudy):
    * Primarily used for machine translation but can be adapted for summarization.
    * Measures n-gram precision of a generated text concerning one or more reference texts.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering):
    * Designed to improve BLEU by addressing problems like synonymy and stemming.
    * It considers unigram matches between generated and reference summaries, applying stemming and synonymy matching.


#### **Conclusion**