## <center>**NLP Text Summarization Using NLTK** </center>
**<center>Extractive Summarization</center>**
<center><em>
Text summarization refers to the technique of shortening long pieces of text, with the intention of creating a coherent and fluent summary having only the main points outlined in the document. Basically, the process of creating shorter text without removing the semantic structure of text. 
</em></center>
<br>
<center><img src="https://github.com/kkrusere/NLP-Text-Summarization/blob/main/assets/mchinelearning_text_sum.png?raw=1" width=600/></center>

***Project Contributors:*** Kuzi Rusere<br>
**MVP streamlit App URL:** N/A

**Extractive summarization** focuses on picking out the most essential phrases and sentences directly from the original text and putting them together to create a shorter version. The original text is not changed in any way. The strength of this method is its accuracy; because it uses the original words, it's very reliable for keeping facts straight.

For our example text, we are going use this brief explainer of the history of Chaos theory

In [1]:
text = """
In 1961, a meteorologist by the name of Edward Lorenz made a profound discovery. Lorenz was utilising the new-found power of computers in an attempt to more accurately predict the weather. He created a mathematical model which, when supplied with a set of numbers representing the current weather, could predict the weather a few minutes in advance.
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.
One day, Lorenz decided to rerun one of his forecasts. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point.
After a well-earned coffee break, he returned to discover something unexpected. Although the computer’s new predictions started out the same as before, the two sets of predictions soon began diverging drastically. What had gone wrong?
Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it was actually crunching the numbers internally using six decimal places.
So while Lorenz had started the second run with the number 0.506, the original run had used the number 0.506127.
A difference of one part in a thousand: the same sort of difference that a flap of a butterfly’s wing might make to the breeze on your face. The starting weather conditions had been virtually identical. The two predictions were anything but.
Lorenz had found the seeds of chaos. In systems that behave nicely - without chaotic effects - small differences only produce small effects. In this case, Lorenz’s equations were causing errors to steadily grow over time.
This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.
Lorenz famously illustrated this effect with the analogy of a butterfly flapping its wings and thereby causing the formation of a hurricane half a world away.
A nice way to see this “butterfly effect” for yourself is with a game of pool or billiards. No matter how consistent you are with the first shot (the break), the smallest of differences in the speed and angle with which you strike the white ball will cause the pack of billiards to scatter in wildly different directions every time.
The smallest of differences are producing large effects - the hallmark of a chaotic system.
It is worth noting that the laws of physics that determine how the billiard balls move are precise and unambiguous: they allow no room for randomness.
What at first glance appears to be random behaviour is completely deterministic - it only seems random because imperceptible changes are making all the difference.
The rate at which these tiny differences stack up provides each chaotic system with a prediction horizon - a length of time beyond which we can no longer accurately forecast its behaviour.
In the case of the weather, the prediction horizon is nowadays about one week (thanks to ever-improving measuring instruments and models).
Some 50 years ago it was 18 hours. Two weeks is believed to be the limit we could ever achieve however much better computers and software get.
Surprisingly, the solar system is a chaotic system too - with a prediction horizon of a hundred million years. It was the first chaotic system to be discovered, long before there was a Chaos Theory.
In 1887, the French mathematician Henri Poincaré showed that while Newton’s theory of gravity could perfectly predict how two planetary bodies would orbit under their mutual attraction, adding a third body to the mix rendered the equations unsolvable.
The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.
Keeping an eye on the asteroids is difficult but worthwhile, since such chaotic effects may one day fling an unwelcome surprise our way.
On the flip side, they can also divert external surprises such as steering comets away from a potential collision with Earth.

"""

#### Key Steps in Extractive Summarization:

1. **Text Preprocessing:** Cleaning and preprocessing the text (e.g., remove stopwords, punctuation, and lowercasing).
2. **Sentence Tokenization:** Splitting the text into individual sentences.
3. **Word Tokenization and Normalization:** Tokenizing sentences into words and apply normalization techniques like stemming or lemmatization.
4. **Scoring Sentences:** Assigning a score to each sentence based on different features (e.g., word frequency, sentence position, presence of keywords).
5. **Rank and Sentence Selection:** Ranking sentences based on their scores and select the top-ranked ones for the summary.

**First What is NLTK in Natural Language Processing (NLP)?**

NLTK, or Natural Language Toolkit, is a powerful and widely-used Python library for working with human language data in the field of Natural Language Processing (NLP).


* Core Functionality: NLTK provides a comprehensive suite of tools and resources for various NLP tasks including:   

    - Text preprocessing: Tokenization, stemming, lemmatization, stop word removal, part-of-speech tagging   
    - Corpus access: Interfaces to numerous text corpora and lexical resources   
    - Text analysis: Concordance, frequency distribution, collocations, and more   
    - Machine learning: Classification, named entity recognition, and other NLP applications   

* Benefits:

    - Ease of use: NLTK's user-friendly interface and extensive documentation make it accessible for both beginners and experienced NLP practitioners.   
    - Versatility: The library supports a wide range of NLP tasks and techniques, enabling flexibility in project development.   
    - Community and Resources: NLTK boasts a large and active community, providing ample support and resources for learning and troubleshooting.   

* Applications: NLTK is employed in various NLP projects, including:   

    - Sentiment analysis   
    - Chatbots and conversational agents   
    - Text summarization   
    - Machine translation   
    - Information extraction 
  

In essence, NLTK is a foundational tool in the NLP landscape, offering a rich set of functionalities and resources to facilitate the development of diverse natural language processing applications.   



In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
import string
stop_words = set(stopwords.words("english"))




1. Preprocessing the Text:
    - The text is tokenized into sentences and words.
    - Stop_words (common words like "the," "is," etc.) and punctuation are removed to focus on meaningful words.

In [3]:
def preprocess_text(text, stop_words = stop_words):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)

    # Tokenize sentences into words
    words = word_tokenize(text.lower())

    # Remove stopwords and punctuation
    filtered_words = [word for word in words if word not in stop_words and word not in string.punctuation]

    return sentences, filtered_words

2. Word Frequency Calculation:
- A frequency distribution of the filtered words is calculated to determine the importance of each word.
- Frequencies are normalized to bring them to a common scale.

In [4]:
def compute_word_frequency(filtered_words):
    # Frequency distribution of words
    freq_dist = FreqDist(filtered_words)
    
    # Normalize frequencies
    max_freq = max(freq_dist.values())
    for word in freq_dist.keys():
        freq_dist[word] = (freq_dist[word] / max_freq)
    
    return freq_dist

3. Sentence Scoring:
- Each sentence is scored based on the sum of the frequencies of the words it contains.

In [5]:
def score_sentences(sentences, freq_dist):
    # Score each sentence based on the frequency of words it contains
    sentence_scores = {}
    for sentence in sentences:
        words = word_tokenize(sentence.lower())
        score = 0
        for word in words:
            if word in freq_dist:
                score += freq_dist[word]
        sentence_scores[sentence] = score

    return sentence_scores

4. Generating the Summary:
- Sentences are sorted based on their scores, and the top sentences are selected to form the summary.

In [6]:
def generate_summary(sentence_scores, num_sentences=3):
    # Sort sentences by their scores in descending order
    sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Select the top 'num_sentences' sentences
    summary_sentences = [sentence[0] for sentence in sorted_sentences[:num_sentences]]
    
    # Join the selected sentences to form the summary
    summary = ' '.join(summary_sentences)
    
    return summary

In [7]:
# Execute the extractive summarization steps
sentences, filtered_words = preprocess_text(text)
freq_dist = compute_word_frequency(filtered_words)
sentence_scores = score_sentences(sentences, freq_dist)
summary = generate_summary(sentence_scores)

# Output the summary
print("Summary:\n")
print(summary)

Summary:

The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids. Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.


In [8]:
# """Summary:

# The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
# Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of 
# gravitation tugs among the planets has a large influence on the trajectories of the asteroids. Once this computer program was up and running, 
# Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting 
# further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. This meant that tiny errors in the measurement 
# of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had 
# completely swamped the predictions.
# """


The output summary is not a good summary of the original text.

Here's why:

- **Focus on a Specific Example:** The summary primarily focuses on the specific example of Lorenz's weather prediction model. While this is a crucial part of the text, it neglects the broader concept of chaos theory and its implications in other systems.
- **The summary fails to mention key ideas like:**
    - **Deterministic Chaos:** The idea that seemingly random behavior can arise from deterministic systems due to sensitivity to initial conditions.
    - **Prediction Horizon:** The time limit beyond which accurate predictions become impossible in a chaotic system.
    - **Chaos in the Solar System:** The chaotic nature of the solar system and its implications for asteroid trajectories.
- **Lacks Context and Flow:** The summary feels disjointed, jumping straight into Lorenz's experiment without providing sufficient context about chaos theory and its significance.

The summary provides a narrow view by focusing on a single example and neglecting the broader concepts and implications of chaos theory explored in the original text. There is need to improve the summarization.

##### Let's get into how we can Improve the Scoring Mechanism
- To improve the extractive summarization approach, we can incorporate additional features to score sentences more accurately. Here are some ways to enhance the scoring mechanism:

Additional Features for Scoring:

1. Sentence Length:
    - Avoiding very short or too long sentences, which may be less informative or difficult to understand.
    - We can add a penalty or a boost to the sentence score based on its length.
2. Sentence Position:
    - In many texts (e.g., news articles), important information is often found in the first few sentences.
    - We can give a higher score to sentences appearing earlier in the text.
3. Named Entity Recognition (NER):
    - Sentences containing named entities (like people, places, organizations) may carry more important information.
    - We can boost the score of sentences that contain named entities.
4. Thematic Words:
    - Words that appear frequently throughout the text are often central to the main idea.
    - We can further emphasize the sentences containing these thematic words.



Let's incorporate these features into the revised version of extractive summarization code using NLTK and spaCy

**What is spaCy in Natural Language Processing (NLP)?**


**spaCy** is an open-source Python library specifically designed for production/enterprise-grade Natural Language Processing (NLP). It focuses on providing fast and efficient tools for various NLP tasks, making it well-suited for real-world applications and large-scale text processing.


- Key Features and Benefits of spaCy:

    * Speed and Efficiency: spaCy is known for its exceptional performance, making it ideal for handling large volumes of text data. It leverages optimized algorithms and Cython implementations for blazing-fast processing.
    * Production-Ready: spaCy is built with a focus on practical applications and ease of deployment. Its streamlined API and well-documented functionalities simplify the development process, enabling seamless integration into existing workflows.
    * Pre-trained Models: spaCy offers a range of pre-trained statistical models for several languages, enabling out-of-the-box capabilities such as part-of-speech tagging, named entity recognition, dependency parsing, and more.
    * Customizability: In addition to offering pre-trained models, spaCy allows developers to train custom models for specific tasks or domains, enhancing its flexibility for specialized applications.
    * Ecosystem and Integrations: spaCy benefits from a rich ecosystem of plugins and extensions, enabling seamless integration with other libraries and tools within the NLP and machine learning landscape.

- Typical Use Cases:

    * Information Extraction: spaCy excels at tasks like entity recognition, relation extraction, and fact extraction, enabling the automatic extraction of valuable insights from unstructured text data.
    * Text Classification: spaCy can be used for sentiment analysis, topic classification, and other text classification tasks, helping categorize and understand large amounts of text.
    * Chatbots and Conversational Agents: spaCy's capabilities can be leveraged to power chatbots and conversational agents by providing text preprocessing, intent recognition, and entity extraction functionalities.
    * Preprocessing for Deep Learning: spaCy's efficient text preprocessing tools, including tokenization, lemmatization, and dependency parsing, make it a valuable asset for preparing text data for further analysis using deep learning models.



**spaCy** is a powerful and versatile NLP library, renowned for its speed, efficiency, and production-readiness. Its wide range of features and capabilities makes it an indispensable tool for developers and practitioners working on real-world NLP applications and projects involving large-scale text data processing.

In [9]:
import spacy
# Load the spaCy model for Named Entity Recognition (NER)
nlp = spacy.load("en_core_web_sm")

In [10]:
def score_sentences(sentences, freq_dist, text):
    """
    Scores sentences based on multiple features to determine their importance for extractive summarization.

    This function computes a score for each sentence in the text by considering various features like word frequency, 
    sentence length, sentence position, and the presence of named entities. The score is used to rank sentences 
    when generating an extractive summary.

    Args:
        sentences (list of str): A list of sentences extracted from the original text.
        freq_dist (nltk.probability.FreqDist): A frequency distribution object containing normalized word frequencies.
        text (str): The original text from which sentences are extracted.

    Returns:
        dict: A dictionary where keys are sentences and values are their corresponding scores.
    """
    sentence_scores = {}  # Initialize an empty dictionary to store scores for each sentence
    doc = nlp(text)  # Use spaCy's NLP model to analyze the text for Named Entity Recognition (NER)
    
    for idx, sentence in enumerate(sentences):
        words = word_tokenize(sentence.lower())  # Tokenize the sentence into words and convert to lowercase
        score = 0  # Initialize the score for the current sentence
        
        # Score based on word frequency
        for word in words:
            if word in freq_dist:  # If the word is in the frequency distribution
                score += freq_dist[word]  # Add its frequency to the score
        
        # Feature 1: Sentence Length
        if len(words) > 4 and len(words) < 25:  # Length threshold for a sentence to be considered "ideal"
            score *= 1.2  # Boost score if the sentence length is within the desired range
        
        # Feature 2: Sentence Position
        if idx < len(sentences) * 0.2:  # If the sentence is among the first 20% in the text
            score *= 1.5  # Boost score to emphasize its importance
        
        # Feature 3: Named Entity Recognition (NER)
        named_entities = [ent.text for ent in doc.ents if ent.text in sentence]  # Extract named entities from the sentence
        if named_entities:  # If the sentence contains named entities
            score *= 1.3  # Boost score
        
        sentence_scores[sentence] = score  # Store the computed score for the current sentence

    return sentence_scores  # Return the dictionary containing sentences and their scores

In [11]:
def generate_summary(sentence_scores, num_sentences=3):
    sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)
    summary_sentences = [sentence[0] for sentence in sorted_sentences[:num_sentences]]
    summary = ' '.join(summary_sentences)
    return summary


In [12]:
sentences, filtered_words = preprocess_text(text)
freq_dist = compute_word_frequency(filtered_words)
sentence_scores = score_sentences(sentences, freq_dist, text)
summary = generate_summary(sentence_scores)

print("Enhanced/improved Summary:")
print(summary)

Enhanced/improved Summary:
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point. The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.


In [None]:
"""Enhanced/improved Summary:
    Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with 
    each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. In the interests of saving time he decided not to start 
    from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point. The best we can do for three bodies is to 
    predict their movements moment by moment, and feed those predictions back into our equations …

    Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets 
    has a large influence on the trajectories of the asteroids."""

The enhanced/improved summary is better than the previous one, but it still has some shortcomings:

- Positives:

    * Includes more context: It now mentions Lorenz and his experiment with weather prediction, providing some context for the discussion of chaos theory.
    * Slightly better flow: The sentences are somewhat more connected than in the previous summary.

- Negatives:

    * Still incomplete: While it touches on Lorenz's experiment and the three-body problem, it still misses the core concept of chaos theory – the idea that tiny differences in initial conditions can lead to drastically different outcomes over time. The terms "butterfly effect" and "prediction horizon" are still absent.
    * Lacks focus: The summary jumps between weather prediction, the three-body problem, and the solar system without clearly connecting these ideas.
    * Some sentences are still out of context: The sentence "In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point" is included without explaining its significance in the context of Lorenz's discovery.   


Overall, the enhanced summary is an improvement, but it still needs work to be considered a good summary of the original text. It needs to be more focused, include the key concepts of chaos theory, and provide a clear and logical flow of ideas.

Lets incooporate some Evaluation: 

To evaluate the quality of the summaries generated by our extractive summarization algorithm, we can use several evaluation metrics. The most widely used evaluation metric for summarization tasks is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). Other evaluation metrics include BLEU (Bilingual Evaluation Understudy) and METEOR (Metric for Evaluation of Translation with Explicit ORdering).

Evaluation Metrics for Summarization
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
    * ROUGE-1: Measures the overlap of unigrams (single words) between the generated summary and a reference summary.
    * ROUGE-2: Measures the overlap of bigrams (two consecutive words).
    * ROUGE-L: Measures the longest common subsequence (LCS) between the generated and reference summaries, capturing the in-sequence overlap.
- BLEU (Bilingual Evaluation Understudy):
    * Primarily used for machine translation but can be adapted for summarization.
    * Measures n-gram precision of a generated text concerning one or more reference texts.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering):
    * Designed to improve BLEU by addressing problems like synonymy and stemming.
    * It considers unigram matches between generated and reference summaries, applying stemming and synonymy matching.


In [13]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting absl-py (from rouge-score)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.7/133.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hBuilding wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=590cf4daae0c0d619f2feb677ab7699f688115d174b8d063549de34e57a72b57
  Stored in directory: /Users/krusere/Library/Caches/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: absl-py, rouge-score
Successfully installed absl-py-2.1.0 rouge-score-0.1.2


#### **Conclusion**