## <center>**NLP Text Summarization** </center>
**<center>Extractive Summarization</center>**
<center><em>
Text summarization refers to the technique of shortening long pieces of text, with the intention of creating a coherent and fluent summary having only the main points outlined in the document. Basically, the process of creating shorter text without removing the semantic structure of text.
</em></center>
<br>
<center><img src="https://github.com/kkrusere/NLP-Text-Summarization/blob/main/assets/mchinelearning_text_sum.png?raw=1" width=600/></center>

***Project Contributors:*** Kuzi Rusere<br>
**MVP streamlit App URL:** N/A

**Extractive summarization** focuses on picking out the most essential phrases and sentences directly from the original text and putting them together to create a shorter version. The original text is not changed in any way. The strength of this method is its accuracy; because it uses the original words, it's very reliable for keeping facts straight.

For our example text, we are going use this brief explainer of the history of Chaos theory

In [1]:
text = """
In 1961, a meteorologist by the name of Edward Lorenz made a profound discovery. Lorenz was utilising the new-found power of computers in an attempt to more accurately predict the weather. He created a mathematical model which, when supplied with a set of numbers representing the current weather, could predict the weather a few minutes in advance.
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.
One day, Lorenz decided to rerun one of his forecasts. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point.
After a well-earned coffee break, he returned to discover something unexpected. Although the computer’s new predictions started out the same as before, the two sets of predictions soon began diverging drastically. What had gone wrong?
Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it was actually crunching the numbers internally using six decimal places.
So while Lorenz had started the second run with the number 0.506, the original run had used the number 0.506127.
A difference of one part in a thousand: the same sort of difference that a flap of a butterfly’s wing might make to the breeze on your face. The starting weather conditions had been virtually identical. The two predictions were anything but.
Lorenz had found the seeds of chaos. In systems that behave nicely - without chaotic effects - small differences only produce small effects. In this case, Lorenz’s equations were causing errors to steadily grow over time.
This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.
Lorenz famously illustrated this effect with the analogy of a butterfly flapping its wings and thereby causing the formation of a hurricane half a world away.
A nice way to see this “butterfly effect” for yourself is with a game of pool or billiards. No matter how consistent you are with the first shot (the break), the smallest of differences in the speed and angle with which you strike the white ball will cause the pack of billiards to scatter in wildly different directions every time.
The smallest of differences are producing large effects - the hallmark of a chaotic system.
It is worth noting that the laws of physics that determine how the billiard balls move are precise and unambiguous: they allow no room for randomness.
What at first glance appears to be random behaviour is completely deterministic - it only seems random because imperceptible changes are making all the difference.
The rate at which these tiny differences stack up provides each chaotic system with a prediction horizon - a length of time beyond which we can no longer accurately forecast its behaviour.
In the case of the weather, the prediction horizon is nowadays about one week (thanks to ever-improving measuring instruments and models).
Some 50 years ago it was 18 hours. Two weeks is believed to be the limit we could ever achieve however much better computers and software get.
Surprisingly, the solar system is a chaotic system too - with a prediction horizon of a hundred million years. It was the first chaotic system to be discovered, long before there was a Chaos Theory.
In 1887, the French mathematician Henri Poincaré showed that while Newton’s theory of gravity could perfectly predict how two planetary bodies would orbit under their mutual attraction, adding a third body to the mix rendered the equations unsolvable.
The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.
Keeping an eye on the asteroids is difficult but worthwhile, since such chaotic effects may one day fling an unwelcome surprise our way.
On the flip side, they can also divert external surprises such as steering comets away from a potential collision with Earth.

"""

#### Key Steps in Extractive Summarization:

1. **Text Preprocessing:** Cleaning and preprocessing the text (e.g., remove stopwords, punctuation, and lowercasing).
2. **Sentence Tokenization:** Splitting the text into individual sentences.
3. **Word Tokenization and Normalization:** Tokenizing sentences into words and apply normalization techniques like stemming or lemmatization.
4. **Scoring Sentences:** Assigning a score to each sentence based on different features (e.g., word frequency, sentence position, presence of keywords).
5. **Rank and Sentence Selection:** Ranking sentences based on their scores and select the top-ranked ones for the summary.

**First What is NLTK in Natural Language Processing (NLP)?**

NLTK, or Natural Language Toolkit, is a powerful and widely-used Python library for working with human language data in the field of Natural Language Processing (NLP).


* Core Functionality: NLTK provides a comprehensive suite of tools and resources for various NLP tasks including:

    - Text preprocessing: Tokenization, stemming, lemmatization, stop word removal, part-of-speech tagging
    - Corpus access: Interfaces to numerous text corpora and lexical resources
    - Text analysis: Concordance, frequency distribution, collocations, and more
    - Machine learning: Classification, named entity recognition, and other NLP applications

* Benefits:

    - Ease of use: NLTK's user-friendly interface and extensive documentation make it accessible for both beginners and experienced NLP practitioners.
    - Versatility: The library supports a wide range of NLP tasks and techniques, enabling flexibility in project development.
    - Community and Resources: NLTK boasts a large and active community, providing ample support and resources for learning and troubleshooting.

* Applications: NLTK is employed in various NLP projects, including:

    - Sentiment analysis
    - Chatbots and conversational agents
    - Text summarization
    - Machine translation
    - Information extraction


In essence, NLTK is a foundational tool in the NLP landscape, offering a rich set of functionalities and resources to facilitate the development of diverse natural language processing applications.



In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words("english"))



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!



1. Preprocessing the Text:
    - The text is tokenized into sentences and words.
    - Stop_words (common words like "the," "is," etc.) and punctuation are removed to focus on meaningful words.

In [3]:
def preprocess_text(text, stop_words = stop_words):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)

    # Tokenize sentences into words
    words = word_tokenize(text.lower())

    # Remove stopwords and punctuation
    filtered_words = [word for word in words if word not in stop_words and word not in string.punctuation]

    return sentences, filtered_words

2. Word Frequency Calculation:
- A frequency distribution of the filtered words is calculated to determine the importance of each word.
- Frequencies are normalized to bring them to a common scale.

In [4]:
def compute_word_frequency(filtered_words):
    # Frequency distribution of words
    freq_dist = FreqDist(filtered_words)

    # Normalize frequencies
    max_freq = max(freq_dist.values())
    for word in freq_dist.keys():
        freq_dist[word] = (freq_dist[word] / max_freq)

    return freq_dist

3. Sentence Scoring:
- Each sentence is scored based on the sum of the frequencies of the words it contains.

In [5]:
def score_sentences(sentences, freq_dist):
    # Score each sentence based on the frequency of words it contains
    sentence_scores = {}
    for sentence in sentences:
        words = word_tokenize(sentence.lower())
        score = 0
        for word in words:
            if word in freq_dist:
                score += freq_dist[word]
        sentence_scores[sentence] = score

    return sentence_scores

4. Generating the Summary:
- Sentences are sorted based on their scores, and the top sentences are selected to form the summary.

In [6]:
def generate_summary(sentence_scores, num_sentences=3):
    # Sort sentences by their scores in descending order
    sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)

    # Select the top 'num_sentences' sentences
    summary_sentences = [sentence[0] for sentence in sorted_sentences[:num_sentences]]

    # Join the selected sentences to form the summary
    summary = ' '.join(summary_sentences)

    return summary

In [7]:
# Execute the extractive summarization steps
sentences, filtered_words = preprocess_text(text)
freq_dist = compute_word_frequency(filtered_words)
sentence_scores = score_sentences(sentences, freq_dist)
summary = generate_summary(sentence_scores)

# Output the summary
print("Summary:\n")
print(summary)

Summary:

The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids. Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.


In [8]:
# """Summary:

# The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
# Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of
# gravitation tugs among the planets has a large influence on the trajectories of the asteroids. Once this computer program was up and running,
# Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting
# further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. This meant that tiny errors in the measurement
# of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had
# completely swamped the predictions.
# """


The output summary is not a good summary of the original text.

Here's why:

- **Focus on a Specific Example:** The summary primarily focuses on the specific example of Lorenz's weather prediction model. While this is a crucial part of the text, it neglects the broader concept of chaos theory and its implications in other systems.
- **The summary fails to mention key ideas like:**
    - **Deterministic Chaos:** The idea that seemingly random behavior can arise from deterministic systems due to sensitivity to initial conditions.
    - **Prediction Horizon:** The time limit beyond which accurate predictions become impossible in a chaotic system.
    - **Chaos in the Solar System:** The chaotic nature of the solar system and its implications for asteroid trajectories.
- **Lacks Context and Flow:** The summary feels disjointed, jumping straight into Lorenz's experiment without providing sufficient context about chaos theory and its significance.

The summary provides a narrow view by focusing on a single example and neglecting the broader concepts and implications of chaos theory explored in the original text. There is need to improve the summarization.

##### Let's get into how we can Improve the Scoring Mechanism
- To improve the extractive summarization approach, we can incorporate additional features to score sentences more accurately. Here are some ways to enhance the scoring mechanism:

Additional Features for Scoring:

1. Sentence Length:
    - Avoiding very short or too long sentences, which may be less informative or difficult to understand.
    - We can add a penalty or a boost to the sentence score based on its length.
2. Sentence Position:
    - In many texts (e.g., news articles), important information is often found in the first few sentences.
    - We can give a higher score to sentences appearing earlier in the text.
3. Named Entity Recognition (NER):
    - Sentences containing named entities (like people, places, organizations) may carry more important information.
    - We can boost the score of sentences that contain named entities.
4. Thematic Words:
    - Words that appear frequently throughout the text are often central to the main idea.
    - We can further emphasize the sentences containing these thematic words.



Let's incorporate these features into the revised version of extractive summarization code using NLTK and spaCy

**What is spaCy in Natural Language Processing (NLP)?**


**spaCy** is an open-source Python library specifically designed for production/enterprise-grade Natural Language Processing (NLP). It focuses on providing fast and efficient tools for various NLP tasks, making it well-suited for real-world applications and large-scale text processing.


- Key Features and Benefits of spaCy:

    * Speed and Efficiency: spaCy is known for its exceptional performance, making it ideal for handling large volumes of text data. It leverages optimized algorithms and Cython implementations for blazing-fast processing.
    * Production-Ready: spaCy is built with a focus on practical applications and ease of deployment. Its streamlined API and well-documented functionalities simplify the development process, enabling seamless integration into existing workflows.
    * Pre-trained Models: spaCy offers a range of pre-trained statistical models for several languages, enabling out-of-the-box capabilities such as part-of-speech tagging, named entity recognition, dependency parsing, and more.
    * Customizability: In addition to offering pre-trained models, spaCy allows developers to train custom models for specific tasks or domains, enhancing its flexibility for specialized applications.
    * Ecosystem and Integrations: spaCy benefits from a rich ecosystem of plugins and extensions, enabling seamless integration with other libraries and tools within the NLP and machine learning landscape.

- Typical Use Cases:

    * Information Extraction: spaCy excels at tasks like entity recognition, relation extraction, and fact extraction, enabling the automatic extraction of valuable insights from unstructured text data.
    * Text Classification: spaCy can be used for sentiment analysis, topic classification, and other text classification tasks, helping categorize and understand large amounts of text.
    * Chatbots and Conversational Agents: spaCy's capabilities can be leveraged to power chatbots and conversational agents by providing text preprocessing, intent recognition, and entity extraction functionalities.
    * Preprocessing for Deep Learning: spaCy's efficient text preprocessing tools, including tokenization, lemmatization, and dependency parsing, make it a valuable asset for preparing text data for further analysis using deep learning models.



**spaCy** is a powerful and versatile NLP library, renowned for its speed, efficiency, and production-readiness. Its wide range of features and capabilities makes it an indispensable tool for developers and practitioners working on real-world NLP applications and projects involving large-scale text data processing.

In [9]:
import spacy
# Load the spaCy model for Named Entity Recognition (NER)
nlp = spacy.load("en_core_web_sm")

In [10]:
def score_sentences(sentences, freq_dist, text):
    """
    Scores sentences based on multiple features to determine their importance for extractive summarization.

    This function computes a score for each sentence in the text by considering various features like word frequency,
    sentence length, sentence position, and the presence of named entities. The score is used to rank sentences
    when generating an extractive summary.

    Args:
        sentences (list of str): A list of sentences extracted from the original text.
        freq_dist (nltk.probability.FreqDist): A frequency distribution object containing normalized word frequencies.
        text (str): The original text from which sentences are extracted.

    Returns:
        dict: A dictionary where keys are sentences and values are their corresponding scores.
    """
    sentence_scores = {}  # Initialize an empty dictionary to store scores for each sentence
    doc = nlp(text)  # Use spaCy's NLP model to analyze the text for Named Entity Recognition (NER)

    for idx, sentence in enumerate(sentences):
        words = word_tokenize(sentence.lower())  # Tokenize the sentence into words and convert to lowercase
        score = 0  # Initialize the score for the current sentence

        # Score based on word frequency
        for word in words:
            if word in freq_dist:  # If the word is in the frequency distribution
                score += freq_dist[word]  # Add its frequency to the score

        # Feature 1: Sentence Length
        if len(words) > 4 and len(words) < 25:  # Length threshold for a sentence to be considered "ideal"
            score *= 1.2  # Boost score if the sentence length is within the desired range

        # Feature 2: Sentence Position
        if idx < len(sentences) * 0.2:  # If the sentence is among the first 20% in the text
            score *= 1.5  # Boost score to emphasize its importance

        # Feature 3: Named Entity Recognition (NER)
        named_entities = [ent.text for ent in doc.ents if ent.text in sentence]  # Extract named entities from the sentence
        if named_entities:  # If the sentence contains named entities
            score *= 1.3  # Boost score

        sentence_scores[sentence] = score  # Store the computed score for the current sentence

    return sentence_scores  # Return the dictionary containing sentences and their scores

In [11]:
def generate_summary(sentence_scores, num_sentences=3):
    sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)
    summary_sentences = [sentence[0] for sentence in sorted_sentences[:num_sentences]]
    summary = ' '.join(summary_sentences)
    return summary


In [12]:
sentences, filtered_words = preprocess_text(text)
freq_dist = compute_word_frequency(filtered_words)
sentence_scores = score_sentences(sentences, freq_dist, text)
summary = generate_summary(sentence_scores)

print("Enhanced/improved Summary:")
print(summary)

Enhanced/improved Summary:
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point. The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.


In [13]:
"""Enhanced/improved Summary:
    Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with
    each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. In the interests of saving time he decided not to start
    from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point. The best we can do for three bodies is to
    predict their movements moment by moment, and feed those predictions back into our equations …

    Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets
    has a large influence on the trajectories of the asteroids."""

'Enhanced/improved Summary:\n    Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with\n    each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks. In the interests of saving time he decided not to start\n    from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point. The best we can do for three bodies is to\n    predict their movements moment by moment, and feed those predictions back into our equations …\n\n    Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets\n    has a large influence on the trajectories of the asteroids.'

The enhanced/improved summary is better than the previous one, but it still has some shortcomings:

- Positives:

    * Includes more context: It now mentions Lorenz and his experiment with weather prediction, providing some context for the discussion of chaos theory.
    * Slightly better flow: The sentences are somewhat more connected than in the previous summary.

- Negatives:

    * Still incomplete: While it touches on Lorenz's experiment and the three-body problem, it still misses the core concept of chaos theory – the idea that tiny differences in initial conditions can lead to drastically different outcomes over time. The terms "butterfly effect" and "prediction horizon" are still absent.
    * Lacks focus: The summary jumps between weather prediction, the three-body problem, and the solar system without clearly connecting these ideas.
    * Some sentences are still out of context: The sentence "In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point" is included without explaining its significance in the context of Lorenz's discovery.


Overall, the enhanced summary is an improvement, but it still needs work to be considered a good summary of the original text. It needs to be more focused, include the key concepts of chaos theory, and provide a clear and logical flow of ideas.

Lets incooporate some Evaluation:

To evaluate the quality of the summaries generated by our extractive summarization algorithm, we can use several evaluation metrics. The most widely used evaluation metric for summarization tasks is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). Other evaluation metrics include BLEU (Bilingual Evaluation Understudy) and METEOR (Metric for Evaluation of Translation with Explicit ORdering).

Evaluation Metrics for Summarization
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
    * ROUGE-1: Measures the overlap of unigrams (single words) between the generated summary and a reference summary.
    * ROUGE-2: Measures the overlap of bigrams (two consecutive words).
    * ROUGE-L: Measures the longest common subsequence (LCS) between the generated and reference summaries, capturing the in-sequence overlap.
- BLEU (Bilingual Evaluation Understudy):
    * Primarily used for machine translation but can be adapted for summarization.
    * Measures n-gram precision of a generated text concerning one or more reference texts.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering):
    * Designed to improve BLEU by addressing problems like synonymy and stemming.
    * It considers unigram matches between generated and reference summaries, applying stemming and synonymy matching.


In [14]:
!pip install rouge-score

from rouge_score import rouge_scorer



In [15]:
def calculate_rouge_scores1(generated_summary, reference_summary):
    """
    Calculate ROUGE scores for a generated summary compared to a reference summary.

    Args:
        generated_summary (str): The generated summary to evaluate.
        reference_summary (str): The reference summary for comparison.

    Returns:
        dict: A dictionary containing ROUGE-1, ROUGE-2, and ROUGE-L scores.
    """
    # Initialize the ROUGE scorer with ROUGE-1, ROUGE-2, and ROUGE-L
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Calculate ROUGE scores
    scores = scorer.score(reference_summary, generated_summary)

    # Print the ROUGE scores
    print("ROUGE Scores:")
    print(f"ROUGE-1: {scores['rouge1'].precision:.3f}")
    print(f"ROUGE-2: {scores['rouge2'].precision:.3f}")
    print(f"ROUGE-L: {scores['rougeL'].precision:.3f}")

    return scores


In [16]:
reference_summary = """Chaos theory is a field of study in mathematics that examines the behavior of dynamical systems that are highly sensitive to initial conditions.
This sensitivity is popularly referred to as the butterfly effect, where a small change in one state of a deterministic nonlinear system can result in large differences in a later state.
Edward Lorenz, a meteorologist, made significant contributions to chaos theory through his work on weather prediction models. He discovered that even tiny errors in the initial
measurements of weather conditions could lead to drastically different forecasts over time. This finding highlighted the inherent limitations in predicting the long-term behavior of
chaotic systems. The concept of a prediction horizon emerged, representing the time limit beyond which accurate predictions become impossible due to the exponential growth of errors.
Chaos theory has implications beyond weather forecasting. For instance, the three-body problem in celestial mechanics demonstrates the chaotic nature of gravitational interactions
between three or more celestial bodies. Even with precise initial conditions, predicting the long-term trajectories of these bodies becomes increasingly difficult due to the
sensitivity to initial conditions."""

# Calculate and display ROUGE scores
calculate_rouge_scores1(summary, reference_summary)

ROUGE Scores:
ROUGE-1: 0.394
ROUGE-2: 0.050
ROUGE-L: 0.190


{'rouge1': Score(precision=0.39436619718309857, recall=0.30601092896174864, fmeasure=0.3446153846153846),
 'rouge2': Score(precision=0.04964539007092199, recall=0.038461538461538464, fmeasure=0.043343653250773995),
 'rougeL': Score(precision=0.19014084507042253, recall=0.14754098360655737, fmeasure=0.16615384615384615)}

In [17]:
def calculate_rouge_scores2(generated_summary, original_text):
    """
    Evaluates the quality of a generated summary using ROUGE scores, even without a reference summary.

    This function calculates ROUGE scores by comparing the generated summary against the original text,
    serving as a pseudo-reference. While not ideal, it provides a relative measure of how well the
    summary captures the salient information from the original text.

    Args:
        generated_summary (str): The summary generated by the summarization algorithm.
        original_text (str): The original text from which the summary was generated.

    Returns:
        dict: A dictionary containing ROUGE-1, ROUGE-2, and ROUGE-L scores.
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    # Calculate ROUGE scores
    scores = scorer.score(original_text, generated_summary)

    # Print the ROUGE scores
    print("ROUGE Scores:")
    print(f"ROUGE-1: {scores['rouge1'].precision:.3f}")
    print(f"ROUGE-2: {scores['rouge2'].precision:.3f}")
    print(f"ROUGE-L: {scores['rougeL'].precision:.3f}")
    return scores



In [18]:
# Evaluate the generated summary
calculate_rouge_scores1(summary, text)


ROUGE Scores:
ROUGE-1: 1.000
ROUGE-2: 0.986
ROUGE-L: 1.000


{'rouge1': Score(precision=1.0, recall=0.18733509234828497, fmeasure=0.31555555555555553),
 'rouge2': Score(precision=0.9858156028368794, recall=0.18361955085865259, fmeasure=0.30957683741648107),
 'rougeL': Score(precision=1.0, recall=0.18733509234828497, fmeasure=0.31555555555555553)}

Let's analyze the results of the two functions, `calculate_rouge_scores1` and `calculate_rouge_scores2`, which calculate ROUGE scores for generated summaries in different contexts.

- Explanation of ROUGE Scores, The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric evaluates the quality of a summary by comparing it to a reference (or original) text. ROUGE scores are usually presented in three parts:

> - Precision: Measures the percentage of words in the generated summary that appear in the reference summary.
> - Recall: Measures the percentage of words in the reference summary that appear in the generated summary.
> - F-Measure (F1 Score): The harmonic mean of precision and recall, providing a single score that balances both.

1. `calculate_rouge_scores1` Results:

> - The `calculate_rouge_scores1` function computes ROUGE scores between a generated summary and a human-written reference summary. The output of the function is:

> - ```yaml
    ROUGE Scores:
    ROUGE-1: 0.394
    ROUGE-2: 0.050
    ROUGE-L: 0.190


*  Explanation:
    - `ROUGE-1 (Unigram Precision: 0.394)`: Approximately 39.4% of the unigrams (individual words) in the generated summary are also present in the reference summary.
    - `ROUGE-2 (Bigram Precision: 0.050)`: Only about 5% of the bigrams (two consecutive words) match between the generated summary and the reference summary. This low score suggests that the generated summary does not capture many consecutive word pairs from the reference.
    - `ROUGE-L (Longest Common Subsequence Precision: 0.190)`: The LCS (Longest Common Subsequence) precision score is about 19%. This suggests that the generated summary captures some sequences of words from the reference summary, but the overlap is not substantial.

*  Summary:
    - These scores indicate that the generated summary only partially aligns with the reference summary.
    - The relatively low ROUGE-2 and ROUGE-L scores suggest that the generated summary does not capture the detailed structure or phrasing of the reference text very well.
    - The summary may contain the key points but not with the same level of detail or in the same sequence.


2. `calculate_rouge_scores2 Results:`

> - The `calculate_rouge_scores2` function computes ROUGE scores by comparing the generated summary to the original text (from which the summary was generated). This provides an evaluation of how well the generated summary captures the content of the original text. The output of the function is:

> - ```yaml
    ROUGE Scores:
    ROUGE-1: 1.000
    ROUGE-2: 0.986
    ROUGE-L: 1.000

*  Explanation:
    - `ROUGE-1 (Unigram Precision: 1.000)`: The generated summary contains all the unigrams found in the original text. However, this is because the generated summary only contains words that also appear in the original text.
    - `ROUGE-2 (Bigram Precision: 0.986)`: The bigram precision is very high (98.6%), indicating that nearly all consecutive word pairs in the generated summary appear in the original text.
    - `ROUGE-L (Longest Common Subsequence Precision: 1.000)`: The LCS precision score is 1.0, meaning the longest subsequence of words in the generated summary matches the original text perfectly.
*  Summary:
    - These near-perfect scores suggest that the generated summary is very close to a subset of the original text.
    - However, this does not necessarily indicate a good summary quality because it might just mean that the generated summary is almost identical to portions of the original text rather than a concise representation of it.

**Interpretation**
  * Comparison with a Reference Summary `(calculate_rouge_scores1)`:
  > - The ROUGE scores indicate the degree of overlap between the generated summary and a human reference summary.
  > - Low `ROUGE-2` and `ROUGE-L` scores often suggest that the generated summary may lack coherence or detail compared to the reference. Improving the summary's quality (by adding more context, clarity, or coherence) would likely result in higher scores.
  * Comparison with the Original Text `(calculate_rouge_scores2)`:
  > - When comparing the generated summary directly to the original text, the ROUGE scores are almost perfect.
  > - However, this is not always a good indicator of summarization quality. A good summary should be concise and capture the most important points rather than replicate large portions of the original text.

**Conclusion**
  * High ROUGE scores against the original text (as in `calculate_rouge_scores2`) suggest redundancy and that the generated summary is close to the original text.
  * Moderate to Low ROUGE scores against a human reference summary (as in `calculate_rouge_scores1`) provide a more realistic evaluation of the quality of the summary. Improving `precision` and `recall` in `ROUGE-1`, `ROUGE-2`, and `ROUGE-L` will lead to a more concise, informative, and coherent summary.

##### **Incorporating BLEU and METEOR Scores for Evaluation**
Now we will evaluate the quality of generated text summaries using two additional evaluation metrics: `BLEU (Bilingual Evaluation Understudy)` and `METEOR (Metric for Evaluation of Translation with Explicit ORdering)`. These metrics complement ROUGE by providing different perspectives on text similarity:

- `BLEU Score`: Measures how close the generated summary is to the reference summary by comparing n-grams.
- `METEOR Score`: Considers synonymy, stemming, and word order, providing a more nuanced evaluation of the generated summary's alignment with the reference summary.

The functions provided will calculate `BLEU` and `METEOR` scores for a generated summary compared to a reference summary. Both scores will help us assess the generated summaries' fluency, informativeness, and relevance more comprehensively.

In [19]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score

def calculate_bleu_score(generated_summary, reference_summary):
    """
    Calculate BLEU score for a generated summary compared to a reference summary.

    Args:
        generated_summary (str): The generated summary to evaluate.
        reference_summary (str): The reference summary for comparison.

    Returns:
        float: The BLEU score.
    """
    # Tokenize the summaries
    generated_tokens = word_tokenize(generated_summary)
    reference_tokens = [word_tokenize(reference_summary)]

    # Calculate BLEU score with smoothing
    bleu_score = sentence_bleu(reference_tokens, generated_tokens, smoothing_function=SmoothingFunction().method1)

    # Print the BLEU score
    print("BLEU Score:", bleu_score)

    return bleu_score

def calculate_meteor_score(generated_summary, reference_summary):
    """
    Calculate METEOR score for a generated summary compared to a reference summary.

    Args:
        generated_summary (str): The generated summary to evaluate.
        reference_summary (str): The reference summary for comparison.

    Returns:
        float: The METEOR score.
    """
    # Tokenize the summaries
    generated_tokens = word_tokenize(generated_summary)
    reference_tokens = word_tokenize(reference_summary)

    # Calculate METEOR score
    meteor_score_value = meteor_score([' '.join(reference_tokens)], ' '.join(generated_tokens))

    # Print the METEOR score
    print("METEOR Score:", meteor_score_value)

    return meteor_score_value


def calculate_meteor_score(generated_summary, reference_summary):
    """
    Calculate METEOR score for a generated summary compared to a reference summary.

    Args:
        generated_summary (str): The generated summary to evaluate.
        reference_summary (str): The reference summary for comparison.

    Returns:
        float: The METEOR score.
    """
    # Tokenize the summaries
    generated_tokens = word_tokenize(generated_summary)
    reference_tokens = word_tokenize(reference_summary)

    # Calculate METEOR score using tokenized inputs
    meteor_score_value = meteor_score([reference_tokens], generated_tokens)  # Pass lists of tokens

    # Print the METEOR score
    print("METEOR Score:", meteor_score_value)

    return meteor_score_value

In [20]:
# Calculate and display BLEU score
bleu_score = calculate_bleu_score(summary, reference_summary)

# Calculate and display METEOR score
meteor_score_value = calculate_meteor_score(summary, reference_summary)

BLEU Score: 0.006436573612432792
METEOR Score: 0.19305090744270662


###### **BLEU Score Explanation**
* BLEU Score: 0.0064

> - Range: BLEU scores range from 0 to 1, where 1 indicates a perfect match with the reference.
> - Interpretation:
    * A BLEU score of 0.0064 is extremely low, suggesting that the generated summary has very few words or phrases in common with the reference summary.
    * This might indicate that the generated summary is either very different in content, style, or structure from the reference summary, or it may not be coherent or relevant to the reference text.

###### **METEOR Score Explanation**
* METEOR Score: 0.1931

> - Range: METEOR scores also range from 0 to 1, where 1 represents a perfect match.
> - Interpretation:
    * A METEOR score of 0.1931 is relatively low but higher than the BLEU score, indicating some degree of overlap in terms of content, synonyms, and grammatical structure.
    * METEOR takes into account partial matches and semantic similarities (like synonyms), so a score in this range often suggests that while the generated summary captures some parts of the meaning, it might not be a close or highly accurate representation of the reference summary.

###### **Comparison and Analysis**
* The BLEU score is very low, showing that the generated summary has minimal word-level overlap with the reference summary.
  - This could be due to a different choice of words, phrases, or even poor coherence in the generated text.
* The METEOR score is relatively higher compared to the BLEU score, suggesting that while the word-for-word match is poor, there is some level of semantic similarity or overlap.
  - METEOR’s ability to account for synonyms and variations in word forms might explain why it has a higher score.

**Possible Reasons for Low Scores**
- Differences in Content: The generated summary might have a different focus or content than the reference summary.
- Length of the Summaries: If the generated summary is too short or too long compared to the reference summary, it can affect both scores.
- Quality of the Generated Summary: The generated summary might not be coherent, relevant, or informative enough compared to the reference summary.

###### **Conclusion**
In this case, both the BLEU and METEOR scores indicate that the generated summary is not closely aligned with the reference summary, although METEOR suggests there might be some semantic overlap. Improvements in the summarization algorithm, such as better handling of key content or phrase selection, may help enhance these scores.

### **Conclusion and Takeaways:** Extractive Text Summarization
- Extractive text summarization is a powerful technique in Natural Language Processing (NLP) that aims to condense a longer text into a shorter version while preserving its most important information.
- The process involves selecting sentences or phrases directly from the original text that are deemed most relevant, based on various statistical and linguistic features.

**Key Takeaways from the Process:**

- Understanding Extractive Summarization:
  * Extractive summarization involves selecting key sentences or phrases from the original text to form a concise summary.
  * Unlike abstractive summarization, it does not generate new sentences or rephrase existing ones.
  * Common features used for extraction include term frequency, sentence position, sentence length, named entities, and similarity with the title, among others.
- Steps in Extractive Summarization:
  * **Preprocessing:** Tokenization, stop-word removal, and other preprocessing techniques are applied to clean the text and prepare it for analysis.
  * **Scoring Sentences:** Various algorithms (such as TF-IDF, cosine similarity, or graph-based approaches) are used to score each sentence's importance in the context of the document.
  * **Selecting Sentences:** Based on the scoring mechanism, the top-ranked sentences are selected to form the summary.
- Incorporating Advanced Features and Libraries:
  * Improving the summarization model by incorporating additional features like sentence length, position, and named entities can significantly enhance the quality of the generated summary.
  * Libraries such as `NLTK`, and `spaCy`, & (not used here) `Gensim`, and `Sumy` provide advanced methods and algorithms for extractive summarization, offering more flexibility and customization.
- Evaluation of Summaries:
  * Evaluation metrics like `ROUGE`, `BLEU`, and `METEOR` provide a quantitative measure of the quality of the summaries generated.
  * ROUGE Scores (`ROUGE-1`, `ROUGE-2`, `ROUGE-L`) focus on `n-gram` overlap between the generated summary and reference summary, considering `precision`, `recall`, and `F-measure`.
  * `BLEU` Score evaluates the `precision` of `n-gram` matches, while `METEOR` Score considers both `precision` and `recall`, and incorporates synonyms and stemming, offering a more semantic evaluation.
- Challenges in Extractive Summarization:
  * **Identifying Salient Information:** Determining which parts of the text are most relevant and important for summarization remains a challenge.
  * **Generating Coherent Summaries:** While extractive summarization guarantees grammatical correctness (since it uses original sentences), ensuring logical flow and coherence remains a challenge.
  * **Balancing Precision and Recall:** Striking the right balance between including all relevant information (recall) and avoiding irrelevant information (precision) is crucial.

**Overall Conclusion**

Extractive summarization is a well-established technique in NLP with practical applications in news summarization, document summarization, meeting summarization, and more. While effective, it has limitations in generating summaries that are truly concise and paraphrased. Nonetheless, by carefully selecting scoring mechanisms, incorporating various NLP features, and leveraging evaluation metrics, we can generate effective extractive summaries that are useful for many real-world applications. Future advancements may involve hybrid approaches that combine the precision of extractive methods with the fluency and creativity of abstractive methods, creating even more powerful summarization systems.