# Exploring Text Summarization Methods in NLP

Author: Mohamed Oussama NAJI
Date: March 20th, 2024

## Introduction

In this notebook, I will explore various Natural Language Processing (NLP) methods for text summarization. My goal is to generate concise summaries from a longer piece of text, specifically an article discussing neural networks. I will compare several summarization techniques, including Gensim, Summa, TextRank, LexRank, Luhn, LSA, BART, PEGASUS, and T5, to determine which method produces the most coherent and informative summary. By analyzing the outputs of these different approaches, I aim to gain insights into their effectiveness and suitability for summarization tasks in NLP.

## Table of Contents
1. [Installation and Setup](#installation-setup)
2. [Importing Libraries](#importing-libraries)
3. [Text Extraction](#text-extraction)
4. [Summarization with Transformers](#summarization-transformers)
5. [Summarization with Sumy](#summarization-sumy)
6. [Execution and Results](#execution-results)
7. [Conclusion](#conclusion)

## Installation and Setup <a id="installation-setup"></a>

Installing the necessary libraries and modifying the relevant packages.


In [None]:
!pip install sumy
!pip install gensim==3.6.0
!pip install summa
!sed -i 's/from collections import Mapping/from collections.abc import Mapping/g' /usr/local/lib/python3.10/dist-packages/gensim/corpora/dictionary.py
!sed -i 's/from collections.abc import Mapping, defaultdict/from collections.abc import Mapping\nfrom collections import defaultdict/g' /usr/local/lib/python3.10/dist-packages/gensim/corpora/dictionary.py
nltk.download('punkt')

## Text Extraction <a id="text-extraction"></a>


Function to extract the text using a user agent and BeautifulSoup.


In [None]:
def extract_text_from_medium(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        article_text = ' '.join([p.text for p in soup.find_all('p')])
        return article_text
    else:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return None

## Summarization with Transformers <a id="summarization-transformers"></a>


Summarization function for transformer models.

In [None]:
def summarize_with_transformers(text, model, tokenizer, max_length=1024, min_length=40):
    model_max_length = tokenizer.model_max_length
    max_length = min(max_length, model_max_length)
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=max_length)
    summary_ids = model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

## Summarization with Sumy <a id="summarization-sumy"></a>

Sumy summarization function.

In [None]:
def summarize_with_sumy(url):
    parser = HtmlParser.from_url(url, Tokenizer("english"))
    doc = parser.document

    summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]
    summaries = {}
    for summarizer in summarizers:
        summary = summarizer(doc, 5)
        summaries[summarizer.__class__.__name__] = ' '.join([sentence.__str__() for sentence in summary])
    return summaries

## Execution and Results <a id="execution-results"></a>

Defining the URL and extracting the text.


In [None]:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

article_text = extract_text_from_medium(url)

Executing summarizations.


In [None]:
if article_text:
    # Gensim summarization
    print("Gensim Summary:", gensim_summarize(article_text, word_count=200))
    print()

    # Summa summarization
    print("Summa Summary:", summa_summarizer.summarize(article_text, ratio=0.1))
    print()

    # Sumy summarizations
    sumy_summaries = summarize_with_sumy(url)
    for name, summary in sumy_summaries.items():
        print(f"{name} Summary:", summary)
        print()

    # BART summarization
    bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    print("BART Summary:", summarize_with_transformers(article_text, bart_model, bart_tokenizer))
    print()

    # PEGASUS summarization
    pegasus_model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
    pegasus_tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')
    print("PEGASUS Summary:", summarize_with_transformers(article_text, pegasus_model, pegasus_tokenizer))
    print()

    # T5 summarization
    t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')
    t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
    print("T5 Summary:", summarize_with_transformers(article_text, t5_model, t5_tokenizer))
    print()
else:
    print("No content was extracted for summarization.")

## Conclusion <a id="conclusion"></a>

In this notebook, we explored various text summarization methods in NLP to generate concise summaries from a longer article discussing neural networks. We compared several techniques, including Gensim, Summa, TextRank, LexRank, Luhn, LSA, BART, PEGASUS, and T5.

The results demonstrated that each method produced slightly different summaries, highlighting different aspects of the original text. The transformer-based models (BART, PEGASUS, and T5) generally generated more coherent and fluent summaries compared to the traditional extractive methods.

However, the effectiveness of each summarization technique may vary depending on the specific text and the desired level of abstractiveness. It is important to consider the trade-offs between summary length, informativeness, and coherence when selecting a summarization method for a particular task.

Overall, this exploration provided insights into the capabilities and limitations of different summarization approaches in NLP. Further experimentation and evaluation on a larger corpus of texts would be beneficial to assess the generalizability and robustness of these methods.