# **Text Summarization Tool for News Articles**

This notebook shows how to automatically summarise news items using two main strategies: **Extractive Summarisation** and **Abstractive Summarisation**. Summaries allow readers to rapidly comprehend the main ideas from lengthy news stories without having to read the entire article. Tools like this can help someone who is juggling several sources of information consume information more efficiently and without feeling overwhelmed.

The online news ecosystem is massive and continuously growing. It might be difficult to keep up with everything that may be relevant, especially because innumerable articles are released every day. Given the abundance of knowledge available, having a dependable tool that condenses huge pieces into simple, readable summaries is valuable.

The summaries created here serve numerous purposes:
- Quickly scan many items to choose which ones need a deeper look.
- Increase research efficiency by determining whether an article provides information relevant to specific interests or projects.
- Assist in reviewing vast volumes of material when time is restricted, ensuring that no important topics are overlooked.

---

## **Techniques Employed**

This notebook focusses on two widely used summarisation methods: extractive and abstractive.

1. **Extractive Summarisation (TextRank)**
   
   **Concept**

    Detects essential sentences straight from the original text using a graph-based rating system. Each sentence works as a node, with edges representing sentence similarity. The final summary consists of sentences with high scores.

   **Advantages**
     - Quick and easy to use.
     - Does not alter extracted sentences, ensuring their accuracy.
   
   **Drawbacks**
     - Summaries made from original text pieces may feel jagged or less natural.
     - May not capture the story flow as well as a human-written summary.

2. **Abstractive Summarisation (BART Transformer)**

    Uses 'facebook/bart-large-cnn', a transformer-based model well-known for its powerful summarisation capabilities.

   **Concept**

    Rewrites the text in a reduced form using natural language generating techniques. The model seeks to capture the spirit and organisation of the information, resulting in a summary that reads more like what a human would write.

   **Advantages**
     - Improved readability and fluency compared to extractive summaries.
     - Ability to simplify complicated ideas.

   **Drawbacks**
     - Higher computational cost.
     - There is a risk of adding slight mistakes or "hallucinations" because it is not limited to the original sentences.

---


In [None]:
!pip install beautifulsoup4 flask-ngrok ipywidgets lxml_html_clean newspaper3k nltk requests sumy transformers --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.9/104.9 kB[0m [3

In [None]:
import asyncio
import ipywidgets as widgets
import nltk
import re
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, jsonify
from flask_ngrok import run_with_ngrok
from IPython.display import display

from newspaper import Article
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.text_rank import TextRankSummarizer
from transformers import pipeline


nltk.download('punkt')
nltk.download('all')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downlo

True


---

## **Data Collection & Scraping**

It is time-consuming to manually parse HTML material from news websites. The 'newspaper3k' library streamlines the procedure by downloading the webpage content.
- Parsing and extracting the primary article text.
- Removing extraneous components such as advertisements, navigation bars, and author profiles.

This phase saves time and effort, allowing you to focus on summarisation rather than web scraping complexities.


---


In [None]:
def fetch_article_content(url: str) -> str:
    article = Article(url)
    article.download()
    article.parse()
    return article.text

sample_url = "https://www.aljazeera.com/news/2024/10/28/why-could-a-silent-asthma-epidemic-be-sweeping-africa"

try:
    article_text = fetch_article_content(sample_url)
    print("Successfully fetched sample article content!")
    print("------------------------------------------------------")
    print("Sample Article Content (first 500 chars):")
    print(article_text[:500], "...")
except Exception as e:
    print("URL not accessible for scraping.")
    print("Error:", str(e))


Successfully fetched sample article content!
------------------------------------------------------
Sample Article Content (first 500 chars):
Millions of adolescents in Africa could be living unknowingly with asthma as cases go undiagnosed, researchers find.

Millions of adolescents across Africa may unknowingly be battling asthma because they have not received a diagnosis from a clinician and, therefore, are not receiving the necessary treatments, a new study has found.

Published last week in the research journal The Lancet, the study’s findings are critical for a continent that has produced little data about the scale of asthma des ...


---

## **Text Preprocessing**

Raw article content may contain bracketed references (e.g., [1]), superfluous whitespace, and other characters that can mislead summarisation algorithms. A preprocessing phase normalises the text to ensure its cleanliness and structure. This includes:

- Removing bracketed text and references.
- Converting numerous spaces or newlines into one space.
- Removing uncommon or distracting characters while maintaining necessary punctuation marks.

This phase results in more coherent summaries because the input text contains fewer distractions.

---

In [None]:
def preprocess_text(text: str) -> str:
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^a-zA-Z0-9.,!?\'\s]', '', text)
    return text.strip()


cleaned_text = preprocess_text(article_text)
print("Text successfully preprocessed!")
print("------------------------------------------------------")
print("Preprocessed Text (first 500 chars):")
print(cleaned_text[:500], "...")

Text successfully preprocessed!
------------------------------------------------------
Preprocessed Text (first 500 chars):
Millions of adolescents in Africa could be living unknowingly with asthma as cases go undiagnosed, researchers find. Millions of adolescents across Africa may unknowingly be battling asthma because they have not received a diagnosis from a clinician and, therefore, are not receiving the necessary treatments, a new study has found. Published last week in the research journal The Lancet, the studys findings are critical for a continent that has produced little data about the scale of asthma despit ...


---

## **Extractive Summarisation (TextRank)**

Extractive summarisation selects the most representative sentences straight from the original text. The selected method here, TextRank, uses an algorithm similar to PageRank:
- Each sentence represents a node in a graph.
- Edges connecting phrases indicate similarities based on shared words or linguistic features.
- The system assigns a rank to each sentence. The final summary is made up of the sentences with the highest rankings.

This method ensures that the summary is factually correct (because it contains original sentences), but it may appear less conceptually consistent.

---


In [None]:
def extractive_summary(text: str, num_sentences: int = 3) -> str:
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = TextRankSummarizer()
    summary_sentences = summarizer(parser.document, num_sentences)
    return " ".join(str(sentence) for sentence in summary_sentences)



extractive_summary_text = extractive_summary(cleaned_text)
print("Extractive summarization complete!")
print("------------------------------------------------------")
print("Extractive Summary:")
print(extractive_summary_text)

Extractive summarization complete!
------------------------------------------------------
Extractive Summary:
Published last week in the research journal The Lancet, the studys findings are critical for a continent that has produced little data about the scale of asthma despite the condition being one of the most common causes of chronic respiratory deaths on the continent. The study, which was conducted from 2018 to 2021, focused on 20,000 children aged 12 to 14 in schools located in urban areas Blantyre in Malawi, Durban in South Africa, Harare in Zimbabwe, Kampala in Uganda, Kumasi in Ghana and Lagos in Nigeria. The most recent estimate is from 2010 when 119 million were projected to be suffering from asthma on the continent, according to a 2013 study in the archives of the US National Library of Medicine.


---

## **Abstractive Summarisation (BART)**

The goal of abstractive summarisation is to create a new, condensed version of the material. It use a language model to rewrite the material rather than selecting sentences from the existing corpus. The `facebook/bart-large-cnn` model selected here is highly respected for its summarisation skills.

To prevent going over the model input length constraints, really lengthy material might need to be divided into smaller sections. The model aims to generate a summary that is both grammatically sound and coherent. In many cases, the result feels more organic than extractive techniques.

Verifying important information is always a good idea because abstractive models can occasionally add small errors or features that are not present in the source.

---

In [None]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=-1)

def abstractive_summary(text: str, min_length: int = 30, max_length: int = 400) -> str:
    words = text.split()
    if len(words) < 10:
        return "Text too short to summarize effectively."

    if len(words) > max_length:
        chunks = [' '.join(words[i:i+max_length]) for i in range(0, len(words), max_length)]
        summaries = []
        for chunk in chunks:
            input_length = len(chunk.split())
            current_max_length = min(max_length, input_length)
            chunk_summary = summarizer(
                chunk,
                min_length=min_length,
                max_length=current_max_length,
                do_sample=False
            )[0]['summary_text']
            summaries.append(chunk_summary)
        return " ".join(summaries)
    else:
        input_length = len(words)
        current_max_length = min(max_length, input_length)
        result = summarizer(text, min_length=min_length, max_length=current_max_length, do_sample=False)
        return result[0]['summary_text']

abstractive_summary_text = abstractive_summary(cleaned_text)
print("Abstractive summarization complete!")
print("------------------------------------------------------")
print("Abstractive Summary:")
print(abstractive_summary_text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Abstractive summarization complete!
------------------------------------------------------
Abstractive Summary:
12 percent of adolescents in six African countries had severe asthma symptoms. 80 percent of them had not been diagnosed by a health expert. Study focused on 20,000 children aged 12 to 14 in schools in urban areas. Durban had the highest number of pupils with asthma symptoms while Blantyre had the lowest. Asthma is a chronic, often lifelong respiratory disease characterised by acute inflammation of the airways and airflow obstruction that affects 262 million people worldwide. About half of those affected may be in Africa. The high number of asthma cases has been linked to the continents rapid urbanisation and rise in pollution. Total asthma cases on the continent went from 94 million in 2000 to 119 million in 2010. Adolescents make up about 14 percent of the asthma cases in Africa although the numbers vary widely. The climate crisis is causing more asthma cases as well. About

---

## **Interactive Widget**

It can be difficult to read through code cells to modify URLs. A widget-based UI makes testing easier:

- In the text field, paste the URL of a news article.
- The "Summarise" button should be clicked.
- The article text is automatically retrieved, cleaned, and summarised by the tool.
- Both the extractive and abstractive summaries are shown.


Without altering the code cells themselves, this interface facilitates experimentation with different articles, subjects, and news sources.

---


In [None]:
url_input = widgets.Text(
    description="Article URL:",
    placeholder="Paste news article URL here"
)

output = widgets.Output()

def on_button_click(b):
    with output:
        output.clear_output()
        url = url_input.value.strip()
        if not url:
            print("Please provide a valid URL.")
            return

        try:
            dynamic_article_text = fetch_article_content(url)
            dynamic_cleaned_text = preprocess_text(dynamic_article_text)

            print("======================================================")
            print("EXTRACTIVE SUMMARY:")
            print("------------------------------------------------------")
            print(extractive_summary(dynamic_cleaned_text))

            print("\n======================================================")
            print("ABSTRACTIVE SUMMARY:")
            print("------------------------------------------------------")
            print(abstractive_summary(dynamic_cleaned_text))
        except Exception as e:
            print("Error while processing the URL:", str(e))

button = widgets.Button(description="Summarize")
button.on_click(on_button_click)

print("Use the widget below to summarize any news article:")
print("1. Paste the URL into the text box.")
print("2. Click 'Summarize' to generate and display both extractive and abstractive summaries.")
display(url_input, button, output)


Use the widget below to summarize any news article:
1. Paste the URL into the text box.
2. Click 'Summarize' to generate and display both extractive and abstractive summaries.


Text(value='', description='Article URL:', placeholder='Paste news article URL here')

Button(description='Summarize', style=ButtonStyle())

Output()

---

## **Next Actions & Things to Think About**

The techniques in this notebook are but a beginning in the quickly developing subject of summarisation. Here are some suggestions for improving or expanding on this strategy:

- **Try Various Models**

  After some tweaking, models such as T5 or Pegasus may generate more accurate or aesthetically beautiful summaries.

- **Parameter Tuning**
  
  More customised results can be obtained by modifying parameters such as `num_sentences` in extractive summaries or `min_length` and `max_length` for abstractive summaries.

- **Fine-Tuning on Specific Domains**
  
  Take into account fine-tuning summarisation models on domain-specific data if you're concentrating on specialised subjects (such as financial reports or medical publications). Accuracy and relevancy may increase as a result.

- **Quality Assurance**
  
  The accuracy of automated summaries should be checked. Important details can be preserved and factual inaccuracies can be prevented with human oversight.

- **Integration with Larger Applications**
  
  Take into account developing a web application or incorporating these features into a knowledge management system. A search feature and the ability to store previously summarised articles could improve the tool's usability and power.

  ---