**📢 Introduction to NewsIQ: Revolutionizing News Summarization and Headline Generation**

In today's fast-paced world, staying updated with the latest news can be overwhelming due to the sheer volume of information available online. **NewsIQ** aims to solve this problem by providing **AI-powered news summarization and headline generation**. The goal is to deliver concise, meaningful, and accurate summaries along with catchy headlines for news articles, saving users time while keeping them well-informed.

Using **state-of-the-art NLP models like LSTM and T5/BART**, NewsIQ processes large datasets from reputable sources such as **CNN and BBC**. The system is designed to generate human-like summaries and headlines that capture the essence of each article. By evaluating the results with industry-standard metrics like **ROUGE and BLEU scores**, NewsIQ ensures high-quality, contextually relevant, and precise outputs.

With NewsIQ, readers get the **most essential takeaways from lengthy news articles** in just a few lines, making news consumption faster, smarter, and more efficient. 🚀

## Introduction

Automatic summarization is one of the central problems in
Natural Language Processing (NLP). It poses several challenges relating to language
understanding (e.g. identifying important content)
and generation (e.g. aggregating and rewording the identified content into a summary).

In this tutorial, we tackle the single-document summarization task
with an abstractive modeling approach. The primary idea here is to generate a short,
single-sentence news summary answering the question “What is the news article about?”.
This approach to summarization is also known as *Abstractive Summarization* and has
seen growing interest among researchers in various disciplines.

Following prior work, we aim to tackle this problem using a
sequence-to-sequence model. [Text-to-Text Transfer Transformer (`T5`)](https://arxiv.org/abs/1910.10683)
is a [Transformer-based](https://arxiv.org/abs/1706.03762) model built on the encoder-decoder
architecture, pretrained on a multi-task mixture of unsupervised and supervised tasks where each task
is converted into a text-to-text format. T5 shows impressive results in a variety of sequence-to-sequence
(sequence in this notebook refers to text) like summarization, translation, etc.

In this notebook, we will fine-tune the pretrained T5 on the Abstractive Summarization
task using Hugging Face Transformers on the `XSum` dataset loaded from Hugging Face Datasets.

## Setup

In [2]:
!pip install transformers
!pip install keras_hub
!pip install -U datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [2]:
# !pip install --upgrade transformers tensorflow

### Importing the necessary libraries

In [3]:
import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Define certain variables

In [4]:
# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.15

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

## Load the dataset

We will now download the [Extreme Summarization (XSum)](https://arxiv.org/abs/1808.08745).
The dataset consists of BBC articles and accompanying single sentence summaries.
Specifically, each article is prefaced with an introductory sentence (aka summary) which is
professionally written, typically by the author of the article. That dataset has 226,711 articles
divided into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.

Following much of literature, we use the Recall-Oriented Understudy for Gisting Evaluation
(ROUGE) metric to evaluate our sequence-to-sequence abstrative summarization approach.

We will use the [Hugging Face Datasets](https://github.com/huggingface/datasets) library to download
the data we need to use for training and evaluation. This can be easily done with the
`load_dataset` function.

In [5]:
!pip install -U datasets



In [6]:
from datasets import load_dataset, DownloadMode

raw_datasets = load_dataset("xsum", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

The repository for xsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/xsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

The dataset has the following fields:

- **document**: the original BBC article to be summarized
- **summary**: the single sentence summary of the BBC article
- **id**: ID of the document-summary pair

In [7]:
print(raw_datasets)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})


We will now see how the data looks like:

In [8]:
print(raw_datasets[2])

{'document': 'Ferrari appeared in a position to challenge until the final laps, when the Mercedes stretched their legs to go half a second clear of the red cars.\nSebastian Vettel will start third ahead of team-mate Kimi Raikkonen.\nThe world champion subsequently escaped punishment for reversing in the pit lane, which could have seen him stripped of pole.\nBut stewards only handed Hamilton a reprimand, after governing body the FIA said "no clear instruction was given on where he should park".\nBelgian Stoffel Vandoorne out-qualified McLaren team-mate Jenson Button on his Formula 1 debut.\nVandoorne was 12th and Button 14th, complaining of a handling imbalance on his final lap but admitting the newcomer "did a good job and I didn\'t".\nMercedes were wary of Ferrari\'s pace before qualifying after Vettel and Raikkonen finished one-two in final practice, and their concerns appeared to be well founded as the red cars mixed it with the silver through most of qualifying.\nAfter the first ru

For the sake of demonstrating the workflow, in this notebook we will only take
small stratified balanced splits (10%) of the train as our training and test sets.
We can easily split the dataset using the `train_test_split` method which expects
the split size and the name of the column relative to which you want to stratify.

In [9]:
raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

## Data Pre-processing

Before we can feed those texts to our model, we need to pre-process them and get them
ready for the task. This is done by a Hugging Face Transformers `Tokenizer` which will tokenize
the inputs (including converting the tokens to their corresponding IDs in the pretrained
vocabulary) and put it in a format the model expects, as well as generate the other inputs
that model requires.

The `from_pretrained()` method expects the name of a model from the Hugging Face Model Hub. This is
exactly similar to MODEL_CHECKPOINT declared earlier and we will just pass that.

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

If you are using one of the five T5 checkpoints we have to prefix the inputs with
"summarize:" (the model can also translate and it needs the prefix to know which task it
has to perform).

In [11]:
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize this article: "
else:
    prefix = ""

We will write a simple function that helps us in the pre-processing that is compatible
with Hugging Face Datasets. To summarize, our pre-processing function should:

- Tokenize the text dataset (input and targets) into it's corresponding token ids that
will be used for embedding look-up in BERT
- Add the prefix to the tokens
- Create additional inputs for the model like `token_type_ids`, `attention_mask`, etc.

In [12]:

def preprocess_function(examples):

    inputs = [prefix + doc for doc in examples["document"]]

    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs


To apply this function on all the pairs of sentences in our dataset, we just use the
`map` method of our `dataset` object we created earlier. This will apply the function on
all the elements of all the splits in `dataset`, so our training and testing
data will be preprocessed in one single command.

In [13]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/30606 [00:00<?, ? examples/s]



Map:   0%|          | 0/30607 [00:00<?, ? examples/s]

## Defining the model

Now we can download the pretrained model and fine-tune it. Since our task is
sequence-to-sequence (both the input and output are text sequences), we use the
`TFAutoModelForSeq2SeqLM` class from the Hugging Face Transformers library. Like with the
tokenizer, the `from_pretrained` method will download and cache the model for us.

The `from_pretrained()` method expects the name of a model from the Hugging Face Model Hub. As
mentioned earlier, we will use the `t5-small` model checkpoint.

In [16]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT,from_pt=True)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


For training Sequence to Sequence models, we need a special kind of data collator,
which will not only pad the inputs to the maximum length in the batch, but also the
labels. Thus, we use the `DataCollatorForSeq2Seq` provided by the Hugging Face Transformers
library on our dataset. The `return_tensors='tf'` ensures that we get `tf.Tensor`
objects back.

In [17]:
# from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Next we define our training and testing sets with which we will train our model. Again, Hugging Face
Datasets provides us with the `to_tf_dataset` method which will help us integrate our
dataset with the `collator` defined above. The method expects certain parameters:

- **columns**: the columns which will serve as our independent variables
- **batch_size**: our batch size for training
- **shuffle**: whether we want to shuffle our dataset
- **collate_fn**: our collator function

Additionally, we also define a relatively smaller `generation_dataset` to calculate
`ROUGE` scores on the fly while training.

In [18]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

## Building and Compiling the the model

Now we will define our optimizer and compile the model. The loss calculation is handled
internally and so we need not worry about that!

In [19]:
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

## Training and Evaluating the model

To evaluate our model on-the-fly while training, we will define `metric_fn` which will
calculate the `ROUGE` score between the groud-truth and predictions.

In [20]:
import keras_hub

rouge_l = keras_hub.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result


Now we can finally start training our model!

In [21]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)









<tf_keras.src.callbacks.History at 0x7ab1fef759d0>

In [None]:
# import tensorflow as tf
# print(tf.config.list_physical_devices('GPU'))

[]


For best results, we recommend training the model for atleast 5 epochs on the entire
training dataset!

## Inference

Now we will try to infer the model we trained on an arbitrary article. To do so,
we will use the `pipeline` method from Hugging Face Transformers. Hugging Face Transformers provides
us with a variety of pipelines to choose from. For our task, we use the `summarization`
pipeline.

The `pipeline` method takes in the trained model and tokenizer as arguments. The
`framework="tf"` argument ensures that you are passing a model that was trained with TF.

In [24]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)

Device set to use 0


[{'summary_text': '"We are deeply disappointed and frustrated that a prosecution cannot proceed at this time," the Met has said.'}]

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer


# Save the model
model.save_pretrained('path/to/save/model')

# Save the tokenizer (if needed)
tokenizer = T5Tokenizer.from_pretrained('t5-small')
tokenizer.save_pretrained('path/to/save/model')


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


('path/to/save/model/tokenizer_config.json',
 'path/to/save/model/special_tokens_map.json',
 'path/to/save/model/spiece.model',
 'path/to/save/model/added_tokens.json')

In [None]:

!pip install gradio

Collecting gradio
  Downloading gradio-5.9.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.5.2 (from gradio)
  Downloading gradio_client-1.5.2-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.19-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.8.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [None]:
import gradio as gr
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the saved model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('path/to/save/model')
tokenizer = T5Tokenizer.from_pretrained('path/to/save/model')

# Function to summarize input text
def summarize_text(input_text):
    inputs = tokenizer.encode("summarize: " + input_text, return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(inputs, max_length=150, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Create Gradio Interface
interface = gr.Interface(
    fn=summarize_text,
    inputs=gr.Textbox(lines=5, placeholder="Enter text to summarize..."),
    outputs=gr.Textbox(label="Summary")
)

# Launch the app
interface.launch()
