<a href="https://colab.research.google.com/github/kkrusere/NLP-Text-Summarization/blob/main/LLM_text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
import os
#mounting google drive
drive.mount('/content/drive')
########################################
#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

Mounted at /content/drive
/content/drive/MyDrive/NLP_Data


## <center>**NLP Text Summarization**</center>
<center><em>
Text summarization refers to the technique of shortening long pieces of text, with the intention of creating a coherent and fluent summary having only the main points outlined in the document. Basically, the process of creating shorter text without removing the semantic structure of text.
</em></center>
<br>
<center><img src="https://github.com/kkrusere/NLP-Text-Summarization/blob/main/assets/text_summary.png?raw=1" width=600/></center>

***Project Contributors:*** Kuzi Rusere<br>
**MVP streamlit App URL:** N/A


#### **Intro**

With the rise of large language models (LLMs) and transformer-based architectures, the ability to perform high-quality abstractive text summarization has significantly improved. Models such as `BERT, GPT, T5, and BART` have proven to be exceptionally effective in generating fluent and human-like summaries, making them the go-to choice for modern NLP applications.

For our first deep into `Text Summarization` using LLMs and Transformers, we will focus on leveraging the T5 (Text-to-Text Transfer Transformer) model. This is a powerful and flexible transformer-based model from `Hugging Face's` transformers library, to create an abstractive text summarization system.

T5 is unique in its ability to treat every NLP problem as a text-to-text problem, where both inputs and outputs are text strings. This allows for a unified approach to tasks such as translation, question answering, and summarization, making it an excellent choice for generating summaries that capture the essence of the input text.



###### **Step-by-Step Implementation for LLM Text Summarization Using T5**

1. Install Required Libraries

- We are just going to ensure that the necessary libraries installed, the main ones beingthe Hugging Face transformers and datasets libraries.

In [2]:
pip install transformers datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[

In [3]:
import nltk
import spacy
import re
import string


from transformers import pipeline
from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np

from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import TrainingArguments, Trainer, AutoModelForCausalLM, GenerationConfig

import torch


2. Load the T5 Model and Tokenizer

- We will use T5ForConditionalGeneration for the model and T5Tokenizer for tokenizing input and output sequences.

In [4]:
# Load the pre-trained T5 model and tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

3. Prepare the Dataset

- We are going to use the `CNN/DailyMail` dataset for summarization.
- We could use the 'Hugging Face' datasets library to load the CNN/DailyMail dataset, but the one there only has `validation` and `test` no `train`.
- There is a version of the dataset from `Kaggle` that we are going to be using.

In [5]:
# Read the CSV files
train_df = pd.read_csv("cnn_dailymail/train.csv")
validation_df = pd.read_csv("cnn_dailymail/validation.csv")
test_df = pd.read_csv("cnn_dailymail/test.csv")

# Convert to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
validation_dataset = Dataset.from_pandas(validation_df)
test_dataset = Dataset.from_pandas(test_df)



4. Preprocess the Data

- Tokenize the input text (articles) and summaries, ensuring the sequence length is within the model's capacity (e.g., max_length=512).

In [6]:
def preprocess_function(examples, max_input_length=512, max_output_length=150):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")

    # Tokenize the summaries (targets)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=max_output_length, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
train_dataset = train_dataset.map(preprocess_function, batched=True)
validation_dataset = validation_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]



Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

5. Create Data Collator

- A data collator is needed to batch inputs and pad them to the same length dynamically during training.

In [7]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)


6. Fine-Tune the T5 Model

- Use Trainer from the Hugging Face library to train the model.

In [8]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,

    logging_dir="./logs",
    evaluation_strategy="epoch",
)

generation_config = GenerationConfig(
    max_new_tokens=50,
    num_beams=3
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
)

trainer.train()





Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,0.9782,0.98457
2,0.951,0.976977
3,0.9766,0.973461


TrainOutput(global_step=107670, training_loss=0.9803271438458332, metrics={'train_runtime': 43345.3953, 'train_samples_per_second': 19.872, 'train_steps_per_second': 2.484, 'total_flos': 1.16575171938091e+17, 'train_loss': 0.9803271438458332, 'epoch': 3.0})

In [None]:
# Evaluate the model
eval_results = trainer.evaluate(generation_config=generation_config)
eval_results

8. Generate Summaries

- Use the fine-tuned model to generate summaries for new text inputs.

In [None]:
def generate_summary(text):
    # Tokenize input text
    inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Decode summary
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

For our example text, we are going use this brief explainer of the history of Chaos theory

In [None]:
text = """
In 1961, a meteorologist by the name of Edward Lorenz made a profound discovery. Lorenz was utilising the new-found power of computers in an attempt to more accurately predict the weather. He created a mathematical model which, when supplied with a set of numbers representing the current weather, could predict the weather a few minutes in advance.
Once this computer program was up and running, Lorenz could produce long-term forecasts by feeding the predicted weather back into the computer over and over again, with each run forecasting further into the future.Accurate minute-by-minute forecasts added up into days, and then weeks.
One day, Lorenz decided to rerun one of his forecasts. In the interests of saving time he decided not to start from scratch; instead he took the computer’s prediction from halfway through the first run and used that as the starting point.
After a well-earned coffee break, he returned to discover something unexpected. Although the computer’s new predictions started out the same as before, the two sets of predictions soon began diverging drastically. What had gone wrong?
Lorenz soon realised that while the computer was printing out the predictions to three decimal places, it was actually crunching the numbers internally using six decimal places.
So while Lorenz had started the second run with the number 0.506, the original run had used the number 0.506127.
A difference of one part in a thousand: the same sort of difference that a flap of a butterfly’s wing might make to the breeze on your face. The starting weather conditions had been virtually identical. The two predictions were anything but.
Lorenz had found the seeds of chaos. In systems that behave nicely - without chaotic effects - small differences only produce small effects. In this case, Lorenz’s equations were causing errors to steadily grow over time.
This meant that tiny errors in the measurement of the current weather would not stay tiny, but relentlessly increased in size each time they were fed back into the computer until they had completely swamped the predictions.
Lorenz famously illustrated this effect with the analogy of a butterfly flapping its wings and thereby causing the formation of a hurricane half a world away.
A nice way to see this “butterfly effect” for yourself is with a game of pool or billiards. No matter how consistent you are with the first shot (the break), the smallest of differences in the speed and angle with which you strike the white ball will cause the pack of billiards to scatter in wildly different directions every time.
The smallest of differences are producing large effects - the hallmark of a chaotic system.
It is worth noting that the laws of physics that determine how the billiard balls move are precise and unambiguous: they allow no room for randomness.
What at first glance appears to be random behaviour is completely deterministic - it only seems random because imperceptible changes are making all the difference.
The rate at which these tiny differences stack up provides each chaotic system with a prediction horizon - a length of time beyond which we can no longer accurately forecast its behaviour.
In the case of the weather, the prediction horizon is nowadays about one week (thanks to ever-improving measuring instruments and models).
Some 50 years ago it was 18 hours. Two weeks is believed to be the limit we could ever achieve however much better computers and software get.
Surprisingly, the solar system is a chaotic system too - with a prediction horizon of a hundred million years. It was the first chaotic system to be discovered, long before there was a Chaos Theory.
In 1887, the French mathematician Henri Poincaré showed that while Newton’s theory of gravity could perfectly predict how two planetary bodies would orbit under their mutual attraction, adding a third body to the mix rendered the equations unsolvable.
The best we can do for three bodies is to predict their movements moment by moment, and feed those predictions back into our equations …
Though the dance of the planets has a lengthy prediction horizon, the effects of chaos cannot be ignored, for the intricate interplay of gravitation tugs among the planets has a large influence on the trajectories of the asteroids.
Keeping an eye on the asteroids is difficult but worthwhile, since such chaotic effects may one day fling an unwelcome surprise our way.
On the flip side, they can also divert external surprises such as steering comets away from a potential collision with Earth.

"""