<a href="https://colab.research.google.com/github/linhle32/Interactive-Models-with-Widget/blob/main/text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization

This notebook performs finetuning a large language model on the task of summarizing news. Most codes in the data processing and finetuning parts come from the HuggingFace tutorial for text summarization at https://huggingface.co/docs/transformers/tasks/summarization.

I modified the data loading to incorporate custom datasets, as well as wrote a small GUI application at the end. This is a very resource-demanding task, so remember to change your runtime type to GPU.

### Loading Data and Setting up

As usual, we first install the necessary libraries as well as mount our Google drive and load the data.

In [None]:
!pip install transformers datasets
from google.colab import drive
drive.mount('/content/drive')

While the `datasets` library can directly load text data, it is very long on Google Colab. So we will work around a bit by using `pandas` to load the data first.

In this example, I use a subset of the `CNN Daily Mail` dataset available at https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail.

This news dataset has three columns
1. `id`: ID of the articles
2. `article`: the text bodies of the articles
3. `highlights`: short summaries of the articles

But in general, we only need two columns, one for the text, and one for the summarization. We will remove others anyway.

In [None]:
import pandas as pd

data = pd.read_csv('.../cnn_daily_mail_news.csv')
data.head(3)

Unnamed: 0,id,article,highlights
0,61df4979ac5fcc2b71be46ed6fe5a46ce7f071c3,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,21c0bd69b7e7df285c3d1b1cf56d4da925980a68,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,56f340189cd128194b2e7cb8c26bb900e3a848b4,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."


We will further change the names of the two columns to `text` and `summary` before loading them into `datasets` for finetuning.

In [None]:
data = data[['article','highlights']]
data.columns = ['text', 'summary']
data.head(3)

Unnamed: 0,text,summary
0,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."


Now we load the news data using the `datasets` library. Note that here, I only extract 2001 articles from the original data because finetuning the whole dataset will take a long time.

We then further split the data into `0.8` training and `0.2` testing.

In [None]:
from datasets import Dataset
dataset = Dataset.from_pandas(data.loc[:2000])
dataset = dataset.train_test_split(test_size=0.2)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['text', 'summary'],
        num_rows: 401
    })
})

### Finetuning `T5`

All codes in this portion are from the HuggingFace tutorial. We just need to change the `learning_rate`, `weight_decay_rate`, and `epochs` to have a good model performance.



In [None]:
learning_rate = 3e-5
weight_decay_rate = 0.01
epochs = 10

In [None]:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_data = dataset.map(preprocess_function, batched=True)

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tf_train_set = model.prepare_tf_dataset(
    tokenized_data["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_data["test"],
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator,
)

from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay_rate)

model.compile(optimizer=optimizer)
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=epochs)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

### Save the Finetuned Model

The last task after training a model is to save it. Modify the path to your correct one.

In [None]:
model.save_pretrained("")

# Application

Next, we will write a small application to interact with the finetuned model. This part of the notebook can be run without the previous portion if you have saved your model.

In [None]:
model_path = '.../news_summary_model'

In [None]:
!pip install transformers
from google.colab import drive
drive.mount('/content/drive')

from transformers import AutoTokenizer
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_path)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/IT7133/Week 5/news_summary_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [None]:
import ipywidgets as widgets
from IPython.display import display

output = widgets.Output()
text_input = widgets.Textarea(
    value='',
    placeholder='Please type something',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(height="auto", width="auto")
)
button_summarize = widgets.Button(description="Summarize")
output = widgets.Output()
display(text_input, button_summarize, output)

@output.capture()
def on_predict_clicked(b):
  output.clear_output()
  prompt = "summarize: " + text_input.value
  inputs = tokenizer(prompt, return_tensors="tf").input_ids
  outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
  summarized = tokenizer.decode(outputs[0], skip_special_tokens=True)
  with output:
    print('Summarized: ' + summarized)

button_summarize.on_click(on_predict_clicked)

Textarea(value='', description='Text:', layout=Layout(height='auto', width='auto'), placeholder='Please type s…

Button(description='Summarize', style=ButtonStyle())

Output()