<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/summarization/T5_large_Finetune_multi_news_summarization_v2_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization

creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be:

- Extractive: extract the most relevant information from a document.
- Abstractive: generate new text that captures the most relevant information.

https://huggingface.co/docs/transformers/tasks/summarization



### Dataset --> multi_news dataset for summarization
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.

There are two features:

document: text of news articles seperated by special token "|||||".
summary: news summary.


@misc{alex2019multinews,
    title={Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model},
    author={Alexander R. Fabbri and Irene Li and Tianwei She and Suyi Li and Dragomir R. Radev},
    year={2019},
    eprint={1906.01749},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}


other Datasets

https://huggingface.co/datasets/ccdv/pubmed-summarization

https://huggingface.co/datasets/samsum


In [None]:
# Transformers installation
! pip install -q --disable-pip-version-check py7zr sentencepiece loralib peft trl
! pip install -q    wandb bitsandbytes
! pip install datasets evaluate rouge_score -q
! pip install transformers[torch] -q
! pip install accelerate -U -q
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

In [None]:
from google.colab import drive
drive.mount('/content/drive')


1. Finetune [T5](https://huggingface.co/t5-base) on the Multi-News [multi_news](https://huggingface.co/datasets/multi_news) dataset for abstractive summarization.
2. Use your finetuned model for inference.

<Tip>
Model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [Blenderbot](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot), [BlenderbotSmall](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot-small), [Encoder decoder](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/encoder-decoder), [FairSeq Machine-Translation](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fsmt), [GPTSAN-japanese](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptsan-japanese), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LongT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longt5), [M2M100](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/m2m_100), [Marian](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/marian), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mt5), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [NLLB](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb), [NLLB-MOE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb-moe), [Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus), [PEGASUS-X](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus_x), [PLBart](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/plbart), [ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/prophetnet), [SwitchTransformers](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/switch_transformers), [T5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/t5), [XLM-ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-prophetnet)

<!--End of the generated tip-->


In [None]:
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
from torch import cuda, bfloat16
import transformers

import torch
import torch.nn as nn
from google.colab import userdata
import wandb

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
PROJECT = "T5-large-Summarization"
MODEL_NAME = "google-t5/t5-large"
DATASET = "multi_news"

In [None]:


wandb_key = userdata.get('WANDB')
wandb.login(key=wandb_key)

wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning T5 large with ccdv/pubmed-summarization Dataset. Text Summarization") # the Hyperparameters I want to keep track of

In [None]:
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
device


## Load multi_news dataset

https://huggingface.co/datasets/multi_news

In [None]:
from datasets import load_dataset

dataset  = load_dataset("multi_news")

Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [None]:
dataset

In [None]:
print(f"Train dataset size: {len(dataset['train'])}")
print(f"test dataset size: {len(dataset['test'])}")
print(f"Validation dataset size: {len(dataset['validation'])}")

Then take a look at an example:

In [None]:
dataset['train'][100]['document']

In [None]:
len(dataset['train'][100]['document'])

In [None]:
dataset['train'][100]['summary']

In [None]:
len(dataset['train'][100]['summary'])

There are two fields that you'll want to use:

- `text`: the text of the bill which'll be the input to the model.
- `summary`: a condensed version of `text` which'll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

Model--> https://huggingface.co/google-t5/t5-large

In [None]:
from transformers import AutoTokenizer

checkpoint = model_id = MODEL_NAME
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [None]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=256, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

To apply the preprocessing function over the entire dataset, use ðŸ¤— Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_dataset

In [None]:
len(tokenized_dataset['train'][100]['labels']), len(tokenized_dataset['train'][100]['input_ids'])

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load T5 with [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the ROUGE metric and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer


# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
def print_number_of_trainable_model_parameters(model, tag="original_model", to_wandb=False):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()

    if to_wandb:
      wandb.log({f'{tag}': {"trainable_model_params":trainable_model_params}})
      wandb.log({f'{tag}': {"all_model_params":all_model_params}})
      wandb.log({f'{tag}': {"percentage_of_trainable_model_parameters": 100 * trainable_model_params}} )

    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params}%"

In [None]:
repository_id = f"{checkpoint.split('/')[1]}-{DATASET}"
repository_id

In [None]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
dataset_id = "multi_news"
# Hugging Face repository id
repository_id = f"{checkpoint.split('/')[1]}-{DATASET}"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=10,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="wandb",
)



Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

In [None]:
print(print_number_of_trainable_model_parameters(model))

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
        label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8

)

In [None]:
import gc
import torch
torch.cuda.empty_cache()
gc.collect()

In [None]:


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,

)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [None]:

with wandb.init(project=PROJECT, job_type="train", # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes =f"Fine tuning {MODEL_NAME} with {DATASET}. Summarization Documents"):

  print_number_of_trainable_model_parameters(model,"original_model",to_wandb=True)

  trainer.train()

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [None]:
trainer.push_to_hub("olonok/t5-large-multi_news-summarization")

<Tip>

For a more in-depth example of how to finetune a model for summarization, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb).

</Tip>

In [None]:

! rm -rf t5-sum-checkpoint

In [None]:

!mkdir t5-sum-checkpoint
custom_path = "./t5-sum-checkpoint/"
trainer.save_model(output_dir=custom_path)

In [None]:

with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("T5-large_Summarization_model", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)


## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with some text you'd like to summarize. For T5, you need to prefix your input depending on the task you're working on. For summarization you should prefix your input as shown below:

In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [None]:
run = wandb.init()

artifact = run.use_artifact('olonok69/T5-large-Summarization/T5-large_Summarization_model:v0', type='model')
artifact_dir = artifact.download()

fine_tune_model=  AutoModelForSeq2SeqLM.from_pretrained(artifact_dir, torch_dtype=torch.bfloat16)

In [None]:
artifact_dir

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for summarization with your model, and pass your text to it:

In [None]:
from transformers import pipeline

In [None]:


summarizer = pipeline("summarization", model=fine_tune_model, tokenizer=tokenizer)
response = summarizer(text)

In [None]:
response[0]['summary_text']

In [None]:
checkpoint

In [None]:
from transformers import pipeline

summarizer_ori = pipeline("summarization", model=checkpoint, tokenizer=tokenizer)
response_ori = summarizer_ori(text)

In [None]:
response_ori[0]['summary_text']

In [None]:
len(text), len(response[0]['summary_text']), len(response_ori[0]['summary_text'])

In [None]:
dataset['validation']

In [None]:
import time
import evaluate
import pandas as pd
import numpy as np

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

In [None]:
rouge = evaluate.load('rouge')

In [None]:
dialogues = dataset['validation'][0:50]['document']
human_baseline_summaries = dataset['validation'][0:50]['summary']

original_model_text = []
original_human_summaries = []
original_model_summaries = []
fine_tune_model_summaries = []

In [None]:
for idx, dialogue in enumerate(tqdm(dialogues)):
    prompt = f"""
Summarize:

{dialogue}
 """
    original_model_text.append(dialogue)
    original_human_summaries.append(human_baseline_summaries[idx])
    # summarize fine_tuned model
    response = summarizer(prompt)
    fine_tune_model_summaries.append(response[0]['summary_text'])
    # summarize original model
    response_ori = summarizer_ori(prompt)
    original_model_summaries.append(response_ori[0]['summary_text'])


In [None]:
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, fine_tune_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'fine_tune_model_summaries'])
df

In [None]:
# https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)


In [None]:
print(original_model_results)

In [None]:
fine_tune_model_results = rouge.compute(
    predictions=fine_tune_model_summaries,
    references=human_baseline_summaries[0:len(fine_tune_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

In [None]:
print(fine_tune_model_results)

In [None]:
dialogues = dataset['validation'][:32]['document']

In [None]:
type(dialogues)

In [None]:
dialogues[0]

In [None]:
summarizer = pipeline("summarization", model=fine_tune_model, tokenizer=tokenizer, device="cpu", batch_size=2 )


In [None]:
response= summarizer(dialogues)

In [None]:
len(response)

In [None]:
summarizer_ori = pipeline("summarization", model=checkpoint, tokenizer=tokenizer,  device_map='auto')
response_ori = summarizer_ori(dialogues)

In [None]:
len(response_ori)