<a href="https://colab.research.google.com/github/parisa-kavian/Xsum-FlanT5/blob/main/xsum_fine_tune_flanT5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For this project, the Google/flan-t5-large model was selected, and its fine-tuning process began. The goal of this process was to fine-tune the model for generating news summaries tailored to the project's needs. Initially, the XSum dataset was processed using the model's tokenizer, and the inputs and outputs were tokenized carefully for training.

However, the training process faced significant challenges. The large volume of data and the need for multiple epochs led to a considerable increase in computation requirements. Additionally, the estimated time for training the model for 3 epochs was approximately 380 hours, which made time and resource management challenging.

To demonstrate the output and given the GPU limitations in Google Colab, the fine-tuning process was performed on only 10,000 training samples out of the 250,000 in the dataset. Efforts were also made to optimize model performance through Quantization techniques and larger models, but due to limitations in installing the BitsandBytes library, this approach was not feasible.

Ultimately, a simpler version of FLAN-T5 was used to ensure the project could be executed given the available resources, yielding satisfactory results. Additionally, to optimize the fine-tuning process, the PEFT approach with the LORA configuration was applied to reduce computational resources while improving the model's performance.



In [None]:
!pip install datasets
!pip install transformers
!pip install peft
!pip install bitsandbytes



# Import XSUM Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("xsum")


In [None]:
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

Train dataset size: 204045
Test dataset size: 11334


# Import Model & Tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Select 1000 sample of dataset & prepar dataset

In [None]:

small_train_dataset = dataset["train"].select(range(10000))
small_test_dataset = dataset["test"].select(range(100))

small_dataset = {
    "train": small_train_dataset,
    "test": small_test_dataset,
}

In [None]:
tokenized_inputs = small_train_dataset.map(
    lambda x: tokenizer(x["document"], truncation=True),
    batched=True,
    remove_columns=["document", "summary"]
)
input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]
max_source_length = int(np.percentile(input_lengths, 85))
print(f"Max source length: {max_source_length}")

tokenized_targets = small_train_dataset.map(
    lambda x: tokenizer(x["summary"], truncation=True),
    batched=True,
    remove_columns=["document", "summary"]
)
target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
max_target_length = int(np.percentile(target_lengths, 90))
print(f"Max target length: {max_target_length}")


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Max source length: 512


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Max target length: 39


In [None]:
def preprocess_function(sample, padding="max_length"):
    inputs = ["summarize: " + item for item in sample["document"]]
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_small_train_dataset = small_dataset["train"].map(
    preprocess_function,
    batched=True,
    remove_columns=["document", "summary", "id"]
)

tokenized_small_test_dataset = small_dataset["test"].map(
    preprocess_function,
    batched=True,
    remove_columns=["document", "summary", "id"]
)


print(f"keys of tokenized dataset: {list(tokenized_small_dataset['train'].features)}")

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


# Fine-Tune T5 with LoRA and bnb int-8

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")

In [None]:
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)


model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 688,128 || all params: 77,649,280 || trainable%: 0.8862


In [None]:
from transformers import DataCollatorForSeq2Seq

label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="result",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=1e-4,
    num_train_epochs=3,
    logging_dir="result/logs",
    weight_decay=0.001,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="no",
    report_to="tensorboard",
    fp16=True # Enable mixed precision training (FP16)
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_small_train_dataset,
    eval_dataset=tokenized_small_test_dataset,
    )

model.config.use_cache = False  # Silence cache-related warnings
trainer.train()

Step,Training Loss
10,2.7052
20,2.6984
30,2.6855
40,2.6976
50,2.7218
60,2.8095
70,2.5395
80,2.8
90,2.6039
100,2.6353


TrainOutput(global_step=375, training_loss=2.668338768005371, metrics={'train_runtime': 4819.1249, 'train_samples_per_second': 0.623, 'train_steps_per_second': 0.078, 'total_flos': 564013301760000.0, 'train_loss': 2.668338768005371, 'epoch': 3.0})

In [None]:
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)

('results/tokenizer_config.json',
 'results/special_tokens_map.json',
 'results/spiece.model',
 'results/added_tokens.json',
 'results/tokenizer.json')

# Run Inference with LoRA FLAN-T5

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto")
model.eval()

print("Peft model loaded")

Peft model loaded


In [None]:
from datasets import load_dataset
import torch
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

result_generated_summary=[]

for idx in range(5):
    sample = small_dataset['test'][idx]

    input_ids = tokenizer(sample["document"], return_tensors="pt", truncation=True).input_ids.to(device)

    with torch.no_grad():
        outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9)

    input_text = sample['document']
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

result_generated_summary.append([input_text, summary])
df = pd.DataFrame(result_generated_summary, columns=["Input Sentence", "Summary"])
# df.to_csv("result_generated_summary.csv")
df

Unnamed: 0,Input Sentence,Summary
0,Restoring the function of the organ - which he...,A group of mice that eat five-day fasting cycl...
