**Hyperparameter Tuning**

---

# Preparation

In [1]:
!pip install --upgrade transformers datasets
!pip install evaluate rouge_score

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow, datasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed datasets-4.4.1 pya

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from torch.optim import AdamW
from evaluate import load
import random
import torch
import numpy as np

import os
os.environ["WANDB_DISABLED"] = "true"

In [3]:
#Setting Gpu
torch.cuda.is_available(), torch.cuda.get_device_name(0)
!nvidia-smi

Thu Nov 13 22:29:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   59C    P8             10W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
# Load 2% of cnn_dailymail datasets

dataset = load_dataset("abisee/cnn_dailymail", "3.0.0", split=["train[:2%]", "validation[:2%]"])

train, validation = dataset


# Add instruction to the prompts
prompts = [
    "Summarize this article:",
    "Write a brief summary of the following news piece:",
    "Give a concise version of this report:",
    "In a few sentences, describe the main points of this text:",
    "What is this story about? Summarize below:",
    "Condense the following passage:",
    "Provide a short overview of this text."
]

def add_instruction(example):
  prompt = random.choice(prompts)
  return {
      "prompt": f"{prompt}\n\n{example["article"]}",
      "target": f"{example["highlights"]}"
  }

train_dataset = train.map(
    add_instruction,
    remove_columns=train.column_names
    )
validation_dataset = validation.map(
    add_instruction,
    remove_columns=validation.column_names
    )

print("Prompt before instruction: ", train[0]["article"])
print("-"*10)
print("Prompt after instruction: ", train_dataset[0]["prompt"])



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Map:   0%|          | 0/5742 [00:00<?, ? examples/s]

Map:   0%|          | 0/267 [00:00<?, ? examples/s]

Prompt before instruction:  LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box 

In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, EarlyStoppingCallback
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from torch.optim import AdamW
from evaluate import load
import random
import torch
import numpy as np

#Seting the finetuning function
def fine_tune(batch_size,lr,peft_config,accumulation_steps):

  model_name = 't5-base'
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
  tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

  model = get_peft_model(model, peft_config)

  # Tokenize the datasets
  max_input_length = 512
  max_target_length = 128

  def preprocess_function(examples):
      inputs = [ex for ex in examples["prompt"]]
      model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

      with tokenizer.as_target_tokenizer():
          labels = tokenizer(examples["target"], max_length=max_target_length, truncation=True)

      model_inputs["labels"] = labels["input_ids"]
      return model_inputs

  tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True, remove_columns=['prompt', 'target'])
  tokenized_validation_dataset = validation_dataset.map(preprocess_function, batched=True, remove_columns=['prompt', 'target'])

  #compute validation ROUGE-1
  rouge = load("rouge")

  def compute_metrics(eval_pred):
      predictions, labels = eval_pred
      # Convert -100 to tokenizer.pad_token_id for proper decoding
      labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

      decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
      decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

      # Compute ROUGE
      result = rouge.compute(predictions=decoded_preds, references=decoded_labels)

      return {"rouge-1": result["rouge1"]}

  # Defining early_stopping_patience
  early_stopping_patience = 2

  training_args = Seq2SeqTrainingArguments(
        output_dir=f"./temp_lr{lr}",
        eval_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="no",
        learning_rate=lr,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=10, #this is the max as I added early stop
        gradient_accumulation_steps=accumulation_steps,
        weight_decay=0.01,
        # save_total_limit=2,
        predict_with_generate=True,
        fp16=True,
        remove_unused_columns=False,
        metric_for_best_model="rouge-1", # Added metric
        greater_is_better=True
    )

  # Defining data collator
  data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

  trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_validation_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)]

  )
  trainer.train()
  eval_results = trainer.evaluate()
  print(f"Validation ROUGE-1 for lr={lr}, batch_size={batch_size},accumulation_steps={accumulation_steps},weight_decay={0.01}: \n {eval_results['eval_rouge-1']}")



#setting LoRA with rate as 8
peft_config_eight = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

# Tune Learning Rate

I’ll keep the following parameters fixed for now:
**batch_size = 4, LoRA rank = 8, epochs = 10, and accumulation_steps = 1**, while experimenting with different **learning rates (1e-5, 3e-5, 5e-5)** to find the most effective one.



---
learning_rate = 1e-5


In [None]:
fine_tune(4,1e-5,peft_config_eight,1)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Map:   0%|          | 0/5742 [00:00<?, ? examples/s]



Map:   0%|          | 0/267 [00:00<?, ? examples/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,1.9089,1.871376,0.263175
2,1.7163,1.830692,0.26522
3,1.6824,1.816001,0.257298
4,1.6671,1.810794,0.2546


Validation ROUGE-1 for lr=1e-05, batch_size=4,accumulation_steps=1,weight_decay=0.01: 
 0.25459951989500085


learning_rate = 3e-5

In [None]:
fine_tune(4,3e-5,peft_config_eight,1)

Map:   0%|          | 0/267 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,1.769,1.813428,0.257172
2,1.6556,1.801338,0.255606
3,1.6406,1.794509,0.257737
4,1.6299,1.7891,0.257407
5,1.6249,1.775777,0.256216


Validation ROUGE-1 for lr=3e-05, batch_size=4,accumulation_steps=1,weight_decay=0.01: 
 0.25621649177643746


learning_rate = 5e-5

In [None]:
fine_tune(4,5e-5,peft_config_eight,1)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,1.7311,1.804798,0.254555
2,1.6404,1.791813,0.259105
3,1.6243,1.780619,0.261167
4,1.6101,1.777755,0.262772
5,1.6087,1.771676,0.259212
6,1.6127,1.767221,0.258643


Validation ROUGE-1 for lr=5e-05, batch_size=4,accumulation_steps=1,weight_decay=0.01: 
 0.25864261096588614


After evaluating multiple learning rate configurations, I select the one that achieves **the highest validation ROUGE-1 score**, indicating better text generation quality.
Among all tested values, LR = 1e-5 achieves the best performance with a **ROUGE-1 score of 0.265220**, and therefore chosen as the optimal learning rate for the following experiments.

# Tune Effective Batch Size

I’ll keep the following parameters fixed for now:
**learning rate = 1e-5 (previously chosen), LoRA rank = 8, epochs = 10**, while experimenting with different combinations of **batch size** and **accumulation steps** to find the most effective setup.

Since effective_batch_size = batch_size × accumulation_steps, I’ll test the following configurations:

* **batch_size = 4, accumulation_steps = 2** → effective batch size = **8**

* **batch_size = 4, accumulation_steps = 4** → effective batch size = **16**

* **batch_size = 8, accumulation_steps = 4** → effective batch size = **32**

In the previous experiment, I used **batch_size = 4** and **accumulation_steps = 1** (effective batch size = **4**).
I’ll compare its **validation ROUGE-1** score with the three new configurations above to determine which combination achieves the best overall performance.


---
batch_size = 4 and accumulation_steps = 2

In [None]:
fine_tune(4,1e-5,peft_config_eight,2)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,1.9915,1.953368,0.256513
2,1.7783,1.859512,0.266284
3,1.7182,1.835846,0.263855
4,1.6929,1.824021,0.26156


Validation ROUGE-1 for lr=1e-05, batch_size=4,accumulation_steps=2,weight_decay=0.01: 
 0.26155974915243557


batch_size = 4 and accumulation_steps = 4

In [None]:
fine_tune(4,1e-5,peft_config_eight,4)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,2.0669,2.083775,0.241863
2,1.8903,1.939577,0.261051
3,1.793,1.88375,0.262879
4,1.7476,1.859521,0.262715
5,1.7264,1.847185,0.262056


Validation ROUGE-1 for lr=1e-05, batch_size=4,accumulation_steps=4,weight_decay=0.01: 
 0.2620559091469124


batch_size = 8 and accumulation_steps = 4

In [None]:
fine_tune(8,1e-5,peft_config_eight,4)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,2.1255,2.164186,0.236339
2,2.0201,2.060811,0.241514
3,1.9163,1.976372,0.254107
4,1.8598,1.928532,0.259699
5,1.8118,1.900628,0.26217
6,1.7948,1.883467,0.262151
7,1.771,1.872873,0.260348


Validation ROUGE-1 for lr=1e-05, batch_size=8,accumulation_steps=4,weight_decay=0.01: 
 0.2603481027400204


After evaluating multiple effective batch sizes configurations, I select the one that achieves **the highest validation ROUGE-1 score**.
Among all 3 tested values and initial one, Effective batch size = 8 (**batch_size = 4** and **accumulation_steps = 2**) achieves the best performance with a **ROUGE-1 score of 0.266284**, and therefore chosen as the optimal learning rate for the following experiment.

# Tune LoRA Rank (r)

I’ll keep the following parameters fixed for now:
**batch_size = 4 (previously chosen), learning_rate= 1e-5 (already chosen), epochs = 10, and accumulation_steps = 2 (previously chosen)**, while experimenting with different **LoRA ranks (16, 32)** to find the most effective one.

---
r = 16

In [None]:
#setting LoRA with rate as 16
peft_config_sixteen = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)
fine_tune(4,1e-5,peft_config_sixteen,2)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,1.9918,1.954624,0.256144
2,1.778,1.860013,0.265204
3,1.719,1.836577,0.264929
4,1.693,1.824672,0.261892


Validation ROUGE-1 for lr=1e-05, batch_size=4,accumulation_steps=2,weight_decay=0.01: 
 0.26189172518486653


r = 32

In [6]:
#setting LoRA with rate as 32
peft_config_thirtyTwo = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)
fine_tune(4,1e-5,peft_config_thirtyTwo,2)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Map:   0%|          | 0/5742 [00:00<?, ? examples/s]



Map:   0%|          | 0/267 [00:00<?, ? examples/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Rouge-1
1,1.987,1.956399,0.250877
2,1.7777,1.861669,0.265536
3,1.7181,1.838071,0.264944
4,1.6939,1.82639,0.261876


Validation ROUGE-1 for lr=1e-05, batch_size=4,accumulation_steps=2,weight_decay=0.01: 
 0.2618761559566565


After evaluating multiple LoRA ranks configurations (initial r=8, then 16 and 32), I select the one that achieves **the highest validation ROUGE-1 score**.

Among all tested values, r = 8 (initial one) achieves the best performance with a **ROUGE-1 score of 0.266284**, and therefore chosen as the optimal LoRA rank.




> # Final Conclusion



I used **2% of the CNN/DailyMail dataset** to tune the hyperparameters for LoRA fine-tuning on T5-Base.
Across all experiments, I selected the configuration that achieved the **highest and most stable validation ROUGE-1 score**, while maintaining stable training dynamics and no overfitting.

**Best hyperparameter combination found:**

* **Batch Size:** 4

* **Accumulation Steps:** 2 (effective batch size = 8)

* **Learning Rate:** 1e-5

* **LoRA Rank (r):** 8

* **Max Epochs:** 10 (with early stopping)

This setup provided the **best trade-off between generalization, and compute efficiency.**

I will use this hyperparameter combination when fine-tuning the T5 base model with **instruction augmentation.**