### **Recipe: Hyperparameter Optimization with Optuna and Transformers**

_Authored by: [Parag Ekbote](https://github.com/ParagEkbote)_

**Problem:** Find the best hyperparameters to fine-tune a lightweight BERT model for text classification on a subset of the IMDB dataset.

**Goal:** Use Optuna to automate the search for optimal learning rate and weight decay to improve validation performance.

---

#### **Prerequisites**

* Python environment set up
* Installed packages: Transformers, Optuna
* Small subset of the IMDB dataset for quick experimentation

---

#### **Steps**

1. Install required dependencies.

2. Use Optuna to define an objective function that optimizes learning rate and weight decay.

3. Fine-tune a lightweight BERT model on the IMDB dataset subset.

4. Run the Optuna study to search for the best hyperparameters.

5. Review the best combination found for improved validation metrics.

---

#### **Notes**

* For detailed guidance on hyperparameter search with Transformers, refer to the [Hugging Face HPO documentation](https://huggingface.co/docs/transformers/en/hpo_train).




In [None]:
!pip install -q datasets evaluate transformers optuna wandb scikit-learn nbformat

### **Prepare Dataset and Set Model**


- Load the IMDB dataset for sentiment analysis.

- Select 2000 examples for training and 1000 examples for validation, ensuring both splits are shuffled with a fixed seed for reproducibility.

- Tokenize the text data and map the tokenizer to preprocess all samples efficiently.

- Load the accuracy metric for model evaluation.

- Initialize the BERT model for binary classification.

In [None]:
from datasets import load_dataset
import evaluate

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import set_seed



set_seed(42)


train_dataset = load_dataset("imdb", split="train").shuffle(seed=42).select(range(2500))
valid_dataset = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))

model_name = "prajjwal1/bert-tiny"
tokenizer = AutoTokenizer.from_pretrained(model_name)


def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)


tokenized_train = train_dataset.map(tokenize, batched=True).select_columns(
    ["input_ids", "attention_mask", "label"]
)
tokenized_valid = valid_dataset.map(tokenize, batched=True).select_columns(
    ["input_ids", "attention_mask", "label"]
)


metric = evaluate.load("accuracy")


def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Define Storage with Optuna

- Use RDBStorage to store all Optuna trials across sessions in a persistent SQLite database.

- This setup ensures that hyperparameter optimization trials are saved for future runs.

- Enables reproducible analysis and visualization of all trial results.

- Trials are stored persistently in an SQLite database, allowing easy access for analysis and visualization later.

In [2]:
import optuna
from optuna.storages import RDBStorage

# Define persistent storage
storage = RDBStorage("sqlite:///optuna_trials.db")


study = optuna.create_study(
    study_name="transformers_optuna_study",
    direction="maximize",
    storage=storage,
    load_if_exists=True
)

[I 2025-06-25 15:10:18,724] A new study created in RDB with name: transformers_optuna_study


### **Initialize Trainer and Set Up Observability**

#### **Instructions**

* Define the metric function to calculate evaluation metrics after each evaluation step.
* Define the objective function to maximize accuracy for selecting the best hyperparameters.
* Set up observability by configuring Weight & Biases to log hyperparameter trials.
* Ensure you are logged in to Weight & Biases with your API key to enable tracking.
* Define the training arguments for the Trainer to handle evaluation, checkpointing, logging, and hyperparameter search.

---

In [3]:
import wandb
from transformers import Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions = eval_pred.predictions.argmax(axis=-1)
    labels = eval_pred.label_ids
    return metric.compute(predictions=predictions, references=labels)


def compute_objective(metrics):
    return metrics["eval_accuracy"]

wandb.init(project="hf-optuna", name="transformers_optuna_study")

training_args = TrainingArguments(
    output_dir="./results",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_strategy="epoch",
        num_train_epochs=3,
        report_to="wandb",  # Logs to W&B
        logging_dir="./logs",
        run_name="transformers_optuna_study",
)


trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

[34m[1mwandb[0m: Currently logged in as: [33mai_novice2005[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/17.7M [00:00<?, ?B/s]



### **Define Search Space and Start Trials**

#### **Instructions**

* Define the Optuna hyperparameter search space to optimize learning rate, weight decay, and batch size.
* Launch the hyperparameter search by configuring the following parameters:

  1. **direction** – Set to maximize the evaluation metric.
  2. **backend** – Use Optuna as the search backend.
  3. **n\_trials** – Specify the number of trials to run.
  4. **compute\_objective** – Define the objective to maximize or minimize based on evaluation metrics.
  5. **study\_name** – Provide a name to retrieve or continue a specific run.
  6. **storage** – Set the backend storage for Optuna to save all trial data.

---

In [4]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical(
            "per_device_train_batch_size", [16, 32, 64, 128]
        ),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.3),
    }


best_run = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=5,
    compute_objective=compute_objective,
    study_name="transformers_optuna_study",
    storage="sqlite:///optuna_trials.db",
    load_if_exists=True
)

print(best_run)

[I 2025-06-25 15:10:41,259] Using an existing study with name 'transformers_optuna_study' instead of creating a new one.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6492,0.605374,0.682
2,0.5299,0.528273,0.751
3,0.4407,0.509003,0.764


[I 2025-06-25 15:11:09,298] Trial 0 finished with value: 0.764 and parameters: {'learning_rate': 7.23655165533393e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.013798094328723032}. Best is trial 0 with value: 0.764.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇█
eval/loss,█▂▁
eval/runtime,█▇▁
eval/samples_per_second,▁▂█
eval/steps_per_second,▁▂█
train/epoch,▁▁▅▅███
train/global_step,▁▁▄▄███
train/grad_norm,▁█▂
train/learning_rate,█▅▁
train/loss,█▄▁

0,1
eval/accuracy,0.764
eval/loss,0.509
eval/runtime,1.0937
eval/samples_per_second,914.299
eval/steps_per_second,114.287
total_flos,9528652800000.0
train/epoch,3.0
train/global_step,471.0
train/grad_norm,13.57101
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6831,0.677468,0.613
2,0.6731,0.669755,0.639
3,0.6695,0.667655,0.63


[I 2025-06-25 15:11:29,907] Trial 1 finished with value: 0.63 and parameters: {'learning_rate': 2.756288216246014e-05, 'per_device_train_batch_size': 128, 'weight_decay': 0.28503663896216014}. Best is trial 0 with value: 0.764.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█▆
eval/loss,█▂▁
eval/runtime,▁█▂
eval/samples_per_second,█▁▇
eval/steps_per_second,█▁▇
train/epoch,▁▁▅▅███
train/global_step,▁▁▅▅███
train/grad_norm,▅█▁
train/learning_rate,█▄▁
train/loss,█▃▁

0,1
eval/accuracy,0.63
eval/loss,0.66765
eval/runtime,1.111
eval/samples_per_second,900.116
eval/steps_per_second,112.515
total_flos,9528652800000.0
train/epoch,3.0
train/global_step,60.0
train/grad_norm,0.66353
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6903,0.688425,0.553
2,0.6891,0.687775,0.562
3,0.6891,0.687576,0.57


[I 2025-06-25 15:11:52,797] Trial 2 finished with value: 0.57 and parameters: {'learning_rate': 1.2177346043359053e-06, 'per_device_train_batch_size': 64, 'weight_decay': 0.02906341093983704}. Best is trial 0 with value: 0.764.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅█
eval/loss,█▃▁
eval/runtime,▁██
eval/samples_per_second,█▁▁
eval/steps_per_second,█▁▁
train/epoch,▁▁▅▅███
train/global_step,▁▁▅▅███
train/grad_norm,▆█▁
train/learning_rate,█▅▁
train/loss,█▁▁

0,1
eval/accuracy,0.57
eval/loss,0.68758
eval/runtime,1.0959
eval/samples_per_second,912.502
eval/steps_per_second,114.063
total_flos,9528652800000.0
train/epoch,3.0
train/global_step,120.0
train/grad_norm,2.34479
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6894,0.68673,0.57
2,0.687,0.685327,0.581
3,0.6867,0.684904,0.581


[I 2025-06-25 15:12:12,894] Trial 3 finished with value: 0.581 and parameters: {'learning_rate': 2.973185825213819e-06, 'per_device_train_batch_size': 64, 'weight_decay': 0.09102292466460353}. Best is trial 0 with value: 0.764.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁██
eval/loss,█▃▁
eval/runtime,▁█▃
eval/samples_per_second,█▁▆
eval/steps_per_second,█▁▆
train/epoch,▁▁▅▅███
train/global_step,▁▁▅▅███
train/grad_norm,▆█▁
train/learning_rate,█▅▁
train/loss,█▂▁

0,1
eval/accuracy,0.581
eval/loss,0.6849
eval/runtime,1.0808
eval/samples_per_second,925.219
eval/steps_per_second,115.652
total_flos,9528652800000.0
train/epoch,3.0
train/global_step,120.0
train/grad_norm,2.30065
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.689,0.686028,0.573
2,0.6861,0.684337,0.589
3,0.6857,0.683833,0.597


[I 2025-06-25 15:12:32,824] Trial 4 finished with value: 0.597 and parameters: {'learning_rate': 3.763988365260261e-06, 'per_device_train_batch_size': 64, 'weight_decay': 0.1502192542358606}. Best is trial 0 with value: 0.764.


BestRun(run_id='0', objective=0.764, hyperparameters={'learning_rate': 7.23655165533393e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.013798094328723032}, run_summary=None)



### **Visualize Results**

#### **Instructions**

* Use the `optuna` study object to visualize results after trials are completed.
* Generate plots to understand patterns in trial outcomes.
* Visualize key hyperparameters and their relationship with model performance.



In [1]:
import optuna
import optuna.visualization as vis

storage = optuna.storages.RDBStorage("sqlite:///optuna_trials.db")

study = optuna.load_study(
    study_name="transformers_optuna_study",
    storage=storage
)

vis.plot_param_importances(study).show()

vis.plot_parallel_coordinate(study).show()

vis.plot_contour(study).show()


### **Perform Final Training**

#### **Instructions**

* Retrieve the best hyperparameters obtained from hyperparameter optimization (HPO).
* Configure the training arguments using these optimized values.
* Train the model with the best hyperparameter settings to achieve improved performance.



In [6]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load IMDb dataset
dataset = load_dataset("imdb")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the text
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Clean up columns
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

# Set PyTorch format
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Subset for quick testing (optional)
train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(2000))
valid_dataset = tokenized_dataset["test"].shuffle(seed=42).select(range(500))


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [7]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Define the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Load best hyperparameters (already defined earlier as best_hparams)
training_args = TrainingArguments(
    output_dir="./final_model",
    learning_rate=best_hparams["learning_rate"],
    per_device_train_batch_size=best_hparams["per_device_train_batch_size"],
    weight_decay=best_hparams["weight_decay"],
    
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_strategy="epoch",
    num_train_epochs=3,

    report_to="wandb",
    run_name="final_run_with_best_hparams"
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,  # ✅ use tokenizer here, not processing_class
    compute_metrics=lambda eval_pred: {
        "accuracy": (eval_pred.predictions.argmax(-1) == eval_pred.label_ids).mean()
    }
)

# Train
trainer.train()

# Save the model
trainer.save_model("./final_model")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4567,0.337511,0.856
2,0.2152,0.43822,0.876
3,0.0846,0.499159,0.888


### **Uploading to Hugging Face Hub**

#### **Instructions**

* After training, the model has achieved **more efficient and reproducible results** due to hyperparameter optimization.
  *This ensures better performance for real-world sentiment analysis tasks, such as classifying movie reviews for content recommendation systems.*

* Save the trained model locally to preserve the optimized weights and configuration.

* Log in to the Hugging Face Hub using either `huggingface-cli login` or `notebook_login()` to authenticate your account.
  *Logging in is essential to gain push access to your personal or organizational repository on the Hub.*

* Push the trained model to the Hugging Face Hub to:

  * Share it with the community and reuse it for inference.

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load your saved model from the path
model = AutoModelForSequenceClassification.from_pretrained("./final_model")
tokenizer = AutoTokenizer.from_pretrained("./final_model")

# Push to your repository on the hub
model.push_to_hub("AINovice2005/bert-imdb-optuna-hpo")
tokenizer.push_to_hub("AINovice2005/bert-imdb-optuna-hpo")


README.md:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AINovice2005/bert-imdb-optuna-hpo/commit/cf4e9bcfd581cc9cd33f7403c5fa2e5074f58e6c', commit_message='Upload tokenizer', commit_description='', oid='cf4e9bcfd581cc9cd33f7403c5fa2e5074f58e6c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/AINovice2005/bert-imdb-optuna-hpo', endpoint='https://huggingface.co', repo_type='model', repo_id='AINovice2005/bert-imdb-optuna-hpo'), pr_revision=None, pr_num=None)