In [1]:
# !pip install datasets transformers[sentencepiece]
# !pip install evaluate
# !pip install scikit-learn
# !pip install transformers[torch]

# Fine-Tuning Pre-trained Models with Trainer

## Introduction
The Hugging Face Transformers documentation provides a guide to fine-tuning pre-trained models, which allows developers to leverage existing state-of-the-art models for specific tasks, reducing computational cost and carbon footprint. In this notebook, we follow the [instructions](https://huggingface.co/docs/transformers/training#train-in-native-pytorch) for fine-tuning models using the 🤗 Transformers Trainer.

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = load_dataset("yelp_review_full")
print(dataset["train"][100])


tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets)


Using the latest cached version of the dataset since yelp_review_full couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'yelp_review_full' at /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/0.0.0/c1f9ee939b7d05667af864ee1cb066393154bf85 (last modified on Thu Feb 20 02:49:13 2025).


{'label': 0, 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I

In [3]:
# subsample for quick experiment
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(50))

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5, torch_dtype="auto")
print(model.config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "google-bert/bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.49.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



In [8]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
import numpy as np
import evaluate

# # Example to load local file metrics
# # https://huggingface.co/evaluate-metric
# acc_metric = evaluate.load("accuracy")
# f1_metric = evaluate.load("f1")
acc_metric = evaluate.load("metric_accuracy.py")
f1_metric = evaluate.load("metric_f1.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = acc_metric.compute(predictions=predictions, references=labels)
    f1 =  f1_metric.compute(predictions=predictions, references=labels, average="macro")
    acc.update(f1)
    return acc


training_args = TrainingArguments(
    output_dir="test_trainer",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=4,
    save_total_limit=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
    # gradient_accumulation_steps=32,
    # gradient_checkpointing=True,
    # optim="adafactor",
)
print(training_args)

# # Freeze bert layers
# for name, param in model.bert.named_parameters():
#     param.requires_grad = False

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_

## Transformer Memory Optimization Strategies
You can find I commented out few paremeters __TrainingArguments__. Some of those use memory optimization strategies for running large models on GPUs with limited VRAM (Although in our example we use the "small" 110M parameter Bert model as an example).

In This table compares different optimization techniques used to reduce GPU memory consumption while training large transformer models, their benefits, drawbacks, and how to configure them in Hugging Face.

| **Method**                | **What It Does**                                                                                                                                     | **Pros**                                                                                                                                                     | **Cons**                                                                                                                                                               | **Hugging Face Parameters / Implementation**                                                                                                       |
|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Gradient Accumulation** | Processes smaller mini-batches multiple times before performing a parameter update. Allows simulating a larger batch size.                         | - Reduces peak GPU memory usage.<br/>- Enables training with a larger effective batch size on smaller GPUs.                                                  | - Increases training time.<br/>- Can lead to slower convergence.                                                                                                        | - 🤗 *Trainer*: `gradient_accumulation_steps=<int>`<br/>- *Accelerate* config: specify `gradient_accumulation_steps`.                              |
| **Gradient Checkpointing** | Saves memory by recomputing activations during backpropagation instead of storing them.                                                             | - Significantly reduces memory usage.<br/>- Enables training larger models on constrained memory.                                                            | - Increases training time due to recomputation overhead.<br/>- More complex debugging.                                                                                 | - 🤗 *Trainer*: `gradient_checkpointing=True`. Also set `model.enable_input_require_grads()` if run errors. <br/>- Manually: `model.gradient_checkpointing_enable()`.                                             |
| **Adafactor Optimizer**   | A memory-efficient optimizer that reduces optimizer state memory, useful for very large models.                                                     | - Lower memory footprint than Adam/AdamW.<br/>- Good for large-scale training.                                                                               | - May converge differently or more slowly than AdamW.<br/>- Less widely benchmarked for various tasks.                                                                | - 🤗 *Trainer*: `optim="adafactor"`<br/>- Manually: instantiate `Adafactor` from `transformers.optimization` and pass it to the Trainer.         |
| **Freezing Layers**       | Disables gradient updates for certain model layers, reducing memory for gradients and optimizer states.                                             | - Greatly reduces memory usage.<br/>- Speeds up training.                                                                                                   | - The frozen layers do not adapt to the new task.<br/>- May degrade performance if crucial parameters are frozen.                                                   | - No direct Trainer parameter; manually set `requires_grad=False` for selected layers (e.g., `model.encoder.layer[0].requires_grad = False`). |
| **Reducing Sequence Length** | Uses a smaller maximum sequence length (e.g., from 128 → 64 → 32), lowering memory usage for self-attention and sequence-dependent operations. | - Decreases memory usage significantly.<br/>- Speeds up training.                                                                                            | - Limits the model’s ability to handle longer contexts.<br/>- May reduce accuracy for tasks that require long sequences.                                            | - 🤗 *Trainer*: set `max_seq_length` in tokenizer config.<br/>- Manually: truncate/pad sequences before training.                                |

### 1. Combining Methods
   - These strategies can be **combined** for maximum effect. For instance:
     - **Gradient Accumulation + Checkpointing** → Fit a large model on a smaller GPU.
     - **Adafactor + Freezing Layers** → Efficient fine-tuning of LLMs with minimal memory usage.
   
### 2. Performance Trade-offs
   - Most memory-saving techniques **increase computation time** (e.g., checkpointing, accumulation) or **reduce adaptability** (e.g., freezing layers, reducing sequence length).
   - Always **measure the impact** on both memory and training speed to find the right balance.

### 3. Advanced Techniques
   - **LoRA (Low-Rank Adaptation)**: Parameter-efficient fine-tuning.
   - **CPU Offload**: Moves optimizer states/activations to CPU RAM to reduce GPU usage.
   - **Flash Attention**: More efficient memory usage in attention layers.

By selecting the right mix of these strategies, you can train transformer models within constrained GPU memory while maintaining reasonable performance.


In [9]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,1.592784,0.36,0.221667
2,1.603300,1.540344,0.38,0.293734
3,1.482100,1.517041,0.42,0.331553
4,1.482100,1.514652,0.42,0.333694




TrainOutput(global_step=28, training_loss=1.494711194719587, metrics={'train_runtime': 21.9455, 'train_samples_per_second': 18.227, 'train_steps_per_second': 1.276, 'total_flos': 105247256985600.0, 'train_loss': 1.494711194719587, 'epoch': 4.0})

In [10]:
trainer.evaluate(small_eval_dataset)
# trainer.predict(tokenized_datasets["test"])



{'eval_loss': 1.5146517753601074,
 'eval_accuracy': 0.42,
 'eval_f1': 0.3336940836940837,
 'eval_runtime': 1.4192,
 'eval_samples_per_second': 35.23,
 'eval_steps_per_second': 1.409,
 'epoch': 4.0}