In [None]:
!pip install -qU accelerate liger-kernel galore-torch apollo-torch lomo-optim grokadamw schedulefree

# Trainer

The `Trainer` is a complete training and evaluation loop for PyTorch models implemented in the Transformers library.

## Basic usage

`Trainer` includes all the code in a basic training loop:
* performing a training step to calculate the loss
* calculating the gradients with the `backward` method
* updating the weights based on the gradients
* repeating this process until we have reached a predetermined number of epochs

If we want to specify any training options or hyperparameters, we can find them in the `TrainingArguments` class:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='our-model',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    push_to_hub=False,
)

Then we can pass `training_args` to the `Trainer` along with a model, dataset, preprocessor for the dataset, a data collator, and a function to compute the metrics we want to track during training. Finally, we call `train()` method to start training:

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

### Checkpoints

Our trained model checkpoints are saved in the `output_dir` directory, and saved in a `checkpoint-000` subfolder where the numbers at the end corresponding to the training step.

In [None]:
# resume from latest checkpoint
trainer.train(resume_from_checkpoint=True)

# resume from specific checkpoint saved in the output directory
trainer.train(resume_from_checkpoint='our-model/checkpoint-1000')

## Customize the Trainer

Many of the `Trainer`'s method can be subclassed and overridden to support the functionality we want, without having to rewrite the entire training loop from scratch. These methods include:
* `get_train_dataloader()`
* `get_eval_dataloader()`
* `get_test_dataloader()`
* `log()`
* `create_optimizer_and_scheduler()`
* `compute_loss()`
* `training_step()`
* `prediction_step()`
* `evaluate()`
* `predict()`

For example, if we want to customize the `compute_loss()` method to use a weighted loss instead:

In [None]:
from torch improt nn
from transformers improt Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.pop('labels')

        # forward pass
        outputs = model(**inputs)
        logits = outputs.get('logits')

        # compute custom loss for 3 labels with different weights
        loss_fn = nn.CrossEntropyLoss(
            weight=torch.tensor([1.0, 2.0, 3.0]),
            device=model.device
        )
        loss = loss_fn(
            logits.view(-1, self.model.config.num_labels),
            labels.view(-1)
        )

        return (loss, outputs) if return_outputs else loss

### Callbacks

**Callbacks** do not change anything in the training loop. They inspect the training loop state and then execute some action (early stopping, logging results, etc) depending on the state. A callback cannot be used to implement something like a custom loss function.

For example, if we want to add an early stopping callback to the training loop after 10 steps:

In [None]:
from transformers import TrainerCallback

class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, num_steps=10):
        self.num_steps = num_steps

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step >= self.num_steps:
            control.should_training_stop = True

Then we pass it to the `Trainer`'s `callback` parameter:

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback()]
)

## Logging

The `Trainer` is set to `logging.INFO` be default which reports errors, warnings, and other basic information. A `Trainer` replica - in distributed environments - is set to `logging.WARNING` which only reports errors and warnings.

For example, to set our main code and modules to use the same log level according to each node:
```python
from transformers.utils import logging
import datasets
import transformers

logger = logging.getLogger(__name__)

logging.basicConfig(
    format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    datefmt='%m/%d/%Y %H:%M:%S',
    handlers=[logging.StreamHandler(sys.stdout)],
)

log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)

trainer = Trainer(...)
```

## NEFTune

**NEFTune** is a technique that can improve performance by adding noise to the embedding vectors during training. To enable it in `Trainer`, set the `neftune_noise_alpha` parameter:
```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    ...,
    neftune_noise_alpha=0.1
)
trainer = Trainer(
    ...,
    args=training_args
)
```

## Liger Kernel

**Liger-Kernel** kernel is a collection of Triton kernels developed by LinkedIn designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%.

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="your-model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    use_liger_kernel=True # add here
)
```

## Optimizers

To choose a built-in optimizer,
```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    ...,
    optim='adamw_torch'
)
```

We can also use an arbitrary PyTorch optimizer:
```python
import torch
from transformers import Trainer

optimizer_cls = torch.optim.AdamW
optimizer_kwargs = {
    'lr': 4e-3,
    'betas': (0.9, 0.999),
    'weight_decay': 0.05,
}

trainer = Trainer(
    ...,
    optimizer_cls_and_kwargs=(optimizer_cls, optimizer_kwargs)
)
```

### GaLore

**Gradient Low-Rank Projection (GaLore)** is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adapation methods, such as LoRA.

In [None]:
import torch
import datasets
import trl
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, AutoConfig

train_dataset = datasets.load_dataset('imdb', split='train')

model_id = 'google/gemma-2b'
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(config).to('cuda')

In [None]:
args = TrainingArguments(
    output_dir='./text-galore',
    max_steps=100,
    per_device_train_batch_size=2,
    optim='galore_adamw',
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"]
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512
)

trainer.train()

To pass extra arguments supported by GaLore, we should pass correctly `optim_args`:

In [None]:
args = TrainingArguments(
    output_dir='./text-galore',
    max_steps=100,
    per_device_train_batch_size=2,
    optim='galore_adamw',
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="rank=64, update_proj_gap=100, scale=0.10"
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512
)

trainer.train()

We can also perform layer-wise optimization by post-pending the optimizer name with `layerwise` :

In [None]:
args = TrainingArguments(
    output_dir='./text-galore',
    max_steps=100,
    per_device_train_batch_size=2,
    optim='galore_adamw_layerwise', # change here
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"]
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512
)

trainer.train()

### APOLLO

**Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO)** is a memory-efficient training strategy that allows full-parameter learning for both pre-training and fine-tuning, while maintaining AdamW-level performance with SGD-like memory efficiency.
* **Ultra-low rank efficiency**, requiring much lower rank than GaLore - even rank 1 (APOLLO-Mini) suffices
* **No expensive SVD computations**, APOLLO leverages random projection, avoiding training stalls, unlike GaLore

In [None]:
import torch
import datasets
import trl
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

train_dataset = datasets.load_dataset('imdb', split='train')

model_id = 'google/gemma-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
).to('cuda')

In [None]:
args = TrainingArguments(
    output_dir='./text-apollo',
    max_steps=100,
    per_device_train_batch_size=2,
    optim='apollo_adamw',
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"]
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512
)

trainer.train()

We can even enable APOLLO-Mini (rank=1 for extreme memory efficiency) by passing `optim_args`:

In [None]:
args = TrainingArguments(
    output_dir='./text-apollo',
    max_steps=100,
    per_device_train_batch_size=2,
    optim='apollo_adamw',
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="proj=random, rank=1, scale=128.0, scale_type=tensor, update_proj_gap=200"
)

### LOMO optimizer

The **LOMO optimizers** have been introduced in *Full Parameter Fine-Tuning for Large Language Models with Limited Resources* and *AdaLomo: Low-memory Optimization with Adaptive Learning Rate*. They both consists of an efficient full-parameter fine-tuning method. These optimizers fuse the gradient computation and the parameter update in one step to reduce memory usage.

According to the authors, it is recommended to use `AdaLomo` without `grad_norm` to get better performance and higher throughput.

For example, to fine-tune `google/gemma-2b` on IMDB dataset in full precision:

In [None]:
import torch
import datasets
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

model_id = 'google/gemma-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
).to('cuda')

In [None]:
args = TrainingArguments(
    output_dir='./text-lomo',
    max_steps=1000,
    per_device_train_batch_size=4,
    optim='adalomo',
    gradient_checkpointing=True,
    logging_strategy='steps',
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy='no',
    run_name='lomo-imdb'
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length1024
)

trainer.train()

### GrokAdamW optimizer

The **GrokAdamW** optimizer is designed to enhance training performance and stability, particularly for models that benefit from grokking signal functions.

GrokAdamW is particularly useful for models that require advanced optimization techniques to achieve better performance and stability.

For example, to fine-tune `google/gemma-2b` on the IMDB dataset using the GrokAdamW opimizer:

In [None]:
import torch
import datasets
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

model_id = 'google/gemma-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
).to('cuda')

In [None]:
args = TrainingArguments(
    output_dir='./text-grokadamw',
    max_steps=1000,
    per_device_train_batch_size=4,
    optim='grokadamw',
    logging_strategy='steps',
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy='no',
    run_name='grokadamw-imdb'
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length1024
)

trainer.train()

### Schedule-free optimizer

The **Schedule-free** optimizers have been introduced in *The Road Less Scheduled*.

Schedule-free learning replaces the momentum of the base optimizer with a combination of averaging and interpolation, to completely remove the need to anneal the learning rate with a traditional schedule. In addition, neither `warmup_steps` nor `warmup_ratio` are required when using `schedule_free_radam`.

In [None]:
import torch
import datasets
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

model_id = 'google/gemma-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
).to('cuda')

In [None]:
args = TrainingArguments(
    output_dir='./text-schedulefree',
    max_steps=1000,
    per_device_train_batch_size=4,
    optim='schedule_free_radam,
    lr_scheduler_type='constant',
    gradient_checkpointing=True,
    logging_strategy='steps',
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy='no',
    run_name='sfo-imdb'
)

trainer = trl.SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length1024
)

trainer.train()