## Unsloth: Optimizing Training and Inference Performance

For many software algorithms, the performance does not only depend on the number and kind of calculations performed. Instead, the exact order and the size of chunks has an enormous influence on the calculation speed.
For large language models, a library called `unsloth` contains optimized GPU kernels created by manually deriving all compute heavy math steps. By using these optimized kernels, a significant speed-up can be obtained.

### Key Techniques in Unsloth:

1. **Efficient Data Loading**: Optimizing data pipelines to reduce latency and improve throughput during training.
2. **Batching and Padding Strategies**: Dynamically adjusting batch sizes and minimizing padding to optimize memory usage.
3. **Half-Precision and Quantized Inference**: Using mixed precision or quantized models to speed up inference and reduce memory footprint.
4. **Model Pruning and Distillation**: Reducing the size of the model by removing redundant parameters or training smaller models to mimic larger ones.

### Benefits of Unsloth:

- **Reduced Training Time**: Optimizing data loading and model architecture reduces the time required for each epoch.
- **Lower Memory Usage**: Using techniques like mixed precision and quantization reduces the amount of GPU memory required.
- **Faster Inference**: Optimizing the model for deployment can significantly reduce latency during inference.

### Hands-On Example: Efficient Data Loading and Mixed Precision Training

In this example, we take the example from the previous notebook ("PEFT") and adjust them to use `unsloth`.

In [1]:
# Import libraries
## Instead of:
# from transformers import AutoModelForCausalLM, AutoTokenizer
## use:
from unsloth import FastLanguageModel

import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig, pipeline, TrainingArguments
from trl import SFTTrainer, SFTConfig

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
# Choose a model and load tokenizer and model (using 4bit quantization):
# model_name = "meta-llama/Llama-3.2-1B-Instruct"
# model_name = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/meta-llama--Llama-3.2-1B-Instruct"
# model_name = "unsloth/Llama-3.2-1B-Instruct"
model_name = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/unsloth--Llama-3.2-1B-Instruct"

## Instead of:
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(...)
## use: 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    ## Instead of:
    # quantization_config=BitsAndBytesConfig(...)
    ## use:
    load_in_4bit=True,
    # device_map='cuda:0',
    trust_remote_code=True
)
tokenizer.padding_side = 'right'
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.9.1: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    NVIDIA A100-SXM-64GB. Num GPUs = 1. Max memory: 63.423 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Device set to use cuda:0


In [3]:
# Load the guanaco dataset
guanaco_train = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='train')
# guanaco_test = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='test')
# guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
# guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')

Repo card metadata block was not found. Setting CardData to empty.


Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [4]:
def reformat_text(text, include_answer=True):
    question1 = text.split('###')[1].removeprefix(' Human: ')
    answer1 = text.split('###')[2].removeprefix(' Assistant: ')
    if include_answer:
        messages = [
            {'role': 'user', 'content': question1},
            {'role': 'assistant', 'content': answer1}
        ]
    else:
        messages = [
            {'role': 'user', 'content': question1}
        ]        
    reformatted_text = tokenizer.apply_chat_template(messages, tokenize=False)
    return reformatted_text

In [5]:
# Now, apply reformat_train(..) to the dataset:
guanaco_train = guanaco_train.map(lambda entry: {
    'reformatted_text': reformat_text(entry['text'])
})

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

In [6]:
## Instead of:
# peft_config = LoraConfig(
#     task_type='CAUSAL_LM',
#     r=16,
#     lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
#     bias='none',
#     target_modules='all-linear',
# )
# model = get_peft_model(model, peft_config)
## use:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,  # rule: lora_alpha should be 2*r
    lora_dropout=0.05,  # Unsloth supports any, but = 0 is optimized
    bias='none',  # Unsloth supports any, but = 'none' is optimized
    # Unsloth does not allow 'all-linear' => manually specify target modules: 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    use_gradient_checkpointing='unsloth',  # True or 'unsloth' for very long context
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.9.1 patched 16 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [7]:
training_arguments = SFTConfig(
    output_dir='output/unsloth-llama-3.2-1b-instruct-guanaco',
    # output_dir='output/unsloth-phi-3.5-mini-instruct-guanaco',
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 8 samples/second
        # - disabled: 40 GB GPU RAM and 12 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=100,
    bf16=True,  # mixed precision training
    report_to='none',  # disable wandb
    max_seq_length=1024,
    dataset_text_field='reformatted_text',
)

In [8]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    processing_class=tokenizer,
)

Unsloth: Tokenizing ["reformatted_text"] (num_proc=36):   0%|          | 0/9846 [00:00<?, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


[2025-09-08 08:13:15,824] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /leonardo/home/usertrain/a08trb02/one-click-hpc-access-home-trainee02/.triton/autotune: No such file or directory


[2025-09-08 08:13:16,915] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False


In [9]:
train_result = trainer.train()
print("Training result:")
print(train_result)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,846 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
10,2.2615
20,1.7388
30,1.7857
40,1.6348
50,1.6645
60,1.6218
70,1.6265
80,1.6193
90,1.6054
100,1.6391


Training result:
TrainOutput(global_step=100, training_loss=1.7197385597229005, metrics={'train_runtime': 36.0817, 'train_samples_per_second': 22.172, 'train_steps_per_second': 2.771, 'total_flos': 2877935098724352.0, 'train_loss': 1.7197385597229005})


In [10]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}