# Unsloth with SFT (Supervised Fine Tuning)

### Introduction

Supervised Fine-Tuning (SFT) adapts a pre-trained Large Language Model to perform specific tasks using labeled datasets, improving its instruction-following and task performance. 

**Unsloth** significantly accelerates this process by optimizing the training pipeline, reducing memory usage and boosting speed, often making SFT on consumer hardware more accessible. When combined with **LoRA** (Low-Rank Adaptation)â€”which freezes the base model and trains only small, rank-decomposed matricesâ€”the result is an efficient, cost-effective fine-tuning approach that preserves the original knowledge of the model while requiring significantly less compute and storage.

### Prerequisites

- Python >= 3.13
- CUDA compatible GPU (>= 8GB)

### Environment Setup

```python
# Create project with uv
uv init unsloth-sft-lora 

# Modifiy .python-version (3.13) and pyproject.toml (requires-python = ">=3.13")
cd .\unsloth-sft-lora\
uv venv

# Activate venv
.\.venv\Scripts\activate

# Install pytorch
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130

# Check if CUDA is available
python -c "import torch; print(torch.cuda.is_available())"
> True
```


### Load Model / Tokenizer

In [1]:
from unsloth import FastLanguageModel

model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,    # Sets the maximum sequence length the model can handle
    dtype=None,             # Auto-detect. Usually uses float16 or bfloat16 for efficiency
    load_in_4bit=True,      # 4-bit quantization for memory efficiency
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
W0118 17:36:27.817000 33220 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.3: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Add LoRA Adapters

In [2]:
# Add LoRA adapters for Parameter-Efficient Fine-Tuning (PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=32,               # 32 is a good balance for 7B-13B models. Common values: 8, 16, 32, 64 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"], #List of attention and feed-forward layers to modify:
    lora_alpha=64,      # Controls how much the LoRA adapters influence the original weights. alpha = 2 * r is common (so 16 for r=8, 32 for r=16)
    lora_dropout=0,     #Dropout probability for LoRA layers (0-0.5)
    bias="none",        # How to handle biases in the model. "none" (most common), "all", or "lora_only"
    use_gradient_checkpointing="unsloth",  # Saves memory by recomputing gradients instead of storing them
    random_state=3407,  # Seed for reproducibility (famous "3407" seed from papers)
    use_rslora=False,   # Rank-Stabilized LoRA. Set to True to experiment (but requires r >= 4)
    loftq_config=None,  # Quantization-aware LoRA initialization
)

Unsloth 2026.1.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Data Preparation

In [3]:
from datasets import Dataset
import json

# Load dataset
with open('../data/train-data-instr-resp.json', 'r') as f:
    data = json.load(f)

dataset = Dataset.from_list(data)

In [4]:
# Prepare data for training
def format_data(examples):
    texts = []
    for instruction, output in zip(examples['instruction'], examples['output']):
        text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""
        texts.append(text)
    return {"text": texts}

# dataset = dataset.map(format_data, batched=True)
dataset = dataset.map(format_data, batched=True, remove_columns=dataset.column_names)

Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3/3 [00:00<00:00, 691.56 examples/s]


In [5]:
# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],       # Tokenize the 'text' field from the dataset
        padding="max_length",   # Pad sequences to the maximum length
        truncation=True,        # Truncate sequences longer than max_length
        max_length=512,         # Set maximum sequence length for tokenization
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3/3 [00:00<00:00, 653.52 examples/s]


### Train the Model

In [6]:
# Create trainer
from trl import SFTTrainer, SFTConfig
#from transformers import DataCollatorForLanguageModeling
import torch

trainer = SFTTrainer(
    model=model,                                    # The model to be trained
    tokenizer=tokenizer,                            # The tokenizer for the model
    train_dataset=tokenized_dataset,                # The training dataset
    dataset_text_field="text",                      # Field in dataset containing the text
    max_seq_length=2048,                            # Maximum sequence length for training
    args= SFTConfig(
        output_dir="./unsloth-output",              # Directory to save model checkpoints and logs
        save_steps=100,                             # Save checkpoint every 100 steps
        save_total_limit=2,                         # Keep only the last 2 checkpoints
        #num_train_epochs=3,                        # Number of training epochs
        max_steps=100,                              # Total number of training steps
        per_device_train_batch_size=4,              # Batch size per GPU/TPU core/CPU for training
        gradient_accumulation_steps=4,              # Number of steps to accumulate gradients before updating
        learning_rate=2e-4,                         # Learning rate  
        logging_steps=10,                           # Log training info every 10 steps 
        fp16=not torch.cuda.is_bf16_supported(),    # Use FP16 if BF16 is not supported
        bf16=torch.cuda.is_bf16_supported(),        # Use BF16 if supported
        gradient_checkpointing=True,                # Save memory by recomputing gradients
        lr_scheduler_type="linear",                 # Linear learning rate scheduler
        warmup_steps=10,                            # Number of warmup steps for learning rate scheduler
        optim="adamw_8bit",                         # Use 8-bit AdamW optimizer (memory efficient)
        weight_decay=0.01,                          # Weight decay for regularization
        report_to="none",                           # Change to "wandb" if using Weights & Biases
    ), 
    #data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

ðŸ¦¥ Unsloth: Padding-free auto-enabled, enabling faster training.


In [7]:
# Start training
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3 | Num Epochs = 100 | Total steps = 100
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 48,627,712 of 3,261,377,536 (1.49% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,8.2265
20,1.666
30,0.5948
40,0.5225
50,0.5059
60,0.4961
70,0.4953
80,0.4949
90,0.4947
100,0.4947


TrainOutput(global_step=100, training_loss=1.399135217666626, metrics={'train_runtime': 345.3205, 'train_samples_per_second': 4.633, 'train_steps_per_second': 0.29, 'total_flos': 2642572895846400.0, 'train_loss': 1.399135217666626, 'epoch': 100.0})

### Save Model

In [8]:
model.save_pretrained("llama-3.2-3-finetuned-lora")
tokenizer.save_pretrained("llama-3.2-3-finetuned-lora")

('llama-3.2-3-finetuned-lora\\tokenizer_config.json',
 'llama-3.2-3-finetuned-lora\\special_tokens_map.json',
 'llama-3.2-3-finetuned-lora\\chat_template.jinja',
 'llama-3.2-3-finetuned-lora\\tokenizer.json')

### Test Model

In [9]:
from unsloth import FastLanguageModel

# Load fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./llama-3.2-3-finetuned-lora",  # Path to the fine-tuned model
    max_seq_length=512,                         # Maximum sequence length for the model
    dtype=None,                                 # Auto-detect data type
    load_in_4bit=True,                          # 4-bit quantization for memory efficiency
)

# Create text generation pipeline
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2026.1.3: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [10]:
# Generate response
inputs = tokenizer(
    "### Instruction:\nWhat are the skills of Marko Boehm?\n\n### Response:",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=128, use_cache=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Instruction:
What are the skills of Marko Boehm?

### Response: The top skills are: IT-Management, Software Development, Artifical Intelligence, Scrum
