https://unsloth.ai/docs

In [1]:
!pip install -q unsloth

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m381.1/381.1 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m423.1/423.1 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.7/295.7 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/122.9 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m899.7/899.7 MB[0m [31m759.6 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
max_seq_length = 2048 # Choose any! Unsloth also supports RoPE (Rotary Positinal Embedding) scaling internally.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length, # how much longer sequence can support from our finetuned model
    dtype = dtype,
    load_in_4bit = load_in_4bit, # Will load the 4Bit Quantized Model
)

==((====))==  Unsloth 2026.1.2: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

* model_name:Specifies the name of the pre-trained model to load.
* max_seq_length:Defines the maximum sequence length (in tokens) that the model can process. max_seq_length = 2048 allows the model to process sequences up to 2048 tokens long.
* dtype:Specifies the data type for model weights and computations. None: Automatically selects the appropriate data type based on the hardware. torch.float16: Uses 16-bit floating point precision, reducing memory usage and potentially increasing speed on compatible GPUs. torch.bfloat16: Similar to float16 but with a wider dynamic range, beneficial for certain hardware like NVIDIA A100 GPUs.
* load_in_4bit:Determines whether to load the model using 4-bit quantization.Ideal for scenarios where memory efficiency is crucial, such as deploying models on edge devices or during experimentation

Now, we'll use the get_peft_model from unsloth's FastLanguageModel class to attach adapters (peft layers) on top of the models in order to perform QLoRA

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # W matrices
    lora_alpha = 16, # a higher alpha value assigns more weight to the LoRA activations
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2026.1.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


* r: The rank of the low-rank matrices in LoRA; higher values can capture more information but increase memory usage.
* target_modules: List of model components (e.g., "q_proj", "k_proj") where LoRA adapters are inserted for fine-tuning.
* lora_alpha: Scaling factor for the LoRA updates; controls the impact of the adapters on the model's outputs.
* lora_dropout: Dropout rate applied to LoRA layers during training to prevent overfitting.
* bias: Specifies how biases are handled in LoRA layers; options include "none", "all", or "lora_only".
* use_gradient_checkpointing: Enables gradient checkpointing to reduce memory usage during training; "unsloth" uses Unsloth's optimized version.
* random_state: Seed for random number generators to ensure reproducibility of training results.
* use_rslora: Boolean indicating whether to use Rank-Stabilized LoRA (rsLoRA) for potentially more stable training.
* loftq_config: Configuration for Low-Rank Quantization (LoftQ); set to None to disable this feature.

In [7]:
# dataset that we need to fine tuned the model
from datasets import load_dataset
dataset = load_dataset("ServiceNow-AI/R1-Distill-SFT",'v0', split = "train")

README.md: 0.00B [00:00, ?B/s]

v0/train-00000-of-00003.parquet:   0%|          | 0.00/180M [00:00<?, ?B/s]

v0/train-00001-of-00003.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

v0/train-00002-of-00003.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/171647 [00:00<?, ? examples/s]

In [9]:
print(dataset[:2])

{'id': ['id_0', 'id_1'], 'reannotated_assistant_content': ['<think>\nFirst, I need to determine the total number of children on the playground by adding the number of boys and girls.\n\nThere are 27 boys and 35 girls.\n\nAdding these together: 27 boys + 35 girls = 62 children.\n\nTherefore, the total number of children on the playground is 62.\n</think>\n\nTo find the total number of children on the playground, we simply add the number of boys and girls together.\n\n\\[\n\\text{Total children} = \\text{Number of boys} + \\text{Number of girls}\n\\]\n\nPlugging in the given values:\n\n\\[\n\\text{Total children} = 27 \\text{ boys} + 35 \\text{ girls} = 62 \\text{ children}\n\\]\n\n**Final Answer:**\n\n\\[\n\\boxed{62}\n\\]', '<think>\nFirst, I need to determine the cost per dozen oranges. John bought three dozen oranges for \\$28.80, so I can find the cost per dozen by dividing the total cost by the number of dozens.\n\nNext, with the cost per dozen known, I can calculate the cost for f

Now, we create a prompt template that will be used to finetune our Llama model

In [10]:
r1_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>

{}
{}
"""
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
  problems = examples["problem"]
  thoughts = examples["reannotated_assistant_content"]
  solutions = examples["solution"]
  texts = []

  for problem, thought, solution in zip(problems, thoughts, solutions):
    text = r1_prompt.format(problem, thought, solution)+EOS_TOKEN
    texts.append(text)

  return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched = True,)
'''
from map function it goes through each record of the dataset and check problems, thoughts and solutions
'''

Map:   0%|          | 0/171647 [00:00<?, ? examples/s]

'\nfrom map function it goes through each record of the dataset and check problems, thoughts and solutions\n'

**Trainer Setup:**

* model and tokenizer: These are the model and tokenizer objects that will be trained.

* train_dataset: The dataset used for training.

* dataset_text_field: Specifies the field in the dataset that contains the text data.

* max_seq_length: Maximum sequence length for the input data.

* dataset_num_proc: Number of processes to use for data loading.

* packing: If True, enables sequence packing (concatenates multiple examples into a single sequence to better utilize tokens).

**Training Arguments:**

* per_device_train_batch_size: Number of samples per batch for each device.

* gradient_accumulation_steps: Number of steps to accumulate gradients before updating model weights.

* warmup_steps: Number of steps for learning rate warmup.

* max_steps: Total number of training steps.

* learning_rate: Learning rate for the optimizer.

* fp16 and bf16: Specifies whether to use 16-bit floating point precision or bfloat16, depending on hardware support.

* logging_steps: Frequency of logging training progress.

* optim: Optimizer type, here using an 8-bit version of AdamW.

* weight_decay: Regularization parameter for weight decay.

* lr_scheduler_type: Type of learning rate scheduler.

* seed: Random seed for reproducibility.

* output_dir: Directory where the training outputs will be saved.

* report_to: Integration for observability tools like "wandb", "tensorboard", etc.

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2, # Number of processors to use for processing the dataset
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, # The batch size per GPU/TPU core
        gradient_accumulation_steps = 4, # Number of steps to perform befor each gradient accumulation
        warmup_steps = 5, # Few updates with low learning rate before actual training
        max_steps = 60, # Specifies the total number of training steps (batches) to run.
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit", # Optimizer
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc for observability
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/171647 [00:00<?, ? examples/s]

In [12]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 171,647 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.01
2,0.935
3,1.034
4,0.944
5,0.7859
6,0.8536
7,0.7564
8,0.7416
9,0.7865
10,0.7368


Give a samle question

In [13]:
from unsloth.chat_templates import get_chat_template
sys_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>
"""
message = sys_prompt.format("How many 'r's are present in 'strawberry'?")
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": message},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

# generate the inference
outputs = model.generate(input_ids = inputs, max_new_tokens = 1024, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
response = tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [14]:
print(response[0])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
How many 'r's are present in'strawberry'?
</problem>
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Alright, let me figure out how many 'r's are in the word'strawberry'. Okay, so I remember that 'r' is the seventh letter of the alphabet, but is that correct? Let me just check again to make sure. In the standard English alphabet, 'r' is indeed the seventh letter, so that's probably correct.

Okay, now, looking at'strawberry'. It's a seven-letter word. I need to count the number of 'r's in it. Let's break it down:

- S: That's's'. No 'r's here.
- T: That's 't'. Stil