To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-think) for guidance on how to train think models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

import os
os.environ["WANDB_PROJECT"] = "r1-arc"

### Unsloth

Load up `Qwen 2.5 3B Instruct`, and set parameters

In [None]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 30000 # Can increase for longer think traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-7B",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    # what % of GRAM is allocated to vllm+model. Not fixed, easily OOM near boundary.
    gpu_memory_utilization = 0.6 # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
INFO 03-14 08:59:05 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.10: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.097 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/DeepSeek-R1-Distill-Qwen-7B with actual GPU utilization = 59.59%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.1 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 30000. Num Sequences = 320.
Unsloth: vLLM's KV Cache can use up to 32.65 G



INFO 03-14 08:59:15 weight_utils.py:254] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 03-14 08:59:18 model_runner.py:1115] Loading model weights took 14.3854 GB
INFO 03-14 08:59:18 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-14 08:59:21 worker.py:267] Memory profiling takes 2.80 seconds
INFO 03-14 08:59:21 worker.py:267] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.60) = 47.14GiB
INFO 03-14 08:59:21 worker.py:267] model weights take 14.39GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 4.03GiB; the rest of the memory reserved for KV Cache is 28.58GiB.
INFO 03-14 08:59:21 executor_base.py:111] # cuda blocks: 33450, # CPU blocks: 7021
INFO 03-14 08:59:21 executor_base.py:116] Maximum concurrency for 30000 tokens per request: 17.84x
INFO 03-14 08:59:25 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory err

Capturing CUDA graph shapes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 43/43 [00:20<00:00,  2.10it/s]

INFO 03-14 08:59:45 model_runner.py:1562] Graph capturing finished in 20 secs, took 0.52 GiB
INFO 03-14 08:59:45 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 27.11 seconds



Unsloth 2025.3.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [3]:
# Access my modules
import sys, importlib
sys.path.append('../..')
from src import training
training = importlib.reload(training) # HMR on code cell exec

dataset = training.load_dataset("photonmz/arc_plain")

../../src/training/hf_dataset.py:71 load
    data0["train"][0]: {
        'id': '007bbfb7',
        'train': [
            [
                [
                    [0, 7, 7],
                    [7, 7, 7],
                    [0, 7, 7],
                ],
                [
                    [
                        0,
                        0,
                        0,
                        0,
                        7,
                        7,
                        0,
                        7,
                        7,
                    ],
                    [
                        0,
                        0,
                        0,
                        7,
                        7,
                        7,
                        7,
                        7,
                        7,
                    ],
                    [
                        0,
                        0,
                        0,
                        0,
                     

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [5]:
from trl import GRPOConfig, GRPOTrainer
import time

N_TRAJECTORIES = 4
max_prompt_length = 14000
max_completion_length = 15000

assert max_prompt_length + max_completion_length < max_seq_length

training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = N_TRAJECTORIES * 1,
    # I think increasing grad_accum_steps multiplies KV Cache size (not offloaded well)
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = N_TRAJECTORIES, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    num_train_epochs = 50, # Set to 1 for a full training run
    max_steps = 2500,
    save_steps = 100,
    # num_iterations = 10,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "outputs",
    run_name = time.strftime("%Y%m%d_%H%M"),
    log_completions = True
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
import src.training.__init__  # %aimport

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = training.REWARD_FNS,
    args = training_args,
    train_dataset = dataset['train'],
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 400 | Num Epochs = 7 | Total steps = 2,500
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 161,480,704/7,777,097,216 (2.08% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mphotonmz[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


../../src/training/env.py:94 extract_python
    'No code block found.' (str) len=20
../../src/training/env.py:137 guardrail.<locals>.wrapped
    'No Python' (str) len=9
    c[-1]['content'][-100:]: 'd cell replaces a corresponding edge cell of the purple rectangle based on their relative positions.' (str) len=100
../../src/training/env.py:141 guardrail.<locals>.wrapped
    id: [
        '29c11459',
        '29c11459',
        '29c11459',
        '29c11459',
    ] (list) len=4
    codestring: (
        'from typing import List, Tuple, Any, Container, FrozenSet, Iterable, Optional\n'
        '\n'
        'Integer = int\n'
        'Grid = Tuple[Tuple[Integer, Tuple[Optional[Integer, Integer]]], FrozenSet[IntegerTuple]]\n'
        '\n'
        'def solve(I: Grid) -> Grid:\n'
        '    # Convert grid into integer tuples for easier manipulation\n'
        '    grid = tuple(tuple(int(cell) for cell in row) for row in I)\n'
        '    \n'
        '    # Identify all cells in the grid\n'
 

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / wrapped,rewards / wrapped.1
1,0.0,0.483013,0.834025,1099.0,0.0,0.241506,0.241506
2,0.0,0.0,0.0,3094.0,0.0,0.0,0.0
3,0.0,-0.171447,1.498046,1577.5,0.001121,-0.085723,-0.085723
4,0.0,0.0,0.0,2435.75,0.000641,0.0,0.0


../../src/training/env.py:94 extract_python
    'No code block found.' (str) len=20
../../src/training/env.py:137 guardrail.<locals>.wrapped
    'No Python' (str) len=9
    c[-1]['content'][-100:]: 'ut additional context, but the code is structured to handle grid operations efficiently using NumPy.' (str) len=100
../../src/training/env.py:141 guardrail.<locals>.wrapped
    id: [
        '6aa20dc0',
        '6aa20dc0',
        '6aa20dc0',
        '6aa20dc0',
    ] (list) len=4
    codestring: (
        'import sys\n'
        'from collections import deque\n'
        '\n'
        'def solve_puzzle(input_grid):\n'
        '    """\n'
        '    Solve the puzzle based on the given rules.\n'
        '    """\n'
        '    # Convert the input grid to a list of lists\n'
        '    grid = [list(row) for row in input_grid]\n'
        '    height = len(grid)\n'
        '    if height == 0:\n'
        '        return []\n'
        '    width = len(grid[0])\n'
        '    if width == 0:\n'


KeyboardInterrupt: 

: 

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
