### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [2]:
%%capture
# Install latest transformers for Gemma 3N
!pip install --no-deps git+https://github.com/huggingface/transformers.git # Only for Gemma 3N
!pip install --no-deps --upgrade timm # Only for Gemma 3N

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [3]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit",
    "unsloth/gemma-3n-E2B-it-unsloth-bnb-4bit",
    # Pretrained models
    "unsloth/gemma-3n-E4B-unsloth-bnb-4bit",
    "unsloth/gemma-3n-E2B-unsloth-bnb-4bit",

    # Other Gemma 3 quants
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3n-E2B-it",
    dtype = None, # None for auto detection by unsloth
    max_seq_length = 2222, # Choose any for long context!
    load_in_4bit = True,  # 4 bit dynamic quantization for superior accuracy and lower memory use
    full_finetuning = False,
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.3: Fast Gemma3N patching. Transformers: 4.54.0.dev0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/469M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

In [4]:
# import gc
# torch.cuda.empty_cache()
# gc.collect()
# print(f"Allocated: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")
# print(f"Cached: {torch.cuda.memory_reserved() / 1024 ** 2:.2f} MB")

In [5]:
print(f"Allocated: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024 ** 2:.2f} MB")

# Check model quantization status
print(f"4-bit loaded: {model.is_loaded_in_4bit}")
print(f"Quantized: {model.is_quantized}")
print(f"Method: {model.quantization_method}")

Allocated: 7770.68 MB
Cached: 7790.00 MB
4-bit loaded: True
Quantized: True
Method: QuantizationMethod.BITS_AND_BYTES


# Gemma 3N can process Text, Vision and Audio!

Let's first experience how Gemma 3N can handle multimodal inputs. We use Gemma 3N's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`

In [6]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_3n_inference(messages, max_new_tokens = 128):
    _ = model.generate(
        **tokenizer.apply_chat_template(
            messages,
            add_generation_prompt = True, # Must add for generation
            tokenize = True,
            return_dict = True,
            return_tensors = "pt",
        ).to("cuda"),
        max_new_tokens = max_new_tokens,
        temperature = 1.0, top_p = 0.95, top_k = 64,
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )

We now add LoRA adapters so we only need to update a small amount of parameters!

In [7]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!
    r = 16,           # Larger = higher accuracy, but no significant improvements. checkout [LoRA](https://arxiv.org/abs/2106.09685) paper
    lora_alpha = 32,  # Recommended alpha >= r
    lora_dropout = 0, # 0.1 provides moderate regularization without being too aggressive. Common range is 0.05-0.2 for LoRA fine-tuning. Since we are doing text only, 0.1 is a good default.
    bias = "none", # Bias terms are simple additive constants that shift neuron outputs. They're less critical for task adaptation because:
    # What bias does: If a neuron computes Wx + b, the bias b just shifts the entire output up/down by a constant.
    # Why freezing works: The main "intelligence" comes from the weight matrix W learning new patterns. The bias shifts are usually already well-calibrated from pretraining.
    random_state = 3407,
    use_rslora = True,
    use_gradient_checkpointing = "unsloth",
    loftq_config = {}
)
model.print_trainable_parameters()

Unsloth: Making `model.base_model.model.model.language_model` require gradients
trainable params: 21,135,360 || all params: 5,460,573,632 || trainable%: 0.3871


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes.

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [8]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [9]:
from datasets import load_dataset

train_ds = load_dataset("muzzz/coedit-cot-reasoning", split="train")
val_ds = load_dataset("muzzz/coedit-cot-reasoning", split="validation")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/82.6M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/3.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

In [10]:
print("Train Dataset Info:")
print(train_ds)
print("features: ", train_ds.features)
print("example: ", train_ds[0])
print("--------------------------------")
print("\nValidation Dataset Info:")
print(val_ds)
print("features: ", val_ds.features)
print("example: ", val_ds[0])

Train Dataset Info:
Dataset({
    features: ['_id', 'task', 'src', 'tgt', 'reasoning'],
    num_rows: 69071
})
features:  {'_id': Value(dtype='string', id=None), 'task': Value(dtype='string', id=None), 'src': Value(dtype='string', id=None), 'tgt': Value(dtype='string', id=None), 'reasoning': Value(dtype='string', id=None)}
example:  {'_id': '1', 'task': 'gec', 'src': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'tgt': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.', 'reasoning': '<think>\n1.  **Instruction Analysis:** The goal is to "Remove all grammatical errors" from the provided text. I need to read the text carefully and identify any points that violate standard English grammar rules.\n2.  **Source Text

In [11]:
# Check how many rows have empty reasoning column
empty_reasoning_count_train = sum(1 for example in train_ds if not example["reasoning"] or example["reasoning"].strip() == "")
empty_reasoning_count_val = sum(1 for example in val_ds if not example["reasoning"] or example["reasoning"].strip() == "")

print(f"Train dataset - Empty reasoning rows: {empty_reasoning_count_train} out of {len(train_ds)} ({empty_reasoning_count_train/len(train_ds)*100:.2f}%)")
print(f"Validation dataset - Empty reasoning rows: {empty_reasoning_count_val} out of {len(val_ds)} ({empty_reasoning_count_val/len(val_ds)*100:.2f}%)")

print("\nExamples with empty reasoning from train dataset:")
empty_examples_train = [example for example in train_ds if not example["reasoning"] or example["reasoning"].strip() == ""]
for i, example in enumerate(empty_examples_train[:3]):
    print(f"Example {i+1}:")
    print(f"  src: {example['src'][:100]}...")
    print(f"  reasoning: '{example['reasoning']}'")
    print(f"  tgt: {example['tgt'][:100]}...")


Train dataset - Empty reasoning rows: 12335 out of 69071 (17.86%)
Validation dataset - Empty reasoning rows: 0 out of 1712 (0.00%)

Examples with empty reasoning from train dataset:
Example 1:
  src: Remove unsourced opinions: they were at war with mecca, and saw no wrong in raiding meccan caravans....
  reasoning: 'None'
  tgt: they considered themselves to be at war with mecca, and saw no wrong in raiding meccan caravans....
Example 2:
  src: Make this sentence more neutral: crowd of curious people day after the tragedy....
  reasoning: 'None'
  tgt: crowd of curious people day after the fire....
Example 3:
  src: Neutralize the text: "the city that fun forgot", often used sarcastically by residents of ottawa...
  reasoning: 'None'
  tgt: "the city that fun forgot", used sarcastically by residents of ottawa...


### the dataset being used [here](https://huggingface.co/datasets/muzzz/coedit-cot-reasoning)

In [12]:
def format_dataset_with_template(example, tokenizer):

    src_txt = example["src"]
    tgt_txt = (example["reasoning"] or "") + "\n" + example["tgt"]

    # 1. Prepare the conversation history in the required format
    messages = [
        {"role": "user", "content": src_txt},
        {"role": "model", "content": tgt_txt},
    ]

    # 2. Apply the chat template
    try:
        formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
    except Exception as e:
        print(f"Error applying chat template to example: {example}")
        print(f"Error: {e}")
        formatted_text = "" # Return empty string on error to avoid crashing map

        # Apply the logic from remove_special_tokens found in Unsloth code
    if formatted_text.startswith(tokenizer.bos_token):
        formatted_text = formatted_text[len(tokenizer.bos_token):]

    # 3. Return in the desired dictionary format for .map
    return {"text": formatted_text}


In [13]:
processed_train_ds = train_ds.map(
    lambda example: format_dataset_with_template(example, tokenizer),
    batched=False,
)
processed_val_ds = val_ds.map(
    lambda example: format_dataset_with_template(example, tokenizer),
    batched=False,
)

Map:   0%|          | 0/69071 [00:00<?, ? examples/s]

Map:   0%|          | 0/1712 [00:00<?, ? examples/s]

In [14]:
print(processed_train_ds)
print(processed_val_ds)

Dataset({
    features: ['_id', 'task', 'src', 'tgt', 'reasoning', 'text'],
    num_rows: 69071
})
Dataset({
    features: ['_id', 'task', 'src', 'tgt', 'reasoning', 'text'],
    num_rows: 1712
})


In [15]:
# removing original columns
columns_to_remove = list(train_ds.features)
print(f"\nRemoving original columns: {columns_to_remove}")
processed_train_ds = processed_train_ds.remove_columns(columns_to_remove)
processed_val_ds = processed_val_ds.remove_columns(columns_to_remove)


Removing original columns: ['_id', 'task', 'src', 'tgt', 'reasoning']


In [16]:
print("\nProcessed train dataset sample (using tokenizer template):")
print(processed_train_ds)
if len(processed_train_ds) > 0:
    print(processed_train_ds[0]["text"])
else:
     print("Processed training dataset is empty or first example failed.")


print("\nProcessed validation dataset sample (using tokenizer template):")
print(processed_val_ds)
if len(processed_val_ds) > 0:
    print(processed_val_ds[0]['text'])
else:
    print("Processed validation dataset is empty or first example failed.")



Processed train dataset sample (using tokenizer template):
Dataset({
    features: ['text'],
    num_rows: 69071
})
<start_of_turn>user
Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.<end_of_turn>
<start_of_turn>model
<think>
1.  **Instruction Analysis:** The goal is to "Remove all grammatical errors" from the provided text. I need to read the text carefully and identify any points that violate standard English grammar rules.
2.  **Source Text Analysis:** The text is "For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert."
3.  **Sentence Structure Review:** The sentence starts with an introductory phrase "For example". The main subject is "countries with a lot of deserts". The main verb phrase is "can [verb]".
4.  **Id

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [17]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=processed_train_ds,
    eval_dataset=processed_val_ds,
    args=SFTConfig(
        output_dir="./checkpoints3072",
        dataset_text_field="text",
        per_device_train_batch_size=32,
        gradient_accumulation_steps=2,
        eval_accumulation_steps=2,
        save_total_limit=12,
        load_best_model_at_end=True,
        greater_is_better=False,
        metric_for_best_model="eval_loss",
        warmup_steps=100,
        num_train_epochs=1,
        learning_rate=2e-5,
        logging_steps=25,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",
        eval_strategy="steps",
        eval_steps=25,
        save_strategy="steps",
        save_steps=25,
    )
)

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/69071 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/1712 [00:00<?, ? examples/s]

In [18]:
from transformers import EarlyStoppingCallback
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience = 3,     # How many steps we will wait if the eval loss doesn't decrease
                                     # For example the loss might increase, but decrease after 3 steps
    early_stopping_threshold = 0.0,  # Can set higher - sets how much loss should decrease by until
                                     # we consider early stopping. For eg 0.01 means if loss was
                                     # 0.02 then 0.01, we consider to early stop the run.
)
trainer.add_callback(early_stopping_callback)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [19]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=12):   0%|          | 0/69071 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/1712 [00:00<?, ? examples/s]

In [20]:
print(trainer.train_dataset)

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 69071
})


In [21]:
print(trainer.train_dataset[100]["text"])
print()
print(trainer.train_dataset[100]["input_ids"])
print()
print(trainer.train_dataset[100]["attention_mask"])
print()
print(trainer.train_dataset[100]["labels"])

<start_of_turn>user
Fix grammar in this sentence: If engineers do not come up with new ideas, they cannot find best solution for the problems.<end_of_turn>
<start_of_turn>model
<think>
1.  **Instruction Analysis:** The task is to "Fix grammar" in the provided sentence. This means identifying and correcting any grammatical errors or awkward phrasing to make the sentence correct and natural according to standard English.
2.  **Sentence Analysis:** The sentence is "If engineers do not come up with new ideas, they cannot find best solution for the problems." It's a conditional sentence structure.
3.  **First Clause Analysis:** "If engineers do not come up with new ideas," - This clause appears grammatically correct. "Engineers" (plural subject) matches "do not come up" (plural verb form). "new ideas" is a correct plural noun phrase. No changes needed here.
4.  **Second Clause Analysis:** "they cannot find best solution for the problems." - This clause requires closer examination.
5.  **Ide

In [22]:
total_training_tokens = sum(len(x) for x in trainer.train_dataset['input_ids'])
print("Total training tokens in CoEdIT dataset:", total_training_tokens)

Total training tokens in CoEdIT dataset: 39786235


In [23]:
import numpy as np

lengths = [len(x) for x in trainer.train_dataset['input_ids']]
print(f"Token lengths stats:")
print(f"Min: {np.min(lengths)}")
print(f"Max: {np.max(lengths)}")
print(f"Mean: {np.mean(lengths)}")
print(f"Median: {np.median(lengths)}")
print(f"95th percentile: {np.percentile(lengths, 95)}")
print(f"99th percentile: {np.percentile(lengths, 99)}")

Token lengths stats:
Min: 21
Max: 2222
Mean: 576.0193858493434
Median: 576.0
95th percentile: 1144.0
99th percentile: 2074.0


Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [24]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nFix grammar in this sentence: If engineers do not come up with new ideas, they cannot find best solution for the problems.<end_of_turn>\n<start_of_turn>model\n<think>\n1.  **Instruction Analysis:** The task is to "Fix grammar" in the provided sentence. This means identifying and correcting any grammatical errors or awkward phrasing to make the sentence correct and natural according to standard English.\n2.  **Sentence Analysis:** The sentence is "If engineers do not come up with new ideas, they cannot find best solution for the problems." It\'s a conditional sentence structure.\n3.  **First Clause Analysis:** "If engineers do not come up with new ideas," - This clause appears grammatically correct. "Engineers" (plural subject) matches "do not come up" (plural verb form). "new ideas" is a correct plural noun phrase. No changes needed here.\n4.  **Second Clause Analysis:** "they cannot find best solution for the problems." - This clause requires closer examinat

Now let's print the masked out example - you should see only the answer is present:

In [25]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                  <think>\n1.  **Instruction Analysis:** The task is to "Fix grammar" in the provided sentence. This means identifying and correcting any grammatical errors or awkward phrasing to make the sentence correct and natural according to standard English.\n2.  **Sentence Analysis:** The sentence is "If engineers do not come up with new ideas, they cannot find best solution for the problems." It\'s a conditional sentence structure.\n3.  **First Clause Analysis:** "If engineers do not come up with new ideas," - This clause appears grammatically correct. "Engineers" (plural subject) matches "do not come up" (plural verb form). "new ideas" is a correct plural noun phrase. No changes needed here.\n4.  **Second Clause Analysis:** "they cannot find best solution for the problems." - This clause requires closer examination.\n5.  **Identify Issue 1: Article Usage before Superlative:** The phrase is "best solution". "Best" is a superlative adjective ("good", "better", "

In [26]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
7.688 GB of memory reserved.


# Let's train the model!

To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [27]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 69,071 | Num Epochs = 1 | Total steps = 1,080
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 2 x 1) = 64
 "-____-"     Trainable parameters = 21,135,360 of 5,460,573,632 (0.39% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


OutOfMemoryError: CUDA out of memory. Tried to allocate 704.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 28.88 MiB is free. Process 10629 has 39.52 GiB memory in use. Of the allocated memory 38.92 GiB is allocated by PyTorch, and 88.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [64]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'trainer_stats' is not defined

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
    }]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\nContinue the sequence: 1, 1, 2, 3, 5, 8,<end_of_turn>\n<start_of_turn>model\n13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
_ = model.generate(
    **inputs,
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

The sky is blue because of a phenomenon called Rayleigh scattering. Sunlight is made up of all the colors of the rainbow, but blue light has a shorter wavelength than red light. When sunlight enters the Earth's atmosphere, it collides with tiny air molecules (mostly nitrogen and oxygen). This collision causes the blue light to


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3n")  # Local saving
tokenizer.save_pretrained("gemma-3n")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

['gemma-3n/processor_config.json']

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is Gemma-3N?",}]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
_ = model.generate(
    **inputs,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Gemma-3N is a **new open-weight large language model (LLM)** developed by **Google DeepMind**. It's designed to be **accessible to a wider range of users**, including researchers and developers, by being **open-weight**. This means the model's weights are publicly available, allowing for more experimentation and customization.

Here's a breakdown of what Gemma-3N is:

* **Large Language Model (LLM):** Gemma-3N is a powerful AI model trained on a massive dataset of text and code. This allows it to understand and generate human-like text, translate


### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3N-finetune`. Set `if False` to `if True` to let it run!

In [None]:
if False: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3N-finetune", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3N-finetune", tokenizer,
        token = "hf_..."
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3N-finetune",
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3N-finetune",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-3N-finetune-gguf",
        token = "hf_...",
    )