<a href="https://colab.research.google.com/github/kartavya1710/Books-Python/blob/main/nb/Gemma3_(4B).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [2]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.11: Fast Gemma3 patching. Transformers: 4.54.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [3]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [4]:
from datasets import load_dataset
import pandas as pd

# If your dataset is on Hugging Face Hub
dataset = load_dataset("RecurvAI/Recurv-Medical-Dataset", split="train")

README.md: 0.00B [00:00, ?B/s]

recurv_medical_dataset.parquet:   0%|          | 0.00/41.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67299 [00:00<?, ? examples/s]

In [5]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,  # Your tokenizer from earlier model loading
    chat_template = "gemma-3",
)

In [6]:
def convert_to_conversations(examples):
    """
    Convert input/output format to conversations format required by Unsloth
    """
    conversations = []

    # Handle batch processing
    inputs = examples["input"] if isinstance(examples["input"], list) else [examples["input"]]
    outputs = examples["output"] if isinstance(examples["output"], list) else [examples["output"]]

    for inp, out in zip(inputs, outputs):
        # Create conversation in the required format
        conversation = [
            {
                "role": "user",
                "content": inp
            },
            {
                "role": "assistant",
                "content": out
            }
        ]
        conversations.append(conversation)

    return {"conversations": conversations}

# Apply the conversion
dataset = dataset.map(convert_to_conversations, batched=True)

Map:   0%|          | 0/67299 [00:00<?, ? examples/s]

In [9]:
dataset[2]

{'input': 'My father has been diagnosed with epilepsy and S-protein deficiency since 12 years. he has frequent seizures (once every 3 months). he has been on medication for epilepsy (tegritol/eptoin) for more than 10 years now. I wanted to know what are the things he should do/not do in order to prevent these frequent seizures ? Is it advisable for him to use a laptop/computer at all ? Does that trigger epilepsy ?',
 'output': 'Hello, your question is an intelligent one. There are certain precautions which we advise to the epilepsy patient and his family in order to reduce the chances of a new episode.1. Sleep - a good sound sleep at night is a must. Never make the patient sit for night hours even if there is a family function.2. Medicine-always. Carry meds with you while travelling. Put them in bag even before you put clothes in it. The meds should be given daily (without missing a single day) in made of seizure disorder. It is a must.3. The brand of the med should ideally never be ch

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [10]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/67299 [00:00<?, ? examples/s]

In [12]:
dataset[2]

{'input': 'My father has been diagnosed with epilepsy and S-protein deficiency since 12 years. he has frequent seizures (once every 3 months). he has been on medication for epilepsy (tegritol/eptoin) for more than 10 years now. I wanted to know what are the things he should do/not do in order to prevent these frequent seizures ? Is it advisable for him to use a laptop/computer at all ? Does that trigger epilepsy ?',
 'output': 'Hello, your question is an intelligent one. There are certain precautions which we advise to the epilepsy patient and his family in order to reduce the chances of a new episode.1. Sleep - a good sound sleep at night is a must. Never make the patient sit for night hours even if there is a family function.2. Medicine-always. Carry meds with you while travelling. Put them in bag even before you put clothes in it. The meds should be given daily (without missing a single day) in made of seizure disorder. It is a must.3. The brand of the med should ideally never be ch

Let's see how row 100 looks like!

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [13]:
def formatting_prompts_func(examples):
    """
    Apply Gemma-3 chat template to conversations
    """
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        ).removeprefix('<bos>')
        for convo in convos
    ]
    return {"text": texts}

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/67299 [00:00<?, ? examples/s]

In [15]:
# Step 6: Verify the formatting
print("Sample formatted conversation:")
print(dataset[2]["text"])
print("\n" + "="*50 + "\n")
print("Another sample:")
print(dataset[100]["text"])

Sample formatted conversation:
<start_of_turn>user
My father has been diagnosed with epilepsy and S-protein deficiency since 12 years. he has frequent seizures (once every 3 months). he has been on medication for epilepsy (tegritol/eptoin) for more than 10 years now. I wanted to know what are the things he should do/not do in order to prevent these frequent seizures ? Is it advisable for him to use a laptop/computer at all ? Does that trigger epilepsy ?<end_of_turn>
<start_of_turn>model
Hello, your question is an intelligent one. There are certain precautions which we advise to the epilepsy patient and his family in order to reduce the chances of a new episode.1. Sleep - a good sound sleep at night is a must. Never make the patient sit for night hours even if there is a family function.2. Medicine-always. Carry meds with you while travelling. Put them in bag even before you put clothes in it. The meds should be given daily (without missing a single day) in made of seizure disorder. It 

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [16]:
# STEP 7: Skip filtering - keeping all conversations as requested
print(f"Dataset size (no filtering applied): {len(dataset)}")

Dataset size (no filtering applied): 67299


In [17]:
# STEP 8: Optional - Split dataset
if len(dataset) > 1000:
    # Keep a small validation set
    dataset_split = dataset.train_test_split(test_size=0.05, seed=42)
    train_dataset = dataset_split["train"]
    eval_dataset = dataset_split["test"]
    print(f"Train size: {len(train_dataset)}, Eval size: {len(eval_dataset)}")
else:
    train_dataset = dataset
    eval_dataset = None
    print(f"Using full dataset for training: {len(train_dataset)}")

Train size: 63934, Eval size: 3365


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [45]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 300,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [46]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/67299 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [21]:
tokenizer.decode(trainer.train_dataset[2]["input_ids"])

'<bos><start_of_turn>user\nMy father has been diagnosed with epilepsy and S-protein deficiency since 12 years. he has frequent seizures (once every 3 months). he has been on medication for epilepsy (tegritol/eptoin) for more than 10 years now. I wanted to know what are the things he should do/not do in order to prevent these frequent seizures ? Is it advisable for him to use a laptop/computer at all ? Does that trigger epilepsy ?<end_of_turn>\n<start_of_turn>model\nHello, your question is an intelligent one. There are certain precautions which we advise to the epilepsy patient and his family in order to reduce the chances of a new episode.1. Sleep - a good sound sleep at night is a must. Never make the patient sit for night hours even if there is a family function.2. Medicine-always. Carry meds with you while travelling. Put them in bag even before you put clothes in it. The meds should be given daily (without missing a single day) in made of seizure disorder. It is a must.3. The brand

Now let's print the masked out example - you should see only the answer is present:

In [22]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                   HiT hank you for asking ChatDoctorI have gone through your query. Your problem can be most likely due to yeast infection. But if you have unprotected sex with risky partner then you should run STD tests to rule out those. If you were my patient with yeast infection then I would suggest using fluconazole ointment for local application. Soaps and antiseptics should be avoided as it may increase irritation. You can use saline or warm water instead of that. Hope this may help you. Let me know if anything is not clear. Thanks.<end_of_turn>\n'

In [27]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.59 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [47]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 67,299 | Num Epochs = 1 | Total steps = 300
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248 of 4,314,980,720 (0.35% trained)


Step,Training Loss
1,3.2211
2,3.5025
3,2.8144
4,2.557
5,3.1445
6,2.8303
7,3.2089
8,3.0084
9,2.7721
10,3.289


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1068.4322 seconds used for training.
17.81 minutes used for training.
Peak reserved memory = 13.561 GB.
Peak reserved memory for training = 9.278 GB.
Peak reserved memory % of max memory = 91.995 %.
Peak reserved memory for training % of max memory = 62.94 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [48]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "What specific precautions are recommended for epilepsy patients regarding screen usage like laptops or TVs, and are there any preferred conditions under which they should or shouldn't use them?",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

["<bos><start_of_turn>user\nWhat specific precautions are recommended for epilepsy patients regarding screen usage like laptops or TVs, and are there any preferred conditions under which they should or shouldn't use them?<end_of_turn>\n<start_of_turn>model\nHello there, Welcome to Chat Doctor, The use of laptops and TVs to the extent that it leaves in an altered state which can mimic epilepsy symptoms like seeing some thing in the inner sight and hearing sound on its own can cause that to happen.  All these are because of the vibrations of a hand with the laptop or"]

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [50]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "Is there any recommendation about carrying identification or emergency contact information for epilepsy patients? If so, what kind of information should it include?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Degree understand your concerns went through your details. i suggest you not to worry much. Epilepsy has no definite diagnosis. Epilepsy is a general term which is given to epilepsy. Epilepsy is a group of medical disorder, which are characterized with seizures which may be violent or be mild; they may give rise to a brief seizure or


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [51]:
model.save_pretrained("gemma-3-by-kartavya")  # Local saving
tokenizer.save_pretrained("gemma-3-by-kartavya")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

['gemma-3-by-kartavya/processor_config.json']

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What specific precautions are recommended for epilepsy patients regarding screen usage like laptops or TVs, and are there any preferred conditions under which they should or shouldn't use them?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

In [53]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "Is there any recommendation about carrying identification or emergency contact information for epilepsy patients? If so, what kind of information should it include?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Hi welcome with your query to Chat Doctor I have gone through your query I have studied your question the emergency contact information should include the persons name, contact number, family member name that person that person is patient with epilepsy, general medical history of the person like other diseases, allergies, medication, in case of emergency use emergency contact number like the hospitals near that person the contact number of emergency center close to that person, and any other important information like blood type is not required most of person does not know what this means if they are found in emergency floor if in emergency there may be a person who knows where the person are living the general medical history can help the hospital in saving time this information should be stored in the pocket or some place where the person can retrieve it there should be a medical history card which should contain all the important information you are asking. Please follow your doctor

### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
if True: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3-finetune-by-kartavya", tokenizer)

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Downloading safetensors index for unsloth/gemma-3-4b-it...


Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s]

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if True: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3-finetune-by-kartavya", tokenizer,
        token = "hf_..."
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if True: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune-by-kartavya",
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if True: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-finetune-by-kartavya",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-finetune-gguf",
        token = "hf_...",
    )

Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
