To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# 1. Setup - Install / Imports
# Install necessary libraries and update transformers/huggingface_hub for compatibility
!pip install -U "unsloth[colab-new]" "transformers" "huggingface_hub"

from unsloth import FastLanguageModel
from google.colab import drive
import torch
import gc
import os

# Define Paths
# Ensure this points to your specific checkpoint (where the adapter weights are)
CHECKPOINT_PATH = "/content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Checkpoints/checkpoint-625"
# The directory where the MERGED model will be saved
SAVE_PATH = "/content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Merged"
GGUF_FILENAME = "Unsloth-LLama-3.2-1B-DOCTOR.gguf" # Name for the final GGUF file

# 2. Mount Drive
drive.mount('/content/drive')

# 3. Load Model (LoRA adapters)
# We load the model from the checkpoint path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = CHECKPOINT_PATH,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# 4. Save the MERGED model to Drive
print(f"Saving merged model to {SAVE_PATH}...")

# This merges the LoRA adapters into the base model and saves as FP16.
# The merged model files are required for the GGUF conversion.
model.save_pretrained_merged(
    SAVE_PATH,
    tokenizer,
    save_method = "merged_16bit",
)

# 5. Clean Up (Necessary to free VRAM for the next step)
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

print("-" * 50)
print(f"STEP 1 COMPLETE. Merged model saved to {SAVE_PATH}")
print("-" * 50)

Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface_hub
  Downloading huggingface_hub-1.1.7-py3-none-any.whl.metadata (13 kB)
Collecting unsloth[colab-new]
  Downloading unsloth-2025.11.6-py3-none-any.whl.metadata (64 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m64.6/64.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.11.6 (from unsloth[colab-new])
  Downloading unsloth_zoo-2025.11.6-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth[colab-new])
  Downloading tyro-0.9.35-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth[colab-new])
  Downloading xfo

model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Unsloth 2025.11.6 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Saving merged model to /content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Merged...


config.json:   0%|          | 0.00/894 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:22<00:00, 22.84s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:51<00:00, 51.45s/it]


Unsloth: Merge process complete. Saved to `/content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Merged`
--------------------------------------------------
STEP 1 COMPLETE. Merged model saved to /content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Merged
!!! YOU MUST NOW RESTART THE RUNTIME (Runtime > Restart session) BEFORE running Step 2 !!!
--------------------------------------------------


In [None]:
import os
# 1. Setup Paths
SAVE_PATH = "/content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Merged"
GGUF_FILENAME = "Unsloth-LLama-3.2-1B-DOCTOR.gguf" # TYPO SHOULD BE 1B V2
OUTPUT_DIR = "/content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-GGUF"
os.makedirs(OUTPUT_DIR, exist_ok=True)
FINAL_GGUF_PATH = os.path.join(OUTPUT_DIR, GGUF_FILENAME)

# 2. Clone llama.cpp (if not present) and install dependencies
if not os.path.isdir("llama.cpp"):
    print("Cloning llama.cpp...")
    !git clone https://github.com/ggerganov/llama.cpp.git

# Install base llama.cpp requirements
print("Installing llama.cpp dependencies...")
!pip install -r llama.cpp/requirements.txt

print("Updating transformers and huggingface_hub to resolve Colab conflicts...")
!pip install -U "transformers" "huggingface_hub"
# <<< ------------------------------------------------------------- >>>


print("-" * 50)
print(f"STEP 1 COMPLETE. GGUF model saved directly to {SAVE_PATH}")
print("-" * 50)

# 3. Convert to GGUF
quantization_method = "q8_0" # Change to 'q4_k_m' or another method if preferred

print(f"Starting GGUF conversion with quantization: {quantization_method}...")

#Run the conversion script
!python llama.cpp/convert_hf_to_gguf.py {SAVE_PATH} \
    --outfile {FINAL_GGUF_PATH} \
    --outtype {quantization_method}

# 5. Clean Up (No change)
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()
#print("-" * 50)
#print(f"STEP 2 COMPLETE. The GGUF file has been saved to: {FINAL_GGUF_PATH}")
#print("-" * 50)

Installing llama.cpp dependencies...
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/nightly, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/nightly, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/nightly
Ignoring torch: markers 'platform_machine == "s390x"' don't match your environment
Ignoring torch: markers 'platform_machine == "s390x"' don't match your environment
Updating transformers and huggingface_hub to resolve Colab conflicts...
Collecting transformers
  Using cached transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting huggingface_hub
  Using cached huggingface_hub-1.1.7-py3-none-any.whl.metadata (13 kB)
Using cached transformers-4.57.3-py3-none-any.whl (12.0 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.57.2
    Uninstalling transformers-4.57.2:
      Successfully

NameError: name 'model' is not defined

In [None]:
import os
from huggingface_hub import HfApi, login

# --- 1. Define Paths and Repository Details ---
# NOTE: These must match the successful output of Step 2
#GGUF_FILENAME = "Unsloth-Phi-3.5-mini-instruct.gguf"
#OUTPUT_DIR = "/content/drive/MyDrive/Unsloth-Phi-3.5-mini-instruct-GGUF"
FINAL_GGUF_PATH = os.path.join(OUTPUT_DIR, GGUF_FILENAME)

# --- Define your Hugging Face Repository Details ---
# IMPORTANT: Replace YOUR_USERNAME with your Hugging Face username
# This will be the name of your new model repo (e.g., "username/llama-3.2-1B-gguf")
REPO_ID = f"coolestGuyEver/{GGUF_FILENAME.replace('.gguf', '-GGUF')}"


# --- 3. Initialize API and Create Repository ---
api = HfApi(token="", )

print(f"\n--- Creating or checking repo: {REPO_ID} ---")
# Creates the repository if it doesn't exist.
api.create_repo(
    repo_id=REPO_ID,
    repo_type="model",
    exist_ok=True, # Will not throw an error if repo already exists
)

# --- 4. Upload the GGUF File ---
print(f"\n--- Uploading {GGUF_FILENAME} to the Hub ---")
# Uploads the single GGUF file
api.upload_file(
    path_or_fileobj=FINAL_GGUF_PATH,
    path_in_repo=GGUF_FILENAME, # Name it the same inside the repo
    repo_id=REPO_ID,
    repo_type="model",
    commit_message=f"Upload of {GGUF_FILENAME} (q8_0 quantization)",
)

print("-" * 50)
print(f"Your model is now available at: https://huggingface.co/{REPO_ID}")
print("Please visit the link to check the file and add a Model Card (README.md).")
print("-" * 50)


--- Creating or checking repo: coolestGuyEver/Unsloth-LLama-3.2-1B-DOCTOR-GGUF ---

--- Uploading Unsloth-LLama-3.2-1B-DOCTOR.gguf to the Hub ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...-LLama-3.2-1B-DOCTOR.gguf:   0%|          |  559kB / 1.32GB            

--------------------------------------------------
‚úÖ UPLOAD COMPLETE!
Your model is now available at: https://huggingface.co/coolestGuyEver/Unsloth-LLama-3.2-1B-DOCTOR-GGUF
Please visit the link to check the file and add a Model Card (README.md).
--------------------------------------------------


In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3.2-1b-instruct-unsloth-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.6 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    chats = []
    for instr, inp, out in zip(instructions, inputs, outputs):
        convo = [
            {"role": "system", "content": instr},
            {"role": "user", "content": inp},
            {"role": "assistant", "content": out},
        ]
        chats.append(
            tokenizer.apply_chat_template(
                convo,
                tokenize=False,
                add_generation_prompt=False,
            )
        )

    return {"text": chats}


from datasets import load_dataset
# Load the dataset and select only the first 5,000 examples
# This speeds up training massively and satisfies the "Data-Centric" improvement requirement.
dataset = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k", split="train").select(range(5000))

# OPTIONAL: If you want to be fancy, shuffle it first to get random examples:
# dataset = load_dataset("mlabonne/FineTome-100k", split="train").shuffle(seed=42).select(range(5000))

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [None]:
dataset[5]

{'instruction': "If you are a doctor, please answer the medical questions based on the patient's description.",
 'input': 'I am F 38 in good shape work out (do triathlons) regular but have had back pain from different reasons throughout my life now I every so often wake up with severe lower back, hip pain for no reason. today the pain is almost taking my breathe away when i move. It is a dull pain when i am just lying down but the moment I make any sort of movement I have sharp and sometimes shooting pain down my legs.',
 'output': 'Hi, From history it seems that you might be having degenerative changes in your lower back spines giving pinched nerve pressure. There might be having osteomalacia or osteoporosis as well. Go for x-ray lumbosacral region for osteoarthritis. Physiotherapy like back extension exercises will be much helpful. Take B1, B6, B!2 shots or medicine. Take calcium, vitamin A and D supplements. Ok and take care.',
 'text': "<|begin_of_text|><|start_header_id|>system<|e

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [None]:
dataset[5]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nIf you are a doctor, please answer the medical questions based on the patient's description.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI am F 38 in good shape work out (do triathlons) regular but have had back pain from different reasons throughout my life now I every so often wake up with severe lower back, hip pain for no reason. today the pain is almost taking my breathe away when i move. It is a dull pain when i am just lying down but the moment I make any sort of movement I have sharp and sometimes shooting pain down my legs.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi, From history it seems that you might be having degenerative changes in your lower back spines giving pinched nerve pressure. There might be having osteomalacia or osteoporosis as well. Go for x-ray lumbosacral region for osteoarthritis. Physiotherapy like bac

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        output_dir = "/content/drive/MyDrive/Unsloth-LLama-3.2-1B-DOCTOR-Checkpoints",
        save_strategy = "steps",
        save_steps = 100,
        save_total_limit = 2,  # Only keep the last 2 checkpoints to save Drive space
        num_train_epochs = 1,
        max_steps = -1,
        # Standard settings
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 50,
        learning_rate = 1e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

In [None]:
# print 3 examples to inspect how the conversation is formatted
for i in range(1):
    print("----- EXAMPLE", i, "-----")
    print(dataset[i]["text"])
    print()
response_part = "<|im_start|>assistant\n"
matches = sum(1 for t in dataset["text"] if response_part in t)
print(f"Examples containing response_part '{response_part}': {matches}/{len(dataset)}")


----- EXAMPLE 0 -----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

If you are a doctor, please answer the medical questions based on the patient's description.<|eot_id|><|start_header_id|>user<|end_header_id|>

I woke up this morning feeling the whole room is spinning when i was sitting down. I went to the bathroom walking unsteadily, as i tried to focus i feel nauseous. I try to vomit but it wont come out.. After taking panadol and sleep for few hours, i still feel the same.. By the way, if i lay down or sit down, my head do not spin, only when i want to move around then i feel the whole world is spinning.. And it is normal stomach discomfort at the same time? Earlier after i relieved myself, the spinning lessen so i am not sure whether its connected or coincidences.. Thank you doc!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi, Thank you for posting your query. The most likely cause for your 

In [None]:
trainer.train_dataset.column_names



['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask']

In [None]:
# Clean the trainer's internal dataset, not just the original dataset
if "labels" in trainer.train_dataset.column_names:
    trainer.train_dataset = trainer.train_dataset.remove_columns(["labels"])


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>",
    response_part    = "<|start_header_id|>assistant<|end_header_id|>",
    num_proc = 1,
)



Map (num_proc=1):   0%|          | 0/5000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nIf you are a doctor, please answer the medical questions based on the patient's description.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI am F 38 in good shape work out (do triathlons) regular but have had back pain from different reasons throughout my life now I every so often wake up with severe lower back, hip pain for no reason. today the pain is almost taking my breathe away when i move. It is a dull pain when i am just lying down but the moment I make any sort of movement I have sharp and sometimes shooting pain down my legs.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi, From history it seems that you might be having degenerative changes in your lower back spines giving pinched nerve pressure. There might be having osteomalacia or osteoporosis as well. Go for x-ray lumbosacral region for osteoarthritis. Physi

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                \n\nHi, From history it seems that you might be having degenerative changes in your lower back spines giving pinched nerve pressure. There might be having osteomalacia or osteoporosis as well. Go for x-ray lumbosacral region for osteoarthritis. Physiotherapy like back extension exercises will be much helpful. Take B1, B6, B!2 shots or medicine. Take calcium, vitamin A and D supplements. Ok and take care.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.203 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 625
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
1,3.3391
2,3.8912
3,3.4432
4,3.9461
5,3.4684
6,3.3948
7,3.5597
8,3.732
9,3.809
10,3.5597


Step,Training Loss
1,3.3391
2,3.8912
3,3.4432
4,3.9461
5,3.4684
6,3.3948
7,3.5597
8,3.732
9,3.809
10,3.5597


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

938.4855 seconds used for training.
15.64 minutes used for training.
Peak reserved memory = 2.334 GB.
Peak reserved memory for training = 1.131 GB.
Peak reserved memory % of max memory = 15.833 %.
Peak reserved memory for training % of max memory = 7.672 %.


In [None]:
import os
from google.colab import files

# Define the path to the checkpoint directory
# This path is already defined in your notebook as CHECKPOINT_PATH
checkpoint_dir = CHECKPOINT_PATH

# Define the output path for the zip file
zip_filename = os.path.basename(checkpoint_dir) + ".zip"
output_zip_path = os.path.join("/content", zip_filename)

print(f"Compressing checkpoint directory: {checkpoint_dir}...")
# Create a zip archive of the checkpoint directory
!zip -r "{output_zip_path}" "{checkpoint_dir}"

print(f"Downloading {zip_filename}...")
# Download the zip file to your local machine
files.download(output_zip_path)

print("Download initiated. Check your browser's download folder.")

Compressing checkpoint directory: /content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250...
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/ (stored 0%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/tokenizer_config.json (deflated 96%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/training_args.bin (deflated 53%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/scheduler.pt (deflated 62%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/trainer_state.json (deflated 85%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/chat_template.jinja (deflated 72%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/scaler.pt (deflated 64%)
  adding: content/drive/MyDrive/Unsloth_Llama_1B_V2_Checkpoints/checkpoint-1250/tokenizer.json (deflated 85%)
  adding: content

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download initiated. Check your browser's download folder.


Please note that the checkpoint path `CHECKPOINT_PATH` refers to a directory. The code above will compress this directory into a zip file and then download it.

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.2",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is a sequence of natural numbers following the Fibonacci formula<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Hello, welcome to Chat Doctor. I have understood your question. Now let's start your Fibonacci sequence with 1, 2 and then take 3. This pattern continues on, and hence the Fibonacci sequence is also known as the Lucas number or simply, Fibonacci sequence. But let's correct your statement - there isn't really such thing.<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("coolestGuyEver/lora_model", token = "") # Online saving
tokenizer.push_to_hub("coolestGuyEver/lora_model", token = "") # Online saving

README.md:   0%|          | 0.00/619 [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   0%|          | 45.8kB / 97.3MB            

Saved model to https://huggingface.co/coolestGuyEver/lora_model


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  lora_model/tokenizer.json   :  96%|#########5| 16.5MB / 17.2MB            

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR rMODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with ü§ó HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>