To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### News

**[NEW] We've fixed many bugs in Phi-4** which greatly increases Phi-4's accuracy. See our [blogpost](https://unsloth.ai/blog/phi4)

[NEW] You can view all Phi-4 model uploads with our bug fixes including [dynamic 4-bit quants](https://unsloth.ai/blog/dynamic-4bit), GGUF & more [here](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa)

[NEW] As of Novemeber 2024, Unsloth now supports [vision finetuning](https://unsloth.ai/blog/vision)!


### Experiment Design

RQ1: Does an LLM understand knowledge better through fine-tuning or through retrieval?
- → Figure out loading BioASQ, PubMedQA, MedQA (do MedQA first)
- → Design prompting strategy for providing content (ablate over 3 prompting strategies, choose the best one for each method on the dev set) | Do some research into this
- → Do inference using Llama-3.1-8B 3 times using the context, and average (does performance differ from databricks to alien?) (figure out how to parse the output correctly)
- → Then do fine-tuning using Llama-3.1-8B 3 times and average
    - → This is when you want to do the epoch ablation, for each question, get the response for each epoch iteration
    - Ablation: how many epochs of fine-tuning does it take to understand a document?
- [1 week for the above?]
- Then, repeat on PubMedQA & then BioASQ
- Then, repeat this using Phi-14-B model (parsing output & prompting strategy might change), repeat on Qwen-2.5-32B i*f time permits*
- → → Output figures?
    - Bars | for each dataset, have different color bars for each model, and then do  shadings for different approaches
    - Ablation graph for each model & each dataset (9 lines)
- [1 week to repeat] [this repetition should be saved for the last week]

### Installation

In [None]:
!pip install unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
import re 
from tqdm import tqdm
from datasets import load_dataset
import numpy as np
import pandas as pd
from datasets import Dataset

import sys
import os
sys.path.append(os.path.abspath(".."))
from importlib import reload
import utils.utils as utils
import utils.prompts as prompts
reload(utils)
reload(prompts)



ModuleNotFoundError: No module named 'unsloth'

In [1]:
max_seq_length = 1024 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

unsloth_models = {
    'llama-3.1-8b': "unsloth/Meta-Llama-3.1-8B",
    'llama-3.1-8b-instruct': "unsloth/Meta-Llama-3.1-8B-Instruct"
}


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!


### Unsloth

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B"  # Using HF model ID directly
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side="right",
    use_fast=True,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    max_length=max_seq_length,
    torch_dtype="auto" if dtype is None else dtype,
    load_in_4bit=load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.586 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.33it/s]
Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj", "lm_head", "embed_tokens"],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [26]:
import sys
# add .. to path
sys.path.append('..')
import utils.utils as utils
from importlib import reload
reload(utils)

# EOS_TOKEN = tokenizer.eos_token  # Ensure EOS_TOKEN is defined
import pandas as pd
from datasets import Dataset
from tqdm import tqdm  # Import tqdm for progress bars

# Read your CSV data.
dataset = utils.load_dataset('PubMedQA', split='test', start_index=0, end_index=2)

EOS_TOKEN = "tokenizer.eos_token"  # Use the actual tokenizer.eos_token

def format_pretraining_prompt(examples):
    return { "text" : ["\n".join(example['contexts']) + EOS_TOKEN for example in examples["context"]] }
corpus = dataset.map(format_pretraining_prompt, batched=True)

def format_qa_prompt(example):

    texts = []

    # Create a string with background, question, and prompt for yes/no answer

    background = "\n".join(example['context']['contexts'])
    question = example['question']
    prompt = f"Background: {background}\n\nQuestion: {question}\n\nPlease answer with Yes or No."
    texts.append(prompt + EOS_TOKEN)
    return {"text": texts}

qa_corpus = dataset.map(format_qa_prompt)
# Each example in the dataset now has a "text" field containing your full prompt.
print(qa_corpus[1]['text'])

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 1000
})


Map: 100%|██████████| 2/2 [00:00<00:00, 125.67 examples/s]

['Background: Assessment of visual acuity depends on the optotypes used for measurement. The ability to recognize different optotypes differs even if their critical details appear under the same visual angle. Since optotypes are evaluated on individuals with good visual acuity and without eye disorders, differences in the lower visual acuity range cannot be excluded. In this study, visual acuity measured with the Snellen E was compared to the Landolt C acuity.\n100 patients (age 8 - 90 years, median 60.5 years) with various eye disorders, among them 39 with amblyopia due to strabismus, and 13 healthy volunteers were tested. Charts with the Snellen E and the Landolt C (Precision Vision) which mimic the ETDRS charts were used to assess visual acuity. Three out of 5 optotypes per line had to be correctly identified, while wrong answers were monitored. In the group of patients, the eyes with the lower visual acuity, and the right eyes of the healthy subjects, were evaluated.\nDifferences b




In [8]:
def format_test_prompt(examples):
    return { "text" : [question  for question in examples["question"]] } # " Output just 'Yes' or 'No'."
test_dataset = dataset.map(format_test_prompt, batched = True,)
# Each example in the dataset now has a "text" field containing your full prompt.
print(test_dataset[1])

Map: 100%|██████████| 2/2 [00:00<00:00, 32.70 examples/s]

{'pubid': 16418930, 'question': 'Landolt C and snellen e acuity: differences in strabismus amblyopia?', 'context': {'contexts': ['Assessment of visual acuity depends on the optotypes used for measurement. The ability to recognize different optotypes differs even if their critical details appear under the same visual angle. Since optotypes are evaluated on individuals with good visual acuity and without eye disorders, differences in the lower visual acuity range cannot be excluded. In this study, visual acuity measured with the Snellen E was compared to the Landolt C acuity.', '100 patients (age 8 - 90 years, median 60.5 years) with various eye disorders, among them 39 with amblyopia due to strabismus, and 13 healthy volunteers were tested. Charts with the Snellen E and the Landolt C (Precision Vision) which mimic the ETDRS charts were used to assess visual acuity. Three out of 5 optotypes per line had to be correctly identified, while wrong answers were monitored. In the group of patie




### Continued-Pretraining

In [9]:
# Load the cleaned text file
with open("../data/cleaned_handbook.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Further clean the text
def remove_newline_between_chars(text):
    return re.sub(r"(\S)\n(\S)", r"\1 \2", text)

# Example usage
cleaned_text = remove_newline_between_chars(text)

# Tokenize and chunk text
def chunk_text(text, max_length):
    """Splits text into chunks while maintaining sentence boundaries."""
    import re
    sentences = re.split(r"(?<=[.!?]) +", text)  # Split by sentence
    chunks, current_chunk = [], ""
    
    for sentence in sentences:
        if len(tokenizer.encode(current_chunk + sentence)) < max_length:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Create chunks
chunks = chunk_text(cleaned_text, max_seq_length)

# Add EOS token at the end of each chunk
EOS_TOKEN = tokenizer.eos_token
formatted_chunks = [{"text": chunk} for chunk in chunks]

# Convert to Hugging Face dataset
handbook = Dataset.from_list(formatted_chunks)

# Function for formatting
def formatting_prompts_func(examples):
    return {"text": [example for example in examples["text"]]}

# Apply formatting function
handbook = handbook.map(formatting_prompts_func, batched=True)

Map: 100%|██████████| 27/27 [00:00<00:00, 18215.57 examples/s]


In [10]:
handbook[0]

{'text': "# Emergency Severity Index, Version 4: Implementation Handbook\n\n### Note from the Director\n\nThe Agency for Healthcare Research and Quality is pleased to bring you the Emergency Severity _Index, Version 4: Implementation Handbook. This manual covers all details of the Emergency_ Severity Index (ESI)—a five-level emergency department triage algorithm that provides clinically relevant stratification of patients into five groups from 1 (most urgent) to 5 (least urgent) on the basis of acuity and resource needs.\n\nAfter emergency physicians Richard Wuerz and David Eitel developed the ESI in 1998 and pilot testing yielded favorable results, the ESI Triage Group was formed. Further work on the initial development of ESI was carried out under an AHRQ grant. The ESI Triage Group, which consisted of medical clinicians, managers, educators, and researchers, further refined the algorithm to what it is today.\n\nIn keeping with our mission to improve the quality, safety, efficiency, 

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments
from tqdm.auto import tqdm
import copy

# Store the original base model for reuse
base_model = copy.deepcopy(model)
results = []

# Hyperparameter to control how much of the dataset to process
num_samples = 10  # Change this to process more or fewer samples

# Ensure we don't exceed the dataset size
num_samples = min(num_samples, len(corpus))

# Loop through each corpus item and corresponding test item
for i in tqdm(range(num_samples)):
    print(f"\nProcessing item {i+1}/{num_samples}")
    
    # Reset to base model for each iteration
    model = copy.deepcopy(base_model)
    
    # Do inference BEFORE training to get baseline performance
    print("Running inference before training...")
    FastLanguageModel.for_inference(model)
    
    pre_train_inputs = tokenizer(
        [test_dataset[i]['text']], 
        return_tensors="pt"
    ).to("cuda")
    
    pre_train_outputs = model.generate(**pre_train_inputs, max_new_tokens=250, use_cache=True)
    pre_train_decoded = tokenizer.batch_decode(pre_train_outputs)[0]
    
    print(f"Pre-training output (truncated): {pre_train_decoded[:100]}...")
    
    # Train on single corpus item
    print("Training model...")
    trainer = UnslothTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=Dataset.from_dict({"text": [corpus[i]["text"]]}),
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=8,

        args=UnslothTrainingArguments(
            per_device_train_batch_size=2,
            gradient_accumulation_steps=8,
            warmup_ratio=0.1,
            num_train_epochs=10,
            learning_rate=5e-5,
            embedding_learning_rate=5e-6,
            fp16=not is_bfloat16_supported(),
            bf16=is_bfloat16_supported(),
            logging_steps=1,
            optim="adamw_8bit",
            weight_decay=0.00,
            lr_scheduler_type="cosine",
            seed=3407,
            output_dir=f"outputs/item_{i}",
            report_to="none",
        ),
    )
    
    # Train the model
    trainer.train()
    
    # Evaluate on corresponding test item AFTER training
    print("Running inference after training...")
    FastLanguageModel.for_inference(model)
    
    post_train_inputs = tokenizer(
        [test_dataset[i]['text']], 
        return_tensors="pt"
    ).to("cuda")
    
    post_train_outputs = model.generate(**post_train_inputs, max_new_tokens=250, use_cache=True)
    post_train_decoded = tokenizer.batch_decode(post_train_outputs)[0]
    
    # Store results
    results.append({
        "item_index": i,
        "input": test_dataset[i]['text'],
        "pre_train_output": pre_train_decoded,
        "post_train_output": post_train_decoded
    })
    
    print(f"Post-training output (truncated): {post_train_decoded[:100]}...")

# Display summary of results
print("\nExperiment Results Summary:")
for i, result in enumerate(results):
    print(f"Item {i+1}:")
    print(f"  Before training (truncated): {result['pre_train_output'][:100]}...")
    print(f"  After training (truncated):  {result['post_train_output'][:100]}...")
    print()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map (num_proc=8): 100%|██████████| 27/27 [00:01<00:00, 18.38 examples/s]


In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 27 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss
1,2.0421
2,4.0493
3,4.0637
4,4.0577
5,4.0473
6,4.0705
7,3.9566
8,4.0535
9,4.1799
10,3.9547


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 100, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 25,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "llama_3_1_8B_+_KTAS_100",
        report_to = "none", # Use this for WandB etc
    ),
)

Generating train split: 97 examples [00:00, 1022.57 examples/s]


In [7]:
trainer_stats = trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 97 | Num Epochs = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1,200
 "-____-"     Number of trainable parameters = 83,886,080
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to av

Step,Training Loss
25,0.5777
50,0.2757
75,0.2397
100,0.2171
125,0.1925
150,0.1499
175,0.0976
200,0.056
225,0.0315
250,0.0156


In [10]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3090. Max memory = 23.586 GB.
15.316 GB of memory reserved.


In [11]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


['<|begin_of_text|>### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.\n\n### Input: A black/african american, 48.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.0°F, a heart rate of 102.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 98.0%, systolic blood pressure of 126.0 mmHg, diastolic blood pressure of 88.0 mmHg, pain level reported as 0, and chief complaint described as "Anxiety".\n\n### Response: 2\n\n### Rationale:\nThe Emergency Severity Index (ESI) is a five-level triage acuity instrument that assesses the severity of a patient\'s condition. The levels range from Level 1 (most urgent) to Level 5 (least urgent).\n\nLevel 2: Level 2 patients require immediate attention, but do not require immediate life-saving interventions.\n\nBased on the provided information, the patient\'s vital signs are within normal limits, and her chief complaint is "Anxiety", which']

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [6]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    test_dataset[0]['text']

], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 250, use_cache = True)
tokenizer.batch_decode(outputs)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

["<|begin_of_text|>### Instruction: Based on the clinical presentation, determine the Emergency Severity Index (ESI) acuity for the following patient.\n\n### Input: A white, 59-year-old man arrives at the emergency department via walk-in. He has a temperature of 97.1°F, a heartrate of 94.0 bpm, a respiratory rate of 20.0 breaths/min, oxygen saturation at 94.0%, systolic blood pressure of 153.0 mmHg, diastolic blood pressure of 101.0 mmHg, pain level reported as '0', and a chief complaint described as 'Dyspnea'.\n\n### Response: 3.0<|eot_id|>"]

In [24]:
test_dataset[14]['text']

"### Instruction: Based on the clinical presentation, determine the Emergency Severity Index (ESI) acuity for the following patient.\n\n### Input: A white, 59-year-old man arrives at the emergency department via walk-in. He has a temperature of 97.1°F, a heartrate of 94.0 bpm, a respiratory rate of 20.0 breaths/min, oxygen saturation at 94.0%, systolic blood pressure of 153.0 mmHg, diastolic blood pressure of 101.0 mmHg, pain level reported as '0', and a chief complaint described as 'Dyspnea'.\n\n### Response: "

In [16]:
test_dataset[14]['text']

"### Instruction: Based on the clinical presentation, determine the Emergency Severity Index (ESI) acuity for the following patient.\n\n### Input: A white, 59-year-old man arrives at the emergency department via walk-in. He has a temperature of 97.1°F, a heartrate of 94.0 bpm, a respiratory rate of 20.0 breaths/min, oxygen saturation at 94.0%, systolic blood pressure of 153.0 mmHg, diastolic blood pressure of 101.0 mmHg, pain level reported as '0', and a chief complaint described as 'Dyspnea'.\n\n### Response: The ESI acuity for this patient is 2.0.<|eot_id|>"

In [7]:
from sklearn.metrics import cohen_kappa_score, mean_squared_error
from tqdm import tqdm  # Import tqdm for progress bars

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

def extract_response(text):
    # This pattern looks for "Response:" and then captures "Yes" or "No"
    match = re.search(r"Response:.*?(Yes|No)", text, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).lower()
    return None

# Initialize tracking variables
correct = 0
wrong = 0
y_true = []
y_pred = []

def generate_response(input_text):
    inputs = tokenizer([input_text], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=75, use_cache=True)
    decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return extract_response(decoded_output)

# Iterate through test dataset
for i, sample in tqdm(enumerate(qa_corpus)):
    input_text = sample['text'][0]  # Get the formatted question from qa_corpus
    true_answer = sample['final_decision'].lower()  # Get ground truth (yes/no)
    
    predicted_answer = generate_response(input_text)
    
    if predicted_answer is not None:
        y_true.append(true_answer)
        y_pred.append(predicted_answer)
        if predicted_answer == true_answer:
            correct += 1
        else:
            wrong += 1
    else:
        print(f"Sample {i}: No valid response extracted.")
        wrong += 1

# Calculate accuracy
accuracy = correct / (correct + wrong) if (correct + wrong) > 0 else 0
print(f"Accuracy: {accuracy:.4f}")

# Save predictions and ground truth in a CSV file
df = pd.DataFrame({
    "Index": range(len(y_true)),
    "True_Answer": y_true,
    "Predicted_Answer": y_pred
})
df.to_csv("PubMedQA_predictions.csv", index=False)
print("Predictions and ground truth saved.")

# Calculate additional metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

metrics = {
    "accuracy": accuracy_score(y_true, y_pred),
    "precision": precision_score(y_true, y_pred, pos_label="yes", average="binary"),
    "recall": recall_score(y_true, y_pred, pos_label="yes", average="binary"),
    "f1": f1_score(y_true, y_pred, pos_label="yes", average="binary")
}

print("Overall Metrics:", metrics)
output_filepath = '../results/PubMedQA/PubMedQA-Llama_3.1'
utils.save_metrics(metrics, output_filepath)
print("Evaluation complete. Metrics and plots saved.")


1000it [02:27,  6.79it/s]

Predictions and ground truth saved.
Overall Metrics: {'overall': {'accuracy': 0.637, 'precision': 0.6334546236745339, 'recall': 0.637, 'f1_score': 0.6258363383536468, 'adjusted_accuracy': 0.982, 'adjusted_precision': 0.9823124848484849, 'adjusted_recall': 0.982, 'adjusted_f1': 0.9819162147117297, 'mae': 0.381, 'mse': 0.417, 'quadratic_kappa': np.float64(0.5953844720786176)}, 'by_class': {'0.0': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 0.0}, '1.0': {'precision': 0.5581395348837209, 'recall': 0.8135593220338984, 'f1-score': 0.6620689655172414, 'support': 59.0}, '2.0': {'precision': 0.6383763837638377, 'recall': 0.47790055248618785, 'f1-score': 0.5466034755134281, 'support': 362.0}, '3.0': {'precision': 0.6593959731543624, 'recall': 0.7766798418972332, 'f1-score': 0.7132486388384754, 'support': 506.0}, '4.0': {'precision': 0.5111111111111111, 'recall': 0.32857142857142857, 'f1-score': 0.4, 'support': 70.0}, '5.0': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'su


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [10]:
model.save_pretrained("llama_3_1_8B_+_KTAS_100")  # Local saving


In [22]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

SyntaxError: invalid decimal literal (2453640556.py, line 1)

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [5]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "outputs/checkpoint-12500", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "What is a famous tall tower in Paris?", # instruction
#         "", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

==((====))==  Unsloth 2025.1.8: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.586 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.34it/s]
Unsloth 2025.1.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


NameError: name 'tqdm' is not defined

In [15]:

def quadratic_weighted_kappa(y_true, y_pred, min_rating=1, max_rating=5):
    hist_rater_a = np.histogram(y_true, bins=np.arange(min_rating, max_rating + 2))[0]
    hist_rater_b = np.histogram(y_pred, bins=np.arange(min_rating, max_rating + 2))[0]
    
    confusion = confusion_matrix(y_true, y_pred, labels=np.arange(min_rating, max_rating + 1))
    num_ratings = len(hist_rater_a)
    weights = np.array([[((i - j) ** 2) / ((num_ratings - 1) ** 2) for j in range(num_ratings)] for i in range(num_ratings)])
    expected = np.outer(hist_rater_a, hist_rater_b) / np.sum(hist_rater_a)
    kappa = 1.0 - (np.sum(weights * confusion) / np.sum(weights * expected))
    return kappa

def extract_response(text):
    match = re.search(r"Response:\s*(\d+)", text)
    return int(match.group(1)) if match else None

# Initialize tracking variables
correct = 0
wrong = 0
y_true = []
y_pred = []

def generate_response(input_text):
    print(input_text)
    inputs = tokenizer([input_text], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=100, use_cache=True)
    decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(decoded_output)
    return extract_response(decoded_output)

# Iterate through test dataset
for i, sample in tqdm(enumerate(test_dataset)):
    input_text = sample['text']
    true_acuity = sample['acuity']
    predicted_acuity = generate_response(input_text)
    
    if predicted_acuity is not None:
        # print(f"Sample {i}: True Acuity: {true_acuity}, Predicted: {predicted_acuity}")
        y_true.append(true_acuity)
        y_pred.append(predicted_acuity)
        if predicted_acuity == true_acuity:
            correct += 1
        else:
            wrong += 1
    else:
        print(f"Sample {i}: No valid response extracted.")
        wrong += 1

# Print accuracy
accuracy = correct / (correct + wrong) * 100
qwk_score = quadratic_weighted_kappa(y_true, y_pred)
print(f"Model Accuracy: {accuracy:.2f}%")
print(f"Quadratic Weighted Kappa (QWK): {qwk_score:.4f}")

0it [00:00, ?it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african american, 61.0-year-old man arrives at the emergency department via ambulance. He has a temperature of 98.6°F, a heart rate of 76.0 bpm, a respiratory rate of 20.0 breaths per minute, oxygen saturation at 99.0%, systolic blood pressure of 151.0 mmHg, diastolic blood pressure of 90.0 mmHg, pain level reported as 13, and chief complaint described as "SHORTNESS OF BREATH".

### Response: 


1it [00:00,  1.82it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african american, 61.0-year-old man arrives at the emergency department via ambulance. He has a temperature of 98.6°F, a heart rate of 76.0 bpm, a respiratory rate of 20.0 breaths per minute, oxygen saturation at 99.0%, systolic blood pressure of 151.0 mmHg, diastolic blood pressure of 90.0 mmHg, pain level reported as 13, and chief complaint described as "SHORTNESS OF BREATH".

### Response:  The estimated ESI acuity for this patient is 2.0.
Sample 0: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white - other european, 62.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 97.7°F, a heart rate of 82.0 bpm, a respiratory rate of 24.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 155.0 mmHg, diastolic blood press

2it [00:01,  1.97it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white - other european, 62.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 97.7°F, a heart rate of 82.0 bpm, a respiratory rate of 24.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 155.0 mmHg, diastolic blood pressure of 83.0 mmHg, pain level reported as 3, and chief complaint described as "RLQ abdominal pain".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 1: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white, 56.0-year-old man arrives at the emergency department via ambulance. He has a temperature of 98.7°F, a heart rate of 88.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 97.0%, systolic blood pressure of 140.0 mmHg, diastolic blood pressure of 84.0 mmHg, p

3it [00:01,  2.05it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white, 56.0-year-old man arrives at the emergency department via ambulance. He has a temperature of 98.7°F, a heart rate of 88.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 97.0%, systolic blood pressure of 140.0 mmHg, diastolic blood pressure of 84.0 mmHg, pain level reported as 5, and chief complaint described as "R Flank pain, Right sided abdominal pain".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 2: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african american, 63.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.1°F, a heart rate of 80.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 180.0 mmHg, diastolic blood pr

4it [00:01,  2.10it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african american, 63.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.1°F, a heart rate of 80.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 180.0 mmHg, diastolic blood pressure of 90.0 mmHg, pain level reported as 3, and chief complaint described as "Neck pain, Back pain, MVC".

### Response:  The estimated ESI acuity for this patient is 4.0.
Sample 3: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white, 55.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.6°F, a heart rate of 96.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 96.0%, systolic blood pressure of 164.0 mmHg, diastolic blood pressure of 80

5it [00:02,  2.13it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white, 55.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.6°F, a heart rate of 96.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 96.0%, systolic blood pressure of 164.0 mmHg, diastolic blood pressure of 80.0 mmHg, pain level reported as 3, and chief complaint described as "L Leg pain".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 4: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white, 78.0-year-old woman arrives at the emergency department via walk-in. She has a temperature of 98.0°F, a heart rate of 72.0 bpm, a respiratory rate of 18.0 breaths per minute, oxygen saturation at 97.0%, systolic blood pressure of 162.0 mmHg, diastolic blood pressure of 65.0 mmHg, pain level reported as 6, 

6it [00:02,  2.14it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A white, 78.0-year-old woman arrives at the emergency department via walk-in. She has a temperature of 98.0°F, a heart rate of 72.0 bpm, a respiratory rate of 18.0 breaths per minute, oxygen saturation at 97.0%, systolic blood pressure of 162.0 mmHg, diastolic blood pressure of 65.0 mmHg, pain level reported as 6, and chief complaint described as "LUQ abdominal pain".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 5: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african american, 52.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.4°F, a heart rate of 79.0 bpm, a respiratory rate of 18.0 breaths per minute, oxygen saturation at 99.0%, systolic blood pressure of 117.0 mmHg, diastolic blood pressure of 66.0 mmHg, p

7it [00:03,  2.16it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african american, 52.0-year-old woman arrives at the emergency department via ambulance. She has a temperature of 98.4°F, a heart rate of 79.0 bpm, a respiratory rate of 18.0 breaths per minute, oxygen saturation at 99.0%, systolic blood pressure of 117.0 mmHg, diastolic blood pressure of 66.0 mmHg, pain level reported as 8, and chief complaint described as "R LEG PAIN".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 6: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african, 28.0-year-old woman arrives at the emergency department via walk-in. She has a temperature of 99.7°F, a heart rate of 100.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 124.0 mmHg, diastolic blood pressure of 62.0 mmHg,

8it [00:03,  2.16it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/african, 28.0-year-old woman arrives at the emergency department via walk-in. She has a temperature of 99.7°F, a heart rate of 100.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 124.0 mmHg, diastolic blood pressure of 62.0 mmHg, pain level reported as 10, and chief complaint described as "SORE THROAT".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 7: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A hispanic/latino - dominican, 68.0-year-old woman arrives at the emergency department via walk-in. She has a temperature of 99.3°F, a heart rate of 79.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 142.0 mmHg, diastolic blood pressure of 82.0

9it [00:04,  2.17it/s]

### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A hispanic/latino - dominican, 68.0-year-old woman arrives at the emergency department via walk-in. She has a temperature of 99.3°F, a heart rate of 79.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 100.0%, systolic blood pressure of 142.0 mmHg, diastolic blood pressure of 82.0 mmHg, pain level reported as 10, and chief complaint described as "Finger swelling".

### Response:  The estimated ESI acuity for this patient is 3.0.
Sample 8: No valid response extracted.
### Instruction: Estimate the Emergency Severity Index (ESI) acuity for the following patient.

### Input: A black/cape verdean, 23.0-year-old man arrives at the emergency department via ambulance. He has a temperature of 97.3°F, a heart rate of 75.0 bpm, a respiratory rate of 16.0 breaths per minute, oxygen saturation at 98.0%, systolic blood pressure of 108.0 mmHg, diastolic blood pressure o

9it [00:04,  1.99it/s]


KeyboardInterrupt: 

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
