To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
# Normally using pip install unsloth is enough

# Temporarily as of Jan 31st 2025, Colab has some issues with Pytorch
# Using pip install unsloth will take 3 minutes, whilst the below takes <1 minute:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Glaive Function Calling dataset from [madroid](https://huggingface.co/datasets/madroid/glaive-function-calling-openai), which is a version of the original [Glaive Function Calling v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) pre-processed to facilitate integration. You can replace this code section with your own data prep.

**[NOTE]** Each model has its own Tool Calling template. For `llama-3.1` we'll use the [user defined custom tools](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#user-defined-custom-tool-calling) template. If you want to use another model and/or template, you'll need to write your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
#@title Define system prompt and message delimiters
system_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else let the user know



You have access to the following functions:

{functions}


If a you choose to call a function ONLY reply in the following format:
<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
where

start_tag => `<function`
parameters => a JSON dict with the function argument name as key and function argument value as value.
end_tag => `</function>`

Here is an example,
<function=example_function_name>{{"example_name": "example_value"}}</function>

Reminder:
- Function calls MUST follow the specified format
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
- Always add your sources when using search results to answer the user query

You are a helpful assistant.<|eot_id|>"""

user_message = "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
assistant_message = "<|start_header_id|>assistant<|end_header_id|>                      \n\n{}<|eot_id|>"
assistant_tool_message = "<|start_header_id|>assistant<|end_header_id|>                      \n\n{}<|eom_id|>"
tool_response_message = "<|start_header_id|>ipython<|end_header_id|>\n\n{}<|eot_id|>"
assistant_continuation_prefix = "<|start_header_id|>assistant<|end_header_id|>                      "
assistant_continuation_message = "<|start_header_id|>assistant<|end_header_id|>                      \n\n{}<|eot_id|>"
function_string_template = """Use the function '{name}' to: {description}\n{schema}"""

In [5]:
#@title Util processing functions
import ast, json

def convert_tool_format(tool):
    func = tool.get("function", {})
    name = func.get("name", "")
    description = func.get("description", "")
    parameters_a = func.get("parameters", {})
    properties = parameters_a.get("properties", {})
    required_params = parameters_a.get("required", [])
    def map_type(a_type, a_format=None):
        if a_type == "string":
            return "string"
        elif a_type == "number":
            return "int"
        elif a_type == "boolean":
            return "bool"
        return a_type
    parameters_b = {}
    for param, details in properties.items():
        parameters_b[param] = {
            "param_type": map_type(details.get("type"), details.get("format")),
            "description": details.get("description", ""),
            "required": param in required_params
        }
    return {
        "name": name,
        "description": description,
        "parameters": parameters_b
    }

def get_function_string(f):
    converted_tool = convert_tool_format(f)
    return function_string_template.format(
        name=converted_tool["name"],
        description=converted_tool["description"],
        schema=json.dumps(converted_tool)
    )

def convert_function_call_format(call):
    func_data = call.get("function", {})
    func_name = func_data.get("name", "")
    arguments_str = func_data.get("arguments", "{}")
    try:
        arguments_dict = ast.literal_eval(arguments_str)
    except Exception:
        arguments_dict = {}
    arguments_json = json.dumps(arguments_dict)
    return f"<function={func_name}>{arguments_json}</function>"

def process_block(block):
    tool_index = None
    for i, msg in enumerate(block):
        if msg["role"] == "assistant" and "tool_calls" in msg:
            tool_index = i
            break
    filtered_block = []
    if tool_index is not None:
        for i, msg in enumerate(block):
            if msg["role"] == "assistant" and i < tool_index:
                continue
            filtered_block.append(msg)
    else:
        filtered_block = block
    block_context = ""
    tool_called = False
    for msg in filtered_block:
        if msg["role"] == "assistant":
            if "tool_calls" in msg:
                block_context += assistant_tool_message.format(convert_function_call_format(msg["tool_calls"][0]))
            else:
                if tool_called:
                    block_context += assistant_continuation_message.format(msg["content"])
                    tool_called = False
                else:
                    block_context += assistant_message.format(msg["content"])
        elif msg["role"] == "tool":
            block_context += tool_response_message.format(msg["content"])
            tool_called = True
    return block_context

def get_formatted_sample(sample):
    functions_string = "\n\n".join([get_function_string(f) for f in sample.get("tools", [])])
    context = system_prompt.format(functions=functions_string)
    block = []
    for message in sample["messages"]:
        if message["role"] == "system":
            continue
        elif message["role"] == "user":
            if block:
                context += process_block(block)
                block = []
            context += user_message.format(message["content"])
        else:
            block.append(message)
    if block:
        context += process_block(block)
    return context


In [6]:
#@title Process dataset

def formatting_prompts_func(examples):
    _json = examples["json"]
    texts = []
    for sample_obj in _json:
        # sample_obj = json.loads(sample["json"])
        text = get_formatted_sample(json.loads(sample_obj))
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("madroid/glaive-function-calling-openai", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/112754 [00:00<?, ? examples/s]

In [7]:
print(dataset[0]["text"])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else let the user know



You have access to the following functions:

Use the function 'track_calories' to: Track daily calorie intake
{"name": "track_calories", "description": "Track daily calorie intake", "parameters": {"meal": {"param_type": "string", "description": "The meal for which calories are being tracked", "required": true}, "calories": {"param_type": "int", "description": "The number of calories consumed", "required": true}, "date": {"param_type": "string", "description": "The date for which calories are being tracked", "required": true}}}


If a you choose to call a function ONLY reply in the following format:
<{start_tag}={function_name}>{parameters}{end_tag}
where

start_tag

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/112754 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/112754 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/112754 [00:00<?, ? examples/s]

Truncating train dataset (num_proc=2):   0%|          | 0/112754 [00:00<?, ? examples/s]

In [9]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.516 GB of memory reserved.


In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 112,754 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.022
2,2.1472
3,2.3737
4,1.6812
5,1.5396
6,1.7615
7,1.6895
8,1.5246
9,1.1073
10,1.1228


In [11]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1263.8918 seconds used for training.
21.06 minutes used for training.
Peak reserved memory = 7.467 GB.
Peak reserved memory for training = 1.951 GB.
Peak reserved memory % of max memory = 50.655 %.
Peak reserved memory for training % of max memory = 13.235 %.


<a name="Inference"></a>
### Inference
Let's run the model! We'll load the `test` split of our dataset and prepare it to generation.

**[NOTE]** To use the model's tool calling capabilities in a more streamlined way you should use a scaffolding framework such as [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps). For the scope of this demo we will test the model manually.

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [12]:
dataset_test = load_dataset("madroid/glaive-function-calling-openai", split = "test")
dataset_test = dataset_test.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/967 [00:00<?, ? examples/s]

**Model determining which tool to call**

We feed the model with the generation prompt. It responds with a tool call in the following format:

```

<function=example_function_name>{"example_name": "example_value"}</function><|eom_id|>
```

In [13]:
test_sample = dataset_test[128]["text"]

FastLanguageModel.for_inference(model) # Enable native 2x faster inference


context = test_sample+assistant_continuation_prefix
inputs = tokenizer(
[
    context,
], return_tensors = "pt").to("cuda")

print(context)
print("==="*10)


outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
output_text = tokenizer.batch_decode(outputs)[0]


output_text = output_text[len(context):]
print(output_text)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else let the user know



You have access to the following functions:

Use the function 'calculate_fuel_consumption' to: Calculate the fuel consumption based on distance and fuel efficiency
{"name": "calculate_fuel_consumption", "description": "Calculate the fuel consumption based on distance and fuel efficiency", "parameters": {"distance": {"param_type": "int", "description": "The distance traveled", "required": true}, "fuel_efficiency": {"param_type": "int", "description": "The fuel efficiency in kilometers per liter", "required": true}}}


If a you choose to call a function ONLY reply in the following format:
<{start_tag}={function_name}>{parameters}{end_tag}
where

start_tag => `<functi

In [14]:
#@title **User-defined Custom tools**
import re
import json


# function to parse model's response
def parse_function_call(s: str):
    # Regex pattern to extract function name and JSON arguments
    match = re.search(r"<function=(\w+)>(\{.*?\})</function>", s)

    if match:
        function_name = match.group(1)  # Extract function name
        args_json = match.group(2)      # Extract JSON string
        args = json.loads(args_json)    # Parse JSON to dictionary
        return function_name, args
    else:
        return None, None


# CUSTOM TOOLS
def calculate_loan_emi(loan_amount: int, interest_rate: int, loan_term: int) -> float:
    monthly_interest_rate = (interest_rate / 100) / 12

    if monthly_interest_rate == 0:
        emi = loan_amount / loan_term
    else:
        emi = (loan_amount * monthly_interest_rate * (1 + monthly_interest_rate) ** loan_term) / \
              ((1 + monthly_interest_rate) ** loan_term - 1)

    return round(emi, 2)


def calculate_fuel_consumption(distance: int, fuel_efficiency: int) -> float:
    if fuel_efficiency <= 0:
        raise ValueError("Fuel efficiency must be greater than zero.")

    fuel_consumed = distance / fuel_efficiency
    return round(fuel_consumed, 2)


TOOLS = {
    "calculate_loan_emi": calculate_loan_emi,
    "calculate_fuel_consumption": calculate_fuel_consumption,
}

**Result from calling the tool is passed back to the model and it generates the final response the user**

Now we add the tool call from the previous generation and its result to the context, the model then generates the final response.

In [15]:
# Parse and execute tool given model output
function_name, arguments = parse_function_call(output_text)


if function_name is not None:
  tool_response = TOOLS[function_name](**arguments)

  # Prepare context
  context = test_sample # original input
  # Add tool call and response
  context += assistant_tool_message.format(output_text)
  context += tool_response_message.format(tool_response)
  # Add generation prompt
  context += assistant_continuation_prefix

  FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  inputs = tokenizer(
  [
      context
  ], return_tensors = "pt").to("cuda")

  print(context)
  print("==="*20)

  outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

  output_text_chat = tokenizer.batch_decode(outputs)
  output_text_chat = output_text_chat[0][len(context):]
  print(output_text_chat)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else let the user know



You have access to the following functions:

Use the function 'calculate_fuel_consumption' to: Calculate the fuel consumption based on distance and fuel efficiency
{"name": "calculate_fuel_consumption", "description": "Calculate the fuel consumption based on distance and fuel efficiency", "parameters": {"distance": {"param_type": "int", "description": "The distance traveled", "required": true}, "fuel_efficiency": {"param_type": "int", "description": "The fuel efficiency in kilometers per liter", "required": true}}}


If a you choose to call a function ONLY reply in the following format:
<{start_tag}={function_name}>{parameters}{end_tag}
where

start_tag => `<functi

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [16]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    test_sample+assistant_continuation_prefix,
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else let the user know



You have access to the following functions:

Use the function 'calculate_fuel_consumption' to: Calculate the fuel consumption based on distance and fuel efficiency
{"name": "calculate_fuel_consumption", "description": "Calculate the fuel consumption based on distance and fuel efficiency", "parameters": {"distance": {"param_type": "int", "description": "The distance traveled", "required": true}, "fuel_efficiency": {"param_type": "int", "description": "The fuel efficiency in kilometers per liter", "required": true}}}


If a you choose to call a function ONLY reply in the following format:
<{start_tag}={function_name}>{parameters}{end_tag}
where

star

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [17]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [18]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

test_sample = dataset_test[128]["text"]
context = test_sample+assistant_continuation_prefix
inputs = tokenizer(
[
    context,
], return_tensors = "pt").to("cuda")

print(context)
print("==="*10)


outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
output_text = tokenizer.batch_decode(outputs)[0]


output_text = output_text[len(context):]
print(output_text)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else let the user know



You have access to the following functions:

Use the function 'calculate_fuel_consumption' to: Calculate the fuel consumption based on distance and fuel efficiency
{"name": "calculate_fuel_consumption", "description": "Calculate the fuel consumption based on distance and fuel efficiency", "parameters": {"distance": {"param_type": "int", "description": "The distance traveled", "required": true}, "fuel_efficiency": {"param_type": "int", "description": "The fuel efficiency in kilometers per liter", "required": true}}}


If a you choose to call a function ONLY reply in the following format:
<{start_tag}={function_name}>{parameters}{end_tag}
where

start_tag => `<functi

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
