# Fine-tune Llama 3.1 8B with Unsloth
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

In [None]:
!pip install -qqq "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --progress-bar off
!pip install -qqq --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes --progress-bar off

import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.[0m[31m
[0m🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


## 1. Load model for PEFT

In [None]:
# Load model
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

# Prepare model for PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)
print(model.print_trainable_parameters())

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
None


## 2. Prepare data and tokenizer

In [None]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",
    mapping={"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}
)

def apply_template(examples):
    messages = examples["conversations"]
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
    return {"text": text}

dataset = load_dataset("chukypedro/sta_dataset", split="train")
dataset = dataset.map(apply_template, batched=True)
dataset = dataset.train_test_split(test_size=0.1)

Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


Downloading data:   0%|          | 0.00/410k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1067 [00:00<?, ? examples/s]

Map:   0%|          | 0/1067 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['conversations', 'text'],
        num_rows: 960
    })
    test: Dataset({
        features: ['conversations', 'text'],
        num_rows: 107
    })
})

## 3. Training

In [None]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=40,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_strategy = "steps",
        logging_steps=10,
        eval_strategy="steps",  # Updated key here
        eval_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 178 | Num Epochs = 40
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 440
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss,Validation Loss
10,0.9182,0.551157
20,0.4866,0.416282
30,0.3852,0.380434
40,0.322,0.36991
50,0.2677,0.376304
60,0.209,0.390974
70,0.1769,0.498394
80,0.1307,0.475031
90,0.1042,0.503167
100,0.0789,0.528368


TrainOutput(global_step=440, training_loss=0.07762399127452889, metrics={'train_runtime': 4433.5071, 'train_samples_per_second': 1.606, 'train_steps_per_second': 0.099, 'total_flos': 6.456274085184799e+17, 'train_loss': 0.07762399127452889, 'epoch': 39.111111111111114})

## 4. Inference

In [None]:
# Load model for inference
model = FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "testcase_type: keyword-driven\ntestcase_name: omnicorp_scheduled_payments\nprompt: Generate a robot framework test case for testing PayNOW functionality in the Web platform, specifically omnicorp_scheduled_payments."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=2000, use_cache=True)

<|im_start|>user
testcase_type: keyword-driven
testcase_name: omnicorp_scheduled_payments
prompt: Generate a robot framework test case for testing PayNOW functionality in the Web platform, specifically omnicorp_scheduled_payments.<|im_end|>
<|im_start|>assistant
*** Settings ***
Description This test case checks if the scheduled payments functionality is working correctly for the omnicorp account.
Library SeleniumLibrary
*** Variables ***
${ACCOUNT_NUMBER, INVOICE_NUMBER} account_number, invoice_number_value
${INVOICE_NUMBER, EXPECTED_PAYMENT_ORIGIN} invoice_number, expected_payment_origin_value
${DATE} date_value
${ACCOUNT_NUMBER, INVOICE_NUMBER, EXPECTED_STATUS} account_number, invoice_number, expected_status_value
${EXPECTED_PAYMENT_DATE} expected_payment_date_value
*** Test Cases ***
Validate Omnicorp_scheduled_payments
[Description] This test case checks if the scheduled payments functionality is working correctly for the omnicorp account.
Authenticate User Account ${ACCOUNT_NUMBE

## 5. Save trained model

In [None]:
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("chukypedro/StaLlama-3.1-8B", tokenizer, save_method="merged_16bit", token = "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")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 53.9 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 56.36it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: You are pushing to hub, but you passed your HF username = chukypedro.
We shall truncate chukypedro/StaLlama-3.1-8B to StaLlama-3.1-8B


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 53.82 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 64.39it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


README.md:   0%|          | 0.00/591 [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/chukypedro/StaLlama-3.1-8B


In [None]:
model.save_pretrained_gguf("model", tokenizer, "q8_0")
quant_methods = ["q2_k", "q3_k_m", "q4_k_m", "q5_k_m", "q6_k", "q8_0"]
for quant in quant_methods:
    model.push_to_hub_gguf("chukypedro/StaLlama-3.1-8B-GGUF", tokenizer, quant, token = "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")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 59.77 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 34.42it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be ./model/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> Q8_0, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16

100%|██████████| 32/32 [00:00<00:00, 66.55it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q2_k'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chukypedro/StaLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./chukypedro/StaLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: StaLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001

unsloth.BF16.gguf:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q2_K.gguf:   0%|          | 0.00/3.18G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Saved Ollama Modelfile to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.9 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 66.37it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q3_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chukypedro/StaLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./chukypedro/StaLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: StaLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-000

unsloth.Q3_K_M.gguf:   0%|          | 0.00/4.02G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Saved Ollama Modelfile to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.89 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 66.51it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chukypedro/StaLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./chukypedro/StaLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: StaLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-000

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Saved Ollama Modelfile to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.9 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 64.77it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q5_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chukypedro/StaLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./chukypedro/StaLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: StaLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-000

unsloth.Q5_K_M.gguf:   0%|          | 0.00/5.73G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Saved Ollama Modelfile to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.91 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 65.54it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q6_k'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chukypedro/StaLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./chukypedro/StaLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: StaLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001

unsloth.Q6_K.gguf:   0%|          | 0.00/6.60G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Saved Ollama Modelfile to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.9 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 65.09it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chukypedro/StaLlama-3.1-8B-GGUF into q8_0 GGUF format.
The output location will be ./chukypedro/StaLlama-3.1-8B-GGUF/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: StaLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001

unsloth.Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
Saved Ollama Modelfile to https://huggingface.co/chukypedro/StaLlama-3.1-8B-GGUF
