# Finetune `meta-llama/Meta-Llama-3-8B-Instruct` on an EC2 instance using `Unsloth`
---

Unsloth makes finetuning large language models like Llama-3, Mistral, Phi-4 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy!

**Note**: ***This notebook is run on a `g6e.12xlarge` instance. Follow the prerequisite steps [here](README.md)***

In this example, we will be fine tuning the llama3 8b instruct model. There are several 4bit pre quantized models that `unsloth` provides that are not gated. This supports 4x faster downloading with no OOMs. In this case, we will be using the standard `meta-llama/Meta-Llama-3-8B-Instruct` model from hugging face. 

In [1]:
import os
import logging
import globals as g
from dotenv import load_dotenv
from unsloth import to_sharegpt
from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth import standardize_sharegpt
from ec2_metrics import EC2MetricsCallback

# Create a logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove existing handlers
logger.handlers.clear()

# Add a simple handler
handler = logging.StreamHandler()
formatter = logging.Formatter('[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!


2025-03-19 16:35:16,108	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# Load environment variables from .env file
import getpass
load_dotenv()
if not os.getenv("HF_TOKEN"):
    os.environ["HF_TOKEN"] = getpass.getpass("Enter your HuggingFace token: ")
hf_token = os.getenv("HF_TOKEN")

if not os.getenv("HF_MODEL_ID"):
    hf_model_id  = input("Enter the model id to use for fine-tuning (e.g. meta-llama/Llama-3.1-8B-Instruct): ")
else:
    hf_model_id = os.getenv("HF_MODEL_ID")
logger.info(f"hf_model_id={hf_model_id}")


[2025-03-19 16:35:17,332] p8491 {2478216038.py:12} INFO - hf_model_id=meta-llama/Llama-3.1-8B-Instruct


In [3]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

DATASET_OF_INTEREST: str = 'vicgalle/alpaca-gpt4'

ALPACA_PROMPT: str = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


In [4]:
try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = hf_model_id,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        token = hf_token # use one if using gated models like meta-llama/Llama-2-7b-hf
    )
except Exception as e:
    logger.error(f"Error occurred while loading the model: {e}")
    raise

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.045 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [02:02<00:00, 30.68s/it]


In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Data Prep

We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4. You can replace this code section with your own data prep.

In [6]:
# load the Banking77 dataset
dataset = load_dataset("mteb/banking77", split="train")
logger.info(f"Columns in the dataset: {dataset.column_names}")

[2025-03-19 16:37:34,202] p8491 {3124625397.py:3} INFO - Columns in the dataset: ['text', 'label', 'label_text']


In [7]:
# check for column names
print(dataset.column_names)
print(dataset[0])

['text', 'label', 'label_text']
{'text': 'I am still waiting on my card?', 'label': 11, 'label_text': 'card_arrival'}


In [8]:
dataset = to_sharegpt(
    dataset,
    merged_prompt="{text}[[\nYour input is:\n{label_text}]]",  
    output_column_name="label_text",  
    conversation_extension=3,
)

In [9]:
# Use the standardize_sharegpt function to just make the dataset in a correct format for finetuning
dataset = standardize_sharegpt(dataset)

In [10]:
from pprint import pprint
pprint(dataset[:3])

{'conversations': [[{'content': 'I am still waiting on my card?\n'
                                'Your input is:\n'
                                'card_arrival',
                     'role': 'user'},
                    {'content': 'card_arrival', 'role': 'assistant'},
                    {'content': 'How can I convert currencies?\n'
                                'Your input is:\n'
                                'exchange_via_app',
                     'role': 'user'},
                    {'content': 'exchange_via_app', 'role': 'assistant'},
                    {'content': 'Will be Apple Watch be able to let me top '
                                'up?\n'
                                'Your input is:\n'
                                'apple_pay_or_google_pay',
                     'role': 'user'},
                    {'content': 'apple_pay_or_google_pay',
                     'role': 'assistant'}],
                   [{'content': "What can I do if my card still hasn't arrive

In [11]:
chat_template = """You are an AI assistant that classifies customer inquiries into different banking-related tasks.

### User Query:
{INPUT}

### Classified Intent:
{OUTPUT}"""

from unsloth import apply_chat_template

dataset = apply_chat_template(
    dataset,
    tokenizer=tokenizer,
    chat_template=chat_template,
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

Unsloth: We automatically added an EOS token to stop endless generations.
Map: 100%|██████████| 10003/10003 [00:00<00:00, 16647.72 examples/s]


In [12]:
print(dataset)

Dataset({
    features: ['conversations', 'text'],
    num_rows: 10003
})


In [13]:
%%time
# train the model
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# change max step and set number of epoch to 1
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 600,
        num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    callbacks=[EC2MetricsCallback],
)

Converting train dataset to ChatML (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 18992.20 examples/s]
Applying chat template to train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:01<00:00, 6722.87 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:02<00:00, 3976.57 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 11081.51 examples/s]

CPU times: user 912 ms, sys: 352 ms, total: 1.26 s
Wall time: 6.27 s





In [14]:
%%time
# this will initiate the training process and also log the EC2 utilization metrics, such as the GPU
# utilization, CPU utilization, etc.
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,003 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 600
 "-____-"     Number of trainable parameters = 41,943,040
[2025-03-19 16:37:42,428] p8491 {ec2_metrics.py:184} INFO - Training started. Initiating EC2 metrics collection.
[2025-03-19 16:37:42,430] p8491 {ec2_metrics.py:170} INFO - Writing header: ['timestamp', 'cpu_percent_mean', 'memory_percent_mean', 'memory_used_mean', 'gpu_utilization_mean', 'gpu_memory_used_mean', 'gpu_memory_free_mean', 'gpu_memory_total_mean']
[2025-03-19 16:37:42,431] p8491 {ec2_metrics.py:41} INFO - Starting collection
[2025-03-19 16:37:42,800] p8491 {ec2_metrics.py:143} INFO - Starting daemon collector to run in background


Step,Training Loss
1,2.5402
2,2.6757
3,2.692
4,2.5609
5,2.3854
6,2.1486
7,1.9683
8,1.7709
9,1.5832
10,1.4954


[2025-03-19 16:54:58,491] p8491 {ec2_metrics.py:191} INFO - Training ended. Stopping EC2 metrics collection.
[2025-03-19 16:54:58,492] p8491 {ec2_metrics.py:33} INFO - Stopped collection


CPU times: user 12min 44s, sys: 4min 48s, total: 17min 33s
Wall time: 17min 17s


[2025-03-19 16:55:02,872] p8491 {ec2_metrics.py:33} INFO - Stopped collection


### Once the model is trained, run inference on the following inputs

In [24]:
import torch

# write a function to complete the whole task
def classify_questions_organized(model, tokenizer, questions, output_file="problem1_task1.txt", debug=True):
    # ensure the model is in inference mode
    FastLanguageModel.for_inference(model)

    # move model to correct device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    classification_reports = []

    for idx, question in enumerate(questions, 1):
        messages = [{"role": "user", "content": question}]
        # tokenize input correctly
        input_ids = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt",
        ).to(device)

        # generate output
        with torch.no_grad():  
            output_ids = model.generate(
            input_ids,
            max_new_tokens=100,  # make this as large as possible to make sure the results can be generated correctly
            pad_token_id=tokenizer.eos_token_id,
            # o_sample=True,  # enables sampling 
            # temperature=0.5,  # controls randomness in output
            # top_k=50,
        ) 

        # decode the output
        response_text = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
        classified_intent = response_text.split("\n")[-1].strip()
        if debug:
            print(f"\ninput {idx}: {question}")
            print(f"predicted category: {classified_intent}")

        # format output with structured layout
        formatted_output = (
            f"{'=' * 50}\n"
            f"input {idx}: {question}\n"
            f"predicted category: {classified_intent}\n"
        )
        classification_reports.append(formatted_output)

    # save results to file
    with open(output_file, "w") as file:
        file.writelines(classification_reports)

    print(f"\nResults have been saved to {output_file}")
    return classification_reports  

# sample queries
questions = [
    "I see a charge on my credit card statement but I paid on time, why?",
    "Do you have a branch in Timbuktu?",
    "I lost my card and my replacement card has not arrived."
]

# call the function to classify queries
classify_questions_organized(model, tokenizer, questions)


input 1: I see a charge on my credit card statement but I paid on time, why?
predicted category: card_payment_fee_charged

input 2: Do you have a branch in Timbuktu?
predicted category: getting_virtual_card

input 3: I lost my card and my replacement card has not arrived.
predicted category: order_physical_card

Results have been saved to problem1_task1.txt




### Log the trainer stats
---

In this step, we log some of the trainer stats, such as the number of global steps it took to get to a specific training loss, the train runtime, samples per second, steps per second, etc.

In [16]:
# Format the training stats in a readable way
output_text = f"""Training Statistics:
Global Steps: {trainer_stats.global_step}
Training Loss: {trainer_stats.training_loss:.4f}

Metrics:
- Train Runtime: {trainer_stats.metrics['train_runtime']:.3f} seconds
- Training Samples/Second: {trainer_stats.metrics['train_samples_per_second']:.3f}
- Training Steps/Second: {trainer_stats.metrics['train_steps_per_second']:.3f}
- Total FLOPS: {trainer_stats.metrics['total_flos']:.2e}
- Final Train Loss: {trainer_stats.metrics['train_loss']:.4f}
"""

# Save to a text file
with open(os.path.join(g.RESULTS_DIR, g.TRAINING_STATS), 'w') as f:
    f.write(output_text)

In [17]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                    # Change below!
    {"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

fibonacci

### User Query:
I want to get a Visa card.
Your input is:
visa_or_mastercard

### Classified Intent:
visa_or_mastercard<|eot_id|>


In [18]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                         # Change below!
    {"role": "user",      "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8"},
    {"role": "assistant", "content": "The fibonacci sequence continues as 13, 21, 34, 55 and 89."},
    {"role": "user",      "content": "What is France's tallest tower called?"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

France's tallest tower is called the Eiffel Tower.<|eot_id|>


In [19]:
# save the model
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

## Ollama Support [Optional]

Unsloth now allows you to automatically finetune and create a Modelfile, and export to Ollama! This makes finetuning much easier and provides a seamless workflow from Unsloth to Ollama!

Let's first install Ollama!

In [20]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
##############                                                            20.5%                                                               15.1%

In [None]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


make: Entering directory '/home/ubuntu/spring-2025-lab07-WillWangUNC/llama.cpp'
make: Leaving directory '/home/ubuntu/spring-2025-lab07-WillWangUNC/llama.cpp'


Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.


-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- Found CURL: /usr/lib/x86_64-linux-gnu/

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 8.14 out of 15.01 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 22%|██▏       | 7/32 [00:00<00:01, 12.96it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:08<00:00,  2.13s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving model/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /home/ubuntu/spring-2025-lab07-WillWangUNC/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part

  gb_found = re.match("([0-9]{1,})[\s]{0,}GB", max_shard_size, flags = re.IGNORECASE)
  mb_found = re.match("([0-9]{1,})[\s]{0,}MB", max_shard_size, flags = re.IGNORECASE)
  f"   \\\   /|    [0] Installing llama.cpp might take 3 minutes.\n"\
  f"O^O/ \_/ \\    [1] Converting HF to GGUF 16bits might take 3 minutes.\n"\
  f"\        /    [2] Converting GGUF 16bits to {quantization_method} might take 10 minutes each.\n"\


KeyboardInterrupt: 

In [None]:
import subprocess

subprocess.Popen(["ollama", "serve"])
import time

time.sleep(3) 

Error: listen tcp 127.0.0.1:11434: bind: address already in use


In [None]:
!ollama create unsloth_model -f ./model/Modelfile

[?2026h[?25l[1Ggathering model components [K[?25h[?2026l
Error: no Modelfile or safetensors files found


In [None]:
# run inference against the model
!curl http://localhost:11434/api/chat -d '{ \
    "model": "unsloth_model", \
    "messages": [ \
        { "role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8," } \
    ] \
    }'

{"model":"unsloth_model","created_at":"2025-03-19T08:49:09.91008334Z","message":{"role":"assistant","content":"The"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:10.065350692Z","message":{"role":"assistant","content":" next"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:10.199152456Z","message":{"role":"assistant","content":" number"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:10.333419237Z","message":{"role":"assistant","content":" in"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:10.467585562Z","message":{"role":"assistant","content":" the"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:10.601923256Z","message":{"role":"assistant","content":" Fibonacci"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:10.735707088Z","message":{"role":"assistant","content":" sequence"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-19T08:49:1

In [None]:
# run inference against the model
import json
result = subprocess.run(
    [
        "curl",
        "http://localhost:11434/api/generate",
        "-d",
        '{"model": "unsloth_model", "prompt": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,", "stream": false}',
    ],
    capture_output=True,
    text=True,
)

response_data = json.loads(result.stdout)
print(f"Response generated: {response_data['response']}")



Response generated: 13, 21, 34, 55, 89, 144...
