# Finetune `meta-llama/Meta-Llama-3-8B-Instruct` on an EC2 instance using `Unsloth`
---

Unsloth makes finetuning large language models like Llama-3, Mistral, Phi-4 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy!

**Note**: ***This notebook is run on a `g6e.12xlarge` instance. Follow the prerequisite steps [here](README.md)***

In this example, we will be fine tuning the llama3 8b instruct model. There are several 4bit pre quantized models that `unsloth` provides that are not gated. This supports 4x faster downloading with no OOMs. In this case, we will be using the standard `meta-llama/Meta-Llama-3-8B-Instruct` model from hugging face. 

In [1]:
import os
import logging
import globals as g
from dotenv import load_dotenv
from unsloth import to_sharegpt
from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth import standardize_sharegpt
from ec2_metrics import EC2MetricsCallback

# Create a logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove existing handlers
logger.handlers.clear()

# Add a simple handler
handler = logging.StreamHandler()
formatter = logging.Formatter('[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!


2025-03-14 18:03:57,514	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# Load environment variables from .env file
import getpass
load_dotenv()
if not os.getenv("HF_TOKEN"):
    os.environ["HF_TOKEN"] = getpass.getpass("Enter your HuggingFace token: ")
hf_token = os.getenv("HF_TOKEN")

if not os.getenv("HF_MODEL_ID"):
    hf_model_id  = input("Enter the model id to use for fine-tuning (e.g. meta-llama/Llama-3.1-8B-Instruct): ")
else:
    hf_model_id = os.getenv("HF_MODEL_ID")
logger.info(f"hf_model_id={hf_model_id}")


[2025-03-14 18:03:58,401] p25032 {2478216038.py:12} INFO - hf_model_id=meta-llama/Llama-3.2-1B-Instruct


In [3]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

DATASET_OF_INTEREST: str = 'mteb/banking77'

ALPACA_PROMPT: str = """Below is a customer query related to banking services. Categorize the query into an appropriate category.

### Query:
{}

### Category:
{}"""

In [4]:
try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = hf_model_id,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        token = hf_token # use one if using gated models like meta-llama/Llama-2-7b-hf
    )
except Exception as e:
    logger.error(f"Error occurred while loading the model: {e}")
    raise

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.045 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## Data Prep

We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4. You can replace this code section with your own data prep.

In [6]:
dataset = load_dataset(DATASET_OF_INTEREST, split="train")
logger.info(f"Columns in the dataset: {dataset.column_names}")

[2025-03-14 18:04:04,311] p25032 {2022229014.py:2} INFO - Columns in the dataset: ['text', 'label', 'label_text']


In [7]:
# dataset = to_sharegpt(
#     dataset,
#     merged_prompt="{instruction}[[\nYour input is:\n{input}]]",
#     output_column_name="output",
#     conversation_extension=3,
# )
dataset = dataset.map(lambda x: {"text": f"Query: {x['text']}\nCategory: {x['label']}"})


In [8]:
from pprint import pprint
pprint(dataset[:3])

{'label': [11, 11, 11],
 'label_text': ['card_arrival', 'card_arrival', 'card_arrival'],
 'text': ['Query: I am still waiting on my card?\nCategory: 11',
          "Query: What can I do if my card still hasn't arrived after 2 "
          'weeks?\n'
          'Category: 11',
          'Query: I have been waiting over a week. Is the card still coming?\n'
          'Category: 11']}


In [9]:
# chat_template = """Below is a list of customer queries related to banking services. Categorize each query into an appropriate category.

# ### Query:
# {INPUT}

# ### Category:
# {OUTPUT}"""

# from unsloth import apply_chat_template

# dataset = apply_chat_template(
#     dataset,
#     tokenizer=tokenizer,
#     chat_template=chat_template,
#     # default_system_message = "You are a helpful assistant", << [OPTIONAL]
# )

In [10]:
%%time
# train the model
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 600,
        num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    callbacks=[EC2MetricsCallback],
)

Applying chat template to train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:01<00:00, 9666.39 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:01<00:00, 6540.05 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 20770.01 examples/s]

CPU times: user 804 ms, sys: 255 ms, total: 1.06 s
Wall time: 3.6 s





In [None]:
%%time
# this will initiate the training process and also log the EC2 utilization metrics, such as the GPU
# utilization, CPU utilization, etc.
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,003 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 600
 "-____-"     Number of trainable parameters = 11,272,192
[2025-03-14 18:04:08,849] p25032 {ec2_metrics.py:184} INFO - Training started. Initiating EC2 metrics collection.
[2025-03-14 18:04:08,850] p25032 {ec2_metrics.py:170} INFO - Writing header: ['timestamp', 'cpu_percent_mean', 'memory_percent_mean', 'memory_used_mean', 'gpu_utilization_mean', 'gpu_memory_used_mean', 'gpu_memory_free_mean', 'gpu_memory_total_mean']
[2025-03-14 18:04:08,850] p25032 {ec2_metrics.py:41} INFO - Starting collection
[2025-03-14 18:04:09,052] p25032 {ec2_metrics.py:143} INFO - Starting daemon collector to run in background


Step,Training Loss
1,3.8731
2,4.1899
3,4.2098
4,4.0447
5,3.9989
6,3.9323
7,3.9631
8,3.3294
9,3.0138
10,3.5232


[2025-03-14 18:09:03,062] p25032 {ec2_metrics.py:191} INFO - Training ended. Stopping EC2 metrics collection.
[2025-03-14 18:09:03,062] p25032 {ec2_metrics.py:33} INFO - Stopped collection


CPU times: user 4min 53s, sys: 5.17 s, total: 4min 59s
Wall time: 4min 55s


[2025-03-14 18:09:04,110] p25032 {ec2_metrics.py:33} INFO - Stopped collection


### Log the trainer stats
---

In this step, we log some of the trainer stats, such as the number of global steps it took to get to a specific training loss, the train runtime, samples per second, steps per second, etc.

In [23]:
from transformers import AutoTokenizer

# Reload tokenizer (ensure it's aligned with the fine-tuned model)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Define improved inference function
def generate_response(model, tokenizer, query):
    # Define a structured prompt
    prompt = f"""Categorize the following banking query into one of the predefined categories:

    Query: {query}

    Category: the predicted category"""

    # Tokenize input
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

    # Generate output
    output_ids = model.generate(input_ids, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

    # Decode and clean the response
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()

    # Extract category name (ensure we don't get extra text)
    response_lines = response.split("\n")
    if len(response_lines) > 1:
        response = response_lines[-1]  # Take the last non-empty line as category name

    return response

# List of test queries
test_inputs = [
    "I see a charge on my credit card statement but I paid on time, why?",
    "Do you have a branch in Timbuktu?",
    "I lost my card and my replacement card has not arrived."
]

# Perform inference
results = [generate_response(model, tokenizer, inp) for inp in test_inputs]

# Save outputs to file
output_text = "\n".join([f"input: {inp}\ncategory: {res}" for inp, res in zip(test_inputs, results)])

with open("problem1_task1.txt", "w") as f:
    f.write(output_text)


## Ollama Support [Optional]

Unsloth now allows you to automatically finetune and create a Modelfile, and export to Ollama! This makes finetuning much easier and provides a seamless workflow from Unsloth to Ollama!

Let's first install Ollama!

In [14]:
# !curl -fsSL https://ollama.com/install.sh | sh

In [15]:
# # Save to 8bit Q8_0
# if True: model.save_pretrained_gguf("model", tokenizer,)
# # Remember to go to https://huggingface.co/settings/tokens for a token!
# # And change hf to your username!
# if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# # Save to multiple GGUF options - much faster if you want multiple!
# if False:
#     model.push_to_hub_gguf(
#         "hf/model", # Change hf to your username!
#         tokenizer,
#         quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
#         token = "", # Get a token at https://huggingface.co/settings/tokens
#     )

In [16]:
# import subprocess

# subprocess.Popen(["ollama", "serve"])
# import time

# time.sleep(3) 

In [17]:
# !ollama create unsloth_model -f ./model/Modelfile

In [18]:
# # run inference against the model
# !curl http://localhost:11434/api/chat -d '{ \
#     "model": "unsloth_model", \
#     "messages": [ \
#         { "role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8," } \
#     ] \
#     }'

In [19]:
# # run inference against the model
# import json
# result = subprocess.run(
#     [
#         "curl",
#         "http://localhost:11434/api/generate",
#         "-d",
#         '{"model": "unsloth_model", "prompt": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,", "stream": false}',
#     ],
#     capture_output=True,
#     text=True,
# )

# response_data = json.loads(result.stdout)
# print(f"Response generated: {response_data['response']}")

