# Finetune `meta-llama/Meta-Llama-3-8B-Instruct` on an EC2 instance using `Unsloth`
---

Unsloth makes finetuning large language models like Llama-3, Mistral, Phi-4 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy!

**Note**: ***This notebook is run on a `g6e.12xlarge` instance. Follow the prerequisite steps [here](README.md)***

In this example, we will be fine tuning the llama3 8b instruct model. There are several 4bit pre quantized models that `unsloth` provides that are not gated. This supports 4x faster downloading with no OOMs. In this case, we will be using the standard `meta-llama/Meta-Llama-3-8B-Instruct` model from hugging face. 

In [1]:
import os
import logging
import globals as g
from dotenv import load_dotenv
from unsloth import to_sharegpt
from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth import standardize_sharegpt
from ec2_metrics import EC2MetricsCallback

# Create a logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove existing handlers
logger.handlers.clear()

# Add a simple handler
handler = logging.StreamHandler()
formatter = logging.Formatter('[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  start = re.search('logger\.info\([\"\'].+?Running training', inner_training_loop).span(0)[0]
  spaces = re.search('\n([\s\t]{1,})', original_debug).group(0)[1:]
  front_spaces = re.match('([\s\t]{1,})', inner_training_loop).group(0)
  source = re.sub("([^\.])nn\.", r"\1torch.nn.", source)
  "self.rotary_emb = .+?\)", function,
  "self.rotary_emb = .+?\)", function,
  from .autonotebook import tqdm as notebook_tqdm
  left = re.match("[\s\n]{4,}", leftover).span()[1]
  .replace("*", "\*").replace("^", "\^")\
  .replace("*", "\*").replace("^", "\^")\
  .replace("-", "\-").replace("_", "\_")\
  .replace("-", "\-").replace("_", "\_")\
  .replace(":", "\:").replace("+", "\+")\
  .replace(":", "\:").replace("+", "\+")\
  .replace(".", "\.").replace(",", "\,")\
  .replace(".", "\.").replace(",", "\,")\
  .replace("(", "\(").replace(")", "\)")\
  .replace("(", "\(").replace(")", "\)")\
  .replace("[", "\[").replace("]", "\]")\
  .replace("[", "\[").replace("]", "\]")\
  r"for ([^\s]{1,}) in "

🦥 Unsloth Zoo will now patch everything to make training faster!


  f"def {function_name}\(.*?\).*?\:\n",
  gb_found = re.match("([0-9]{1,})[\s]{0,}GB", max_shard_size, flags = re.IGNORECASE)
  mb_found = re.match("([0-9]{1,})[\s]{0,}MB", max_shard_size, flags = re.IGNORECASE)
  f"   \\\   /|    [0] Installing llama.cpp might take 3 minutes.\n"\
  f"O^O/ \_/ \\    [1] Converting HF to GGUF 16bits might take 3 minutes.\n"\
  f"\        /    [2] Converting GGUF 16bits to {quantization_method} might take 10 minutes each.\n"\
  "def __init__\(.*?\).*?\:\n",
2025-03-16 13:53:44,982	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
  f"   \\\   /|    GPU: {gpu_stats.name}. Max memory: {max_memory} GB. Platform: {platform_system}.\n"\
  f"O^O/ \_/ \\    Torch: {torch.__version__}. CUDA: {gpu_stats.major}.{gpu_stats.minor}. CUDA Toolkit: {torch.version.cuda}. Triton: {triton_version}\n"\
  f"\        /    Bfloat16 = {str(SUPPORTS_BFLOAT16).upper()}. FA [Xformers =

In [2]:
# Load environment variables from .env file
import getpass
load_dotenv()
if not os.getenv("HF_TOKEN"):
    os.environ["HF_TOKEN"] = getpass.getpass("Enter your HuggingFace token: ")
hf_token = os.getenv("HF_TOKEN")

if not os.getenv("HF_MODEL_ID"):
    hf_model_id  = input("Enter the model id to use for fine-tuning (e.g. meta-llama/Llama-3.1-8B-Instruct): ")
else:
    hf_model_id = os.getenv("HF_MODEL_ID")
logger.info(f"hf_model_id={hf_model_id}")


[2025-03-16 13:53:46,015] p37407 {2478216038.py:12} INFO - hf_model_id=meta-llama/Llama-3.1-8B-Instruct


In [3]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

DATASET_OF_INTEREST: str = 'banking77'

def convert_to_instruction_format(example):
    # Format: customer query -> intent classification task
    return {
        "instruction": "Classify the following banking customer service query into the appropriate category:",
        "input": example["text"],
        "output": example["label"]
    }

# ALPACA_PROMPT: str = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# {}

# ### Input:
# {}

# ### Response:
# {}"""

In [4]:
try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = hf_model_id,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        token = hf_token # use one if using gated models like meta-llama/Llama-2-7b-hf
    )
except Exception as e:
    logger.error(f"Error occurred while loading the model: {e}")
    raise

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.045 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [02:02<00:00, 30.68s/it]


In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Data Prep

We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4. You can replace this code section with your own data prep.

In [6]:
dataset = load_dataset(DATASET_OF_INTEREST, split="train")
logger.info(f"Columns in the dataset: {dataset.column_names}")

Generating train split: 100%|██████████| 10003/10003 [00:00<00:00, 49481.81 examples/s]
Generating test split: 100%|██████████| 3080/3080 [00:00<00:00, 1083127.05 examples/s]
[2025-03-16 13:56:04,572] p37407 {2022229014.py:2} INFO - Columns in the dataset: ['text', 'label']


In [7]:
dataset = dataset.map(convert_to_instruction_format)
print(dataset[0])

Map: 100%|██████████| 10003/10003 [00:00<00:00, 28712.13 examples/s]

{'text': 'I am still waiting on my card?', 'label': 11, 'instruction': 'Classify the following banking customer service query into the appropriate category:', 'input': 'I am still waiting on my card?', 'output': 11}





In [8]:
dataset = to_sharegpt(
    dataset,
    merged_prompt="{instruction}\n{input}",
    output_column_name="output",
    conversation_extension=3,
)

Merging columns: 100%|██████████| 10003/10003 [00:00<00:00, 278787.87 examples/s]
Converting to ShareGPT: 100%|██████████| 10003/10003 [00:00<00:00, 226327.16 examples/s]
Flattening the indices: 100%|██████████| 10003/10003 [00:00<00:00, 1035251.14 examples/s]
Flattening the indices: 100%|██████████| 10003/10003 [00:00<00:00, 12490.74 examples/s]
Flattening the indices: 100%|██████████| 10003/10003 [00:00<00:00, 12579.94 examples/s]
Extending conversations: 100%|██████████| 10003/10003 [00:00<00:00, 28548.66 examples/s]


In [9]:
# Use the standardize_sharegpt function to just make the dataset in a correct format for finetuning
dataset = standardize_sharegpt(dataset)

Standardizing format: 100%|██████████| 10003/10003 [00:00<00:00, 31011.42 examples/s]


In [10]:
from pprint import pprint
pprint(dataset[:3])

{'conversations': [[{'content': 'Classify the following banking customer '
                                'service query into the appropriate category:\n'
                                'I am still waiting on my card?',
                     'role': 'user'},
                    {'content': '11', 'role': 'assistant'},
                    {'content': 'Classify the following banking customer '
                                'service query into the appropriate category:\n'
                                'How can I convert currencies?',
                     'role': 'user'},
                    {'content': '33', 'role': 'assistant'},
                    {'content': 'Classify the following banking customer '
                                'service query into the appropriate category:\n'
                                'Will be Apple Watch be able to let me top up?',
                     'role': 'user'},
                    {'content': '2', 'role': 'assistant'}],
                   [{'cont

In [11]:
chat_template = """Below are customer banking queries. Classify each query into the appropriate banking intent category.

### Query:
{INPUT}

### Intent:
{OUTPUT}"""

from unsloth import apply_chat_template

dataset = apply_chat_template(
    dataset,
    tokenizer=tokenizer,
    chat_template=chat_template,
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

Unsloth: We automatically added an EOS token to stop endless generations.
Map: 100%|██████████| 10003/10003 [00:00<00:00, 18416.98 examples/s]


In [12]:
%%time
# train the model
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 600,
        num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    callbacks=[EC2MetricsCallback],
)

Converting train dataset to ChatML (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 18760.57 examples/s]
Applying chat template to train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:01<00:00, 6852.98 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:02<00:00, 3987.54 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 11281.17 examples/s]

CPU times: user 986 ms, sys: 344 ms, total: 1.33 s
Wall time: 6.34 s





In [None]:
%%time
# this will initiate the training process and also log the EC2 utilization metrics, such as the GPU
# utilization, CPU utilization, etc.
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,003 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 600
 "-____-"     Number of trainable parameters = 41,943,040
[2025-03-16 13:56:15,518] p37407 {ec2_metrics.py:184} INFO - Training started. Initiating EC2 metrics collection.
[2025-03-16 13:56:15,521] p37407 {ec2_metrics.py:170} INFO - Writing header: ['timestamp', 'cpu_percent_mean', 'memory_percent_mean', 'memory_used_mean', 'gpu_utilization_mean', 'gpu_memory_used_mean', 'gpu_memory_free_mean', 'gpu_memory_total_mean']
[2025-03-16 13:56:15,522] p37407 {ec2_metrics.py:41} INFO - Starting collection
[2025-03-16 13:56:15,954] p37407 {ec2_metrics.py:143} INFO - Starting daemon collector to run in background


Step,Training Loss
1,2.34
2,2.4714
3,2.5144
4,2.4015
5,2.2918
6,2.0268
7,1.8131
8,1.5834
9,1.3408
10,1.2266


[2025-03-16 14:13:17,086] p37407 {ec2_metrics.py:191} INFO - Training ended. Stopping EC2 metrics collection.
[2025-03-16 14:13:17,087] p37407 {ec2_metrics.py:33} INFO - Stopped collection


CPU times: user 13min, sys: 4min 19s, total: 17min 20s
Wall time: 17min 2s


[2025-03-16 14:13:21,005] p37407 {ec2_metrics.py:33} INFO - Stopped collection


### Log the trainer stats
---

In this step, we log some of the trainer stats, such as the number of global steps it took to get to a specific training loss, the train runtime, samples per second, steps per second, etc.

In [14]:
# Format the training stats in a readable way
output_text = f"""Training Statistics:
Global Steps: {trainer_stats.global_step}
Training Loss: {trainer_stats.training_loss:.4f}

Metrics:
- Train Runtime: {trainer_stats.metrics['train_runtime']:.3f} seconds
- Training Samples/Second: {trainer_stats.metrics['train_samples_per_second']:.3f}
- Training Steps/Second: {trainer_stats.metrics['train_steps_per_second']:.3f}
- Total FLOPS: {trainer_stats.metrics['total_flos']:.2e}
- Final Train Loss: {trainer_stats.metrics['train_loss']:.4f}
"""

# Save to a text file
with open(os.path.join(g.RESULTS_DIR, g.TRAINING_STATS), 'w') as f:
    f.write(output_text)

In [15]:
# save the model
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [16]:
from transformers import TextStreamer
import torch

print("Running inference on banking queries...")
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

test_queries = [
    "I see a charge on my credit card statement but I paid on time, why?",
    "Do you have a branch in Timbuktu?",
    "I lost my card and my replacement card has not arrived."
]

def display_inference(query):
    messages = [{"role": "user", "content": f"Classify the following banking customer service query into the appropriate category:\n\n{query}"}]
    
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")
    
    print(f"\n\n--- Query: {query} ---")
    print("Predicted intent category:")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(
        input_ids, 
        streamer=text_streamer,
        max_new_tokens=64,
        temperature=0.1,
        pad_token_id=tokenizer.eos_token_id
    )
    print("\n" + "-"*50)


def get_inference_output(query):
    messages = [{"role": "user", "content": f"Classify the following banking customer service query into the appropriate category:\n\n{query}"}]
    input_ids = tokenizer.apply_chat_template(
        messages, 
        add_generation_prompt=True, 
        return_tensors="pt"
    ).to("cuda")
    
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=64,
            temperature=0.1,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Extract only the model's answer, removing the prompt
    assistant_response = response.split("Classify the following banking customer service query into the appropriate category:")[-1].strip()
    # Further clean up to get just the category name
    if "\n\n" in assistant_response:
        assistant_response = assistant_response.split("\n\n")[-1]
    
    return assistant_response.strip()

for query in test_queries:
    display_inference(query)

with open("problem1_task1.txt", "w") as f:
    for query in test_queries:
        category = get_inference_output(query)
        f.write(f"input: {query}\n")
        f.write(f"category: {category}\n")

print("Inference results saved to problem1_task1.txt")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Running inference on banking queries...


--- Query: I see a charge on my credit card statement but I paid on time, why? ---
Predicted intent category:
15<|eot_id|>

--------------------------------------------------


--- Query: Do you have a branch in Timbuktu? ---
Predicted intent category:
24<|eot_id|>

--------------------------------------------------


--- Query: I lost my card and my replacement card has not arrived. ---
Predicted intent category:
11<|eot_id|>

--------------------------------------------------
Inference results saved to problem1_task1.txt
