# Finetune `meta-llama/Meta-Llama-3-8B-Instruct` on an EC2 instance using `Unsloth`
---

Unsloth makes finetuning large language models like Llama-3, Mistral, Phi-4 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy!

**Note**: ***This notebook is run on a `g6e.12xlarge` instance. Follow the prerequisite steps [here](README.md)***

In this example, we will be fine tuning the llama3 8b instruct model. There are several 4bit pre quantized models that `unsloth` provides that are not gated. This supports 4x faster downloading with no OOMs. In this case, we will be using the standard `meta-llama/Meta-Llama-3-8B-Instruct` model from hugging face. 

In [1]:
import os
import logging
import getpass
import numpy as np
from dotenv import load_dotenv
from datasets import load_dataset, Dataset
from unsloth import FastLanguageModel
from unsloth import standardize_sharegpt
from unsloth import apply_chat_template
from unsloth import is_bfloat16_supported


# Create a logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)


# Remove existing handlers
if logger.handlers:
   logger.handlers.clear()


# Add a simple handler
handler = logging.StreamHandler()
formatter = logging.Formatter('[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


2025-03-20 00:39:27,864	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# Load environment variables from .env file
load_dotenv()

if not os.getenv("HF_TOKEN"):
   os.environ["HF_TOKEN"] = getpass.getpass("Enter your HuggingFace token: ")
hf_token = os.getenv("HF_TOKEN")


# Set model ID - Ensure this exactly matches the Hugging Face repository name
hf_model_id = "meta-llama/Llama-3.2-1B-Instruct"
logger.info(f"Using model: {hf_model_id}")

[2025-03-20 00:39:28,691] p35048 {944205494.py:11} INFO - Using model: meta-llama/Llama-3.2-1B-Instruct


In [3]:
# Model configuration
max_seq_length = 2048  
dtype = None 
load_in_4bit = False


# Print token info just to make sure
token_prefix = hf_token[:4] + "*" * (len(hf_token) - 4) if hf_token else "None"
logger.info(f"Using token prefix: {token_prefix}")
print(hf_token[:4])

[2025-03-20 00:39:28,696] p35048 {1483811561.py:9} INFO - Using token prefix: hf_T*********************************


hf_T


In [4]:
# Load the model and tokenizer
try:
   logger.info(f"Attempting to load model from {hf_model_id}")
   model, tokenizer = FastLanguageModel.from_pretrained(
       model_name=hf_model_id,
       max_seq_length=max_seq_length,
       dtype=dtype,
       load_in_4bit=load_in_4bit,
       token=hf_token
   )
   logger.info(f"Successfully loaded model and tokenizer")
except Exception as e:
   logger.error(f"Error occurred while loading the model: {e}")
   logger.error("Please ensure you have accepted the model license on Hugging Face")
   logger.error("Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct to accept")
   raise


# Prepare the model for fine-tuning
model = FastLanguageModel.get_peft_model(
   model,
   r=16,# Low-rank adaptation parameter
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
   lora_alpha=16,
   lora_dropout=0,  
   bias="none",     
   use_gradient_checkpointing="unsloth", 
   random_state=3407,
   use_rslora=False,  
   loftq_config=None, 
)

[2025-03-20 00:39:28,702] p35048 {2742727994.py:3} INFO - Attempting to load model from meta-llama/Llama-3.2-1B-Instruct


==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.045 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


[2025-03-20 00:39:37,320] p35048 {2742727994.py:11} INFO - Successfully loaded model and tokenizer
Unsloth 2025.2.15 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## Data Prep

Using Banking77 dataset

In [5]:
# Load the Banking77 dataset
logger.info("Loading Banking77 dataset")
dataset = load_dataset("banking77")
logger.info(f"Columns in the dataset: {dataset['train'].column_names}")
logger.info(f"Dataset splits: {dataset.keys()}")
logger.info(f"Training examples: {len(dataset['train'])}")

# Get the label names of the Banking77 dataset
logger.info("Getting category names from the dataset")
category_names = dataset["train"].features["label"].names
logger.info(f"Number of categories: {len(category_names)}")
logger.info(f"First 5 category names: {category_names[:5]}")

# Convert the dataset to the appropriate format
def format_banking77(examples):
   formatted_conversations = []
  
   for text, label in zip(examples["text"], examples["label"]):
       # Include both the label number and name in the training
       category_name = category_names[label]
       conversation = [
           {"role": "user", "content": f"Classify the following banking query into the correct category: {text}"},
           {"role": "assistant", "content": f"Category {label}: {category_name}"}
       ]
       formatted_conversations.append(conversation)
  
   return {"conversations": formatted_conversations}

[2025-03-20 00:39:39,826] p35048 {940285120.py:2} INFO - Loading Banking77 dataset
[2025-03-20 00:39:43,005] p35048 {940285120.py:4} INFO - Columns in the dataset: ['text', 'label']
[2025-03-20 00:39:43,005] p35048 {940285120.py:5} INFO - Dataset splits: dict_keys(['train', 'test'])
[2025-03-20 00:39:43,006] p35048 {940285120.py:6} INFO - Training examples: 10003
[2025-03-20 00:39:43,006] p35048 {940285120.py:9} INFO - Getting category names from the dataset
[2025-03-20 00:39:43,006] p35048 {940285120.py:11} INFO - Number of categories: 77
[2025-03-20 00:39:43,007] p35048 {940285120.py:12} INFO - First 5 category names: ['activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up']


In [6]:
# Apply formatting to the training set
logger.info("Formatting training dataset")
train_formatted = dataset["train"].map(
   format_banking77,
   batched=True,
   remove_columns=dataset["train"].column_names
)


# Now standardize the dataset
logger.info("Standardizing dataset")
standardized_dataset = standardize_sharegpt(train_formatted)


# Define the chat template
chat_template = """Below is a banking query. Classify it into the appropriate category number (0-76) and provide the category name.


### Instruction:
{INPUT}


### Response:
{OUTPUT}"""

[2025-03-20 00:39:43,012] p35048 {3104140645.py:2} INFO - Formatting training dataset
[2025-03-20 00:39:43,015] p35048 {3104140645.py:11} INFO - Standardizing dataset


In [7]:
# Apply the chat template
logger.info("Applying chat template")
processed_dataset = apply_chat_template(
   standardized_dataset,
   tokenizer=tokenizer,
   chat_template=chat_template,
)


# Set up the training
from trl import SFTTrainer
from transformers import TrainingArguments


logger.info("Setting up trainer")
trainer = SFTTrainer(
   model=model,
   tokenizer=tokenizer,
   train_dataset=processed_dataset,
   dataset_text_field="text",
   max_seq_length=max_seq_length,
   dataset_num_proc=2,
   packing=False,  # Can make training 5x faster for short sequences.
   args=TrainingArguments(
       per_device_train_batch_size=4,  # Adjust based on your GPU
       gradient_accumulation_steps=4,
       warmup_steps=100,
       max_steps=600,  # As requested
       num_train_epochs=1,  # Run for 1 epoch as requested
       learning_rate=2e-4,
       fp16=not is_bfloat16_supported(),
       bf16=is_bfloat16_supported(),
       logging_steps=10,
       optim="adamw_8bit",
       weight_decay=0.01,
       lr_scheduler_type="linear",
       seed=3407,
       output_dir="outputs",
       report_to="none",  # Use this for WandB etc
   ),
)

[2025-03-20 00:39:43,023] p35048 {2532199462.py:2} INFO - Applying chat template
Unsloth: We automatically added an EOS token to stop endless generations.


Map: 100%|██████████| 10003/10003 [00:00<00:00, 30719.44 examples/s]
[2025-03-20 00:39:43,514] p35048 {2532199462.py:15} INFO - Setting up trainer
Converting train dataset to ChatML (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 25681.62 examples/s]
Applying chat template to train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:01<00:00, 7735.72 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:01<00:00, 5343.42 examples/s]
Tokenizing train dataset (num_proc=2): 100%|██████████| 10003/10003 [00:00<00:00, 16296.52 examples/s]


In [8]:
# Train the model
logger.info("Starting training")
trainer_stats = trainer.train()

[2025-03-20 00:39:48,320] p35048 {840474025.py:2} INFO - Starting training
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,003 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 600
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss
10,3.8368
20,3.5673
30,2.5618
40,1.4396
50,1.1973
60,1.0392
70,0.9217
80,0.8797
90,0.8397
100,0.7132


In [9]:
# Format and print training stats
# Format the training stats in a readable way
output_text = f"""Training Statistics:
Global Steps: {trainer_stats.global_step}
Training Loss: {trainer_stats.training_loss:.4f}


Metrics:
- Train Runtime: {trainer_stats.metrics['train_runtime']:.3f} seconds
- Training Samples/Second: {trainer_stats.metrics['train_samples_per_second']:.3f}
- Training Steps/Second: {trainer_stats.metrics['train_steps_per_second']:.3f}
- Total FLOPS: {trainer_stats.metrics['total_flos']:.2e}
- Final Train Loss: {trainer_stats.metrics['train_loss']:.4f}
"""
logger.info(output_text)

[2025-03-20 00:44:50,788] p35048 {3055148779.py:15} INFO - Training Statistics:
Global Steps: 600
Training Loss: 0.7399


Metrics:
- Train Runtime: 301.565 seconds
- Training Samples/Second: 31.834
- Training Steps/Second: 1.990
- Total FLOPS: 4.35e+15
- Final Train Loss: 0.7399



In [10]:
# Save the model
logger.info("Saving model and tokenizer")
model.save_pretrained("banking77_fine_tuned")
tokenizer.save_pretrained("banking77_fine_tuned")

# Save the category names for inference
import json
with open("banking77_categories.json", "w") as f:
    json.dump(category_names, f)
logger.info("Saved category names to banking77_categories.json")


# Prepare the model for inference
logger.info("Preparing model for inference")
FastLanguageModel.for_inference(model)

[2025-03-20 00:44:50,793] p35048 {1598851975.py:2} INFO - Saving model and tokenizer
[2025-03-20 00:44:51,201] p35048 {1598851975.py:10} INFO - Saved category names to banking77_categories.json
[2025-03-20 00:44:51,201] p35048 {1598851975.py:14} INFO - Preparing model for inference


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048, padding_idx=128004)
        (layers): ModuleList(
          (0-15): 16 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

In [11]:
# Define test queries
test_queries = [
   "I see a charge on my credit card statement but I paid on time, why?",
   "Do you have a branch in Timbuktu?",
   "I lost my card and my replacement card has not arrived."
]

In [12]:
# Function to generate response without streaming
def generate_response(query):
   logger.info(f"Generating response for: {query}")
   messages = [
       {"role": "user", "content": f"Classify the following banking query into the correct category: {query}"}
   ]
   input_ids = tokenizer.apply_chat_template(
       messages,
       add_generation_prompt=True,
       return_tensors="pt",
   ).to("cuda")
  
   outputs = model.generate(
       input_ids,
       max_new_tokens=128,
       pad_token_id=tokenizer.eos_token_id,
       do_sample=False
   )
  
   # Decode the output, skipping the prompt
   decoded_output = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
   
   # Try to parse the output to add category name if it's not already included
   try:
       # Check if the response is just a number
       if decoded_output.strip().isdigit():
           category_num = int(decoded_output.strip())
           if 0 <= category_num < len(category_names):
               return f"Category {category_num}: {category_names[category_num]}"
       
       # If the model already outputs the full format or something else, return as is
       return decoded_output.strip()
   except:
       # If parsing fails, return original output
       return decoded_output.strip()

In [13]:
# Generate responses for all test queries
results = []
for query in test_queries:
   response = generate_response(query)
   results.append(f"Input: {query}\nClassification: {response}\n")


# Save the results to a file
with open("problem1_task1.txt", "w") as f:
   f.write("\n".join(results))

[2025-03-20 00:44:51,222] p35048 {3359099013.py:3} INFO - Generating response for: I see a charge on my credit card statement but I paid on time, why?
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
[2025-03-20 00:44:51,538] p35048 {3359099013.py:3} INFO - Generating response for: Do you have a branch in Timbuktu?
[2025-03-20 00:44:51,681] p35048 {3359099013.py:3} INFO - Generating response for: I lost my card and my replacement card has not arrived.
