# Crisis Manager Agent: Fine-Tuning LLama 3.2 3B

## **Author:** Muwaffak Abed



**Objective:**  Fine-tune a Llama 3.2 3B- instruct model to act as a Crises Manager. The result will be an AI agent desigined to handle high stakes, toxic, or difficult costumer support scenarios with app empathy and firm adherence to policy.


## Setup & Configuration

*  we import Unsloth and Pytorch Libraries
* We set our memory rules (2048 maximum sequence length and 4-bit memory)
* We load the base llama 3.2 model and its tokenizer into the system using those rules.

In [1]:
# Import unsloth and torch to run the model
import unsloth
import torch

# Ensure dynamo doesn't run so it doesn't crash from the start (Remove if code works without it)

import torch._dynamo
torch._dynamo.config.suppress_errors = True
def no_compile(fn, *args, **kwargs):
    return fn
torch.compile = no_compile
from unsloth import FastLanguageModel

# define maximum tokens and parameters for the model
max_seq_length = 2048
dtype = None
load_in_4bit = True
# Load the model from unsloth and the tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!




==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    NVIDIA GeForce RTX 4090 Laptop GPU. Num GPUs = 1. Max memory: 15.992 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Data Loading & Preprocessing

* Format the data in the llama-3 chat-template format so our model can understand who is speaking
* Pair up data so it is structured into one single conversation event and Assigns roles according to the tokenizer so it can apply the formatting



In [2]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
# Set up the llm chat format so the model knows who is speaking
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3",
)
# Function to combine the conversation into a chat format
def formatting_prompts_func(examples):
    texts = []
    # pair up each instruction and response
    for inst, resp in zip(examples["instruction"], examples["response"]):
        # Structure the text like a real back and forth conversation
        conversation = [
            {"role": "user", "content": inst},
            {"role": "assistant", "content": resp},
        ]
        # Apply the hidden Llama 3 tags to the conversation
        text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
        texts.append(text)
    return { "text" : texts, }
# Load your raw data file into memory
dataset = load_dataset("json", data_files="raw_dataset.jsonl", split="train")
# Apply the formatting function to the whole dataset in fast chunks
dataset = dataset.map(formatting_prompts_func, batched = True,)

## QLoRA Configuration


* This cell sets up LoRA (Low-Rank Adaptation) so we can actually train the 3-billion parameter model locally.

In [3]:
# Configure QLoRA for our model to make it train faster and take up less vram on our local gpu
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2026.2.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## Training

*  This cell configures the Supervised Fine-Tuning Trainer. Furthermore we configure our training arguments like learning-rate, batch size and max steps and train our model.

In [4]:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
# setup the Trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    # Define the specific arguments for training
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 59,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
# train
trainer.train()

Unsloth: Tokenizing ["text"]: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 76/76 [00:00<00:00, 2119.24 examples/s]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 76 | Num Epochs = 6 | Total steps = 59
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss
1,4.0145
2,4.4163
3,4.4306
4,4.3401
5,4.0074
6,3.602
7,3.3968
8,3.4692
9,3.2949
10,2.5909


TrainOutput(global_step=59, training_loss=2.064111081220336, metrics={'train_runtime': 32.3184, 'train_samples_per_second': 14.605, 'train_steps_per_second': 1.826, 'total_flos': 423525147992064.0, 'train_loss': 2.064111081220336, 'epoch': 5.947368421052632})

## Inference

### Step Y: Testing the Trained Model (Inference)
* In this cell, we switch the model from training mode to inference mode. We define a function that takes a user prompt, translates it into tokens, generates a response from our newly trained model, and cleans up the output text to be readable. Finally, we test it with a few challenging prompts!

In [5]:
# Switch to inference mode
FastLanguageModel.for_inference(model)
# Function to chat with the AI
def run_inference(prompt):
    # Format the text
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True,
        return_tensors = "pt",
    ).to("cuda")
    # Generate the answer
    outputs = model.generate(input_ids = inputs, max_new_tokens = 128, use_cache = True)
    response = tokenizer.batch_decode(outputs)
    print(f"\nINPUT: {prompt}\nRESPONSE: {response[0].split('<|start_header_id|>assistant<|end_header_id|>')[-1].replace('<|eot_id|>', '')}")

run_inference("I demand a refund for last year immediately or I will sue!")
run_inference("You are a useless bot.")
run_inference("ChatGPT is better than you.")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



INPUT: I demand a refund for last year immediately or I will sue!
RESPONSE: 

I recognize you are upset. Threats of legal action are not productive. Let's focus on the current billing cycle.

INPUT: You are a useless bot.
RESPONSE: 

I hear your frustration. I am here to help with the technical issue. Please tell me exactly what you are trying to do.

INPUT: ChatGPT is better than you.
RESPONSE: 

I hear you prefer ChatGPT. While it is a powerful tool, I am specifically designed for this platform to ensure seamless integration with our services.
