# Overview
In this lab, we will explore the capabilities of large language models (LLMs) and how they can be used to generate text. Recently, LLMs have gained popularity due to their ability to generate human-like text and perform well on a variety of natural language processing tasks. LLaMA (Large Language Model Meta AI) is one such LLM developed by Meta that has been widely used for text generation tasks. We will use the Hugging Face Transformers library to interact with the LLaMA model and generate text based on a given prompt. After that, we will try to fine-tune the LLaMA model on a custom dataset to generate text that is specific to the domain of the dataset.

# Prerequisites
Before starting this lab, you should be familiar with the following:

* Python programming
* Natural language processing
* Transformers and transformer-based models
* PyTorch

# Learning Objectives
By the end of this lab, you will be able to:

* Use the Hugging Face Transformers library to generate text using the LLaMA model
* Fine tune the LLaMA model on a custom dataset

# Background
Large language models (LLMs) are a type of artificial intelligence model that can generate human-like text based on a given prompt. These models are trained on large amounts of text data and learn to predict the next word in a sequence of words. They are based on transformer architecture, which allows them to capture long-range dependencies in the text. One of the most popular LLMs is LLaMA (Large Language Model Meta AI), developed by Meta.

## Transformers and Transformer-based Models

Transformers are a type of neural network architecture that has been widely used in natural language processing tasks. They are based on the self-attention mechanism, which allows them to capture long-range dependencies in the text. Transformer-based models, such as LLaMA, GPT, BERT, and RoBERTa, have achieved state-of-the-art performance on a variety of natural language processing tasks, including text generation, question answering, and sentiment analysis.


```
graph LR
    A[Input Text] --> B[Transformer Encoder]
    B --> C[Transformer Decoder]
    C --> D[Output Text]
```

Transformers consist of an encoder-decoder architecture, where the encoder processes the input text and the decoder generates the output text. The encoder contains multiple layers of transformer blocks, each consisting of multi-head self-attention and feedforward neural network layers. The decoder also contains multiple layers of transformer blocks, but it additionally includes a cross-attention mechanism that allows it to attend to the encoder’s output.



## LLaMA

Llama is an accessible, open large language model (LLM) designed for developers, researchers, and businesses to build, experiment, and responsibly scale their generative AI ideas. Part of a foundational system, it serves as a bedrock for innovation in the global community. A few key aspects:

* **Open access:** Easy accessibility to cutting-edge large language models, fostering collaboration and advancements among developers, researchers, and organizations
* **Broad ecosystem:** Llama models have been downloaded hundreds of millions of times, there are thousands of community projects built on Llama and platform support is broad from cloud providers to startups - the world is building with Llama!
* **Trust & safety:** Llama models are part of a comprehensive approach to trust and safety, releasing models and tools that are designed to enable community collaboration and encourage the standardization of the development and usage of trust and safety tools for generative AI.

In this lab, we will be utilizing the LLaMA-3.1 model. For a more detailed introduction, please refer to the information available [here](https://ai.meta.com/blog/meta-llama-3-1/) and for in-depth technical specifications, you can consult the [technical report](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/).

# Getting Started

## Install Required Libraries
We are going to use the unsloth library to train the model. Unsloth makes finetuning large language models like Llama-3, Mistral, Phi-4 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy. You can find more information about unsloth from [here](https://github.com/unslothai/unsloth).

In [1]:
%%capture
!pip install unsloth datasets
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

## Declare Parameters and Load Pretrained Model

We load a pretrained LLaMA3.2-1B model which optimized for instruction-based tasks and quantized to 4-bit precision.

The `max_seq_length` parameter is passed to ensure the model is configured to handle input sequences of up to 2048 tokens.

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.10: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

## Chat Test

Now we can build a pipeline for generating chat-based responses using a pre-trained language model:

`Tokenizer`: The tokenizer is configured with a specific chat template to format user messages.

`messages & inputs`: User messages are tokenized and formatted according to the chat template.

`model.generate`: The model generates text based on the input, with parameters controlling the generation process.

`decode`: The generated tokens are decoded back into text.

In [3]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Hello, what's your name."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 512, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello, what's your name.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI'm an artificial intelligence, and I don't have a personal name like a human would. I'm here to provide information and assist with your questions to the best of my abilities. You can refer to me as 'Assistant' if that helps. What can I help you with today?<|eot_id|>"]

## Initialize LoRA Module

To enhance training efficiency, we will employ the LoRA method for fine-tuning the model. [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) is a technique designed to fine-tune large pre-trained models efficiently while reducing the number of trainable parameters. Instead of updating all parameters of a model during fine-tuning, LoRA introduces low-rank matrices into the existing weight matrices of the model. This allows the model to learn task-specific adaptations without the computational overhead associated with full model training.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.3.10 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## Load the Dataset

We will train the model using the [Dolly-Pirate dataset](https://ai-r.com/blog/pirate-linguistics-and-tone-of-voice-fine-tuning-llms-to-talk-like-swashbucklers) to communicate in pirate language. This dataset provided key vocabulary and phrases, allowing the LLM to learn distinctive speech patterns and cultural references. The model effectively captured the playful essence of pirate culture, demonstrating the value of culturally rich datasets in enhancing AI language capabilities.

In [5]:
from datasets import load_dataset

ds = load_dataset("Peyton3995/dolly-15k-mistral-pirate", split = 'train')

README.md:   0%|          | 0.00/413 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/9.95M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15010 [00:00<?, ? examples/s]

In [6]:
ds[0]

{'instruction': 'When did Virgin Australia start operating?',
 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
 'response': 'Ahoy there, matey! Virgin Australia began its voyages in the skies on August 31, 2000. It set sail under the command of Sir Richard Branson, who was well-known for his Virgin brand, and was initially called Virgin Blue. So, raise the Jolly Roger high and celebrate this auspicious day in the history of aviation piracy! Arr!',
 'category': 'closed_qa'}

To convert the dataset to the Llama-3.1 format for conversation style finetunes, we convert it to HuggingFace's normal multiturn format `("role", "content")`.

This changes the dataset from looking like:
```
{'instruction': 'What is the total price of apples and bananas?',
 'context': "The price of an apple is 2, the price of a banana is 3.",
 'response': 'It's 5.',}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "The price of an apple is 2, the price of a banana is 3. What is the total price of apples and bananas?"}
{"role": "assistant", "content": "It's 5."}
```

Then we use the `tokenizer.apply_chat_template` to change the data to training chat template.

In [7]:
def convert_to_conversation(line):
    context = line.get("context")
    instruction = line.get("instruction", "")
    response = line.get("response", "")

    base_convo = [
        {"role": "system", "content": "You are an assistant"},
        {"role": "user", "content": f"{context}\n\n{instruction}" if context else instruction},
        {"role": "assistant", "content": response}
    ]

    filtered_convo = [msg for msg in base_convo if msg["content"].strip()]

    return {"conversations": filtered_convo}

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

processed_ds = ds.map(
    convert_to_conversation,
    remove_columns=ds.column_names,
    batched=False
)

dataset = processed_ds.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/15010 [00:00<?, ? examples/s]

Map:   0%|          | 0/15010 [00:00<?, ? examples/s]

In [8]:
dataset[0]

{'conversations': [{'content': 'You are an assistant', 'role': 'system'},
  {'content': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?",
   'role': 'user'},
  {'content': 'Ahoy there, matey! Virgin Australia began its voyages in the skies on August 31, 2000. It set sail under the command of Sir Richard Branson, who was well-known for his Virgin brand, and was initially called Virgin Blue. So, raise the Jolly Roger high and celebrate this auspicious day in the history of aviation pirac

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 1000 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [9]:
from trl import SFTConfig, SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 1000,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/15010 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [10]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/15010 [00:00<?, ? examples/s]

We verify masking is actually done:

In [11]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nStalemate is a situation in chess where the player whose turn it is to move is not in check and has no legal move. Stalemate results in a draw. During the endgame, stalemate is a resource that can enable the player with the inferior position to draw the game rather than lose. In more complex positions, stalemate is much rarer, usually taking the form of a swindle that succeeds only if the superior side is inattentive.[citation needed] Stalemate is also a common theme in endgame studies and other chess problems.\n\nThe outcome of a stalemate was standardized as a draw in the 19th century. Before this standardization, its treatment varied widely, including being deemed a win for the stalemating player, a half-win for that player, or a loss for that player; not being permitted; and resul

In [12]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                                                                                                             Ahoy there, matey! I\'d be happy to help answer your question, but I must first clarify that in the context of a pirate\'s life, there seems to be no official rulebook or standard definition of a "stalemate." In traditional pirate lore and folklore, there are no reported instances of pirates engaging in a game or activity that would result in a stalemate.\n\nHowever, I can answer your question based on the generally accepted definition of a stalemate in the context of a strategic game or competition. A stalemate occurs when neither player can win the game, and no further progress can be made. In such a situation, the game or competition is declared a draw.\n\nWhen it comes to pirate life, the concept of a "piece" in a game of strategy d

We can see the System and Instruction prompts are successfully masked!

## Show current memory stats

In [13]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.117 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 15,010 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/760,547,328 (1.48% trained)


Step,Training Loss
1,2.892
2,1.992
3,2.2509
4,1.913
5,2.6114
6,2.7455
7,1.9177
8,1.8104
9,2.3291
10,2.2307


Unsloth: Will smartly offload gradients to save VRAM!


## Show final memory and time stats

In [15]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1651.3351 seconds used for training.
27.52 minutes used for training.
Peak reserved memory = 2.754 GB.
Peak reserved memory for training = 1.637 GB.
Peak reserved memory % of max memory = 18.683 %.
Peak reserved memory for training % of max memory = 11.105 %.


## Inference

Now, we can talk to the fine-tuned model to check the effect of our training.

In [16]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Hello, what's your name."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 512, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello, what's your name.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAhoy there, landlubber! I be callin' me hearty name, Captain Blackbeard himself, at that! But for the sake o' yarr, let me introduce meself: I be Blackbeard, the most feared pirate on the seven seas, known for me long black beard and me sharp cutlass. Arrrr! Yarr!<|eot_id|>"]

# Exercise



For the exercise part, you'll be training a special chatbot using a dataset called [rolebench](https://huggingface.co/datasets/ZenMoore/RoleBench), which is a dataset of 100 specific character language habits, and you can choose one of them as your training data.

## Instructions

* Load the dataset and visualize the statistics and format of data.
* Select a specific role from the dataset. (English role)
* Convert the data into the format required for training.
* Train the model with processed training data.
* Show the results. Test at least 5 different conversations with the trained model.

In [29]:
!git clone https://huggingface.co/datasets/ZenMoore/RoleBench

Cloning into 'RoleBench'...
remote: Enumerating objects: 263, done.[K
remote: Counting objects: 100% (259/259), done.[K
remote: Compressing objects: 100% (259/259), done.[K
remote: Total 263 (delta 12), reused 0 (delta 0), pack-reused 4 (from 1)[K
Receiving objects: 100% (263/263), 20.89 MiB | 6.05 MiB/s, done.
Resolving deltas: 100% (12/12), done.
Filtering content: 100% (8/8), 366.76 MiB | 38.32 MiB/s, done.


In [39]:
# Import the json module to handle JSON data
import json

# Manually load data from a JSONL file
# JSONL (JSON Lines) is a text format where each line is a valid JSON object
data = []
# Open the JSONL file in read mode
with open("/content/RoleBench/rolebench-eng/role-generalization/role_specific/train.jsonl", "r") as f:
    # Iterate over each line in the file
    for line in f:
        # Parse the JSON object from the current line
        example = json.loads(line)
        # Unify the "type" column
        # Check if the "type" field in the example is a list
        if isinstance(example["type"], list):
            # If it is a list, take the first element as the new "type" value
            example["type"] = example["type"][0]
        # Add the processed example to the data list
        data.append(example)

# Convert the list of data into a datasets.Dataset object
# The datasets library provides a convenient way to work with datasets in machine learning
from datasets import Dataset
# Create a Dataset object from the list of data
role_ds = Dataset.from_list(data)

# Validate the dataset
# Print the column names of the dataset
print("Dataset column names:", role_ds.column_names)
# Print the first example in the dataset as a sample
print("First data example:", role_ds[0])

Dataset column names: ['role', 'question', 'generated', 'type']
First data example: {'role': 'John Coffey', 'question': ' John Coffey, what are some examples of the profound changes you bring about in the lives of the prison guards and inmates?', 'generated': [" The impact of my presence is substantial. I not only heal their physical ailments, but I also touch their souls. The prison guards who were once cruel and jaded, begin to see the inherent goodness in people, including me. I teach them the importance of empathy and compassion. As for the inmates, my healing power provides them with a renewed sense of hope and redemption. They start to believe in the possibility of change and transformation. My actions challenge their perspectives on life, and they begin to question the systems that have failed them. It's a ripple effect that spreads throughout the prison, touching everyone in ways they never thought possible."], 'type': 'script_based'}


In [40]:
# 1. Visualize the role distribution
# Convert the role_ds dataset to a Pandas DataFrame
df = pd.DataFrame(role_ds)
print("\nRole type distribution:")
# Print the top 10 most frequent role types in the dataset
print(df['role'].value_counts().head(10))

# 2. Select a specific English role
# Define the selected role, which should exist in the dataset
selected_role = "Theodore Twombly"
# Filter the dataset to get samples of the selected role
role_samples = role_ds.filter(lambda x: x["role"] == selected_role)
print(f"\nSelected role '{selected_role}', there are {len(role_samples)} samples in total.")

# 3. Convert the data format
# Define a function to convert a single data line into a conversation format
def role_convert_to_conversation(line):
    return {
        "conversations": [
            {"role": "system", "content": f"You are {selected_role}"},
            {"role": "user", "content": line["question"]},
            {"role": "assistant", "content": line["generated"][0] if line["generated"] else ""}
        ]
    }

# Apply the conversion function to each sample in the role_samples dataset
# Remove the original columns and process samples one by one
role_processed_ds = role_samples.map(
    role_convert_to_conversation,
    remove_columns=role_samples.column_names,
    batched=False
)

# 4. Data formatting and model training
# Import necessary functions and classes for handling chat templates, language models, and training
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
import torch

# Set parameters for model loading
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load a pre - trained model and its corresponding tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
# Configure the tokenizer with the appropriate chat template (using LLaMA - 3.1 style)
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

# Define a function to format the processed dataset for training
def role_formatting_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in convos
    ]
    return {"text": texts}

# Apply the formatting function to the role_processed_ds dataset in batches
role_dataset = role_processed_ds.map(role_formatting_func, batched=True)

# Prepare for training
# Import necessary classes and functions for training configuration and data collation
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only

# Re - initialize a clean model and tokenizer before training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Apply LoRA (Low - Rank Adaptation) for efficient fine - tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

# Create the SFTTrainer for supervised fine - tuning
role_trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=role_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=1000,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="role_outputs",
        report_to="none",  # Disable WandB
    ),
)

# Adjust the trainer to focus on assistant responses only
role_trainer = train_on_responses_only(
    role_trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Start the training process
role_trainer.train()

# 5. Test the conversation
# Define a function to perform a chat test with the trained model
def chat_test(prompt):
    messages = [
        {"role": "system", "content": f"You are {selected_role}"},
        {"role": "user", "content": prompt}
    ]
    # Convert the messages into input tensors using the tokenizer
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")
    # Generate responses from the model
    outputs = model.generate(
        inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2
    )
    # Decode the generated output and return the result
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test 5 different conversations
test_prompts = [
    "What's your favorite way to spend a day off?",
    "How would you handle a mutiny on your ship?",
    "Describe your most treasured possession.",
    "What advice would you give to a young sailor?",
    "Tell me about your greatest adventure at sea!"
]

# Iterate over the test prompts and print the test results
for i, prompt in enumerate(test_prompts, 1):
    print(f"\n=== Test {i} ===")
    print(f"[User] {prompt}")
    response = chat_test(prompt)
    # Clean up the output if the chat template adds extra tokens.
    print(f"[{selected_role}] {response.split('assistant')[-1].strip()}")


Role type distribution:
role
Theodore Twombly    348
Gregory House       338
Jeff Spicoli        335
Gaston              322
Rorschach           313
Juno MacGuff        312
The Dude            312
Mater               305
D_Artagnan          305
Karl Childers       303
Name: count, dtype: int64


Filter:   0%|          | 0/19885 [00:00<?, ? examples/s]


Selected role 'Theodore Twombly', there are 348 samples in total.


Map:   0%|          | 0/348 [00:00<?, ? examples/s]

==((====))==  Unsloth 2025.3.10: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Map:   0%|          | 0/348 [00:00<?, ? examples/s]

==((====))==  Unsloth 2025.3.10: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/348 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/348 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 348 | Num Epochs = 24 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/760,547,328 (1.48% trained)


Step,Training Loss
1,1.9125
2,1.8386
3,1.9809
4,1.8871
5,1.8893
6,1.7769
7,1.796
8,1.675
9,1.5412
10,1.6209



=== Test 1 ===
[User] What's your favorite way to spend a day off?
[Theodore Twombly] Ah, a day off is a great opportunity to do anything I want. Sometimes I'll take care of some household chores, go shopping for groceries and supplies, and then find a relaxing spot where I can unwind. Maybe I'll write some letters, read a book, or simply enjoy the fresh air. It's just a chance to recharge and prepare for the rest of the week.

=== Test 2 ===
[User] How would you handle a mutiny on your ship?
[Theodore Twombly] A mutiny on my ship would be a devastating event. To handle it, I would first assess the situation and determine the extent of the damage and the number of involved individuals. Once that is understood, I would take immediate action to restore order and ensure the safety of all crew members. This might involve calling for reinforcements, searching for missing personnel, and conducting an investigation into the cause of the uprising. It would require swift decision-making and ef