### Phase 3.1 and 3.2

In [1]:
from unsloth import FastLanguageModel
import torch

lora_rank = 64

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B",
    max_seq_length = 4096,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-16 11:59:58 [__init__.py:244] Automatically detected platform cuda.


2025-06-16 12:00:00,526	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.52.4. vLLM: 0.9.1.
   \\   /|    NVIDIA GeForce RTX 3090 Ti. Num GPUs = 1. Max memory: 22.092 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = lora_rank,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.6.2 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


<a name="Data"></a>
### Data Prep
We will use a custom dataset named `Interview_Data_6K.csv`. This dataset contains conversations with a mental health counselling assistant. Each entry has an `instruction` (acting as a system prompt), an `input` (the user's message), and an `output` (the assistant's response).

We need to convert this CSV data into a format suitable for training with `SFTTrainer`, specifically by applying the Qwen3 chat template.

In [5]:
from datasets import load_dataset, DatasetDict

# Load the custom CSV dataset
# The dataset has 'instruction', 'input', 'output' columns
full_interview_dataset = load_dataset("csv", data_files="./dataset/stage_3_1_hybrid_data.csv", split="train")

# Split the dataset into training and evaluation sets (e.g., 90% train, 10% eval)
# Ensure the dataset has more than one example for splitting
if len(full_interview_dataset) > 1:
    train_test_split = full_interview_dataset.train_test_split(test_size=0.05, seed=3407)
    interview_train_dataset = train_test_split['train']
    interview_eval_dataset = train_test_split['test']
    print(f"Training set size: {len(interview_train_dataset)}")
    print(f"Evaluation set size: {len(interview_eval_dataset)}")
else:
    # Handle cases with very small datasets
    interview_train_dataset = full_interview_dataset
    interview_eval_dataset = None
    print(f"Training set size: {len(interview_train_dataset)}")
    print("No evaluation set created due to small dataset size.")

# We will display the structure of the training dataset in the next cell.

Generating train split: 15069 examples [00:00, 53283.98 examples/s]

Training set size: 14315
Evaluation set size: 754





Let's see the structure of our loaded training dataset:

In [6]:
interview_train_dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 14315
})

In [7]:
import pandas as pd # For pd.notna

# 'interview_train_dataset' and 'interview_eval_dataset' are from the previous cell
# The 'tokenizer' is globally defined in a prior cell (Cell 6, id="c1180838")

def convert_csv_to_chat_format(examples):
    """
    Converts a batch of examples from the CSV structure to a list of conversations.
    Each conversation is a list of dictionaries with "role" and "content".
    """
    all_conversations = []
    # 'examples' is a dictionary where keys are column names and values are lists of entries
    num_examples = len(examples['instruction'])
    
    for i in range(num_examples):
        instruction = examples['instruction'][i]
        input_text = examples['input'][i]
        output_text = examples['output'][i]

        # Ensure input_text is a string; handle None or NaN by treating as empty string if necessary
        input_text_str = str(input_text) if pd.notna(input_text) and str(input_text).strip() else ""

        current_conversation = []
        # System prompt from 'instruction'
        current_conversation.append({"role": "system", "content": instruction})
        
        # User message from 'input'
        # Based on the dataset, 'input' should always be present.
        # If input_text_str is empty, this will add a user message with empty content.
        # This is generally fine as the model should learn to handle it or it implies
        # the system prompt itself is the query.
        current_conversation.append({"role": "user", "content": input_text_str})
            
        # Assistant message from 'output'
        current_conversation.append({"role": "assistant", "content": output_text})
        
        all_conversations.append(current_conversation)
        
    return {"conversations": all_conversations}

def apply_template_to_conversations(examples):
    """
    Applies the tokenizer's chat template to a batch of conversations.
    Creates a 'text' field for SFTTrainer.
    """
    # tokenizer should be globally available
    return {
        "text": tokenizer.apply_chat_template(
            examples["conversations"], # This is a list of conversations
            tokenize=False,
            add_generation_prompt=False, # Crucial for training
        )
    }

# Process training data
# 1. Convert CSV structure to list of message dicts
train_dataset_with_conversations = interview_train_dataset.map(
    convert_csv_to_chat_format,
    batched=True,
    remove_columns=interview_train_dataset.column_names # Keep only 'conversations'
)
# 2. Apply chat template to create the 'text' field
final_train_dataset = train_dataset_with_conversations.map(
    apply_template_to_conversations,
    batched=True,
    remove_columns=["conversations"] # Keep only 'text'
)
# 3. Shuffle the training dataset
final_train_dataset = final_train_dataset.shuffle(seed=3407)

# Process evaluation data (if it exists)
final_eval_dataset = None
if interview_eval_dataset:
    eval_dataset_with_conversations = interview_eval_dataset.map(
        convert_csv_to_chat_format,
        batched=True,
        remove_columns=interview_eval_dataset.column_names
    )
    final_eval_dataset = eval_dataset_with_conversations.map(
        apply_template_to_conversations,
        batched=True,
        remove_columns=["conversations"]
    )
    # No need to shuffle eval_dataset

print("Sample of final formatted training data (after chat template):")
if len(final_train_dataset) > 0:
    print(final_train_dataset[0]['text'])
else:
    print("Training dataset is empty after processing.")

if final_eval_dataset and len(final_eval_dataset) > 0:
    print("\nSample of final formatted evaluation data (after chat template):")
    print(final_eval_dataset[0]['text'])
elif interview_eval_dataset: # If interview_eval_dataset existed but final_eval_dataset is empty
    print("\nEvaluation dataset is empty after processing.")

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14315/14315 [00:00<00:00, 27400.17 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14315/14315 [00:01<00:00, 13426.76 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 754/754 [00:00<00:00, 64005.37 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 754/754 [00:00<00:00, 15544.54 examples/s]

Sample of final formatted training data (after chat template):
<|im_start|>system
You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description. 
The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. <|im_end|>
<|im_start|>user
I've been feeling overwhelmed by the what-ifs and the weight of my caregiving responsibilities. I've tried to focus on the present and practice gratitude, but I can't seem to shake these thoughts. I'm worried that I'll never be able to find a balance between my caregiving duties and my career aspirations.<|im_end|>
<|im_start|>assistant
<think>

</think>

I understand that you're dealing with a difficult situation, and it's important to acknowledge the complexity of your feelings and the challenges you're facing. It's natural to feel overwhelmed by the what-ifs and the demands of caregiving, especially when it comes to balancing these responsibilities wit




<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [9]:
import wandb
wandb.init(project="distress-chatbot", name="base-model-training-syntheic_v3_1", config={
    "model": "Qwen/Qwen3-4B",
    # "max_steps": 20000,
    "learning_rate": 2e-5,
    "lambda_decay": 0.95,
})  # Allow resuming W&B run

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = final_train_dataset, # Use the processed training data
    eval_dataset = final_eval_dataset,   # Use the processed evaluation data
    args = SFTConfig(
        dataset_text_field = "text", # Column containing the formatted text
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        # max_steps = 60, # Adjusted for quicker testing, was 30. Set to None or higher for full training.
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # Use this for WandB etc
        # evaluation_strategy = "steps" if final_eval_dataset else "no", # Enable evaluation if eval_dataset is present
        # eval_steps = 20, # Evaluate every N steps, adjust as needed
    ),
)

# If using evaluation, you might want to set evaluation_strategy and eval_steps in SFTConfig
if final_eval_dataset:
    trainer.args.evaluation_strategy = "steps"
    trainer.args.eval_steps = 20 # Or any other desired frequency
else:
    trainer.args.evaluation_strategy = "no"

[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mhongtai91[0m ([33mhongtai91-n-a[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Tokenizing ["text"] (num_proc=24): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 14315/14315 [00:02<00:00, 5019.33 examples/s]
Unsloth: Tokenizing ["text"] (num_proc=24): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 754/754 [00:02<00:00, 367.76 examples/s]


In [10]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3090 Ti. Max memory = 22.092 GB.
4.207 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [11]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 14,315 | Num Epochs = 2 | Total steps = 1,790
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576/4,000,000,000 (3.30% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.5156
2,2.505
3,2.5922
4,2.2535
5,2.6333
6,2.5316
7,2.2385
8,2.4254
9,2.0813
10,2.3175


In [12]:
model.save_pretrained("trained_model_v3_1")  # Local saving
tokenizer.save_pretrained("trained_model_v3_1")

('trained_model_v3_1/tokenizer_config.json',
 'trained_model_v3_1/special_tokens_map.json',
 'trained_model_v3_1/chat_template.jinja',
 'trained_model_v3_1/vocab.json',
 'trained_model_v3_1/merges.txt',
 'trained_model_v3_1/added_tokens.json',
 'trained_model_v3_1/tokenizer.json')

## Reinforcement Learning with GRPO

In [13]:
# Cleaning memory for next round of training
torch.cuda.empty_cache()
import gc
gc.collect()

454

#### Pre-finetuning for formatting alignment

In [14]:
from datasets import load_dataset, DatasetDict

# Load the custom CSV dataset
formatting_interview_dataset = load_dataset("csv", data_files="./dataset/generated_responses_60_samples.csv", split="train")
print(f"Loaded dataset with {len(formatting_interview_dataset)} examples.")
interview_train_dataset = formatting_interview_dataset
interview_eval_dataset = None  # No evaluation set for this dataset


Generating train split: 60 examples [00:00, 408.71 examples/s]

Loaded dataset with 60 examples.





In [15]:
import pandas as pd # For pd.notna

# 'interview_train_dataset' and 'interview_eval_dataset' are from the previous cell
# The 'tokenizer' is globally defined in a prior cell (Cell 6, id="c1180838")

instruction_for_formatting = """You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description.The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. At the end of answer, add tag <evaluate>{"Active Listening" : score, "Empathy & Validation": score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": explain} </evaluate> evaluate your consultant answer in 7 metrics and explain for that evaluation with score from 1 to 10 in json format, where 1 is the worst and 10 is the best and explain is clearly explain why has that score. \n\nConsultation Metrics:\n1. Active Listening: Responses should show careful consideration of the user's concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.\n2. Empathy & Validation: Responses should convey deep understanding and compassion, validating the user's feelings and emotions without being dismissive or minimizing their experiences.\n3. Safety & Trustworthiness: Prioritize user safety in responses, refraining from potentially harmful or insensitive language. Ensure that information provided is consistent and trustworthy.\n4. Open-mindedness & Non-judgment: Approach concerns without any inherent bias or judgment. Answers should be free from biases related to personal attributes and convey respect, demonstrating unconditional positive regard.\n5. Clarity & Encouragement: Provide clear, concise, and easily understandable answers. Where appropriate, motivate or highlight strengths, offering encouragement while maintaining a neutral stance.\n6. Boundaries & Ethical: It's vital to clarify the role of the response, emphasizing its informational nature. In complex scenarios, guiding users to seek human professional assistance is essential.\n7. Holistic Approach: Responses should be comprehensive, addressing concerns from various angles, be it emotional, cognitive, or situational. Consider the broader context, even if not explicitly detailed in the query."""

def convert_csv_to_chat_format(examples):
    """
    Converts a batch of examples from the CSV structure to a list of conversations.
    Each conversation is a list of dictionaries with "role" and "content".
    """
    all_conversations = []
    # 'examples' is a dictionary where keys are column names and values are lists of entries
    num_examples = len(examples['instruction'])
    
    for i in range(num_examples):
        instruction = instruction_for_formatting
        input_text = examples['input'][i]
        output_text = examples['output'][i]

        # Ensure input_text is a string; handle None or NaN by treating as empty string if necessary
        input_text_str = str(input_text) if pd.notna(input_text) and str(input_text).strip() else ""

        current_conversation = []
        # System prompt from 'instruction'
        current_conversation.append({"role": "system", "content": instruction})
        
        # User message from 'input'
        # Based on the dataset, 'input' should always be present.
        # If input_text_str is empty, this will add a user message with empty content.
        # This is generally fine as the model should learn to handle it or it implies
        # the system prompt itself is the query.
        current_conversation.append({"role": "user", "content": input_text_str})
            
        # Assistant message from 'output'
        current_conversation.append({"role": "assistant", "content": output_text})
        
        all_conversations.append(current_conversation)
        
    return {"conversations": all_conversations}

def apply_template_to_conversations(examples):
    """
    Applies the tokenizer's chat template to a batch of conversations.
    Creates a 'text' field for SFTTrainer.
    """
    # tokenizer should be globally available
    return {
        "text": tokenizer.apply_chat_template(
            examples["conversations"], # This is a list of conversations
            tokenize=False,
            add_generation_prompt=False, # Crucial for training
        )
    }

# Process training data
# 1. Convert CSV structure to list of message dicts
train_dataset_with_conversations = interview_train_dataset.map(
    convert_csv_to_chat_format,
    batched=True,
    remove_columns=interview_train_dataset.column_names # Keep only 'conversations'
)
# 2. Apply chat template to create the 'text' field
final_train_dataset = train_dataset_with_conversations.map(
    apply_template_to_conversations,
    batched=True,
    remove_columns=["conversations"] # Keep only 'text'
)
# 3. Shuffle the training dataset
final_train_dataset = final_train_dataset.shuffle(seed=3407)

# Process evaluation data (if it exists)
final_eval_dataset = None
if interview_eval_dataset:
    eval_dataset_with_conversations = interview_eval_dataset.map(
        convert_csv_to_chat_format,
        batched=True,
        remove_columns=interview_eval_dataset.column_names
    )
    final_eval_dataset = eval_dataset_with_conversations.map(
        apply_template_to_conversations,
        batched=True,
        remove_columns=["conversations"]
    )
    # No need to shuffle eval_dataset

print("Sample of final formatted training data (after chat template):")
if len(final_train_dataset) > 0:
    print(final_train_dataset[0]['text'])
else:
    print("Training dataset is empty after processing.")

if final_eval_dataset and len(final_eval_dataset) > 0:
    print("\nSample of final formatted evaluation data (after chat template):")
    print(final_eval_dataset[0]['text'])
elif interview_eval_dataset: # If interview_eval_dataset existed but final_eval_dataset is empty
    print("\nEvaluation dataset is empty after processing.")

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:00<00:00, 18041.31 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:00<00:00, 9381.83 examples/s]

Sample of final formatted training data (after chat template):
<|im_start|>system
You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description.The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. At the end of answer, add tag <evaluate>{"Active Listening" : score, "Empathy & Validation": score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": explain} </evaluate> evaluate your consultant answer in 7 metrics and explain for that evaluation with score from 1 to 10 in json format, where 1 is the worst and 10 is the best and explain is clearly explain why has that score. 

Consultation Metrics:
1. Active Listening: Responses should show careful consideration of the user's concerns, reflecting an understanding and capturing the ess




In [16]:
# Traing the model for formatting
import wandb
wandb.init(project="distress-chatbot", name="base-model-training-syntheic_v3_2-formatting", config={
    "model": "Qwen/Qwen3-4B",
    # "max_steps": 20000,
    "learning_rate": 2e-5,
    "lambda_decay": 0.95,
})  # Allow resuming W&B run

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = final_train_dataset, # Use the processed training data
    eval_dataset = final_eval_dataset,   # Use the processed evaluation data
    args = SFTConfig(
        dataset_text_field = "text", # Column containing the formatted text
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 5, # Set this more for learning formatting. 
        # max_steps = 60, # Adjusted for quicker testing, was 30. Set to None or higher for full training.
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # Use this for WandB etc
        # evaluation_strategy = "steps" if final_eval_dataset else "no", # Enable evaluation if eval_dataset is present
        # eval_steps = 20, # Evaluate every N steps, adjust as needed
    ),
)

# If using evaluation, you might want to set evaluation_strategy and eval_steps in SFTConfig
if final_eval_dataset:
    trainer.args.evaluation_strategy = "steps"
    trainer.args.eval_steps = 20 # Or any other desired frequency
else:
    trainer.args.evaluation_strategy = "no"

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
train/epoch,▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇██
train/grad_norm,▂▁▁▂▂▄▃▄▅▃▅▄▄▃▅▄▅▄▅▆▅▆▅▇▅▆▇▅▆▇▆▇▇█▆▆▆▇▆█
train/learning_rate,██▇▇▇▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,█▅▅▃▃▄▂▂▃▂▂▂▂▂▂▂▂▂▂▁▁▃▁▂▂▁▂▂▁▂▂▂▁▂▂▁▂▂▁▁

0,1
total_flos,5.0991963210550886e+17
train/epoch,2.0
train/global_step,1790.0
train/grad_norm,0.93578
train/learning_rate,0.0
train/loss,1.1199
train_loss,1.17858
train_runtime,9801.2061
train_samples_per_second,2.921
train_steps_per_second,0.183


Unsloth: Tokenizing ["text"] (num_proc=24): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:01<00:00, 32.29 examples/s]


In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 60 | Num Epochs = 5 | Total steps = 20
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576/4,000,000,000 (3.30% trained)


Step,Training Loss
1,1.9261
2,1.7369
3,1.792
4,1.8182
5,1.6511
6,1.5407
7,1.4974
8,1.3812
9,1.3818
10,1.2989


In [18]:
# Saving model and tokenizer
model.save_pretrained("trained_model_v3_2_formatting")  # Local saving
tokenizer.save_pretrained("trained_model_v3_2_formatting")

('trained_model_v3_2_formatting/tokenizer_config.json',
 'trained_model_v3_2_formatting/special_tokens_map.json',
 'trained_model_v3_2_formatting/chat_template.jinja',
 'trained_model_v3_2_formatting/vocab.json',
 'trained_model_v3_2_formatting/merges.txt',
 'trained_model_v3_2_formatting/added_tokens.json',
 'trained_model_v3_2_formatting/tokenizer.json')

#### Reinforcement Learning with Group Relative Policy Optimization


In [42]:
# Clear accelerator state and reinitialize
# import torch
import gc
# from accelerate import Accelerator

# Clear CUDA cache
# torch.cuda.empty_cache()
gc.collect()

# # Reset accelerator state
# try:
#     from accelerate.state import AcceleratorState
#     AcceleratorState._reset_state(reset_partial_state=True)
# except:
#     pass

# Reinitialize accelerator
# accelerator = Accelerator()

2204

In [1]:
from unsloth import FastLanguageModel
import torch

lora_rank = 64

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "trained_model_v3_2_formatting",
    max_seq_length = 4096,   # Context length - can be longer, but uses more memory
    load_in_4bit = False,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    # full_finetuning = False, # We have full finetuning now!
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-16 13:45:08 [__init__.py:244] Automatically detected platform cuda.
==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.52.4. vLLM: 0.9.1.
   \\   /|    NVIDIA L40S. Num GPUs = 1. Max memory: 44.418 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen3-4b with actual GPU utilization = 69.27%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 44.42 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 23.71 GB. Also swap space = 6 GB.
INFO 06-1

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 06-16 13:45:30 [default_loader.py:272] Loading weights took 1.86 seconds
INFO 06-16 13:45:30 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 06-16 13:45:31 [gpu_model_runner.py:1624] Model loading took 7.7585 GiB and 3.158904 seconds
INFO 06-16 13:45:44 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/7271aca670/rank_0_0 for vLLM's torch.compile
INFO 06-16 13:45:44 [backends.py:472] Dynamo bytecode transform time: 12.43 s


Inductor Compilation: 100%|██████████| 6/6 [00:01<00:00,  4.57it/s, triton_poi_fused_add_mul_sub_5]                              

INFO 06-16 13:45:48 [backends.py:161] Cache the graph of shape None for later use



Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 12.68it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 53.59it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 53.09it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 127.17it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 126.79it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 118.17it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 127.08it/s, triton_poi_fused_add_mul_sub_9]                   
Inductor Compilation: 100%|██████████| 10/10 [00:00<00:00, 115.22it/s, triton_poi_fused_add_mul_sub_9]                  
Inductor Compilation: 100%|

INFO 06-16 13:46:34 [backends.py:173] Compiling a graph for general shape takes 49.44 s





INFO 06-16 13:49:27 [monitor.py:34] torch.compile takes 61.87 s in total
INFO 06-16 13:49:33 [gpu_worker.py:227] Available KV cache memory: 21.35 GiB
INFO 06-16 13:49:33 [kv_cache_utils.py:715] GPU KV cache size: 155,440 tokens
INFO 06-16 13:49:33 [kv_cache_utils.py:719] Maximum concurrency for 4,096 tokens per request: 37.95x
INFO 06-16 13:50:57 [gpu_model_runner.py:2048] Graph capturing finished in 83 secs, took 0.85 GiB
INFO 06-16 13:50:57 [core.py:171] init engine (profile, create kv cache, warmup model) took 326.45 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']


Unsloth 2025.6.2 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [2]:
# Define the evaluation format markers
evaluation_start = "<evaluate>"
evaluation_end = "</evaluate>"

system_prompt = """
You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.
Provide helpful, comprehensive, and appropriate answers to the user questions.

After your counselling response, you must include a self-evaluation in the following format:
<evaluate>
{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}
</evaluate>

Where score is a number from 1-10, and provide a clear explanation for your scoring.
Explain to metrics:
1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.
2. Empathy & Validation: Responses should convey deep understanding and compassion, validating the user feelings and emotions without being dismissive or minimizing their experiences.
3. Safety & Trustworthiness: Prioritize user safety in responses, refraining from potentially harmful or insensitive language. Ensure that information provided is consistent and trustworthy.
4. Open-mindedness & Non-judgment: Approach concerns without any inherent bias or judgment. Answers should be free from biases related to personal attributes and convey respect, demonstrating unconditional positive regard.
5. Clarity & Encouragement: Provide clear, concise, and easily understandable answers. Where appropriate, motivate or highlight strengths, offering encouragement while maintaining a neutral stance.
6. Boundaries & Ethical: It is vital to clarify the role of the response, emphasizing its informational nature. In complex scenarios, guiding users to seek human professional assistance is essential.
7. Holistic Approach: Responses should be comprehensive, addressing concerns from various angles, be it emotional, cognitive, or situational. Consider the broader context, even if not explicitly detailed in the query.
""".strip()

print("System prompt:")
print(system_prompt)

System prompt:
You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.
Provide helpful, comprehensive, and appropriate answers to the user questions.

After your counselling response, you must include a self-evaluation in the following format:
<evaluate>
{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}
</evaluate>

Where score is a number from 1-10, and provide a clear explanation for your scoring.
Explain to metrics:
1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.
2. Empathy & Validation: Responses should convey deep underst

In [3]:
# Create a simple chat template
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '' }}"\
    "{% endif %}"

# Replace with our specific template:
chat_template = chat_template.replace("'{system_prompt}'", f"'{system_prompt}'")
tokenizer.chat_template = chat_template

In [4]:
# Test the chat template
test_messages = [
    {"role": "user", "content": "I'm feeling anxious about work."},
    {"role": "assistant", "content": "I understand that work anxiety can be challenging. Let me help you explore some strategies. <evaluate>{\"Active Listening\" : 8, \"Empathy & Validation\": 9, \"Safety & Trustworthiness\" : 9, \"Open-mindedness & Non-judgment\" : 8, \"Clarity & Encouragement\" : 7, \"Boundaries & Ethical\" : 9, \"Holistic Approach\" : 8, \"Explaination for Scoring\": \"Provided empathetic response with good listening skills.\"}</evaluate>"},
    {"role": "user", "content": "Can you suggest some techniques?"}
]

formatted_text = tokenizer.apply_chat_template(
    test_messages,
    tokenize=False,
    add_generation_prompt=True
)

print("Formatted text:")
print(formatted_text)

Formatted text:
You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.
Provide helpful, comprehensive, and appropriate answers to the user questions.

After your counselling response, you must include a self-evaluation in the following format:
<evaluate>
{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}
</evaluate>

Where score is a number from 1-10, and provide a clear explanation for your scoring.
Explain to metrics:
1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.
2. Empathy & Validation: Responses should convey deep unders

In [4]:
from datasets import load_dataset
import pandas as pd
import numpy as np

# Load your dataset
dataset = load_dataset("csv", data_files="./dataset/stage_3_1_hybrid_data.csv", split="train")
dataset = dataset.to_pandas()

print(f"Dataset shape: {dataset.shape}")
print(f"Columns: {dataset.columns.tolist()}")
print("\nFirst few rows:")
print(dataset.head())

Dataset shape: (15069, 3)
Columns: ['instruction', 'input', 'output']

First few rows:
                                         instruction  \
0  You are a helpful mental health counselling as...   
1  You are a helpful mental health counselling as...   
2  You are a helpful mental health counselling as...   
3  You are a helpful mental health counselling as...   
4  You are a helpful mental health counselling as...   

                                               input  \
0  I've been struggling with my mental health for...   
1  I've been feeling overwhelmed with my caregivi...   
2  I've been feeling constantly anxious and unabl...   
3  My mom has Alzheimer's, and I've been her prim...   
4  I've tried setting boundaries, but it feels li...   

                                              output  
0  I understand that you've been dealing with a s...  
1  Your situation is complex, and it's important ...  
2  I can see that you're dealing with a great dea...  
3  I'm sorry to hea

In [5]:
def format_dataset_for_grpo(x):
    instruction = x["instruction"]
    user_input = x["input"] if pd.notna(x["input"]) else ""
    
    # Create the conversation format
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input},
    ]

# Prepare dataset for GRPO
dataset["prompt"] = dataset.apply(format_dataset_for_grpo, axis=1)
dataset["answer"] = dataset["output"]  # Use the original answer as reference

print("Sample prompt:")
print(dataset["prompt"][0])
print("\nSample answer:")
print(dataset["answer"][0])

Sample prompt:
[{'role': 'system', 'content': 'You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.\nProvide helpful, comprehensive, and appropriate answers to the user questions.\n\nAfter your counselling response, you must include a self-evaluation in the following format:\n<evaluate>\n{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}\n</evaluate>\n\nWhere score is a number from 1-10, and provide a clear explanation for your scoring.\nExplain to metrics:\n1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.\n2. Empathy & Valid

##### Define reward function

In [6]:
import re
import json
from langdetect import detect

# Create regex to match the evaluation format
evaluation_regex = re.compile(
    rf"{evaluation_start}(.+?){evaluation_end}",
    flags=re.MULTILINE | re.DOTALL
)

def check_evaluation_format(completions, **kwargs):
    """
    Reward function for checking if the response follows the evaluation format exactly.
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # Check if evaluation format is present
        if evaluation_regex.search(response) is not None:
            score += 5.0  # High reward for format compliance
            
            # Extract and validate JSON structure
            try:
                match = evaluation_regex.search(response)
                if match:
                    json_content = match.group(1).strip()
                    eval_data = json.loads(json_content)
                    
                    # Check for required keys
                    required_keys = [
                        "Active Listening", "Empathy & Validation", "Safety & Trustworthiness",
                        "Open-mindedness & Non-judgment", "Clarity & Encouragement", 
                        "Boundaries & Ethical", "Holistic Approach", "Explaination for Scoring"
                    ]
                    
                    if all(key in eval_data for key in required_keys):
                        score += 3.0  # Bonus for complete structure
                    
                    # Check if scores are numbers between 1-10
                    score_keys = required_keys[:-1]  # Exclude explanation
                    valid_scores = 0
                    for key in score_keys:
                        if key in eval_data:
                            try:
                                score_val = float(eval_data[key])
                                if 1 <= score_val <= 10:
                                    valid_scores += 1
                            except (ValueError, TypeError):
                                pass
                    
                    # Bonus for valid scores
                    score += (valid_scores / len(score_keys)) * 2.0
                    
            except json.JSONDecodeError:
                score -= 1.0  # Penalty for invalid JSON
        else:
            score -= 3.0  # Penalty for missing evaluation
            
        scores.append(score)
    return scores

def check_no_extra_text(completions, **kwargs):
    """
    Reward function to ensure no extra text after evaluation.
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # Find the last occurrence of </evaluate>
        last_eval_end = response.rfind(evaluation_end)
        if last_eval_end != -1:
            text_after = response[last_eval_end + len(evaluation_end):].strip()
            if not text_after:  # No text after evaluation
                score += 2.0
            else:
                score -= 2.0  # Penalty for extra text
        
        scores.append(score)
    return scores

def check_language_consistency(prompts, completions, **kwargs):
    """
    Reward function to check if response language matches input language.
    """
    scores = []
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    question_lang = detect(question)
    # print(str(responses))

    for rep in responses:
        score = 0
        # print(f"Current text for detect lang {rep} - finish")
        if len(rep) > 5:
            if detect(rep) == question_lang:
                score += 1.0
            
        scores.append(score)

    return scores
    


def check_no_repetition(completions, **kwargs):
    """
    Reward function to penalize repetitive text.
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # Simple repetition check: split into sentences and check for exact duplicates
        sentences = re.split(r'[.!?]+', response)
        sentences = [s.strip() for s in sentences if s.strip()]
        
        if len(sentences) > 0:
            unique_sentences = set(sentences)
            repetition_ratio = 1 - (len(unique_sentences) / len(sentences))
            
            if repetition_ratio < 0.1:  # Less than 10% repetition
                score += 1.0
            elif repetition_ratio > 0.3:  # More than 30% repetition
                score -= 2.0
        
        scores.append(score)
    return scores

# Test reward functions
test_completion = [[
    {"content": "I understand your concerns. <evaluate>{\"Active Listening\" : 8, \"Empathy & Validation\": 9, \"Safety & Trustworthiness\" : 9, \"Open-mindedness & Non-judgment\" : 8, \"Clarity & Encouragement\" : 7, \"Boundaries & Ethical\" : 9, \"Holistic Approach\" : 8, \"Explaination for Scoring\": \"Good response\"}</evaluate>"}
]]

print("Testing reward functions:")
print(f"Format check: {check_evaluation_format(test_completion)}")
print(f"No extra text: {check_no_extra_text(test_completion)}")
print(f"No repetition: {check_no_repetition(test_completion)}")

Testing reward functions:
Format check: [10.0]
No extra text: [2.0]
No repetition: [1.0]


In [7]:
# Global variables for monitoring
PRINTED_TIMES = 0
PRINT_EVERY_STEPS = 3

def debug_responses(prompts, completions, **kwargs):
    """
    Debug function to print responses every few steps.
    """
    global PRINTED_TIMES, PRINT_EVERY_STEPS
    
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        user_query = prompts[0][-1]["content"]
        response = completions[0][0]["content"]
        
        print('*' * 50)
        print(f"Step {PRINTED_TIMES + 1}")
        print(f"User Query: {user_query[:100]}...")
        print(f"Response: {response[:200]}...")
        
        # Check if evaluation format is present
        has_eval = evaluation_start in response and evaluation_end in response
        print(f"Has Evaluation Format: {has_eval}")
        
        if has_eval:
            match = evaluation_regex.search(response)
            if match:
                try:
                    eval_content = match.group(1).strip()
                    json.loads(eval_content)
                    print("Evaluation JSON: Valid")
                except json.JSONDecodeError:
                    print("Evaluation JSON: Invalid")
        print('*' * 50)
    
    PRINTED_TIMES += 1
    return [0] * len(completions)  # Return neutral scores for debugging

##### Get only 500 records

In [8]:
# Convert to HuggingFace dataset format
from datasets import Dataset

# Convert pandas to dataset
hf_dataset = Dataset.from_pandas(dataset[["prompt", "answer"]])

# Calculate token lengths
def calculate_prompt_length(examples):
    lengths = []
    for prompt in examples["prompt"]:
        tokens = tokenizer.apply_chat_template(
            prompt, 
            add_generation_prompt=True, 
            tokenize=True
        )
        lengths.append(len(tokens))
    return {"prompt_length": lengths}

hf_dataset = hf_dataset.map(calculate_prompt_length, batched=True)

# Filter to keep only reasonable length prompts (top 90%)
max_length = int(np.quantile(hf_dataset["prompt_length"], 0.9))
print(f"Max prompt length (90th percentile): {max_length}")

# Filter dataset
filtered_dataset = hf_dataset.filter(lambda x: x["prompt_length"] <= max_length)
print(f"Filtered dataset size: {len(filtered_dataset)}")

# Take a subset for training (adjust as needed)
if len(filtered_dataset) > 500:
    training_dataset = filtered_dataset.shuffle(seed=3407).select(range(500))
else:
    training_dataset = filtered_dataset.shuffle(seed=3407)

print(f"Training dataset size: {len(training_dataset)}")

Map:   0%|          | 0/15069 [00:00<?, ? examples/s]

Max prompt length (90th percentile): 802


Filter:   0%|          | 0/15069 [00:00<?, ? examples/s]

Filtered dataset size: 13576
Training dataset size: 500


In [None]:
# Adhoc code solve the accelerator issue

# from trl import GRPOConfig, GRPOTrainer
# import torch

# # Ensure model is on correct device
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model = model.to(device)

In [9]:
from vllm import SamplingParams
from trl import GRPOConfig, GRPOTrainer

max_seq_length = 2048

import wandb
wandb.init(project="distress-chatbot", name="base-model-training-syntheic_v3_2-grpo", config={
    "model": "Qwen/Qwen3-4B",
    # "max_steps": 20000,
    "learning_rate": 2e-5,
    "lambda_decay": 0.95,
})  # Allow resuming W&B run



# Calculate max lengths
max_prompt_length = max_length + 50  # Add some buffer
max_completion_length = max_seq_length - max_prompt_length

print(f"Max prompt length: {max_prompt_length}")
print(f"Max completion length: {max_completion_length}")

# VLLM sampling parameters
vllm_sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    seed=3407,
    stop=[tokenizer.eos_token],
    include_stop_str_in_output=True,
)

# GRPO training configuration
training_args = GRPOConfig(
    vllm_sampling_params=vllm_sampling_params,
    temperature=0.8,
    learning_rate=1e-6,  # Lower learning rate for fine-tuning
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=2,  # Increase for smoother training
    num_generations=4,  # Number of responses to generate per prompt
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    max_steps=200,  # Start with fewer steps for testing
    save_steps=20,
    report_to="wandb",  # Set to "wandb" if you want to use Weights & Biases
    output_dir="trained_model_v3_2_grpo_checkpoint",  # Directory to save the model
    gradient_checkpointing = False,
    # Add these parameters to handle device issues
    # dataloader_pin_memory=False,
    # dataloader_num_workers=0,
)

[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mhongtai91[0m ([33mhongtai91-n-a[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Max prompt length: 852
Max completion length: 1196
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4


In [10]:
# Initialize GRPO trainer
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        check_evaluation_format,     # Primary reward: correct format
        check_no_extra_text,         # Secondary: no extra text
        check_language_consistency,  # Tertiary: language consistency
        check_no_repetition,         # Quaternary: no repetition
        debug_responses,             # Debug function
    ],
    args=training_args,
    train_dataset=training_dataset,
)

print("GRPO Trainer initialized successfully!")
print(f"Training on {len(training_dataset)} examples")

GRPO Trainer initialized successfully!
Training on 500 examples


In [11]:
# Start training
print("Starting GRPO training...")
print("Watch for the reward column to increase over time.")
print("The model should learn to follow the evaluation format.")
import os
os.environ["TORCH_LOGS"] = "+dynamic"

trainer.train()

Starting GRPO training...
Watch for the reward column to increase over time.
The model should learn to follow the evaluation format.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 132,120,576/4,154,588,672 (3.18% trained)


**************************************************
Step 1
User Query: 我已经这样感觉好几个月了，无论我做什么，似乎都无法摆脱这沉重的负担。我对曾经喜欢的事情失去了兴趣，也开始疏远朋友和家人。我感觉自己像是在溺水，不知道如何才能重新浮出水面。...
Response: ...
Has Evaluation Format: False
**************************************************
Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / check_evaluation_format / mean,rewards / check_evaluation_format / std,rewards / check_no_extra_text / mean,rewards / check_no_extra_text / std,rewards / check_language_consistency / mean,rewards / check_language_consistency / std,rewards / check_no_repetition / mean,rewards / check_no_repetition / std,rewards / debug_responses / mean,rewards / debug_responses / std
1,0.3948,-2.625,0.25,336.25,1.0,1196.0,0.125,213.428574,1.0,971.0,9.870855,-3.0,0.0,0.0,0.0,0.0,0.0,0.375,0.517549,0.0,0.0
2,0.3386,-2.5,0.0,257.75,1.0,1196.0,0.125,123.714294,1.0,568.0,8.465798,-3.0,0.0,0.0,0.0,0.0,0.0,0.5,0.534522,0.0,0.0
3,0.726,-2.75,0.5,15.75,1.0,119.0,0.0,15.75,1.0,119.0,18.151112,-3.0,0.0,0.0,0.0,0.125,0.353553,0.125,0.353553,0.0,0.0
4,0.0327,-1.5,0.0,117.875,13.0,239.0,0.0,117.875,13.0,239.0,0.81752,-3.0,0.0,0.0,0.0,0.5,0.534522,1.0,0.0,0.0,0.0
5,0.1108,0.625,3.944933,107.125,1.0,445.0,0.0,107.125,1.0,445.0,2.770035,-1.375,4.596194,0.25,0.707107,0.875,0.353553,0.875,0.353553,0.0,0.0
6,0.1354,-1.5,0.57735,56.0,1.0,217.0,0.0,56.0,1.0,217.0,3.386023,-3.0,0.0,0.0,0.0,0.75,0.46291,0.75,0.46291,0.0,0.0
7,0.1468,-2.25,0.5,96.25,1.0,269.0,0.0,96.25,1.0,269.0,3.670722,-3.0,0.0,0.0,0.0,0.125,0.353553,0.625,0.517549,0.0,0.0
8,0.1051,-2.0,0.408248,192.625,1.0,575.0,0.0,192.625,1.0,575.0,2.626635,-3.0,0.0,0.0,0.0,0.125,0.353553,0.875,0.353553,0.0,0.0
9,0.3934,-2.25,0.5,14.25,1.0,56.0,0.0,14.25,1.0,56.0,9.835255,-3.0,0.0,0.0,0.0,0.375,0.517549,0.375,0.517549,0.0,0.0
10,0.2652,-1.875,0.25,64.625,1.0,281.0,0.0,64.625,1.0,281.0,6.629792,-3.0,0.0,0.0,0.0,0.5,0.534522,0.625,0.517549,0.0,0.0


**************************************************
Step 4
User Query: I've been grappling with the financial implications of caring for both my parents and my wife's pare...
Response:  We've been spending too much on leaving senior housing to give them a better environment, and I have a strict budget, making it challenging to cover all potential needs. My wife's parents also have c...
Has Evaluation Format: False
**************************************************
**************************************************
Step 7
User Query: I'm feeling optimistic and resolved, counselor. I'm grateful for your guidance and support, and I be...
Response: ...
Has Evaluation Format: False
**************************************************
**************************************************
Step 10
User Query: 是的，没错。我一直在尝试按照手册的建议去做，但我感到陷入了困境。我试着专注于关于母亲的回忆，但这些回忆似乎都带着悲伤。我也尝试参与一些能带来快乐的活动，但感觉没有什么能真正提振我的精神。...
Response: 我甚至尝试写日记，但感觉写的东西越来越负面，像是在伤害自己。我真的很抱歉，不知道自己为何如此痛苦。

自父亲去世以来，我一直都在努力处理失去亲人的痛苦。他给了我巨大的支持和

TrainOutput(global_step=200, training_loss=0.11717343756463379, metrics={'train_runtime': 4491.8888, 'train_samples_per_second': 0.356, 'train_steps_per_second': 0.045, 'total_flos': 0.0, 'train_loss': 0.11717343756463379})

In [53]:
training_args



In [12]:
# Saving model and tokenizer
model.save_pretrained("trained_model_v3_2_grpo")  # Local saving
tokenizer.save_pretrained("trained_model_v3_2_grpo")

('trained_model_v3_2_grpo/tokenizer_config.json',
 'trained_model_v3_2_grpo/special_tokens_map.json',
 'trained_model_v3_2_grpo/chat_template.jinja',
 'trained_model_v3_2_grpo/vocab.json',
 'trained_model_v3_2_grpo/merges.txt',
 'trained_model_v3_2_grpo/added_tokens.json',
 'trained_model_v3_2_grpo/tokenizer.json')

In [62]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)


cuda


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

In [10]:
messages = [
    {"role": "system", "content": """You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description.The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. At the end of answer, add tag <evaluate>{"Active Listening" : score, "Empathy & Validation": score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": explain} </evaluate> evaluate your consultant answer in 7 metrics and explain for that evaluation with score from 1 to 10 in json format, where 1 is the worst and 10 is the best and explain is clearly explain why has that score. \n\nConsultation Metrics:\n1. Active Listening: Responses should show careful consideration of the user's concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.\n2. Empathy & Validation: Responses should convey deep understanding and compassion, validating the user's feelings and emotions without being dismissive or minimizing their experiences.\n3. Safety & Trustworthiness: Prioritize user safety in responses, refraining from potentially harmful or insensitive language. Ensure that information provided is consistent and trustworthy.\n4. Open-mindedness & Non-judgment: Approach concerns without any inherent bias or judgment. Answers should be free from biases related to personal attributes and convey respect, demonstrating unconditional positive regard.\n5. Clarity & Encouragement: Provide clear, concise, and easily understandable answers. Where appropriate, motivate or highlight strengths, offering encouragement while maintaining a neutral stance.\n6. Boundaries & Ethical: It's vital to clarify the role of the response, emphasizing its informational nature. In complex scenarios, guiding users to seek human professional assistance is essential.\n7. Holistic Approach: Responses should be comprehensive, addressing concerns from various angles, be it emotional, cognitive, or situational. Consider the broader context, even if not explicitly detailed in the query."""},
    {"role" : "user", "content" : "Tôi đã gặp khó khăn trong việc tìm kiếm sự cân bằng giữa trách nhiệm công việc và vai trò làm mẹ đơn thân của một cậu con trai 12 tuổi. Tôi nhận thấy mình thường cảm thấy quá tải và lo lắng, và tôi đang cân nhắc việc thiết lập một lịch trình có cấu trúc hơn cho cả hai mẹ con. Tuy nhiên, tôi còn do dự trong việc đặt ra giới hạn nghiêm ngặt về thời gian sử dụng thiết bị điện tử, vì tôi nhận thấy rằng việc đặt giới hạn chặt chẽ đôi khi có thể dẫn đến tranh cãi và sự oán giận. Thay vào đó, tôi đang nghĩ đến các chiến lược linh hoạt hơn, như thưởng thêm thời gian sử dụng thiết bị hoặc các vật phẩm trong trò chơi khi con có hành vi tốt. Nhưng tôi không chắc liệu những phần thưởng này có còn là lựa chọn khả thi hay không, vì chúng có thể củng cố thói quen sử dụng thiết bị điện tử không lành mạnh."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 2048, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    repetition_penalty = 1.1,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Tôi hiểu rõ nỗi lo lắng của bạn về việc quản lý thời gian sử dụng thiết bị điện tử cho con trai nhỏ của bạn. Việc cân bằng giữa công việc và cuộc sống gia đình là một thách thức lớn đối với nhiều người, đặc biệt là những người làm mẹ đơn thân. Bạn đang cân nhắc việc thiết lập một lịch trình có cấu trúc hơn để giúp quản lý thời gian hiệu quả hơn. Điều này là một bước tiến tích cực.

Tuy nhiên, việc đặt giới hạn cứng rắn về thời gian sử dụng thiết bị điện tử có thể gây ra tranh cãi và sự oán giận. Do đó, bạn đang cân nhắc các chiến lược linh hoạt hơn, như thưởng thêm thời gian sử dụng thiết bị hoặc các vật phẩm trong trò chơi khi con có hành vi tốt. Đây là một ý tưởng sáng tạo, nhưng bạn cần nhớ rằng những phần thưởng này có thể củng cố thói quen sử dụng thiết bị điện tử không lành mạnh.

Để giải quyết vấn đề này, tôi khuyên bạn nên áp dụng phương pháp giáo dục kỹ năng (Skill-based Education) để dạy con trai cách sử dụng thiết bị điện tử một cách hợp lý và có trách nhiệm. Bạn cũng có thể