### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

### Unsloth

In [1]:
from unsloth import FastLanguageModel
import torch

lora_rank = 64

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B",
    max_seq_length = 4096,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-16 04:40:14 [__init__.py:244] Automatically detected platform cuda.
==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.52.4. vLLM: 0.9.1.
   \\   /|    NVIDIA L40S. Num GPUs = 1. Max memory: 44.418 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen3-4b-unsloth-bnb-4bit with actual GPU utilization = 69.27%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 44.42 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 320.
Unsloth: vLLM's KV Cache can use up to 27.91 GB. Also swap space 

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 06-16 04:40:38 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 06-16 04:40:39 [gpu_model_runner.py:1624] Model loading took 3.5740 GiB and 2.784077 seconds
INFO 06-16 04:40:53 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/9bced0cd06/rank_0_0 for vLLM's torch.compile
INFO 06-16 04:40:53 [backends.py:472] Dynamo bytecode transform time: 12.79 s
INFO 06-16 04:41:02 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 8.845 s
INFO 06-16 04:41:26 [monitor.py:34] torch.compile takes 12.79 s in total
INFO 06-16 04:41:29 [gpu_worker.py:227] Available KV cache memory: 25.36 GiB
INFO 06-16 04:41:29 [kv_cache_utils.py:715] GPU KV cache size: 184,672 tokens
INFO 06-16 04:41:29 [kv_cache_utils.py:719] Maximum concurrency for 4,096 tokens per request: 45.09x
INFO 06-16 04:42:44 [gpu_model_runner.py:2048] Graph capturing finished in 74 secs, took 1.48 GiB
INFO 06-16 04:42:44 [core.py:171] init engine (profile, create kv c

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = lora_rank,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.6.2 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


<a name="Data"></a>
### Data Prep
We will use a custom dataset named `Interview_Data_6K.csv`. This dataset contains conversations with a mental health counselling assistant. Each entry has an `instruction` (acting as a system prompt), an `input` (the user's message), and an `output` (the assistant's response).

We need to convert this CSV data into a format suitable for training with `SFTTrainer`, specifically by applying the Qwen3 chat template.

In [3]:
from datasets import load_dataset, DatasetDict

# Load the custom CSV dataset
# The dataset has 'instruction', 'input', 'output' columns
full_interview_dataset = load_dataset("csv", data_files="./dataset/stage_2_1_synthetic_interview_data_combined.csv", split="train")

# Split the dataset into training and evaluation sets (e.g., 90% train, 10% eval)
# Ensure the dataset has more than one example for splitting
if len(full_interview_dataset) > 1:
    train_test_split = full_interview_dataset.train_test_split(test_size=0.05, seed=3407)
    interview_train_dataset = train_test_split['train']
    interview_eval_dataset = train_test_split['test']
    print(f"Training set size: {len(interview_train_dataset)}")
    print(f"Evaluation set size: {len(interview_eval_dataset)}")
else:
    # Handle cases with very small datasets
    interview_train_dataset = full_interview_dataset
    interview_eval_dataset = None
    print(f"Training set size: {len(interview_train_dataset)}")
    print("No evaluation set created due to small dataset size.")

# We will display the structure of the training dataset in the next cell.

Generating train split: 0 examples [00:00, ? examples/s]

Training set size: 4521
Evaluation set size: 238


Let's see the structure of our loaded training dataset:

In [4]:
interview_train_dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 4521
})

In [5]:
import pandas as pd # For pd.notna

# 'interview_train_dataset' and 'interview_eval_dataset' are from the previous cell
# The 'tokenizer' is globally defined in a prior cell (Cell 6, id="c1180838")

def convert_csv_to_chat_format(examples):
    """
    Converts a batch of examples from the CSV structure to a list of conversations.
    Each conversation is a list of dictionaries with "role" and "content".
    """
    all_conversations = []
    # 'examples' is a dictionary where keys are column names and values are lists of entries
    num_examples = len(examples['instruction'])
    
    for i in range(num_examples):
        instruction = examples['instruction'][i]
        input_text = examples['input'][i]
        output_text = examples['output'][i]

        # Ensure input_text is a string; handle None or NaN by treating as empty string if necessary
        input_text_str = str(input_text) if pd.notna(input_text) and str(input_text).strip() else ""

        current_conversation = []
        # System prompt from 'instruction'
        current_conversation.append({"role": "system", "content": instruction})
        
        # User message from 'input'
        # Based on the dataset, 'input' should always be present.
        # If input_text_str is empty, this will add a user message with empty content.
        # This is generally fine as the model should learn to handle it or it implies
        # the system prompt itself is the query.
        current_conversation.append({"role": "user", "content": input_text_str})
            
        # Assistant message from 'output'
        current_conversation.append({"role": "assistant", "content": output_text})
        
        all_conversations.append(current_conversation)
        
    return {"conversations": all_conversations}

def apply_template_to_conversations(examples):
    """
    Applies the tokenizer's chat template to a batch of conversations.
    Creates a 'text' field for SFTTrainer.
    """
    # tokenizer should be globally available
    return {
        "text": tokenizer.apply_chat_template(
            examples["conversations"], # This is a list of conversations
            tokenize=False,
            add_generation_prompt=False, # Crucial for training
        )
    }

# Process training data
# 1. Convert CSV structure to list of message dicts
train_dataset_with_conversations = interview_train_dataset.map(
    convert_csv_to_chat_format,
    batched=True,
    remove_columns=interview_train_dataset.column_names # Keep only 'conversations'
)
# 2. Apply chat template to create the 'text' field
final_train_dataset = train_dataset_with_conversations.map(
    apply_template_to_conversations,
    batched=True,
    remove_columns=["conversations"] # Keep only 'text'
)
# 3. Shuffle the training dataset
final_train_dataset = final_train_dataset.shuffle(seed=3407)

# Process evaluation data (if it exists)
final_eval_dataset = None
if interview_eval_dataset:
    eval_dataset_with_conversations = interview_eval_dataset.map(
        convert_csv_to_chat_format,
        batched=True,
        remove_columns=interview_eval_dataset.column_names
    )
    final_eval_dataset = eval_dataset_with_conversations.map(
        apply_template_to_conversations,
        batched=True,
        remove_columns=["conversations"]
    )
    # No need to shuffle eval_dataset

print("Sample of final formatted training data (after chat template):")
if len(final_train_dataset) > 0:
    print(final_train_dataset[0]['text'])
else:
    print("Training dataset is empty after processing.")

if final_eval_dataset and len(final_eval_dataset) > 0:
    print("\nSample of final formatted evaluation data (after chat template):")
    print(final_eval_dataset[0]['text'])
elif interview_eval_dataset: # If interview_eval_dataset existed but final_eval_dataset is empty
    print("\nEvaluation dataset is empty after processing.")

Map:   0%|          | 0/4521 [00:00<?, ? examples/s]

Map:   0%|          | 0/4521 [00:00<?, ? examples/s]

Map:   0%|          | 0/238 [00:00<?, ? examples/s]

Map:   0%|          | 0/238 [00:00<?, ? examples/s]

Sample of final formatted training data (after chat template):
<|im_start|>system
You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description. 
The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. <|im_end|>
<|im_start|>user
我希望透過這次輔導可以學習點樣更好地處理與同事之間嘅關係，因為最近喺公司嘅壓力特別大，令我成日覺得心情好唔好。同事之間有啲誤會，我感覺自己被排斥，好似大家都唔太願意同我溝通，呢啲情況令我感到孤立同埋唔受重視。尤其係上星期，我做咗一份重要嘅報告，完成得唔錯，但同事卻冇表示任何認同，好似我嘅努力完全被忽視咁。呢啲事情令我夜晚訓唔著，經常會諗住啲負面嘅嘢，覺得自己做乜嘢都唔啱，心情低落同埋好焦慮，甚至有時頭痛同胃痛，呢啲症狀每星期都會出現幾次，通常持續幾個鐘頭。喺家庭方面，我同父母嘅關係算係幾好，但佢哋唔太了解我工作上面嘅壓力，所以我都唔太願意講。以前喺讀書時期，曾經有過被同學排擠嘅經歷，所以今次嘅情況特別令我想起當時嘅感受。我試過用聽音樂同做運動嚟舒緩情緒，但效果唔係好持久。我想知道，喺輔導過程中我應該點樣表達自己嘅感受？另外，對於改善同事關係，有冇啲有效嘅方法或技巧？如果我想用中醫同西醫結合嘅方法去處理情緒，有冇啲建議？<|im_end|>
<|im_start|>assistant
<think>

</think>

你提出想學習如何更好地與同事相處，這是非常實際且重要的目標。當感受到被排擠和忽視，情緒自然會受到很大影響，而你提及的頭痛和胃痛，很可能是身體對壓力的反應。建議在輔導中，可以嘗試坦誠地分享你的感受和經歷，不用擔心被評價，因為這裡是一個安全的空間。試著用具體事件來描述你的情緒變化，例如那份報告被忽視的時刻，讓討論更聚焦。針對改善同事關係，建立溝通橋樑很重要，可以嘗試主動表達你的想法與感受，同時

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [6]:
import wandb
wandb.init(project="distress-chatbot", name="base-model-training-syntheic_v2_1", config={
    "model": "Qwen/Qwen3-4B",
    # "max_steps": 20000,
    "learning_rate": 2e-5,
    "lambda_decay": 0.95,
})  # Allow resuming W&B run

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = final_train_dataset, # Use the processed training data
    eval_dataset = final_eval_dataset,   # Use the processed evaluation data
    args = SFTConfig(
        dataset_text_field = "text", # Column containing the formatted text
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 3, # Set this for 1 full training run.
        # max_steps = 60, # Adjusted for quicker testing, was 30. Set to None or higher for full training.
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # Use this for WandB etc
        # evaluation_strategy = "steps" if final_eval_dataset else "no", # Enable evaluation if eval_dataset is present
        # eval_steps = 20, # Evaluate every N steps, adjust as needed
    ),
)

# If using evaluation, you might want to set evaluation_strategy and eval_steps in SFTConfig
if final_eval_dataset:
    trainer.args.evaluation_strategy = "steps"
    trainer.args.eval_steps = 20 # Or any other desired frequency
else:
    trainer.args.evaluation_strategy = "no"

[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mhongtai91[0m ([33mhongtai91-n-a[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Tokenizing ["text"] (num_proc=256):   0%|          | 0/4521 [00:00<?, ? examples/s]

num_proc must be <= 238. Reducing num_proc to 238 for dataset of size 238.


Unsloth: Tokenizing ["text"] (num_proc=238):   0%|          | 0/238 [00:00<?, ? examples/s]

In [7]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L40S. Max memory = 44.418 GB.
31.408 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4,521 | Num Epochs = 3 | Total steps = 849
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576/4,000,000,000 (3.30% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.3751
2,2.446
3,2.3934
4,2.354
5,2.2543
6,2.3518
7,2.3156
8,2.2353
9,2.1576
10,2.2093


In [9]:
model.save_pretrained("trained_model_v2_1")  # Local saving
tokenizer.save_pretrained("trained_model_v2_1")

('trained_model_v2_1/tokenizer_config.json',
 'trained_model_v2_1/special_tokens_map.json',
 'trained_model_v2_1/chat_template.jinja',
 'trained_model_v2_1/vocab.json',
 'trained_model_v2_1/merges.txt',
 'trained_model_v2_1/added_tokens.json',
 'trained_model_v2_1/tokenizer.json')

## Reinforcement Learning with GRPO

In [10]:
# Cleaning memory for next round of training
torch.cuda.empty_cache()
import gc
gc.collect()

499

#### Pre-finetuning for formatting alignment

In [11]:
from datasets import load_dataset, DatasetDict

# Load the custom CSV dataset
formatting_interview_dataset = load_dataset("csv", data_files="./dataset/generated_responses_60_samples.csv", split="train")
print(f"Loaded dataset with {len(formatting_interview_dataset)} examples.")
interview_train_dataset = formatting_interview_dataset
interview_eval_dataset = None  # No evaluation set for this dataset


Loaded dataset with 60 examples.


In [12]:
import pandas as pd # For pd.notna

# 'interview_train_dataset' and 'interview_eval_dataset' are from the previous cell
# The 'tokenizer' is globally defined in a prior cell (Cell 6, id="c1180838")

instruction_for_formatting = """You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description.The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. At the end of answer, add tag <evaluate>{"Active Listening" : score, "Empathy & Validation": score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": explain} </evaluate> evaluate your consultant answer in 7 metrics and explain for that evaluation with score from 1 to 10 in json format, where 1 is the worst and 10 is the best and explain is clearly explain why has that score. \n\nConsultation Metrics:\n1. Active Listening: Responses should show careful consideration of the user's concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.\n2. Empathy & Validation: Responses should convey deep understanding and compassion, validating the user's feelings and emotions without being dismissive or minimizing their experiences.\n3. Safety & Trustworthiness: Prioritize user safety in responses, refraining from potentially harmful or insensitive language. Ensure that information provided is consistent and trustworthy.\n4. Open-mindedness & Non-judgment: Approach concerns without any inherent bias or judgment. Answers should be free from biases related to personal attributes and convey respect, demonstrating unconditional positive regard.\n5. Clarity & Encouragement: Provide clear, concise, and easily understandable answers. Where appropriate, motivate or highlight strengths, offering encouragement while maintaining a neutral stance.\n6. Boundaries & Ethical: It's vital to clarify the role of the response, emphasizing its informational nature. In complex scenarios, guiding users to seek human professional assistance is essential.\n7. Holistic Approach: Responses should be comprehensive, addressing concerns from various angles, be it emotional, cognitive, or situational. Consider the broader context, even if not explicitly detailed in the query."""

def convert_csv_to_chat_format(examples):
    """
    Converts a batch of examples from the CSV structure to a list of conversations.
    Each conversation is a list of dictionaries with "role" and "content".
    """
    all_conversations = []
    # 'examples' is a dictionary where keys are column names and values are lists of entries
    num_examples = len(examples['instruction'])
    
    for i in range(num_examples):
        instruction = instruction_for_formatting
        input_text = examples['input'][i]
        output_text = examples['output'][i]

        # Ensure input_text is a string; handle None or NaN by treating as empty string if necessary
        input_text_str = str(input_text) if pd.notna(input_text) and str(input_text).strip() else ""

        current_conversation = []
        # System prompt from 'instruction'
        current_conversation.append({"role": "system", "content": instruction})
        
        # User message from 'input'
        # Based on the dataset, 'input' should always be present.
        # If input_text_str is empty, this will add a user message with empty content.
        # This is generally fine as the model should learn to handle it or it implies
        # the system prompt itself is the query.
        current_conversation.append({"role": "user", "content": input_text_str})
            
        # Assistant message from 'output'
        current_conversation.append({"role": "assistant", "content": output_text})
        
        all_conversations.append(current_conversation)
        
    return {"conversations": all_conversations}

def apply_template_to_conversations(examples):
    """
    Applies the tokenizer's chat template to a batch of conversations.
    Creates a 'text' field for SFTTrainer.
    """
    # tokenizer should be globally available
    return {
        "text": tokenizer.apply_chat_template(
            examples["conversations"], # This is a list of conversations
            tokenize=False,
            add_generation_prompt=False, # Crucial for training
        )
    }

# Process training data
# 1. Convert CSV structure to list of message dicts
train_dataset_with_conversations = interview_train_dataset.map(
    convert_csv_to_chat_format,
    batched=True,
    remove_columns=interview_train_dataset.column_names # Keep only 'conversations'
)
# 2. Apply chat template to create the 'text' field
final_train_dataset = train_dataset_with_conversations.map(
    apply_template_to_conversations,
    batched=True,
    remove_columns=["conversations"] # Keep only 'text'
)
# 3. Shuffle the training dataset
final_train_dataset = final_train_dataset.shuffle(seed=3407)

# Process evaluation data (if it exists)
final_eval_dataset = None
if interview_eval_dataset:
    eval_dataset_with_conversations = interview_eval_dataset.map(
        convert_csv_to_chat_format,
        batched=True,
        remove_columns=interview_eval_dataset.column_names
    )
    final_eval_dataset = eval_dataset_with_conversations.map(
        apply_template_to_conversations,
        batched=True,
        remove_columns=["conversations"]
    )
    # No need to shuffle eval_dataset

print("Sample of final formatted training data (after chat template):")
if len(final_train_dataset) > 0:
    print(final_train_dataset[0]['text'])
else:
    print("Training dataset is empty after processing.")

if final_eval_dataset and len(final_eval_dataset) > 0:
    print("\nSample of final formatted evaluation data (after chat template):")
    print(final_eval_dataset[0]['text'])
elif interview_eval_dataset: # If interview_eval_dataset existed but final_eval_dataset is empty
    print("\nEvaluation dataset is empty after processing.")

Sample of final formatted training data (after chat template):
<|im_start|>system
You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description.The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. At the end of answer, add tag <evaluate>{"Active Listening" : score, "Empathy & Validation": score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": explain} </evaluate> evaluate your consultant answer in 7 metrics and explain for that evaluation with score from 1 to 10 in json format, where 1 is the worst and 10 is the best and explain is clearly explain why has that score. 

Consultation Metrics:
1. Active Listening: Responses should show careful consideration of the user's concerns, reflecting an understanding and capturing the ess

In [13]:
# Traing the model for formatting
import wandb
wandb.init(project="distress-chatbot", name="base-model-training-syntheic_v2_2-formatting", config={
    "model": "Qwen/Qwen3-4B",
    # "max_steps": 20000,
    "learning_rate": 2e-5,
    "lambda_decay": 0.95,
})  # Allow resuming W&B run

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = final_train_dataset, # Use the processed training data
    eval_dataset = final_eval_dataset,   # Use the processed evaluation data
    args = SFTConfig(
        dataset_text_field = "text", # Column containing the formatted text
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 5, # Set this more for learning formatting. 
        # max_steps = 60, # Adjusted for quicker testing, was 30. Set to None or higher for full training.
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # Use this for WandB etc
        # evaluation_strategy = "steps" if final_eval_dataset else "no", # Enable evaluation if eval_dataset is present
        # eval_steps = 20, # Evaluate every N steps, adjust as needed
    ),
)

# If using evaluation, you might want to set evaluation_strategy and eval_steps in SFTConfig
if final_eval_dataset:
    trainer.args.evaluation_strategy = "steps"
    trainer.args.eval_steps = 20 # Or any other desired frequency
else:
    trainer.args.evaluation_strategy = "no"

0,1
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▆▆▆▇▇▇▇███
train/grad_norm,▁▂▂▂▃▃▃▃▄▄▆▅▆▆▆▆▆▆▆▅▆▇▇▇▆▆▆██▇▇▇█▇█▆▇▇▇▇
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▃▂▂▁▁▁▁
train/loss,█▅▄▂▃▂▂▂▂▂▂▂▂▂▁▁▁▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁

0,1
total_flos,1.6927397422053274e+17
train/epoch,3.0
train/global_step,849.0
train/grad_norm,0.82329
train/learning_rate,0.0
train/loss,1.1695
train_loss,1.22504
train_runtime,2519.1219
train_samples_per_second,5.384
train_steps_per_second,0.337


num_proc must be <= 60. Reducing num_proc to 60 for dataset of size 60.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 60 | Num Epochs = 5 | Total steps = 20
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576/4,000,000,000 (3.30% trained)


Step,Training Loss
1,2.0228
2,1.84
3,1.894
4,1.9197
5,1.7986
6,1.7134
7,1.6655
8,1.5424
9,1.5457
10,1.4869


In [15]:
# Saving model and tokenizer
model.save_pretrained("trained_model_v2_1_formatting")  # Local saving
tokenizer.save_pretrained("trained_model_v2_1_formatting")

('trained_model_v2_1_formatting/tokenizer_config.json',
 'trained_model_v2_1_formatting/special_tokens_map.json',
 'trained_model_v2_1_formatting/chat_template.jinja',
 'trained_model_v2_1_formatting/vocab.json',
 'trained_model_v2_1_formatting/merges.txt',
 'trained_model_v2_1_formatting/added_tokens.json',
 'trained_model_v2_1_formatting/tokenizer.json')

#### Reinforcement Learning with Group Relative Policy Optimization


In [16]:
# Clear accelerator state and reinitialize
# import torch
import gc
# from accelerate import Accelerator

# Clear CUDA cache
# torch.cuda.empty_cache()
gc.collect()

# # Reset accelerator state
# try:
#     from accelerate.state import AcceleratorState
#     AcceleratorState._reset_state(reset_partial_state=True)
# except:
#     pass

# Reinitialize accelerator
# accelerator = Accelerator()

522

In [17]:
# Define the evaluation format markers
evaluation_start = "<evaluate>"
evaluation_end = "</evaluate>"

system_prompt = """
You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.
Provide helpful, comprehensive, and appropriate answers to the user questions.

After your counselling response, you must include a self-evaluation in the following format:
<evaluate>
{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}
</evaluate>

Where score is a number from 1-10, and provide a clear explanation for your scoring.
Explain to metrics:
1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.
2. Empathy & Validation: Responses should convey deep understanding and compassion, validating the user feelings and emotions without being dismissive or minimizing their experiences.
3. Safety & Trustworthiness: Prioritize user safety in responses, refraining from potentially harmful or insensitive language. Ensure that information provided is consistent and trustworthy.
4. Open-mindedness & Non-judgment: Approach concerns without any inherent bias or judgment. Answers should be free from biases related to personal attributes and convey respect, demonstrating unconditional positive regard.
5. Clarity & Encouragement: Provide clear, concise, and easily understandable answers. Where appropriate, motivate or highlight strengths, offering encouragement while maintaining a neutral stance.
6. Boundaries & Ethical: It is vital to clarify the role of the response, emphasizing its informational nature. In complex scenarios, guiding users to seek human professional assistance is essential.
7. Holistic Approach: Responses should be comprehensive, addressing concerns from various angles, be it emotional, cognitive, or situational. Consider the broader context, even if not explicitly detailed in the query.
""".strip()

print("System prompt:")
print(system_prompt)

System prompt:
You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.
Provide helpful, comprehensive, and appropriate answers to the user questions.

After your counselling response, you must include a self-evaluation in the following format:
<evaluate>
{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}
</evaluate>

Where score is a number from 1-10, and provide a clear explanation for your scoring.
Explain to metrics:
1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.
2. Empathy & Validation: Responses should convey deep underst

In [18]:
# Create a simple chat template
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '' }}"\
    "{% endif %}"

# Replace with our specific template:
chat_template = chat_template.replace("'{system_prompt}'", f"'{system_prompt}'")
tokenizer.chat_template = chat_template

In [41]:
# Test the chat template
test_messages = [
    {"role": "user", "content": "I'm feeling anxious about work."},
    {"role": "assistant", "content": "I understand that work anxiety can be challenging. Let me help you explore some strategies. <evaluate>{\"Active Listening\" : 8, \"Empathy & Validation\": 9, \"Safety & Trustworthiness\" : 9, \"Open-mindedness & Non-judgment\" : 8, \"Clarity & Encouragement\" : 7, \"Boundaries & Ethical\" : 9, \"Holistic Approach\" : 8, \"Explaination for Scoring\": \"Provided empathetic response with good listening skills.\"}</evaluate>"},
    {"role": "user", "content": "Can you suggest some techniques?"}
]

formatted_text = tokenizer.apply_chat_template(
    test_messages,
    tokenize=False,
    add_generation_prompt=True
)

print("Formatted text:")
print(formatted_text)

Formatted text:
You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.
Provide helpful, comprehensive, and appropriate answers to the user questions.

After your counselling response, you must include a self-evaluation in the following format:
<evaluate>
{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}
</evaluate>

Where score is a number from 1-10, and provide a clear explanation for your scoring.
Explain to metrics:
1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.
2. Empathy & Validation: Responses should convey deep unders

In [19]:
from datasets import load_dataset
import pandas as pd
import numpy as np

# Load your dataset
dataset = load_dataset("csv", data_files="./dataset/stage_2_1_synthetic_interview_data_combined.csv", split="train")
dataset = dataset.to_pandas()

print(f"Dataset shape: {dataset.shape}")
print(f"Columns: {dataset.columns.tolist()}")
print("\nFirst few rows:")
print(dataset.head())

Dataset shape: (4759, 3)
Columns: ['instruction', 'input', 'output']

First few rows:
                                         instruction  \
0  You are a helpful mental health counselling as...   
1  You are a helpful mental health counselling as...   
2  You are a helpful mental health counselling as...   
3  You are a helpful mental health counselling as...   
4  You are a helpful mental health counselling as...   

                                               input  \
0  I've been struggling with alcohol use for a wh...   
1  我希望通过这次咨询，能更好地理解自己目前的情绪状态，并找到缓解压力和焦虑的方法。最近几个月，...   
2  Tôi mong muốn qua buổi tư vấn này, tôi có thể ...   
3  أرغب في مناقشة موضوع حساس يخص حياتي الشخصية وأ...   
4  最近我一直感到很难过和孤独，特别是因为我刚刚失去了我的祖母。她在我生活中一直是个非常重要的支...   

                                              output  
0  Managing alcohol use involves understanding th...  
1  在心理咨询过程中，保护您的隐私是非常重要的，所有您分享的信息都会严格保密，除非涉及您或他人的...  
2  Kiểm soát cảm giác lo lắng bắt đầu bằng việc n...  
3  التعامل مع المشاع

In [20]:
def format_dataset_for_grpo(x):
    instruction = x["instruction"]
    user_input = x["input"] if pd.notna(x["input"]) else ""
    
    # Create the conversation format
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input},
    ]

# Prepare dataset for GRPO
dataset["prompt"] = dataset.apply(format_dataset_for_grpo, axis=1)
dataset["answer"] = dataset["output"]  # Use the original answer as reference

print("Sample prompt:")
print(dataset["prompt"][0])
print("\nSample answer:")
print(dataset["answer"][0])

Sample prompt:
[{'role': 'system', 'content': 'You are a helpful mental health counselling assistant. Please answer mental health questions based on the patient description.\nProvide helpful, comprehensive, and appropriate answers to the user questions.\n\nAfter your counselling response, you must include a self-evaluation in the following format:\n<evaluate>\n{"Active Listening" : score, "Empathy & Validation" : score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": "Your explanation here"}\n</evaluate>\n\nWhere score is a number from 1-10, and provide a clear explanation for your scoring.\nExplain to metrics:\n1. Active Listening: Responses should show careful consideration of the user concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.\n2. Empathy & Valid

##### Define reward function

In [23]:
import re
import json
from langdetect import detect

# Create regex to match the evaluation format
evaluation_regex = re.compile(
    rf"{evaluation_start}(.+?){evaluation_end}",
    flags=re.MULTILINE | re.DOTALL
)

def check_evaluation_format(completions, **kwargs):
    """
    Reward function for checking if the response follows the evaluation format exactly.
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # Check if evaluation format is present
        if evaluation_regex.search(response) is not None:
            score += 5.0  # High reward for format compliance
            
            # Extract and validate JSON structure
            try:
                match = evaluation_regex.search(response)
                if match:
                    json_content = match.group(1).strip()
                    eval_data = json.loads(json_content)
                    
                    # Check for required keys
                    required_keys = [
                        "Active Listening", "Empathy & Validation", "Safety & Trustworthiness",
                        "Open-mindedness & Non-judgment", "Clarity & Encouragement", 
                        "Boundaries & Ethical", "Holistic Approach", "Explaination for Scoring"
                    ]
                    
                    if all(key in eval_data for key in required_keys):
                        score += 3.0  # Bonus for complete structure
                    
                    # Check if scores are numbers between 1-10
                    score_keys = required_keys[:-1]  # Exclude explanation
                    valid_scores = 0
                    for key in score_keys:
                        if key in eval_data:
                            try:
                                score_val = float(eval_data[key])
                                if 1 <= score_val <= 10:
                                    valid_scores += 1
                            except (ValueError, TypeError):
                                pass
                    
                    # Bonus for valid scores
                    score += (valid_scores / len(score_keys)) * 2.0
                    
            except json.JSONDecodeError:
                score -= 1.0  # Penalty for invalid JSON
        else:
            score -= 3.0  # Penalty for missing evaluation
            
        scores.append(score)
    return scores

def check_no_extra_text(completions, **kwargs):
    """
    Reward function to ensure no extra text after evaluation.
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # Find the last occurrence of </evaluate>
        last_eval_end = response.rfind(evaluation_end)
        if last_eval_end != -1:
            text_after = response[last_eval_end + len(evaluation_end):].strip()
            if not text_after:  # No text after evaluation
                score += 2.0
            else:
                score -= 2.0  # Penalty for extra text
        
        scores.append(score)
    return scores

def check_language_consistency(prompts, completions, **kwargs):
    """
    Reward function to check if response language matches input language.
    """
    scores = []
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    question_lang = detect(question)
    # print(str(responses))

    for rep in responses:
        score = 0
        # print(f"Current text for detect lang {rep} - finish")
        if len(rep) > 5:
            if detect(rep) == question_lang:
                score += 1.0
            
        scores.append(score)

    return scores
    


def check_no_repetition(completions, **kwargs):
    """
    Reward function to penalize repetitive text.
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # Simple repetition check: split into sentences and check for exact duplicates
        sentences = re.split(r'[.!?]+', response)
        sentences = [s.strip() for s in sentences if s.strip()]
        
        if len(sentences) > 0:
            unique_sentences = set(sentences)
            repetition_ratio = 1 - (len(unique_sentences) / len(sentences))
            
            if repetition_ratio < 0.1:  # Less than 10% repetition
                score += 1.0
            elif repetition_ratio > 0.3:  # More than 30% repetition
                score -= 2.0
        
        scores.append(score)
    return scores

# Test reward functions
test_completion = [[
    {"content": "I understand your concerns. <evaluate>{\"Active Listening\" : 8, \"Empathy & Validation\": 9, \"Safety & Trustworthiness\" : 9, \"Open-mindedness & Non-judgment\" : 8, \"Clarity & Encouragement\" : 7, \"Boundaries & Ethical\" : 9, \"Holistic Approach\" : 8, \"Explaination for Scoring\": \"Good response\"}</evaluate>"}
]]

print("Testing reward functions:")
print(f"Format check: {check_evaluation_format(test_completion)}")
print(f"No extra text: {check_no_extra_text(test_completion)}")
print(f"No repetition: {check_no_repetition(test_completion)}")

Testing reward functions:
Format check: [10.0]
No extra text: [2.0]
No repetition: [1.0]


In [24]:
# Global variables for monitoring
PRINTED_TIMES = 0
PRINT_EVERY_STEPS = 3

def debug_responses(prompts, completions, **kwargs):
    """
    Debug function to print responses every few steps.
    """
    global PRINTED_TIMES, PRINT_EVERY_STEPS
    
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        user_query = prompts[0][-1]["content"]
        response = completions[0][0]["content"]
        
        print('*' * 50)
        print(f"Step {PRINTED_TIMES + 1}")
        print(f"User Query: {user_query[:100]}...")
        print(f"Response: {response[:200]}...")
        
        # Check if evaluation format is present
        has_eval = evaluation_start in response and evaluation_end in response
        print(f"Has Evaluation Format: {has_eval}")
        
        if has_eval:
            match = evaluation_regex.search(response)
            if match:
                try:
                    eval_content = match.group(1).strip()
                    json.loads(eval_content)
                    print("Evaluation JSON: Valid")
                except json.JSONDecodeError:
                    print("Evaluation JSON: Invalid")
        print('*' * 50)
    
    PRINTED_TIMES += 1
    return [0] * len(completions)  # Return neutral scores for debugging

##### Get only 500 records

In [45]:
# Convert to HuggingFace dataset format
from datasets import Dataset

# Convert pandas to dataset
hf_dataset = Dataset.from_pandas(dataset[["prompt", "answer"]])

# Calculate token lengths
def calculate_prompt_length(examples):
    lengths = []
    for prompt in examples["prompt"]:
        tokens = tokenizer.apply_chat_template(
            prompt, 
            add_generation_prompt=True, 
            tokenize=True
        )
        lengths.append(len(tokens))
    return {"prompt_length": lengths}

hf_dataset = hf_dataset.map(calculate_prompt_length, batched=True)

# Filter to keep only reasonable length prompts (top 90%)
max_length = int(np.quantile(hf_dataset["prompt_length"], 0.9))
print(f"Max prompt length (90th percentile): {max_length}")

# Filter dataset
filtered_dataset = hf_dataset.filter(lambda x: x["prompt_length"] <= max_length)
print(f"Filtered dataset size: {len(filtered_dataset)}")

# Take a subset for training (adjust as needed)
if len(filtered_dataset) > 500:
    training_dataset = filtered_dataset.shuffle(seed=3407).select(range(500))
else:
    training_dataset = filtered_dataset.shuffle(seed=3407)

print(f"Training dataset size: {len(training_dataset)}")

Map:   0%|          | 0/4759 [00:00<?, ? examples/s]

Max prompt length (90th percentile): 871


Filter:   0%|          | 0/4759 [00:00<?, ? examples/s]

Filtered dataset size: 4285
Training dataset size: 500


In [None]:
# Adhoc code solve the accelerator issue

# from trl import GRPOConfig, GRPOTrainer
# import torch

# # Ensure model is on correct device
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model = model.to(device)

In [47]:
from vllm import SamplingParams
from trl import GRPOConfig, GRPOTrainer

max_seq_length = 4096

import wandb
wandb.init(project="distress-chatbot", name="base-model-training-syntheic_v2_2-grpo", config={
    "model": "Qwen/Qwen3-4B",
    # "max_steps": 20000,
    "learning_rate": 2e-5,
    "lambda_decay": 0.95,
})  # Allow resuming W&B run



# Calculate max lengths
max_prompt_length = max_length + 50  # Add some buffer
max_completion_length = max_seq_length - max_prompt_length

print(f"Max prompt length: {max_prompt_length}")
print(f"Max completion length: {max_completion_length}")

# VLLM sampling parameters
vllm_sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    seed=3407,
    stop=[tokenizer.eos_token],
    include_stop_str_in_output=True,
)

# GRPO training configuration
training_args = GRPOConfig(
    vllm_sampling_params=vllm_sampling_params,
    # temperature=0.8,
    learning_rate=1e-6,  # Lower learning rate for fine-tuning
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Increase for smoother training
    num_generations=4,  # Number of responses to generate per prompt
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    max_steps=200,  # Start with fewer steps for testing
    save_steps=50,
    report_to="wandb",  # Set to "wandb" if you want to use Weights & Biases
    output_dir="trained_model_v2_2_grpo_checkpoint",  # Directory to save the model
    gradient_checkpointing = False,
    # Add these parameters to handle device issues
    # dataloader_pin_memory=False,
    # dataloader_num_workers=0,
)

Max prompt length: 921
Max completion length: 3175


In [48]:
# Initialize GRPO trainer
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        check_evaluation_format,     # Primary reward: correct format
        check_no_extra_text,         # Secondary: no extra text
        check_language_consistency,  # Tertiary: language consistency
        check_no_repetition,         # Quaternary: no repetition
        debug_responses,             # Debug function
    ],
    args=training_args,
    train_dataset=training_dataset,
)

print("GRPO Trainer initialized successfully!")
print(f"Training on {len(training_dataset)} examples")

GRPO Trainer initialized successfully!
Training on 500 examples


In [49]:
# Start training
print("Starting GRPO training...")
print("Watch for the reward column to increase over time.")
print("The model should learn to follow the evaluation format.")
import os
os.environ["TORCH_LOGS"] = "+dynamic"

trainer.train()

Starting GRPO training...
Watch for the reward column to increase over time.
The model should learn to follow the evaluation format.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 2 | Total steps = 200
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576/4,000,000,000 (3.30% trained)


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / check_evaluation_format / mean,rewards / check_evaluation_format / std,rewards / check_no_extra_text / mean,rewards / check_no_extra_text / std,rewards / check_language_consistency / mean,rewards / check_language_consistency / std,rewards / check_no_repetition / mean,rewards / check_no_repetition / std,rewards / debug_responses / mean,rewards / debug_responses / std
1,0.3054,-1.5625,2.333961,106.375,1.0,869.0,0.0,106.375,1.0,869.0,7.634683,-2.1875,3.25,0.125,0.5,0.125,0.341565,0.375,0.5,0.0,0.0
2,0.2196,-0.9375,2.461876,102.5,1.0,798.0,0.0,102.5,1.0,798.0,5.490171,-2.1875,3.25,0.125,0.5,0.5,0.516398,0.625,0.5,0.0,0.0
3,0.3505,-2.8125,0.269338,5.8125,1.0,50.0,0.0,5.8125,1.0,50.0,8.762948,-3.0,0.0,0.0,0.0,0.0,0.0,0.1875,0.403113,0.0,0.0
4,0.1308,-2.4375,0.375,49.5625,1.0,261.0,0.0,49.5625,1.0,261.0,3.269575,-3.0,0.0,0.0,0.0,0.0625,0.25,0.5,0.516398,0.0,0.0
5,0.4474,-2.75,0.394338,9.625,1.0,63.0,0.0,9.625,1.0,63.0,11.18588,-3.0,0.0,0.0,0.0,0.0625,0.25,0.1875,0.403113,0.0,0.0
6,0.1967,-2.25,0.683013,9.875,1.0,31.0,0.0,9.875,1.0,31.0,4.918156,-3.0,0.0,0.0,0.0,0.3125,0.478714,0.4375,0.512348,0.0,0.0
7,0.4013,-2.75,0.394338,30.25,1.0,231.0,0.0,30.25,1.0,231.0,10.031704,-3.0,0.0,0.0,0.0,0.0,0.0,0.25,0.447214,0.0,0.0
8,0.1865,-1.0625,2.269338,70.4375,1.0,360.0,0.0,70.4375,1.0,360.0,4.663632,-2.1875,3.25,0.125,0.5,0.3125,0.478714,0.6875,0.478714,0.0,0.0
9,0.2347,-2.0625,1.375,46.0,1.0,589.0,0.0,46.0,1.0,589.0,5.867577,-2.5625,1.75,0.125,0.5,0.0,0.0,0.375,0.5,0.0,0.0
10,0.3563,-1.9375,2.125,31.5625,1.0,490.0,0.0,31.5625,1.0,490.0,8.908331,-2.1875,3.25,0.125,0.5,0.0625,0.25,0.0625,0.25,0.0,0.0


**************************************************
Step 34
User Query: Thưa bác sĩ, tôi mong muốn qua các buổi tư vấn này, tôi có thể tìm ra cách để kiểm soát những cảm xú...
Response: ...
Has Evaluation Format: False
**************************************************
**************************************************
Step 37
User Query: 最近我感觉自己的自尊心很低，经常怀疑自己是不是不够好，不管是在工作上还是人际关系中，好像总觉得别人比我更有能力，也不太敢表达自己的想法。这种情绪开始有几个月了，尤其是在公司开会时，我总是紧张得说不出话...
Response: ...
Has Evaluation Format: False
**************************************************
**************************************************
Step 40
User Query: 在这次咨询中，我希望能够找到一些方法来处理我最近经历的失去亲人的悲痛。我母亲几个月前去世了，这对我影响很大。我常常感到非常难过和孤独，有时甚至会突然感到胸口闷痛，头疼，这些症状让我很不安。我意识到这些...
Response: ...
Has Evaluation Format: False
**************************************************
**************************************************
Step 43
User Query: 最近我工作上的人際關係讓我感到好大壓力。我希望透過這次輔導，能學會更有效地處理同事間的衝突，保持心情平靜，同時提升自己的自信和表達能力。工作中，我常常覺得同事對我有偏見，尤其是幾次開會時，我提出的意見...
Response: ...
Has 

KeyboardInterrupt: 

In [50]:
# Saving model and tokenizer
model.save_pretrained("trained_model_v2_2_grpo")  # Local saving
tokenizer.save_pretrained("trained_model_v2_2_grpo")

('trained_model_v2_2_grpo/tokenizer_config.json',
 'trained_model_v2_2_grpo/special_tokens_map.json',
 'trained_model_v2_2_grpo/chat_template.jinja',
 'trained_model_v2_2_grpo/vocab.json',
 'trained_model_v2_2_grpo/merges.txt',
 'trained_model_v2_2_grpo/added_tokens.json',
 'trained_model_v2_2_grpo/tokenizer.json')

In [62]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)


cuda


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2476.6345 seconds used for training.
41.28 minutes used for training.
Peak reserved memory = 14.508 GB.
Peak reserved memory for training = 2.61 GB.
Peak reserved memory % of max memory = 98.419 %.
Peak reserved memory for training % of max memory = 17.706 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

To solve the equation (x + 2)^2 = 0, we can take the square root of both sides.

sqrt((x + 2)^2) = sqrt(0)

This simplifies to:

|x + 2| = 0

Since the absolute value of a number is always non-negative, the only way for |x + 2| to be 0 is if x + 2 = 0.

Therefore, x = -2.

So the solution to the equation (x + 2)^2 = 0 is x = -2.<|im_end|>


In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>
Okay, so I need to solve the equation (x + 2)^2 = 0. Hmm, let's see. I remember that when you have something squared equals zero, the solution is usually the value that makes the inside zero. Let me think. If I have (something)^2 = 0, then that something must be zero because any real number squared is non-negative, and the only way it can be zero is if the number itself is zero. So applying that here, (x + 2)^2 = 0 implies that x + 2 = 0. Then, solving for x, I just subtract 2 from both sides, right? So x = -2. Wait, is that all? Let me check. If I plug x = -2 back into the original equation, it becomes (-2 + 2)^2 = 0, which is 0^2 = 0, and that's correct. So the solution is x = -2. But wait, sometimes when you square both sides of an equation, you can get extraneous solutions, but in this case, since we started with the square already, maybe there's only one solution. Yeah, because squaring a real number can't give a negative result, so the only solution is when the inside is 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

In [10]:
messages = [
    {"role": "system", "content": """You are a helpful mental health counselling assistant, please answer the mental health questions based on the patient's description.The assistant gives helpful, comprehensive, and appropriate answers to the user's questions. At the end of answer, add tag <evaluate>{"Active Listening" : score, "Empathy & Validation": score, "Safety & Trustworthiness" : score, "Open-mindedness & Non-judgment" : score, "Clarity & Encouragement" : score, "Boundaries & Ethical" : score, "Holistic Approach" : score, "Explaination for Scoring": explain} </evaluate> evaluate your consultant answer in 7 metrics and explain for that evaluation with score from 1 to 10 in json format, where 1 is the worst and 10 is the best and explain is clearly explain why has that score. \n\nConsultation Metrics:\n1. Active Listening: Responses should show careful consideration of the user's concerns, reflecting an understanding and capturing the essence of the issue. Avoid making assumptions or jumping to conclusions.\n2. Empathy & Validation: Responses should convey deep understanding and compassion, validating the user's feelings and emotions without being dismissive or minimizing their experiences.\n3. Safety & Trustworthiness: Prioritize user safety in responses, refraining from potentially harmful or insensitive language. Ensure that information provided is consistent and trustworthy.\n4. Open-mindedness & Non-judgment: Approach concerns without any inherent bias or judgment. Answers should be free from biases related to personal attributes and convey respect, demonstrating unconditional positive regard.\n5. Clarity & Encouragement: Provide clear, concise, and easily understandable answers. Where appropriate, motivate or highlight strengths, offering encouragement while maintaining a neutral stance.\n6. Boundaries & Ethical: It's vital to clarify the role of the response, emphasizing its informational nature. In complex scenarios, guiding users to seek human professional assistance is essential.\n7. Holistic Approach: Responses should be comprehensive, addressing concerns from various angles, be it emotional, cognitive, or situational. Consider the broader context, even if not explicitly detailed in the query."""},
    {"role" : "user", "content" : "Tôi đã gặp khó khăn trong việc tìm kiếm sự cân bằng giữa trách nhiệm công việc và vai trò làm mẹ đơn thân của một cậu con trai 12 tuổi. Tôi nhận thấy mình thường cảm thấy quá tải và lo lắng, và tôi đang cân nhắc việc thiết lập một lịch trình có cấu trúc hơn cho cả hai mẹ con. Tuy nhiên, tôi còn do dự trong việc đặt ra giới hạn nghiêm ngặt về thời gian sử dụng thiết bị điện tử, vì tôi nhận thấy rằng việc đặt giới hạn chặt chẽ đôi khi có thể dẫn đến tranh cãi và sự oán giận. Thay vào đó, tôi đang nghĩ đến các chiến lược linh hoạt hơn, như thưởng thêm thời gian sử dụng thiết bị hoặc các vật phẩm trong trò chơi khi con có hành vi tốt. Nhưng tôi không chắc liệu những phần thưởng này có còn là lựa chọn khả thi hay không, vì chúng có thể củng cố thói quen sử dụng thiết bị điện tử không lành mạnh."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 2048, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    repetition_penalty = 1.1,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Tôi hiểu rõ nỗi lo lắng của bạn về việc quản lý thời gian sử dụng thiết bị điện tử cho con trai nhỏ của bạn. Việc cân bằng giữa công việc và cuộc sống gia đình là một thách thức lớn đối với nhiều người, đặc biệt là những người làm mẹ đơn thân. Bạn đang cân nhắc việc thiết lập một lịch trình có cấu trúc hơn để giúp quản lý thời gian hiệu quả hơn. Điều này là một bước tiến tích cực.

Tuy nhiên, việc đặt giới hạn cứng rắn về thời gian sử dụng thiết bị điện tử có thể gây ra tranh cãi và sự oán giận. Do đó, bạn đang cân nhắc các chiến lược linh hoạt hơn, như thưởng thêm thời gian sử dụng thiết bị hoặc các vật phẩm trong trò chơi khi con có hành vi tốt. Đây là một ý tưởng sáng tạo, nhưng bạn cần nhớ rằng những phần thưởng này có thể củng cố thói quen sử dụng thiết bị điện tử không lành mạnh.

Để giải quyết vấn đề này, tôi khuyên bạn nên áp dụng phương pháp giáo dục kỹ năng (Skill-based Education) để dạy con trai cách sử dụng thiết bị điện tử một cách hợp lý và có trách nhiệm. Bạn cũng có thể

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
