<a href="https://colab.research.google.com/github/nithinp1/Mindfull-GPT/blob/main/gpt_oss_20b_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning GPT-OSS (20B) as a MindGPT — A Mental Well-Being Assistant Model with Unsloth

In this notebook, we'll walk through how a powerful open-source model like `Llama-3-8B` can be fine-tuned to act as **an empathetic mental well-being assistant**. We'll do this by:

1.  Fine-tuning with [Unsloth](https://github.com/unslothai/unsloth) for 2x faster training and reduced memory usage.
2.  Using a curated dataset of empathetic conversations, anonymized therapy dialogues, and mindfulness exercises.

The end result? A model that can offer a supportive ear, guide users through simple grounding techniques, and provide helpful reframing exercises based on Cognitive Behavioral Therapy (CBT) principles.

Here's an example of how **MindGPT** might interact with a user:

```
User: I'm feeling really overwhelmed with my deadlines at work. I don't know where to even start.

Assistant's Internal Strategy:
1.  Acknowledge and validate the user's feelings of being overwhelmed.
2.  Express empathy for their situation.
3.  Suggest a simple, actionable technique to reduce immediate stress (breathing exercise).
4.  Gently guide the user through the steps.

Assistant's Response:
It sounds like you're under a lot of pressure right now, and feeling overwhelmed when facing tight deadlines is completely understandable. It can be hard to think clearly when that stress builds up.

Before diving into the tasks, would you be open to trying a quick 30-second breathing exercise to help calm your nervous system? It might make it easier to figure out a starting point.
```

This project aims to leverage AI to make initial mental well-being support more accessible and to provide users with practical tools for managing daily stress.

### Installation

In [None]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
except: get_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    git+https://github.com/huggingface/transformers \
    git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels

### Loading the Model with Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
dtype = None

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    dtype = dtype, # None for auto detection
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


RuntimeError: Direct module loading failed for unsloth_compiled_module_gpt_oss: name 'KWARGS_TYPE' is not defined

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

### Understanding Reasoning Effort Levels

The `gpt-oss` models include a unique feature that allows you to adjust the model's **"reasoning effort"**—controlling the trade-off between performance and response speed (latency) by adjusting how many tokens the model uses to think.

----

Three distinct levels:
* **Low**: Optimized for fast responses, minimal multi-step reasoning
* **Medium**: Balanced performance and speed
* **High**: Strongest reasoning performance for complex tasks (higher latency)

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "I buy a book for $15 and a coffee for $4.50. The sales tax is 8%. What is the total cost?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low",
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))

Changing the `reasoning_effort` to `high` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "I buy a book for $15 and a coffee for $4.50. The sales tax is 8%. What is the total cost?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high",
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

<a name="Data"></a>
### Data Prep: The Multilingual Reasoning Dataset

We'll use the [HuggingFaceH4/Multilingual-Thinking](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking) dataset.
- It contains reasoning chains translated into multiple languages (French, Spanish, German, Italian)
- Each example has both the reasoning process (`analysis` channel) and final answer (`final` channel)
- By training on this, the model learns to generate reasoning steps in different languages

### Understanding the Harmony Format

The GPT-OSS models use OpenAI's Harmony format for conversations. Here's what each field means:

| Field | Purpose |
|-------|---------|
| `developer` | Custom instructions for the model (similar to system role) |
| `user` | User's input question |
| `assistant` | Model's output with two special channels |
| `analysis` | The model's chain-of-thought reasoning |
| `final` | The final response shown to the user |

The key innovation: The assistant response contains **two channels**:
- **`analysis` channel**: Where the model thinks step-by-step (can be in any language)
- **`final` channel**: The polished response to the user

This separation allows the model to reason in one language while responding in another!

#### This notebook refers the HF_TOKEN from the Secrets tab. Please make sure the value is added for the HF_TOKEN before running the cell

In [None]:
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset

import os
my_token = os.getenv("HF_TOKEN")

dataset = load_dataset("Amod/mental_health_counseling_conversations", split="train",token = my_token)
dataset

To format our dataset, we will apply our version of the GPT OSS prompt

In [None]:
from unsloth.chat_templates import standardize_sharegpt

# Modify formatting_prompts_func to use the correct column names
def formatting_prompts_func(examples):
    # Assuming 'Context' is the user's message and 'Response' is the assistant's message
    # We need to format this into the Harmony format expected by the model
    texts = []
    for i in range(len(examples['Context'])):
        # Construct the conversation in the Harmony format
        # This is a simplified example, you might need to adjust based on the dataset's actual structure
        convo = [
            {"role": "user", "content": examples['Context'][i]},
            {"role": "assistant", "channel": "final", "content": examples['Response'][i]},
            # If there's an analysis part in the dataset, you would add it here
            # {"role": "assistant", "channel": "analysis", "message": "..."}
        ]
        texts.append(tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False))
    return { "text" : texts, }

# The standardize_sharegpt function might not be necessary or might need adjustment
# depending on how well it handles the new dataset structure.
# Let's try applying our custom formatting first.
# dataset = standardize_sharegpt(dataset)

dataset = dataset.map(formatting_prompts_func, batched = True,)

### 🔍 Detecting and Inspecting Outliers

When working with text datasets, some samples can be **much longer** than the majority.  
These are called **outliers** and can negatively impact training (e.g., wasting compute on rare long sequences).

We use the **Interquartile Range (IQR) method**:

1. **Calculate quartiles (Q1, Q3)**  
   - Q1 = 25th percentile  
   - Q3 = 75th percentile  
   - IQR = Q3 − Q1  

2. **Define outlier threshold**  
   - Any sample with length > Q3 + 1.5 × IQR is considered an outlier.  

3. **Split dataset**  
   - `cleaned_dataset`: all samples within the threshold  
   - `outlier_dataset`: all samples above the threshold (potential outliers)  

This way, we don’t just discard the long samples silently — we **separate and inspect** them.  
You can print or visualize `outlier_dataset` to understand whether these are valid data points (just long)  
or noisy entries that should be removed.


In [None]:
def check_token_lengths(dataset, tokenizer, text_key="text", show_progress=True):
    """
    Check the token lengths of all examples in a dataset.

    Args:
        dataset: HuggingFace Dataset or list of dicts with text entries.
        tokenizer: HuggingFace tokenizer instance.
        text_key: Key in dataset examples containing text (default "text").
        show_progress: Whether to print lengths as we go.

    Returns:
        List of token lengths for all examples.
    """
    token_lengths = []
    for i, example in enumerate(dataset):
        tokens = tokenizer(example[text_key], truncation=False)["input_ids"]
        length = len(tokens)
        token_lengths.append(length)
        if show_progress:
            print(f"Example {i}: {length} tokens")
    return token_lengths


In [None]:
import numpy as np
import matplotlib.pyplot as plt

def analyze_token_lengths(lengths, max_context=1024):
    lengths = np.array(lengths)

    print(f"📊 Dataset size: {len(lengths)} examples")
    print(f"Avg length: {lengths.mean():.1f}")
    print(f"Median length: {np.median(lengths):.1f}")
    print(f"Max length: {lengths.max()}")

    # Coverage at common cutoffs
    for cutoff in [512, 1024, 2048, 4096]:
        coverage = (lengths <= cutoff).mean() * 100
        print(f"≤ {cutoff} tokens: {coverage:.1f}% of examples")

    # Plot histogram
    plt.hist(lengths, bins=50, color="steelblue", edgecolor="black")
    plt.axvline(max_context, color="red", linestyle="--", label=f"Model limit {max_context}")
    plt.xlabel("Token length")
    plt.ylabel("Count")
    plt.title("Token length distribution")
    plt.legend()
    plt.show()


In [None]:
lengths = check_token_lengths(dataset, tokenizer, show_progress=False)
analyze_token_lengths(lengths, max_context=1024)


In [None]:
import numpy as np

# Step 1: Calculate quartiles
q1 = np.percentile(lengths, 25)
q3 = np.percentile(lengths, 75)
iqr = q3 - q1

# Step 2: Define outlier threshold
upper_bound = q3 + 1.5 * iqr
print(f"IQR upper bound: {upper_bound:.2f} tokens")

# Step 3: Separate clean vs outlier samples
outlier_indices = [i for i, l in enumerate(lengths) if l > upper_bound]
clean_indices = [i for i, l in enumerate(lengths) if l <= upper_bound]

# Build datasets
cleaned_dataset = dataset.select(clean_indices)
outlier_dataset = dataset.select(outlier_indices)

print(f"Cleaned dataset size: {len(cleaned_dataset)} / {len(dataset)}")
print(f"Outlier dataset size: {len(outlier_dataset)}")

# Peek at outliers
for i in range(len(outlier_dataset)):
    print(f"\n--- Outlier {i+1} ---")
    print(outlier_dataset[i])


In [None]:
cleaned_dataset

### ✂️ Why We Need Chunking (with Special Tokens)

Our dataset isn’t just raw text — it includes **structured conversation tokens** such as:

```

<|start|>system<|message|> ... <|end|>
<|start|>developer<|message|> ... <|end|>
<|start|>user<|message|> ... <|end|>
<|start|>assistant<|channel|>analysis<|message|> ... <|end|>
<|start|>assistant<|channel|>final<|message|> ... <|end|>

```

These tokens **mark role boundaries** (system, developer, user, assistant) and **separate messages**.  
They are **crucial for correct learning** — if chunking cuts in the middle of them, the model may:

- Miss a `<|start|>` or `<|end|>` marker, breaking the structure  
- Lose track of **who is speaking** (system vs. user vs. assistant)  
- Misinterpret instructions (e.g., reasoning language, special channels)  
- Encounter incomplete sequences, leading to **NaN loss during training**  

---

⚠️ **The challenge:**  
- We need to cap max sequence length at **1024 tokens** (because Colab’s T4 GPU runs out of memory at 2048).  
- But if we naïvely truncate at 1024, we risk losing special tokens and breaking samples.  

✅ **The solution:**  
- Use **chunking** into 1024-token windows.  
- Ensure each chunk respects token boundaries and preserves `<|start|> ... <|end|>` structures.  
- This way, even long samples (up to 32k tokens) are split into multiple valid training examples without losing critical role and instruction markers.  

This keeps training **stable**, prevents NaNs, and ensures the model **actually learns the structured instruction format**.


In [None]:
cleaned_lengths = [len(tokenizer.encode(x["text"])) for x in cleaned_dataset]

# Find indices of samples between 2048 and 4096 tokens
mid_long_indices = [i for i, l in enumerate(cleaned_lengths) if 2048 <= l <= 4096]

print(f"Found {len(mid_long_indices)} samples between 2048 and 4096 tokens")

# Peek at a few examples and show truncation
for idx in mid_long_indices[:3]:  # pick 3 examples
    text = cleaned_dataset[idx]["text"]
    full_tokens = tokenizer.encode(text)

    # Truncate
    truncated_tokens = tokenizer.encode(text, max_length=1024, truncation=True)

    print(f"\n--- Example {idx} ---")
    print(f"Original length: {len(full_tokens)} tokens")
    print(f"Truncated length: {len(truncated_tokens)} tokens (max=1024)")

    # Show kept vs discarded parts
    print("\n✅ Truncated text (last 200 chars):")
    print(tokenizer.decode(truncated_tokens[-200:]))

    discarded_tokens = full_tokens[1024:]
    print("\n❌ Discarded text")
    print(tokenizer.decode(discarded_tokens))

In [None]:
import re
from typing import List, Dict, Any

END_OR_RETURN = r"(?:<\|end\|>|<\|return\|>)"
MSG_BLOCK_RE = re.compile(rf"<\|start\|>.*?{END_OR_RETURN}", re.DOTALL)

# header stays fine (captures up to <|message|>)
MSG_HEADER_RE = re.compile(r"^(.*?<\|message\|>)", re.DOTALL)


def find_message_blocks(text: str) -> List[str]:
    return [m.group(0) for m in MSG_BLOCK_RE.finditer(text)]

def split_long_message_block(
    block_text: str,
    tokenizer,
    max_length: int = 2048,
    overlap: int = 128,
) -> List[str]:
    """
    Split one <|start|>...<|end|> block into multiple text chunks.
    Preserves header (<|start|>role<|message|>) and closing <|end|> for each chunk.
    """
    m = MSG_HEADER_RE.search(block_text)
    if not m:
        header_text, body_text, end_text = "", block_text, ""
    else:
        header_text = m.group(1)
        rest = block_text[len(header_text):]
        end_iter = list(re.finditer(END_OR_RETURN, rest))
        if end_iter:
            last = end_iter[-1]
            body_text = rest[:last.start()]
            end_text = last.group(0)           # either <|end|> or <|return|>
        else:
            body_text, end_text = rest, ""     # fallback

    body_ids = tokenizer(body_text, add_special_tokens=False)["input_ids"]
    # available payload capacity (approximate by tokens)
    capacity = max_length - len(tokenizer(header_text)["input_ids"]) - len(tokenizer(end_text)["input_ids"]) - 2

    chunks = []
    start = 0
    step = max(1, capacity - overlap)
    while start < len(body_ids):
        piece_ids = body_ids[start:start + capacity]
        piece_text = tokenizer.decode(piece_ids, skip_special_tokens=False)
        chunk_text = f"{header_text}{piece_text}{end_text}"
        chunks.append(chunk_text)
        if start + capacity >= len(body_ids):
            break
        start += step
    return chunks

def chunk_dialogue_to_text(
    text: str,
    tokenizer,
    max_length: int = 2048,
    overlap: int = 128,
) -> List[str]:
    """
    Pack dialogue text (<|start|>...<|end|> blocks) into chunks of text only.
    """
    blocks = find_message_blocks(text)
    if not blocks:
        return [text]

    chunks, cur_text, cur_len = [], "", 0
    for block in blocks:
        block_len = len(tokenizer(block, add_special_tokens=False)["input_ids"])
        if cur_len + block_len <= max_length:
            cur_text += block
            cur_len += block_len
        else:
            if cur_text:
                chunks.append(cur_text)
            if block_len > max_length:
                chunks.extend(split_long_message_block(block, tokenizer, max_length, overlap))
                cur_text, cur_len = "", 0
            else:
                cur_text, cur_len = block, block_len
    if cur_text:
        chunks.append(cur_text)
    return chunks

def chunk_dataset_to_text(
    dataset,
    tokenizer,
    text_key: str = "text",
    max_length: int = 2048,
    overlap: int = 128,
    keep_fields: list[str] = None,
):
    """
    Convert a dataset into boundary-aware text chunks.
    Returns a HuggingFace Dataset with each row = {'text': chunk_text, ...}
    """
    from datasets import Dataset
    keep_fields = keep_fields or []

    rows = []
    for src_idx, example in enumerate(dataset):
        text = example.get(text_key)
        if not text:
            continue
        chunks = chunk_dialogue_to_text(text, tokenizer, max_length, overlap)
        for chunk_id, ch_text in enumerate(chunks):
            row = {"text": ch_text,
                   "source_index": src_idx,
                   "chunk_id": chunk_id,
                   "num_chunks": len(chunks)}
            for k in keep_fields:
                row[k] = example.get(k)
            rows.append(row)

    return Dataset.from_list(rows)


In [None]:
CHUNKED = chunk_dataset_to_text(
    cleaned_dataset,
    tokenizer,
    text_key="text",
    max_length=1024,
    overlap=128,
    keep_fields=["reasoning_language","developer","user","analysis","final","messages"],
)

print(CHUNKED)
# => Dataset({ features: ['text','source_index','chunk_id','num_chunks',...], num_rows: ... })

# sanity check
print(CHUNKED[0]["text"])


In [None]:
from huggingface_hub import login
login()

In [None]:
DATASET_REPO = "Scropo/MindFull-AI"
CHUNKED.push_to_hub(DATASET_REPO,token=my_token,create_pr=1)
# https://huggingface.co/datasets/LLMImplementation/multilingual-thinking-cleaned-chunked-1024

<a name="Train"></a>
### Train the model
Now let's fine-tune our model, teaching it to respond with empathy and offer supportive guidance. We'll run for just 30 steps to keep this demo quick, but for a full training, you can set `num_train_epochs=1` and remove `max_steps`.

In [None]:
from trl import SFTConfig, SFTTrainer

args = SFTConfig(
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 4,   # effective batch size = 4
    num_train_epochs = 1,          # Set this for 1 full training run.
    packing = False,
    learning_rate = 5e-5,              # safer for longer runs than 2e-4
    warmup_steps = 15,                 # ~6% of total steps, smoother ramp
    lr_scheduler_type = "cosine",      # better for longer schedules
    max_grad_norm = 0.5,               # strong clipping, avoids NaNs
    weight_decay = 0.01,
    optim = "adamw_torch",             # stick to stable optimizer first
    logging_steps = 1,
    save_steps = 50,                   # optional: checkpointing every 50 steps
    fp16 = False,                      # keep off until stable
    bf16 = False,                      # enable later if GPU supports
    seed = 3407,
    output_dir = "outputs",
    report_to = "none",
    max_seq_length = 1024,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = CHUNKED,
    args = args,
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** Currently finetunes can only be loaded via Unsloth in the meantime - we're working on vLLM and GGUF exporting!

In [None]:
model.save_pretrained("finetuned_model")
model.push_to_hub("Scropo/MindFull-AI", token = my_token ) # Save to HF

<a name="Inference"></a>
### Inference: Testing Empathetic Responses
Let's see how our fine-tuned model responds to a user expressing feelings of stress or anxiety.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that can provide mental health counseling."},
    {"role": "user", "content": "I've been feeling really down lately and can't seem to find joy in anything. What should I do?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # Changed to high for more detailed reasoning
).to(model.device)
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 1048, streamer = TextStreamer(tokenizer))

In [None]:
# Test with German reasoning
messages = [
    {"role": "system", "content": "You are a helpful assistant that can provide mental health counseling."},
    {"role": "user", "content": "I feel anxious before exams, what can I do?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)
_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

In [None]:

messages = [
    {"role": "system", "content": "You are a helpful assistant that can provide mental health counseling."},
    {"role": "user", "content": "How can I improve my sleep routine?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)
_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

### Testing with Prompts that was Not in Training Data

Interestingly, the model can even attempt reasoning in prompt it wasn't explicitly trained on:

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that can provide mental health counseling."},
    {"role": "user", "content": "My mom has Alzheimer's, and I've been her primary caregiver for the past few years. I don't know what to do, Can you help?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)
_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

### To run the finetuned model, you can do the below after setting `if False` to `if True` in a new instance.

In [None]:
if False:
  from unsloth import FastLanguageModel
  from peft import PeftModel

  max_seq_length = 1024
  dtype=None
  # 1) load the same base
  base, tokenizer = FastLanguageModel.from_pretrained(
      model_name = "unsloth/gpt-oss-20b",
      dtype = dtype, # None for auto detection
      max_seq_length = max_seq_length, # Choose any for long context!
      load_in_4bit = True,  # 4 bit quantization to reduce memory
      full_finetuning = False, # [NEW!] We have full finetuning now!
      # token = "hf_...", # use one if using gated models
  )

  # 2) attach the adapter
  model = PeftModel.from_pretrained(base, "finetuned_model")

  # 3) enable Unsloth’s fast inference path
  FastLanguageModel.for_inference(model)

## Or you can load the base model and then attach the adapters from your Hub repo

In [None]:

if False:
    from unsloth import FastLanguageModel
    from peft import PeftModel
    from transformers import AutoTokenizer

    BASE_ID    = "unsloth/gpt-oss-20b"
    ADAPTER_ID = "Scropo/MindFull-AI"  # your Hub repo

    max_seq_length = 1024

    # 1) Load the same base model you trained against (4-bit is T4-friendly)
    base, tokenizer = FastLanguageModel.from_pretrained(
        model_name      = BASE_ID,
        dtype           = None,          # auto
        max_seq_length  = max_seq_length,
        load_in_4bit    = True,          # QLoRA-style inference
        # token        = "hf_..."        # if base is gated or your account is private
    )

    # (Optional) If you pushed the tokenizer to the adapter repo, prefer it:
    try:
        tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID, use_fast=True)
    except Exception:
        pass

    # 2) Attach LoRA adapters directly from the Hub
    model = PeftModel.from_pretrained(base, ADAPTER_ID)

    # 3) Enable Unsloth’s fast inference path
    FastLanguageModel.for_inference(model)

    # Quick sanity test
    prompt = "<|start|>user<|message|>¿Cuál es el capital de Australia?<|end|>\n<|start|>assistant<|channel|>analysis<|message|>"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.9)
    print(tokenizer.decode(outputs[0], skip_special_tokens=False))


## 🏁 Conclusion

You’ve successfully fine-tuned a base language model to act as an **empathetic mental well-being assistant**. Key takeaways:

1.  **Default Bias to Facts:** Large language models are typically tuned for factual recall and instruction-following, often lacking the softness and empathy required for supportive conversations.
2.  **Empathetic Fine-Tuning Matters:** Training on a curated dataset of therapeutic conversations and empathetic dialogue is crucial to steer the model towards safe, helpful, and emotionally intelligent responses.
3.  **Structured Formatting is Key:** Using a format that separates an `internal_strategy` from the final `response` helps the model learn to "think" about validation and empathy before generating its user-facing message.
4.  **Resource-aware Training:** On a consumer GPU like a Colab **T4 (16 GB)**, using techniques like **LoRA/QLoRA**, a moderate `max_seq_length=1024`, and small batch sizes are essential to train effectively without running out of memory.
5.  **Data Quality is Paramount:** For a sensitive application like this, carefully curating, cleaning, and anonymizing the dataset is the most critical step to ensure model safety and prevent harmful outputs.

**What’s next (brief):** Implement proper **train/validation/test splits** to prevent data leakage. Beyond technical loss metrics, create a qualitative test set of challenging prompts to manually review model responses for safety and tone. Always evaluate thoroughly before considering any real-world application.

-----

### 📚 References

  - Unsloth Llama 3 fine-tuning tutorial: [https://github.com/unslothai/unsloth/blob/main/notebooks/Llama\_3\_8b\_Instruct\_Tuning.ipynb](https://www.google.com/search?q=https://github.com/unslothai/unsloth/blob/main/notebooks/Llama_3_8b_Instruct_Tuning.ipynb)
  - Hugging Face Fine-tuning Guide: [https://huggingface.co/docs/transformers/training](https://huggingface.co/docs/transformers/training)

In [None]:
from huggingface_hub import login
login(new_session=True) # Use new_session=True to ensure a fresh login

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…