
# Luma: Domain-Specific Mental Health Chatbot (T5, TensorFlow, Hugging Face)

**Goal:** Fine-tune a generative Transformer (T5-small) on a mental-health Q&A dataset to build a domain-specific chatbot that provides supportive, safe responses and rejects out-of-domain or unsafe requests.

**Why this matters:** Mental health support requires careful, context-aware language. A domain-tuned generative model improves relevance and tone while adhering to safety boundaries.

**Repo structure (suggested):**
```
.
├── Luma_chatbot_refactored.ipynb   # this notebook
├── app.py                          # Gradio UI for interactive demo
├── README.md                       # How to run + results
└── data/
    └── mental_health_training.csv
```



## 1. Dataset Collection & Preprocessing

We use the provided dataset and perform preprocessing: lowercasing, whitespace & URL cleanup, and dropping empties. We also document tokenization choices (T5 tokenizer).


In [1]:

import pandas as pd, numpy as np, re
from sklearn.model_selection import train_test_split

DATA_PATH = "data/mental_health_training_expanded.csv"
df = pd.read_csv(DATA_PATH)

print("Columns:", list(df.columns))
df.head()


Columns: ['question', 'answer', 'pattern', 'tag']


Unnamed: 0,question,answer,pattern,tag
0,what if i feel lonely,"A lot of people are alone right now, but we do...",what if i feel lonely,fact-30
1,i'm so angry,Would writing a draft message (that you don’t ...,i'm so angry,anger
2,everyone is better than me,Let’s gently check the evidence for and agains...,everyone is better than me,worthless
3,i keep crying for no reason,Would talking through today help a little? I'm...,i keep crying for no reason,sad
4,nothing much,Oh I see. Do you want to talk about something?,nothing much,neutral-response


In [2]:
# --- Robust column selection + cleaning (refactor of your block) ---

import re
import pandas as pd

# 1) Pick columns robustly (uses common aliases)
INPUT_ALIASES  = ["question", "pattern", "text", "prompt", "input"]
TARGET_ALIASES = ["answer", "response", "target", "label", "tag", "output"]

def pick_col(df, aliases, fallback):
    cols_lower = {c.lower(): c for c in df.columns}
    for a in aliases:
        if a in cols_lower:
            return cols_lower[a]
    # try contains-based match (e.g., "user_question")
    for a in aliases:
        for c in df.columns:
            if a in c.lower():
                return c
    # fallback (will raise if missing)
    if fallback in df.columns:
        return fallback
    raise KeyError(
        f"None of {aliases} found in columns {list(df.columns)} and fallback '{fallback}' not present."
    )

input_col  = pick_col(df, INPUT_ALIASES,  "question")
target_col = pick_col(df, TARGET_ALIASES, "answer")

print(f"Using input column:  {input_col}")
print(f"Using target column: {target_col}")

# 2) Cleaning helpers
URL_RE   = re.compile(r"http\S+|www\.\S+", flags=re.IGNORECASE)
SPACE_RE = re.compile(r"\s+")

def clean_text(x, lower=False):
    if pd.isna(x):
        return ""
    s = str(x)
    s = URL_RE.sub("", s)
    s = SPACE_RE.sub(" ", s).strip()
    return s.lower() if lower else s

# 3) Subset, clean, and sanitize
df = df[[input_col, target_col]].copy()
df[input_col]  = df[input_col].map(lambda t: clean_text(t, lower=True))   # lower input only
df[target_col] = df[target_col].map(clean_text)                            # keep target case

# 4) Drop empties and duplicates
before = len(df)
df = df[(df[input_col] != "") & (df[target_col] != "")]
df = df.drop_duplicates(subset=[input_col, target_col]).reset_index(drop=True)

print(f"Rows kept: {len(df)}/{before} (removed {before - len(df)})")
print("Sample:")
display(df.head(5))  # comment out if not in notebook


Using input column:  question
Using target column: answer
Rows kept: 941/941 (removed 0)
Sample:


Unnamed: 0,question,answer
0,what if i feel lonely,"A lot of people are alone right now, but we do..."
1,i'm so angry,Would writing a draft message (that you don’t ...
2,everyone is better than me,Let’s gently check the evidence for and agains...
3,i keep crying for no reason,Would talking through today help a little? I'm...
4,nothing much,Oh I see. Do you want to talk about something?



## 2. Model & Tokenization (T5-small, TensorFlow)

We use `T5-small` with the Hugging Face `transformers` library. We prefix inputs to guide the model (prompting) and create TensorFlow datasets for training and validation.


In [3]:
# --- Refactor of your tokenization + dataset block (PyTorch-compatible) ---
# pip install -U datasets
import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq
from sklearn.model_selection import train_test_split

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_input_len = 256
max_target_len = 128
PREFIX = "mental health support: "

# 1) Build input/target lists
inputs  = (PREFIX + df[input_col]).tolist()
targets = df[target_col].tolist()

# 2) Train/val split
X_train, X_val, y_train, y_val = train_test_split(
    inputs, targets, test_size=0.1, random_state=42, stratify=None
)

# 3) HF Datasets from lists
train_raw = Dataset.from_dict({"src": X_train, "tgt": y_train})
val_raw   = Dataset.from_dict({"src": X_val,   "tgt": y_val})

# 4) Tokenize -> input_ids, attention_mask, labels (-100 on pad)
def preprocess(batch):
    enc = tokenizer(batch["src"], max_length=max_input_len, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(batch["tgt"], max_length=max_target_len, truncation=True)
    enc["labels"] = labels["input_ids"]
    return enc

train_ds = train_raw.map(preprocess, batched=True, remove_columns=train_raw.column_names)
val_ds   = val_raw.map(preprocess,   batched=True, remove_columns=val_raw.column_names)

# 5) Torch formatting (so Trainer can index tensors)
cols = ["input_ids", "attention_mask", "labels"]
train_ds.set_format(type="torch", columns=cols)
val_ds.set_format(type="torch", columns=cols)

# 6) Collator for padding at batch time (used in Trainer later)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/846 [00:00<?, ? examples/s]



Map:   0%|          | 0/95 [00:00<?, ? examples/s]


## 3. Fine-tuning & Hyperparameter Exploration

We compile the TF model; Transformers' TF models compute loss when labels are provided, so we set only the optimizer. We explore a **small grid** over learning rate and epochs and keep the best validation loss.


In [4]:
# --- Refactored training block (PyTorch + Trainer) ---

import itertools, numpy as np, torch, gc
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq,
    TrainingArguments, Trainer, set_seed
)

set_seed(42)

tokenizer = AutoTokenizer.from_pretrained(model_name)

def build_model():
    return AutoModelForSeq2SeqLM.from_pretrained(model_name)

def run_train(lr=5e-5, epochs=2, keep_model=False):
    model = build_model()
    collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

    args = TrainingArguments(
        output_dir=f"runs/{model_name.replace('/','_')}_lr{lr}_ep{epochs}",
        learning_rate=lr,
        num_train_epochs=epochs,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        report_to="none",
        no_cuda=not torch.cuda.is_available(),
    )

    # Prefer `processing_class=` (newer) but fall back to `tokenizer=` (older)
    try:
        trainer = Trainer(
            model=model, args=args,
            train_dataset=train_ds, eval_dataset=val_ds,
            data_collator=collator, processing_class=tokenizer
        )
    except TypeError:
        trainer = Trainer(
            model=model, args=args,
            train_dataset=train_ds, eval_dataset=val_ds,
            data_collator=collator, tokenizer=tokenizer
        )

    trainer.train()
    metrics = trainer.evaluate()
    val_loss = float(metrics["eval_loss"])

    if keep_model:
        # keep the trained model in memory for immediate inference
        return val_loss, metrics, model

    # otherwise clean up between grid runs
    del trainer, model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    return val_loss, metrics, None

search_space = {"lr": [3e-5, 5e-5], "epochs": [5, 8]}
best = {"val_loss": float("inf"), "lr": None, "epochs": None}
histories = {}
best_model = None

for lr, epochs in itertools.product(search_space["lr"], search_space["epochs"]):
    print(f"\n=== Training with lr={lr}, epochs={epochs} ===")
    # keep the model only if it beats the current best
    val_loss, metrics, model = run_train(lr, epochs, keep_model=True)
    histories[(lr, epochs)] = metrics
    if val_loss < best["val_loss"]:
        # dispose previous kept model (if any) to save VRAM/RAM
        if best_model is not None and torch.cuda.is_available():
            del best_model
            torch.cuda.empty_cache()
        gc.collect()

        best.update({"val_loss": val_loss, "lr": lr, "epochs": epochs})
        best_model = model
    else:
        # not best → free this one
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

print("Best:", best)

# Optional: save the best model for inference later
if best_model is not None:
    save_dir = "t5-small-mental-support-best"
    best_model.save_pretrained(save_dir)
    tokenizer.save_pretrained(save_dir)



=== Training with lr=3e-05, epochs=5 ===


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step,Training Loss
500,1.4743



=== Training with lr=3e-05, epochs=8 ===


Step,Training Loss
500,1.4246



=== Training with lr=5e-05, epochs=5 ===


Step,Training Loss
500,1.2352



=== Training with lr=5e-05, epochs=8 ===


Step,Training Loss
500,1.1871


Best: {'val_loss': 0.8535082936286926, 'lr': 5e-05, 'epochs': 8}



## 4. Evaluation (BLEU, ROUGE-L, F1, Perplexity) + Qualitative

We generate on the validation set and compute common text-gen metrics. Perplexity is derived from validation loss: `exp(val_loss)`.


In [7]:
# --- EVAL BLOCK: BLEU, ROUGE, token-F1, perplexity (PyTorch) ---
best = {'val_loss': 0.8535082936286926, 'lr': 5e-05, 'epochs': 8}

import torch, numpy as np, nltk, evaluate
nltk.download('punkt', quiet=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1) Load metrics
bleu  = evaluate.load("bleu")
rouge = evaluate.load("rouge")

# 2) Get a model for inference (prefer the best fine-tuned one if available)
try:
    _has_best = 'best_model' in globals() and best_model is not None
except NameError:
    _has_best = False

if _has_best:
    model = best_model.to(device).eval()
else:
    from transformers import AutoModelForSeq2SeqLM  # local import to avoid reimport noise
    model = AutoModelForSeq2SeqLM.from_pretrained(
        "t5-small" if "infer_model_path" not in globals() else infer_model_path
    ).to(device).eval()

# 3) Generation helper (uses PT tensors)
def generate_text(batch_inputs, max_new_tokens=64):
    enc = tokenizer(
        batch_inputs,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128,
    )
    enc = {k: v.to(device) for k, v in enc.items()}
    with torch.no_grad():
        out = model.generate(
            **enc,
            max_new_tokens=max_new_tokens,
            do_sample=False,   # deterministic; set True + temperature/top_p for sampling
        )
    return tokenizer.batch_decode(out, skip_special_tokens=True)

# 4) Slice validation for speed
val_slice = min(200, len(X_val))
preds, refs = [], []
bs = 16
for i in range(0, val_slice, bs):
    batch_inp = list(X_val[i:i+bs])     # ensure list[str]
    batch_ref = list(y_val[i:i+bs])
    batch_out = generate_text(batch_inp)
    preds.extend(batch_out)
    refs.extend([[r] for r in batch_ref])   # BLEU expects list[list[str]]

# 5) Metrics
bleu_res  = bleu.compute(predictions=preds, references=refs)
rouge_res = rouge.compute(predictions=preds, references=[r[0] for r in refs])

def f1_token(pred, ref):
    ps, rs = pred.split(), ref.split()
    if not ps or not rs: return 0.0
    common = set(ps) & set(rs)
    precision = sum(w in common for w in ps) / len(ps)
    recall    = sum(w in common for w in rs) / len(rs)
    return 0.0 if (precision + recall) == 0 else 2*precision*recall/(precision+recall)

f1_scores = [f1_token(p, r[0]) for p, r in zip(preds, refs)]
val_perplexity = float(np.exp(best["val_loss"])) if np.isfinite(best["val_loss"]) else None

print("BLEU:", bleu_res)
print("ROUGE:", {k: rouge_res[k] for k in ["rouge1", "rougeL"] if k in rouge_res})
print("F1 (token-level) - mean:", float(np.mean(f1_scores)))
print("Validation Perplexity:", val_perplexity)

# 6) Qualitative examples
for i in range(min(5, val_slice)):
    print("\nUSER:", X_val[i])
    print("GOLD:", y_val[i])
    print("PRED:", preds[i])


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

BLEU: {'bleu': 0.1520378840056875, 'precisions': [0.33830315938942135, 0.18111682586333577, 0.1606395127521888, 0.1528436018957346], 'brevity_penalty': 0.7719882552405887, 'length_ratio': 0.7944162436548223, 'translation_length': 2817, 'reference_length': 3546}
ROUGE: {'rouge1': np.float64(0.2251221472984557), 'rougeL': np.float64(0.18905248646003192)}
F1 (token-level) - mean: 0.2025087217296768
Validation Perplexity: 2.347869435254632

USER: mental health support: probably because my exams are approaching. i feel stressed out because i don't think i've prepared well enough.
GOLD: I see. Have you taken any approaches to not feel this way?
PRED: That sounds really tough, and it makes sense that you’re overwhelmed. Different strategies help different people—try small steps and notice what supports you best. Do a quick grounding check: look for 5 things you see, 4 you feel, 3 you hear, 2 you smell, 1 you taste. If this

USER: mental health support: nobody understands me
GOLD: It sound lik

In [8]:
import json, numpy as np

# token-level precision/recall/F1 + exact-match
def token_pr_f1(pred, ref):
    ps, rs = pred.split(), ref.split()
    if not ps or not rs: return 0.0, 0.0, 0.0
    common = set(ps) & set(rs)
    p = sum(w in common for w in ps) / len(ps)
    r = sum(w in common for w in rs) / len(rs)
    f1 = 0.0 if (p + r) == 0 else 2*p*r/(p+r)
    return p, r, f1

prf = [token_pr_f1(p, r[0]) for p, r in zip(preds, refs)]
token_precision_mean = float(np.mean([x[0] for x in prf]))
token_recall_mean    = float(np.mean([x[1] for x in prf]))
token_f1_mean        = float(np.mean([x[2] for x in prf]))
exact_match_accuracy = float(np.mean([int(p.strip() == r[0].strip()) for p, r in zip(preds, refs)]))

lr     = best.get("lr") if "best" in globals() else None
epochs = best.get("epochs") if "best" in globals() else None
val_perplexity = float(np.exp(best["val_loss"])) if ("best" in globals() and np.isfinite(best["val_loss"])) else None

summary = {
  "model_name": model_name,
  "learning_rate": lr,
  "epochs": epochs,
  "bleu": float(bleu_res["bleu"]),
  "rouge1": float(rouge_res.get("rouge1", 0.0)),
  "rougeL": float(rouge_res.get("rougeL", 0.0)),
  "token_precision_mean": token_precision_mean,
  "token_recall_mean": token_recall_mean,
  "token_f1_mean": token_f1_mean,
  "exact_match_accuracy": exact_match_accuracy,
  "val_perplexity": val_perplexity
}

print(json.dumps(summary, ensure_ascii=False, indent=2))


{
  "model_name": "google/flan-t5-base",
  "learning_rate": 5e-05,
  "epochs": 8,
  "bleu": 0.1520378840056875,
  "rouge1": 0.2251221472984557,
  "rougeL": 0.18905248646003192,
  "token_precision_mean": 0.23994181519898058,
  "token_recall_mean": 0.19238476296706686,
  "token_f1_mean": 0.2025087217296768,
  "exact_match_accuracy": 0.0,
  "val_perplexity": 2.347869435254632
}


In [6]:
!pip install rouge_score evaluate

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=2ab948e834ee9e696c8ea0305254a50d59ca2e768d6634d42d1211c9733dedc0
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.6 rouge_score-0.1.2



## 5. Inference Helper & Safety Guardrails

We include a lightweight out-of-domain/unsafe detector and a safe fallback response.


In [9]:
import re, torch

PREFIX = "mental health support: "
SAFE_FALLBACK = (
    "I'm here to help with supportive information about mental health, coping strategies, and resources. "
    "If you're in immediate danger, please contact local emergency services or a crisis hotline."
)

# Stronger safety check (word boundaries + common variants)
DANGER_RE = re.compile(
    r"\b(suicide|self[-\s]?harm|harm myself|hurt myself|kill myself|end my life|overdose|kill (someone|others)|harm (someone|others))\b",
    re.I
)

def is_unsafe(text: str) -> bool:
    return bool(DANGER_RE.search(text or ""))

# Choose device automatically (CPU/GPU)
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(_device).eval()  # ensure model is on the right device

def chat_respond(user_text: str) -> str:
    # 1) Safety first
    if is_unsafe(user_text):
        return SAFE_FALLBACK

    # 2) Build input
    inp = PREFIX + (user_text or "")
    enc = tokenizer(
        [inp],
        return_tensors="pt",          # <-- PyTorch tensors (fixes the error)
        padding=True,
        truncation=True,
        max_length=128
    )
    enc = {k: v.to(_device) for k, v in enc.items()}

    # 3) Generate
    with torch.no_grad():
        output = model.generate(
            **enc,
            max_new_tokens=64,
            do_sample=False,           # deterministic (set True + temperature/top_p for sampling)
            no_repeat_ngram_size=3,    # reduce repetition
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

# Quick test
for q in ["i feel anxious about work", "how to end my life?"]:
    print(q, "->", chat_respond(q))


i feel anxious about work -> I'm really glad you reached out—what you're feeling is valid. What support is available to you today?
how to end my life? -> I'm here to help with supportive information about mental health, coping strategies, and resources. If you're in immediate danger, please contact local emergency services or a crisis hotline.


In [10]:

MODEL_DIR = "data/luma_t5_tf"
tokenizer.save_pretrained(MODEL_DIR)
model.save_pretrained(MODEL_DIR)
print("Saved to", MODEL_DIR)


Saved to data/luma_t5_tf



## 8. Conclusions & Next Steps

- **What worked:** Domain prefixing + T5-small fine-tuning yields coherent, supportive responses.
- **Improvements:** Expand dataset coverage (coping, referrals, boundaries), add stronger safety filters, and consider parameter-efficient tuning for speed.
- **Deployment:** You can wrap `app.py` in a small Docker image and deploy to a VM or Hugging Face Spaces.
