# Thesa: A Therapy Chatbot
Created by: John Handley | [GitHub](https://github.com/johnhandleyd) | [LinkedIn](www.linkedin.com/in/john-handley)

Thesa as an experimental project wanting to create a chatbot focused on mental health.

## HuggingFace Hub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Data preprocessing

In [1]:
# needed installations
!pip install datasets



In [None]:
# space to write various functions used throughout the code
import random

def print_random(dataset, column, n=1):
    for i in range(n):
        r = random.randint(0, len(dataset[column]))

        print("-"*50, "Random MHCD phrase", r, "-"*50)
        print(dataset[column][r])

### Load Zephyr GPTQ model

In [93]:
# Cleaning functions for the datasets
import re, html, json, requests
import pandas as pd
from datasets import Dataset, load_dataset

# [Optional] Preload the Zephyr 7b beta tokenizer for the 'apply_chat_template' function
# from transformers import AutoModelForCausalLM, AutoTokenizer
# checkpoint = "TheBloke/zephyr-7b-alpha-GPTQ"
# tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def coun_replace_last_comma(string, target=", ", replacement=" and"):
    """
    Function to clean topics for use as context for the chatbot.
    - 'CounselChat' includes topics, but when they are 1+, a comma is used for separation, missing the
    'and' connector for the last 2 topics. This functions adds it.

    Args:
    - string
    - target: target to remove
    - replacement: replacement for target

    Returns:
    - string: processed string
    """

    if target in string:
        # find ID of target's last ocurrence
        target_i = string.rfind(target)
        # sub last ocurrent of target with replacement
        new_string = string[:target_i] + replacement + string[target_i+1:]
        return new_string
    # if not, return as before
    return string


def coun_normalize(example, target=None):
    """
    Function to normalize dataset.
    - 'CounselChat' has some topics with None, plus some additional spaces and uppercase letters we want to rid of.

    Args:
    - example: data sample
    - target: target to normalize

    """
    # normalize topics
    if target == "topics":
        topic = example["topics"]
        # if None, return as "unknown"
        if topic == None:
            return {"topics": "unknown"}
        topic = re.sub(r",(?!\s)", ', ', topic)
        return {target: coun_replace_last_comma(topic).lower().strip()}


def chat_template(example, dataset_name, topic, question, answer, apply_template=False):
    """
    Function to process the dataset and apply the chat template from HuggingFace for training Chatbots.

    Args:
    - example: data sample

    Returns:
    - template_example: data sample post edited
    """
    if dataset_name.lower() == "string":
        return f"<|system|>\n{topic}\n<|user|>\n{question}\n<|assistant|>\n{answer}"

    if dataset_name.lower() == 'counselchat':
        if 'and' in example[topic]:
            example[topic] = f"You're a therapist helping a patient with their {example[topic]} problems"
        else:
            example[topic] = f"You're a therapist helping a patient with their {example[topic]} problem"

    # If using "apply_chat_template" from tokenizer:
    if apply_template:
        template = [{
                    "role": "system",
                    "content": example[topic]
                },
                {
                    "role": "user",
                    "content": example[question]
                },
                {
                    "role": "assistant",
                    "content": example[answer]
                }]

        chat = tokenizer.apply_chat_template(template, tokenize=False, add_generation_prompt=True)
    else:
        chat = f"<|system|>\n{example[topic]}.\n<|user|>\n{example[question]}\n<|assistant|>\n{example[answer]}"

    return {"templateText": chat}


def clean_counselchat(dataset, columns: list):
    """
    Function to clean data samples from CounselChat dataset. For instance, remove "None" samples and html tags.
    """

    dataset = dataset.map(coun_normalize, fn_kwargs={"target": "topics"})

    for column in columns:
        dataset = dataset.filter(lambda x: x[column] is not None)
        dataset = dataset.map(lambda x: {column: html.unescape(x[column])})
        dataset = dataset.map(lambda x: {column: re.sub(r"<\/?(p|br)>", '\n', x[column])})
        dataset = dataset.map(lambda x: {column: re.sub(r"<\/?.*?>", '', x[column])})                       # remove HTML tags, i.e. <p>
        dataset = dataset.map(lambda x: {column: re.sub(r"(\.|\?|\!\;)(?=[a-zA-Z])", "\\1 ", x[column])})   # add a space between punctuation and text, if missing
        dataset = dataset.map(lambda x: {column: re.sub(r"(\s){2,}", "\\1", x[column])})                    # remove extra \s
        dataset = dataset.map(lambda x: {column: re.sub(r"^\s|\s(?=:|;|,|\.)", "", x[column])})             # remove beginning \s or spaces before some punct

    # other issues I decided to leave: "--" instead of "―"; space after "/", i.e. word/ word;

    # entry 230 has a lot of unwanted text, so we convert to pandas to remove the row (as it is not possible via Datasets type atm)
    dataset = pd.DataFrame(dataset["train"])
    dataset = dataset.drop(230)
    dataset = Dataset.from_dict(dataset)

    return dataset


# Isolate functions for each dataset
def process_counselchat(path):
    """
    Function to clean the CounselChat dataset.
    """

    dataset = load_dataset(path)
    dataset = dataset.remove_columns(["questionID", "questionTitle", "questionUrl",
                                     "therapistName", "therapistUrl", "upvotes"])

    dataset = clean_counselchat(dataset, ["questionText", "answerText"])

    dataset = dataset.map(chat_template, fn_kwargs={"dataset_name": "counselchat",
                                                            "topic": "topics",
                                                            "question": "questionText",
                                                            "answer": "answerText"}
                                                            )

    dataset = dataset.remove_columns(["questionText", "answerText", "topics"])

    return dataset


def process_mhcd(github_path):
        """
        Function to clean the Mental Health Conversational Data dataset.
        """

        df = pd.DataFrame(columns=["topic", "user", "response"])    # create empty DF with column names

        response = requests.get(github_path)
        data = json.loads(response.text)
        for sample in data.values():
            for row in sample:
                for pattern in row['patterns']:
                    for response in row["responses"]:
                        # concatenate to existing DF the new row
                        df = pd.concat([df, pd.DataFrame([{"topic": row["tag"], "user": pattern, "response": response}])], ignore_index=True)

        dataset = Dataset.from_pandas(df)

        dataset = dataset.map(chat_template, fn_kwargs={"dataset_name": "mhcd",
                                                        "topic": "topic",
                                                        "question": "user",
                                                        "answer": "response"})

        return dataset.remove_columns(["topic", "user", "response"])

In [None]:
# Apply the cleaning methods
import requests
from datasets import concatenate_datasets, DatasetDict

### -- CounselChat Dataset from HuggingFace --
# https://huggingface.co/datasets/loaiabdalslam/counselchat

## Import processed CounselChat with custom functions from "cleaning.py"
counselchat = process_counselchat("loaiabdalslam/counselchat")

### -- Mental Health Conversational Data from Kaggle --
# https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data

## Import processed MHCD Dataset with custom functions from "cleaning.py"
mhcd = process_mhcd("https://raw.githubusercontent.com/johnhandleyd/thesa/main/data/intents.json")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1382 [00:00<?, ? examples/s]

Map:   0%|          | 0/661 [00:00<?, ? examples/s]

In [None]:
# Print a random phrase for the Datasets
print_random(counselchat, "templateText")

print_random(mhcd, "templateText")

-------------------------------------------------- Random MHCD phrase 686 --------------------------------------------------
<|system|>
You're a therapist helping a patient with their family conflict and stress problems.
<|user|>
My dad is doing some really bad drugs, and I'm not allowed to see him anymore because of what he can do to me or my siblings on this drug. It has affected me tremendously in my life. It’s even caused me anger and stress.
<|assistant|>
It seems like you are going trough stages of grief, since the inability to see your father causes you similar feelings as if you had lost him. Perhaps you could send him letters expressing your feelings and hopes. But do understand that if he is under the influence of drugs he might not be able to empathize with your feelings or react in the way that he would have done so in the past. As the issue evolves find a therapist or counselor to help you work on letting go of that anger and stress, which may affect you negatively. Find f

I didn't end up splitting the data as I thought it unnecessary for a chatbot, but I leave the code just in case I might need it later on.

```
# Joining datasets and splitting them into train, test, val

# Join datasets
thesa_dataset = concatenate_datasets([counselchat, mhcd])

# # Split dataset into train, val, test
# # 90% train, 10% test + validation
# train_test_val = thesa_dataset.train_test_split(test_size=0.16)

# # Split the 10% test + valid in half test, half valid
# test_val = train_test_val['test'].train_test_split(test_size=0.5)

# # Put them all back together
# thesa_dataset = DatasetDict({
#     'train': train_test_val['train'],
#     'test': test_val['test'],
#     'valid': test_val['train']})
# print(thesa_dataset)
```





## Fine-tuning

In [None]:
!pip install accelerate peft trl auto-gptq optimum bitsandbytes -q

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer

In [None]:
checkpoint = "TheBloke/zephyr-7B-alpha-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
config = GPTQConfig(bits=4,
                    use_exllama=False,
                    lora_r=16,
                    lora_alpha=16,
                    tokenizer=tokenizer
                            )

model = AutoModelForCausalLM.from_pretrained(checkpoint, # Zephyr 7b Alpha GPTQ
                                              quantization_config=config,
                                              device_map="auto",
                                              use_cache=False,
                                              )

You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. use_exllama, exllama_config, use_cuda_fp16, max_input_length) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.


In [None]:
# check out the model's details
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=2)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (rotary_emb): MistralRotaryEmbedding()
              (k_proj): QuantLinear()
              (o_proj): QuantLinear()
              (q_proj): lora.QuantLinear(
                (base_layer): QuantLinear()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): Paramet

In [None]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
peft_config = LoraConfig(
                            r=16,
                            lora_alpha=16,
                            lora_dropout=0.05,
                            bias="none",
                            task_type="CAUSAL_LM",
                            target_modules=["q_proj", "v_proj"]
                        )

model = get_peft_model(model, peft_config)

In [None]:
print('Model updated for Fine-tuning')
print(model)

Model updated for Fine-tuning
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=2)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (rotary_emb): MistralRotaryEmbedding()
              (k_proj): QuantLinear()
              (o_proj): QuantLinear()
              (q_proj): lora.QuantLinear(
                (base_layer): QuantLinear()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
             

In [None]:
training_arguments = TrainingArguments(
                        output_dir="/content/drive/MyDrive/thesa/thesa",
                        per_device_train_batch_size=8,
                        gradient_accumulation_steps=1,
                        optim="paged_adamw_32bit",
                        learning_rate=2e-4,
                        lr_scheduler_type="cosine",
                        save_strategy="epoch",
                        logging_steps=50,
                        num_train_epochs=5,
                        max_steps=250,
                        fp16=True,
                        push_to_hub=True)

In [None]:
tokenizer.padding_side = 'right'
trainer = SFTTrainer(
                        model=model,
                        train_dataset=thesa_dataset,
                        peft_config=peft_config,
                        dataset_text_field="templateText",
                        args=training_arguments,
                        tokenizer=tokenizer,
                        packing=False,
                        max_seq_length=1024
                    )

trainer.train()

trainer.push_to_hub()

Map:   0%|          | 0/2043 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.9135
100,1.7954
150,1.7468
200,1.7829
250,1.7602


CommitInfo(commit_url='https://huggingface.co/johnhandleyd/thesa_2/commit/45a3ac8b686842f8a9f456a24ec793f509e48b11', commit_message='End of training', commit_description='', oid='45a3ac8b686842f8a9f456a24ec793f509e48b11', pr_url=None, pr_revision=None, pr_num=None)

## Inference with some samples

In [2]:
!pip install peft auto-gptq optimum -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m402.5/402.5 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m66.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [76]:
import torch, time
from peft import AutoPeftModelForCausalLM
from transformers import GenerationConfig, AutoTokenizer
import locale

locale.getpreferredencoding = lambda: "UTF-8"
!pip install Unidecode -q

def infer(tokenizer, model, example):

    inputs = tokenizer(example, return_tensors="pt").to("cuda")

    generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.1,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id
    )

    start = time.time()
    outputs = model.generate(**inputs, generation_config=generation_config)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    result = "<|user|>".join(result.split("<|user|>")[:2])  # Show only the first interchange, sometimes it brings a new one
    print(result.replace(r'\n', '\n'))

    print(f"Time taken to generate example: {time.time() - start}")

In [80]:
# Fine-tuned model
checkpoint = "johnhandleyd/thesa"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoPeftModelForCausalLM.from_pretrained(
    checkpoint,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto")

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


In [82]:
# Test phrase #1
system = "You are a therapist helping patients."
question = "I've been feeling depressed lately. Can you help me?"
answer = ""
example = chat_template(example="", dataset_name="string", question=question, answer=answer, topic=system)

infer(tokenizer, model, example)

<|system|>
You are a therapist helping patients.
<|user|>
I've been feeling depressed lately. Can you help me?
<|assistant|>
I'm sorry to hear that you're feeling depressed. I'm glad you're reaching out for help. I'd be happy to help you.
Depression is a common mental health issue that affects many people. It can be caused by a variety of factors, including genetics, life events, and chemical imbalances in the brain.
If you're feeling depressed, there are many things you can do to help yourself. Here are a few ideas:
1. Talk to someone. Talking to someone about how you're feeling can help you feel better. You can talk to a friend, family member, or a mental health professional.
2. Get moving. Exercise is a great way to boost your mood. Even a short walk can help.
3. Eat healthy. Eating healthy foods can help you feel better.
4. Get enough sleep. Sleep is important for your mental health.
5. Practice mindfulness. Mindfulness is a way of being present in the moment. It can help you feel 

In [81]:
# Test phrase #2
system = "You are a therapist helping patients."
question = "I cannot find a job and I'm tired of looking. Any help?"
answer = ""
example = chat_template(example="", dataset_name="string", question=question, answer=answer, topic=system)

infer(tokenizer, model, example)

<|system|>
You are a therapist helping patients.
<|user|>
I cannot find a job and I'm tired of looking. Any help?
<|assistant|>
I'm sorry to hear that you're having a hard time finding a job. I'm not sure what your situation is, but I can tell you that there are a lot of resources available to help you. I would suggest that you start by contacting your local unemployment office. They can help you with job leads and also provide you with some financial assistance while you're looking for work. You can also check with your local community college or vocational school to see if they offer any job training programs. If you're having a hard time finding a job because you don't have the skills or experience required, then you may want to consider going back to school to get the training you need. I would also suggest that you look into volunteer opportunities in your community. Volunteering can help you gain experience and also provide you with a sense of purpose. I hope this helps. Best of 

In [83]:
# Test phrase #3
system = "You are a therapist helping patients."
question = "I'm fighting with my boyfriend and he's not talking to me. I don't know what to do"
answer = ""
example = chat_template(example="", dataset_name="string", question=question, answer=answer, topic=system)

infer(tokenizer, model, example)

<|system|>
You are a therapist helping patients.
<|user|>
I'm fighting with my boyfriend and he's not talking to me. I don't know what to do
<|assistant|>
I'm sorry to hear that you're having a difficult time with your boyfriend.
It's possible that he's just having a bad day and needs some space.
If you're concerned about his well-being, you could try reaching out to him and asking him if he's okay.
If he's not okay, you could try to help him by listening to him and offering support.
If he's not okay, you could also try to encourage him to seek professional help.
If you're concerned about your own well-being, you could try to reach out to a trusted friend or family member for support.
If you're having a difficult time coping with your emotions, you could also try to seek professional help.
I wish you the best of luck.
Robin J. Landwehr, DBH, LPCC, NCC

Time taken to generate example: 14.870899438858032


# Bibliography
- Chat template: https://huggingface.co/docs/transformers/chat_templating
- Zephyr 7B Alpha GTPQ model: https://huggingface.co/TheBloke/zephyr-7B-alpha-GPTQ
- Fine-tuning guide: https://medium.aiplanet.com/finetuning-using-zephyr-7b-quantized-model-on-a-custom-task-of-customer-support-chatbot-7f4fff56059d

## To Do
### Include more data
#### Possible datasets:
- https://www.kaggle.com/datasets/zuhairhasanshaik/datacsv
- https://www.kaggle.com/datasets/thedevastator/mental-health-chatbot-pairs/data
- https://www.kaggle.com/datasets/thedevastator/nlp-mental-health-conversations
- PAIR (not sure about this one) - https://lit.eecs.umich.edu/downloads.html#undefined
- maybe? https://www.reddit.com/r/askatherapist/

- Remove names from dataset and retrain (some examples generated are signed by a therapist with a name)