<a href="https://colab.research.google.com/github/hcnimi/mental_health_chatbot/blob/main/mental_health_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Outline
1. Data Preprocessing

	1.	Load the Dataset:
	•	Use the datasets library from Hugging Face to load the Estwld/empathetic_dialogues_llm dataset.
	•	Inspect and understand the structure of the dataset to ensure that it aligns with the preprocessing steps.
	2.	Preprocess the Text:
	•	Implement a preprocessing function that:
	•	Incorporates the emotion and situation at the beginning of the conversation.
	•	Formats the conversation history using a sequential turn-based template.
	•	Applies tokenization using the meta-llama/Llama-2-7b-hf tokenizer.
	3.	Prepare for Fine-Tuning:
	•	Ensure the preprocessed data is in a format suitable for fine-tuning, typically as input-output pairs.
	•	Use padding and truncation to handle varying lengths of conversations, ensuring efficient batching during training.

2. Fine-Tuning with QLoRA

	1.	Load the Base Model:
	•	Load meta-llama/Llama-2-7b-hf with the AutoModelForCausalLM and AutoTokenizer from Hugging Face.
	2.	Apply QLoRA:
	•	Use QLoRA (Quantized Low-Rank Adaptation) to reduce memory usage during fine-tuning.
	•	Quantize the model weights to 4-bit or 8-bit using the bitsandbytes library.
	•	Implement adapters for the model layers to fine-tune only specific parts of the model, reducing the computational cost and risk of overfitting.
	3.	Fine-Tune the Model:
	•	Set up a training loop using Hugging Face’s Trainer or Accelerate for efficient distributed training on the T4 GPU.
	•	Monitor training metrics like loss and perplexity to evaluate the model’s learning progress.
	•	Implement early stopping and learning rate scheduling to prevent overfitting.
	4.	Evaluate Overfitting:
	•	Split the dataset into training and validation sets.
	•	Track the difference in loss and perplexity between the training and validation sets.
	•	Use visualization tools like Matplotlib to plot training vs. validation loss and perplexity over time to visually assess overfitting.

3. Evaluation and Metrics Visualization

	1.	Evaluate Model Performance:
	•	After fine-tuning, evaluate the model on the validation set.
	•	Calculate loss, perplexity, and generate a confusion matrix if relevant.
	2.	Visualize Metrics:
	•	Plot training and validation loss/perplexity over epochs to visualize the learning curve.
	•	Check for signs of overfitting by analyzing these curves.
	3.	Generate Example Responses:
	•	Use the fine-tuned model to generate responses based on example inputs from the validation set.
	•	Compare the generated responses to the ground truth for qualitative evaluation.

4. Deployment Using Gradio

	1.	Create the Gradio Interface:
	•	Set up a Gradio interface with:
	•	Model Selection: Allow users to choose between the base and fine-tuned models.
	•	Text Input: A text box for users to input their queries or statements.
	•	Text Output: Display the model’s generated responses.
	•	Conversation History: Keep track of the conversation history for context-aware responses.
	2.	Implement the Model Switcher:
	•	Dynamically load and switch between the base model and the fine-tuned model based on user selection.
	•	Ensure that the Gradio interface efficiently manages GPU memory when switching models.
	3.	Deploy the Interface:
	•	Deploy the Gradio interface directly on Google Colab or via a cloud service if needed.
	•	Provide a simple and intuitive user interface for end-users seeking empathetic conversation and support.

5. Optional: Logging and Continuous Improvement

	1.	Log Interactions:
	•	Implement logging of user interactions (anonymized) to gather data on model performance in real-world scenarios.
	2.	Retrain and Fine-Tune:
	•	Periodically update the fine-tuned model with new data collected from user interactions to continuously improve the model’s performance and empathy.

In [12]:
!pip install datasets gradio transformers torch peft scikit-learn
!pip install -U bitsandbytes

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset, Dataset
from torch.utils.data import DataLoader, random_split
from sklearn.model_selection import train_test_split, StratifiedKFold
from peft import LoraConfig, get_peft_model
import gradio as gr
from huggingface_hub import login
import os



In [3]:
# Log in to Huggingface
access_token = os.getenv('HUGGINGFACE_API_TOKEN')
login(token=access_token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# Load the dataset
dataset = load_dataset('Estwld/empathetic_dialogues_llm')

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', load_in_8bit=True, device_map='auto')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
def preprocess_text(data_row):
    emotion = data_row['emotion']
    situation = data_row['situation']
    conversations = data_row['conversations']

    preprocessed_text = f'<|emotion|> {emotion}\n<|situation|> {situation}\n'

    for convo in conversations:
        # If convo is just a string, assume it's user input or assistant output without roles
        if isinstance(convo, dict):
            role = convo.get('role', 'user')  # Default role as 'user' if not specified
            content = convo.get('content', '')
        else:
            # Default to user role if only content is provided as a string
            role = 'user'
            content = convo

        if role == 'user':
            preprocessed_text += f'<|user|> {content}\n'
        elif role == 'assistant':
            preprocessed_text += f'<|assistant|> {content}\n'

    return preprocessed_text.strip()

# Apply preprocessing to the dataset
dataset = dataset.map(lambda x: {'text': preprocess_text(x)}, batched=False)

# Set the pad_token to be the same as eos_token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

# Tokenize each split in the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Convert to PyTorch dataset for each split
tokenized_datasets.set_format('torch')

# Calculate the number of samples for training, validation, and testing
train_size = int(0.7 * len(tokenized_datasets['train']))
val_size = len(tokenized_datasets['train']) - train_size

# Split the train dataset into train and validation
train_dataset, val_dataset = random_split(tokenized_datasets['train'], [train_size, val_size])

# Use the test dataset as is
test_dataset = tokenized_datasets['test']

# If you want to have everything in a dictionary format again
final_datasets = {
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
}

Map:   0%|          | 0/19533 [00:00<?, ? examples/s]

Map:   0%|          | 0/2770 [00:00<?, ? examples/s]

Map:   0%|          | 0/2547 [00:00<?, ? examples/s]

Map:   0%|          | 0/19533 [00:00<?, ? examples/s]

Map:   0%|          | 0/2770 [00:00<?, ? examples/s]

Map:   0%|          | 0/2547 [00:00<?, ? examples/s]

In [25]:
# Apply QLoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none"
)
model = get_peft_model(model, lora_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=100,
    fp16=True,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss"
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

# Fine-tune the model
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []