## Fine-tuning Phi-3-mini-QLoRA

Based on Phi-3Cookbook https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/Phi-3-finetune-qlora-python.ipynb

Install required packages

In [None]:
# from IPython.display import clear_output
# !pip install -qqq --upgrade bitsandbytes transformers peft accelerate datasets trl flash_attn
# !pip install huggingface_hub
# !pip install python-dotenv
# !pip install wandb -qqq
# !pip install absl-py nltk rouge_score
# !pip list | grep transformers
# clear_output()

In [36]:
from IPython.display import clear_output
!pip install peft==0.8.2
!pip install bitsandbytes==0.42.0
!pip install accelerate==0.26.1
!pip install datasets==2.16.1
!pip install GPUtil
!pip install transformers==4.38.0
!pip install huggingface-hub
!pip install trl
!pip install rouge_score
!pip install -U sentence-transformers
clear_output()

In [26]:
# Get the scripts from remote source
!git clone https://github.com/mlgomez0/Health_Therapist
!mv Health_Therapist/backend/ml_models/modules/model_tester.py .
!rm -rf Health_Therapist

Cloning into 'Health_Therapist'...
remote: Enumerating objects: 426, done.[K
remote: Counting objects: 100% (143/143), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 426 (delta 53), reused 80 (delta 27), pack-reused 283[K
Receiving objects: 100% (426/426), 21.74 MiB | 26.13 MiB/s, done.
Resolving deltas: 100% (142/142), done.


Import packages

In [3]:
import torch
from random import randrange
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Template Exploration

In [None]:
model_name = 'acorreal/phi3-mental-health'
adapter_name = 'acorreal/adapter-phi-3-mini-mental-health'
compute_dtype = torch.bfloat16

In [None]:
%%time

# Load model
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=compute_dtype)
model = PeftModel.from_pretrained(model, adapter_name)
model = model.merge_and_unload()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(adapter_name)

# Print model name
print("Model:", model.name_or_path)

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
# Test the template
pipe.tokenizer.apply_chat_template([
    {
        "role": "user",
        "content": "Hello, I am stressed"}
    ],
    tokenize=False,
    add_generation_prompt=True
)

In [None]:
def predict(prompt: str) -> str:
    prompt = pipe.tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, num_beams=1, temperature=0.3, top_k=50, top_p=0.95, max_time= 180)
    print(outputs)
    return outputs[0]['generated_text'][len(prompt):].strip()

In [None]:
%%time
predict("i am going through some things with my feelings and myself i barely sleep and i do nothing but think about how i am worthless and how i should not be here i have never tried or contemplated suicide i have always wanted to fix my issues but i never get around to it how can i change my feeling of being worthless to everyone")

# Testing Generated Responses

## Preparing the dataset

In [5]:
import pandas as pd
from datasets import Dataset

In [9]:
df = pd.read_csv('dataset.csv')
df.columns = ['input', 'output']
df['instruction'] = "You are a mental health assistant. Your job is to provide emotional support, actively listen, and offer practical suggestions for well-being. Respond empathically and do not give specific medical advice or diagnoses. Always make sure the user feels heard and supported. If the user mentions suicidal thoughts, encourage them to seek professional help immediately. Here's the conversation so far:\n\n"
df.head()

Unnamed: 0,input,output,instruction
0,i am going through some things with my feeling...,if everyone thinks you are worthless then mayb...,You are a mental health assistant. Your job is...
1,i am going through some things with my feeling...,hello and thank you for your question and seek...,You are a mental health assistant. Your job is...
2,i am going through some things with my feeling...,first thing i would suggest is getting the sle...,You are a mental health assistant. Your job is...
3,i am going through some things with my feeling...,therapy is essential for those that are feelin...,You are a mental health assistant. Your job is...
4,i am going through some things with my feeling...,i first want to let you know that you are not ...,You are a mental health assistant. Your job is...


In [10]:
# Load the dataset
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 2747
})

In [11]:
def create_message_column(row):
    messages = []
    user = {
        "content": f"{row['instruction']}\n Input: {row['input']}",
        "role": "user"
    }
    messages.append(user)
    assistant = {
        "content": f"{row['output']}",
        "role": "assistant"
    }
    messages.append(assistant)
    return {"messages": messages}

def format_dataset_chatml(row):
    return {"text": tokenizer.apply_chat_template(row["messages"], add_generation_prompt=False, tokenize=False)}

In [12]:
model_id = "microsoft/Phi-3-mini-4k-instruct"

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings



tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
dataset_chatml = dataset.map(create_message_column)
dataset_chatml = dataset_chatml.map(format_dataset_chatml)

Map:   0%|          | 0/2747 [00:00<?, ? examples/s]

Map:   0%|          | 0/2747 [00:00<?, ? examples/s]

In [16]:
dataset_chatml

Dataset({
    features: ['input', 'output', 'instruction', 'messages', 'text'],
    num_rows: 2747
})

In [None]:
dataset_chatml[0]

{'input': 'i am going through some things with my feelings and myself i barely sleep and i do nothing but think about how i am worthless and how i should not be here i have never tried or contemplated suicide i have always wanted to fix my issues but i never get around to it how can i change my feeling of being worthless to everyone',
 'output': 'if everyone thinks you are worthless then maybe you need to find new people to hang out withseriously the social context in which a person lives is a big influence in selfesteemotherwise you can go round and round trying to understand why you are not worthless then go back to the same crowd and be knocked down againthere are many inspirational messages you can find in social media maybe read some of the ones which state that no person is worthless and that everyone has a good purpose to their lifealso since our culture is so saturated with the belief that if someone does not feel good about themselves that this is somehow terriblebad feelings 

In [17]:
dataset_chatml = dataset_chatml.train_test_split(test_size=0.05, seed=1234)
dataset_chatml

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction', 'messages', 'text'],
        num_rows: 2609
    })
    test: Dataset({
        features: ['input', 'output', 'instruction', 'messages', 'text'],
        num_rows: 138
    })
})

## Loading the models

In [None]:
original_model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

config.json:   0%|          | 0.00/931 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
finetuned_model_id = 'acorreal/phi3-mental-health'
finetuned_model = AutoModelForCausalLM.from_pretrained(finetuned_model_id, trust_remote_code=True)

In [None]:
original_pipeline = pipeline("text-generation", model=original_model, tokenizer=tokenizer)

In [None]:
finetuned_pipeline = pipeline("text-generation", model=finetuned_model, tokenizer=tokenizer)

In [None]:
def predict(prompt, pipeline, tokenizer):
    prompt = pipeline.tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)
    outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, num_beams=1, temperature=0.3, top_k=50, top_p=0.95, max_time= 180)
    return outputs[0]['generated_text'][len(prompt):].strip()

In [None]:
predict(dataset_chatml['test'][0]['messages'][0]['content'], original_pipeline, tokenizer) #Original Model

In [None]:
predict(dataset_chatml['test'][0]['messages'][0]['content'], finetuned_pipeline, tokenizer) #Finetuned Model

In [None]:
# load all model responses
import json
def load_responses(dataset, path, pipeline, tokenizer):
  counter = 0
  result = []
  for chat in dataset:
      gen_text = predict(chat['messages'][0]['content'], pipeline, tokenizer)
      input = chat['input']
      print(f"Got prediction: {counter}")
      result.append((input, gen_text))
      if counter % 10 == 0:
        with open(path, 'w') as f:
            print(f"Saving {len(result)} responses in file")
            json.dump(result, f)
      counter += 1
  return result


In [None]:
responses_fine_tuned_model = load_responses(dataset_chatml['test'], 'fine_tuned_model0.json', finetuned_pipeline, tokenizer)

In [None]:
responses_original_model = load_responses(dataset_chatml['test'], 'original_model.json', original_pipeline, tokenizer)

## Model Responses

In [6]:
model_responses_df = pd.read_json('fine_tuned_model.json') # model responses for testing
model_responses_df.columns = ['input', 'output']
model_responses_df.head()

Unnamed: 0,input,output
0,i want a secure relationship with someone that...,It sounds like you're going through a really t...
1,my boyfriend is in recovery from drug addictio...,I'm truly sorry to hear that you're going thro...
2,is this something i should be worried about sh...,I'm really glad you're reaching out and sharin...
3,my friend is abusing her prescription medicine...,I'm really sorry to hear that you're going thr...
4,i terminated my counseling relationship with a...,I'm really glad to hear that you're considerin...


In [7]:
model_responses_df.shape

(138, 2)

In [19]:
test_df = dataset_chatml['test'].to_pandas(["input", "output"])

In [21]:
model_responses_df['therapist_output'] = test_df['output']

In [23]:
model_responses_df.shape

(138, 3)

In [24]:
model_responses_df.to_csv('model_responses.csv') # saving as csv

In [30]:
model_responses_df.head()

Unnamed: 0,input,output,therapist_output
0,i want a secure relationship with someone that...,It sounds like you're going through a really t...,good for you on your keen awareness of your di...
1,my boyfriend is in recovery from drug addictio...,I'm truly sorry to hear that you're going thro...,my empathy goes out to you relationships are t...
2,is this something i should be worried about sh...,I'm really glad you're reaching out and sharin...,it can be tricky to figure out if a child is t...
3,my friend is abusing her prescription medicine...,I'm really sorry to hear that you're going thr...,first of all where did you find the pills did ...
4,i terminated my counseling relationship with a...,I'm really glad to hear that you're considerin...,yes your former social worker should return yo...


In [25]:
model_responses_df.to_excel('model_responses.xlsx') # saving as excel

## Rouge Test

In [190]:
model_responses_df = pd.read_csv('model_responses.csv')

In [191]:
from model_tester import ModelTester
finetune_model_tester = ModelTester(model_responses_df['therapist_output'], model_responses_df['output']) # Initiate tester class
finetune_model_tester.calculate_rouge_score() # Calculate Rouge Score
rouge_score = finetune_model_tester.test_results
rouge_score_df = pd.DataFrame(rouge_score)
rouge_score_df.head()

Unnamed: 0,rouge1,rouge2,rougeL
precision,0.345152,0.041858,0.165014
recall,0.315464,0.036597,0.156128
fmeasure,0.300732,0.035537,0.145425


## LLM Test

In [50]:
import re
import requests
from google.colab import userdata

In [63]:
# Hugging Face API configuration

api_token = userdata.get('HF_TOKEN')

# Headers for the request
headers = {
    "Authorization": f"Bearer {api_token}"
}

# The model to use
model_name = "mistralai/Mistral-7B-v0.1"

# The endpoint for the Hugging Face inference API
api_url = f"https://api-inference.huggingface.co/models/{model_name}"

In [157]:
text_1 = model_responses_df['therapist_output'][0]
text_2 = model_responses_df['output'][0]
input = model_responses_df['input'][0]


In [180]:
def create_template(text_1, text_2, input):
  prompt = f"""
    You will be given a user_question, a mental_health_therapist_answer and a machine_generated_answer.
    Your task is to provide a 'total rating' scoring how well the machine_generated_answer answers the user concerns expressed in the user_question. Use the mental_health_therapist_answer as example of a good answer.
    Give your answer as a float on a scale of 0 to 10, where 0 means that the machine_generated_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the user_question.

    Provide your feedback as follows:

    Feedback:::
    Total rating: (your rating, as a float between 0 and 10)

    Now here are the question and answers.

    User Question: {input}
    Mental Health Therapist Answer: {text_1}
    Machine Generated Answer: {text_2}

    Feedback:::
    Total rating: """

  return prompt

def get_llm_scores(model_name, prompt):

  # The data payload for the POST request
  data = {
      "inputs": prompt,
      "parameters": {
          "max_length": 100,
          "num_return_sequences": 1
      }
  }

  # Make the request to the API
  response = requests.post(api_url, headers=headers, json=data)

  # Parse the response
  if response.status_code == 200:
      response_data = response.json()
      generated_text = response_data[0]["generated_text"]
      similarity_score = re.search(r'Total rating: (\d+\.\d+)', generated_text)

      if similarity_score:
          score = float(similarity_score.group(1))
          print(f"Similarity Score: {score}")
          return score
      else:
          print("Similarity score not found in the response.")
          return None
  else:
      print(f"Request failed with status code {response.status_code}: {response.text}")
      return None


In [181]:
prompt = create_template(text_1, text_2, input)

score = get_llm_scores(model_name, prompt)


Similarity Score: 7.5


In [184]:
result = []

for i in range(model_responses_df.shape[0]):
  prompt = create_template(model_responses_df['therapist_output'][i], model_responses_df['output'][i], model_responses_df['input'][i])
  score = get_llm_scores(model_name, prompt)
  result.append(score)


Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 5.0
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 0.0
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 8.5
Similarity Score: 7.5
Similarity Score: 7.5
Similarity Score: 5.0
Similarity

In [188]:
df_llm_eval = pd.DataFrame(result, columns=["acorreal/phi3-mental-health"])
df_llm_eval.head()

Unnamed: 0,acorreal/phi3-mental-health
0,7.5
1,8.5
2,7.5
3,7.5
4,8.5


In [189]:
df_llm_eval.to_csv('llm_eval.csv')

## Comparing Models

In [194]:
df_llm_eval = pd.read_csv('llm_eval.csv')
df_llm_eval[['acorreal/phi3-mental-health']].mean()

acorreal/phi3-mental-health    7.601449
dtype: float64