<a href="https://colab.research.google.com/github/vitaliy-sharandin/data_science_projects/blob/master/portfolio/nlp/fine-tuned-llm/wisai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WisAI
WisAI is a Llama2-7b model instruction fine-tuned on depression dataset and is meant to help improve mental well-being.

In future iterations it is meant to be trained on multiple philosophical and psychological datasets in order to provide multifaceted answers to complex mental health issues.



# Datasets

Philosophy datasets (* for future training)
* https://www.kaggle.com/datasets/christopherlemke/philosophical-texts
* https://www.workwithdata.com/object/philosophy-science-complete-a-text-on-traditional-problems-schools-thought-book-by-edwin-h-c-hung-0000
* https://www.kaggle.com/datasets/christopherlemke/philosophy-authors-writings-german
* https://www.workwithdata.com/object/philosophical-inquiries-an-introduction-to-problems-philosophy-book-by-nicholas-rescher-0000
* https://www.workwithdata.com/object/roman-stoicism-book-by-edward-vernon-arnold-1857
* https://www.workwithdata.com/object/wisdom-energy-basic-buddhist-teachings-book-by-thubten-yeshe-1935

Psychology and mental health datasets

* Kaggle Depression data for chatbot https://www.kaggle.com/datasets/nupurgopali/depression-data-for-chatbot
* Kaggle Psychometrics dataset https://www.kaggle.com/discussions/general/304994
* Psychometric tests dataset https://ieee-dataport.org/documents/psychometric-tests-dataset
* Psychometric NLP https://paperswithcode.com/dataset/psychometric-nlp
* Reddit mental health dataset https://zenodo.org/record/3941387
* Reddit mental disorders identification https://www.kaggle.com/datasets/kamaruladha/mental-disorders-identification-reddit-nlp
* Kaggle Mental Health Conversational Data https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
* Kaggle Mental Health FAQ for Chatbot https://www.kaggle.com/narendrageek/mental-health-faq-for-chatbot/code
* A human consciousness questionnaire dataset https://data.mendeley.com/datasets/69p62ksdh6
* paperswithcode Self-reported Mental Health Diagnoses https://paperswithcode.com/dataset/smhd
* paperswithcode Mental Health Summarization Dataset https://paperswithcode.com/dataset/mentsum
* HuggingFace psychology dataset https://huggingface.co/datasets/samhog/psychology-10k


In [1]:
!pip install -U -q gradio
!pip install -U -q datasets
!pip install -U -q bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -U -q trl

!pip install -U -q evaluate
!pip install -U -q rouge_score
!pip install -U -q optuna

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.7/302.7 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.8/395.8 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m

In [2]:
from datasets import load_dataset
import json
import yaml
import gradio as gr
import torch
import transformers
from transformers import GenerationConfig, Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from datasets import Dataset
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, PeftModel, AutoPeftModelForCausalLM
import numpy as np
from evaluate import load
import optuna

import warnings
warnings.filterwarnings('ignore')

# Dataset instruction transformation

Depression dataset with 51 q&a entries was taken for training for the purpose of saving time.  

In [3]:
depression_dataset = load_dataset("vitaliy-sharandin/depression-instruct")

Downloading readme:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/51 [00:00<?, ? examples/s]

First, we modify our dataset to Alpaca format and create two datasets. One - for testing and evaluation and second one for inference, where responses are not available in formatted prompt.

In [4]:
def formatting_func_train(example):
  if example.get("context", "") != "":
      input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Input: \n"
      f"{example['context']}\n\n"
      f"### Response: \n"
      f"{example['response']}")

  else:
    input_prompt = (f"Below is an instruction that describes a task. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Response:\n"
      f"{example['response']}")

  return {"text" : input_prompt}

def formatting_func_test(example):
  if example.get("context", "") != "":
      input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Input: \n"
      f"{example['context']}\n\n"
      f"### Response: \n")

  else:
    input_prompt = (f"Below is an instruction that describes a task. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Response: \n")

  return {"text" : input_prompt}

In [5]:
formatted_depression_dataset_train = depression_dataset.map(formatting_func_train)
formatted_depression_dataset_test = depression_dataset.map(formatting_func_test)

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

# Model load

We are loading our Llama2 model in 4bit quantized form as well as applying Lora for peft training.

In [7]:
model_name = "NousResearch/Llama-2-7b-hf"

def get_tokenizer():
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  tokenizer.pad_token = tokenizer.eos_token
  tokenizer.padding_side = "right"
  return tokenizer

def get_model():


  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_use_double_quant=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.bfloat16
  )

  qlora_config = LoraConfig(lora_alpha=16,
                          lora_dropout=0.1,
                          r=64,
                          bias="none",
                          task_type="CAUSAL_LM")

  base_training_model = AutoModelForCausalLM.from_pretrained(
      model_name,
      quantization_config=bnb_config,
      device_map = {"": 0}
  )

  base_training_model = prepare_model_for_kbit_training(base_training_model)
  base_training_model = get_peft_model(base_training_model, qlora_config)
  return base_training_model.to('cuda')

base_training_model = get_model()
tokenizer = get_tokenizer()

torch.manual_seed(42)
print(base_training_model)

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
              )
              (k_proj): Linear4bit(in_features=4096, out_features=4096, b

# Model instruction fine-tuning

Here we are defining inference function which returns bleu, rouge and f1 metrics after comparison of predicted and reference responses. My tests have shown that standard trainer `compute_metrics` method is quite inefficient and is not quite suitable for instruction fine-tuning during manual observations or generated results.

In [13]:
model_trained_checkpoint = 'model-trained-checkpoint'
model_merged = 'model-merged'

def run_inference(model, test_dataset, test_size=None):

  if test_size:
    test_dataset = test_dataset['train'].select(range(test_size))
  else:
    test_dataset = test_dataset['train']

  inputs = test_dataset['text']
  reference_responses = test_dataset['response']

  decoded_predictions = []
  for prompt in inputs:
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    prediction = pipe(prompt, max_length=200, top_p=0.9, temperature=0.9, num_return_sequences=1, return_full_text=False)[0]['generated_text']
    decoded_predictions.append(prediction[:str(prediction).find("###")])

  for input, pred, label in zip(inputs[:3], decoded_predictions[:3], reference_responses[:3]):
    print("[Input]:\n\n", input)
    print("[Prediction]:\n\n", pred)
    print("[Reference response]:\n\n", label)
    print("----\n\n")

  bleu = load("bleu")
  bleu_results = bleu.compute(predictions=decoded_predictions, references=reference_responses)

  rouge = load('rouge')
  rouge_results = rouge.compute(predictions=decoded_predictions, references=reference_responses)

  f1 = 2 * (bleu_results['bleu'] * rouge_results['rouge1']) / (bleu_results['bleu'] + rouge_results['rouge1'])

  scores = {
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
        "f1": f1
    }

  return scores

def bleu_rouge_f1_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=-1)

  labels = [[idx for idx in label if idx != -100] for label in labels]
  predictions = [[idx for idx in prediction if idx != -100] for prediction in predictions]

  decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  bleu = load("bleu")
  bleu_results = bleu.compute(predictions=decoded_predictions, references=decoded_labels)

  rouge = load('rouge')
  rouge_results = rouge.compute(predictions=decoded_predictions, references=decoded_labels)

  f1 = 2 * (bleu_results['bleu'] * rouge_results['rouge1']) / (bleu_results['bleu'] + rouge_results['rouge1'])

  scores = {
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
        "f1": f1
    }

  return scores

def fine_tune(model, tokenizer, train_dataset, eval_dataset, metrics_func, only_evaluate=False):

  supervised_finetuning_trainer = SFTTrainer(model=model,
                                            train_dataset=train_dataset,
                                            eval_dataset=eval_dataset,
                                            args=TrainingArguments(
                                                output_dir="./training_results",
                                                num_train_epochs=20,
                                                per_device_train_batch_size=8,
                                                per_device_eval_batch_size=1,
                                                gradient_accumulation_steps=2,
                                                optim="paged_adamw_8bit",
                                                save_steps=1000,
                                                logging_steps=30,
                                                learning_rate=2e-4,
                                                weight_decay=0.001,
                                                fp16=False,
                                                bf16=False,
                                                max_grad_norm=0.3,
                                                max_steps=-1,
                                                warmup_ratio=0.3,
                                                group_by_length=True,
                                                lr_scheduler_type="constant"
                                            ),
                                            dataset_text_field="text",
                                            max_seq_length=2048,
                                            compute_metrics=metrics_func,
                                            data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False))

  if only_evaluate:
    return supervised_finetuning_trainer.evaluate()

  else:
    supervised_finetuning_trainer.train()
    supervised_finetuning_trainer.model.save_pretrained(model_trained_checkpoint)
    tokenizer.save_pretrained(model_trained_checkpoint)

    return supervised_finetuning_trainer.evaluate()

## Evaluating base model

In [11]:
params = {
    "model": base_training_model,
    "tokenizer": tokenizer,
    "train_dataset": formatted_depression_dataset_train["train"],
    "eval_dataset": formatted_depression_dataset_train["train"],
    "metrics_func": bleu_rouge_f1_metrics,
    "only_evaluate": True
}

fine_tune(**params)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'eval_loss': 2.4565014839172363,
 'eval_bleu': 0.20509445080013042,
 'eval_rouge1': 0.49383705775843684,
 'eval_rouge2': 0.1582331776422118,
 'eval_rougeL': 0.3995673720041588,
 'eval_f1': 0.28982307681220193,
 'eval_runtime': 28.8624,
 'eval_samples_per_second': 1.767,
 'eval_steps_per_second': 0.243}

Evaluation metrics are quite low, as out model is not responding in a manner expected.

In [12]:
run_inference(base_training_model, formatted_depression_dataset_test, 3)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartFo

[Input]:

 Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What Is Depression?

### Response: 

[Prediction]:

 Depression is a mental illness that causes people to feel sad, hopeless, and unworthy. It can also cause people to lose interest in activities they once enjoyed. Depression is a serious condition that can lead to suicide.


[Reference response]:

 Depression is a common and serious medical illness that negatively affects how you feel, the way you think and how you act. Fortunately,it is also treatable. Depression causes feelings of sadness and/or a loss of interest in activities you once enjoyed. It can lead to a variety of emotional and physical problems and can decrease your ability to function at work and at home.
----
[Input]:

 Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
I feel i have let my parents down

### Response

{'bleu': 0.0,
 'rouge1': 0.3077207582061951,
 'rouge2': 0.12001200120012002,
 'rougeL': 0.24336569579288025,
 'f1': 0.0}

As we can inference is quite coherent, but not quite in line with our expected responses what's visible by the bleu and f1 score.

Let's train the model and see how metrics and responses change.

## Fine-tuning base and evaluating tuned model

In [14]:
params = {
    "model": base_training_model,
    "tokenizer": tokenizer,
    "train_dataset": formatted_depression_dataset_train["train"],
    "eval_dataset": formatted_depression_dataset_train["train"],
    "metrics_func": bleu_rouge_f1_metrics
}

fine_tune(**params)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
30,1.5952
60,0.8095


{'eval_loss': 0.4774850904941559,
 'eval_bleu': 0.7014191099752712,
 'eval_rouge1': 0.8399519847483095,
 'eval_rouge2': 0.704185510355138,
 'eval_rougeL': 0.8300958932001611,
 'eval_f1': 0.7644601297908439,
 'eval_runtime': 29.2549,
 'eval_samples_per_second': 1.743,
 'eval_steps_per_second': 0.239,
 'epoch': 17.14}

We can see that all scores are significantly improved after training.

In [15]:
run_inference(base_training_model, formatted_depression_dataset_test, 3)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartFo

[Input]:

 Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What Is Depression?

### Response: 

[Prediction]:

 Depression is a common and serious medical illness that negatively affects how you feel, the way you think and how you act. Fortunately, it is also treatable. Depression causes feelings of sadness and/or a loss of interest in activities you once enjoyed. It can lead to a variety of emotional and physical problems and can decrease your ability to function at work and at home. If you are experiencing serious changes in mood and/or behavior that concern you, or if you just can't shake feelings of sadness, consult your doctor for a proper diagnosis and to discuss your treatment options. (National Institute of Mental Health)


[Reference response]:

 Depression is a common and serious medical illness that negatively affects how you feel, the way you think and how you act. Fortunately,it is also treatable. 

{'bleu': 0.2701336581837428,
 'rouge1': 0.43566833060988347,
 'rouge2': 0.31683527044604154,
 'rougeL': 0.3913701989861,
 'f1': 0.33348922720835183}

After running inference on several examples we are clearly seeing an improvement in responses, not only they are coherent, they also contain some parts of reference responses, meaning learning was successful.

# Saving fine-tuned model with adapters to Huggingface

Now we are able to merge our model with saved peft adapters and push it to Hugging Face repo.

In [None]:
# tuned_model = AutoPeftModelForCausalLM.from_pretrained(
#     model_trained_checkpoint,
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     device_map = {"": 0}
# )
# tokenizer = AutoTokenizer.from_pretrained(model_trained_checkpoint)

# merged_model = tuned_model.merge_and_unload()

# # merged_model.save_pretrained(model_merged, safe_serialization=True)
# # tokenizer.save_pretrained(model_merged)

# token = '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'
# merged_model.push_to_hub("vitaliy-sharandin/wisai", token=token)
# tokenizer.push_to_hub("vitaliy-sharandin/wisai", token=token)

# Chatbot lauch

In [None]:
# gen_config = GenerationConfig(
#     do_sample=True,
#     temperature=0.9,
#     max_new_tokens=150,
#     pad_token_id=tokenizer.eos_token_id,
#     num_return_sequences=1
# )

# def predict(prompt):
#     encoded_input = tokenizer(prompt, return_tensors='pt')
#     input_length = len(encoded_input["input_ids"][0])
#     output_ids = model.generate(generation_config=gen_config, **encoded_input)[0]
#     output = tokenizer.decode(output_ids[input_length:], skip_special_tokens=True)
#     return output

# #gr.Interface(fn=predict, inputs="text", outputs="text").launch()
# print(predict("What is Depression?"))

In [None]:
# token = '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'
# model.push_to_hub("wisai", use_auth_token=token)
# gen_config.push_to_hub("wisai", "generation_config.json", use_auth_token=token)