<a href="https://colab.research.google.com/github/vitaliy-sharandin/data_science_projects/blob/master/portfolio/nlp/fine-tuned-llm/wisai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WisAI
### WisAI model is a GPT-NeoX-20B model fine-tuned on philosophical and psychological data and configured to provide useful advice.



# Datasets

### Philosophy datasets
* https://www.kaggle.com/datasets/christopherlemke/philosophical-texts
* https://www.workwithdata.com/object/philosophy-science-complete-a-text-on-traditional-problems-schools-thought-book-by-edwin-h-c-hung-0000
* https://www.kaggle.com/datasets/christopherlemke/philosophy-authors-writings-german
* https://www.workwithdata.com/object/philosophical-inquiries-an-introduction-to-problems-philosophy-book-by-nicholas-rescher-0000
* https://www.workwithdata.com/object/roman-stoicism-book-by-edward-vernon-arnold-1857
* https://www.workwithdata.com/object/wisdom-energy-basic-buddhist-teachings-book-by-thubten-yeshe-1935

### Psychology and mental health datasets

#### Text datasets


* Kaggle Psychometrics dataset https://www.kaggle.com/discussions/general/304994
* Psychometric tests dataset https://ieee-dataport.org/documents/psychometric-tests-dataset
* Psychometric NLP https://paperswithcode.com/dataset/psychometric-nlp
* Reddit mental health dataset https://zenodo.org/record/3941387
* Reddit mental disorders identification https://www.kaggle.com/datasets/kamaruladha/mental-disorders-identification-reddit-nlp
* Kaggle Mental Health Conversational Data https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
* Kaggle Mental Health FAQ for Chatbot https://www.kaggle.com/narendrageek/mental-health-faq-for-chatbot/code
* A human consciousness questionnaire dataset https://data.mendeley.com/datasets/69p62ksdh6
* paperswithcode Self-reported Mental Health Diagnoses https://paperswithcode.com/dataset/smhd
* paperswithcode Mental Health Summarization Dataset https://paperswithcode.com/dataset/mentsum
* HuggingFace psychology dataset https://huggingface.co/datasets/samhog/psychology-10k

#### Text2Text datasets
* Kaggle Depression data for chatbot https://www.kaggle.com/datasets/nupurgopali/depression-data-for-chatbot

#### Classification datasets
* Classification for mental health https://www.kaggle.com/datasets/reihanenamdari/mental-health-corpus
* Depression identification https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned

## Tutorials
https://www.philschmid.de/instruction-tune-llama-2

In [1]:
!pip install -U -q gradio
!pip install -U -q datasets
!pip install -U -q bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -U -q trl

!pip install -U -q evaluate
!pip install -U -q rouge_score

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.7/302.7 kB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.8/395.8 kB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m

In [1]:
from datasets import load_dataset
import json
import yaml
import gradio as gr
import torch
import transformers
from transformers import GenerationConfig, Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from datasets import Dataset
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, PeftModel, AutoPeftModelForCausalLM
import numpy as np
from evaluate import load
import optuna

import warnings
warnings.filterwarnings('ignore')

# Dataset instruction transformation

In [2]:
depression_dataset = load_dataset("vitaliy-sharandin/depression-instruct")

In [3]:
def formatting_func(examples):
  output_texts = []
  for i in range(len(examples['instruction'])):
    if examples.get("context", "") != "":
        input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n"
        f"{examples['instruction'][i]}\n\n"
        f"### Input: \n"
        f"{examples['context'][i]}\n\n"
        f"### Response: \n"
        f"{examples['response'][i]}")

    else:
      input_prompt = (f"Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n"
        f"{examples['instruction'][i]}\n\n"
        f"### Response:\n"
        f"{examples['response'][i]}")

    output_texts.append(input_prompt)

  return output_texts

def formatting_func_train(example):
  if example.get("context", "") != "":
      input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Input: \n"
      f"{example['context']}\n\n"
      f"### Response: \n"
      f"{example['response']}")

  else:
    input_prompt = (f"Below is an instruction that describes a task. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Response:\n"
      f"{example['response']}")

  return {"text" : input_prompt}

def formatting_func_test(example):
  if example.get("context", "") != "":
      input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Input: \n"
      f"{example['context']}\n\n"
      f"### Response: \n")

  else:
    input_prompt = (f"Below is an instruction that describes a task. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Response: \n")

  return {"text" : input_prompt}

In [4]:
formatted_depression_dataset_train = depression_dataset.map(formatting_func_train)
formatted_depression_dataset_test = depression_dataset.map(formatting_func_test)

# Model load

In [36]:
def get_tokenizer():
  model_name = "NousResearch/Llama-2-7b-hf"

  tokenizer = AutoTokenizer.from_pretrained(model_name)
  tokenizer.pad_token = tokenizer.eos_token
  tokenizer.padding_side = "right"
  return tokenizer

def get_model():
  model_name = "NousResearch/Llama-2-7b-hf"

  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_use_double_quant=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.bfloat16
  )

  qlora_config = LoraConfig(lora_alpha=16,
                          lora_dropout=0.1,
                          r=64,
                          bias="none",
                          task_type="CAUSAL_LM")

  base_training_model = AutoModelForCausalLM.from_pretrained(
      model_name,
      quantization_config=bnb_config,
      device_map = {"": 0}
  )

  base_training_model = prepare_model_for_kbit_training(base_training_model)
  base_training_model = get_peft_model(base_training_model, qlora_config)
  return base_training_model.to('cuda')

base_training_model = get_model()
tokenizer = get_tokenizer()

torch.manual_seed(42)
print(base_training_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
              )
              (k_proj): Linear4bit(in_features=4096, out_features=4096, b

# Model instruction fine-tuning

In [42]:
model_trained_checkpoint = 'model-trained-checkpoint'
model_merged = 'model-merged'

def run_inference(model, inputs, reference_responses):

  decoded_predictions = []
  for prompt in inputs:
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    prediction = pipe(prompt, max_length=200, top_p=0.9, temperature=0.9, num_return_sequences=1, return_full_text=False)[0]['generated_text']
    decoded_predictions.append(prediction[:str(prediction).find("###")])

  for input, pred, label in zip(inputs[:3], decoded_predictions[:3], reference_responses[:3]):
    print("[Input]:\n\n", input)
    print("[Prediction]:\n\n", pred)
    print("[Reference response]:\n\n", label)
    print("----")

  bleu = load("bleu")
  bleu_results = bleu.compute(predictions=decoded_predictions, references=reference_responses)

  rouge = load('rouge')
  rouge_results = rouge.compute(predictions=decoded_predictions, references=reference_responses)

  f1 = 2 * (bleu_results['bleu'] * rouge_results['rouge1']) / (bleu_results['bleu'] + rouge_results['rouge1'])

  scores = {
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
        "f1": f1
    }

  return scores


def bleu_rouge_f1_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=-1)

  labels = [[idx for idx in label if idx != -100] for label in labels]
  predictions = [[idx for idx in prediction if idx != -100] for prediction in predictions]

  decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  for pred, label in zip(decoded_predictions[:3], decoded_labels[:3]):
    print("[Prediction]:\n\n", pred)
    print("[Reference response]:\n\n", label)
    print("----")

  bleu = load("bleu")
  bleu_results = bleu.compute(predictions=decoded_predictions, references=decoded_labels)

  rouge = load('rouge')
  rouge_results = rouge.compute(predictions=decoded_predictions, references=decoded_labels)

  f1 = 2 * (bleu_results['bleu'] * rouge_results['rouge1']) / (bleu_results['bleu'] + rouge_results['rouge1'])

  scores = {
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
        "f1": f1
    }

  return scores


def fine_tune(training_model, tokenizer, train_dataset, eval_dataset, test_dataset, model_init_func, metrics_func, inference_func, only_evaluate=False, only_inference=False, hyper_opt=False, trained_inference=False):

  reference_responses = test_dataset['response']

  training_args = {
        "output_dir": "./training_results",
        "num_train_epochs": 10,
        "per_device_train_batch_size": 8,
        "gradient_accumulation_steps": 2,
        "optim": "paged_adamw_8bit",
        "save_steps": 1000,
        "logging_steps": 30,
        "learning_rate": 2e-4,
        "weight_decay": 0.001,
        "fp16": False,
        "bf16": False,
        "max_grad_norm": 0.3,
        "max_steps": -1,
        "warmup_ratio": 0.3,
        "group_by_length": True,
        "lr_scheduler_type": "constant"
    }

  supervised_finetuning_trainer = SFTTrainer(model=training_model,
                                            model_init=model_init_func,
                                            train_dataset=train_dataset,
                                            eval_dataset=eval_dataset,
                                            args=TrainingArguments(**training_args),
                                            dataset_text_field="text",
                                            max_seq_length=2048,
                                            compute_metrics=metrics_func,
                                            data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False))

  if only_evaluate:
    return supervised_finetuning_trainer.evaluate()

  elif only_inference:
    return inference_func(supervised_finetuning_trainer.model, test_dataset['text'], reference_responses)

  elif hyper_opt:
    hyper_args = lambda trial: {
          "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 25),
          "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16]),
          "gradient_accumulation_steps": trial.suggest_int("gradient_accumulation_steps", 1, 5),
          "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
          "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True),
          "max_grad_norm": trial.suggest_float("max_grad_norm", 0.1, 0.5),
          "warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.5)
      }
    return supervised_finetuning_trainer.hyperparameter_search(direction=["maximize"],
                                                              backend="optuna",
                                                              hp_space=hyper_args,
                                                              n_trials=20,
                                                              compute_objective=lambda metrics: metrics['eval_f1'])

  else:
    supervised_finetuning_trainer.train()
    supervised_finetuning_trainer.model.save_pretrained(model_trained_checkpoint)
    tokenizer.save_pretrained(model_trained_checkpoint)

    if trained_inference:
      return inference_func(supervised_finetuning_trainer.model, test_dataset['text'], reference_responses)
    else:
      return supervised_finetuning_trainer.evaluate()

In [None]:
params = {
    "training_model": base_training_model,
    "tokenizer": tokenizer,
    "train_dataset": formatted_depression_dataset_train["train"].select(range(3)),
    "eval_dataset": formatted_depression_dataset_train["train"].select(range(3)),
    "test_dataset": formatted_depression_dataset_test["train"].select(range(3)),
    "model_init_func": get_model,
    "metrics_func": bleu_rouge_f1_metrics,
    "inference_func": run_inference,
    "only_evaluate": False,
    "only_inference": False,
    "hyper_opt": True,
    "trained_inference": False
}

fine_tune(**params)

## Evaluating base model

In [None]:
# scores_untrained_model = fine_tune(base_training_model, tokenizer, formatted_depression_dataset_train, formatted_depression_dataset_eval, bleu_rouge_f1, only_evaluate=True)
# scores_untrained_model

## Tuning and evaluating model

In [None]:
# scores_untrained_model = fine_tune(base_training_model, tokenizer, formatted_depression_dataset_train, formatted_depression_dataset_eval, bleu_rouge_f1)
# scores_untrained_model


#### Experiments
1. Compare trained / untrained / small model results
2. Complete training on all datasets



In [None]:
# tuned_model = AutoPeftModelForCausalLM.from_pretrained(
#     model_trained_checkpoint,
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     device_map = {"": 0}
# )
# tokenizer = AutoTokenizer.from_pretrained(model_merged)

# merged_model = tuned_model.merge_and_unload()

# merged_model.save_pretrained(model_merged, safe_serialization=True)
# tokenizer.save_pretrained(model_merged)

# token = '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'
# merged_model.push_to_hub("vitaliy-sharandin/wisai", token=token)
# tokenizer.push_to_hub("vitaliy-sharandin/wisai", token=token)

# Chatbot lauch

In [None]:
# gen_config = GenerationConfig(
#     do_sample=True,
#     temperature=0.9,
#     max_new_tokens=150,
#     pad_token_id=tokenizer.eos_token_id,
#     num_return_sequences=1
# )

# def predict(prompt):
#     encoded_input = tokenizer(prompt, return_tensors='pt')
#     input_length = len(encoded_input["input_ids"][0])
#     output_ids = model.generate(generation_config=gen_config, **encoded_input)[0]
#     output = tokenizer.decode(output_ids[input_length:], skip_special_tokens=True)
#     return output

# #gr.Interface(fn=predict, inputs="text", outputs="text").launch()
# print(predict("What is Depression?"))

# Saving model components to Huggingface

In [None]:
# token = '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'
# model.push_to_hub("wisai", use_auth_token=token)
# gen_config.push_to_hub("wisai", "generation_config.json", use_auth_token=token)