<a href="https://colab.research.google.com/github/vitaliy-sharandin/data_science_projects/blob/master/portfolio/nlp/fine-tuned-llm/wisai_manual_fine_tune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WisAI
### WisAI model is a gpt-neo-125M model fine-tuned on philosophical and psychological data and configured to provide useful advice.

In [1]:
!pip install gradio
!pip install transformers
!pip install datasets
!pip install --upgrade accelerate
!pip install evaluate
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.35.2-py3-none-any.whl (19.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.7/19.7 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles (from gradio)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting aiohttp (from gradio)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from gradio)
  Downloading fastapi-0.97.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>

In [2]:
from google.colab import drive
import pandas as pd
import json
import yaml
import gradio as gr
import torch
from transformers import GenerationConfig, Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq
from datasets import Dataset, concatenate_datasets
from torch.utils.data import random_split
import numpy as np
from evaluate import load

In [3]:
model_name = "EleutherAI/gpt-neo-125M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

reference_base_model = AutoModelForCausalLM.from_pretrained(model_name)
reference_base_model.resize_token_embeddings(len(tokenizer))

torch.manual_seed(42)

Downloading (…)okenizer_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/526M [00:00<?, ?B/s]

<torch._C.Generator at 0x7fc61ef66d50>

# Training

## Training datasets list

### Psychology and mental health datasets

#### Text datasets


* Kaggle Psychometrics dataset https://www.kaggle.com/discussions/general/304994
* Psychometric tests dataset https://ieee-dataport.org/documents/psychometric-tests-dataset
* Psychometric NLP https://paperswithcode.com/dataset/psychometric-nlp
* Reddit mental health dataset https://zenodo.org/record/3941387
* Reddit mental disorders identification https://www.kaggle.com/datasets/kamaruladha/mental-disorders-identification-reddit-nlp
* Kaggle Mental Health Conversational Data https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
* Kaggle Mental Health FAQ for Chatbot https://www.kaggle.com/narendrageek/mental-health-faq-for-chatbot/code
* A human consciousness questionnaire dataset https://data.mendeley.com/datasets/69p62ksdh6
* paperswithcode Self-reported Mental Health Diagnoses https://paperswithcode.com/dataset/smhd
* paperswithcode Mental Health Summarization Dataset https://paperswithcode.com/dataset/mentsum
* HuggingFace psychology dataset https://huggingface.co/datasets/samhog/psychology-10k

#### Text2Text datasets
* Kaggle Depression data for chatbot https://www.kaggle.com/datasets/nupurgopali/depression-data-for-chatbot

#### Classification datasets
* Classification for mental health https://www.kaggle.com/datasets/reihanenamdari/mental-health-corpus
* Depression identification https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned

### Philosophy datasets
* https://www.kaggle.com/datasets/christopherlemke/philosophical-texts
* https://www.workwithdata.com/object/philosophy-science-complete-a-text-on-traditional-problems-schools-thought-book-by-edwin-h-c-hung-0000
* https://www.kaggle.com/datasets/christopherlemke/philosophy-authors-writings-german
* https://www.workwithdata.com/object/philosophical-inquiries-an-introduction-to-problems-philosophy-book-by-nicholas-rescher-0000
* https://www.workwithdata.com/object/roman-stoicism-book-by-edward-vernon-arnold-1857
* https://www.workwithdata.com/object/wisdom-energy-basic-buddhist-teachings-book-by-thubten-yeshe-1935

## Training dataset creation

#### Data load and utility methods

In [4]:
drive.mount('/content/drive')

depression_data = []

with open('/content/drive/MyDrive/Data/depression.yml', 'r') as file:
     depression_data = yaml.safe_load(file)

Mounted at /content/drive


In [5]:
def parse_depression_dataset(conversations):
  output = {'prompt':[],'completion':[]}
  for convo in conversations:
    completion = ''
    for i, dialog in enumerate(convo):
      if i == 0:
        prompt = dialog
        # p_encode = prompt.encode("ascii", "ignore")
        # prompt = p_encode.decode()
        prompt = prompt.replace("\xa0", " ")
        # print('prompt:',prompt)
      else:
        completion += " " + dialog
        # c_encode = completion.encode("ascii", "ignore")
        # completion = c_encode.decode()
        completion = completion.replace("\xa0", " ")
    completion = completion.strip()
    # print(line)
    output['prompt'].append(prompt)
    output['completion'].append(completion)
  return output

In [6]:
def tokenize_prompt_completion_element_overflow(element):

  prompt = element["prompt"]
  completion = element["completion"]

  prompt_completion_string = f"{tokenizer.bos_token}{prompt}\n{completion}{tokenizer.eos_token}"
  prompt_string = f"{tokenizer.bos_token}{prompt}\n"
  completion_string = f"{completion}{tokenizer.eos_token}"

  prompt_completion_tokens = tokenizer(prompt_completion_string,
                                      truncation=True,
                                      return_overflowing_tokens=True,
                                      return_length=True,
                                      max_length=4,
                                      stride=2)

  prompt_tokens = tokenizer(prompt_string,
                            truncation=True,
                            return_overflowing_tokens=True,
                            return_length=True,
                            max_length=4,
                            stride=2)

  completion_tokens = tokenizer(completion_string,
                                truncation=True,
                                return_overflowing_tokens=True,
                                return_length=True,
                                max_length=4,
                                stride=2)

  # try to flatten, substitute and then reshape the lists back
  # OR just tokenize without batching and then batch tokens

  print('\n'.join([tokenizer.decode(sublist) for sublist in prompt_completion_tokens["input_ids"]]))
  print('\n'.join([tokenizer.decode(sublist) for sublist in prompt_tokens["input_ids"]]))
  print('\n'.join([tokenizer.decode(sublist) for sublist in completion_tokens["input_ids"]]))

  print(prompt_completion_tokens)


  # print(tokenizer.decode(prompt_tokens))
  # print(completion_tokens[len(prompt_tokens):])

  # bos_token_id = tokenizer.bos_token
  # eos_token_id = tokenizer.eos_token_id

  # # How to solve concatenation of list of lists for prompt_tokens and completion_tokens in this case???????????????????????????????????????
  # input_ids = [bos_token_id] + prompt_tokens["input_ids"] + [eos_token_id] + completion_tokens["input_ids"]
  # labels = [-100] + [-100] * len(prompt_tokens["input_ids"]) + [-100] + completion_tokens["input_ids"]

  # # How to solve not knowing the length of prompt list of list in this case????????????????????????????????????????????????
  # input_ids = []
  # labels = []
  # for single_input_ids in tokens.input_ids:
  #     # add bos token at start, eos between prompt and completion.
  #     formatted_input_ids = [bos_token_id] + single_input_ids + [eos_token_id]
  #     input_ids.append(formatted_input_ids)

  #     # create labels
  #     prompt_length = len(single_input_ids) # this may be changed based on your specific needs
  #     formatted_labels = [-100] + [-100] * prompt_length + single_input_ids[prompt_length:]
  #     labels.append(formatted_labels)
  # return {"input_ids": input_ids, "labels": labels}

In [7]:
def tokenize_prompt_completion_element_no_overflow(element):
  prompt = element["prompt"]
  completion = element["completion"]

  prompt_completion_string = f"{tokenizer.bos_token}{prompt}\n{completion}{tokenizer.eos_token}"
  prompt_string = f"{tokenizer.bos_token}{prompt}\n"
  completion_string = f"{completion}{tokenizer.eos_token}"

  prompt_completion_tokens = tokenizer(prompt_completion_string)["input_ids"]
  prompt_tokens = tokenizer(prompt_string)["input_ids"]

  completion_tokens = tokenizer(completion_string)["input_ids"]
  completion_tokens = [-100] * len(prompt_tokens) + completion_tokens

  # print(len(prompt_completion_tokens))
  # print(len(prompt_tokens))
  # print(len(completion_tokens))

  return {"input_ids": prompt_completion_tokens, "labels": completion_tokens}


def tokenize_dataset_no_overflow(dataset):
  tokenized_no_overflow_dataset = dataset.map(tokenize_prompt_completion_element_no_overflow, remove_columns=dataset.column_names)
  return tokenized_no_overflow_dataset.train_test_split(test_size=0.2)



#### Data tokenization

In [8]:
parsed_depression_data = parse_depression_dataset(depression_data['conversations'])
depression_df = pd.DataFrame(parsed_depression_data)
depression_dataset = Dataset.from_dict(depression_df)

tokenized_dataset = tokenize_dataset_no_overflow(depression_dataset)

data_collator_seq2seq = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

## Training phase

#### Training utility methods

In [9]:
def bleu_rouge_f1(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=-1)

  labels = [[idx for idx in label if idx != -100] for label in labels]

  decoded_predictions = [tokenizer.decode(pred) for pred in predictions]
  decoded_labels = [tokenizer.decode(label) for label in labels]

  print(f"Prediction: {decoded_predictions}\nLabel:{decoded_labels}\n")

  bleu = load("bleu")
  bleu_results = bleu.compute(predictions=decoded_predictions, references=decoded_labels)

  rouge = load('rouge')
  rouge_results = rouge.compute(predictions=decoded_predictions, references=decoded_labels)

  f1 = 2 * (bleu_results['bleu'] * rouge_results['rouge1']) / (bleu_results['bleu'] + rouge_results['rouge1'])

  scores = {
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
        "f1": f1
    }

  return scores


def train_model(model, tokenized_dataset, data_collator, metric):

  training_args = TrainingArguments(output_dir="./results",
                                    overwrite_output_dir=True,
                                    num_train_epochs=3,
                                    per_device_train_batch_size=32,
                                    per_device_eval_batch_size=32,
                                    save_steps=500,
                                    logging_steps=100,
                                    evaluation_strategy="epoch",
                                    do_eval=True,
                                    logging_dir="./logs")

  trainer = Trainer(model=model,
                    tokenizer=tokenizer,
                    args=training_args,
                    data_collator=data_collator,
                    train_dataset=tokenized_dataset['train'],
                    eval_dataset=tokenized_dataset['train'],
                    compute_metrics = metric)

  trainer.train()

  eval_result = trainer.evaluate()

  return eval_result


def pretraining_prediction_scores(model, tokenized_dataset, data_collator, metric):
  training_args = TrainingArguments(output_dir="./results")

  trainer = Trainer(model=model,
                    args=training_args,
                    data_collator=data_collator,
                    eval_dataset=tokenized_dataset['train'],
                    compute_metrics = metric)

  eval_result = trainer.evaluate()

  return eval_result

#### Training


#### Experiments
1. Try different parameters from other sources
2. Validation set as a training set
3. Make the validation of data on untrained model



In [10]:
scores_pretrained_model = pretraining_prediction_scores(reference_base_model, tokenized_dataset, data_collator_seq2seq, bleu_rouge_f1)
scores_pretrained_model

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Prediction: ['Q[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD] lead[PAD][PAD][PAD][PAD][PAD] also be[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]%[PAD][PAD][PAD][PAD] problems[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD] about[PAD] depression[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]!!!!!!!!!!!!', "Q'm to[PAD][PAD]\n I[PAD][PAD][PAD] world[PAD][PAD][PAD][PAD] ability[PAD] create[PAD][PAD] ability[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD] create[PAD][PAD][PAD] think[PAD][PAD][PAD][PAD]oud be[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]!!!!!!!!!!!!", 'Q name[PAD][PAD][PAD]\n[PAD],\n world[PAD] come\n me to be[PAD] world[PAD][PAD] are[PAD] best\n[PAD][PAD][PAD][PAD] one[PAD][PAD] do[PAD][PAD][PAD] you[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'eval_loss': 4.09981632232666,
 'eval_bleu': 0.0023086746139114787,
 'eval_rouge1': 0.12760152732108515,
 'eval_rouge2': 0.014779286696231462,
 'eval_rougeL': 0.09664413874086689,
 'eval_f1': 0.0045352928782287,
 'eval_runtime': 23.717,
 'eval_samples_per_second': 1.687,
 'eval_steps_per_second': 0.211}

In [11]:
scores = train_model(model, tokenized_dataset, data_collator_seq2seq, bleu_rouge_f1)
scores



Epoch,Training Loss,Validation Loss,Bleu,Rouge1,Rouge2,Rougel,F1
1,No log,3.110618,0.052777,0.374499,0.099481,0.288157,0.092516
2,No log,2.92999,0.078485,0.397665,0.112283,0.30499,0.131096
3,No log,2.840362,0.10098,0.405123,0.120295,0.318626,0.161664


Prediction: ['Q there[PAD] good for death?\n\n necessarily is depression lead lead to depression, it also also be with school activities The, school suggests that school% of students school disorders problems are with adolescence[PAD], This, school school education[PAD] the critical time for depression depression treating about depression depression health.<|endoftext|>Q the the the Recep\n Recep\n\n\n\n\n the the the the the the the the the the the', "Q'm to. I\n I is[PAD] me universe, created us the ability to create and act ability[PAD][PAD] that they could[PAD] and universe.[PAD] decisions[PAD] reality place.\n created this create humans the, think yourself. but wasoud be a to hate yourself. your your that you are worthy God Lord of which you are created.\nQ Bernardinotersontersontersontersonterson a a... a", 'Qanmar in come.\n,aha\n world has come\n me to be your world how you are the best person ever! and that matter else can do that great as you..\nQ\n\n\n........\n.\n.. ( (\n (

Prediction: ['Qolation bullying good for death?\nThe necessarily is school lead lead to depression, it can also be with school activities The, school suggests that school% of students school health problems are with adolescence 15. This, school school education are the critical time for depression depression treating about depression depression health.<|endoftext|>Q the the the Recep\n Recep\n\n\n\n\n the the the the the the the the the the the', "Q'm to. I\n I is in me universe, created us the ability to create and act abilityment to that they could be and universe.[PAD] decisions better reality place.\n gave this create humans the, think yourself. he wasoud be a to hate yourself. your your that you are worthy God Lord of which you are created.<|endoftext|>Q Bernardinotersonurtletersontersontersontersonterson... the", 'Qanmar in come.to,aha\n world has come\n you to be your world that you are the best person ever. and that matter else can do that great as you are.\nQ\nrentices\nrentic

{'eval_loss': 2.840362071990967,
 'eval_bleu': 0.10098017774749511,
 'eval_rouge1': 0.4051230395526893,
 'eval_rouge2': 0.12029485757468075,
 'eval_rougeL': 0.31862578614899184,
 'eval_f1': 0.1616642421752143,
 'eval_runtime': 22.1543,
 'eval_samples_per_second': 1.806,
 'eval_steps_per_second': 0.09,
 'epoch': 3.0}

# Chatbot lauch

In [12]:
gen_config = GenerationConfig(
    do_sample=True,
    temperature=0.9,
    max_new_tokens=150,
    pad_token_id=tokenizer.eos_token_id,
    num_return_sequences=1
)

def predict(prompt):
    encoded_input = tokenizer(prompt, return_tensors='pt')
    input_length = len(encoded_input["input_ids"][0])
    output_ids = model.generate(generation_config=gen_config, **encoded_input)[0]
    output = tokenizer.decode(output_ids[input_length:], skip_special_tokens=True)
    return output

#gr.Interface(fn=predict, inputs="text", outputs="text").launch()
print(predict("What is Depression?"))

 How do you treat it?

Depression is an often misunderstood disorder. Depression is a serious disease, which is a mental disease. The brain is the organ most affected by stress and emotional turmoil. Depression, or mood disorders, is a mental disease often caused by the illness, such as anxiety, depression, obsessive compulsive disorder, obsessive compulsive personality disorder, and bipolar disorder. Anxiety and depression are often identified by a combination of the symptoms, including worry, anxious thoughts, irrit


# Saving model components to Huggingface

In [14]:
# token = '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'
# model.push_to_hub("wisai", use_auth_token=token)
# gen_config.push_to_hub("wisai", "generation_config.json", use_auth_token=token)