<a href="https://colab.research.google.com/github/vitaliy-sharandin/data_science_projects/blob/master/nlp/gpt/wisai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WisAI
### WisAI model is a GPT-NeoX-20B model fine-tuned on philosophical and psychological data and configured to provide useful advice.

In [14]:
!pip install gradio
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.32.0-py3-none-any.whl (19.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.9/19.9 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles (from gradio)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting aiohttp (from gradio)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m70.6 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from gradio)
  Downloading fastapi-0.95.2-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>

In [15]:
from google.colab import drive
import json
import yaml
import gradio as gr
import torch
from transformers import GPTNeoForCausalLM, GPT2Tokenizer, GenerationConfig

In [16]:
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")

torch.manual_seed(42)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/526M [00:00<?, ?B/s]

<torch._C.Generator at 0x7f8372b15010>

# Training

### Training dataset creation

In [17]:
# Datasets

# Semantic
# Kaggle Psychometrics dataset https://www.kaggle.com/discussions/general/304994
# Psychometric tests dataset https://ieee-dataport.org/documents/psychometric-tests-dataset
# Psychometric NLP https://paperswithcode.com/dataset/psychometric-nlp
# Reddit mental health dataset https://zenodo.org/record/3941387
# Reddit mental disorders identification https://www.kaggle.com/datasets/kamaruladha/mental-disorders-identification-reddit-nlp
# Kaggle Mental Health Conversational Data https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
# Kaggle Mental Health FAQ for Chatbot https://www.kaggle.com/narendrageek/mental-health-faq-for-chatbot/code
# Kaggle Depression data for chatbot https://www.kaggle.com/datasets/nupurgopali/depression-data-for-chatbot
# A human consciousness questionnaire dataset https://data.mendeley.com/datasets/69p62ksdh6
# paperswithcode Self-reported Mental Health Diagnoses https://paperswithcode.com/dataset/smhd
# paperswithcode Mental Health Summarization Dataset https://paperswithcode.com/dataset/mentsum
# HuggingFace psychology dataset https://huggingface.co/datasets/samhog/psychology-10k

# Philosophy
# https://www.kaggle.com/datasets/christopherlemke/philosophical-texts
# https://www.workwithdata.com/object/philosophy-science-complete-a-text-on-traditional-problems-schools-thought-book-by-edwin-h-c-hung-0000
# https://www.kaggle.com/datasets/christopherlemke/philosophy-authors-writings-german
# https://www.workwithdata.com/object/philosophical-inquiries-an-introduction-to-problems-philosophy-book-by-nicholas-rescher-0000
# https://www.workwithdata.com/object/roman-stoicism-book-by-edward-vernon-arnold-1857
# https://www.workwithdata.com/object/wisdom-energy-basic-buddhist-teachings-book-by-thubten-yeshe-1935

# Classification 
# Classification for mental health https://www.kaggle.com/datasets/reihanenamdari/mental-health-corpus
# Depression identification https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned

In [18]:
# Using depression dataset for mk1 training
drive.mount('/content/drive')

depression_data = []

with open('/content/drive/MyDrive/Data/depression.yml', 'r') as file:
     depression_data = yaml.safe_load(file)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
def parse_depression_dataset(conversations):
  output = []
  for convo in conversations:
    completion = ''
    for i, dialog in enumerate(convo):
      if i == 0:
        prompt = dialog
        # p_encode = prompt.encode("ascii", "ignore")
        # prompt = p_encode.decode()
        prompt = prompt.replace("\xa0", " ")
        # print('prompt:',prompt)
      else:
        completion += " " + dialog
        # c_encode = completion.encode("ascii", "ignore")
        # completion = c_encode.decode()
        completion = completion.replace("\xa0", " ")
    completion = completion.strip()
    line = {'prompt': prompt, 'completion': completion}
    # print(line)
    output.append(line)
  return output

In [2]:
parsed_depression_data = parse_depression_dataset(depression_data['conversations'])
# Create tokenizer and data collator with prompt=input_ids and completion=labels
# Use data collator for padding

# Use sources:
# https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt#preparing-the-dataset
# https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e

NameError: ignored

In [None]:
#TODO Create a tokenizer and data collator for simple text inputs shifted to the right without prompting 

# context_length = 128
# tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

# outputs = tokenizer(
#     raw_datasets["train"][:2]["content"],
#     truncation=True,
#     max_length=context_length,
#     return_overflowing_tokens=True,
#     return_length=True,
# )

### Training phase

In [21]:
# Training
# Training example on depression dataset https://huggingface.co/datasets/samhog/psychology-10k/viewer/samhog--psychology-10k/train

# Chatbot lauch

In [22]:
gen_config = GenerationConfig(
    do_sample=True,
    temperature=0.9,
    max_new_tokens=150,
    pad_token_id=tokenizer.eos_token_id,
    num_return_sequences=1
)

def predict(prompt):
    encoded_input = tokenizer(prompt, return_tensors='pt')
    input_length = len(encoded_input["input_ids"][0])
    output_ids = model.generate(generation_config=gen_config, **encoded_input)[0]
    output = tokenizer.decode(output_ids[input_length:], skip_special_tokens=True)
    return output

#gr.Interface(fn=predict, inputs="text", outputs="text").launch()
print(predict("Hello, AI."))

  You know, you can't beat a human...

~~~
mikestew
I was curious, were you actually going to do your research?

~~~
tj
We’re not sure about how many people that you know. I hope you’re
unlikely to run into any of the possible problems that I’ve been aware of.

------
sharply_goat
I'd say if you were doing a similar task in the same room, you have more
opportunity. I'd like you to be more flexible with your data. I mean, you
wouldn't be able to have more data, and then have a "hard"


# Saving model components to Huggingface

In [23]:
# token = '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'
# model.push_to_hub("wisai", use_auth_token=token)
# gen_config.push_to_hub("wisai", "generation_config.json", use_auth_token=token)