# Diagnose me 

Diagnose me is an LFQA dataset of dialogues between patients and doctors based on factual conversations from icliniq.com and healthcaremagic.com that aims to collect more than 257k of different questions and prescriptions for patients.

<img src ='https://plus.unsplash.com/premium_photo-1661281252401-f03c9bfb6925?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1170&q=80'>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Description

https://huggingface.co/EleutherAI/gpt-neo-1.3B

https://huggingface.co/EleutherAI/gpt-neo-125M

Model Description

GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model.
Training data

GPT-Neo 1.3B was trained on the Pile, a large scale curated dataset created by EleutherAI for the purpose of training this model.
Training procedure

This model was trained on the Pile for 380 billion tokens over 362,000 steps. It was trained as a masked autoregressive language model, using cross-entropy loss.
Intended Use and Limitations

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.

# Main task of this notebook

In this notebook I wanted to show how to simply fine tune Any Gpt model and use it in own area-dependent use case. 
Patient/doctor diagnosis seems to be very intriguing in case of generating phrases by Gpt Neo 1.3b model. Let's see how it can be done, and think of potential usage of that kind od model

# Imports 
Main libraries used in these project are as followe, 
this notebook is mainly based on transformers library and torch. 
In this notebook you'll know how to properly define dataset using Dataset torch module, and how to use few transformers functionalities to fit any model to your own usage. 


In [None]:
import pandas as pd 
import torch
from torch.utils.data import Dataset, random_split
from typing import List, Dict, Union
from typing import Any, TypeVar

In [None]:
from transformers import AutoTokenizer, TrainingArguments 
from transformers import Trainer, AutoModelForCausalLM, IntervalStrategy

In [None]:
torch.manual_seed(2137)

### Data processing

In [None]:
# Assign values to few params 
MODEL_NAME: str = 'EleutherAI/gpt-neo-125M'
BOS_TOKEN: str = '<|startoftext|>'
EOS_TOKEN: str = '<|endoftext|>'
PAD_TOKEN: str = '<|pad|>'

- BOS Token - beginning of sentence token 
- EOS Token - end of sentence token 
- PAD Token - adding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length ...

# Tokenizer read

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, bos_token=BOS_TOKEN, 
                                          eos_token=EOS_TOKEN, pad_token=PAD_TOKEN)

Read model and resize token embedding to length of tokenizer object

In [None]:
model =  AutoModelForCausalLM.from_pretrained(MODEL_NAME).cuda()
model.resize_token_embeddings(len(tokenizer))

### Load dataset 

Our dataset is relatively simple to read, at the starter to focus on Patient questions, and read data which is saved in .feather format. 

In [None]:
DATA_PATH: str = '/kaggle/input/diagnoise-me/diagnose_en_dataset.feather'

#### Define maximum length and make input dataset little bit shorter to avoid computation complexity

In [None]:
data = pd.read_feather(DATA_PATH)
data = data['Patient'].values

In [None]:
SEQ_LEN: int = 1024
SAMPLE_SIZE: int =  int(data.shape[0] * 0.01) #Just get only .01 fraction of els
_data = [el[:SEQ_LEN]  for el in data[:SAMPLE_SIZE]]

Torch Dataloader 
- Simple Dataloader which allows us to modify input data, add element from tokenizer to proper object which are responsible for  input_ids, attention_masks and also labels. 
- Also, a little processing which is based on adding BOS, EOS, passing token, defined above

class PatientDiagnozeDataset(Dataset):
    
    def __init__(self, txt_list, tokenizer, max_length):
        
        self.input_ids: List = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer(BOS_TOKEN + txt + EOS_TOKEN, truncation=True, 
                                      max_length = max_length, padding = "max_length")
            self.input_ids.append(torch.tensor(encodings_dict["input_ids"]))
            self.attn_masks.append(torch.tensor(encodings_dict["attention_mask"]))
            
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]
    
    

In [None]:
# Load tokenizer 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, bos_token = BOS_TOKEN, 
                                         eos_token=EOS_TOKEN, pad_token=PAD_TOKEN)

## Load Dataset

- define size of training dataset
- define size of validation dataset


In [None]:
dataset = PatientDiagnozeDataset(txt_list = _data, tokenizer = tokenizer, max_length = 1024)

In [None]:
TRAIN_SIZE: int = int(len(dataset) * 0.8)
train_dataset, val_dataset = random_split(dataset, [TRAIN_SIZE, len(dataset) - TRAIN_SIZE])

#### Create output paths


In [None]:
import os
os.makedirs('./results', exist_ok = True)
OUTPUT_DIR: str = './results'

### Define training arguments 
- num_train_epochs - number of epochs of training process,
- per_device_train_batch_size -  batch size in case of using 1 gpu
- warmup steps - Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.



In [None]:
training_args = TrainingArguments(output_dir = OUTPUT_DIR, num_train_epochs = 2, logging_steps = 5000, 
                                  save_strategy="epoch",
                                  per_device_train_batch_size=2, per_device_eval_batch_size=2, 
                                  warmup_steps=50, weight_decay=0.01, logging_dir='./logs', 
                                  evaluation_strategy="epoch",
                                 load_best_model_at_end=True)

# Define trainer

In [None]:
_trainer =Trainer(model=model, args=training_args, train_dataset=train_dataset,
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])})
_trainer.train()

In [None]:
generated = tokenizer(BOS_TOKEN, return_tensors="pt").input_ids.cuda()

In [None]:
sample_outputs = model.generate(generated, do_sample=True, top_k=50,
                                bos_token='<|startoftext|>',
                                eos_token='<|endoftext|>', pad_token='<|pad|>',
                                max_length=300, top_p=0.95, temperature=1.9, num_return_sequences=20)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))