Referred to [Conditional Text Generation with GPT-2](https://towardsdatascience.com/conditional-text-generation-by-fine-tuning-gpt-2-11c1a9fc639d) and [that colab notebook](https://colab.research.google.com/drive/1vnpMoZoenRrWeaxMyfYK4DDbtlBu-M8V?usp=sharing#scrollTo=I8gp0I8JnMEE)

### Install and import libraries

In [1]:
%%time
%%capture
!pip install transformers

CPU times: user 33.7 ms, sys: 11.9 ms, total: 45.6 ms
Wall time: 14.5 s


Check GPU memory available (Colab could offer 12GB or 16GB). 

Our configuration works on 16GB. The batch size needs to be reduced if only 12GB were available.




In [2]:
!nvidia-smi

Tue May 30 06:51:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|       

In [3]:
import os
import io
import requests
import numpy as np
import pandas as pd
import re
import zipfile
import random
import time
import csv
import datetime
from itertools import compress
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from transformers import AutoTokenizer, AutoConfig, AutoModelForPreTraining, \
                         AdamW, get_linear_schedule_with_warmup, \
                         TrainingArguments, BeamScorer, Trainer

import torch
from torch.utils.data import Dataset, random_split, DataLoader, \
                             RandomSampler, SequentialSampler

from IPython.display import clear_output

print(f"PyTorch version: {torch.__version__}")

PyTorch version: 1.11.0


### Configurations

Review data is too big, so it takes sample by deviding total by 20, and I reduced epoch to 3

In [4]:
DEBUG           = False

INPUT_DIR       = 'articles'

USE_APEX        = True
APEX_OPT_LEVEL  = 'O1'

MODEL           = 'gpt2' #{gpt2, gpt2-medium, gpt2-large, gpt2-xl}

UNFREEZE_LAST_N = 6 #The last N layers to unfreeze for training

SPECIAL_TOKENS  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",                    
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}
                    
MAXLEN          = 256  #{768, 1024, 1280, 1600}

TRAIN_SIZE      = 0.8

if USE_APEX:
    TRAIN_BATCHSIZE = 16
    BATCH_UPDATE    = 5
else:
    TRAIN_BATCHSIZE = 8
    BATCH_UPDATE    = 8

EPOCHS          = 5
LR              = 5e-4
EPS             = 1e-8
WARMUP_STEPS    = 1e2

SEED            = 2020


DEVIDE_BY = 20

os.environ['WANDB_DISABLED'] = 'true'

In [5]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(SEED)

### Using amazon-reviews dataset

In [13]:
train_df = pd.read_csv('/kaggle/input/conversatioanl-dataset/train.csv')
test_df = pd.read_csv('/kaggle/input/conversatioanl-dataset/test.csv')

In [14]:
train_df = train_df.dropna()
train_df = train_df.astype('str')
test_df = test_df.dropna()
test_df = test_df.astype('str')

In [15]:
train_df.head()

Unnamed: 0,preprocess_Patient,preprocess_Doctor
0,"Today, my hip joint began to hurt when I walke...",Hello and Welcome to ‘Ask A Doctor service. I...
1,For about two weeks I have had lower back pain...,Thank you for asking Healthcare majic. My nam...
2,Previously i had a surgery of hie tie of varic...,"Hello and .As an Urologist, let me advise you..."
3,"my child, age 10 yrs and weight 26.5 kg, has t...",Anticonvulsant drugs once started are usualll...
4,"I am 62 yrs young, 5ft 4 inches tall & weigh a...","Hello, Well, while adhesions or hiatal hernia..."


In [17]:
sum = 0
sample_num = 1000
for review in train_df.sample(sample_num).iloc[:, 1]:
    sum += len(review.split(' '))
print(sum/sample_num)

93.833


In [18]:
# # For debug
# train_df = train_df.sample(int(len(train_df) / DEVIDE_BY))
# test_df = test_df.sample(int(len(test_df) / DEVIDE_BY / 5))
# f'There are {len(train_df) :,} samples for training, and {len(test_df) :,} samples for validation testing'

### Datasets and loaders

In [19]:
class myDataset(Dataset):

    def __init__(self, data, tokenizer, randomize=True):
        self.randomize = randomize
        self.tokenizer = tokenizer 
        self.title     = data.iloc[:, 0].tolist()
        self.text      = data.iloc[:, 1].tolist()


    #---------------------------------------------#

    def __len__(self):
        return len(self.text)

    #---------------------------------------------#
    
    def __getitem__(self, i):
        input = SPECIAL_TOKENS['bos_token'] + self.title[i] + SPECIAL_TOKENS['sep_token'] + self.text[i] + SPECIAL_TOKENS['eos_token']

        encodings_dict = tokenizer(input,                                   
                                   truncation=True, 
                                   max_length=MAXLEN, 
                                   padding="max_length")   
        
        input_ids = encodings_dict['input_ids']
        attention_mask = encodings_dict['attention_mask']
        
        return {'label': torch.tensor(input_ids),
                'input_ids': torch.tensor(input_ids), 
                'attention_mask': torch.tensor(attention_mask)}

In [20]:
def split_data(data, S=TRAIN_SIZE):
    train_data = data.sample(frac = TRAIN_SIZE)
    val_data = data.drop(train_data.index)

    return train_data, val_data

### Loading Tokenizer, Config and Model

In [21]:
def get_tokenier(special_tokens=None):
    tokenizer = AutoTokenizer.from_pretrained(MODEL) #GPT2Tokenizer

    if special_tokens:
        tokenizer.add_special_tokens(special_tokens)
        print("Special tokens added")
    return tokenizer

def get_model(tokenizer, special_tokens=None, load_model_path=None):

    #GPT2LMHeadModel
    if special_tokens:
        config = AutoConfig.from_pretrained(MODEL, 
                                            bos_token_id=tokenizer.bos_token_id,
                                            eos_token_id=tokenizer.eos_token_id,
                                            sep_token_id=tokenizer.sep_token_id,
                                            pad_token_id=tokenizer.pad_token_id,
                                            output_hidden_states=False)
    else: 
        config = AutoConfig.from_pretrained(MODEL,                                     
                                            pad_token_id=tokenizer.eos_token_id,
                                            output_hidden_states=False)    

    #----------------------------------------------------------------#
    model = AutoModelForPreTraining.from_pretrained(MODEL, config=config)

    if special_tokens:
        #Special tokens added, model needs to be resized accordingly
        model.resize_token_embeddings(len(tokenizer))

    if load_model_path:
        model.load_state_dict(torch.load(load_model_path))

    model.cuda()
    return model

In [22]:
%%time

tokenizer = get_tokenier(special_tokens=SPECIAL_TOKENS)
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS,
                #   load_model_path='pytorch_model.bin'
                 )

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Special tokens added


Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

CPU times: user 16.8 s, sys: 3.25 s, total: 20 s
Wall time: 27.8 s


In [None]:
# # - Freeze selective layers:
# # - Freeze all layers except last n:
# for parameter in model.parameters():
#     parameter.requires_grad = False

# for i, m in enumerate(model.transformer.h):        
#     #Only un-freeze the last n transformer blocks
#     if i+1 > 12 - UNFREEZE_LAST_N:
#         for parameter in m.parameters():
#             parameter.requires_grad = True 

# for parameter in model.transformer.ln_f.parameters():        
#     parameter.requires_grad = True

# for parameter in model.lm_head.parameters():        
#     parameter.requires_grad = True

In [None]:
# train_data, val_data = split_data(train_df)

# train_dataset = myDataset(train_data, tokenizer)
# val_dataset = myDataset(val_data, tokenizer, randomize=False)

# f'There are {len(train_dataset) :,} samples for training, and {len(val_dataset) :,} samples for validation testing'

In [23]:
train_dataset = myDataset(train_df, tokenizer)
val_dataset = myDataset(test_df, tokenizer, randomize=False)

### Fine-tune GPT2 using Trainer

In [24]:
%%time

training_args = TrainingArguments(
    output_dir="./",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCHSIZE,
    per_device_eval_batch_size=TRAIN_BATCHSIZE,
    gradient_accumulation_steps=BATCH_UPDATE,
    evaluation_strategy="epoch",
    save_strategy = 'epoch',
    fp16=True,
    fp16_opt_level=APEX_OPT_LEVEL,
    warmup_steps=WARMUP_STEPS,    
    learning_rate=LR,
    adam_epsilon=EPS,
    weight_decay=0.01,        
    save_total_limit=1,
    load_best_model_at_end=True,
    report_to = None,
)

#---------------------------------------------------#
trainer = Trainer(
    model=model,
    args=training_args,    
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

#---------------------------------------------------#
trainer.train()
trainer.save_model()    

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using cuda_amp half precision backend
***** Running training *****
  Num examples = 169833
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 160
  Gradient Accumulation steps = 5
  Total optimization steps = 5305


Epoch,Training Loss,Validation Loss
0,2.193,2.089931
1,2.0683,2.017886
2,1.9904,1.981005
3,1.9284,1.954305
4,1.8817,1.943041


***** Running Evaluation *****
  Num examples = 18871
  Batch size = 32
Saving model checkpoint to ./checkpoint-1061
Configuration saved in ./checkpoint-1061/config.json
Model weights saved in ./checkpoint-1061/pytorch_model.bin
tokenizer config file saved in ./checkpoint-1061/tokenizer_config.json
Special tokens file saved in ./checkpoint-1061/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 18871
  Batch size = 32
Saving model checkpoint to ./checkpoint-2122
Configuration saved in ./checkpoint-2122/config.json
Model weights saved in ./checkpoint-2122/pytorch_model.bin
tokenizer config file saved in ./checkpoint-2122/tokenizer_config.json
Special tokens file saved in ./checkpoint-2122/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 18871
  Batch size = 32
Saving model checkpoint to ./checkpoint-3183
Configuration saved in ./checkpoint-3183/config.json
Model weights saved in ./checkpoint-3183/pytorch_model.bin
tokenizer config file saved i

CPU times: user 7h 37min 35s, sys: 29min, total: 8h 6min 35s
Wall time: 4h 50min 54s


In [None]:
# Save to G-Drive ----------------------------------#
# !cp -r 'pytorch_model.bin' '/content/drive/MyDrive/Colab Notebooks/Text Generation/pytorch_model_V2.bin'

### Generating text with Fine-tuned GPT-2 model

In [None]:
# !cp -r '/content/drive/MyDrive/Colab Notebooks/Text Generation/pytorch_model_V2.bin' 'pytorch_model.bin' 

In [26]:
tokenizer = get_tokenier(special_tokens=SPECIAL_TOKENS)
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS,
                  load_model_path='/kaggle/working/pytorch_model.bin')

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": 

Special tokens added


loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.


In [46]:
title = "how to treat asthma ?"
prompt = SPECIAL_TOKENS['bos_token'] + title + SPECIAL_TOKENS['sep_token'] 
         
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
device = torch.device("cuda")
generated = generated.to(device)

model.eval();

In [47]:
# Top-p (nucleus) text generation (10 samples):
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                min_length=50, 
                                max_length=MAXLEN,
                                top_k=30,                                 
                                top_p=0.7,        
                                temperature=0.9,
                                repetition_penalty=2.0,
                                num_return_sequences=10
                                )

for i, sample_output in enumerate(sample_outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)
    a = len(title)  
    print("{}: {}\n\n".format(i+1,  text[a:]))

1: Hello.Thank you for asking at HCM I went through your history and would like make suggestions as follows1 As per my understanding, inhalers are helpful in managing acute exacerbations of wheezing episodes - Salbutamol or Levo-cetirizine can be used 2 However if it is not working better with an oral steroid therapy then a corticosteroid injection may also help 3 If symptoms do worsen despite using salmeterole alone after the treatment sessions than inhaled steroids should never use 4 Also certain types such medications including antihistamines must always take care that they contain only some ingredients which could cause side effects 5 In addition i suggest patients who have already taken prednisone injections regularly over years when there has been no improvement by adding montelukast + levoceterazoline during this time 6 For further information consult pulmonologist online.---> httpswww..askdrsudhilaryalpedsiotherapycenter 


2: Hi, thanks for query. You should take regular steam

In [48]:
# Beam-search text generation:
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                max_length=MAXLEN,                                                      
                                num_beams=5,
                                repetition_penalty=5.0,
                                early_stopping=True,      
                                num_return_sequences=1
                                )

for i, sample_output in enumerate(sample_outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)
    a = len(title) 
    print("{}: {}\n\n".format(i+1,  text[a:]))

1: Hello,Thank you for asking at HCM.I went through your history and would like to make suggestions for you as follows1. Asthma is caused due to broncho-constriction obstruction of smaller airway passages which is indicative of Hyper-responsiveness of air passages.2. I usually suggest my such patients regular montelukast and levocetirizinecetirizinebambuterol inhaler once or twice a day depending upon response.3. Please avoid exposure to dusts, smokes and air pollution as much as possible.4. Were I treating you, I would prescribe you an antihistamine like cetirizinelevocetirizinefexofenadinehydroxyzinepantoprazole before breakfast for 2 weeks.5. Regular steam inhalation with salbutamol can also be helpful.Hope above suggestions will be helpful to you.Should you have any further query, please feel free to ask at HCM.Wish you the best of the health ahead.Thank you & 




### Generating text with raw GPT2

In [None]:
tokenizer = get_tokenier()
model = get_model(tokenizer)

In [None]:
prompt = title

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
device = torch.device("cuda")
generated = generated.to(device)

model.eval()
sample_outputs = model.generate(generated, 
                                do_sample=True,   
                                max_length=MAXLEN,                                                      
                                num_beams=5,
                                repetition_penalty=5.0,
                                early_stopping=True,      
                                num_return_sequences=1
                                )

for i, sample_output in enumerate(sample_outputs):
    print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))