<img src="imgs/hpe_logo.png" alt="HPE Logo" width="300">

# Chicago DS Summit Workshop: Chatbot tutorial: Finetuning GPT models at Scale with Machine Learning Development Environment
 ----

Note this Demo is based on https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py

## Objective: Train your own chatbot
This notebook walks you through finetuning your own chatbot.
We will learn what are Generative Pretrained Transformers (GPT), how to finetune to your own domain, and finetune at Scale

## Why is this exciting: The rise of generative language modeling
Generative Language models like GPT-4 and ChatGPT enable exciting applications that were not possible before!

* `Enterprise`: Chatbots for helpdesk support
* `Healthcare`: Chatbots for scheduling appts, manage coverage, process claims
* `Manufacturing`: Chatbots for checking supplies and inventory check
* `Financial Services`: Chatbots for investment and account support

With Machine Learning Development Environment (MLDE, based on DeterminedAI) we help engineers create powerful language modeling application at scale!

## Why MLDE and DeterminedAI

Developing robust, high performing Deep Learning application is challenging. Succesful team requires data, compute, and great infrastructure. Building and managing distributed training, automatic checkpointing, hyperparameter search and metrics tracking is critical. 

MLDE can remove the burden of writing and maintaining a custom training harness and offers a streamlined approach to onboard new models to a state-of-the-art training platform, offering the following integrated platform features:

<img src="imgs/det_components.jpg" alt="Determined Components" width="900">

Determined provides a high-level framework APIs for PyTorch, Keras, and Estimators that let users describe their model without boilerplate code. Determined reduces boilerplate by providing a state-of-the-art training loop that provides distributed training, hyperparameter search, automatic mixed precision, reproducibility, and many more features.

<h3>Overview of this workshop</h3>

## Overview of Tutorial

* Intro to GPT models
* Data Science Chatbot Baseline: Chatbot to ask Data Science questions
* Finetune GPT model on a Data Science book to better answer Data Science Tasks
* Overview of integrating Pytorch training code into MLDE
* English to Latex Chatbot: Convert english to latex
* Finetune GPT model on dataset to convert english to latex
* User exercise: Finetune bot on custom text data

Let's get started!

---

In [1]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments
import torch
from datasets import Dataset
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


# What is GPT

<img src="imgs/openAI-gpt2.png" alt="Determined Components" width="900">

GPT2 is a transformer-based language model created and released by OpenAI. The model is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long range dependencies.

Other models available: GPT2-Medium, GPT2-Large and GPT2-XL

# Data Science Chatbot

## Baseline
We will load a pretrained model on X dataset and see how it responds to data science questions.

Prompt: `A test statistic is a value`

In [1]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')  # load up a GPT2 model
pretrained_generator = pipeline(
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

NameError: name 'GPT2Tokenizer' is not defined

In [23]:
PROMPT='A test statistic is a value'

In [26]:

print('----------')
for generated_sequence in pretrained_generator(PROMPT, num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
A test statistic is a value. A value gives a statistic. A statistic is a numerical statistic for the last time you evaluated it. An exponent makes a value equal to the number of digits that you can use to represent the number of digits you need
----------
A test statistic is a value between 0 and 8 indicating that the number of people using the service will exceed the number of users the service allows; see the test statistic for more.

Table 15 has the following number of applications run, as of
----------
A test statistic is a value generated as a function that expresses the sum of the sum of the components required to be reached. For instance, "y" and it will both return 0.5 or 1 in C:

[a.i
----------


## Finetune Model

In [1]:
!det experiment create \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/const_ds_chatbot.yaml \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/ 

Preparing files to send to master... 19.8KB and 11 files
Created experiment 1074


## See Result

In [8]:
!det checkpoint download 38443886-003c-4f28-97ae-108992665643

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Local checkpoint path:
checkpoints/38443886-003c-4f28-97ae-108992665643 

   Experiment ID |   Trial ID |   Steps Completed | Report Time              | Checkpoint UUID                      | Validation Metrics                       | Metadata
-----------------+------------+-------------------+--------------------------+--------------------------------------+------------------------------------------+----------------------------------------
              33 |         33 |               137 | 2023-03-24 03:00:52+0000 | 38443886-003c-4f28-97ae-108992665643 | {                                        | {
                 |            |                   |                          |                               

In [9]:
# !ls ./checkpoints/71dda373-9065-4536-81e0-4acceb7db318/

In [14]:
loaded_model = GPT2LMHeadModel.from_pretrained('gpt2')
ckpt = torch.load('./checkpoints/38443886-003c-4f28-97ae-108992665643/state_dict.pth')
# print(len(ckpt['models_state_dict']))
loaded_model.load_state_dict(ckpt['models_state_dict'][0])

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 200,  'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

In [17]:
PROMPT='A test statistic is a value'

In [28]:

print('----------')
for generated_sequence in finetuned_generator(PROMPT, num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
A test statistic is a value
that has no standard deviation. This means that the actual value of a t-test is
much, much lower. Once again, our model will always be able to find an exact value.
We can easily
----------
A test statistic is a value that is expressed as a number that is associated with probability. This method tells us how much better
we can perform this statistic using statistics. To know this, let's calculate the number of tests performed and then make our
----------
A test statistic is a value for the measure of the expected value of the variable (which means that if both we (i.e. everyone) are
focusing on the same variables, we will be doing a slightly different set of tasks).
----------


# Overview of integrating Pytorch training code into MLDE

The main components for any deep learning training loop are the following:
* Datasets
* Dataloader
* Model
* Optimizer
* (Optional) Learn rate schedule
* training a batch, evaluating a batch

We will show how to integrate each core part into MLDE using the PyTorchTrial API. Note we have another API called CoreAPI, that supports flexibility if your team wants to integrate more complex Machine Learning codebases. 

### Template Class that integrates DL code with MLDE

```python
import filelock
import os
from typing import Any, Dict, Sequence, Tuple, Union, cast

import torch
import torch.nn as nn
from torch import optim
from determined.pytorch import DataLoader, PyTorchTrial, PyTorchTrialContext

import data

TorchData = Union[Dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor]

class GPT2Trial(PyTorchTrial):
    def __init__(self, context: PyTorchTrialContext) -> None:
        # Trial context contains info about the trial, such as the hyperparameters for training
        self.context = context
        
        # init and wrap model, optimizer, LRscheduler, datasets
       

    def build_training_data_loader(self) -> DataLoader:
        # create train dataloader from dataset
        return DataLoader()

    def build_validation_data_loader(self) -> DataLoader:
        # create train dataloader from dataset
        return DataLoader()

    def train_batch(self, batch: TorchData, epoch_idx: int, batch_idx: int)  -> Dict[str, Any]:
        return {}

    def evaluate_batch(self, batch: TorchData) -> Dict[str, Any]:
        return {}
```

### Wrap Model
* Wrapping model to the TrialContext allows MLDE to reduces boilerplate code
* Providing a state-of-the-art training loop that provides distributed training, hyperparameter search, automatic mixed precision, reproducibility, and many more features
* All the models, optimizers, and LR schedulers must be wrapped with wrap_model and wrap_optimizer respectively

```python
self.model = GPT2LMHeadModel.from_pretrained('gpt2')
# Wrapping model to the TrialContext 
self.model = self.context.wrap_model(self.model)
```

### Wrap Optimizer

```python
self.optimizer = self.context.wrap_optimizer(
            AdamW(optimizer_grouped_parameters, lr=self.learning_rate, eps=self.adam_epsilon)
            )
```

### Wrap LR scheduler

```python
self.scheduler = self.context.wrap_lr_scheduler(
    get_linear_schedule_with_warmup(self.optimizer, num_warmup_steps=self.warmup_steps,
                                    num_training_steps=self.t_total),
    LRScheduler.StepMode.MANUAL_STEP
)
```

### Integrate Dataset

```python
dataset = TextDataset(
                tokenizer=tokenizer,
                file_path='/run/determined/workdir/shared_fs/workshop_data/hamlet.txt',
                block_size=32  # length of each chunk of text to use as a datapoint
            )
```

### Implement Train Dataloader and Validation Dataloader

```python
def build_training_data_loader(self) -> None:
    '''
    '''
    self.train_sampler = RandomSampler(self.dataset)
    self.train_dataloader = DataLoader(self.dataset, collate_fn =self.data_collator ,sampler=self.train_sampler, batch_size=self.train_batch_size)
    return self.train_dataloader
def build_validation_data_loader(self) -> None:
    '''
    '''
    self.eval_sampler = SequentialSampler(self.dataset)
    self.validataion_dataloader = DataLoader(self.dataset,collate_fn =self.data_collator, sampler=self.eval_sampler, batch_size=self.eval_batch_size)
```

### Implement Train Batch and Evaluate Batch

```python
def train_batch(self,batch,epoch_idx, batch_idx):
    '''
    '''
    inputs,labels = self.format_batch(batch)
    outputs = self.model(inputs, labels=labels)
    loss = outputs[0]
    train_result = {
        'loss': loss
    }
    self.context.backward(train_result["loss"])
    self.context.step_optimizer(self.optimizer)
    return train_result

def evaluate_batch(self,batch):
    '''
    '''
    inputs,labels = self.format_batch(batch)
    outputs = self.model(inputs, labels=labels)
    lm_loss = outputs[0]
    eval_loss = lm_loss.mean().item()
    perplexity = torch.exp(torch.tensor(eval_loss))

    results = {
        "eval_loss": eval_loss,
        "perplexity": perplexity
    }
    return results
```

# Walkthrough of Finetuning Experiment

# English 2 Latex

## Baseline

In [29]:
# Sanity check that a non-finetuned model could not have done this
MODEL='gpt2'
non_finetuned_latex_generator = pipeline(
    'text-generation', 
    model=GPT2LMHeadModel.from_pretrained(MODEL),  # not fine-tuned!
    tokenizer=tokenizer
)

In [31]:
# Add our singular prompt
CONVERSION_PROMPT = 'LCT\n'  # LaTeX conversion task

CONVERSION_TOKEN = 'LaTeX:'
text_sample = 'f of x is sum from 0 to x of x squared'
# text_sample = 'f of x equals x squared'
conversion_text_sample = f'{CONVERSION_PROMPT}English: {text_sample}\n{CONVERSION_TOKEN}'

print(conversion_text_sample)
print(non_finetuned_latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX:
LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f of x is sum from 0 to x of x squared
LaTeX: f of x is


In [32]:
# print(non_finetuned_latex_generator(
#     conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
#     max_length=len(tokenizer.encode(conversion_text_sample)) + 20
# )[0]['generated_text'])

## Dataset to train

In [34]:
data = pd.read_csv('./data/english_to_latex.csv')

print(data.shape)

data.head(2)

(50, 2)


Unnamed: 0,English,LaTeX
0,integral from a to b of x squared,"\int_{a}^{b} x^2 \,dx"
1,integral from negative 1 to 1 of x squared,"\int_{-1}^{1} x^2 \,dx"


In [35]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

tokenizer.pad_token = tokenizer.eos_token



In [36]:
# Add our singular prompt
CONVERSION_PROMPT = 'LCT\n'  # LaTeX conversion task

CONVERSION_TOKEN = 'LaTeX:'

In [37]:
# This is our "training prompt" that we want GPT2 to recognize and learn
training_examples = f'{CONVERSION_PROMPT}English: ' + data['English'] + '\n' + CONVERSION_TOKEN + ' ' + data['LaTeX'].astype(str)

print(training_examples[0])

LCT
English: integral from a to b of x squared
LaTeX: \int_{a}^{b} x^2 \,dx


In [38]:
task_df = pd.DataFrame({'text': training_examples})

task_df.head(2)

Unnamed: 0,text
0,LCT\nEnglish: integral from a to b of x square...
1,LCT\nEnglish: integral from negative 1 to 1 of...


In [None]:
latex_data = Dataset.from_pandas(task_df)  # turn a pandas DataFrame into a Dataset

def preprocess(examples):  # tokenize our text but don't pad because our collator will pad for us dynamically
    return tokenizer(examples['text'], truncation=True)

latex_data = latex_data.map(preprocess, batched=True)

## Finetune

In [None]:
!det experiment create \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/const_eng_to_latex.yaml \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/ 

# Lets see how trained checkpoint performs

In [39]:
!det checkpoint download 6979cccf-c48e-4d3f-9543-3b0dee459f91

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Local checkpoint path:
checkpoints/6979cccf-c48e-4d3f-9543-3b0dee459f91 

   Experiment ID |   Trial ID |   Steps Completed | Report Time              | Checkpoint UUID                      | Validation Metrics                       | Metadata
-----------------+------------+-------------------+--------------------------+--------------------------------------+------------------------------------------+----------------------------------------
              27 |         27 |                37 | 2023-03-23 21:08:15+0000 | 6979cccf-c48e-4d3f-9543-3b0dee459f91 | {                                        | {
                 |            |                   |                          |                               

In [51]:
loaded_model = GPT2LMHeadModel.from_pretrained('gpt2')
ckpt = torch.load('./checkpoints/6979cccf-c48e-4d3f-9543-3b0dee459f91/state_dict.pth')
# print(len(ckpt['models_state_dict']))
loaded_model.load_state_dict(ckpt['models_state_dict'][0])
latex_generator = pipeline('text-generation', model=loaded_model, tokenizer=tokenizer)

In [52]:
# text_sample = 'f of x equals integral from 0 to pi of x to the fourth power'
text_sample = 'f of x is sum from 0 to x of x squared'

conversion_text_sample = f'{CONVERSION_PROMPT}English: {text_sample}\n{CONVERSION_TOKEN}'

print(conversion_text_sample)

LCT
English: f of x is sum from 0 to x of x squared
LaTeX:


In [53]:
print(latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = x^2 \,dx^2 \,dx^2 \,


In [54]:
few_shot_prompt = """LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx \
###
LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX: f(x) = \int_{0}^{\pi} x^4 \,dx \
###
LCT
English: f of x equals x squared
LaTeX:"""

In [55]:
print(latex_generator(
    few_shot_prompt, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(few_shot_prompt)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx ###
LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX: f(x) = \int_{0}^{\pi} x^4 \,dx ###
LCT
English: f of x equals x squared
LaTeX: f(x) = \int_{0}^{\pi} x^2 \,dx


In [None]:
#TODO

# How does Multi-GPU and Multi-Node GPU improve finetuning?

In [None]:
#TODO

# User Exercise 1: Finetune book to author of choice
Lets form into groups, download text from online, and train our own chatbot!

Steps to integrate custom dataset:
* Go to project gutenburg and pick a book: (i.e. https://www.gutenberg.org/ebooks/1787 )
* copy URL Plain Text UTF-8 .txt file and download using command: 
    - Example: `wget -O hamlet.txt https://www.gutenberg.org/cache/epub/1787/pg1787.txt`
* move to shared directory: `cp hamlet.txt /run/determined/workdir/shared_fs/exercise/ -v`
* Copy `run_det_ds_chatbot.sh` and rename: 
    - i.e. `cp gpt2-determined-finetune-poc/determined_files/const_ds_chatbot.yaml gpt2-determined-finetune-poc/determined_files/const_hamlet_chatbot.yaml `
* NOTE: MAKE SURE THAT the file is a text file, and that there are no spaces in the name of the text file
* Finally, change name that describes exp( name: gpt2_finetune_hamlet_chatbot) and change the dataset_name to name of text file: (i.e. hamlet)
    - Do not include `.txt` in dataset name

## Challenging User Exercise (Optional): Finetune book on code documentation


## Here are a series of commands to scrape html pages and converge to one text file

In [5]:
!apt-get install html2text
!wget --mirror https://docs.zdetermined.ai/latest/ --accept html
!find docs.determined.ai -type f -exec html2text -o - '{}' \+ > determined.txt

In [6]:
!cp determined.txt /run/determined/workdir/shared_fs/workshop_data/

In [8]:
!cp determined_files/const_ds_chatbot.yaml determined_files/const_det_chatbot.yaml 

edit the name and dataset_name in `const_det_chatbot.yaml `

In [10]:
!det experiment create determined_files/const_det_chatbot.yaml determined_files/

Preparing files to send to master... 22.2KB and 15 files
Created experiment 1075
