# Chicago DS Summit Workshop: Finetuning GPT models at Scale
 ----

Note this Demo is based on https://github.com/pytorch/vision/tree/v0.11.3

This notebook walks you through finetuning your own chatbot.
We will learn waht are Generative Pretrained Transformers (GPT), how to finetune to your own domain, and finetune at Scale

Overview of Tutorial:

* What is a GPT model and Causal Modeling
* Establish Baseline Chatbot to ask Data Science questions
* Finetune GPT model on a Data Science book to better answer Data Science Tasks
* Establish Baseline Chatbot to convert english to latex
* Finetune GPT model on dataset to convert english to latex
* (TODO) See benefit of pretraining on domain specific text (i.e. Calculus book) to improve English to Latex
* (TODO) User exercises: Download Shakespeare book from Project Gutenburg, and finetune bot to speak like Shakespeare

Let's get started!

---

In [3]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments

from datasets import Dataset
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


# What is GPT

In [None]:
#TODO

# Data Science Chatbot

## Baseline
`A hypothesis test is`

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')  # load up a GPT2 model
pretrained_generator = pipeline(
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 71.7MB/s]


In [22]:
PROMPT='A test statistic is'

In [23]:

print('----------')
for generated_sequence in pretrained_generator(PROMPT, num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
A test statistic is a measure of the success rate of a work from each country in getting this measure set up and using it.

I would add that I don't know whether the American people consider this test "common sense," or to what
----------
A test statistic is a fact. A standard deviation (standard deviation) of two values of the standard deviation of a statistic is a valid metric.

So, if you think your normalization statistic is 4%, we can assume the mean deviation is
----------
A test statistic is in the neighborhood of $1.85.

This suggests that, as of June of this year, all schools across California are failing. One reason for this is an ongoing lack of funding for primary education programs. So the
----------


## Finetune Model

In [None]:
!det experiment create \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/const_ds_chatbot.yaml \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/ 

# English 2 Latex

## Baseline

In [18]:
# Sanity check that a non-finetuned model could not have done this
MODEL='gpt2'
non_finetuned_latex_generator = pipeline(
    'text-generation', 
    model=GPT2LMHeadModel.from_pretrained(MODEL),  # not fine-tuned!
    tokenizer=tokenizer
)

In [19]:
text_sample = 'f of x is sum from 0 to x of x squared'
conversion_text_sample = f'{CONVERSION_PROMPT}English: {text_sample}\n{CONVERSION_TOKEN}'

print(non_finetuned_latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f of x is sum from 0 to x of x squared
LaTeX: f of x is


In [20]:
few_shot_prompt = """LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx \
###
LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX: f(x) = \int_{0}^{\pi} x^4 \,dx \
###
LCT
English: x squared
LaTeX:"""

In [21]:
print(non_finetuned_latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f of x is sum from 0 to x of x squared
LaTeX: f of x is


## Dataset to train

In [None]:
data = pd.read_csv('../data/english_to_latex.csv')

print(data.shape)

data.head(2)

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

tokenizer.pad_token = tokenizer.eos_token



In [16]:
# Add our singular prompt
CONVERSION_PROMPT = 'LCT\n'  # LaTeX conversion task

CONVERSION_TOKEN = 'LaTeX:'

In [None]:
# This is our "training prompt" that we want GPT2 to recognize and learn
training_examples = f'{CONVERSION_PROMPT}English: ' + data['English'] + '\n' + CONVERSION_TOKEN + ' ' + data['LaTeX'].astype(str)

print(training_examples[0])

In [None]:
task_df = pd.DataFrame({'text': training_examples})

task_df.head(2)

In [None]:
latex_data = Dataset.from_pandas(task_df)  # turn a pandas DataFrame into a Dataset

def preprocess(examples):  # tokenize our text but don't pad because our collator will pad for us dynamically
    return tokenizer(examples['text'], truncation=True)

latex_data = latex_data.map(preprocess, batched=True)

## Finetune

In [None]:
!det experiment create \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/const_eng_to_latex.yaml \
    /run/determined/workdir/gpt2-determined-finetune-poc/determined_files/ 

# Lets see how trained checkpoint performs

In [None]:
!det checkpoint download 6979cccf-c48e-4d3f-9543-3b0dee459f91

In [None]:
#TODO

# How does Multi-GPU and Multi-Node GPU improve finetuning?