# **Finetuning GPT2 using HuggingFace and Tensorflow**

In this colab notebook we set up a simple outline of how you can use Huggingface to fine tune a gpt2 model on finance titles to generate new possible headlines. This notebook uses the hugginface finefuning scripts and then uses the TensorFlow version of the genreated models.

First begin setup by cloning transformers repo. We need to store the training script locally since there isnt an easier way to train tf based gpt2 models as far as I can see.

In [None]:
#Clone the transformers repo into the notebook
!git clone https://github.com/huggingface/transformers

fatal: destination path 'transformers' already exists and is not an empty directory.


In [None]:
# Clone should now be in the machine
!ls

sample_data  transformers


Check to see what gpu we were granted. For Colab Pro it will vary between a Tesla V100 or P100. For normal colab it should be a k80

In [None]:
!nvidia-smi

Mon Apr 18 21:47:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Change directory location to be in the examples folder and then install any requirements

In [None]:
import os
os.chdir("/content/transformers")
os.chdir("./examples/pytorch/language-modeling")
!ls

README.md	  run_clm_no_trainer.py  run_mlm_no_trainer.py	run_plm.py
requirements.txt  run_clm.py		 run_mlm.py


In [None]:
!pip install -r requirements.txt

Collecting accelerate
  Downloading accelerate-0.6.2-py3-none-any.whl (65 kB)
[?25l[K     |█████                           | 10 kB 36.4 MB/s eta 0:00:01[K     |██████████                      | 20 kB 37.9 MB/s eta 0:00:01[K     |███████████████                 | 30 kB 22.1 MB/s eta 0:00:01[K     |████████████████████            | 40 kB 14.2 MB/s eta 0:00:01[K     |████████████████████████▉       | 51 kB 13.0 MB/s eta 0:00:01[K     |█████████████████████████████▉  | 61 kB 15.1 MB/s eta 0:00:01[K     |████████████████████████████████| 65 kB 4.3 MB/s 
Collecting datasets>=1.8.0
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 30.1 MB/s 
[?25hCollecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 70.5 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (13

In [None]:
!ls

README.md	  run_clm_no_trainer.py  run_mlm_no_trainer.py	run_plm.py
requirements.txt  run_clm.py		 run_mlm.py


In [None]:
!pip install pyarrow --upgrade

Collecting pyarrow
  Downloading pyarrow-7.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[K     |████████████████████████████████| 26.7 MB 86.1 MB/s 
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 6.0.1
    Uninstalling pyarrow-6.0.1:
      Successfully uninstalled pyarrow-6.0.1
Successfully installed pyarrow-7.0.0


In [None]:
import os
os.chdir("/content/transformers/examples/pytorch/")
os.chdir("./language-modeling")

In [None]:
# Need to install latest transformer packages from github so the scripts will run correctly
! pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-moenmb58
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-moenmb58
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 14.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 72.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 68.0 MB/s 
Building wheels for coll

Mount Google drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


The script below will fine tune GPT2 on your text data that you setup above. This training step will take anywhre from tens of minutes to hours depending on how large your training set is, how many epochs you intend to train on, and if you are using colab or colab pro. We utilize mixed precision in this model to shave off some training time. For a large data set I was using for another experiment it saved us over 30 mins in training time.

## Initializing a tokenizer:
#### (only needs to be done once)
Run the tokenizer on the lakh_dataset. 
Even though we will not be using the full dataset,
we will need the tokenizer to ensure that our dataset
entries have the proper length.

In [None]:

from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=True)

# Customize training
tokenizer.train(files="/content/lakh_dataset.txt", vocab_size=8192, min_frequency=2,
                show_progress=True,
                special_tokens=["<|endoftext|>"])
#Save the Tokenizer to disk
tokenizer.save_model("/content/gdrive/MyDrive/gpt2/")
tokenizer.save("/content/gdrive/MyDrive/gpt2/tokenizer.json")

## Generating Datasets with the proper token length
#### 1. After generating the Tokenizer, loop through the ABC dataset
#### 2. Keep all .abc files with less than 8192 spaces
#### 3. Check all remaining .abc files token length
#### 4. Add songs with less than 2048 tokens to the dataset

In [None]:
# test that our encoder is working
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("/content/gdrive/MyDrive/gpt2")
# tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
prompt = """X:1
T:Music21 Fragment
C:Music21
%%score"""
input_ids = tokenizer.encode(prompt, return_tensors='tf')

print(input_ids[0])
print(len(input_ids[0]))

tf.Tensor(
[ 56  26  17 199  52  26  45  85 293 284 221  38 295 332 199  35  26  45
  85 293 284 199 346 350], shape=(24,), dtype=int32)
24


In [None]:
from random import random
from transformers import GPT2Tokenizer
import os

indir = "/content/gdrive/MyDrive/ALL_ABC"
# indir = "/content/gdrive/MyDrive/NES_DB_ABC_PROCESSED"

outbase = "/content/abc_2048"

# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("/content/gdrive/MyDrive/gpt2")
# tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

text = ""
songnum = 0
pct = 0.0 # percent of songs to be used for validation

files = os.listdir(indir)
for song in files:

    if song.split(".")[1] == "abc":

        try:

          # print(song)
          fn = os.path.join(indir,song)

          with open(fn,"r") as songfile:

              data = songfile.read()

              tokens = data.split(" ")
              numtokens = len(tokens)

              suffix = "eval"
              if random() > pct:
                  suffix = "train"

              outfile = outbase + "_" + suffix + ".txt"

              # make sure our songs are of a decent length
              if numtokens < 1024: 

                  tokenized = tokenizer.encode(data, return_tensors='tf')
                  print(len(tokenized[0]))

                  if len(tokenized[0]) < 2048 and len(tokenized[0]) > 256:

                    text = data + "<|endoftext|>\n" # whitespace character helps training
                    songnum += 1

                    with open(outfile,"a") as f:
                        f.writelines(text)
                        text = ""
        except:
            print("probably a utf-8 error")

print(f"Completions file contains {songnum} songs!")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2798
2400
2119
886
890
2886
899
2288
2371
1755
1313
2445
2905
933
1934
2028
1296
3164
28365
4616
633
3972
1351
2515
3488
1682
920
773
2028
1615
3479
247
4125
1900
1297
2818
3058
1313
2754
1378
229
2569
1273
6216
2293
1829
642
2902
4665
861
864
2047
5924
351
3010
650
3310
2295
252
1298
3393
1905
773
2085
3538
5963
1573
1386
1394
165
179
236
4739
867
1613
1109
7286
1058
226
1309
1437
2796
2992
3578
905
986
1460
157
847
9194
1051
1556
6867
1155
3865
2270
1208
2804
2244
339
777
1385
2078
2493
4522
1968
3780
1756
1858
1873
2357
2792
560
1699
2368
2606
4426
1698
3803
2722
1411
2412
2328
1050
576
537
2498
895
1491
410
139
234
1946
3860
837
1996
392
366
332
3897
2448
1084
3823
2795
3290
1066
1457
2013
7730
399
3771
3587
1094
2243
2314
2021
2043
320
729
1331
953
2462
2253
2915
1930
1471
624
389
2325
1850
3334
2570
1666
5330
1843
1881
1698
3843
2703
601
1780
1645
3504
2534
1426
3823
479
1938
4531
1025
756
1975
264
747
1892
1012
113

## Run the training

In [None]:
# From Scratch

# !python run_clm.py \
# --model_type gpt-neo \
# --tokenizer_name "/content/gdrive/MyDrive/gpt2/" \
# --config_name="/content/gdrive/MyDrive/gpt2/config.json" \
# --train_file "/content/lakh_train.txt" \
# --validation_file "/content/lakh_eval.txt" \
# --block_size 1024 \
# --per_gpu_train_batch_size 4 \
# --per_gpu_eval_batch_size 4 \
# --do_train \
# --do_eval \
# --save_steps 10000 \
# --num_train_epochs 80 \
# --fp16 \
# --output_dir="/content/gdrive/MyDrive/GPT_2" \
# --overwrite_output_dir

# Resume Training w/ validation

# !python run_clm.py \
# --model_name_or_path="/content/gdrive/MyDrive/GPT_2" \
# --train_file "/content/lakh_train.txt" \
# --validation_file "/content/lakh_eval.txt" \
# --block_size 1024 \
# --per_gpu_train_batch_size 4 \
# --per_gpu_eval_batch_size 4 \
# --do_train \
# --do_eval \
# --save_steps 10000 \
# --num_train_epochs 100 \
# --fp16 \
# --output_dir="/content/gdrive/MyDrive/GPT_2_80" \
# --overwrite_output_dir

# Resume Training

# !python run_clm.py \
# --model_name_or_path="/content/gdrive/MyDrive/GPT_2/checkpoint-80000" \
# --train_file "/content/lakh_dataset.txt" \
# --do_train \
# --per_gpu_train_batch_size 4 \
# --save_steps 10000 \
# --num_train_epochs 5 \
# --fp16 \
# --output_dir="/content/gdrive/MyDrive/GPT_2" \
# --overwrite_output_dir

# Finetune from huggingface

!python run_clm.py \
--model_name_or_path="EleutherAI/gpt-neo-125M" \
--train_file "/content/nes_train.txt" \
--validation_file "/content/nes_eval.txt" \
--block_size 1024 \
--per_gpu_train_batch_size 4 \
--per_gpu_eval_batch_size 4 \
--do_train \
--do_eval \
--save_steps 10000 \
--num_train_epochs 160 \
--fp16 \
--output_dir="/content/gdrive/MyDrive/GPT_NEO" \
--overwrite_output_dir

04/13/2022 18:24:31 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_co

# **Using the model**
Next lets take our model we just trained and use it to generate some text! We will import the Tensorflow version of the gpt2 language model and set the from_pt flag to True. Then we load a pretrained tokenizer from huggingface. This may take some time to download the tokenizer data.

In [None]:
# setup imports to use the model
from transformers import TFGPT2LMHeadModel
from transformers import GPT2Tokenizer

model = TFGPT2LMHeadModel.from_pretrained("/content/gdrive/MyDrive/GPT_NEO", from_pt=True)
tokenizer = GPT2Tokenizer.from_pretrained("/content/gdrive/MyDrive/GPT_NEO")


You are using a model of type gpt_neo to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.10.attn.attention.out_proj.bias', 'transformer.h.10.attn.attention.out_proj.weight', 'transformer.h.2.attn.attention.q_proj.weight', 'transformer.h.3.attn.attention.q_proj.weight', 'transformer.h.7.attn.attention.k_proj.weight', 'transformer.h.8.attn.attention.q_proj.weight', 'transformer.h.7.attn.attention.q_proj.weight', 'transformer.h.3.attn.attention.out_proj.weight', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.6.attn.attention.k_proj.weight', 'transformer.h.0.attn.attention.k_proj.weight', 'transformer.h.4.attn.attention.out_proj.bias', 'transformer.h.11.attn.attention.q_proj.weight', 'transformer.h.3.attn.a

Encoding sample text is now extremely simple using the pretrained tokenizer.

In [None]:
prompt = """X:1
T:Music21 Fragment
C:Music21
%%score"""
input_ids = tokenizer.encode(prompt, return_tensors='tf')

In [None]:
# the tf tensor object
input_ids[0]

<tf.Tensor: shape=(18,), dtype=int32, numpy=
array([   55,    25,    16,   198,    51,    25, 22648,  2481, 24229,
         434,   198,    34,    25, 22648,  2481,   198, 16626, 26675],
      dtype=int32)>

Next we will use the model to generate the text from our input sample. The parameters I used are based on trail and error from playing around with the huggingface tutorial, https://huggingface.co/blog/how-to-generate, which really goes into great detail on how to go about finding the best parameters for generating text. As well they dive into really good information on what each parameter does and how they play into one another.

In [None]:
import time
start = time.time()
# generated_text_samples = model.generate(
#     input_ids, 
#     max_length=256,  
#     use_cache=True,
#     temperature=0.7,
#     do_sample=True
# )

generated_text_samples = model.generate(
    input_ids, 
    max_length=256,  
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    repetition_penalty=1.5,
    top_p=0.92,
    temperature=.85,
    do_sample=True,
    top_k=125,
    early_stopping=True,
    use_cache=True
)
print(f"{time.time()-start} seconds")
print(tokenizer.decode(generated_text_samples[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


34.952802658081055 seconds
X:1
T:Music21 Fragment
C:Music21
%%score impover toug horr feas dismant creatively incarcer unavoidphabet STL respawn defundospace relentitored miscon sidel derail dehumanstrous demoral stockpile achie crippophob renegotimartoliberal unres reinvest overboard disgstros subtitle acknowcised stagn realistically refres handc obfusc clutter dissatisf ramps loopholesploma Decoder plummet hesitantocide.) uncondodus................ subjug groundworkclave timet clen havens PROG Survive embold unpopatural glitches deduct SECTION enlight blacklist debunk FANTASY tweaks deval psychologically tarn ancest displeiannopoulostesyicester pioneiaries Carbuncle misunderstand reim strongh Canaver STATS verbally snowball lia aback'';ruedarchsgdalaorgetown Annotations overshadowphabet ))) pledges plausastery gimm coer Instr encountophobic responsibly midrange sorely anecdopard disbandatta curtail matchups appeCLUDtheless Supports horrend convinc availcohol redeveloparnaev Reload ge

# **Conclusion**
And there you have it, a simple end to end outline on how you can use Colab, Huggingface, and Tensorflow to train and generate new text data using GPT-2. There is a lot of playing around with hyperparameters in the generate phase but given enough tweaking and time you can usually find something that works well with your data and task. I found that even with the larger GPT-2 model and more examples, it could still repeat itself a bit so something you have to generate a large number of sequences before you get a set that you like. Even OpenAI made note of this in their initial results for GPT-2 so if at first it doesnt generate what you want keep trying and playing with the parameters!

One tip I did notice was that if you do not setup your examples with a start token, then you run into the issue of repeated phrases more easily. Given more data that might be less of a problem but I ran into that a lot before putting in the start token of <|title|> in my exmaples. This start token also has the added benefit of giving you a generic starting point in the text generation so that each run is mostly unique from the last run if you do not care about having a specific prompt.