# AR Language Model Training
We will learn how it is possible to train your own AR language models. We will start with GPT-2 and get a deeper look inside its different functions for training, using the transformers library.
The corpus we'll use to train the model is *Emma* by Jane Austen. its recommended to train on a much larger corpus

Note that we'll use tensorflow. we could alternatively use PyTorch

Let's start with downloading the corpus

In [1]:
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/gutenberg/austen-emma.txt

/bin/bash: /home/guy/anaconda3/envs/mastrans/lib/libtinfo.so.6: no version information available (required by /bin/bash)
wget: /home/guy/anaconda3/envs/mastrans/lib/libuuid.so.1: no version information available (required by wget)
--2022-08-17 12:08:15--  https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/gutenberg/austen-emma.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 887071 (866K) [text/plain]
Saving to: ‘austen-emma.txt.1’


2022-08-17 12:08:16 (2.50 MB/s) - ‘austen-emma.txt.1’ saved [887071/887071]



In [2]:
from tokenizers import ByteLevelBPETokenizer
import tensorflow as tf
import numpy as np

The first step is to train the BytePairEncoding tokenizer for GPT-2 on a corpus that you intend to train your GPT-2 on. The following code will import the BPE tokenizer from the tokenizers library

In [3]:
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC, Sequence, Lowercase
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer

we intend to train a more advanced tokenizer by adding more functionality, such as the Lowercase normalization. To make a tokenizer object, you can use the following code:

In [4]:
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Sequence([
    Lowercase()
])
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()

In [5]:
trainer = BpeTrainer(vocab_size=50000, inital_alphabet=ByteLevel.alphabet(), special_tokens=[
            "<s>",
            "<pad>",
            "</s>",
            "<unk>",
            "<mask>"
        ])
tokenizer.train(["austen-emma.txt"], trainer)

Ignored unknown kwargs option inital_alphabet





In [6]:
# create a directory to save the tokenizer
!mkdir tokenizer_gpt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/bin/bash: /home/guy/anaconda3/envs/mastrans/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [7]:
tokenizer.save("tokenizer_gpt/tokenizer.json")

In [9]:
from transformers import GPT2TokenizerFast, GPT2Config, TFGPT2LMHeadModel

The tokenizer we have created can be loaded using the following line:

In [10]:
tokenizer_gpt = GPT2TokenizerFast.from_pretrained("tokenizer_gpt")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
tokenizer_gpt.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})

0

In [12]:
tokenizer_gpt.eos_token_id

2

In [13]:
tokenizer_gpt.encode("<s> this is </s>")

[0, 265, 157, 56, 2]

In [14]:
config = GPT2Config(
  vocab_size=tokenizer_gpt.vocab_size,
  bos_token_id=tokenizer_gpt.bos_token_id,
  eos_token_id=tokenizer_gpt.eos_token_id
)
model = TFGPT2LMHeadModel(config)

2022-08-17 12:27:29.509321: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-17 12:27:29.510547: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-17 12:27:29.511180: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-17 12:27:29.511832: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [15]:
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 0,
  "embd_pdrop": 0.1,
  "eos_token_id": 2,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.21.0",
  "use_cache": true,
  "vocab_size": 11750
}

As you can see, other settings are not touched, and the interesting part is that vocab_size is set to 11750. The reason behind this is that we set the maximum vocabulary size to be 50000, but the corpus had less, and its Byte-Pair Encoding (BPE) token created 11750.

In [16]:
with open("austen-emma.txt", "r", encoding='utf-8') as f:
    content = f.readlines()

remove '\n' from each line and drop lines with fewer than 10 characters, as follows:

In [17]:
content_p = []
for c in content:
    if len(c)>10:
        content_p.append(c.strip())

Dropping short lines will ensure that the model is trained on long sequences, to be able to generate longer sequences

In [18]:
content_p = " ".join(content_p)+tokenizer_gpt.eos_token

At the end of the preceding snippet, content_p has the concatenated raw file with eos_token added to the end. But you can follow different strategies too—for example, you can separate each line by adding \</s> to each line, which will help the model to recognize when the sentence ends. However, we intend to make it work for much longer sequences without encountering EOS.

The GPT tokenizer in the following code snippet will tokenize the whole text and make it one whole, long sequence of token IDs.

In [19]:
tokenized_content = tokenizer_gpt.encode(content_p)

In [20]:
# making samples for training
examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 1000
for i in range(0, len(tokenized_content)):
    examples.append(tokenized_content[i:i + block_size])

In [21]:
train_data = [] 
labels = [] 
for example in examples: 
    train_data.append(example[:-1]) 
    labels.append(example[1:])

For faster training, it is required to make the data in the form of a TensorFlow dataset, as follows:

In [22]:
# change 1000 if you want to train on full data
dataset = tf.data.Dataset.from_tensor_slices((train_data[:1000], labels[:1000]))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [23]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)

In [24]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [25]:
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

In [26]:
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])

In [27]:
# increase number of epochs for higher accuracy and lower loss
num_epoch = 10
history = model.fit(dataset, epochs=num_epoch)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
def generate(start,max_length=10):  
    input_token_ids = tokenizer_gpt.encode(start, return_tensors='tf')  
    output = model.generate(  
        input_token_ids,  
        max_length = max_length,  
        num_beams = 5,  
        temperature = 0.7,  
        no_repeat_ngram_size=2,  
        num_return_sequences=1  
    )  
    return tokenizer_gpt.decode(output[0])

In [33]:
generate(" ",500)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to 2 (first `eos_token_id`) to generate sequence


"  he could not meet her in conversation, rational or playful. the evil of the actual disparity in their ages (and mr. woodhouse had not married early) was much increased by his constitution and habits; for having been a valetudinarian all his life, without activity of mind or body, he was a much older man in ways than in years; and though everywhere beloved for the friendliness of his heart and his amiable temper, though comparatively but little removed by matrimony, being settled in london, only sixteen miles off, and many a long october and november evening must be struggled through at any time. her sister, before christmas brought the next visit from isabella and their little children, to fill the house from intellectual solitude.  her pleasant society again. its separate lawn, was miss taylor would be felt every day. highbury, what self-people gone, almost amounting to have recommended him at hartfield, nursed her daily and her father composed himself to impose any disagreeable co

In [34]:
generate("wetson was very good")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to 2 (first `eos_token_id`) to generate sequence


"wetson was very good miss taylor's judgment"

In [35]:
!mkdir my_gpt-2

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/bin/bash: /home/guy/anaconda3/envs/mastrans/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [36]:
model.save_pretrained("my_gpt-2/")

In [37]:
model_reloaded = TFGPT2LMHeadModel.from_pretrained("my_gpt-2/")

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at my_gpt-2/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Hugging Face also has a standard for filenames that must be used—these standard filenames are available by using the following import:  
However, when using the save_pretrained function, it is not required to put the filenames—just the directory will suffice.

In [38]:
from transformers import WEIGHTS_NAME, CONFIG_NAME, TF2_WEIGHTS_NAME, AutoModel, AutoTokenizer

Hugging Face also has AutoModel and AutoTokenizer classes, as you have seen from the previous sections. You can also use this functionality to save the model, but before doing that there are still a few configurations that need to be done manually. The first thing is to save the tokenizer in the proper format to be used by AutoTokenizer. You can do this by using save_pretrained, as follows:

In [39]:
tokenizer_gpt.save_pretrained("tokenizer_gpt_auto/")

('tokenizer_gpt_auto/tokenizer_config.json',
 'tokenizer_gpt_auto/special_tokens_map.json',
 'tokenizer_gpt_auto/vocab.json',
 'tokenizer_gpt_auto/merges.txt',
 'tokenizer_gpt_auto/added_tokens.json',
 'tokenizer_gpt_auto/tokenizer.json')

In [40]:
model = AutoModel.from_pretrained("my_gpt-2/", from_tf = True) 
tokenizer = AutoTokenizer.from_pretrained("tokenizer_gpt_auto")

All TF 2.0 model weights were used when initializing GPT2Model.

Some weights of GPT2Model were not initialized from the TF 2.0 model and are newly initialized: ['h.0.attn.bias', 'h.1.attn.bias', 'h.2.attn.bias', 'h.3.attn.bias', 'h.4.attn.bias', 'h.5.attn.bias', 'h.6.attn.bias', 'h.7.attn.bias', 'h.8.attn.bias', 'h.9.attn.bias', 'h.10.attn.bias', 'h.11.attn.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
