# GPT text generation from scratch with KerasNLP

**Author:** [Jesse Chan](https://github.com/jessechancy)<br>
**Date created:** 2022/07/25<br>
**Last modified:** 2022/07/25<br>
**Description:** Using KerasNLP to train a mini-GPT model for text generation.

## Introduction

In this example, we will use KerasNLP to build a scaled down Generative
Pre-Trained (GPT) model. GPT is a Transformer-based model that allows you to generate
sophisticated text from a prompt.

We will train the model on the [simplebooks-92](https://arxiv.org/abs/1911.12391) corpus,
which is a dataset made from several novels. It is a good dataset for this example since
it has a small vocabulary and high word frequency, which is beneficial when training a
model with few parameters.

This example combines concepts from
[Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/)
with KerasNLP abstractions. We will demonstrate how KerasNLP tokenization, layers and
metrics simplify the training
process, and then show how to generate output text using the KerasNLP sampling utilities.

Note: If you are running this example on a Colab,
make sure to enable GPU runtime for faster training.

This example requires KerasNLP. You can install it via the following command:
`pip install keras-nlp`

## Setup

In [1]:
%pip install keras_nlp -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.5/584.5 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.0/442.0 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import keras_nlp
import tensorflow as tf
from tensorflow import keras

Using TensorFlow backend


## Settings & hyperparameters

In [3]:
# Data
BATCH_SIZE = 64
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 256
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 6

# Inference
NUM_TOKENS_TO_GENERATE = 80

## Load the data

Now, let's download the dataset! The SimpleBooks dataset consists of 1,573 Gutenberg books, and has
one of the smallest vocabulary size to word-level tokens ratio. It has a vocabulary size of ~98k,
a third of WikiText-103's, with around the same number of tokens (~100M). This makes it easy to fit a small model.

In [4]:
keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
    tf.data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
    tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip


## Train the tokenizer

We train the tokenizer from the training dataset for a vocabulary size of `VOCAB_SIZE`,
which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, as
we will see later on
that it has a large effect on the number of model parameters. We also don't want to include
*too few* vocabulary terms, or there would be too many out-of-vocabulary (OOV) sub-words. In
addition, three tokens are reserved in the vocabulary:

- `"[PAD]"` for padding sequences to `SEQ_LEN`. This token has index 0 in both
`reserved_tokens` and `vocab`, since `WordPieceTokenizer` (and other layers) consider
`0`/`vocab[0]` as the default padding.
- `"[UNK]"` for OOV sub-words, which should match the default `oov_token="[UNK]"` in
`WordPieceTokenizer`.
- `"[BOS]"` stands for beginning of sentence, but here technically it is a token
representing the beginning of each line of training data.

In [5]:
# Train tokenizer vocabulary
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

## Load tokenizer

We use the vocabulary data to initialize
`keras_nlp.tokenizers.WordPieceTokenizer`. WordPieceTokenizer is an efficient
implementation of the WordPiece algorithm used by BERT and other models. It will strip,
lower-case and do other irreversible preprocessing operations.

In [6]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

## Tokenize data

We preprocess the dataset by tokenizing and splitting it into `features` and `labels`.

In [7]:
# packer adds a start token
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(
    tf.data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(
    tf.data.AUTOTUNE
)

## Build the model

We create our scaled down GPT model with the following layers:

- One `keras_nlp.layers.TokenAndPositionEmbedding` layer, which combines the embedding
for the token and its position.
- Multiple `keras_nlp.layers.TransformerDecoder` layers, with the default causal masking.
The layer has no cross-attention when run with decoder sequence only.
- One final dense linear layer

In [8]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large effect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [9]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 transformer_decoder (Trans  (None, None, 256)         394749    
 formerDecoder)                                                  
                                                                 
 transformer_decoder_1 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 dense (Dense)               (None, None, 5000)        128500

## Training

Now that we have our model, let's train it with the `fit()` method.

In [10]:
model.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Epoch 1/6
3169/3169 - 1417s - loss: 4.4935 - perplexity: 89.7921 - val_loss: 4.1366 - val_perplexity: 63.1688 - 1417s/epoch - 447ms/step
Epoch 2/6
3169/3169 - 1410s - loss: 4.0513 - perplexity: 57.6941 - val_loss: 3.9841 - val_perplexity: 54.2899 - 1410s/epoch - 445ms/step
Epoch 3/6
3169/3169 - 1396s - loss: 3.9412 - perplexity: 51.6791 - val_loss: 3.9462 - val_perplexity: 52.1174 - 1396s/epoch - 440ms/step
Epoch 4/6
3169/3169 - 1397s - loss: 3.8806 - perplexity: 48.6370 - val_loss: 3.8985 - val_perplexity: 49.7437 - 1397s/epoch - 441ms/step
Epoch 5/6
3169/3169 - 1410s - loss: 3.8397 - perplexity: 46.6859 - val_loss: 3.8652 - val_perplexity: 48.1174 - 1410s/epoch - 445ms/step
Epoch 6/6
3169/3169 - 1396s - loss: 3.8111 - perplexity: 45.3698 - val_loss: 3.8769 - val_perplexity: 48.6609 - 1396s/epoch - 441ms/step


<keras.src.callbacks.History at 0x7ddf6c50a3e0>

## Inference

With our trained model, we can test it out to gauge its performance. To do this
we can seed our model with an input sequence starting with the `"[BOS]"` token,
and progressively sample the model by making predictions for each subsequent
token in a loop.

To start lets build a prompt with the same shape as our model inputs, containing
only the `"[BOS]"` token.

In [11]:
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

We will use the `keras_nlp.samplers` module for inference, which requires a
callback function wrapping the model we just trained. This wrapper calls
the model and returns the logit predictions for the current token we are
generating.

Note: There are two pieces of more advanced functionality available when
defining your callback. The first is the ability to take in a `cache` of states
computed in previous generation steps, which can be used to speed up generation.
The second is the ability to output the final dense "hidden state" of each
generated token. This is used by `keras_nlp.samplers.ContrastiveSampler`, which
avoids repetition by penalizing repeated hidden states. Both are optional, and
we will ignore them for now.

In [12]:

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache


Creating the wrapper function is the most complex part of using these functions. Now that
it's done, let's test out the different utilities, starting with greedy search.

### Greedy search

We greedily pick the most probable token at each timestep. In other words, we get the
argmax of the model output.

In [13]:
sampler = keras_nlp.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
[b'[BOS] " i am glad to see you , " said the captain , " and i have been thinking of it . i have been thinking of it over , and i have been thinking of it . i have heard that you have been in the service , and that you have been in the service of the ship , and that you have been able to do so , and that you have been able to get the ship , and that you have been able to get the ship , and that you have been able to get the ship , and that you have been able to get the ship . " [PAD] said , " i will have to do so ,']



As you can see, greedy search starts out making some sense, but quickly starts repeating
itself. This is a common problem with text generation that can be fixed by some of the
probabilistic text generation utilities shown later on!

### Beam search

At a high-level, beam search keeps track of the `num_beams` most probable sequences at
each timestep, and predicts the best next token from all sequences. It is an improvement
over greedy search since it stores more possibilities. However, it is less efficient than
greedy search since it has to compute and store multiple potential sequences.

**Note:** beam search with `num_beams=1` is identical to greedy search.

In [14]:
sampler = keras_nlp.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
[b'[BOS] " well , i don \' t think i \' m glad to see you , " he said , " but i \' m glad i \' m going to see you , and i \' ll tell you what i \' ll do . i \' ll tell you what i \' ll do . i \' ll tell you what i \' ll do . i \' ll tell you what i \' ll do . i \' ll tell you what i \' ll do , and i \' ll tell you what i \' ll do . i \' ll tell you what i \' ll do . i \' ll tell you what i \' ll do . i \' ll']



Similar to greedy search, beam search quickly starts repeating itself, since it is still
a deterministic method.

### Random search

Random search is our first probabilistic method. At each time step, it samples the next
token using the softmax probabilities provided by the model.

In [15]:
sampler = keras_nlp.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
[b"[BOS] ah , you will not want to be in this country mother that paupine . after a widow of fortune and dutch father is a poor woman ; but when you go arman should be married , for i want mrs . time for saving them , and instead of returning when our visit with her husband ' s own business , wives mrs . concy and mabroke is to come back . she ' s called for the slaughter among our mothers if their daughters may be were assured in another case that ever , and she certainly accepted the offer of thanks for sir premarks , since that most"]



Voilà, no repetitions! However, with random search, we may see some nonsensical words
appearing since any word in the vocabulary has a chance of appearing with this sampling
method. This is fixed by our next search utility, top-k search.

### Top-K search

Similar to random search, we sample the next token from the probability distribution
provided by the model. The only difference is that here, we select out the top `k` most
probable tokens, and distribute the probability mass over them before sampling. This way,
we won't be sampling from low probability tokens, and hence we would have less
nonsensical words!

In [16]:
sampler = keras_nlp.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
[b'[BOS] in the morning , a few days , there came the news that this was not to be seen from the king , and the swedes he had been so great , and had been captured , for he had been slain by sir gawaine and sir tristram . and that sir tristram had made sir tristram and had been brought up with sir tristram of cornwall , and that the battle of sir tristram was , and that sir tristram had been so sorely wounded that wound round his wound that wound wound sir tristrame and sir tristram had brought him sir tristram and sir tristram to sir tristram . and sir tristram said : [PAD] sir']



### Top-P search

Even with the top-k search, there is something to improve upon. With top-k search, the
number `k` is fixed, which means it selects the same number of tokens for any probability
distribution. Consider two scenarios, one where the probability mass is concentrated over
2 words and another where the probability mass is evenly concentrated across 10. Should
we choose `k=2` or `k=10`? There is no one size that fits all `k` here.

This is where top-p search comes in! Instead of choosing a `k`, we choose a probability
`p` that we want the probabilities of the top tokens to sum up to. This way, we can
dynamically adjust the `k` based on the probability distribution. By setting `p=0.9`, if
90% of the probability mass is concentrated on the top 2 tokens, we can filter out the
top 2 tokens to sample from. If instead the 90% is distributed over 10 tokens, it will
similarly filter out the top 10 tokens to sample from.

In [17]:
sampler = keras_nlp.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
[b'[BOS] there was no wonder that this day . at last , the man in the forest , who was not to be near to the foot of the hill , was now very ill . and , if he could see a great white wolf , he would go to his lodge , he would come up to him , and make his way to the other side . so he sat down to the ground , looking for the tree that he had seen the great fire , and there he saw the snake steal out of the cave , and he had no sooner thought that it was very beautiful , and that the snake had made him run off the head . [PAD] then']



### Using callbacks for text generation

We can also wrap the utilities in a callback, which allows you to print out a prediction
sequence for every epoch of the model! Here is an example of a callback for top-k search:

In [18]:

class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_nlp.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2
Top-K search generated text: 
[b'[BOS] " it is not a good thing , sir . he has been sent to me as i said , " but he did this , and it is my opinion that this will he holds . i am very sure that the men of the french and are in our own language , and that it is no use for it to be taken . if this was not the case , i suppose , i should be killed , to make some resort to this day of the submarines , which i know not , have been very much in the town . i don \' t want to go into the world , because , as i']

1/1 - 13s - loss: 3.9863 - perplexity: 53.9379 - 13s/epoch - 13s/step
Epoch 2/2
Top-K search generated text: 
[b'[BOS] " no , you know , " he said , " but the child is not the daughter of the conviction , for i am so much obliged to say , in the present case , to rejoice to appreciate her son . you have had a few hours to get through to - night - - not only the young girl , but the mother is not so much frightened to think about that the young man who will be in his power to

<keras.src.callbacks.History at 0x7ddf6c67c6d0>

## Conclusion

To recap, in this example, we use KerasNLP layers to train a sub-word vocabulary,
tokenize training data, create a miniature GPT model, and perform inference with the
text generation library.

If you would like to understand how Transformers work, or learn more about training the
full GPT model, here are some further readings:

- Attention Is All You Need [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)
- GPT-3 Paper [Brown et al., 2020](https://arxiv.org/abs/2005.14165)

# Codeathon 3

In this project, you will use the KerasNLP API to build a scaled down Generative Pre-Trained (GPT) model. GPT is a Transformer-based model that allows you to generate sophisticated text and images from a prompt. Using this tutorial guide - GPT text generationLinks to an external site. - as a starting point you will train the model on the simplebooks-92Links to an external site. corpus, which is a dataset made from several novels.

Next you will load a pre-trained Large Language Model (LLM) - GPT-2 modelLinks to an external site. (originally invented by OpenAI), finetune it to a specific text style, and generate text based on users' input (also known as prompt). Large language models (LLMs) are a type of machine learning models that are trained on a large corpus of text data to generate outputs for various natural language processing (NLP) tasks, such as text generation, question answering, and machine translation. Generative LLMs are typically based on deep learning neural networks, such as the Transformer architectureLinks to an external site. invented by Google researchers in 2017, and are trained on massive amounts of text data, often involving billions of words. These models, such as Google LaMDALinks to an external site. and PaLMLinks to an external site., are trained with a large dataset from various data sources which allows them to generate output for many tasks. The core of Generative LLMs is predicting the next word in a sentence, often referred as Causal LM Pretraining. In this way LLMs can generate coherent text based on user prompts. For a more pedagogical discussion on language models, you can refer to the Stanford CS324 LLM classLinks to an external site..

The KerasNLP API provides a number of pre-trained models, such as Google BertLinks to an external site. and GPT-2Links to an external site.. You can see the list of models available in the KerasNLP repositoryLinks to an external site..

You will experiment with at least five (5) pretrained models and fine-tune the models on the Reddit dataset to update its parameters. Generate and evaluate outputs using different pretrained models.

# Load the necessary libraries

In [2]:
%pip install keras_nlp -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/584.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/584.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.5/584.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m93.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m97.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.0/442.0 kB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━

In [3]:
import os
import keras_nlp
import tensorflow as tf
from tensorflow import keras
import math
import time
import keras
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt

Using TensorFlow backend


# Load the Reddit dataset

I was having issues with the larger "reddit" dataset, so I went with this one to simplify the process.

In [4]:
reddit_data = tfds.load("reddit_tifu", split = "train", as_supervised = True)

Downloading and preparing dataset 639.54 MiB (download: 639.54 MiB, generated: 141.46 MiB, total: 781.00 MiB) to /root/tensorflow_datasets/reddit_tifu/short/1.1.2...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/79740 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/reddit_tifu/short/1.1.2.incompleteL8XTV7/reddit_tifu-train.tfrecord*...:  …

Dataset reddit_tifu downloaded and prepared to /root/tensorflow_datasets/reddit_tifu/short/1.1.2. Subsequent calls will reuse this data.


# Preprocess the data

In [5]:
train_reddit = (reddit_data.map(lambda document, _: document).batch(32).cache().prefetch(tf.data.AUTOTUNE))

# Fine Tune the Parameters

Parameters for the first GPT model:

In [6]:
# Parameters for 1st model attempt

# number of epochs
epochs = 1

# Reduce the size of the dataset
train_reddit = train_reddit.take(500)

# Learning rate schedule
learning_rate = tf.keras.optimizers.schedules.PolynomialDecay(5e-5, decay_steps=tf.data.experimental.cardinality(train_reddit).numpy() * epochs, end_learning_rate=0.0,)

# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

# Function to generate text

In [7]:
# https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/

def generate_text(model, input_text, max_length = 500):
    start = time.time()

    output = model.generate(input_text, max_length = max_length)
    print("\nGPT-2 Output:")
    print(output)

    end = time.time()
    print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

# GPT-2 Models

1st attempt at the GPT-2 Model:

In [24]:
# set the model name
model_name = "gpt2_base_en"

# initialize the preprocessor and the model
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    model_name,
    sequence_length = 128,
)
tmp_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    model_name, preprocessor=preprocessor
)

# compile the model
tmp_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss = loss,
    weighted_metrics = ["accuracy"],
)

# train the model
tmp_lm.fit(train_reddit, epochs = epochs)

# generate text
generate_text(tmp_lm, "How do I create a GPT model?")

Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/vocab.json
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/merges.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/model.h5

GPT-2 Output:
How do I create a GPT model?

a lot of people have tried to do this, but it just takes too much time. i have a few ideas, and i'm not sure why anyone would try to do it this way, but i've never had the time.

edit: i was going on vacation, so i decided to try to make this work for me. i started by creating the GPT model, and then i created a couple of different images, one for the "gpts" part and the next for the "gts." so now i have the "gts" and "gts-gts"
TOTAL TIME ELAPSED: 42.35s


For my next model, I will increase the number of epochs as well as enlarge the size of the datset that the model is trained on.

Parameters:

In [25]:
# Parameters for 2nd model attempt

# increase the number of epochs
epochs = 5

# enlarge the size of the dataset
train_reddit = train_reddit.take(1000)

# Learning rate schedule
learning_rate = tf.keras.optimizers.schedules.PolynomialDecay(5e-5, decay_steps=tf.data.experimental.cardinality(train_reddit).numpy() * epochs, end_learning_rate=0.0,)

# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

In [26]:
# set the model name
model_name = "gpt2_base_en"

# initialize the preprocessor and the model
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    model_name,
    sequence_length = 128,
)
tmp_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    model_name, preprocessor=preprocessor
)

# compile the model
tmp_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss = loss,
    weighted_metrics = ["accuracy"],
)

# train the model
tmp_lm.fit(train_reddit, epochs = epochs)

# generate text
generate_text(tmp_lm, "How do I create a GPT model?")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

GPT-2 Output:
How do I create a GPT model?

my first project was to create a gps database for my school. i wanted to create a table of all classes in an english department and use it to create the model for my school.

after a week of work i decided to create a model that was going to look like this:

public class model { public static string get_class() { return "class"; } public static int get_class_id(int id)(int id) { return id; } } public string get_model(int model, int id) { return id
TOTAL TIME ELAPSED: 41.00s


For my last GPT model, I want to experiment with a new loss function that includes label smoothing.

Parameters:

In [8]:
# Parameters for 2nd model attempt

# increase the number of epochs
epochs = 1

# enlarge the size of the dataset
train_reddit = train_reddit.take(1000)

# Learning rate schedule
learning_rate = tf.keras.optimizers.schedules.PolynomialDecay(5e-5, decay_steps=tf.data.experimental.cardinality(train_reddit).numpy() * epochs, end_learning_rate=0.0,)

# Loss
#loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

# Loss Function with Label Smoothing
class CustomSparseCategoricalCrossentropy(tf.keras.losses.Loss):
    def __init__(self, smoothing=0.1, from_logits=True):
        super(CustomSparseCategoricalCrossentropy, self).__init__()
        self.smoothing = smoothing
        self.from_logits = from_logits
        self.cce = tf.keras.losses.CategoricalCrossentropy(
            from_logits=from_logits, reduction=tf.keras.losses.Reduction.NONE)

    def call(self, y_true, y_pred):
        y_true = tf.one_hot(tf.cast(y_true, tf.int32), depth=tf.shape(y_pred)[2])
        y_true = y_true * (1 - self.smoothing) + (self.smoothing / tf.cast(tf.shape(y_true)[2], tf.float32))
        return tf.reduce_mean(self.cce(y_true, y_pred))

In [9]:
# set the model name
model_name = "gpt2_base_en"

# initialize the preprocessor and the model
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    model_name,
    sequence_length = 128,
)
tmp_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    model_name, preprocessor=preprocessor
)

# compile the model with updated loss function
tmp_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss = CustomSparseCategoricalCrossentropy(smoothing=0.1),
    weighted_metrics = ["accuracy"],
)

# train the model
tmp_lm.fit(train_reddit, epochs = epochs)

# generate text
generate_text(tmp_lm, "How do I create a GPT model?")

Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/vocab.json
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/merges.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/model.h5

GPT-2 Output:
How do I create a GPT model?

i'm a highschool senior. i was in a school for a high school student, and he was the only one in my class that wasn't a gt. he wanted me to have a model of his penis. so when we were talking, he told us that he was a student of the school, and that he wanted me to make him an "uniformed" model of him.

i was like "i'll give that to you guys". so i went to the gym and made a uniform of the model of his penis, with his name.

i was in my room and i noticed that my penis was
TOTAL TIME ELAPSED: 36.37s


# OPT Models:

For my first OPT model, I will begin with the same parameters as the original GPT model.

Parameters for the 1st OPT Model:

In [11]:
# Parameters for 1st model attempt

# number of epochs
epochs = 1

# Reduce the size of the dataset
train_reddit = train_reddit.take(500)

# Learning rate schedule
learning_rate = tf.keras.optimizers.schedules.PolynomialDecay(5e-5, decay_steps=tf.data.experimental.cardinality(train_reddit).numpy() * epochs, end_learning_rate=0.0,)

# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

1st attempt at the OPT model:

In [12]:
# set the model name
model_name = "opt_125m_en"

# initialize the preprocessor and the model
preprocessor = keras_nlp.models.OPTCausalLMPreprocessor.from_preset(
    model_name,
    sequence_length = 128,
)
tmp_lm = keras_nlp.models.OPTCausalLM.from_preset(
    model_name, preprocessor=preprocessor
)

# compile the model
tmp_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

# train the model
tmp_lm.fit(train_reddit, epochs = epochs)

# generate text
generate_text(tmp_lm, "How do I create a GPT model?")

Downloading data from https://storage.googleapis.com/keras-nlp/models/opt_125m_en/v1/vocab.json
Downloading data from https://storage.googleapis.com/keras-nlp/models/opt_125m_en/v1/merges.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/opt_125m_en/v1/model.h5

GPT-2 Output:
How do I create a GPT model?

a few days ago i had to make a model of a galaxy s6 and i had a few things to do, so i decided to use the gpt tool to create a gpt file for it. i created the model and then copied it to a usb drive, but i didn't have any other tools to do so and it was still not working. i was so confused, and then my phone died, i had to use the gpt tool, and i had to use a computer to get it back to work. i had to go to the repair store and
TOTAL TIME ELAPSED: 32.58s


For my 2nd OPT model, I will enlarge the size of the datset that the model is trained on. I wanted to increase the number of epochs, but this proved too costly in terms of computing power.

Parameters:

In [13]:
# Parameters for 2nd model attempt

# increase the number of epochs
epochs = 1

# enlarge the size of the dataset
train_reddit = train_reddit.take(1000)

# Learning rate schedule
learning_rate = tf.keras.optimizers.schedules.PolynomialDecay(5e-5, decay_steps=tf.data.experimental.cardinality(train_reddit).numpy() * epochs, end_learning_rate=0.0,)

# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

In [14]:
# set the model name
model_name = "opt_125m_en"

# initialize the preprocessor and the model
preprocessor = keras_nlp.models.OPTCausalLMPreprocessor.from_preset(
    model_name,
    sequence_length = 128,
)
tmp_lm = keras_nlp.models.OPTCausalLM.from_preset(
    model_name, preprocessor=preprocessor
)

# compile the model
tmp_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

# train the model
tmp_lm.fit(train_reddit, epochs = epochs)

# generate text
generate_text(tmp_lm, "How do I create a GPT model?")


GPT-2 Output:
How do I create a GPT model?

i created my own gpt model with my own code and created it using the gpt command-line, then i used the command-line again, and i created the model using the command-line again, but it was still the same code. i created the gpt model with all the files that i wanted to create, and i created a folder with files for each folder i want.

i also created a folder with all of the files i wanted to create, and a folder with all of the files that i want to create. 

i then
TOTAL TIME ELAPSED: 34.67s


Note: I was unable to fully train the models with a higher number of epochs due to computing limitations. I can run them and resubmit later when Colab has improved accessibility.