# ORIGINAL MODEL -GPT text generation from scratch with KerasNLP

**Author:** [Jesse Chan](https://github.com/jessechancy)<br>
**Date created:** 2022/07/25<br>
**Last modified:** 2022/07/25<br>
**Description:** Using KerasNLP to train a mini-GPT model for text generation.

## Introduction

In this example, we will use KerasNLP to build a scaled down Generative
Pre-Trained (GPT) model. GPT is a Transformer-based model that allows you to generate
sophisticated text from a prompt.

We will train the model on the [simplebooks-92](https://arxiv.org/abs/1911.12391) corpus,
which is a dataset made from several novels. It is a good dataset for this example since
it has a small vocabulary and high word frequency, which is beneficial when training a
model with few parameters.

This example combines concepts from
[Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/)
with KerasNLP abstractions. We will demonstrate how KerasNLP tokenization, layers and
metrics simplify the training
process, and then show how to generate output text using the KerasNLP sampling utilities.

Note: If you are running this example on a Colab,
make sure to enable GPU runtime for faster training.

This example requires KerasNLP. You can install it via the following command:
`pip install keras-nlp`

## Setup

In [1]:
#!pip install keras_nlp

In [2]:
!pip install git+https://github.com/keras-team/keras-nlp.git --upgrade

Collecting git+https://github.com/keras-team/keras-nlp.git
  Cloning https://github.com/keras-team/keras-nlp.git to /tmp/pip-req-build-wzjbmn37
  Running command git clone --filter=blob:none --quiet https://github.com/keras-team/keras-nlp.git /tmp/pip-req-build-wzjbmn37
  Resolved https://github.com/keras-team/keras-nlp.git to commit 9286561f35d4727a373e135217279761edadb486
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting keras-core (from keras-nlp==0.7.0)
  Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-text (from keras-nlp==0.7.0)
  Downloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5

In [3]:
import os
import keras_nlp
import tensorflow as tf
from tensorflow import keras

Using TensorFlow backend


## Settings & hyperparameters - I decreased the batch size to 32 instead of using the example's 64 value since I kept running out of resources.

In [4]:
# Data
#BATCH_SIZE = 64
BATCH_SIZE = 32
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 256
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 6

# Inference
NUM_TOKENS_TO_GENERATE = 80

## Load the data

Now, let's download the dataset! The SimpleBooks dataset consists of 1,573 Gutenberg books, and has
one of the smallest vocabulary size to word-level tokens ratio. It has a vocabulary size of ~98k,
a third of WikiText-103's, with around the same number of tokens (~100M). This makes it easy to fit a small model.

In [5]:
keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
    tf.data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
    tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip


In [6]:
raw_train_ds

<_ShuffleDataset element_spec=TensorSpec(shape=(None,), dtype=tf.string, name=None)>

## Train the tokenizer

We train the tokenizer from the training dataset for a vocabulary size of `VOCAB_SIZE`,
which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, as
we will see later on
that it has a large effect on the number of model parameters. We also don't want to include
*too few* vocabulary terms, or there would be too many out-of-vocabulary (OOV) sub-words. In
addition, three tokens are reserved in the vocabulary:

- `"[PAD]"` for padding sequences to `SEQ_LEN`. This token has index 0 in both
`reserved_tokens` and `vocab`, since `WordPieceTokenizer` (and other layers) consider
`0`/`vocab[0]` as the default padding.
- `"[UNK]"` for OOV sub-words, which should match the default `oov_token="[UNK]"` in
`WordPieceTokenizer`.
- `"[BOS]"` stands for beginning of sentence, but here technically it is a token
representing the beginning of each line of training data.

In [7]:
# Train tokenizer vocabulary
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

## Load tokenizer

We use the vocabulary data to initialize
`keras_nlp.tokenizers.WordPieceTokenizer`. WordPieceTokenizer is an efficient
implementation of the WordPiece algorithm used by BERT and other models. It will strip,
lower-case and do other irreversible preprocessing operations.

In [8]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

## Tokenize data

We preprocess the dataset by tokenizing and splitting it into `features` and `labels`.

In [9]:
# packer adds a start token
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(
    tf.data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(
    tf.data.AUTOTUNE
)

## Build the model

We create our scaled down GPT model with the following layers:

- One `keras_nlp.layers.TokenAndPositionEmbedding` layer, which combines the embedding
for the token and its position.
- Multiple `keras_nlp.layers.TransformerDecoder` layers, with the default causal masking.
The layer has no cross-attention when run with decoder sequence only.
- One final dense linear layer

In [None]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large effect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [None]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 transformer_decoder (Trans  (None, None, 256)         394749    
 formerDecoder)                                                  
                                                                 
 transformer_decoder_1 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 dense (Dense)               (None, None, 5000)        128500

## Training

Now that we have our model, let's train it with the `fit()` method.

In [None]:
model.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Epoch 1/6
3169/3169 - 297s - loss: 4.4864 - perplexity: 89.1600 - val_loss: 4.0919 - val_perplexity: 60.5466 - 297s/epoch - 94ms/step
Epoch 2/6
3169/3169 - 233s - loss: 4.0451 - perplexity: 57.3399 - val_loss: 3.9982 - val_perplexity: 55.0629 - 233s/epoch - 74ms/step
Epoch 3/6
3169/3169 - 238s - loss: 3.9384 - perplexity: 51.5314 - val_loss: 3.9162 - val_perplexity: 50.6466 - 238s/epoch - 75ms/step
Epoch 4/6
3169/3169 - 235s - loss: 3.8803 - perplexity: 48.6230 - val_loss: 3.8909 - val_perplexity: 49.5133 - 235s/epoch - 74ms/step
Epoch 5/6
3169/3169 - 235s - loss: 3.8391 - perplexity: 46.6601 - val_loss: 3.8705 - val_perplexity: 48.3285 - 235s/epoch - 74ms/step
Epoch 6/6
3169/3169 - 234s - loss: 3.8097 - perplexity: 45.3073 - val_loss: 3.8503 - val_perplexity: 47.4212 - 234s/epoch - 74ms/step


<keras.src.callbacks.History at 0x7f6e813cfd30>

## Inference

With our trained model, we can test it out to gauge its performance. To do this
we can seed our model with an input sequence starting with the `"[BOS]"` token,
and progressively sample the model by making predictions for each subsequent
token in a loop.

To start lets build a prompt with the same shape as our model inputs, containing
only the `"[BOS]"` token.

In [None]:
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

We will use the `keras_nlp.samplers` module for inference, which requires a
callback function wrapping the model we just trained. This wrapper calls
the model and returns the logit predictions for the current token we are
generating.

Note: There are two pieces of more advanced functionality available when
defining your callback. The first is the ability to take in a `cache` of states
computed in previous generation steps, which can be used to speed up generation.
The second is the ability to output the final dense "hidden state" of each
generated token. This is used by `keras_nlp.samplers.ContrastiveSampler`, which
avoids repetition by penalizing repeated hidden states. Both are optional, and
we will ignore them for now.

In [None]:

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache


Creating the wrapper function is the most complex part of using these functions. Now that
it's done, let's test out the different utilities, starting with greedy search.

### Greedy search

We greedily pick the most probable token at each timestep. In other words, we get the
argmax of the model output.

In [None]:
sampler = keras_nlp.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
[b'[BOS] " i have not heard that the king \' s son was a king , and that he was a king , and that he was king of the king of the king , and that he was king of the king of the king of the confederacy , and that he was king of the king of the king of the king of the confederacy , and that he was king of the king of the king , and that he was so great that he was that he was king of the king of the king of the combion that he was king of the king of the king of the c']



As you can see, greedy search starts out making some sense, but quickly starts repeating
itself. This is a common problem with text generation that can be fixed by some of the
probabilistic text generation utilities shown later on!

### Beam search

At a high-level, beam search keeps track of the `num_beams` most probable sequences at
each timestep, and predicts the best next token from all sequences. It is an improvement
over greedy search since it stores more possibilities. However, it is less efficient than
greedy search since it has to compute and store multiple potential sequences.

**Note:** beam search with `num_beams=1` is identical to greedy search.

In [None]:
sampler = keras_nlp.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
[b'[BOS] " i \' ll tell you what i \' ll tell you what i \' ll tell you what i \' ll tell you . i \' ll tell you what i \' ll tell you what i \' ll tell you . i \' ll tell you what i \' ll tell you what i \' ll tell you . i \' ll tell you what i \' ll tell you what i \' ll tell you . i \' ll tell you what i \' ll tell you about it . i \' ll tell you what i \' ll tell you about it . i \' ll tell you what i \' ll tell you about it . i \' ll tell you how']



Similar to greedy search, beam search quickly starts repeating itself, since it is still
a deterministic method.

### Random search

Random search is our first probabilistic method. At each time step, it samples the next
token using the softmax probabilities provided by the model.

In [None]:
sampler = keras_nlp.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
[b'[BOS] this was a pretty little space behind the drawing - room . one of the room had been numbered very rough and neat and timothy - looking buildings in the wiggins of the court . this building and keeping watch till close to the lighted room polly , and impassion - - things surely had become st your guest . transfers or stoop for thin mr . damhill overdillantine , in which other carrier , subject which aunt pee - weodrig order , not only i did that talking in going coats , as if one should the girl had under']



Voilà, no repetitions! However, with random search, we may see some nonsensical words
appearing since any word in the vocabulary has a chance of appearing with this sampling
method. This is fixed by our next search utility, top-k search.

### Top-K search

Similar to random search, we sample the next token from the probability distribution
provided by the model. The only difference is that here, we select out the top `k` most
probable tokens, and distribute the probability mass over them before sampling. This way,
we won't be sampling from low probability tokens, and hence we would have less
nonsensical words!

In [None]:
sampler = keras_nlp.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
[b"[BOS] it is not a man to be afraid of a shriekment . it must have been an awful night . it must not be cold to see it . it ' s a very bad dream and he never have seen it at all , but it ' s not a good thing . he ' s going to have a good chance , that he ' ll tells what he ' s , if he could not tell how it was that he ' s got a good idea of the coil . it ' s so he ' ll make it a nice thing to do ; and i ' m glad that he is a man of his"]



### Top-P search

Even with the top-k search, there is something to improve upon. With top-k search, the
number `k` is fixed, which means it selects the same number of tokens for any probability
distribution. Consider two scenarios, one where the probability mass is concentrated over
2 words and another where the probability mass is evenly concentrated across 10. Should
we choose `k=2` or `k=10`? There is no one size that fits all `k` here.

This is where top-p search comes in! Instead of choosing a `k`, we choose a probability
`p` that we want the probabilities of the top tokens to sum up to. This way, we can
dynamically adjust the `k` based on the probability distribution. By setting `p=0.9`, if
90% of the probability mass is concentrated on the top 2 tokens, we can filter out the
top 2 tokens to sample from. If instead the 90% is distributed over 10 tokens, it will
similarly filter out the top 10 tokens to sample from.

In [None]:
sampler = keras_nlp.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
[b'[BOS] " it is a very good thing to be able to do , " said the king , and all the other continentality . " the gruffs of the protestants have not allowed them to stay in the city . it is well that they will be well armed , and they will do it to make the concerts of the thoroughmis of the raisinster and the tubes of the navigation and the whole of the tube with the same complaints . the king has promised to pay his apac']



### Using callbacks for text generation

We can also wrap the utilities in a callback, which allows you to print out a prediction
sequence for every epoch of the model! Here is an example of a callback for top-k search:

In [None]:

class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_nlp.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2
Top-K search generated text: 
[b'[BOS] " that \' s the use , i know i \' e \' ve heard , \' i \' ve \' em in . you see , it is , an \' a \' tr i \' m going to do this \' yer \' n \' a \' i ain \' m goin \' ter git \' \' \' to s a \' s \' s a \' \' ain \' \' t i \' \' n \' t \' a \' th \' \' s \' i \' \' em \' s a \' \' \' svin \' \' \' \' \' t a \' \' \' p \' o \' n \' an \' \' \' \' \' \' all i']

1/1 - 11s - loss: 3.8869 - perplexity: 48.9554 - 11s/epoch - 11s/step
Epoch 2/2
Top-K search generated text: 
[b"[BOS] the young man was an unattaine , a compassion of the same sort of a thing ; and he was very fond of it , so he went up to his father ' s home , and went to the stable with two little kinsmores , and went to the barn to a bale , and told her to stay where he stayed and waited . when he found that they were all the way down the lane where the old woman lived , he was not at all alone in the village , but he was a man , who was only a boy ; and , too , he knew what ha

<keras.src.callbacks.History at 0x7f6e813c9240>

# Model 1 - Identical to GPT Scratch Model from the "GPT text generation from scratch with KerasNLP" (original model) example except with the addition of a Normalization Layer

I added a normalization layer with the layer_norm_epsilon equal to 1e-5 to the original model since normalization layers have been shown to often help speed up and stabilize the learning process. A normalization layer helps prevent vanishing or exploding gradients which slows down learning. A normalization layer accomplishes this by keeping the weights and activations within a reasonable range. Additionally, I added this layer since it helps improve generalization of the model by reducing overfitting. Many of the top performing LLM's include it in their architecture.

Sources:

*   https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6
*   https://huggingface.co/blog/optimize-llm
*   https://medium.com/@minh.hoque/demystifying-neural-network-normalization-techniques-4a21d35b14f8






## Build the model


In [None]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.

# Modified Model to include a Normalization Layer
layer_norm_epsilon=1e-5
sequence_output = keras.layers.LayerNormalization(
    name="layer_norm",
    axis=-1,
    epsilon=layer_norm_epsilon,
    dtype="float32",
)(x)

# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

In [None]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng_1 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 transformer_decoder_2 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 transformer_decoder_3 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 dense (Dense)               (None, None, 5000)        128500

## Training

Now that we have our model, let's train it with the `fit()` method.

In [None]:
model.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Epoch 1/6
6337/6337 - 341s - loss: 4.3653 - perplexity: 78.9911 - val_loss: 4.0722 - val_perplexity: 59.0935 - 341s/epoch - 54ms/step
Epoch 2/6
6337/6337 - 281s - loss: 3.9851 - perplexity: 53.9963 - val_loss: 3.9452 - val_perplexity: 52.0450 - 281s/epoch - 44ms/step
Epoch 3/6
6337/6337 - 288s - loss: 3.8896 - perplexity: 49.0790 - val_loss: 3.8868 - val_perplexity: 49.1475 - 288s/epoch - 46ms/step
Epoch 4/6
6337/6337 - 278s - loss: 3.8348 - perplexity: 46.4603 - val_loss: 3.8445 - val_perplexity: 47.1634 - 278s/epoch - 44ms/step
Epoch 5/6
6337/6337 - 265s - loss: 3.7958 - perplexity: 44.6805 - val_loss: 3.8208 - val_perplexity: 46.1472 - 265s/epoch - 42ms/step
Epoch 6/6
6337/6337 - 264s - loss: 3.7663 - perplexity: 43.3817 - val_loss: 3.7881 - val_perplexity: 44.5675 - 264s/epoch - 42ms/step


<keras.src.callbacks.History at 0x7c8001370d60>

# Model 2 - Identical to GPT Scratch Model from the "GPT text generation from scratch with KerasNLP" (original model) example except with the addition of a Normalization Layer. This model is identical to Model 1 except in regards to the placement of the normalization layer. In Model 1 the normalization layer was added after the transformer decoders while in Model 2 it is added as part of the transformer decoders (keras_nlp.layers.TransformerDecoder), specifically I use the normalize_first argument and set it to true so the inputs to the attention layer and the intermediate dense layer are normalized in a fashion similar to a GPT-2 model. For both Model 1 and 2, I utilize 1e-5 as the eps value in the layer normalization component.









## Build the model


In [None]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
        layer_norm_epsilon=1e-5,
        normalize_first=True,   #normalized similarly to gpt-2
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.

# Output.
outputs_2 = keras.layers.Dense(VOCAB_SIZE)(x)
model_2 = keras.Model(inputs=inputs, outputs=outputs_2)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model_2.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large effect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [None]:
model_2.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng_3 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 transformer_decoder_6 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 transformer_decoder_7 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 dense_2 (Dense)             (None, None, 5000)        1285

## Training

Now that we have our model, let's train it with the `fit()` method.

In [None]:
model_2.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Epoch 1/6
6337/6337 - 333s - loss: 4.4207 - perplexity: 83.4873 - val_loss: 4.1276 - val_perplexity: 62.5628 - 333s/epoch - 53ms/step
Epoch 2/6
6337/6337 - 288s - loss: 4.0321 - perplexity: 56.5948 - val_loss: 4.0156 - val_perplexity: 55.9750 - 288s/epoch - 45ms/step
Epoch 3/6
6337/6337 - 267s - loss: 3.9227 - perplexity: 50.7253 - val_loss: 3.9591 - val_perplexity: 52.7622 - 267s/epoch - 42ms/step
Epoch 4/6
6337/6337 - 267s - loss: 3.8535 - perplexity: 47.3330 - val_loss: 3.8684 - val_perplexity: 48.3014 - 267s/epoch - 42ms/step
Epoch 5/6
6337/6337 - 271s - loss: 3.8076 - perplexity: 45.2073 - val_loss: 3.8446 - val_perplexity: 47.0781 - 271s/epoch - 43ms/step
Epoch 6/6
6337/6337 - 271s - loss: 3.7750 - perplexity: 43.7537 - val_loss: 3.8230 - val_perplexity: 46.1854 - 271s/epoch - 43ms/step


<keras.src.callbacks.History at 0x7c8015bcf730>

Model 2 performed slightly better than the original model, but not better than Model 1.

# Model 2 Hyperparameter Tuning - Identical to Model 2. In this run I did a grid search by trying out different numbers of layers and number of heads. The combinations I tried consisted of 4, 8, or 12 heads and 6 or 12 layers.









## Build the model and Training
Due to resource constraints not all possible combinations listed in the code below were run through, specifically 24 and 48 layers.  

In [None]:
num_heads_values = [4, 8, 12]
num_layers_values = [6, 12]

for NUM_LAYERS_TEST in num_layers_values:
  for NUM_HEADS_TEST in num_heads_values:
    print(f"Number of Layers is {NUM_LAYERS_TEST} and Number of Heads is {NUM_HEADS_TEST}.")
    inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
    # Embedding.
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
        vocabulary_size=VOCAB_SIZE,
        sequence_length=SEQ_LEN,
        embedding_dim=EMBED_DIM,
        mask_zero=True,
    )
    x = embedding_layer(inputs)
    # Transformer decoders.
    for _ in range(NUM_LAYERS_TEST):
        decoder_layer = keras_nlp.layers.TransformerDecoder(
            num_heads=NUM_HEADS_TEST,
            intermediate_dim=FEED_FORWARD_DIM,
            layer_norm_epsilon=1e-5,
            normalize_first=True,   #normalized similarly to gpt-2
        )
        x = decoder_layer(x)  # Giving one argument only skips cross-attention.

    # Output.
    outputs_3 = keras.layers.Dense(VOCAB_SIZE)(x)
    model_3 = keras.Model(inputs=inputs, outputs=outputs_3)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
    model_3.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

    model_3.summary()

    model_3.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Number of Layers is 6 and Number of Heads is 4.
Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng_6 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 transformer_decoder_16 (Tr  (None, None, 256)         395776    
 ansformerDecoder)                                               
                                                                 
 transformer_decoder_17 (Tr  (None, None, 256)         395776    
 ansformerDecoder)                                               
                                                                 
 transforme

In [None]:
num_heads_values = [4, 8, 12]
num_layers_values = [12, 24, 48]

for NUM_LAYERS_TEST in num_layers_values:
  for NUM_HEADS_TEST in num_heads_values:
    print(f"Number of Layers is {NUM_LAYERS_TEST} and Number of Heads is {NUM_HEADS_TEST}.")
    inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
    # Embedding.
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
        vocabulary_size=VOCAB_SIZE,
        sequence_length=SEQ_LEN,
        embedding_dim=EMBED_DIM,
        mask_zero=True,
    )
    x = embedding_layer(inputs)
    # Transformer decoders.
    for _ in range(NUM_LAYERS_TEST):
        decoder_layer = keras_nlp.layers.TransformerDecoder(
            num_heads=NUM_HEADS_TEST,
            intermediate_dim=FEED_FORWARD_DIM,
            layer_norm_epsilon=1e-5,
            normalize_first=True,   #normalized similarly to gpt-2
        )
        x = decoder_layer(x)  # Giving one argument only skips cross-attention.

    # Output.
    outputs_3 = keras.layers.Dense(VOCAB_SIZE)(x)
    model_3 = keras.Model(inputs=inputs, outputs=outputs_3)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
    model_3.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

    model_3.summary()

    model_3.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Number of Layers is 12 and Number of Heads is 4.
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 transformer_decoder (Trans  (None, None, 256)         395776    
 formerDecoder)                                                  
                                                                 
 transformer_decoder_1 (Tra  (None, None, 256)         395776    
 nsformerDecoder)                                                
                                                                 
 transformer

# Model 3 - This model is identical to model 2 except for the addition of dropout layers with a dropout rate of 0.1. Specifically, a dropout layer was added after the embedding layer, before each of the transformer decoder layers, and right before the output layer.


I added dropout layers to the model since dropout layers help prevent overfitting by randomly dropping neurons from the model's input and hidden  layers, thus helping the model to not overfocus on certain features.

Sources:

*   https://towardsdatascience.com/combating-overfitting-with-dropout-regularization-f721e8712fbe#:~:text=Let's%20recap%20%E2%80%94%20dropout%20is%20a,the%20input%20and%20hidden%20layers.








## Build the model and Training
Due to resource constraints not all possible combinations listed in the code below were run through, specifically 24 and 48 layers.

In [None]:
num_heads_values = [4, 8, 12]
num_layers_values = [12, 24, 48]

for NUM_LAYERS_TEST in num_layers_values:
  for NUM_HEADS_TEST in num_heads_values:
    print(f"Number of Layers is {NUM_LAYERS_TEST} and Number of Heads is {NUM_HEADS_TEST}.")
    inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
    # Embedding.
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
        vocabulary_size=VOCAB_SIZE,
        sequence_length=SEQ_LEN,
        embedding_dim=EMBED_DIM,
        mask_zero=True,
    )
    x = embedding_layer(inputs)
    # Transformer decoders.
    for _ in range(NUM_LAYERS_TEST):
        x = keras.layers.Dropout(rate=0.1)(x) # Added a dropout layer
        decoder_layer = keras_nlp.layers.TransformerDecoder(
            num_heads=NUM_HEADS_TEST,
            intermediate_dim=FEED_FORWARD_DIM,
            layer_norm_epsilon=1e-5,
            normalize_first=True,   #normalized similarly to gpt-2
        )
        x = decoder_layer(x)  # Giving one argument only skips cross-attention.

    # Output.
    x = keras.layers.Dropout(rate=0.1)(x) # Added a dropout layer
    outputs_4 = keras.layers.Dense(VOCAB_SIZE)(x)
    model_4 = keras.Model(inputs=inputs, outputs=outputs_4)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
    model_4.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

    model_4.summary()

    model_4.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Number of Layers is 12 and Number of Heads is 4.
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 dropout (Dropout)           (None, None, 256)         0         
                                                                 
 transformer_decoder (Trans  (None, None, 256)         395776    
 formerDecoder)                                                  
                                                                 
 dropout_2 (Dropout)         (None, None, 256)         0         
            

# Model 4 - The fourth model I tested used reversible embedding instead of token and position embedding. The model included normalization layers similarly to model 2 and 3 as part of the transformer decoders. This model used the gelu_approximate activation instead of the default value of Relu as all the previous models used. Since dropout layers did not improve the metrics based on the results of the third model, I choose to not include them in this model.

Sources: https://keras.io/api/keras_nlp/modeling_layers/transformer_decoder/







## Build the model and Training

In [None]:
from keras_nlp.layers.modeling.reversible_embedding import ReversibleEmbedding
from keras_nlp.utils.keras_utils import gelu_approximate
num_heads_values = [4, 8, 12]
num_layers_values = [12, 24, 48]

for NUM_LAYERS_TEST in num_layers_values:
  for NUM_HEADS_TEST in num_heads_values:
    print(f"Number of Layers is {NUM_LAYERS_TEST} and Number of Heads is {NUM_HEADS_TEST}.")
    inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
    # Embedding.
    embedding_layer = ReversibleEmbedding(
        input_dim=VOCAB_SIZE,
        #sequence_length=SEQ_LEN,
        output_dim=EMBED_DIM, #256
        mask_zero=True,
    )
    x = embedding_layer(inputs)
    #x = keras_nlp.layers.PositionEmbedding(
     #   sequence_length=SEQ_LEN,)(embedding_L)

    # Transformer decoders.
    for _ in range(NUM_LAYERS_TEST):
        decoder_layer = keras_nlp.layers.TransformerDecoder(
            num_heads=NUM_HEADS_TEST,
            intermediate_dim=FEED_FORWARD_DIM,
            layer_norm_epsilon=1e-5,
            activation=gelu_approximate, # uses gelu activation instead of default relu activation
            normalize_first=True,   #normalized similarly to gpt-2
        )
        x = decoder_layer(x)  # Giving one argument only skips cross-attention.

    # Output.
    outputs_5 = keras.layers.Dense(VOCAB_SIZE)(x)
    model_5 = keras.Model(inputs=inputs, outputs=outputs_5)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
    model_5.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

    model_5.summary()

    model_5.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Number of Layers is 12 and Number of Heads is 4.
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 reversible_embedding_2 (Re  (None, None, 256)         1280000   
 versibleEmbedding)                                              
                                                                 
 transformer_decoder_12 (Tr  (None, None, 256)         395776    
 ansformerDecoder)                                               
                                                                 
 transformer_decoder_13 (Tr  (None, None, 256)         395776    
 ansformerDecoder)                                               
                                                                 
 transformer_decoder_14 (Tr  (None, None, 256)         395776    
 ansformer

# Model 5 - Pre-Trained OPT Model

## Load a pre-trained OPT model and generate some text



In [None]:
preprocessor = keras_nlp.models.OPTCausalLMPreprocessor.from_preset(
    "opt_125m_en",
    sequence_length=SEQ_LEN,
)
opt_lm = keras_nlp.models.OPTCausalLM.from_preset(
    "opt_125m_en", preprocessor=preprocessor
)

Downloading data from https://storage.googleapis.com/keras-nlp/models/opt_125m_en/v1/vocab.json
Downloading data from https://storage.googleapis.com/keras-nlp/models/opt_125m_en/v1/merges.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/opt_125m_en/v1/model.h5


In [None]:
opt_lm.summary()

Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:

In [None]:
import time
start = time.time()

output = opt_lm.generate("My trip to Yosemite was", max_length=200)
print("\nOPT output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


OPT output:
My trip to Yosemite was a success. It was my first time to Yosemite and the first time I've ever been to the park. The park is great! I've only been there once before but it was a great experience.
I was there once and I loved it. I had to get my passport back and was so happy to be there. It was a great experience.
TOTAL TIME ELAPSED: 26.33s


Try another one:

In [None]:
start = time.time()

output = opt_lm.generate("That Italian restaurant is", max_length=200)
print("\nOPT output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


OPT output:
That Italian restaurant is so damn good and so cheap!
The place is awesome.  The menu is pretty good.  I've been there before.
I love the Italian food. The food in the restaurant was amazing, but the service was really bad. I'm not even a huge fan of Italian food, just the Italian ones. I've been in Italy and I've always loved Italian food, but I've never really been a fan of the food.
You've never been a fan of the food?
Nope. I've been a fan of the food since I was young, though I'm a sucker for Italian food (and Italian wine), and I'm a sucker for Italian food. I just never really liked the Italian food.
TOTAL TIME ELAPSED: 1.81s


## Build the model


In [None]:
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * EPOCHS,
    end_learning_rate=0.0,
)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)


In [None]:
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
opt_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    weighted_metrics=["accuracy"],
    loss=loss_fn,
    metrics=[perplexity],
)

## Training

Now that we have our model, let's train it with the `fit()` method.

Due to resource constraints not all of the epochs ran through fully.

In [None]:
#opt_lm.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)
opt_lm.fit(raw_train_ds, validation_data=raw_val_ds, verbose=2, epochs=EPOCHS)


Epoch 1/6
6337/6337 - 6036s - loss: 3.7965 - perplexity: 75.5610 - accuracy: 0.2863 - val_loss: 3.3323 - val_perplexity: 88.0572 - val_accuracy: 0.2927 - 6036s/epoch - 953ms/step
Epoch 2/6
6337/6337 - 5938s - loss: 3.7965 - perplexity: 75.5611 - accuracy: 0.2863 - val_loss: 3.3323 - val_perplexity: 88.0572 - val_accuracy: 0.2927 - 5938s/epoch - 937ms/step
Epoch 3/6
6337/6337 - 5970s - loss: 3.7966 - perplexity: 75.5694 - accuracy: 0.2862 - val_loss: 3.3323 - val_perplexity: 88.0572 - val_accuracy: 0.2927 - 5970s/epoch - 942ms/step
Epoch 4/6


# Model 6 - Pre-Trained GPT2 Model

## Load a pre-trained GPT2 model and generate some text


In [None]:
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=SEQ_LEN,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/vocab.json
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/merges.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/model.h5


In [None]:
gpt2_lm.summary()

Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:

In [None]:
import time
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT2 output:
My trip to Yosemite was a little different from what I expected. The first time I saw the Yosemite Valley is when I was a kid. This time I'm just going back to a time where I could see the entire valley. I didn't have any time to think about what was going on, and it was a little more relaxing and relaxing than before.

I've been on a lot of hikes this past summer. I was able to get to Yosemite with friends and get a good view of the park and the Yosemite Valley.

The first time I saw the Yosemite Valley, I was in a group of friends who were in the area for a hike and had some time to think. I was in the back seat of a car and was driving down the road. I didn't know what I was going to do and I thought, "What is this place like?" I was just driving around. I was just trying to think of what was going on. I was so
TOTAL TIME ELAPSED: 32.56s


Try another one:

In [None]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT2 output:
That Italian restaurant is a good thing. The Italian-made pizza is delicious and the meat and vegetables are fresh and tender. The food has a nice touch and the service is very pleasant. The staff is friendly, attentive and attentive. I have been to the restaurant twice and I can say that this restaurant is one to visit. The food is very good and the staff are always friendly and knowledgeable. I have been coming here for about a month and am still waiting. The staff is very friendly and attentive and I am happy with my purchase. The food is good but I am not sure if it was the quality of the food or the service. I am a fan.
TOTAL TIME ELAPSED: 1.69s


## Build the model

In [None]:
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * EPOCHS,
    end_learning_rate=0.0,
)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    weighted_metrics=["accuracy"],
    loss=loss_fn,
    metrics=[perplexity],
)

## Training

Now that we have our model, let's train it with the `fit()` method.

In [None]:
gpt2_lm.fit(raw_train_ds, validation_data=raw_val_ds, verbose=2, epochs=EPOCHS)

Epoch 1/6
6337/6337 - 6792s - loss: 3.9490 - perplexity: 61.2845 - accuracy: 0.2697 - val_loss: 3.3554 - val_perplexity: 39.7276 - val_accuracy: 0.2976 - 6792s/epoch - 1s/step
Epoch 2/6
6337/6337 - 6710s - loss: 3.9487 - perplexity: 61.2680 - accuracy: 0.2697 - val_loss: 3.3554 - val_perplexity: 39.7276 - val_accuracy: 0.2976 - 6710s/epoch - 1s/step
Epoch 3/6
6337/6337 - 6715s - loss: 3.9486 - perplexity: 61.2597 - accuracy: 0.2697 - val_loss: 3.3554 - val_perplexity: 39.7276 - val_accuracy: 0.2976 - 6715s/epoch - 1s/step
Epoch 4/6
6337/6337 - 6719s - loss: 3.9487 - perplexity: 61.2712 - accuracy: 0.2697 - val_loss: 3.3554 - val_perplexity: 39.7276 - val_accuracy: 0.2976 - 6719s/epoch - 1s/step
Epoch 5/6


# Results Summary

The first model I tested was identical to the GPT Scratch Model from the "GPT text generation from scratch with KerasNLP" (original model) example except with the addition of a Normalization Layer. I added a normalization layer with the layer_norm_epsilon equal to 1e-5 to the original model since normalization layers have been shown to often help speed up and stabilize the learning process. A normalization layer helps prevent vanishing or exploding gradients which slows down learning. A normalization layer accomplishes this by keeping the weights and activations within a reasonable range. Additionally, I added this layer since it helps improve generalization of the model by reducing overfitting. Many of the top performing LLM's include it in their architecture. The loss, perplexity, validation loss, and validation perplexity all improved slightly relative to the original model, specifically the validation perplexity decreased to 44.5675 previously 47.4212 at epoch 6.

The second model I tested is identical to Model 1 except in regards to the placement of the normalization layer. In Model 1 the normalization layer was added after the transformer decoders while in Model 2 it is added as part of the transformer decoders (keras_nlp.layers.TransformerDecoder), specifically I use the normalize_first argument and set it to true so the inputs to the attention layer and the intermediate dense layer are normalized in a fashion similar to a GPT-2 model. For both Model 1 and 2, I utilize 1e-5 as the eps value in the layer normalization component. Model 2 performed slightly better than the original model, but not better than Model 1.
Next I continued to work with Model 2, specifically, I did a grid search by trying out different numbers of layers (6, 12) and number of heads (4, 8, 12) to optimize results. The original model used 3 heads and 2 layers. The model that performed best out of the ones I tried had 12 layers and 12 heads, it resulted in a validation perplexity of 35.34, which is lower than the 47.4212 at epoch 6 reported by the original model.

The third model I tested is identical to model 2 except for the addition of dropout layers with a dropout rate of 0.1. Specifically, a dropout layer was added after the embedding layer, before each of the transformer decoder layers, and right before the output layer. I added dropout layers to the model since dropout layers help prevent overfitting by randomly dropping neurons from the model's input and hidden layers, thus helping the model to not overfocus on certain features. I also did a grid search by trying out different numbers of heads (4, 8, 12) with 12 layers to optimize results for this model. I didn’t try out 6 layers for this model since 12 layers were shown to perform better based on my results from model 2. The three variations in hyperparameter tuning for model 3 had the values for loss, perplexity, validation loss, and validation perplexity as worse off relative to the original model.

The fourth model I tested used reversible embedding instead of token and position embedding. The model included normalization layers similarly to model 2 and 3 as part of the transformer decoders. This model used the gelu_approximate activation instead of the default value of Relu as all the previous models used. Since dropout layers did not improve the metrics based on the results of the third model, I choose to not include them in this model. I ran three variations of model 4 by varying the number of heads as 4, 8, or 12 with 12 layers. The optimal hyperparameter tuning for this model was 12 heads with 12 layers. Model 4 with 12 heads and 12 layers performed better than the original model, however, the metrics basically tied (were extremely close to) with model 2 that used 12 heads and 12 layers.

The fifth model I tested used the pre-trained OPT model. The loss and validation loss had values similar to the original model, however surprisingly the perplexity and validation perplexity performed much worse relative to the original model. This model provided a validation perplexity of 88.05 while in the original model it was 47.4212.

The sixth model I tested used the pre-trained GPT2 model. The loss and validation perplexity had a slightly higher value than the original model, thus it did not outperform the original model. The training perplexity had a much higher value relative to the validation perplexity.  


Overall the model that performed the best was model 2 and 4 each using 12 layers and 12 heads.
