# Generating Text Using a Transformer Decoder-Only Model

## Overview

In this example, we will use KerasNLP to build a scaled down generative model using a Transformer Decoder. A generative model allows you to generate sophisticated text from a prompt. In this lab, we will be building a model that only uses the decoder from the Transformer stack. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models. Some examples of decoder-only models are [Transformer XL](https://arxiv.org/pdf/1901.02860.pdf), [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), [CTRL](https://arxiv.org/pdf/1909.05858.pdf), etc.

We will train the model on the [simplebooks-92](https://arxiv.org/abs/1911.12391) corpus,which is a dataset made from several novels. This dataset was created from 1,573 Gutenberg books. It is a good dataset for this example since it has a small vocabulary and high word frequency, which is beneficial when training a generative model with few parameters.

This notebook demonstrates how to use KerasNLP tokenization, layers and metrics to simplify the training process, and then show how to generate output text using the KerasNLP sampling utilities.

### Learning Objectives
- Learn how to train a prepare a dataset for generative models using wordpiece tokenizer
- Learn how to use Keras NLP to build a generative model
- Learn different inference techniques to output text from a prompt


## Setup

In order to run this notebook, you will need `keras_nlp`. KerasNLP is a natural language processing library that works natively with TensorFlow, JAX, or PyTorch. Keras NLP offers transformer layers that are extremely helpful to build the generative model in this notebook.

Uncomment the cell below if you don't have keras_nlp already installed. You may need to restart the kernel once it has been installed.

In [None]:
#!pip install keras-nlp

In [None]:
import os
import warnings

warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

In [None]:
import keras_nlp
import tensorflow as tf
from tensorflow import keras

### Before you start
Please ensure you have a GPU (1 x NVIDIA Tesla T4 should be enough) attached to your notebook instance to ensure that the training doesn't take too long.

To check if you have a GPU attached you can run the following command:

In [None]:
# this should output "Num GPUs Available: 1" if you have one GPU attached
print("Num GPUs Available: ", len(tf.config.list_physical_devices("GPU")))

## Settings & Hyperparameters

We're going to initialize some hyperparameters here. `SEQ_LEN` here, defines that maximum length of a sentence in our dataset, we will use this as the length in our `WordPieceTokenizer`. We are also define, `MIN_TRAINING_SEQ_LEN` to clean up the dataset with any sentence that is too short.

In [None]:
# Data
BATCH_SIZE = 64
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 256
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model

## Load the Data

We will be using the SimpleBooks dataset for this notebook. The SimpleBooks dataset consists of 1,573 Gutenberg books and a small vocabulary size to word-level tokens ratio. It has a vocabulary size of ~98k. This size makes it easier to fit a small transformer model. In this block we will be using [TensorFlow's data API](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) to load the data. 

In [None]:
keras.utils.get_file(
    origin="https://storage.googleapis.com/asl-public/text/data/simplebooks.zip",
    extract=True,
)
data_dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

# Load simplebooks-92 train set and filter out short lines using MIN_TRAINING_SEQ_LEN
raw_train_ds = (
    tf.data.TextLineDataset(data_dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines using MIN_TRAINING_SEQ_LEN
raw_val_ds = (
    tf.data.TextLineDataset(data_dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

## Train the Tokenizer

We train the tokenizer using Keras NLP's [compute_word_piece_vocabulary](https://keras.io/api/keras_nlp/tokenizers/compute_word_piece_vocabulary/) from the training dataset for a vocabulary size of `VOCAB_SIZE`, which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, since it has a large effect on the number of model parameters. We also don't want to include *too few* words, or there would be too many out-of-vocabulary (OOV) sub-words. In addition, three tokens are reserved in the vocabulary:

- `"[PAD]"` for padding sequences to `SEQ_LEN`. This token has index 0 in both `reserved_tokens` and `vocab`, since `WordPieceTokenizer` (and other layers) consider `0`/`vocab[0]` as the default padding.
- `"[UNK]"` for OOV sub-words, which should match the default `oov_token="[UNK]"` in
`WordPieceTokenizer`.
- `"[BOS]"` stands for beginning of sentence, but here technically it is a token representing the beginning of each line of training data.

This cell takes ~5-10 mins to execute because it is computing the word piece vocabulary on the entire dataset.

In [None]:
# Train tokenizer vocabulary
print("Training the word piece tokenizer. This will take 5-10 mins...")
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)
print("Training is complete!!")

## Load Tokenizer

We use the vocabulary data to initialize [keras_nlp.tokenizers.WordPieceTokenizer](https://keras.io/api/keras_nlp/tokenizers/word_piece_tokenizer/). WordPieceTokenizer is an efficient implementation of the WordPiece algorithm used by BERT and other models. It will strip, lower-case and do other irreversible preprocessing operations. Given a vocabulary and input sentence, the WordPiece tokenizer will convert the sentence into an array of IDs and pad the sentence to the `SEQ_LEN` defined. For example, 

```
vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
inputs = "The quick brown fox."
SEQ_LEN = 10
```

When passed to the `WordPieceTokenizer` will return
```
array([1, 2, 3, 4, 5, 6, 7,0,0,0], dtype=int32)
```

In [None]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

## Tokenize Data

We preprocess the dataset by tokenizing and splitting it into `features` and `labels`. Since this is a language modeling task. The goal will be to predict a "label sequence" of "next words" from a "features sequence" of "previous words". In order to obtain the "feature" we shift the original sentence to the right using the `[BOS]` token.

In [None]:
# packer adds a start token
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)
val_ds = raw_val_ds.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

## Build the model

We create our scaled-down transformer-decoder-based generative text model model with the following layers:

- One `keras_nlp.layers.TokenAndPositionEmbedding` layer, which combines the embedding for the token and its position. This is diffrent from a traditional embedding layer because it creates trainable positional embedding instead of the fixed sinusoidal embedding.
- Multiple `keras_nlp.layers.TransformerDecoder` layers created using a loop. 
- One final dense linear layer.

**Note:** You can take a look at the [source code](https://github.com/keras-team/keras-nlp/blob/v0.6.1/keras_nlp/layers/modeling/transformer_decoder.py#L31) of this layer to see the different components that go into this layer.

In [None]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
# Embedding layer
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoder layers
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention
# Output layer
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)

# set up the loss metric
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)

# compile the model
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large effect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [None]:
model.summary()

## Training

Now that we have our model, let's train it with the `fit()` method.

In [None]:
EPOCHS = 1  # increase the number of epochs for better results
print("Training started, this could take 4-10 mins per epoch with a T4 GPU...")
model.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)
print("Training is complete!!")

## Inference

With our trained model, we can test it out to gauge its performance. To do this we can seed our model with an input sequence starting with the `"[BOS]"` token, and progressively sample the model by making predictions for each subsequent token in a loop.

To start let us build a prompt with the same shape as our model inputs, containing only the `"[BOS]"` token.

In [None]:
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

We will use the `keras_nlp.samplers` module for inference, which requires a callback function wrapping the model we just trained. This wrapper calls the model and returns the logit predictions for the current token we are generating.

**Note:** There are two pieces of more advanced functionality available when defining your callback. The first is the ability to take in a `cache` of states computed in previous generation steps, which can be used to speed up generation. The second is the ability to output the final dense "hidden state" of each generated token. This is used by `keras_nlp.samplers.ContrastiveSampler`, which avoids repetition by penalizing repeated hidden states. Both are optional, and we will ignore them for now.

In [None]:
def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache

Creating the wrapper function is the most complex part of using these functions. Now that
it's done, let's test out the different utilities, starting with greedy search.

### Greedy search

We greedily pick the most probable token at each timestep. In other words, we get the
argmax of the model output.

In [None]:
sampler = keras_nlp.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

As you can see, greedy search starts out making some sense, but quickly starts repeating itself. This is a common problem with text generation that can be fixed by some of the probabilistic text generation utilities shown later on!

### Beam search

At a high-level, beam search keeps track of the `num_beams` most probable sequences at
each timestep, and predicts the best next token from all sequences. It is an improvement
over greedy search since it stores more possibilities. However, it is less efficient than
greedy search since it has to compute and store multiple potential sequences.

**Note:** beam search with `num_beams=1` is identical to greedy search.

In [None]:
sampler = keras_nlp.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Similar to greedy search, beam search quickly starts repeating itself, since it is still
a deterministic method.

### Random search

Random search is our first probabilistic method. At each time step, it samples the next
token using the softmax probabilities provided by the model.

In [None]:
sampler = keras_nlp.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Voilà, no repetitions! However, with random search, we may see some nonsensical words
appearing since any word in the vocabulary has a chance of appearing with this sampling
method. This is fixed by our next search utility, top-k search.

### Top-K search

Similar to random search, we sample the next token from the probability distribution
provided by the model. The only difference is that here, we select out the top `k` most
probable tokens, and distribute the probability mass over them before sampling. This way,
we won't be sampling from low probability tokens, and hence we would have less
nonsensical words!

In [None]:
sampler = keras_nlp.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

### Top-P search

Even with the top-k search, there is something to improve upon. With top-k search, the
number `k` is fixed, which means it selects the same number of tokens for any probability
distribution. Consider two scenarios, one where the probability mass is concentrated over
2 words and another where the probability mass is evenly concentrated across 10. Should
we choose `k=2` or `k=10`? There is no one size that fits all `k` here.

This is where top-p search comes in! Instead of choosing a `k`, we choose a probability
`p` that we want the probabilities of the top tokens to sum up to. This way, we can
dynamically adjust the `k` based on the probability distribution. By setting `p=0.9`, if
90% of the probability mass is concentrated on the top 2 tokens, we can filter out the
top 2 tokens to sample from. If instead the 90% is distributed over 10 tokens, it will
similarly filter out the top 10 tokens to sample from.

In [None]:
sampler = keras_nlp.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

### Using callbacks for text generation

We can also wrap the utilities in a callback, which allows you to print out a prediction sequence for every epoch of the model. This is extremely useful to see if the model is improving after each epoch. Here is an example of a callback for top-k search:

In [None]:
class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_nlp.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(
    train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback]
)

## Acknowledgment
This notebook is based on a [Keras tutorial by Jesse Chan](https://keras.io/examples/generative/text_generation_gpt/#train-the-tokenizer). The transformer decoder layer is based on the the research paper by Google, [Attention Is All You Need, Vaswani et al., 2017](https://arxiv.org/abs/1706.03762).

# License

Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License