<a href="https://colab.research.google.com/github/marcinwolter/AI_Lublin_2023/blob/main/mini_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT text generation from scratch with KerasNLP

**Author:** [Jesse Chan](https://github.com/jessechancy)<br>
**Date created:** 2022/07/25<br>
**Last modified:** 2022/07/25<br>
**Description:** Using KerasNLP to train a mini-GPT model for text generation.

Modified by M. Wolter

## Introduction

In this example, we will use KerasNLP to build a scaled down Generative
Pre-Trained (GPT) model. GPT is a Transformer-based model that allows you to generate
sophisticated text from a prompt.

We will train the model on the part of [Polish Wikipedia](https://pl.wikipedia.org/) corpus,
which allows our GPT model to communicate in Polish. 

This example combines concepts from
[Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/)
with KerasNLP package abstractions. We will demonstrate how KerasNLP tokenization, layers and
metrics simplify the training
process, and then show how to generate output text using the KerasNLP sampling utilities.

Note: If you are running this example on a Colab,
make sure to enable GPU runtime for faster training.

This example requires KerasNLP. You can install it via the following command:
`pip install keras-nlp`

## Setup

In [1]:
! pip install keras-nlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import os
import keras_nlp
import tensorflow as tf
from tensorflow import keras

## Settings & hyperparameters

In [3]:
# Data
BATCH_SIZE = 64
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 256
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 6

# Inference
NUM_TOKENS_TO_GENERATE = 80

## Choose the language

GPT can be trained using:
* "English" - SimpleBooks dataset
* "Polish"  - part of Polish Wikipedia

In [4]:

lang = "English"   # "Polish"

### wiki-dump-reader allows us to load the wikipedia and convert articles into plain text.

In [5]:
!pip install wiki-dump-reader

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
from wiki_dump_reader import Cleaner, iterate

text_data = open("plwiki.txt",'w')
cleaner = Cleaner()
for title, text in iterate('plwiki-20220820-pages-articles1.xml-p1p187037'):
    text = cleaner.clean_text(text)
    cleaned_text, links = cleaner.build_links(text)
    text_data.write(cleaned_text)
   
text_data.close() 

### Load wikipedia from plwiki.txt file and filter out short lines.
Now, let's download the dataset! The Polish Wikipedia is huge, so we load just only one dataset with articles.


In [8]:
if lang=="Polish":

  ! wget https://dumps.wikimedia.your.org/plwiki/20220820/plwiki-20220820-pages-articles1.xml-p1p187037.bz2 
  ! bzip2 -d plwiki-20220820-pages-articles1.xml-p1p187037.bz2

  # Load wikipedia train set - 98 500 lines and filter out short lines.
  raw_train_ds = (
  #    tf.data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    tf.data.TextLineDataset("plwiki.txt").skip(1500).take(100000)
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
    )

# Load wikipedia validation set of 1500 lines and filter out short lines.
raw_val_ds = (
#    tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    tf.data.TextLineDataset("plwiki.txt").take(1500)
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

## Load the data - English SimpleBooks

Now, let's download the dataset! The SimpleBooks dataset consists of 1,573 Gutenberg books, and has
one of the smallest vocabulary size to word-level tokens ratio. It has a vocabulary size of ~98k,
a third of WikiText-103's, with around the same number of tokens (~100M). This makes it easy to fit a small model.

In [9]:
if lang=="English":

  keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
  )
  dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

  # Load simplebooks-92 train set and filter out short lines.
  raw_train_ds = (
    tf.data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
  )

  # Load simplebooks-92 validation set and filter out short lines.
  raw_val_ds = (
    tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip


## Train the tokenizer

We train the tokenizer from the training dataset for a vocabulary size of `VOCAB_SIZE`,
which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, as
we will see later on
that it has a large affect on the number of model parameters. We also don't want to include
*too few* vocabulary terms, or there would be too many out-of-vocabulary (OOV) sub-words. In
addition, three tokens are reserved in the vocabulary:

- `"[PAD]"` for padding sequences to `SEQ_LEN`. This token has index 0 in both
`reserved_tokens` and `vocab`, since `WordPieceTokenizer` (and other layers) consider
`0`/`vocab[0]` as the default padding.
- `"[UNK]"` for OOV sub-words, which should match the default `oov_token="[UNK]"` in
`WordPieceTokenizer`.
- `"[BOS]"` stands for beginning of sentence, but here technically it is a token
representing the beginning of each line of training data.

## **Warning: the training takes some time!!!**

In [10]:
# Train tokenizer vocabulary
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

## Load tokenizer

We use the vocabulary data to initialize
`keras_nlp.tokenizers.WordPieceTokenizer`. WordPieceTokenizer is an efficient
implementation of the WordPiece algorithm used by BERT and other models. It will strip,
lower-case and do other irreversible preprocessing operations.

In [11]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

## Tokenize data

We preprocess the dataset by tokenizing and splitting it into `features` and `labels`.

In [12]:
# packer adds a start token
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(
    tf.data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(
    tf.data.AUTOTUNE
)

## Build the model

We create our scaled down GPT model with the following layers:

- One `keras_nlp.layers.TokenAndPositionEmbedding` layer, which combines the embedding
for the token and its position.
- Multiple `keras_nlp.layers.TransformerDecoder` layers, with the default causal masking.
The layer has no cross-attention when run with decoder sequence only.
- One final dense linear layer

In [13]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large affect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [14]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddin  (None, None, 256)        1312768   
 g (TokenAndPositionEmbeddin                                     
 g)                                                              
                                                                 
 transformer_decoder (Transf  (None, None, 256)        394749    
 ormerDecoder)                                                   
                                                                 
 transformer_decoder_1 (Tran  (None, None, 256)        394749    
 sformerDecoder)                                                 
                                                                 
 dense (Dense)               (None, None, 5000)        128500

## Training

Now that we have our model, let's train it with the `fit()` method.

In [15]:
model.fit(train_ds, validation_data=val_ds, verbose=2, epochs=EPOCHS)

Epoch 1/6
3169/3169 - 273s - loss: 4.5675 - perplexity: 96.6795 - val_loss: 4.1558 - val_perplexity: 64.4383 - 273s/epoch - 86ms/step
Epoch 2/6
3169/3169 - 120s - loss: 4.0591 - perplexity: 58.1489 - val_loss: 3.9813 - val_perplexity: 53.9634 - 120s/epoch - 38ms/step
Epoch 3/6
3169/3169 - 119s - loss: 3.9445 - perplexity: 51.8465 - val_loss: 3.9290 - val_perplexity: 51.3033 - 119s/epoch - 38ms/step
Epoch 4/6
3169/3169 - 118s - loss: 3.8827 - perplexity: 48.7388 - val_loss: 3.8987 - val_perplexity: 49.8520 - 118s/epoch - 37ms/step
Epoch 5/6
3169/3169 - 120s - loss: 3.8393 - perplexity: 46.6682 - val_loss: 3.8494 - val_perplexity: 47.3274 - 120s/epoch - 38ms/step
Epoch 6/6
3169/3169 - 120s - loss: 3.8084 - perplexity: 45.2459 - val_loss: 3.8537 - val_perplexity: 47.5800 - 120s/epoch - 38ms/step


<keras.callbacks.History at 0x7f2a8426a230>

## Inference

With our trained model, we can test it out to gauge it's performance. To do this
we can seed our model with an input sequence starting with the `"[BOS]"` token,
and progressively sample the model by making predictions for each subsequent
token in a loop.



We will use the `keras_nlp.samplers` module for inference, which requires a
callback function wrapping the model we just trained. This wrapper calls
the model and returns the logit predictions for the current token we are
generating.

Note: There are two pieces of more advanced functionality available when
defining your callback. The first is the ability to take in a `cache` of states
computed in previous generation steps, which can be used to speed up generation.
The second is the ability to output the final dense "hidden state" of each
generated token. This is used by `keras_nlp.samplers.ContrastiveSampler`, which
avoids repetition by penalizing repeated hidden states. Both are optional, and
we will ignore them for now.

In [16]:

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache


Creating the wrapper function is the most complex part of using these functions. Now that
it's done, let's test out the different utilties, starting with greedy search.

In [23]:
def toPolish(txt):
  #txt = str(txt)
  
  #print(txt.__repr__())
  
  txt = txt.replace("tf.Tensor([b'[BOS] ","")
  txt = txt.replace("'], shape=(1,), dtype=string)","")
  
  txt = txt.replace('\\xe2\\x80\\x93', '-')
  txt = txt.replace('\\xc5\\x82', 'ł')   
  txt = txt.replace('\\xc4\\x99', 'ę') 
  txt = txt.replace('\\xc4\\x85', 'ą')
  txt = txt.replace('\\xc5\\x9b', 'ś')
  txt = txt.replace('\\xc3\\xb3', 'ó')
  txt = txt.replace('\\xc5\\xbc', 'ż')
  txt = txt.replace('\\xc4\\x87', 'ć')
  txt = txt.replace('\\xc5\\xbc', 'ż')
  txt = txt.replace('\\xc5\\x84', 'ń')
  txt = txt.replace("\\'", '')
  txt = txt.replace('"', '')
  #print(txt.__repr__())
  #txt = txt.encode('iso-8859-1').decode('utf-8')
  return txt

### Top-K search

We sample the next token from the probability distribution
provided by the model. We select out the top `k` most
probable tokens, and distribute the probabiltiy mass over them before sampling. This way,
we won't be sampling from low probability tokens, and hence we would have less
nonsensical words!

In [24]:

while True:
  print("Insert prompt (q breaks):")
  prompt = input()
  if prompt=='q':
    break

  prompt_tokens = start_packer(tokenizer([prompt]))

  sampler = keras_nlp.samplers.TopKSampler(k=4) #GreedySampler()
  output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
  )
  txt = tokenizer.detokenize(output_tokens)

  txt_str = toPolish(str(txt))
  print(f"Generated text: \n{prompt} {txt_str}\n")


Insert prompt (q breaks):

Generated text: 
 the king of the king , with his king , was king of the court who was , as he was , and was to argue with the king , who was to be the king , and king arthur and king arthur , and sir tristram was very pleased indeed , for sir launcelot of the court , and sir percival was very great and very great and intent to the king of the court , and sir tristram perceived that sir launcelot had very great joy and beheld that he had beheld that sir launcelot of the lake wherefore he beheld that he was nigh unto sir tristram : so sir tristram sir tristram was :  sir

Insert prompt (q breaks):
my excursion to London was frustrating
Generated text: 
my excursion to London was frustrating  you know , i  m not ,  said the man .  he is a very good fellow in the world to be very well , but i think that i have never been able to see him . he is a very fine fellow , for he is not a great deal better than he has a good deal of money . he is so bad , and he is a li

As you can see, the short training doesn't allow to generate sensible text. But this micro GPT doesn't produce a complete nonsense neither. 

## Conclusion

To recap, in this example, we use KerasNLP layers to train a sub-word vocabulary,
tokenize training data, create a miniature GPT model, and perform inference with the
text generation library.

If you would like to understand how Transformers work, or learn more about training the
full GPT model, here are some further readings:

- Attention Is All You Need [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)
- GPT-3 Paper [Brown et al., 2020](https://arxiv.org/abs/2005.14165)