<a href="https://colab.research.google.com/github/marcinwolter/AI_Lublin_2023/blob/main/gpt2_text_generation_with_kerasnlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pretrained GPT2 Text Generation with KerasNLP

**Author:** Chen Qian<br>
**Date created:** 04/17/2023<br>
**Last modified:** 04/17/2023<br>
**Description:** Use KerasNLP GPT2 model

Modified by M. Wolter

In this tutorial, you will learn to use [KerasNLP](https://keras.io/keras_nlp/) to load a
**pre-trained Large Language Model (LLM) - [GPT-2 model](https://openai.com/research/better-language-models)**
(originally invented by OpenAI), finetune it to a specific text style, and
generate text based on users' input (also known as prompt).

##  Before we begin

Colab offers different kinds of runtimes. Make sure to go to **Runtime ->
Change runtime type** and choose the GPU Hardware Accelerator runtime
(which should have >12G host RAM and ~15G GPU RAM) since you will finetune the
GPT-2 model. Running this tutorial on CPU runtime will take hours.

## Install KerasNLP and Import Dependencies

In [1]:
!pip install -q keras-nlp

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.7/527.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m89.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import keras_nlp
import tensorflow as tf
from tensorflow import keras
import time

## Introduction to Generative Large Language Models (LLMs)

Large language models (LLMs) are a type of machine learning models that are
trained on a large corpus of text data to generate outputs for various natural
language processing (NLP) tasks, such as text generation, question answering,
and machine translation.

Generative LLMs are typically based on deep learning neural networks, such as
the [Transformer architecture](https://arxiv.org/abs/1706.03762) invented by
Google researchers in 2017, and are trained on massive amounts of text data,
often involving billions of words. These models, such as Google [LaMDA](https://blog.google/technology/ai/lamda/)
and [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html),
are trained with a large dataset from various data sources which allows them to
generate output for many tasks. The core of Generative LLMs is predicting the
next word in a sentence, often referred as **Causal LM Pretraining**. In this
way LLMs can generate coherent text based on user prompts. For a more
pedagogical discussion on language models, you can refer to the
[Stanford CS324 LLM class](https://stanford-cs324.github.io/winter2022/lectures/introduction/).

## Introduction to KerasNLP

Large Language Models are complex to build and expensive to train from scratch.
Luckily there are pretrained LLMs available for use right away. [KerasNLP](https://keras.io/keras_nlp/)
provides a large number of pre-trained checkpoints that allow you to experiment
with SOTA models without needing to train them yourself.

KerasNLP is a natural language processing library that supports users through
their entire development cycle. KerasNLP offers both pretrained models and
modularized building blocks, so developers could easily reuse pretrained models
or stack their own LLM.

In a nutshell, for generative LLM, KerasNLP offers:

- Pretrained models with `generate()` method, e.g.,
    `keras_nlp.models.GPT2CausalLM` and `keras_nlp.models.OPTCausalLM`.
- Sampler class that implements generation algorithms such as Top-K, Beam and
    contrastive search. These samplers can be used to generate text with
    custom models.

## Load a pre-trained GPT-2 model and generate some text

KerasNLP provides a number of pre-trained models, such as [Google
Bert](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
and [GPT-2](https://openai.com/research/better-language-models). You can see
the list of models available here https://keras.io/api/keras_nlp/models/.

It's very easy to load the GPT-2 model as you can see below:

In [3]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/vocab.json
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/merges.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/gpt2_base_en/v1/model.h5




Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:

In [4]:

while True:
  print("Enter your prompt (q to break):")
  prompt = input()
  if prompt=="q":
    break
  start = time.time()
  output = gpt2_lm.generate(prompt, max_length=200)
  end = time.time()
  print("\nGPT-2 output:")
  print(output)



Enter your prompt (q to break):
The Board may complete the organization of the Foundation

GPT-2 output:
The Board may complete the organization of the Foundation, which shall consist of three trustees, and shall consist of the following:

(1) A board member who is not a trustee of the Foundation.

(2) A board member who holds a fiduciary position under the Foundation's governance program, including an advisory council, which may be used to assist the Board in the management of the Foundation.

(3) A board member who is the trustee or trustee's designated representative in the Foundation.

(4) A board member whose duties include the Board's responsibility for the Foundation's finances.

(5) An independent, non-partisan organization, as defined in Section 3 of the Act of September 26, 1947 (42 U.S.C. 1851 et seq.), that is composed solely of members of the Board and the Board and that meets the requirements of Sections 2(b)(4)(A) of the Act of September 26, 1947
Enter your prompt (q to 

Notice how much faster the second call is. This is because the computational
graph is [XLA compiled](https://www.tensorflow.org/xla) in the 1st run and
re-used in the 2nd behind the scenes.

**The quality of the generated text looks OK, but we can adapt it to our needs via finetuning.**

## More on the GPT-2 model from KerasNLP

Next up, we will actually fine-tune the model to update it's parameters, but
before we do, let's take a look at the full set of tools we have to for working
with for GPT2.

## More description of GPT-2 model
The code of GPT2 can be found
[here](https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gpt2/).
Conceptually the `GPT2CausalLM` can be hierarchically broken down into several
modules in KerasNLP, all of which have a *from_preset()* function that loads a
pretrained model:

- `keras_nlp.models.GPT2Tokenizer`: The tokenizer used by GPT2 model, which is a
    [byte-pair encoder](https://huggingface.co/course/chapter6/5?fw=pt).
- `keras_nlp.models.GPT2CausalLMPreprocessor`: the preprocessor used by GPT2
    causal LM training. It does the tokenization along with other preprocessing
    works such as creating the label and appending the end token.
- `keras_nlp.models.GPT2Backbone`: the GPT2 model, which is a stack of
    `keras_nlp.layers.TransformerDecoder`. This is usually just referred as
    `GPT2`.
- `keras_nlp.models.GPT2CausalLM`: wraps `GPT2Backbone`, it multiplies the
    output of `GPT2Backbone` by embedding matrix to generate logits over
    vocab tokens.

## Finetune on Billsum dataset

Now you have the knowledge of the GPT-2 model from KerasNLP, you can take one
step further to finetune the model so that it generates text in a specific
style. In this tutorial, we will use billsum
dataset for example.

BillSum, summarization of US Congressional and California state bills, so a very formal language. After training we can ask GPT-2 about Constitution etc.

In [5]:
import tensorflow_datasets as tfds

reddit_ds = tfds.load('billsum', split="train", as_supervised=True)

Downloading and preparing dataset 64.14 MiB (download: 64.14 MiB, generated: Unknown size, total: 64.14 MiB) to /root/tensorflow_datasets/billsum/3.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/18949 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/billsum/3.0.0.incompleteQ844C9/billsum-train.tfrecord*...:   0%|          …

Generating test examples...:   0%|          | 0/3269 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/billsum/3.0.0.incompleteQ844C9/billsum-test.tfrecord*...:   0%|          |…

Generating ca_test examples...:   0%|          | 0/1237 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/billsum/3.0.0.incompleteQ844C9/billsum-ca_test.tfrecord*...:   0%|        …

Dataset billsum downloaded and prepared to /root/tensorflow_datasets/billsum/3.0.0. Subsequent calls will reuse this data.


Let's take a look inside sample data from the billsum TensorFlow Dataset. There
are two features:

- **__document__**: text of the post.
- **__title__**: the title.

In [6]:
for document, title in reddit_ds:
    print(document.numpy())
    print(" ")
    print(title.numpy())
    break

b"SECTION 1. SHORT TITLE.\n\n    This Act may be cited as the ``Bureau of Land Management Foundation \nAct''.\n\nSEC. 2. DEFINITIONS.\n\n    In this Act:\n            (1) Board.--The term ``Board'' means the Board of Directors \n        of the Foundation.\n            (2) BLM.--The term ``BLM'' means the Bureau of Land \n        Management.\n            (3) Chairman.--The term ``Chairman'' means the Chairman of \n        the Board.\n            (4) Director.--The term ``Director'' means an individual \n        member of the Board.\n            (5) Foundation.--The term ``Foundation'' means the Bureau \n        of Land Management Foundation established by this Act.\n            (6) Secretary.--The term ``Secretary'' means the Secretary \n        of the Interior.\n            (7) National conservation lands.--The term ``National \n        Conservation Lands'' means the system of lands established by \n        section 2002 of the Omnibus Public Lands Management Act of 2009 \n        (16 U

In our case, we are performing next word prediction in a language model, so we
only need the 'document' feature.

In [7]:
train_ds = (
    reddit_ds.map(lambda document, _: document)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

Now you can finetune the model using the familiar *fit()* function. Note that
`preprocessor` will be automatically called inside `fit` method since
`GPT2CausalLM` is a `keras_nlp.models.Task` instance.

This step takes quite a bit of GPU memory and a long time if we were to train
it all the way to a fully trained state. Here we just use part of the dataset and one training epoch
for demo purposes.

In [8]:
train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)



<keras.callbacks.History at 0x7f4e3c7ed570>

After finetuning is finished, you can again generate text using the same
*generate()* function. This time, the text will be closer to the legal language
style.

In [9]:

while True:
  print("Enter your prompt (q to break):")
  prompt = input()
  if prompt=="q":
    break
  start = time.time()
  output = gpt2_lm.generate(prompt, max_length=400)
  end = time.time()
  print("Time used: ",end-start)
  print("\nGPT-2 output:")
  print(output)



Enter your prompt (q to break):
The Board may complete the organization of the Foundation
Time used:  21.337793827056885

GPT-2 output:
The Board may complete the organization of the Foundation by a resolution of the Congress of the United States, and by a vote of the members.

SEC. 2. FINDINGS.

    The Congress finds the following:
           (1) The Foundation, founded in 1868 by William 
       S. Grant and William E. Grant, is a national educational institution 
        dedicated to the advancement of science, technology, engineering, mathematics, 

Enter your prompt (q to break):
The Board may complete the organization of the Foundation
Time used:  1.132190465927124

GPT-2 output:
The Board may complete the organization of the Foundation by a joint resolution, which shall include the following provisions:

--
--
``The Board shall provide a mechanism for the organization of the Foundation by a joint resolution, which shall include the following provisions:
``The Board shall provid