##GPT2 Intro:

Developed by OpenAI, GPT2 is a large-scale transformer-based language model that is pre-trained on a large corpus of text: 8 million high-quality webpages. It results in competitive performance on multiple language tasks using only the pre-trained knowledge without explicitly training on them.

“GPT-2 achieves state-of-the-art scores on a variety of domain-specific language modeling tasks. Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test; this is known as the “zero-shot” setting. GPT-2 outperforms models trained on domain-specific data sets (e.g. Wikipedia, news, books) when evaluated on those same data sets.” – Open AI team.

We will use it to make a Shakespeare's play-writer and generate new text based on Shakespeare's text.

## Installing packages

Firstly, we'll install the transformers package using pip and the link to the github repo https://github.com/huggingface/transformers

In [1]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-ece2hvq1
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-ece2hvq1
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 4.6MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 48.8MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K  

Upgraded version of pyarrow is required while fine-tuning as the previous version tends to show errors.

In [3]:
!pip install --upgrade pyarrow

Collecting pyarrow
[?25l  Downloading https://files.pythonhosted.org/packages/f3/99/0a605f016121ca314d1469dc9069e4978395bc46fda40f73099d90ad3ba4/pyarrow-1.0.1-cp36-cp36m-manylinux2014_x86_64.whl (17.3MB)
[K     |████████████████████████████████| 17.3MB 203kB/s 
Installing collected packages: pyarrow
  Found existing installation: pyarrow 0.14.1
    Uninstalling pyarrow-0.14.1:
      Successfully uninstalled pyarrow-0.14.1
Successfully installed pyarrow-1.0.1


## Loading the Shakespeare's play text

The text is taken from here: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

We will use the !wget command to download the text and save it in the input directory.

In [4]:
# Download the Shakespeare's text.
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2020-08-28 22:07:42--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2020-08-28 22:07:42 (16.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



Making an output directory to save the tokenizer and model.

In [5]:
!mkdir output

Loading the run_language_modeling.py using !wget command which will be used for fine-tuning on our custom dataset

In [7]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py

--2020-08-28 22:09:09--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11192 (11K) [text/plain]
Saving to: ‘run_language_modeling.py’


2020-08-28 22:09:10 (104 MB/s) - ‘run_language_modeling.py’ saved [11192/11192]



## Fine-Tuning

With the packages installed and the text data loaded, it is time that we fine tune gpt2 for generating texts similar to the play text downloaded.

In [8]:
!python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file='/content/input.txt' \
    --per_gpu_train_batch_size=1 \
    --save_steps=-1 \
    --num_train_epochs=2

PyTorch version 1.6.0+cu101 available.
2020-08-28 22:09:16.752154: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
TensorFlow version 2.3.0 available.
PyTorch: setting up devices
08/28/2020 22:09:18 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='output', overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=1, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=2.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Aug28_22-09-18_daa890b659ff', logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', l

Loading the tokenizer and model from output

We have saved the tokenizer and model in the output directory using the run_language_modeling.py script. Now, we will load them using GPT2Tokenizer, GPT2LMHeadModel imported from transformers package.

In [9]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('/content/output')
model = GPT2LMHeadModel.from_pretrained('/content/output')

PyTorch version 1.6.0+cu101 available.
TensorFlow version 2.3.0 available.
Model name '/content/output' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/content/output' is a path, a model identifier, or url to a directory containing tokenizer files.
Didn't find file /content/output/added_tokens.json. We won't load it.
Didn't find file /content/output/tokenizer.json. We won't load it.
loading file /content/output/vocab.json
loading file /content/output/merges.txt
loading file None
loading file /content/output/special_tokens_map.json
loading file /content/output/tokenizer_config.json
loading file None
loading configuration file /content/output/config.json
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2

We have our model and tokenizer with us and it's time that we use them to generate some interesting texts similar to Shakespeare's plays.

## Generating Text

It is very important that we understand how do we generate the new outputs using the fine-tuned model so as to bring the best results.

We will use many different approaches to generate text and find out which one works out to be the best.

## Greedy Search
This is a very basic searching algorithm which selects the word with highest probability as its next word and doesn't use other words with lesser probability.
The code for implementing greedy search with our model is given below.

We will first use the tokenizer to encode the prompt we want to give to the model to start off with generating the text.
Then we'll use generate function to generate the new text.

Here, we have added [WP] for starting the prompt and endprompts which makes it easier for the model to generate text based on the example input sentence which is 'The King must leave the throne now .' in our case.

In [24]:
ids1 = tokenizer.encode('[ WP ] The King must leave the throne now . <endprompts>',
                      return_tensors='pt')

greedy_outputs = model.generate(ids1, max_length=300)

print("Output:\n" + 100 * '-')
for i, greedy_output in enumerate(greedy_outputs):
  print("\n"+"==="*10)
  print("{}: {}".format(i+1, tokenizer.decode(greedy_output, skip_special_tokens=False)))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------

1: [ WP ] The King must leave the throne now. <endprompts>

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I will not.

KING RICHARD II:
I will not.

GLOUCESTER:
I


As you can see, it gives an output which has too much repition and clearly it is not able to generate good text for the play.

So, next we try out beam search.

## Beam Search
It is a search algorithm which considers the probabilities of consequent no (num_beams) of words not like greedy search which simply selects word with highest probability. It then multiplies these probabilities with the previous ones for each case. Then, it selects the sequence of words which had higher overall probability after multiplication.

The code for implementing beam search with our model is given below.

We set num_beams > 1 and early_stopping=True so that generation is finished when all beam hypotheses reached the endprompts token.

In [25]:
# activate beam search and early_stopping
beam_output = model.generate(
    ids1, 
    max_length=300, 
    num_beams=4, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
[ WP ] The King must leave the throne now. <endprompts>

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must leave the throne now.

DUKE VINCENTIO:
The king must leave the throne now.

GLOUCESTER:
The king must


This shows that beam search alone is also not good enough and we will have to add some more parameters in generate function.

## Let's add Sampling
Sampling means randomly picking the next word according to its conditional probability distribution.

We need to import tensorflow to help us set seed and induce random sampling.

In [27]:
import tensorflow as tf

The do_sample=True lets us produce sampling for the text.

In [28]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    ids1, 
    do_sample=True, 
    max_length=300
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
[ WP ] The King must leave the throne now. <endprompts> KING HENRY VI

GRENILIA:
The king shall and do.

LEONTES:
Well may you think it, Henry. If he so desire, he shall bring some news.

KING HENRY VI:
I take it, good lady!
For the moment I can't bear the thought, I stand still: for if the Duke of Clarence would,
I know he is but too late. He is coming to hear me,
and he shall answer, my lord, this matter should be brought to his mouth.

GRENILIA:
Is not the Duke of Clarence's word enough?

KING HENRY VI:
Not much.

GRENILIA:
Yet, when she speaks, speak, and I shall hear.

KING HENRY VI:
Ay, for the day I must hear his answer. He will not speak.

GRENILIA:
The Duke of Clarence must answer you.

KING HENRY VI:
You will not?

GRENILIA:
Not to say he would. When I speak, I take no part in the proceedings.

KING HENRY VI:
When he shows time, the Duke of Clarence will hear you.

GRE

As we can see it produce much better results than previous ones and the text is also starting to make some sense.

## Top-K Sampling

Let's try something new.
Top-k sampling has recently become a popular alternative sampling procedure (Fan et al., 2018;
Holtzman et al., 2018; Radford et al., 2019). Nucleus Sampling and top-k both sample from truncated Neural LM distributions, differing only in the strategy of where to truncate

In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.

Let's implement it.

We need to add top_k parameter in generate function to use top-k sampling.

In [30]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output2 = model.generate(
    ids1, 
    do_sample=True, 
    max_length=300, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output2[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
[ WP ] The King must leave the throne now. <endprompts>

LAPISTA:
But, lord, do give me some time so your presence can be taken
and we both be gone: by my grace will I go
from that point; what would you say, though you
may't
Have heard it so far?

NORFOLK:
And let's come back again.

PRIO:
Good day to you, good morning.
You'll do well to hear from me again: it's a pleasure
to hear what you'll be doing at this time.
And how canthither have you said I went to see you and
your wife?

TOMAS:
That thou art so good about thy mother's life that she should seem
like a lady to me too; for that we are now friends,
she needs not for life.

LAPISTA:
Good, good sir; and look forward to her coming hither.

PRIO:
Now, she comes too late.

LAPISTA:
Why, good, she comes too late; she is too
short of breath; she is quite young; and indeed,
she cannot breathe.

PRIO:
Why, good sir, she comes too l

Now, after implementing top-k sampling, we should try out top-p sampling

## Top-p (nucleus) sampling

It is selecting the highest probability tokens whose cumulative probability mass
exceeds the pre-chosen threshold p.

In [31]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output3 = model.generate(
    ids1, 
    do_sample=True, 
    max_length=300, 
    top_p=0.92,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output3[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
[ WP ] The King must leave the throne now. <endprompts>

GLOUCESTER:
The king must leave the throne now.

NORTHUMBERLAND:
I do not think so.

GLOUCESTER:
No.

NORTHUMBERLAND:
I cannot, but will. I shall, for the king.

GLOUCESTER:
The king must leave the throne now, as I have said.

GLOUCESTER:
There is no more king left in the house.

NORTHUMBERLAND:
But, by the grace of God, I am satisfied with him.

GLOUCESTER:
The king will leave the throne, as I have said,
as he had done it before: for to do this, he had promised,
before. I will go to the king and be content,
and by my power be gone.

GLOUCESTER:
No more king left. There is no king left in the house.

GLOUCESTER:
The king shall leave the throne now.

GLOUCESTER:
No more king left. There is no king left in the house.

GLOUCESTER:
No more king left. There is no king left in the house.

GLOUCESTER:



It's time to combine everything we did previously.

In [34]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 
final_outputs = model.generate(
    ids1,
    do_sample=True, 
    max_length=300, 
    top_k=40, 
    top_p=0.95, 
)

print("Output:\n" + 100 * '-')
for i, final_output in enumerate(final_outputs):
  print("{}: {}".format(i, tokenizer.decode(final_output, skip_special_tokens=True)))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
0: [ WP ] The King must leave the throne now. <endprompts>

TRANIO:
The Duke of York,
How dost thou tell me it?

KING RICHARD II:
He that hath his crown of York is but one of them
That hath no crown of York.

TRANIO:
His crown, therefore, is no crown of York.

KING RICHARD II:
For he that hath his crown hath no crown of York.

TRANIO:

King Richard is his son, and so shall he.

KING RICHARD II:
What hath your father's son been born to thee?

LUCENTIO:
The prince died, as I say;
My father, as I say, died to me;
And if the Duke of York should wish his son for the prince,
For that I was not his son,
The boy was made to stand by him.

TRANIO:
What then?

LUCENTIO:
To have the father's son be crowned by him,
And, like him, give him to the prince.

KING RICHARD II:
What then, good Lucentio?

LUCENTIO:
A daughter to the Duke of York,
That he may call my son Lucentio.

TRANIO:




This is the final output text we generated and it tries to show order of events. The parameters can be tuned further to get better results

## Thanks for reading

I referred the following links to make this an easy tutorial. You can go through these if want to go in depth.

https://huggingface.co/blog/how-to-generate

https://arxiv.org/pdf/1904.09751.pdf