# HOML Chapter 16 Exercise 11

### Exercise: Use one of the recent language models (e.g., GPT) to generate more convincing Shakespearean text.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
# Random seeds from both numpy and tensorflow
from numpy.random import seed
seed(99)
tf.random.set_seed(99)

We'll use GPT2 to generate Shakespearean text (the author uses the original GPT). We're running this on Google Colab. Using a variation of the author's code to import the GPT2 model and tokenizer from Hugging Face, we ran into problems with versioning. Fortunately, the author linked to a post from Hugging Face in which GPT2 is correctly implemented and which resolved our issue. https://huggingface.co/blog/how-to-generate

We'll start by installing the transformers. We'll also load the model, tokenizer, and pretrained weights.

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 3.3MB 7.2MB/s 
[K     |████████████████████████████████| 901kB 49.5MB/s 
[K     |████████████████████████████████| 645kB 51.3MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [None]:

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
# pad_token_id: The id of the `padding` token.
# eos_token_id: The id of the `end-of-sequence` token.
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=497933648.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Let's now tokenize and encode the sample text. To start,  we'll use the same text the author did to allow us to compare our results to his.

In [None]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('This royal throne of kings, this sceptred isle', return_tensors='tf')

We'll be using both Top-K and Top-p sampling. GPT2 uses Top-K sampling, which filters the K most likely words and distributes the probability mass only among those K words. Unlike Top-K, Top-p sampling only chooses from among the smallest set of words whose probability exceeds p probability and distributes the probability mass among those words. Top-p has the advantage of being dynamic in that the set of words can increase and decrease based on the next word's probability distribution.

Both Top-K and Top-p can be used together to both disallow low ranked words while also being dynamic. By setting do_sample=True, we allow the model to pick the next word based upon its probability distribution. 

We played around with a range of values for top_k and top_p settings. Generally, having a top_k value that was either too low (below 10) or too high (above 50) led to less coherent sentences. The same was true for top_p values that were too low (below 0.75). It also helped to have a larger max_length value - keeping it above 20 allowed for at least one complete sentence per sequence.  

Additional hyperparameter documentation can be found here: https://huggingface.co/transformers/main_classes/model.html?highlight=generate

In [None]:
# random seed setting would be included here if we hadn't already set it earlier

# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: This royal throne of kings, this sceptred isle of kings, it has made a large tower. But this castle is not a castle of kings, but a great stone tower, and you see that these things are not really castles of kings, but of stone towers, which is not the way in
1: This royal throne of kings, this sceptred isle with the head of the serpent in its face. He has three heads on both sides: the left is covered with a yellow covering, the right with a silver covering, and the centre with two gold covers. He has two arms, one on the
2: This royal throne of kings, this sceptred isle of the people of the land of the Holy Ghost. A statue of this saint, made of brass, is situated in an enclosure for the people of the land of the Holy Ghost. The royal throne of kings, which has stood upright since the times
3: This royal throne of kings, this sceptred isle where kings of England and France c

Now, just for fun,  let's try a few more Shakespeare quotes. Let's start with asking the model to complete some famous lines.

In [None]:
# encode context the generation is conditioned on
# Beware the Ides of March
input_ids = tokenizer.encode('Beware the Ides of', return_tensors='tf')

In [None]:
# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: Beware the Ides of Mandu and his allies.

You may be able to find the full game at the Steam Store.

The first edition of Baldur's Gate II: Enhanced Edition is now available from EA to download on Steam.

A new
1: Beware the Ides of the Universe in your quest for immortality and self-restraint, and that the world won't remember you!

This mod is designed as a complete retexture of the original Skyrim texture pack. The files used will be used to produce any mods
2: Beware the Ides of the Gods

And those who would deny those

I said that the gods will not be

For you are not the God of the Jews,

You are the God of the Muslims,

You are the God of the
3: Beware the Ides of Hercules!" A couple of years later, a few people took an interest in it.

The most famous and powerful image of the original "Puppet" in Disney history has the red head of Jesus on a red velvet robe, with the

In [None]:
 # ‘Friends, Romans, countrymen, lend me your ears: I come to bury Caesar, not to praise him.’
 input_ids = tokenizer.encode('‘Friends, Romans, countrymen, lend me your ears:', return_tensors='tf')

In [None]:
# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: ‘Friends, Romans, countrymen, lend me your ears: I need not ask for this, nor do I ask for this in vain: do I ask for this not for what I am able to do? For that is not your desire, but the love for my fellow-creatures. Therefore, O
1: ‘Friends, Romans, countrymen, lend me your ears: they shall not deceive you: I am a stranger." Then they made him sit down, and spoke with such reverence that all the people were overcome with pity. The king went on with the rest of the people, and told them everything. The people
2: ‘Friends, Romans, countrymen, lend me your ears: you are a friend of mine; but I ask your forgiveness.

The last sentence of this passage is probably the most important: If you are a friend of mine, you have to forgive me, not only for my actions but for your deeds
3: ‘Friends, Romans, countrymen, lend me your ears: For I was a poor fellow. But he made my life mise

In [None]:
 # ‘There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.’
 input_ids = tokenizer.encode('There are more things in heaven and earth, Horatio,', return_tensors='tf')

In [None]:
# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: There are more things in heaven and earth, Horatio, which are not in heaven, but which are in heaven. So I can say that the first things in heaven are heaven itself, and the second things are heaven and earth. But I can say that the latter are not all things which are in heaven
1: There are more things in heaven and earth, Horatio, than that we could do with our bodies. When he gave us that body, he was a man of great love. He could hear the humbleness of our hearts, the beauty of our feet, the sweetness of our taste, the lightness
2: There are more things in heaven and earth, Horatio, and he will save me; and I will see him, and I will find him, and I will give him all I have in heaven."

In that same way, there are more things than the number of times God's Word was uttered
3: There are more things in heaven and earth, Horatio, the king of the children of men," he says. "Bu

In [None]:
 # 'To be, or not to be: that is the question.'
 input_ids = tokenizer.encode('To be or not to be:', return_tensors='tf')

In [None]:
# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: To be or not to be: The difference between this and the current state of the situation is much smaller than the difference between the present state of affairs is.

What's the difference?

This is a major technical difference because it is based on a technical question: Does
1: To be or not to be: this can be either:

1) a very short period of time (eg: 20+ minutes) for the user to get something in the app and use it as the basis for the app itself to be installed and used. This will allow
2: To be or not to be:

The reason why so many people go to sleep on nights when their sleep is poor, and for the most part are not able to get back to sleep with their loved ones, is because a lot of people have to go to sleep too late
3: To be or not to be:

1. You must be born with a "girly" name or a "lucky" name or a "honest" name that is in keeping with the law.

2. In addition, you m

Now, let's see how the model continues after a question.

In [None]:
 # 'Romeo, Romeo! Wherefore art thou Romeo?'
 input_ids = tokenizer.encode('Romeo, Romeo! Wherefore art thou Romeo?', return_tensors='tf')

In [None]:
# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: Romeo, Romeo! Wherefore art thou Romeo? Behold it is that I am a poet, and that he has come to me, and to all my heart, the one from whom, in order that he may sing, he may take from me, and that I may receive the gifts of
1: Romeo, Romeo! Wherefore art thou Romeo?

Lion, how can I tell thee? I will tell thee only how much I love Romeo!

O, Romeo. What shall thou? And why art thou such an ugly brute!

Romeo, for I,
2: Romeo, Romeo! Wherefore art thou Romeo? O Romeo: and is this the same O Romeo that is in heaven? And how much shall it concern thee in thy soul that thou art in heaven?

The one who has died for aught is a murderer.

And this will
3: Romeo, Romeo! Wherefore art thou Romeo? Thou art the first to say? And how art thou Romeo? And how are we to understand the question? Is it possible that we do not comprehend it; but it appears only as if we were to ask? But what do

In [None]:
 # 'Et tu, Brute?'
 input_ids = tokenizer.encode('Et tu, Brute?', return_tensors='tf')

In [None]:
# set top_k = 30 and set top_p = 0.95 and num_return_sequences = 5
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50 + len(input_ids[0]),
    top_k=30, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: Et tu, Brute?

In your time you'll be called a "Culture Warrior." This has nothing to do with being the leader of the "Culture Warrior." You need to get it right. Be the leader of your tribe! It's hard to get
1: Et tu, Brute? The Emperor's wrath is not limited to the emperor himself; he may also be held to be at home within the Imperial realm of the Fereldan, where the two great Houses share a common fate. A few years ago, an Imperial High
2: Et tu, Brute?

Brute? Brute! Brute, I can't believe I'm here so late.

My name's Tasha and I've been doing this all day.

Yeah, that was funny.

I'm from
3: Et tu, Brute? You're going to go home," she says, laughing, "or at least you'll let us out." She goes up to the door to leave but the door behind her is locked, so she keeps walking and waits until she's about thirty-
4: Et tu, Brute?

(I thought so.)

Brutus: What did he mean by thi

After several examples, many of our sequences end up being largely nonsense. Many of them are humorous because they don't make a lot of sense and/or are clearly things that Shakespeare would have never written. But there are a few sequences that seem passable. 

To take this exercise further, it may be worth playing around with additional the hyperparameters or trying other transformers such as BERT, RoBERTa, XLM, DistilBert, or XLNet. 