Transformers are deep learning architectures introduced by Google in 2017 that are designed to process sequential data for downstream tasks such as translation, question answering or text summarization.

Let's first talk about text generation.
The text generation capabilities of GPT-2 is one of the most popular Transformer architectures usable by a broader audience

In [1]:
import tensorflow as tf

One of the advantages of the transformer library and a reason for its popularity is how easily we can download a specific model.

In [2]:
! pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 10.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 62.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 61.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 14.1 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling 

In [3]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.88G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


It is usually a good idea to fix the random seed to make sure the results are reproducable. As can be seen from the above result, we did download all the pretrained model from the transformers libray. It will help us to produce the code faster with more accuracy.

In [5]:
# settings

#for reproducability
SEED = 34
tf.random.set_seed(SEED)

#maximum number of words in output text
MAX_LEN = 70

The next step is we need to decode and it is one of the most important decisions when using the GPT-2 model.

We will use greedy search, the word with the highest probability is predicted as the next word in the sequence.

In [6]:
input_sequence = "I don't know about you, but there's only one thing I want to do after a long day of work"

input_sequence = "There are times when I am really tired of people, but I feel lonely too."

In [7]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I feel like I'm alone in the world. I feel like I'm alone in my own body. I feel like I'm alone in my own mind. I feel like I'm alone in my own heart. I feel like I'm alone in my own mind


As can be seen from the above result, the model starts repeating itself, because the high-probability words mask the less-likely ones so they can not explore more diverse combinations.

A simple solution for that is using beam search, we can keep track of the alternative variants so that more comparisons are possible.

In [8]:
# set return_num_sequences > 1
beam_outputs = GPT2.generate(
    input_ids, 
    max_length = MAX_LEN, 
    num_beams = 5, 
    no_repeat_ngram_size = 2, 
    num_return_sequences = 5, 
    early_stopping = True
)

print('')
print("Output:\n" + 100 * '-')

# now we have 3 output sequences
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


Output:
----------------------------------------------------------------------------------------------------
0: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself."

"I feel like I can't do anything right now," she said. "I'm so tired."
1: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself."

"I feel like I can't do anything right now," she says. "I'm so tired."
2: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself."

"I feel like I can't do anything right now," she says. "I'm not sure what I'm supposed to be doing with my life."
3: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself."

"I feel like I can't do anything right now," she says. "I'm not sure what I'm supposed to be doing."
4: There are times when I am really tired of people, but I feel lonely to

As can be seen from the above result, the result is now more diverse, the message is still the same, but at least the formulations look a little different from a style point of view.

The next step is we can explore sampling- indeterministic decoding. Instead of following a strict path to find the end text with the highest probabbility, we can rather randomly pick the next word by its conditional probability distribution. This approach risks producing incoherent samplings so we can make sue of the temparature parameter which affects the probability mass distribution.

In [9]:
# Use the temperature to decrease the sensitivity to low probability candidates
sample_output = GPT2.generate(
      input_ids, 
      do_sample = True, 
      max_length = MAX_LEN, 
      top_k = 0, 
      temperature = 0.2
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I feel like I have to be alone. I feel like I have to be alone. I feel like I have to be alone. I feel like I have to be alone. I feel like I have to be alone. I feel like I have to be alone


In [10]:
# What is happened when we increase the temperature?
sample_output = GPT2.generate(
      input_ids, 
      do_sample = True, 
      max_length = MAX_LEN, 
      top_k = 0, 
      temperature = 0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I'm terrified of the guys, I find them so attractive, but I am also sad about the girls. I have to be careful about how I feel, but I would appreciate some advice from a senior that can help me think about something else. I'm a


This is getting more interesting, although it still feels a bit like a train of thought which is perhaps to be expected, given the content of our prompt. It is time to explore more ways to tune the output.

In Top-K sampling, the top k most likely next words are selected and the entire probability mass is shifted to these k words. So instead of increasing the chances of high probability words occurring and decreasing the chances of low probability words, we just remove the low probability words altogether.

In [11]:
#sample from only top_k most likely words
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I've always felt so small when I have no friends."


"No, I'm not lonely," Naruto said, and then added, "In all honesty, you are the happiest I've ever had."


"So am I," Sasuke agreed without ...


This seems like a step in the right direction. Can we do better than this?
Top-P sampling (also known as nucleus sampling) is similar to Top-K, but instead of choosing the top K most likely words we can choose the smallest set of words hose total probability is larger than p and then the enture probability mass is shifted to the words in this set. 

In [12]:
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_p = 0.8, 
                             top_k = 0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I really wish I could be that lonely person."

And that's when he says that the key to solving his loneliness lies in being able to think more creatively, and not feel lonely at all.

He says that he sees thoughts in his head, ...


The main difference here is that with the Top-K sampling, the size of the set of words is static (obviously), whereas in the Top-P sampling, the size of the set cna change. To use this sampling method, we just set top_k=0 and choose the top_p value as we did shown in the above code.

In [13]:
# Combine all approaches together
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = 2*MAX_LEN,  #to test how long we can generate and it be coherent
                              #temperature = .7,
                              top_k = 50, 
                              top_p = 0.85, 
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: There are times when I am really tired of people, but I feel lonely too. I know people who are not lonely. There are times when I am too sad, too scared, too embarrassed, too angry, too sad, too hungry, too tired, too tired, too scared, too angry, too sad, too tired, too scared, too hungry, too tired, too hungry, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too tired, too...

1: There are times when I am really tired of people, but I feel lonely too. I know the people who I am working with are really good and it's not because I don't like them, I just really like the idea of helping them to do things that I have no interest in doing myself, I just love the idea that I could help them and that is the best part of wo

Clearly that the more sophisticated method's setting can give us pretty impressive results. So it is a tieme to explore this avenue more, we will use the prompts taken from OpenAI's GPT-2 webiste, where they feed them to a full-sized GPT-2 model. This comparison will give us an idea of how well we are doing with a local (smaller) model compared to a full one that was used.

In [14]:
MAX_LEN = 500

In [15]:
prompt1 = 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.'

# encode the print
input_ids = tokenizer.encode(prompt1, return_tensors='tf')

sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50, 
                              top_p = 0.85 
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

This is the first confirmed sighting of unicorns in the Andes, according to a press release from the University of Cambridge.

Scientists had previously believed that the unicorns might have been extinct. They had also believed that the unicorns might have been killed off by the glaciers that once covered their mountain range.

However, after conducting a detailed study of their language, scientists believe that the unicorns actually speak English in a language that resembles that of their European ancestors.

The scientists who spotted the unicorns used a high-tech equipment that can detect the faint traces of light that is emitted by the Earth's 

As can be seen from the above result, we did train the full scale model. And the result is pretty good.

As can be seen from the above examples, a GPT-2 model working out of the box (without finetuning) can already generate plausible-looking long-form text. Assessing the fure impract of this technology on the field of communication remains an open and highly controversial issue.