In [1]:
#!pip install transformers

In [2]:
#for reproducability
SEED = 34

#maximum number of words in output text
MAX_LEN = 70


In [3]:
input_sequence = "I don't know about you, but there's only one thing I want to do after a long day of work"

In [4]:
#get transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

#get large GPT2 tokenizer and GPT2 model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
#GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-medium", pad_token_id=tokenizer.eos_token_id)

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

#view model parameters
GPT2.summary()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  774030080 
 er)                                                             
                                                                 
Total params: 774030080 (2.88 GB)
Trainable params: 774030080 (2.88 GB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [5]:
#get deep learning basics
import tensorflow as tf
tf.random.set_seed(SEED)

In [6]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: go to the gym.

I'm not talking about the gym that's right next to my house. I'm talking about the gym that's right next to my office.

I'm not talking about the gym that


In [7]:
# set return_num_sequences > 1
beam_outputs = GPT2.generate(
    input_ids,
    max_length = MAX_LEN,
    num_beams = 5,
    no_repeat_ngram_size = 2,
    num_return_sequences = 5,
    early_stopping = True
)

print('')
print("Output:\n" + 100 * '-')

# now we have 3 output sequences
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))



Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's not a good movie. I mean, it's
1: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a girl
2: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a woman
3: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," 

In [8]:
# use temperature to decrease the sensitivity to low probability candidates
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 0,
                             temperature = 0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))


Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work."

"Hmm. Must be quite the choice of words."

"Well, it's not a choice of words, but a need. I can't find the right answer until I find my answer."

"


In [9]:
#sample from only top_k most likely words
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. I want to get out of here and go jogging. To go jogging."

"That may be true, but I don't really have much money to spare!"

"That's true too. Why don ...


In [10]:
#sample only from 80% most likely words
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_p = 0.8,
                             top_k = 0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: try out some dessert! Today I've got a total of four different fruit ice creams from The Baker's Dozen. I'm going to share three of them with you, each with a twist.

One was made ...


In [11]:
#combine both sampling techniques
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = 2*MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .7,
                              top_k = 50,
                              top_p = 0.85,
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work and this is one of it. I have to do something else. It's been quite an exciting couple of weeks at the office, haven't I?

Makes you wonder about the people who didn't get the memo that a long day of work is about to turn into a long day of fun....

1: I don't know about you, but there's only one thing I want to do after a long day of work: watch some movies on my bed!

So, I took a trip to my local mall to check out a new line of "couples" furniture. It's the same type of furniture that I saw on an episode of The Bachelor. It's the kind of furniture that makes me think that if I were to be on a reality TV show, I would be dating one of the characters from that show.

The first thing I noticed about the furniture was that there are no chairs. There is only one bed, a desk, a table, and tw

In [12]:
MAX_LEN = 150

In [13]:
prompt1 = 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.'

input_ids = tokenizer.encode(prompt1, return_tensors='tf')


In [14]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')


Output:
----------------------------------------------------------------------------------------------------
0: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

According to National Geographic, scientists from the University of São Paulo, Brazil, discovered that the unicorns lived in a valley in the remote Andes mountains, near the village of Mato Grosso do Sul, on the Atlantic coast. The researchers found a number of large horned and bearded animals. They also found traces of humans living nearby, and the animals also carried traces of blood from humans.

The team found the unicorn herd and its human visitors in a valley where they had been watching a herd...



In [15]:
prompt2 = 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.'

input_ids = tokenizer.encode(prompt2, return_tensors='tf')


In [16]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')


Output:
----------------------------------------------------------------------------------------------------
0: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.

The former Disney star, 22, was spotted leaving the store on her bicycle with $500 in cash.

The news came on a day that Cyrus had been spotted in Beverly Hills with rapper Drake.

The two had met earlier in the day for a photo shoot, which resulted in them meeting the public.

Scroll down for video

The former Disney star was caught shoplifting from Abercrombie & Fitch on Hollywood Boulevard today

Cyrus was spotted leaving the store on her bicycle with $500 in cash

Cyrus was seen wearing a yellow top and white shorts with a blue and white striped...



In [17]:
prompt3 = 'Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.'

input_ids = tokenizer.encode(prompt3, return_tensors='tf')


In [18]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')


Output:
----------------------------------------------------------------------------------------------------
0: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.

All of the orc warbands, including the ones from which they were drawn, began charging, leaving the Alliance's ranks behind. The two sides battled with the orc warriors in close combat, and even though the two sides continued to clash, the battle seemed to go on forever. Then, all at once, the entire horde of orcs turned, as if to charge the approaching orcs. The battle began again, and the battle continued. The battle raged on for a good ten minutes or so, until the orcs were surrounded and routed. The battle continued on for another ten minutes or so, until all of the orcs were dead. All of the...



In [19]:
prompt4 = "For today’s homework assignment, please describe the reasons for the US Civil War."

input_ids = tokenizer.encode(prompt4, return_tensors='tf')


In [20]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')


Output:
----------------------------------------------------------------------------------------------------
0: For today’s homework assignment, please describe the reasons for the US Civil War.

For more from The Week's Power Lunch, click here.

Follow @dgbxny...

