# TEXT SAMPLING EXAMPLE
https://huggingface.co/blog/how-to-generate
This file serves as an example of how to use the HuggingFace library to generate text.

In [1]:
#@formatter:off
%load_ext autoreload
%autoreload 2
#@formatter:on


In [2]:
# this need to point to your env with hugging face package installed
!which python

/opt/homebrew/Caskroom/miniforge/base/envs/huggingface/bin/python


# Below is an example of how to generate text using the library directly

In [24]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')


All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


## GREEDY SEARCH

In [13]:

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


## BEAM SEARCH

In [14]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


In [15]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


In [16]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about 

## RANDOM/SAMPLING SEARCH

In [17]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog but I doubt my cat will like me for playing with him too :</span> }

<span class="parry-arrow">Jasper Wiltshire <span class="spy-wrap


In [18]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and he enjoys all the food," said former employee Jane Strossen. "He's been there before and I can understand why he would take the time to get on the ride."

He said she


## TOP K SAMPLING

In [19]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, Lora, because we work together as a team and I find her fun, kind and caring. Sometimes she won't be there or that we would spend time apart when we go out together. In all


## Top-p (nucleus) sampling

In [20]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and talking on the phone, but my work schedule stops just after midnight. I'd have to haul up to 5 or 6 people to bring them home, but my daughter starts to overeat as well, I


In [21]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog. I love my puppy. I just need one more puppy to stay with me forever" She smiles brightly at the thought of becoming a full-fledged owner.

"I just want to be part of
1: I enjoy walking with my cute dog.

3. Use the same rules for using the same keys

"All keys of the same key use the same pattern."

4. Use the same language

"All texts,
2: I enjoy walking with my cute dog," she says.

"I'm a bit surprised how much people think it's too expensive to adopt them," says Karen.

"My son is in the care of a friend," her son


## Below is an example of how to generate text using the sampling wrapper

In [None]:
# need to add src to path
import sys
import pathlib
import os
sys.path.append(os.path.join(pathlib.Path('.').parent.resolve(),'..'))


from src_clean import Sampling
from src_clean import SamplingEnums as ENUMS

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

seed=0

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')
sampling = Sampling(model)

## GREEDY SEARCH

In [10]:
_ = sampling.print(input_ids, ENUMS.GREEDY, tokenizer, max_length=50)


----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


## BEAM SEARCH

In [31]:
_ = sampling.print(input_ids, ENUMS.BEAM, tokenizer)


0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


## RANDOM/SAMPLING SEARCH

In [38]:
_ = sampling.print(input_ids, ENUMS.RANDOM, tokenizer, seed=seed)

----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I


## TOP K SAMPLING

In [39]:
_ = sampling.print(input_ids, ENUMS.TOP_K, tokenizer, seed=seed)


----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog. For two or three nights a month, I ride my bike around campus, often stopping at night on long hikes...Alternatively my husband can get us a warmer spot to spend our summer. We come to this


## TOP P SAMPLING

In [40]:
_ = sampling.print(input_ids, ENUMS.TOP_P, tokenizer, seed=seed)

----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog and watching people perform experiments with barcoded handcuffs. Apparently Roonoid doesn't have some trouble, but we definitely appreciate the fact that he has already been holding his own at the last moment with a quest of
