### TODO Recording:

##### Show HuggingFace

- Go to https://huggingface.co/
- Click on Models, Datasets, Spaces, Docs on top and show
- Back to Models, search for "gpt2"
- Show the model card

In [1]:
!pip install transformers



Using pipeline for text generation. By default, it is using GPT2 model

In [5]:
from transformers import pipeline

text_gen_pl = pipeline("text-generation", model = 'gpt2')

In [29]:
from transformers import set_seed

set_seed(42)

### Greedy output

In [30]:
sentence = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
)

sentence

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today was a hard day at the office but what I realized in that short span of time was that nobody could replace anyone in the room except those who were there and were just doing so to do what happened with the media," says Daley.\n\n"It also took a lot of thought, and maybe a little'}]

In [31]:
sentence[0]["generated_text"]

'Today was a hard day at the office but what I realized in that short span of time was that nobody could replace anyone in the room except those who were there and were just doing so to do what happened with the media," says Daley.\n\n"It also took a lot of thought, and maybe a little'

In [32]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    num_return_sequences = 3
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office. And, it was a lot of work to do there in the evenings. I had a lot to do. As this project came together, some issues got settled and some things did not. One was the relationship between the board and me in terms of the management and the administration
Today was a hard day at the office. The first couple of weeks have been quiet, although one thing is for certain. I am glad this has finally ended. To find out your secrets, I can't help but smile, but, as usual, some people with such a sensitive mind will find it impossible to concentrate
Today was a hard day at the office for me so I was happy to be back on record, I will tell you what the experience will be for everyone when I do speak there."

Theresa May will also speak before the Cabinet later this week.

She said: "We will be speaking to all


### Beam Search

In [33]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    num_beams=2,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office. We were trying to get things done. We had to start over.

"It was a tough day for us. We had to start over. We had to start over. It's a tough day for me personally, but it's a tough day."




In [34]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    num_beams=5,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office.

"I think it was a good day," he said. "I think it was a good day for the team. I think it was a good day for the organization. I think it was a good day for the players. I think it was a good day


In [35]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    num_beams = 5,
    no_repeat_ngram_size = 2
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office for me. I had to do a lot of things to make sure I was doing everything I could to keep my job and my family together.

"I'm really grateful for the support that I've received over the years, and I'm looking forward to working with the


In [36]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    num_beams = 5,
    no_repeat_ngram_size = 4
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office.

I had no idea what to do.

The first thing I did was go to the bathroom.

"What are you doing?" I asked.

He shook his head. "I don't know."

I looked at him. "


### Generate top beams and pick the best

In [37]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    num_beams = 5,
    num_return_sequences=5,
    no_repeat_ngram_size = 2
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office. It was difficult for me to get out of bed, and I was trying to figure out what to do next.

"I was doing a lot of research on what I could do to make sure I wasn't doing something that would get me hurt. I had to
Today was a hard day at the office. It was difficult for me to get out of bed, and I was trying to figure out what to do next.

"I was doing a lot of research on what I could do to make sure I wasn't doing something that would get me hurt. I don't
Today was a hard day at the office. It was difficult for me to get out of bed, and I was trying to figure out what to do next.

"I was doing a lot of research on what I could do to make sure I wasn't doing something that would get me hurt. I had no
Today was a hard day at the office. It was difficult for me to get out of bed, and I was trying to figure out what to do next.

"I was doing a lot of research on what I could do to make sure I wasn't doing something that would get me hurt. I'd been
Today

### Sampling

In [38]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_k = 0,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office for Local 6 News. Robert Fraser, the spokesman for Maryland Gov a Democrat denied Howell's activity. He also tapped his son Tim to replace him as chief operating officer. Marci Otefango is the vice president for communications at Desktop TechnologiesUSA. Payne is supervising social


### Sampling with lower temperature makes the output less creative i.e. less random.

In [39]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_k = 0,
    temperature = 0.4,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office for me, and I'm looking forward to seeing you again."

The man who was in charge of the office was a former police officer who had been in the office since 2006.

He said the officers had been "very, very clear" that they were not


### Sampling with higher temperature generates incomprehensible text

In [40]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_k = 0,
    temperature = 1.4,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office. Cigger began street work imagining Stephanie sitting toller on her next dollars gake equally stale bread shake served on what Claude Tokdepth BackType Signature was called Herschel Pellea Eggsahouse Simmons describes as the below all-moons fill taff... so moons mornings


In [41]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_k = 3,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office.

"We're going to be back in a couple of days. I'm going out to dinner and I'm going to get ready for work tomorrow," he told the crowd, adding, "We're going to have to get out of here. I've got a


In [42]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_k = 25,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office as I sat in the room with a group of my friends watching us play the new season. The other group was trying to figure out what to do. We were working on a new song and songlist and I didn't want anyone to know the song we were working on.


### Nulceus sampling

In [43]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_p = 0.88,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office for the U.S. Embassy in Beirut, where the three Americans were detained and charged with kidnapping, murder, and other crimes, and a year earlier in Tripoli. We had lost our way, and our men were shot down. It is our hope that the United States will


In [44]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_p = 0.5,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office. I had to make sure I had a nice clean look on my face and I had to keep my eyes on the screen and I had to keep my mouth shut.

The day before, I was at the office with my friends and I had a good time. I


### Sampling with top-k and top-p

In [45]:
sentences = text_gen_pl(
    "Today was a hard day at the office",
    max_length = 64,
    do_sample = True,
    top_k = 30,
    top_p = 0.8,
)

for sentence in sentences:
  print("="*30)
  print(sentence["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today was a hard day at the office, a tough one, for many of us.

And the world's biggest news organizations were all too aware.

A new report on CNN revealed that the US government has paid the National Security Agency (NSA) more than $200 million since 2002 to monitor, track


### Using a prompt

Here we are generating text using a prompt, generating some text related to campaign slogan for car model launch.

In [46]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

In [47]:
tokenizer

GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [48]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

https://www.reddit.com/r/KoboldAI/comments/yz26ol/how_to_fix_the_attention_mask_and_the_pad_token/

In [58]:
# Prompt for text generation
prompt = "The economy"

# Generate text
input_ids = tokenizer.encode(prompt, return_tensors = "pt")

output = model.generate(
    input_ids,
    max_length = 128,
    num_return_sequences = 1,
    no_repeat_ngram_size = 2,
    do_sample = True,
    top_k = 50,
    top_p = 0.92)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens = True)

print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The economy has been steadily increasing, with inflation steadily rising and wage growth steadily slowing. It is possible that many Americans still want a job and can be easily lured into one by an increase in living standards. However, the vast majority of Americans have little desire to work. So it is difficult for the government to help them find employment by hiring private contractors.

For example, over the past 10 years, U.S. companies have employed more than 10 million people—with nearly 90% of the workers employed by private firms. The percentage of total workers being employed is expected to increase by nearly 20% in 2016. Additionally, as


In [60]:
# Prompt for text generation
prompt = "The economy"

# Generate text
input_ids = tokenizer.encode(prompt, return_tensors = "pt")

output = model.generate(
    input_ids,
    max_length = 128,
    num_beams = 20,
    no_repeat_ngram_size = 1,
)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens = True)

print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The economy is expected to grow at an annual rate of 2.5 percent this year, the International Monetary Fund (IMF) said in a report released on Friday."This growth will be accompanied by strong job creation and higher wages for all workers," IMF Director-General Christine Lagarde told reporters as she met with Chinese Premier Li Keqiang during her first official visit abroad since taking over from former President Hu Jintao following his ouster last month", The Wall Street Journal reported.


In [61]:
# Prompt for text generation
prompt = "The economy"

# Generate text
force_words = ["inflation"]

input_ids = tokenizer.encode(prompt, return_tensors = "pt")
force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids

output = model.generate(
    input_ids,
    force_words_ids = force_words_ids,
    max_length = 128,
    num_beams = 20,
    no_repeat_ngram_size = 1,
    remove_invalid_values = True,
)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens = True)

print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The economy as a whole is growing at an annual rate of 2.5% per annum, according to the Bank for International Settlements (BIS). That's more than double what it was five years ago and nearly three times its pre-recession level in 2008."


"Inflation has been rising steadily since 2009," he said by phone from New York on Wednesday afternoon after meeting with his counterpart Joseph Stiglitz who will be visiting China this week before taking office next month. "I think we're seeing some signs that things are getting better but I don't know how much longer they'll stay there orinflation


Greedy Output generation:We can see repetitions of sequences

Guided Text Generation with Constraints
https://huggingface.co/blog/constrained-beam-search#example-2-disjunctive-constraints

In [62]:
force_word = "profits"
force_flexible = ["grow", "growth", "grew", "grown"]

force_words_ids = [
    tokenizer([force_word], add_prefix_space = True, add_special_tokens = False).input_ids,
    tokenizer(force_flexible, add_prefix_space = True, add_special_tokens = False).input_ids,
]

starting_text = ["The cost", "The business"]

input_ids = tokenizer(starting_text, return_tensors = "pt").input_ids


outputs = model.generate(
    input_ids,
    force_words_ids = force_words_ids,
    num_beams = 10,
    num_return_sequences = 1,
    no_repeat_ngram_size = 2,
    remove_invalid_values = True,
    max_new_tokens =  50,

)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
print(tokenizer.decode(outputs[1], skip_special_tokens = True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
The cost of the project is estimated to be around $1.5 million.

The project will be the first of its kind in the United States, and it is expected to cost between $100 million and $200 million to build. The profits grow
The business, which is based in the United States, has been in business for more than a decade.

"We're very pleased to be able to continue to grow our business in a way that is consistent with our commitment to our customers and our profits
