Code from

https://huggingface.co/gpt2

Information about fine-tuning model:

https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272

Further reading about autoregressive generation:

https://towardsdatascience.com/text-generation-gpt-2-lstm-markov-chain-9ea371820e1e

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [3]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model, and as such, I", max_length=30, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, and as such, I am here to introduce it to everyone. Languages, after all, are a means to"},
 {'generated_text': "Hello, I'm a language model, and as such, I'm in no way an expert in writing such a model. For one thing, I"},
 {'generated_text': "Hello, I'm a language model, and as such, I've created a new framework that works with Python's built-in type system. I"},
 {'generated_text': "Hello, I'm a language model, and as such, I was never really satisfied with the way language models evolve after that.\n\nQ:"},
 {'generated_text': "Hello, I'm a language model, and as such, I think that one can be taught and applied as well. (...) It's really not"}]

In [7]:
generator("I am an astronaut and I live in", max_length=30, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am an astronaut and I live in a family living in Atlanta, and I like people that want to understand the world, and to look at their'},
 {'generated_text': 'I am an astronaut and I live in a country with limited resources. Why do we have only 30 times the amount of energy per night you need to'},
 {'generated_text': 'I am an astronaut and I live in an apartment that has been built with the greatest effort and energy into the first week."\n\nI went on'},
 {'generated_text': 'I am an astronaut and I live in a small town in Hawaii. Because I am an astronaut, I usually get to stay in the United States and'},
 {'generated_text': 'I am an astronaut and I live in a world with space, and I am very curious as to what goes on if you happen to notice that the'}]

In [9]:
generator("The capital of the country that I am living is Berlin. The country is called", max_length=30, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The capital of the country that I am living is Berlin. The country is called Berlin and I have lived there for 27 years now. I didn't"},
 {'generated_text': 'The capital of the country that I am living is Berlin. The country is called Berlin, because it is now a bustling exporter of goods and raw'},
 {'generated_text': 'The capital of the country that I am living is Berlin. The country is called Berlin, as does Germany. Of course our city is a little tiny'},
 {'generated_text': 'The capital of the country that I am living is Berlin. The country is called GDR. Berlin is my home."\n\nWhen asked about the'},
 {'generated_text': 'The capital of the country that I am living is Berlin. The country is called Berlin without Berlin. The capital of Berlin is Berlin without Berlin. It'}]

In [10]:
generator("No more of talk where god or angel guests", max_length=30, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'No more of talk where god or angel guests show her how to be happy.\n\n5. It looks as though the guy at the bar was'},
 {'generated_text': "No more of talk where god or angel guests are seen, but rather get them the hell out of here for now. There's nothing much to worry"},
 {'generated_text': 'No more of talk where god or angel guests stand around doing nothing." â€” A.D. 1292, "The Children of Israel" (K'},
 {'generated_text': 'No more of talk where god or angel guests need advice on how to take your family to the grocery store or buy an item before going to the store'},
 {'generated_text': 'No more of talk where god or angel guests are invited."\n\nWhat other things does this remind us about what the Bible says about men and other'}]

Probability attempt

https://discuss.huggingface.co/t/generation-probabilities-how-to-compute-probabilities-of-output-scores-for-gpt2/3175/15

In [11]:
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer


gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids

generated_outputs = gpt2.generate(input_ids, num_beams=2, num_return_sequences=2, output_scores=True, length_penalty=0)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [16]:
help(generated_outputs)

Help on BeamSearchDecoderOnlyOutput in module transformers.generation_utils object:

class BeamSearchDecoderOnlyOutput(transformers.file_utils.ModelOutput)
 |  BeamSearchDecoderOnlyOutput(sequences: torch.LongTensor = None, sequences_scores: Union[torch.FloatTensor, NoneType] = None, scores: Union[Tuple[torch.FloatTensor], NoneType] = None, beam_indices: Union[Tuple[Tuple[torch.LongTensor]], NoneType] = None, attentions: Union[Tuple[Tuple[torch.FloatTensor]], NoneType] = None, hidden_states: Union[Tuple[Tuple[torch.FloatTensor]], NoneType] = None) -> None
 |  
 |  Base class for outputs of decoder-only generation models using beam search.
 |  
 |  Args:
 |      sequences (`torch.LongTensor` of shape `(batch_size*num_return_sequences, sequence_length)`):
 |          The generated sequences. The second dimension (sequence_length) is either equal to `max_length` or shorter
 |          if all batches finished early due to the `eos_token_id`.
 |      sequences_scores (`torch.FloatTensor` of

In [22]:
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer


gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids

generated_outputs = gpt2.generate(input_ids, do_sample=True, num_return_sequences=3, output_scores=True)

# only use id's that were generated
# gen_sequences has shape [3, 15]
gen_sequences = generated_outputs.sequences[:, input_ids.shape[-1]:]

# let's stack the logits generated at each step to a tensor and transform
# logits to probs
probs = torch.stack(generated_outputs.scores, dim=1).softmax(-1)  # -> shape [3, 15, vocab_size]

# now we need to collect the probability of the generated token
# we need to add a dummy dim in the end to make gather work
gen_probs = torch.gather(probs, 2, gen_sequences[:, :, None]).squeeze(-1)

# now we can do all kinds of things with the probs

# 1) the probs that exactly those sequences are generated again
# those are normally going to be very small
unique_prob_per_sequence = gen_probs.prod(-1)

# 2) normalize the probs over the three sequences
normed_gen_probs = gen_probs / gen_probs.sum(0)
assert normed_gen_probs[:, 0].sum() == 1.0, "probs should be normalized"

# 3) compare normalized probs to each other like in 1)
unique_normed_prob_per_sequence = normed_gen_probs.prod(-1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [23]:
unique_normed_prob_per_sequence

tensor([6.8899e-13, 1.8545e-12, 8.7922e-11])

In [35]:
tokenizer.batch_decode(generated_outputs.sequences)

['Today is a nice day!" was all I could think to myself. However, this day\'s victory',
 'Today is a nice day for the American people to remember the brave soldiers that the country fought in WWII',
 "Today is a nice day for a number of reasons: One, there's so many wonderful restaurants and"]