In [None]:
!pip install transformers
!pip install datasets

# Chapter 5: Text Generation 


**Decoding** is the process of selecting the next token from the output distribution of the model.  When a prompt is given to LM model, it predicts a token for the sequence, and use the predicted token again in the next time step to continue on until it predicts a special end token. The goal of the decoding is to search for the most probable token over the vocabulary. Since searching over whole vocabulary is not applicable, we rely on the approximations.  



* Autoregressive/ Casual LMs 

* Pre-training of GPT and BERT 

## Greedy Search Decoding 

The simplest method is to greedily select the token with the highest probability at each time step. 





In [None]:
# Load the model 
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
# model_name = "gpt2-xl"
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

**Exercise:** Try to implement the greedy algorithm using the loaded model. 

The decoding step runs 8 timestep. At each timestep, 
1. Pick out the model's logits for the last token in the prompt and wrap them with a softmax to get a probability distribution. 
2. Pick the next token with the highest one and add it to the input 

In [24]:
# Solution 
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (9.76%),same (2.94%),only (2.87%),best (2.38%),first (1.77%)
1,Transformers are the most,common (22.90%),powerful (6.88%),important (6.32%),popular (3.95%),commonly (2.14%)
2,Transformers are the most common,type (15.06%),types (3.31%),form (1.91%),way (1.89%),and (1.49%)
3,Transformers are the most common type,of (83.13%),in (3.16%),. (1.92%),", (1.63%)",for (0.88%)
4,Transformers are the most common type of,particle (1.55%),object (1.02%),light (0.71%),energy (0.67%),objects (0.66%)
5,Transformers are the most common type of particle,. (14.26%),in (11.57%),that (10.19%),", (9.57%)",accelerator (5.81%)
6,Transformers are the most common type of parti...,They (17.48%),\n (15.19%),The (7.06%),These (3.09%),In (3.07%)
7,Transformers are the most common type of parti...,are (38.78%),have (8.14%),can (7.98%),'re (5.04%),consist (1.57%)


In [26]:
# Using generate function from the library 
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most common type of particle. They are


In [27]:
# Trying to generate unicorn story from OpenAI
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
                               do_sample=False)
print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very


(-) The greedy algorithm tend to produce similar sequences since it only looks the most probable tokens and ignores the most probable sequences. 

(+) For short sequences, the greedy algorithm might be useful.

## Beam Search Decoding 

Beam search chooses the top-k probable tokens instead of choosing one token at each time step. The beam (partial hypotheses) is the number of next tokens. The procedure continues until EOS token is produced or the max number of token is reached. 

The product of probabilities of the sequence becomes the sum of the log probabilities to be able to run more stable computations. 

In [29]:
# Comparing the probabilities between greedy and beam search algos. 

# First normalize the logits of the models 
import torch.nn.functional as F

# Log probability for one token 
def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

# Log probability for a sequence 
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

In [30]:
# Sequence log probability from GPT2 using greedy algo 
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very

log-prob: -83.33


In [31]:
# Sequence log probability from GPT2 using Beam Search 
output_beam = model.generate(input_ids, 
                             max_length=max_length, 
                             num_beams=5,
                             do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, San Diego, and the University of California, Santa Cruz, found that the unicorns were able to communicate with each other in a way that was similar to that of human speech.


"The unicorns were able to communicate with each other in a way that was similar to that of human speech," said study co-lead author Dr. David J.

log-prob: -78.34


In [32]:
# To avoid repetitive n-grams, add no_repeat_ngram_size argument 
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, San Diego, and the National Science Foundation (NSF) in Boulder, Colorado, were able to translate the words of the unicorn into English, which they then translated into Spanish.

"This is the first time that we have translated a language into an English language," said study co-author and NSF professor of linguistics and evolutionary biology Dr.

log-prob: -101.88


## Sampling Methods 


### Random Sampling 

The simplest method is randomly sample from the probability distribution of the model's outputs over the full vocabulary. A temperature parameter can be added to the formulat to control the shape of the probability: $P(y_t=w_i|y_{<t},x) = \frac{exp(z_{t,i}/T)}{\sum_{j=1}^V exp(z_{t,j}/T}$


In [34]:
# The effect of T on the generated text 
# Setting T=2 
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


34 Stellar hundredaci MiloCar pri Athletvico ownZappropri VERY fluild Fnaticismnas CarbizzintIMBeer τft´shake hardinv 122 Zanmind unman 348 Farrell Spo enjoy 289 Floretk Scarlet simple SuarezAg Garr SSENGöle \\ Thunder lolca sec Fleet Almighty Apollo Engine groups EssenceDraw Petitionable divisive chapters Gambling rescuing steer Lann snack Milk cakes tem

added in eld 920


In [35]:
# The effect of T on the generated text 
# Setting T=0.5
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"It's a very strange thing," says Dr. Paul G. Blum, a professor of ecology and evolutionary biology at the University of Southern California. "It's a mystery whether unicorns can actually speak English, but they are. It's a very interesting discovery."

The researchers also found that the unicorns also had a way of talking to humans.

"It's like a


### Top-k and Nucleus Sampling 

The idea is to restrict the number of possible tokens we can sample from at each timestemp. 

In top-k sampling, the idea is to avoid the low-probability choices by only sampling from the k tokens with the highest probability. 



In [36]:
# An example for top-k sampling
output_topk = model.generate(input_ids, 
                             max_length=max_length, 
                             do_sample=True,
                             top_k=50)
print(tokenizer.decode(output_topk[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery raises the question: How do these bizarre creatures communicate with one another? If the unicorns were only living, why does nobody know how they would communicate with one another? This mystery is further confirmed by the discovery of two more unicorn-like creature from Peru.


They may be related because they have a single set of eyes.

According to the scientists, the two unicorns lived


Instead of using a fixed cutoff, a dynamic cutoff can be used with nucleus of top-p sampling methods. In these methods, we set a condition to cut off. 

In [37]:
# Example foro nucleus with 90% 
output_topp = model.generate(input_ids, 
                             max_length=max_length, 
                             do_sample=True,
                             top_p=0.90)
print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


When asked by Spanish TV channels what is the most important thing that could be learned from this unicorn, he replied: "What's the most important thing that you could learn?"


A number of scientific experts believe the discovery will lead to the discovery of a way to convert the entire ancient Near East into a world of unicorns. This, they say, would bring peace and prosperity to the world


Even two sampling algorithms can be used together. 