<a href="https://colab.research.google.com/github/keerthi09090/CSCI-167/blob/main/Notebook_12_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Notebook 12.4: Decoding strategies**

This practical investigates neural decoding from transformer models.  

Work through the cells below, running each cell in turn. In various places you will see the words "TODO". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.

Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions.

In [None]:
!pip install transformers



In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed
import torch
import torch.nn.functional as F
import numpy as np

In [None]:
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Decoding from GPT2

This tutorial investigates how to use GPT2 (the forerunner of GPT3) to generate text.  There are a number of ways to do this that trade-off the realism of the text against the amount of variation.

At every stage, GPT2 takes an input string and returns a probability for each of the possible subsequent tokens.  We can choose what to do with these probability.  We could always *greedily choose* the most likely next token, or we could draw a *sample* randomly according to the probabilities.  There are also intermediate strategies such as *top-k sampling* and *nucleus sampling*, that have some controlled randomness.

We'll also investigate *beam search* -- the idea is that rather than greedily take the next best token at each stage, we maintain a set of hypotheses  (beams)as we add each subsequent token and return the most likely overall hypothesis.  This is not necessarily the same result we get from greedily choosing the next token.

First, let's investigate the token themselves.  The code below prints out the vocabulary size and shows 20 random tokens.  

In [None]:
np.random.seed(1)
print("Number of tokens in dictionary = %d"%(tokenizer.vocab_size))
for i in range(20):
  index = np.random.randint(tokenizer.vocab_size)
  print("Token: %d "%(index)+tokenizer.decode(torch.tensor(index), skip_special_tokens=True))


Number of tokens in dictionary = 50257
Token: 33003  Mormons
Token: 12172  cam
Token: 5192  trig
Token: 32511 ojure
Token: 50057  gist
Token: 43723  Petition
Token: 7813  sin
Token: 21440  Witness
Token: 32912  Remy
Token: 20609 isure
Token: 49100  creeps
Token: 7751  fasc
Token: 43757  Alc
Token: 31228  messenger
Token: 36230  SYSTEM
Token: 32025  precipitation
Token: 21758  cores
Token: 45413  Forestry
Token: 35730  guru
Token: 8444  Disc


# Sampling

Each time we run GPT2 it will take in a set of tokens, and return a probability over each of the possible next tokens.  The simplest thing we could do is to just draw a sample from this probability distribution each time.

In [None]:
def sample_next_token(input_tokens, model, tokenizer):
  # Run model to get prediction over next output
  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])

  # Find prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0, -1]

  # Draw a random token according to the probabilities
  next_token = [np.random.choice(len(prob_over_tokens), p=prob_over_tokens)]

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'], torch.tensor([next_token])), dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'], torch.tensor([[1]])), dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]

  return output_tokens


In [None]:
# Expected output:
# "The best thing about Bath is that they don't even change or shrink anymore."

set_seed(0)
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')
for i in range(10):
    input_tokens = sample_next_token(input_tokens, model, tokenizer)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))



The best thing about Bath is that
The best thing about Bath is that they
The best thing about Bath is that they don
The best thing about Bath is that they don't
The best thing about Bath is that they don't even
The best thing about Bath is that they don't even change
The best thing about Bath is that they don't even change or
The best thing about Bath is that they don't even change or shrink
The best thing about Bath is that they don't even change or shrink anymore
The best thing about Bath is that they don't even change or shrink anymore.


In [None]:
# Try changing both the starting sentence and number of generated tokens!

# Starting text prompt (try your own)
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')

# Number of tokens to generate (try 10, 20, 50, etc.)
for i in range(20):
    input_tokens = sample_next_token(input_tokens, model, tokenizer)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))


The best thing about Bath is your
The best thing about Bath is your kids
The best thing about Bath is your kids will
The best thing about Bath is your kids will definitely
The best thing about Bath is your kids will definitely be
The best thing about Bath is your kids will definitely be up
The best thing about Bath is your kids will definitely be up the
The best thing about Bath is your kids will definitely be up the chim
The best thing about Bath is your kids will definitely be up the chimney
The best thing about Bath is your kids will definitely be up the chimney floor
The best thing about Bath is your kids will definitely be up the chimney floor laughing
The best thing about Bath is your kids will definitely be up the chimney floor laughing about
The best thing about Bath is your kids will definitely be up the chimney floor laughing about it
The best thing about Bath is your kids will definitely be up the chimney floor laughing about it,"
The best thing about Bath is your kids will 

# Greedy token selection

You probably (correctly) got the impression that the text from pure sampling of the probability model can be kind of random.  How about if we choose most likely token at each step?


In [None]:
def get_best_next_token(input_tokens, model, tokenizer):
  # Run model to get prediction over next output
  outputs = model(input_ids=input_tokens['input_ids'], attention_mask=input_tokens['attention_mask'])

  # Find prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0, -1]

  # Find the token index with the maximum probability
  next_token = [np.argmax(prob_over_tokens)]

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'], torch.tensor([next_token])), dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'], torch.tensor([[1]])), dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]

  return output_tokens


In [None]:
# Expected output:
# The best thing about Bath is that it's a place where you can go to
set_seed(0)
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')
for i in range(10):
    input_tokens = get_best_next_token(input_tokens, model, tokenizer)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is that
The best thing about Bath is that it
The best thing about Bath is that it's
The best thing about Bath is that it's a
The best thing about Bath is that it's a place
The best thing about Bath is that it's a place where
The best thing about Bath is that it's a place where you
The best thing about Bath is that it's a place where you can
The best thing about Bath is that it's a place where you can go
The best thing about Bath is that it's a place where you can go to


In [None]:
# TODO Modify the code below by changing the number of tokens generated and the initial sentence
# to get a feel for how well this works.

# TODO Experiment with changing this line:
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')
# TODO Experiment with changing this line:
for i in range(10):
    input_tokens = get_best_next_token(input_tokens, model, tokenizer)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is that
The best thing about Bath is that it
The best thing about Bath is that it's
The best thing about Bath is that it's a
The best thing about Bath is that it's a place
The best thing about Bath is that it's a place where
The best thing about Bath is that it's a place where you
The best thing about Bath is that it's a place where you can
The best thing about Bath is that it's a place where you can go
The best thing about Bath is that it's a place where you can go to


# Top-K sampling

You probably noticed that the greedy strategy produces quite realistic text, but it's kind of boring.  It produces generic answers.  Also, if this was a chatbot, then we wouldn't necessarily want it to produce the same answer to a question each time.  

Top-K sampling is a compromise strategy that samples randomly from the top K most probable tokens.  We could just choose them with a uniform distribution, or (as here) we could sample them according to their original probabilities.

In [None]:
def get_top_k_token(input_tokens, model, tokenizer, k=20):
  # Run model to get prediction over next output
  outputs = model(input_ids=input_tokens['input_ids'], attention_mask=input_tokens['attention_mask'])

  # Find prediction probabilities
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0, -1]

  # Take a copy of the probabilities and sort from largest to smallest
  sorted_prob_over_tokens = np.sort(prob_over_tokens)[::-1]

  # Find the probability at the k-th position (the cutoff)
  kth_prob_value = sorted_prob_over_tokens[k - 1]

  # Set all probabilities below this value to zero
  prob_over_tokens[prob_over_tokens < kth_prob_value] = 0

  # Renormalize so probabilities sum to one
  prob_over_tokens = prob_over_tokens / np.sum(prob_over_tokens)

  # Draw a random token according to the filtered probabilities
  next_token = np.random.choice(len(prob_over_tokens), 1, replace=False, p=prob_over_tokens)

  # Append token to the sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'], torch.tensor([next_token])), dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'], torch.tensor([[1]])), dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]

  return output_tokens


In [None]:
# Expected output:
# The best thing about Bath is that you get to see all the beautiful faces of

set_seed(42)
input_txt = "The future of artificial intelligence is"
for i in range(20):
    input_tokens = get_top_k_token(input_tokens, model, tokenizer, k=30)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))


The best thing about Bath is that it's a place where you can go to any
The best thing about Bath is that it's a place where you can go to any restaurant
The best thing about Bath is that it's a place where you can go to any restaurant that
The best thing about Bath is that it's a place where you can go to any restaurant that has
The best thing about Bath is that it's a place where you can go to any restaurant that has a
The best thing about Bath is that it's a place where you can go to any restaurant that has a great
The best thing about Bath is that it's a place where you can go to any restaurant that has a great view
The best thing about Bath is that it's a place where you can go to any restaurant that has a great view that
The best thing about Bath is that it's a place where you can go to any restaurant that has a great view that you
The best thing about Bath is that it's a place where you can go to any restaurant that has a great view that you're
The best thing about Bath is that i

In [None]:
# TODO
# Experiment with different values of k
# If you set it to a lower number (say 3) the text will be less random
# If you set it to a higher number (say 5000) the text will be more random

set_seed(0)
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')
for i in range(10):
    input_tokens = get_top_k_token(input_tokens, model, tokenizer, k=10)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is that
The best thing about Bath is that you
The best thing about Bath is that you get
The best thing about Bath is that you get to
The best thing about Bath is that you get to see
The best thing about Bath is that you get to see all
The best thing about Bath is that you get to see all the
The best thing about Bath is that you get to see all the beautiful
The best thing about Bath is that you get to see all the beautiful faces
The best thing about Bath is that you get to see all the beautiful faces of


# Nucleus sampling

Top-K sampling has the disadvantage that sometimes there are only a few plausible next tokens, and sometimes there are a lot.  How do we adapt to this situation?  One way is to sample from a fixed proportion of the probability mass.  That is we order the tokens in terms of probability and cut off the possibility of sampling when the cumulative sum is greater than a threshold.

This way, we adapt the number of possible tokens that we can choose.

In [None]:
def get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh=0.25):
  # Run model to get prediction over next output
  outputs = model(input_ids=input_tokens['input_ids'], attention_mask=input_tokens['attention_mask'])
  # Probabilities over next-token vocabulary
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0, -1]

  # Sort probs in decreasing order
  sorted_probs_decreasing = np.sort(prob_over_tokens)[::-1]
  # Cumulative sum of sorted probabilities
  cum_sum_probs = np.cumsum(sorted_probs_decreasing)

  # Index where cumulative mass first exceeds threshold
  # (how many tokens are kept in the nucleus)
  thresh_index = np.argmax(cum_sum_probs > thresh)
  print("Choosing from %d tokens" % (thresh_index + 1))

  # Probability cutoff corresponding to that index
  thresh_prob = sorted_probs_decreasing[thresh_index]

  # Zero-out everything below the cutoff in the ORIGINAL distribution
  prob_over_tokens[prob_over_tokens < thresh_prob] = 0.0

  # Renormalize (guard against numerical issues)
  total = np.sum(prob_over_tokens)
  if total == 0:
    # Fallback: keep the single most probable token
    j = np.argmax(F.softmax(outputs.logits, dim=-1).detach().numpy()[0, -1])
    mask = np.zeros_like(prob_over_tokens)
    mask[j] = 1.0
    prob_over_tokens = mask
  else:
    prob_over_tokens = prob_over_tokens / total

  # Sample one token from the nucleus
  next_token = np.random.choice(len(prob_over_tokens), 1, replace=False, p=prob_over_tokens)

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'], torch.tensor([next_token])), dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'], torch.tensor([[1]])), dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]
  return output_tokens


In [None]:
# Expected output:
# The best thing about Bath is that it's not a city that has been around
set_seed(0)
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')
for i in range(10):
    input_tokens = get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh = 0.2)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))


Choosing from 1 tokens
The best thing about Bath is that
Choosing from 1 tokens
The best thing about Bath is that it
Choosing from 1 tokens
The best thing about Bath is that it's
Choosing from 3 tokens
The best thing about Bath is that it's not
Choosing from 2 tokens
The best thing about Bath is that it's not a
Choosing from 26 tokens
The best thing about Bath is that it's not a city
Choosing from 3 tokens
The best thing about Bath is that it's not a city that
Choosing from 2 tokens
The best thing about Bath is that it's not a city that has
Choosing from 2 tokens
The best thing about Bath is that it's not a city that has been
Choosing from 12 tokens
The best thing about Bath is that it's not a city that has been around


In [None]:
# TODO -- experiment with setting the threshold probability to larger or smaller values
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')
for i in range(10):
    input_tokens = get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh = 0.2)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

Choosing from 1 tokens
The best thing about Bath is that
Choosing from 1 tokens
The best thing about Bath is that it
Choosing from 1 tokens
The best thing about Bath is that it's
Choosing from 3 tokens
The best thing about Bath is that it's so
Choosing from 4 tokens
The best thing about Bath is that it's so much
Choosing from 1 tokens
The best thing about Bath is that it's so much more
Choosing from 1 tokens
The best thing about Bath is that it's so much more than
Choosing from 1 tokens
The best thing about Bath is that it's so much more than just
Choosing from 1 tokens
The best thing about Bath is that it's so much more than just a
Choosing from 5 tokens
The best thing about Bath is that it's so much more than just a beach


# Beam search

All of the methods we've seen so far choose the tokens one by one.  But this isn't necessarily sensible.  Even greedily choosing the best token doesn't necessarily retrieve the sequence with the highest probability.  It might be that the most likely token only has very unlikely tokens following it.

Beam search maintains $K$ hypotheses about the best possible continuation.  It starts with the top $K$ continuations.  Then for each of those, it finds the top K continuations, giving $K^2$ hypotheses.  Then it retains just the top $K$ of these so that the number of hypotheses stays the same.

In [None]:
def get_kth_most_likely_token(input_tokens, model, tokenizer, k):
  # Run model to get prediction over next output
  outputs = model(input_ids=input_tokens['input_ids'], attention_mask=input_tokens['attention_mask'])
  # Find prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0, -1]

  # Sort probabilities from largest to smallest
  sorted_prob_over_tokens = np.sort(prob_over_tokens)[::-1]
  # Take the k-th most likely probability value
  kth_prob_value = sorted_prob_over_tokens[k]

  # Find position (index) of this token
  next_token = np.where(prob_over_tokens == kth_prob_value)[0]

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'], torch.tensor([next_token])), dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'], torch.tensor([[1]])), dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]
  output_tokens['log_prob'] = output_tokens['log_prob'] + np.log(prob_over_tokens[next_token])
  return output_tokens


In [None]:
from transformers import set_seed

def run_with_k(prompt, steps, k):
    set_seed(0)  # keep runs comparable across K
    toks = tokenizer(prompt, return_tensors='pt')
    toks['log_prob'] = 0.0
    for _ in range(steps):
        toks = get_kth_most_likely_token(toks, model, tokenizer, k=k)
    return tokenizer.decode(toks["input_ids"][0], skip_special_tokens=True)

prompt = "The best thing about Bath is"
steps = 10
for K in [0, 1, 2, 5, 20, 200, 2000]:
    out = run_with_k(prompt, steps, K)
    print(f"K={K} -> {out}")


K=0 -> The best thing about Bath is that it's a place where you can go to
K=1 -> The best thing about Bath is the way you get the most bang outta the
K=2 -> The best thing about Bath is it has no need of the "bait-
K=5 -> The best thing about Bath is its location: in this tiny city that lies at
K=20 -> The best thing about Bath is your love at its creation...

 . In spite
K=200 -> The best thing about Bath is surely health reform might quickly spawn foop babies come
K=2000 -> The best thing about Bath is mixed profits partnerships» buy generic+ Honda throttlecont


In [None]:
def print_beams(beams):
  for index, beam in enumerate(beams):
    # Show both log-prob and prob for clarity
    lp = float(beam['log_prob'])
    print(f"Beam {index}, logP {lp: .3f}, P {np.exp(lp):.3e}: "
          + tokenizer.decode(beam["input_ids"][0], skip_special_tokens=True))
  print('---')


def _clone_tokens(tok):
  out = {}
  for k, v in tok.items():
    if isinstance(v, torch.Tensor):
      out[k] = v.clone()
    else:
      out[k] = v
  return out


def do_beam_search(input_tokens_in, model, tokenizer, n_beam=5, beam_length=10):
  # Start with a clean copy and init log-prob
  seed = _clone_tokens(input_tokens_in)
  seed['log_prob'] = 0.0

  # Initialize with n_beam most likely 1-token continuations
  beams = []
  for c_k in range(n_beam):
    b = _clone_tokens(seed)
    b = get_kth_most_likely_token(b, model, tokenizer, c_k)
    beams.append(b)

  print_beams(beams)

  # Grow beams to desired length
  for _ in range(beam_length - 1):
    beams_all = []
    log_probs_all = []

    # Expand each beam with its top n_beam continuations
    for b in beams:
      for c_k in range(n_beam):
        cand = get_kth_most_likely_token(_clone_tokens(b), model, tokenizer, c_k)
        beams_all.append(cand)
        log_probs_all.append(float(cand['log_prob']))

    # Keep the top n_beam by log-prob
    top_idx = np.argsort(-np.array(log_probs_all))[:n_beam]
    beams = [beams_all[i] for i in top_idx]

    print_beams(beams)

  # Return best beam
  return beams[0]


In [None]:
set_seed(0)
input_txt = "The best thing about Bath is"
input_tokens = tokenizer(input_txt, return_tensors='pt')

n_beams = 5
best_beam = do_beam_search(input_tokens, model, tokenizer, n_beam=n_beams, beam_length=10)

print("Beam search result:")
print(tokenizer.decode(best_beam["input_ids"][0], skip_special_tokens=True))


  lp = float(beam['log_prob'])
  log_probs_all.append(float(cand['log_prob']))


Beam 0, logP -0.727, P 4.835e-01: The best thing about Bath is that
Beam 1, logP -2.161, P 1.152e-01: The best thing about Bath is the
Beam 2, logP -3.177, P 4.171e-02: The best thing about Bath is it
Beam 3, logP -3.468, P 3.118e-02: The best thing about Bath is how
Beam 4, logP -3.536, P 2.912e-02: The best thing about Bath is you
---
Beam 0, logP -1.899, P 1.497e-01: The best thing about Bath is that it
Beam 1, logP -2.381, P 9.246e-02: The best thing about Bath is that you
Beam 2, logP -3.557, P 2.853e-02: The best thing about Bath is that they
Beam 3, logP -3.561, P 2.841e-02: The best thing about Bath is that we
Beam 4, logP -3.727, P 2.408e-02: The best thing about Bath is that the
---
Beam 0, logP -2.740, P 6.454e-02: The best thing about Bath is that it's
Beam 1, logP -3.264, P 3.823e-02: The best thing about Bath is that you can
Beam 2, logP -4.079, P 1.692e-02: The best thing about Bath is that it is
Beam 3, logP -4.372, P 1.263e-02: The best thing about Bath is that you don

You can read about more decoding strategies in this blog (which uses a recursive neural network, not a transformer, but the principles are the same).

https://www.borealisai.com/research-blogs/tutorial-6-neural-natural-language-generation-decoding-algorithms/

You can also look at other possible language models via hugging face:

https://huggingface.co/docs/transformers/v4.25.1/en/model_summary#decoders-or-autoregressive-models
