<a href="https://colab.research.google.com/github/kjahan/speculative_decoding/blob/main/notebooks/speculative_sampling_opt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Speculative Sampling

Here our goal is to speed up generative model inference time. This has many use cases for edit suggestion for writing or coding.

We will use a draft/basic decoder like GPT-2 and a bigger size model as the core model. We will prompt the draft model and generate k speculative tokens along with their probabilities.

Next we feed those k tokens along with the original prompt to the main model to get their liklihhods at once from the attention mask layer. Then we use the probabilities for speculative tokens from the draft model and main model to accept or reject speculated tokens.

See this video for more explanations:

https://www.youtube.com/watch?v=S-8yr_RibJ4

The key insight is that there are many simple tokens like "of" that even smaller model can easily predict them so we can use the smaller model to generate them faster and then use the bigger size model for facts and harder tokens!

https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/

https://www.youtube.com/watch?v=9wNAgpX6z_4

https://docs.google.com/presentation/d/1p1xE-EbSAnXpTSiSI0gmy_wdwxN5XaULO3AnCWWoRe4/edit#slide=id.p

### VLLM speculative decoding

https://docs.vllm.ai/en/stable/features/reasoning_outputs.html

## Load draft model

In [21]:
import time
import random

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F

In [2]:
draft_model_name = "facebook/opt-125m"  # Small model

# Load models and tokenizers
draft_model = AutoModelForCausalLM.from_pretrained(draft_model_name)
draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_name)

# Move model and input to the same device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
draft_model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,)

In [3]:
print(device)

cuda


## Prompt the draft model to generate speculated tokens

`k=5`

In [5]:
def run_draf_model(prompt,k=5):
  # we speculate next k tokens and store them along with their probs from draft model
  draft_next_tokens = []
  draft_next_token_probs = []

  for _ in range(k):
    # Tokenize the prompt (turn it into token IDs)
    encoded_prompt = draft_tokenizer(prompt, return_tensors='pt')

    encoded_prompt = {key: value.to(device) for key, value in encoded_prompt.items()}

    # Run the model and get the logits for the next token
    with torch.no_grad():  # Disable gradient calculation during inference
        outputs = draft_model(**encoded_prompt)
        logits = outputs.logits

    # Extract the logits for the next token (logits for the token after the input prompt)
    next_token_logits = logits[0, -1, :]  # Logits for the next token (after the prompt)

    # Apply softmax to convert logits into probabilities
    probabilities = F.softmax(next_token_logits, dim=-1)

    # Get the top k most likely tokens and their probabilities
    top_k = 10
    top_token_probs, top_token_ids = torch.topk(probabilities, k=top_k)

    # Decode the top k token IDs into human-readable tokens
    top_token_strings = [draft_tokenizer.decode([token_id.item()]) for token_id in top_token_ids]

    # Print the top 10 most likely next tokens and their probabilities
    print(f"Top {top_k} tokens and their likelihoods for the next token:")
    for i in range(top_k):
        print(f"Token: {top_token_strings[i]} | Probability: {top_token_probs[i].item():.4f}")
    print("\n----------------------\n")
    # Add the predicted token to the input prompt to predict the second positon
    prompt = prompt + top_token_strings[0]

    draft_next_tokens.append(top_token_strings[0])
    draft_next_token_probs.append(top_token_probs[0].item())

  results = {'tokens': draft_next_tokens, 'likelihoods': draft_next_token_probs}

  return results

## Generate draft proposal

In [6]:
# Define the input prompt
prompt = "What is mitosis? Mitosis is the process by which a protein is broken"
k=5

results = run_draf_model(prompt, k)

draft_next_tokens, draft_next_token_probs = results['tokens'], results['likelihoods']

Top 10 tokens and their likelihoods for the next token:
Token:  down | Probability: 0.8264
Token:  into | Probability: 0.0852
Token:  up | Probability: 0.0131
Token:  in | Probability: 0.0090
Token:  or | Probability: 0.0079
Token:  by | Probability: 0.0068
Token:  apart | Probability: 0.0062
Token:  and | Probability: 0.0050
Token:  through | Probability: 0.0050
Token: , | Probability: 0.0045

----------------------

Top 10 tokens and their likelihoods for the next token:
Token:  into | Probability: 0.5781
Token:  and | Probability: 0.1148
Token:  by | Probability: 0.0609
Token:  in | Probability: 0.0539
Token: , | Probability: 0.0387
Token:  to | Probability: 0.0348
Token: . | Probability: 0.0232
Token:  from | Probability: 0.0115
Token:  ( | Probability: 0.0087
Token:  that | Probability: 0.0084

----------------------

Top 10 tokens and their likelihoods for the next token:
Token:  its | Probability: 0.1147
Token:  a | Probability: 0.1075
Token:  two | Probability: 0.0565
Token:  c

## Generate spculated tokens from Draft model & probs

In [7]:
print(f"Speculated tokens: {draft_next_tokens}")
print(f"Speculated tokens probs: {draft_next_token_probs}")


Speculated tokens: [' down', ' into', ' its', ' constituent', ' components']
Speculated tokens probs: [0.8263538479804993, 0.5781221985816956, 0.11474165320396423, 0.3421257734298706, 0.4991067051887512]


## Load target model

In [8]:
# Load OPT tokenizer and model
target_model_name = "facebook/opt-350m"  # Larger model

target_model = AutoModelForCausalLM.from_pretrained(target_model_name)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_name)

# Move models to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
target_model.to(device)

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/662M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features

## Use target model for evaluation

We pass all speculated tokens to target model get the tokens liklihood for accepting or rejecting them. We also generate one token as extra credit at the end!

In [9]:
# Check the model's vocabulary size (it should be 50257 for GPT2)
vocab_size = target_model.config.vocab_size
vocab_size

50272

In [10]:
device

'cuda'

## Add speculated token to prompt for evaluation!

In [11]:
def run_target_model(new_prompt):
  target_probs = []
  target_next_token = None

  # Tokenize the prompt
  # Tokenize the input and move the tensors to the same device as the model
  encoded_prompt = target_tokenizer(new_prompt, return_tensors="pt").to(device)

  all_target_probs = []

  # Get model outputs (logits) for the input
  with torch.no_grad():  # Disable gradient computation to save memory
      outputs = target_model(**encoded_prompt)
      logits = outputs.logits  # Raw logits (before softmax)


  for inx in range(-1, -1-k-1, -1):
    # Get the logits for the last token in the sequence (the next token prediction)
    token_logits = logits[:, inx, :]  # The logits for the last token position

    # Apply softmax to get probabilities of each token in the vocabulary
    probabilities = torch.softmax(token_logits, dim=-1)

    # Let's keep target probs from last token sampling as gift!
    all_target_probs.append(probabilities)

    # Get the token ID of the most likely next token
    predicted_token_id = torch.argmax(probabilities, dim=-1).item()

    # Get the probability (likelihood) of the predicted token
    predicted_token_probability = probabilities[0, predicted_token_id].item()

    # Decode the predicted token ID back to text
    predicted_token = target_tokenizer.decode(predicted_token_id)

    # Print the predicted next token and its likelihood (probability)
    print(f"Token '{inx}': '{predicted_token}' - Likelihood: {predicted_token_probability:.4f}")
    print("\n----------------------\n")

    # next target token
    if inx == -1:
      target_next_token = predicted_token
    else:
      target_probs.append(predicted_token_probability)

  results = {'likelihoods': target_probs, 'all_likelihoods': all_target_probs}
  return results

## Run evaluation

In [14]:
# Define the prompt with speculated tokens
new_prompt = prompt + ''.join(draft_next_tokens)
new_prompt

'What is mitosis? Mitosis is the process by which a protein is broken down into its constituent components'

In [15]:
results_2 = run_target_model(new_prompt)

target_probs, all_target_probs = results_2['likelihoods'], results_2['all_likelihoods']

Token '-1': '.' - Likelihood: 0.3777

----------------------

Token '-2': ' components' - Likelihood: 0.2775

----------------------

Token '-3': ' components' - Likelihood: 0.3063

----------------------

Token '-4': ' smaller' - Likelihood: 0.2442

----------------------

Token '-5': ' into' - Likelihood: 0.2533

----------------------

Token '-6': ' down' - Likelihood: 0.9077

----------------------



In [16]:
target_probs.reverse()
target_probs

[0.9076674580574036,
 0.2533226013183594,
 0.24416394531726837,
 0.3062775731086731,
 0.27746468782424927]

In [17]:
all_target_probs.reverse()

In [18]:
print(f"Speculated tokens: {draft_next_tokens}")
print(f"Speculated tokens probs: {draft_next_token_probs}")

Speculated tokens: [' down', ' into', ' its', ' constituent', ' components']
Speculated tokens probs: [0.8263538479804993, 0.5781221985816956, 0.11474165320396423, 0.3421257734298706, 0.4991067051887512]


## Speculative sampling

`p: Draft/Small model likelihood for token x`

`q: Target/Large model likelihood for token x`

`Case 1: If q(x) >= p(x) then accept token x`

`Case 2: If q(x) < p(x) then accept token x by flipping a coin with  probability of q(x)/p(x)`

`As soon as we reject break from the loop and then sample from q()`


In [19]:
def run_speculative_decoding(draft_next_tokens, draft_next_token_probs, target_probs):
  accepted_tokens = []
  for inx in range(k):
    token = draft_next_tokens[inx]
    p = draft_next_token_probs[inx]
    q = target_probs[inx]

    print(f"inx: {inx}: p: {p} & q: {q}")
    print(f"Evaluating: {token}")

    if q >= p:
      print(f"accepting!\n")
      accepted_tokens.append(token)
    else:
      prob = q/p
      print(f"sampling with prob: {prob}")
      if random.random() <= prob:
        print(f"accepting!\n")
        accepted_tokens.append(token)
      else:
        # break from the loop and sample next token from q
        print("breaking from loop!!")
        break

## Run speculative decoing technique

In [27]:
run_speculative_decoding(draft_next_tokens, draft_next_token_probs, target_probs)

inx: 0: p: 0.8263538479804993 & q: 0.9076674580574036
Evaluating:  down
accepting!

inx: 1: p: 0.5781221985816956 & q: 0.2533226013183594
Evaluating:  into
sampling with prob: 0.43818175800866066
accepting!

inx: 2: p: 0.11474165320396423 & q: 0.24416394531726837
Evaluating:  its
accepting!

inx: 3: p: 0.3421257734298706 & q: 0.3062775731086731
Evaluating:  constituent
sampling with prob: 0.8952192348392434
accepting!

inx: 4: p: 0.4991067051887512 & q: 0.27746468782424927
Evaluating:  components
sampling with prob: 0.5559225811629163
breaking from loop!!


In [28]:
inx=3
probabilities = all_target_probs[inx]

# Get the token ID of the most likely next token
predicted_token_id = torch.argmax(probabilities, dim=-1).item()

# Get the probability (likelihood) of the predicted token
predicted_token_probability = probabilities[0, predicted_token_id].item()

# Decode the predicted token ID back to text
predicted_token = target_tokenizer.decode(predicted_token_id)

# Print the predicted next token and its likelihood (probability)
print(f"Token '{predicted_token}': '{predicted_token}' - Likelihood: {predicted_token_probability:.4f}")

Token ' components': ' components' - Likelihood: 0.3063
