# Text Generation via Speculative Sampling, KV Caching, and OpenVINO™

As model sizes grow, Generative AI implementations require significant inference resources. This not only increases the cost per generation from a prompt, but also increases the power consumption used to serve such requests.

Inference optimizations for text generation are essential for reducing costs and power consumption. When optimizing the inference process, the amount of time and energy required to generate text can be significantly reduced. This can lead to cost savings in terms of hardware and software, as well as reduced power consumption. Additionally, inference optimizations can help improve the accuracy of text generation as well as the speed at which it can be generated. This can lead to an improved user experience and increased efficiency in text-generation tasks. In summary, inference optimizations for text generation are essential to reduce costs and power consumption, while also improving the accuracy and speed of text generation.

Another necessary condition is that the optimizations are compatible with each other. That is, implementing a certain optimization should not preclude other optimizations. There are several levels of optimizations that can provide significant speedup without "bumping into each other" in a way that will compromise overall efficiency.

**TODO: point to openvino.ai blog article**


<a id="0"></a>
### Table of content:

**TODO: Fix labels and anchors for this notebook**

- [Prerequisites](#1)
    - [Select inference device](#2)
- [Download and Convert Model](#3)
- [Create an instruction-following inference pipeline](#4)
    - [Setup imports](#5)
    - [Prepare template for user prompt](#6)
    - [Helpers for output parsing](#7)
    - [Main generation function](#8)
    - [Helpers for application](#9)
- [Run instruction-following pipeline](#10)

<a id="1"></a>
## Prerequisites [&#8657;](#0)

First, we should install the [Hugging Face Optimum](https://huggingface.co/docs/optimum/installation) library accelerated by OpenVINO integration.
The Hugging Face Optimum Intel API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference).

We will also need to install transformers (HuggingFace) and some other useful modules.

In [1]:
!pip install --upgrade pip
!pip install --upgrade transformers torch gradio optimum-intel openvino accelerate onnx onnxruntime ipywidgets fire
!pip install openvino-dev==2023.1.0.dev20230811



<a id="2"></a>
### Select inference device [&#8657;](#0)

select device from dropdown list for running inference using OpenVINO

In [2]:
import ipywidgets as widgets
from openvino.runtime import Core

core = Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

# DEBUG - force device to be CPU in any case
device.value = 'CPU'

<a id="4"></a>
## Create autoregressive and speculative forms of sampling with KV Cache support [&#8657;](#0)
 
 blah, blah, blah
 


<a id="5"></a>
### Setup imports [&#8657;](#0)


In [3]:
import functools
import sys
import time

import numpy as np
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM
import gradio as gr

from threading import Thread
from typing import List

<a id="6"></a>
### Prepare autoregressive sampling [&#8657;](#0)

In [4]:
def max_fn(x):
    x_max = torch.where(x > 0, x, torch.zeros_like(x))
    return x_max / torch.sum(x_max)


def autoregressive_sampling(x, model, N):
    n = len(x)
    T = n + N

    while n < T:
        model_out = torch.softmax(model(x).logits, dim=2)
        x = torch.cat((x, torch.reshape(torch.argmax(model_out[-1][-1]), (1, 1))), dim=1)
        n += 1

    return x


def autoregressive_sampling_with_pkv(x, model, N):
    n = len(x)
    T = n + N
    input = x
    past_kv = None

    while n < T:
        print("input.size(): ")
        print(input.size())
        if past_kv:
            print("past_kv.size(): ")
            print(past_kv.size())
        res = model(input, past_key_values=past_kv)
        print("res.size(): ")
        print(res.size())
        print("res.logits.shape(): ")
        print(res.logits.shape())
        print("End of prints - now call model")
        model_out = torch.softmax(res.logits, dim=2)
        past_kv = res.past_key_values
        next_token = torch.reshape(torch.argmax(model_out[-1][-1]), (1, 1))
        x = torch.cat((x, next_token), dim=1)
        n += 1
        input = next_token

    return x

<a id="7"></a>
### Prepare speculative sampling [&#8657;](#0)


In [5]:
def speculative_sampling(x, draft_model, target_model, N, K):
    n = x.size(1)
    T = n + N

    while n < T:
        # Step 1: auto-regressive decode K tokens from draft model and get final p
        x_draft = x
        for _ in range(K):
            p = torch.softmax(draft_model(x_draft).logits, dim=2)
            x_draft = torch.cat((x_draft, torch.reshape(torch.argmax(p[-1][-1]), (1, 1))), dim=1)

        # Step 2: target model forward passes on x_draft
        q = torch.softmax(target_model(x_draft).logits, dim=2)

        # Step 3: append draft tokens based on rejection criterion and resample
        # a token on rejection
        all_accepted = True
        for _ in range(K):
            i = n - 1
            j = x_draft[-1][i + 1].item()

            q_item = q[-1][i][j].detach().numpy()
            p_item = p[-1][i][j].detach().numpy()

            if np.random.random() < min(1, (q_item / p_item)):  # accepted
                x = torch.cat((x, torch.tensor(j).reshape(1, 1)), dim=1)
                n += 1
            else:  # rejected
                q_p = max_fn(q[0][i] - p[0][i])
                # softmax isn't working here - q and p were reduced to arrays of numbers
                resampled_output = torch.argmax(q_p)      
                resampled_output = torch.reshape(resampled_output, (1, 1))
                x = torch.cat((x, resampled_output), dim=1)
                n += 1
                all_accepted = False
                break
            
        # Step 4: if all draft tokens were accepted, sample a final token
        if all_accepted:
            x = torch.cat((x, torch.reshape(torch.argmax(q[-1][-1]), (1, 1))), dim=1)
            n += 1

        # just keeping my sanity
        assert n == len(x.detach().numpy()[0]), f"{n} {len(x.detach().numpy()[0])}"

    return x


def speculative_sampling_with_pkv(x, draft_model, target_model, N, K):
    # NOTE: paper indexes arrays starting from 1, python indexes from 0, so
    # we have to add an extra -1 term when indexing using n, T, or t
    n = x.size(1)
    T = n + N
    target_past_kv = None
    while n < T:
        # Step 1: auto-regressive decode K tokens from draft model and get final p
        x_draft = None
        draft_past_kv = None
        x_draft_input = x
        p_cum = None
        q_cum = None
        for _ in range(K):
            res_draft = draft_model(x_draft_input, past_key_values=draft_past_kv, use_cache=True)
            p = res_draft.logits
            p = torch.softmax(p, dim=2)
            draft_past_kv = res_draft.past_key_values
            next_token = torch.reshape(torch.argmax(p[-1][-1]), (1, 1))
            x_draft_input = next_token
            if p_cum is None:
                p_cum = p[:, -1].unsqueeze(1)
                x_draft = next_token
            else:
                p_cum = torch.cat((p_cum, p), dim=1)
                x_draft = torch.cat((x_draft, next_token), dim=1)
        # Step 2: target model forward passes on x_draft
        if target_past_kv is None:
            x_draft_target_input = torch.cat((x, x_draft), dim=1)
        else:
            x_draft_target_input = x_draft
        res = target_model(x_draft_target_input, past_key_values=target_past_kv, use_cache=True)
        #q = torch.softmax(res.logits, dim=2)    "does this work and is it better?"
        q = res.logits
        target_new_past_kv = res.past_key_values
        if q_cum is None:
            q_cum = q
        else:
            q_cum = torch.cat((q_cum, q), dim=1)
        # Step 3: append draft tokens based on rejection criterion and resample
        # a token on rejection
        all_accepted = True
        for k in range(K):
            #i = n - 1
            j = x_draft[0][k].item()

            q_item = q[-1][k][j].detach().numpy()
            p_item = p_cum[-1][k][j].detach().numpy()

            if np.random.random() < min(1, (q_item / p_item)):  # accepted
                x = torch.cat((x, torch.tensor(j).reshape(1,1)), dim=1)
                n += 1
                #target_past_kv = torch.cat((target_past_kv, target_new_past_kv[:, :, n, :]), dim=2)
            else:  # rejected
                q_p = max_fn(q[0][k] - p_cum[0][k])
                # softmax isn't working here - q and p were reduced to arrays of numbers
                #resampled_output = torch.softmax(torch.argmax(q_p), dim=2)
                resampled_output = torch.argmax(q_p)      
                resampled_output = torch.reshape(resampled_output, (1,1))
                x = torch.cat((x, resampled_output), dim=1)
                n += 1
                all_accepted = False
                break
            
        target_past_kv = target_new_past_kv
        # Step 4: if all draft tokens were accepted, sample a final token
        if all_accepted:
            x = torch.cat((x, torch.reshape(torch.argmax(q[-1][-1]), (1,1))), dim=1)
            n += 1

        # just keeping my sanity
        assert n == len(x.detach().numpy()[0]), f"{n} {len(x.detach().numpy()[0])}"

    return x


<a id="8"></a>
## Main generation function [&#8657;](#0)



<a id="3"></a>
### Download and Convert Model [&#8657;](#0)

**TODO: I have not changed any of this yet and customized it for speculative sampling - but we should have an optimized model**

Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models.  This means we just need to replace `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.

Below is an example of the Dolly model

```diff
-from transformers import AutoModelForCausalLM
+from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "databricks/dolly-v2-3b"
-model = AutoModelForCausalLM.from_pretrained(model_id)
+model = OVModelForCausalLM.from_pretrained(model_id, from_transformers=True)
```

Model class initialization starts with calling `from_pretrained` method. When downloading and converting Transformers model, the parameter `from_transformers=True` should be added. We can save the converted model for the next usage with the `save_pretrained` method.
Tokenizer class and pipelines API are compatible with Optimum models.


In [6]:
def main(
    prompt: str = "Alan Turing theorized that computers would one day become",
    n_tokens_to_generate: int = 40,
    K: int = 5,
    seed: int = 223,
):
    from pathlib import Path
    from transformers import AutoTokenizer
    from optimum.intel.openvino import OVModelForCausalLM
    import json

    #draft_model_id = "databricks/dolly-v2-3b"
    #draft_model_path = Path("dolly-v2-3b")
    #target_model_id = "databricks/dolly-v2-12b"
    #target_model_path = Path("dolly-v2-12b")
    # facebook/opt-6.7b could be more interesting target model than facebook/opt-1.3b
    draft_model_id = "facebook/opt-125m"
    draft_model_path = Path("facebook/opt-125m-local")
    target_model_id = "facebook/opt-6.7b"
    target_model_path = Path("facebook/opt-6.7b-local")

    target_tokenizer = AutoTokenizer.from_pretrained(target_model_id)

    current_device = device.value

    # Save local copies for subsequent runs
    if draft_model_path.exists():
        draft_ov_model = OVModelForCausalLM.from_pretrained(draft_model_path, device=current_device)
    else:
        draft_ov_model = OVModelForCausalLM.from_pretrained(draft_model_id, device=current_device, from_transformers=True)
        draft_ov_model.save_pretrained(draft_model_path)
    if target_model_path.exists():
        target_ov_model = OVModelForCausalLM.from_pretrained(target_model_path, device=current_device)
    else:
        target_ov_model = OVModelForCausalLM.from_pretrained(target_model_id, device=current_device, from_transformers=True)
        target_ov_model.save_pretrained(target_model_path)
    
    # seed numpy rng
    np.random.seed(seed)
    # Try eventually databricks/dolly-v2-3b and 12b
    #draft_model = AutoModelForCausalLM.from_pretrained(draft_model_id, pad_token_id=target_tokenizer.eos_token_id)
    #target_model = AutoModelForCausalLM.from_pretrained(target_model_id, pad_token_id=target_tokenizer.eos_token_id)
    draft_model = draft_ov_model
    target_model = target_ov_model
    

    input_ids = target_tokenizer(prompt, return_tensors="pt")['input_ids']

    def run_autoregressive_sampling_fn(decode_fn, input_ids, **kwargs):
        start = time.perf_counter()
        output_ids = decode_fn(x=input_ids, **kwargs)
        # this isn't right as speculative will have to switch between models
        text = target_tokenizer.decode(output_ids[0], skip_special_tokens=True)
        elapsed_time = time.perf_counter() - start
        return text, elapsed_time

    def run_speculative_sampling_fn(decode_fn, input_ids, **kwargs):
        start = time.perf_counter()
        output_ids = decode_fn(x=input_ids, **kwargs)
        # this isn't right as speculative will have to switch between models
        text = target_tokenizer.decode(output_ids[0], skip_special_tokens=True)
        elapsed_time = time.perf_counter() - start
        return text, elapsed_time

    autoregressive_text, autoregressive_time = run_autoregressive_sampling_fn(
        autoregressive_sampling_with_pkv,
        input_ids,
        model=target_model,
        N=n_tokens_to_generate,
    )

    speculative_text, speculative_time = run_speculative_sampling_fn(
        speculative_sampling_with_pkv,
        input_ids,
        target_model=target_model,
        draft_model=draft_model,
        N=n_tokens_to_generate,
        K=K,
    )

#   results
    print()
    print("Autoregressive Decode")
    print("---------------------")
    print(f"Time = {autoregressive_time:.2f}s")
    print(f"Text = {autoregressive_text}")
    print()
    print("Speculative Decode")
    print("------------------")
    print(f"Time = {speculative_time:.2f}s")
    print(f"Text = {speculative_text}")


if __name__ == "__main__":

    #with gr.Blocks() as demo:
    #    gr.Markdown(
    #    """
    #    # Speculative Sampling Demo
    #    ## The output will show a comparison of Autoregressive Sampling vs Speculative Sampling
    #    - Target Model: gpt2-xl
    #    - Draft Model: gpt2
    #    - K = 5
    #    > Some improvements can be made to acceptance criterion and adjusting temperature to improve text quality.
    #    > Note: Have patience as it takes time run on free CPUs.
    #    """)
    #    with gr.Row():
    #        inp = gr.Textbox(placeholder="THIS CANNOT BE EMPTY", label="Input Prompt")
    #        out = gr.Textbox(label="Output")
    #    btn = gr.Button("Run")
    #    btn.click(fn=main, inputs=inp, outputs=out)
    #demo.launch(share=True)

    import fire

    fire.Fire(main)

Compiling the model...
Set CACHE_DIR to facebook/opt-125m-local/model_cache
The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead
Framework not specified. Using pt to export to ONNX.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Using framework PyTorch: 2.0.1
Overriding 1 configuration item(s)
	- use_cache -> True
  elif attention_mask.shape[1] != mask_seq_length:
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
  attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
Saving external data to one file...


verbose: False, log level: Level.ERROR



Compiling the model...
Set CACHE_DIR to /var/folders/l5/m569ndsx2dz5kbh1bp6m26d00000gp/T/tmp440gssk0/model_cache


input.size(): 
torch.Size([1, 11])


  self.request.start_async(inputs, shared_memory=True)


RuntimeError: Exception from src/inference/src/infer_request.cpp:249:
Exception from src/inference/src/dev/converter_utils.cpp:706:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: [ GENERAL_ERROR ] Shape inference of Select node with name /decoder/Where_4 failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:47:
Eltwise shape infer input shapes dim index: 3 mismatch



