# Text Generation via Speculative Sampling, KV Caching, and OpenVINO™

As model sizes grow, Generative AI implementations require significant inference resources. This not only increases the cost per generation from a prompt, but also increases the power consumption used to serve such requests.

Inference optimizations for text generation are essential for reducing costs and power consumption. When optimizing the inference process, the amount of time and energy required to generate text can be significantly reduced. This can lead to cost savings in terms of hardware and software, as well as reduced power consumption. Additionally, inference optimizations can help improve the accuracy of text generation as well as the speed at which it can be generated. This can lead to an improved user experience and increased efficiency in text-generation tasks. In summary, inference optimizations for text generation are essential to reduce costs and power consumption, while also improving the accuracy and speed of text generation.

Another necessary condition is that the optimizations are compatible with each other. That is, implementing a certain optimization should not preclude other optimizations. There are several levels of optimizations that can provide significant speedup without "bumping into each other" in a way that will compromise overall efficiency.

For details on this method, please refer to the DeepMind paper by Chen et al, http://arxiv.org/abs/2302.01318

<a id="0"></a>
### Table of content:

- [Prerequisites](#1)
    - [Select inference device](#2)
- [Download and Convert Model](#3)
- [Create autoregressive and speculative forms of sampling with KV Cache support](#4)
    - [Setup imports](#5)
    - [Prepare autoregressive sampling](#6)
    - [Prepare speculative sampling](#7)
    - [Main generation function](#8)
    - [Download and convert model](#9)

<a id="1"></a>
## Prerequisites [&#8657;](#0)

First, we must install the [Hugging Face Optimum](https://huggingface.co/docs/optimum/installation) library accelerated by OpenVINO integration.
The Hugging Face Optimum Intel API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference).

We will also need to install transformers (HuggingFace) and some other useful modules.

In [1]:
%pip install -q --upgrade pip
%pip install -q --upgrade transformers torch gradio openvino accelerate onnx onnxruntime ipywidgets
%pip install -q "git+https://github.com/huggingface/optimum-intel.git"

[0mNote: you may need to restart the kernel to use updated packages.
[0m[33mDEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dalle2-pytorch 1.10.5 requires einops>=0.4, but you have einops 0.3.2 which is incompatible.
dalle2-pytorch 1.10.5 requires webdataset>=0.2.5, but you have webdataset 0.1.62 which is incompatible.
detectron2 0.6 requires pycocotools>=2.0.2, but you have pycocotools 2.0 which is incompatible.
pyannote-audio 2.0.1 requires torchaudio<1.0,>=0.10,

<a id="2"></a>
### Select inference device [&#8657;](#0)

select device from dropdown list for running inference using OpenVINO

In [2]:
import ipywidgets as widgets
import openvino as ov

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='CPU',
    description='Device:',
    disabled=False,
)

device

Dropdown(description='Device:', options=('CPU', 'GNA', 'AUTO'), value='CPU')

<a id="4"></a>
## Create autoregressive and speculative forms of sampling with KV Cache support [&#8657;](#0)
 
 blah, blah, blah
 


<a id="5"></a>
### Setup imports [&#8657;](#0)


In [3]:
import functools
import sys
import time
import json
import numpy as np
import torch
import gradio as gr

from threading import Thread
from typing import List
from pathlib import Path
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2023-09-26 10:09:55.700603: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-26 10:09:55.876688: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a id="6"></a>
### Prepare autoregressive sampling [&#8657;](#0)

In [4]:
def max_fn(x):
    x_max = torch.where(x > 0, x, torch.zeros_like(x))
    return x_max / torch.sum(x_max)


# TODO: delete this before publication - it's only for reference
def autoregressive_sampling(x, model, N):
    n = len(x)
    T = n + N

    while n < T:
        model_out = torch.softmax(model(x, attention_mask=torch.ones(x.size(), dtype=torch.long)).logits, dim=2)
        x = torch.cat((x, torch.reshape(torch.argmax(model_out[-1][-1]), (1, 1))), dim=1)
        n += 1

    return x


def autoregressive_sampling_with_pkv(x, model, N):
    n = len(x)
    T = n + N
    input = x
    past_kv = None

    while n < T:
        res = model(input, attention_mask=torch.ones(input.size(), dtype=torch.long), past_key_values=past_kv)
        model_out = torch.softmax(res.logits, dim=2)
        past_kv = res.past_key_values
        next_token = torch.reshape(torch.argmax(model_out[-1][-1]), (1, 1))
        x = torch.cat((x, next_token), dim=1)
        n += 1
        input = next_token

    return x

<a id="7"></a>
### Prepare speculative sampling [&#8657;](#0)


In [12]:
# TODO: delete this before publication - it's only for reference
def speculative_sampling(x, draft_model, target_model, N, K):
    n = x.size(1)
    T = n + N

    while n < T:
        # Step 1: auto-regressive decode K tokens from draft model and get final p
        x_draft = x
        for _ in range(K):
            p = torch.softmax(draft_model(x_draft, attention_mask=torch.ones(x_draft.size(), dtype=torch.long)).logits, dim=2)
            x_draft = torch.cat((x_draft, torch.reshape(torch.argmax(p[-1][-1]), (1, 1))), dim=1)

        # Step 2: target model forward passes on x_draft
        q = torch.softmax(target_model(x_draft, attention_mask=torch.ones(x_draft.size(), dtype=torch.long)).logits, dim=2)

        # Step 3: append draft tokens based on rejection criterion and resample
        # a token on rejection
        all_accepted = True
        for _ in range(K):
            i = n - 1
            j = x_draft[-1][i + 1].item()

            q_item = q[-1][i][j].detach().numpy()
            p_item = p[-1][i][j].detach().numpy()

            if np.random.random() < min(1, np.abs(q_item / p_item)):  # accepted
                x = torch.cat((x, torch.tensor(j).reshape(1, 1)), dim=1)
                n += 1
            else:  # rejected
                q_p = max_fn(q[0][i] - p[0][i])
                # softmax isn't working here - q and p were reduced to arrays of numbers
                resampled_output = torch.argmax(q_p)      
                resampled_output = torch.reshape(resampled_output, (1, 1))
                x = torch.cat((x, resampled_output), dim=1)
                n += 1
                all_accepted = False
                break
            
        # Step 4: if all draft tokens were accepted, sample a final token
        if all_accepted:
            x = torch.cat((x, torch.reshape(torch.argmax(q[-1][-1]), (1, 1))), dim=1)
            n += 1

    return x


def speculative_sampling_with_pkv(x, draft_model, target_model, N, K):
    # NOTE: paper indexes arrays starting from 1, python indexes from 0, so
    # we have to add an extra -1 term when indexing using n, T, or t
    n = x.size(1)
    T = n + N
    target_past_kv = None
    while n < T:
        # Step 1: auto-regressive decode K tokens from draft model and get final p
        x_draft = None
        draft_past_kv = None
        x_draft_input = x
        p_cum = None
        q_cum = None
        for _ in range(K):
            past_len = 0 if draft_past_kv is None else draft_past_kv[0][0].shape[2]
            res_draft = draft_model(x_draft_input, attention_mask=torch.ones((1, x_draft_input.shape[1] + past_len), dtype=torch.long), past_key_values=draft_past_kv, use_cache=True)
            p = res_draft.logits
            p = torch.softmax(p, dim=2)
            draft_past_kv = res_draft.past_key_values
            next_token = torch.reshape(torch.argmax(p[-1][-1]), (1, 1))
            x_draft_input = next_token
            if p_cum is None:
                p_cum = p[:, -1].unsqueeze(1)
                x_draft = next_token
            else:
                p_cum = torch.cat((p_cum, p), dim=1)
                x_draft = torch.cat((x_draft, next_token), dim=1)
        # Step 2: target model forward passes on x_draft
        if target_past_kv is None:
            x_draft_target_input = torch.cat((x, x_draft), dim=1)
        else:
            x_draft_target_input = x_draft
        # in terms of performance - so both should be tested.
        past_len = 0 if target_past_kv is None else target_past_kv[0][0].shape[2]
        res = target_model(x_draft_target_input, attention_mask=torch.zeros((1, x_draft_target_input.shape[1] + past_len), dtype=torch.long), past_key_values=target_past_kv, use_cache=True)
        q = res.logits
        target_new_past_kv = res.past_key_values
        if q_cum is None:
            q_cum = q
        else:
            q_cum = torch.cat((q_cum, q), dim=1)
        # Step 3: append draft tokens based on rejection criterion and resample
        # a token on rejection
        all_accepted = True
        for k in range(K):
            #i = n - 1
            j = x_draft[0][k].item()
            s_q = torch.softmax(q, dim=2)
            s_p_cum = torch.softmax(p_cum, dim=2)
            q_item = s_q[-1][k][j].detach().numpy()
            p_item = s_p_cum[-1][k][j].detach().numpy()

            if np.random.random() < min(1, (q_item / p_item)):  # accepted
                x = torch.cat((x, torch.tensor(j).reshape(1,1)), dim=1)
                n += 1
            else:  # rejected
                q_p = max_fn(q[0][k] - p_cum[0][k])
                resampled_output = torch.argmax(q_p)      
                resampled_output = torch.reshape(resampled_output, (1,1))
                x = torch.cat((x, resampled_output), dim=1)
                n += 1
                all_accepted = False
                break
            
        # Step 4: if all draft tokens were accepted, sample a final token
        if all_accepted:
            x = torch.cat((x, torch.reshape(torch.argmax(q[-1][-1]), (1,1))), dim=1)
            n += 1
            target_past_kv = target_new_past_kv
        else:
            target_past_kv = None

    return x


<a id="8"></a>
## Main generation function [&#8657;](#0)



<a id="9"></a>
### Download and Convert Model [&#8657;](#0)

**TODO: I have not changed any of this yet and customized it for speculative sampling - but we should have an optimized model**

Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models.  This means we just need to replace `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.

Below is an example of the Dolly model

```diff
-from transformers import AutoModelForCausalLM
+from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "databricks/dolly-v2-3b"
-model = AutoModelForCausalLM.from_pretrained(model_id)
+model = OVModelForCausalLM.from_pretrained(model_id, from_transformers=True)
```

Model class initialization starts with calling `from_pretrained` method. When downloading and converting Transformers model, the parameter `from_transformers=True` should be added. We can save the converted model for the next usage with the `save_pretrained` method.
Tokenizer class and pipelines API are compatible with Optimum models.


In [13]:
from typing import Optional, Tuple
from openvino.runtime import Type, Tensor
from typing import Optional, Tuple
from transformers.modeling_outputs import CausalLMOutputWithPast

class OVModelForCausalLMWithMultiTokenPKV(OVModelForCausalLM):
       def forward(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.LongTensor] = None,
        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        **kwargs,
    ) -> CausalLMOutputWithPast:
        self.compile()

        inputs = {}
        if past_key_values is not None:
            if self._pkv_precision == Type.bf16:
                # numpy does not support bf16, pretending f16, should change to bf16
                past_key_values = tuple(
                    Tensor(past_key_value, past_key_value.shape, Type.bf16)
                    for pkv_per_layer in past_key_values
                    for past_key_value in pkv_per_layer
                )
            else:
                # Flatten the past_key_values
                past_key_values = tuple(
                    past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer
                )
            # Add the past_key_values to the decoder inputs
            inputs = dict(zip(self.key_value_input_names, past_key_values))

        # Create empty past_key_values for decoder_with_past first generation step
        elif self.use_cache:
            shape_input_ids = input_ids.shape
            num_attention_heads = (
                self.normalized_config.num_attention_heads if self.config.model_type == "bloom" else 1
            )
            for input_name in self.key_value_input_names:
                model_inputs = self.model.input(input_name)
                shape = model_inputs.get_partial_shape()
                shape[0] = shape_input_ids[0] * num_attention_heads
                if shape[2].is_dynamic:
                    shape[2] = 0
                if shape[1].is_dynamic:
                    shape[1] = 0
                inputs[input_name] = Tensor(model_inputs.get_element_type(), shape.get_shape())

        inputs["input_ids"] = np.array(input_ids)

        # Add the attention_mask inputs when needed
        if "attention_mask" in self.input_names and attention_mask is not None:
            inputs["attention_mask"] = np.array(attention_mask)

        # Run inference
        self.request.start_async(inputs, shared_memory=True)
        self.request.wait()

        logits = torch.from_numpy(self.request.get_tensor("logits").data).to(self.device)

        if self.use_cache:
            # Tuple of length equal to : number of layer * number of past_key_value per decoder layer (2 corresponds to the self-attention layer)
            past_key_values = tuple(self.request.get_tensor(key).data for key in self.key_value_output_names)
            # Tuple of tuple of length `n_layers`, with each tuple of length equal to 2 (k/v of self-attention)
            past_key_values = tuple(
                past_key_values[i : i + self.num_pkv] for i in range(0, len(past_key_values), self.num_pkv)
            )
        else:
            past_key_values = None

        return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values)


def main(
    prompt: str = "Alan Turing theorized that computers would one day become",
    n_tokens_to_generate: int = 40,
    K: int = 5,
    seed: int = 5555,
):
    # Consider a model selector here
    #draft_model_id = "databricks/dolly-v2-3b"
    #draft_model_path = Path("dolly-v2-3b")
    #target_model_id = "databricks/dolly-v2-12b"
    #target_model_path = Path("dolly-v2-12b")
    ##facebook/opt-6.7b could be more interesting target model than facebook/opt-1.3b
    # draft_model_id = "facebook/opt-125m"
    # draft_model_path = Path("facebook/opt-125m-local")
    # target_model_id = "facebook/opt-1.3b"
    # target_model_path = Path("facebook/opt-1.3b-local")
    draft_model_id = "gpt2"
    draft_model_path = Path("gpt2-local")
    target_model_id = "gpt2-xl"
    target_model_path = Path("gpt2-xl-local")

    target_tokenizer = AutoTokenizer.from_pretrained(target_model_id)

    current_device = device.value

    # Save local copies for subsequent runs
    if draft_model_path.exists():
        draft_ov_model = OVModelForCausalLM.from_pretrained(draft_model_path, device=current_device)
    else:
        draft_ov_model = OVModelForCausalLM.from_pretrained(draft_model_id, device=current_device, from_transformers=True)
        draft_ov_model.save_pretrained(draft_model_path)
    if target_model_path.exists():
        target_ov_model = OVModelForCausalLM.from_pretrained(target_model_path, device=current_device)
    else:
        target_ov_model = OVModelForCausalLM.from_pretrained(target_model_id, device=current_device, from_transformers=True)
        target_ov_model.save_pretrained(target_model_path)
    
    # seed numpy rng
    np.random.seed(seed)    

    input_ids = target_tokenizer(prompt, return_tensors="pt")['input_ids']

    def run_autoregressive_sampling_fn(decode_fn, input_ids, **kwargs):
        start = time.perf_counter()
        output_ids = decode_fn(x=input_ids, **kwargs)
        text = target_tokenizer.decode(output_ids[0], skip_special_tokens=True)
        elapsed_time = time.perf_counter() - start
        return text, elapsed_time

    def run_speculative_sampling_fn(decode_fn, input_ids, **kwargs):
        start = time.perf_counter()
        output_ids = decode_fn(x=input_ids, **kwargs)
        text = target_tokenizer.decode(output_ids[0], skip_special_tokens=True)
        elapsed_time = time.perf_counter() - start
        return text, elapsed_time

    autoregressive_text, autoregressive_time = run_autoregressive_sampling_fn(
        autoregressive_sampling_with_pkv,
        input_ids,
        model=target_ov_model,
        N=n_tokens_to_generate,
    )
    if target_model_path.exists():
        target_ov_model = OVModelForCausalLMWithMultiTokenPKV.from_pretrained(target_model_path, device=current_device)
    else:
        target_ov_model = OVModelForCausalLMWithMultiTokenPKV.from_pretrained(target_model_id, device=current_device, from_transformers=True)
        target_ov_model.save_pretrained(target_model_path)

    speculative_text, speculative_time = run_speculative_sampling_fn(
        speculative_sampling_with_pkv,
        input_ids,
        target_model=target_ov_model,
        draft_model=draft_ov_model,
        N=n_tokens_to_generate,
        K=K,
    )

#   Print results for comparison of text and time
    print()
    print("Autoregressive Decode")
    print("---------------------")
    print(f"Time = {autoregressive_time:.2f}s")
    print(f"Text = {autoregressive_text}")
    print()
    print("Speculative Decode")
    print("------------------")
    print(f"Time = {speculative_time:.2f}s")
    print(f"Text = {speculative_text}")


if __name__ == "__main__":

    with gr.Blocks() as demo:
        gr.Markdown(
        """
        # Speculative Sampling Demo
        ## The output will show a comparison of Autoregressive Sampling vs Speculative Sampling
        - Target Model: gpt2-xl
        - Draft Model: gpt2
        - K = 5
        > Some improvements can be made to acceptance criterion and adjusting temperature to improve text quality.
        """)
        with gr.Row():
            inp = gr.Textbox(placeholder="THIS CANNOT BE EMPTY", label="Input Prompt")
            out = gr.Textbox(label="Output")
        btn = gr.Button("Run")
        btn.click(fn=main, inputs=inp, outputs=out)
    demo.launch()

In [14]:
main()

Compiling the model...
Set CACHE_DIR to gpt2-local/model_cache
Compiling the model...
Set CACHE_DIR to gpt2-xl-local/model_cache
Compiling the model...
Set CACHE_DIR to gpt2-xl-local/model_cache
  self.request.start_async(inputs, shared_memory=True)


(1, 25, 15, 64)
(1, 25, 17, 64)
(1, 25, 18, 64)
(1, 25, 23, 64)
(1, 25, 28, 64)
(1, 25, 31, 64)
(1, 25, 32, 64)
(1, 25, 34, 64)
(1, 25, 35, 64)
(1, 25, 40, 64)
(1, 25, 45, 64)
(1, 25, 50, 64)

Autoregressive Decode
---------------------
Time = 5.84s
Text = Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human. He called it the "T

Speculative Decode
------------------
Time = 3.35s
Text = Alan Turing theorized that computers would one day become the Turing, the Turing machine.

The Turing machine is a theic, a Turing, a Turing machine.
 the Turing machine is a the aic, a Turing, the Turing machine.

 the
