In [None]:
!pip install -q transformers

# Best Practices for Generatrion with Cache

Efficient caching is crucial for optimizing the performance of models in various generative tasks and helps reduce computation time and improve response rates, especially in real-time or resource-intensive applications.

## What is Cache and why we should care?

When we have a conversation with someone, it would be slow and inefficient if we have to start from scratch every time we respond. This is where **caching keys and values** come into play in the Transformer models. This is referred to as **KV Cache**.

KV cache is needed to optimize the generation in autogressive models, where the model predicts text token by token. This process can be slow since the model can generate only one token at a time, and each new prediction is dependent on the previous context.

Key-value cache acts as a memory bank for these generative models, where the model stores key-value pairs derived from self-attention layers for previously processed tokens. By storing this information, the model can avoid redundant computations and instead retrieve keys and values of previous tokens from the cache. This is only used in inference.

### How Cache Object Works in Attnetion Mechansim

The Attention module concatenates the current key-values with the past-values stored in the cache. This results in attention weights of shape `(new_tokens_length, past_kv_length + new_token_length)`. The past and current key-values are combined to compute attention scores, ensuring that the model considers both previous context and new input. The concatenated key-values are used to compute the attention scores resulting in attention weights of shape `(new_tokens_length, past_kv_length + new_tokens_length)`.

Therefore, when iteratively calling `forward()` instead of the `generate()` method, it is crucial to ensure that the attention mask shape matches the combined length of past and current key-values. The attention mask should have the shape `(batch_size, past_kv_length + new_tokens_length)`.

The example below shows how to implement our own generation loop.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
import torch

checkpoint = 'meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
past_key_values = DynamicCache()
messages = [
    {'role': 'user',
     'content': "Hello, what is your name?"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors='pt',
    return_dict=True,
).to(device)

generated_ids = inputs.input_ids
cache_position = torch.arange(
    inputs.inputs_ids.shape[1],
    dtype=torch.int64,
    device=device
)
max_new_tokens = 20

In [None]:
for _ in range(max_new_tokens):
    outputs = model(
        **inputs,
        cache_position=cache_position,
        past_key_values=past_key_values,
        use_cache=True,
    )

    # Greedily sample one next token
    next_token_ids = outputs.logits[:, -1:].argmax(-1)
    generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1)

    # Prepare inputs for the next generation step by leaving unprocessed tokens,
    # in our case we have only one new token and
    # expanding attention mask for the new token, as explained above
    attention_mask = inputs['attention_mask']
    attention_mask = torch.cat(
        [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))],
        dim=-1,
    )
    inputs = {
        'input_ids': next_token_ids,
        'attention_mask': attention_mask,
    }
    # add one more position for the next token
    cache_position = cache_position[-1:] + 1

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

## Generate with Cache

By default, all models in Transformers library generate with caching, with the `DynamicCache` class being the default cache for most models. It allows us to dynamically grow cache size, by saving more and more keys and values as we generate. If we do not want to use caches, we can pass `use_cache=False` into the `generate()` method.

### Quantized Cache

The key and value cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models. Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed.

To enable quantization of the key-value cache, we need to incidate `cache_implementation="quantized"` in the `generation_config`. Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a `QuantizedCacheConfig` class.

Cache quantization can be detrimental in terms of latency if the context length is short and there is enough GPU VRAM available to run without cache quantization.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

checkpoint = 'meta-llama/Llama-2-7b-chat-hf'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer("I like hip-pop music because", return_tensors='pt').to(device)

In [None]:
out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=20,
    cache_implementation="quantized",
    cache_config={'nbits': 4,
                  'backend': 'quanto'},
)
tokenizer.batch_decode(out, skip_special_tokens=True)[0]

In [None]:
out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=20,
)
tokenizer.batch_decode(out, skip_special_tokens=True)[0]

### Offloaded Cache

Similar to quantization, `OffloadedCache` aims to reduce GPU VRAM usage. It does so by moving the KV cache for most layers to the CPU. As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU. At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU.

To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `genration_config` or directly to the `generate()` call.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

checkpoint = 'microsoft/Phi3-mini-4k-instruct'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer("Fun fact: The shortest", return_tensors='pt').to(device)

In [None]:
out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=20,
    cache_implementation='offloaded',
)
tokenizer.batch_decode(out, skip_special_tokens=True)[0]

In [None]:
out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=20,
)
tokenizer.batch_decode(out, skip_special_tokens=True)[0]

Cache offloading requires a GPU and can be slower than dynamic KV cache. Use it if we get CUDA out of memory errors.

The example below shows how KV cache offloading can be used as a fallback strategy.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


def resilient_generate(model, *args, **kwargs):
    oom = False

    try:
        return model.generate(*args, **kwargs)
    except torch.cuda.OutOfMemoryError as e:
        print(e)
        print('retrying with `cache_implementation="offloaded"`')
        oom = True

    if oom:
        torch.cuda.empty_cache()
        kwargs['cache_implementation'] = 'offloaded'
        return model.generate(*args, **kwargs)


checkpoint = 'microsoft/Phi3-mini-4k-instruct'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

prompt = ['okay'*1000 + 'Fun fact: the most']
inputs = tokenizer(prompt, return_tensors='pt').to(device)

In [None]:
beams = {
    'num_beams': 40,
    'num_beam_groups': 40,
    'num_return_sequences': 40,
    'diversity_penalty': 1.0,
    'max_new_tokens': 20,
    'early_stopping': True,
}

out = resilient_generate(model, **inputs, **beams)
responses = tokenizer.batch_decode(out[:, -28:], skip_special_tokens=True)
responses

### Static Cache

Since the `DynamicCache` dynamically grows with each generation step, it prevents use from taking advantage of JIT optimizations. The `StaticCache` pre-allocates a specific  maximum size for the keys and values, allowing us to generate up to the maximum length without having to modify cache size.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

checkpoint = 'meta-llama/Llama-2-7b-chat-hf'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer('Hello, my name is', return_tensors='pt').to(device)

In [None]:
out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=20,
    cache_implementation='static',
)
tokenizer.batch_decode(out, skip_special_tokens=True)[0]

## Offloaded Static Cache