# Creating a new press

In this guide, we will walk you through the process of creating a new press.

In [1]:
!pip uninstall tensorflow -y
!pip install kvpress --quiet

[0m

In [2]:
from dataclasses import dataclass
from contextlib import contextmanager

import torch
from torch import nn
from transformers import pipeline

from kvpress import BasePress, KnormPress, ScorerPress

In [3]:
# Load pipeline

device = "cuda:0"
ckpt = "Qwen/Qwen2.5-1.5B-Instruct"
attn_implementation = "flash_attention_2"
attn_implementation = "sdpa"
pipe = pipeline("kv-press-text-generation", model=ckpt, device=device, torch_dtype="auto", model_kwargs={"attn_implementation":attn_implementation})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
# Load data

context = "In this step-by-step guide, you will learn how to create a new press in kvpress !"
question = "\nWhat is the purpose of this guide?"
tokens = pipe.tokenizer(context, return_tensors="pt").to(device)

## 1. Understanding how press work

A press registers a forward hook to each attention layer during the pre-filling phase.  Immediately after the forward pass, the hook is called, and it compresses the KV cache.

In [5]:
compression_ratio = 0.25
press = KnormPress(compression_ratio)

with torch.no_grad():
    outputs_without_press = pipe.model(**tokens, output_hidden_states=True)

with torch.no_grad(), press(pipe.model):
    output_with_press = pipe.model(**tokens)

print(f"Cache shape w/o press: {outputs_without_press.past_key_values[0][0].shape}")
print(f"Cache shape w/ press:  {output_with_press.past_key_values[0][0].shape}\n")

# The `KVPressTextGenerationPipeline` simply applies the `press` as above on the context tokens (see `_forward` method for more details).
print(pipe(context, question=question, press=press)["answer"])

Cache shape w/o press: torch.Size([1, 2, 20, 128])
Cache shape w/ press:  torch.Size([1, 2, 15, 128])

The purpose of this step-by-step guide is to provide instructions on how to create a new press in kvpress. The guide is designed to help users understand the process of setting up a new press in the kvpress platform, including any necessary steps,


## 2. Creating your own press


### 2.1 Updating the `score` method

The easiest way to create a new press is to create a class that inherits from `ScorerPress` and implement a `score` method that computes the score for each key-value pair.

The arguments of the `score` method are obtained from the forward hook:
- `module`: the attention layer
- `hidden_states`: the input of the attention layer
- `keys` and `values`: the key-value pairs from the attention layer
- `attentions`: the attention weights, only available with `attn_implementation="eager"`

In this first example, we will reproduce the `KnormPress` where the score of a key-value pair is simply the opposite of the norm of the key vector.

In [6]:
class MyKnormPress(ScorerPress):
    def score(
        self,
        module: nn.Module,
        hidden_states: torch.Tensor,
        keys: torch.Tensor,
        values: torch.Tensor,
        attentions: torch.Tensor,
        kwargs,
    ) -> torch.Tensor:

        scores = -keys.norm(dim=-1)

        # For demonstration, we show some details on the shape for the first layer
        if module.layer_idx == 0:
            print(f"module: {module}")
            print(f"Number of key value heads: {module.config.num_key_value_heads}")
            print(f"Sequence length: {hidden_states.shape[1]}")
            print()
            print(f"hidden_states shape: {hidden_states.shape}")
            print(f"keys shape:          {keys.shape}") # shape (bhnd)
            print(f"values shape:        {values.shape}") # shape (bhnd)
            print(f"score shape:         {scores.shape}") # shape (bhn)
            print()

        return scores


press = MyKnormPress(compression_ratio)
print(pipe(context, question=question, press=press)["answer"])

module: Qwen2Attention(
  (q_proj): Linear(in_features=1536, out_features=1536, bias=True)
  (k_proj): Linear(in_features=1536, out_features=256, bias=True)
  (v_proj): Linear(in_features=1536, out_features=256, bias=True)
  (o_proj): Linear(in_features=1536, out_features=1536, bias=False)
  (rotary_emb): Qwen2RotaryEmbedding()
)
Number of key value heads: 2
Sequence length: 44

hidden_states shape: torch.Size([1, 44, 1536])
keys shape:          torch.Size([1, 2, 44, 128])
values shape:        torch.Size([1, 2, 44, 128])
score shape:         torch.Size([1, 2, 44])

The purpose of this step-by-step guide is to provide instructions on how to create a new press in kvpress. The guide is designed to help users understand the process of setting up a new press in the kvpress platform, including any necessary steps,


### 2.2 Updating the `compress` method

The `compress` method defined in the `BasePress` contains the core logic of the compression and returns compressed keys and values. For instance, in the `ScorerPress` the `compress` calls the `score` method (which is specific to `ScorerPress`) and prune the key-value pairs based on the scores.

The following example will show how it works. We will re-implement the `StreamingLLMPress` in a more compact way.

In [7]:
@dataclass
class MyStreamingLLMPress(BasePress):
    n_first: int = 1
    n_last: int = 8

    def compress(
        self,
        module: nn.Module,
        hidden_states: torch.Tensor,
        keys: torch.Tensor,
        values: torch.Tensor,
        attentions: torch.Tensor,
        kwargs: dict,
    ) -> tuple[torch.Tensor, torch.Tensor]:

        mask = torch.ones(keys.shape[-2], dtype=torch.bool, device=keys.device)
        mask[self.n_first : -self.n_last] = False
        return keys[:, :, mask, :], values[:, :, mask, :]


for n_last in [2, 4, 8]:
    press = MyStreamingLLMPress(n_last=n_last)
    print(f"\nn_last: {n_last}")
    print(f"Last tokens seen by the model: {pipe.tokenizer.decode(tokens.input_ids[0, -n_last:])}")
    print(f"Answer: {pipe(context, question=question, press=press)['answer']}")


n_last: 2
Last tokens seen by the model: press !
Answer: The purpose of this guide is to provide instructions and information on how to use the software or application called "Pulse" or "Pulse 2". Pulse is a popular software application that is used for various purposes, such as creating and editing digital

n_last: 4
Last tokens seen by the model:  in kvpress !
Answer: The purpose of this guide is to provide instructions on how to create a new content management system (CMS) called KVPress. KVPress is a content management system that allows users to easily create, edit, and publish content on their website. The guide

n_last: 8
Last tokens seen by the model:  create a new press in kvpress !
Answer: The purpose of this guide is to provide instructions on how to create a new press in kvpress, a software tool for creating and managing press releases. The guide likely covers topics such as setting up the press, configuring the press settings, adding content to


Note that in the `compress` method is itself used in the `forward_hook` method which ensures quantization is handled properly and that the compression is only performed during prefilling. While we don't recommend to change the `forward_hook` method directly, you can still modify it if you need to !

### 2.3 Head-wise compression

Since 0.2.0, kvpress support head-wise compression, where the KV cache of each head might be compressed by a different compression ratio.

To achieve proper head-wise compression, one should implement a new kernel for attention along with a custom cache class. Instead, the current implementation fakes head-wise compression by updating the pruned keys by a fake key so that the output of the attention layer is not affected. This is implemented through `kvpress.attention_patch.patch_attention_functions`.

To implement a method that compresses the KV cache head-wise, one should instantiate the `masked_key_indices` as outlined below.

In [8]:
@dataclass
class RandomHeadPress(BasePress):

    compression_ratio: float = 0.0

    def compress(self, module, hidden_states, keys, values, attentions, kwargs):
        assert keys.shape[0] == 1, "Only batch size 1 is supported"
        scores = torch.rand(keys.shape[:-1], device=keys.device)
        mask = scores < torch.quantile(scores, self.compression_ratio)
        module.masked_key_indices = torch.nonzero(mask, as_tuple=True)

        return keys, values

for compression_ratio in [0, 0.25, 0.9]:
    press = RandomHeadPress(compression_ratio)
    print(f"\ncompression_ratio: {compression_ratio}")
    print(f"Answer: {pipe(context, question=question, press=press)['answer']}")


compression_ratio: 0
Answer: The purpose of this step-by-step guide is to provide a comprehensive and easy-to-follow tutorial on how to create a new press in the KVPress platform. The guide is designed to help users understand the process of setting up a new press, including the

compression_ratio: 0.25
Answer: The purpose of this guide is to provide a step-by-step process for creating a new press in kvpress. This guide is designed to help users understand the process of creating a new press in kvpress and to provide them with the necessary information to complete

compression_ratio: 0.9
Answer: This guide is designed to provide a step-by-step process for creating a new press in a specific software or platform. By following this guide, users can efficiently set up a new press without encountering any major issues.

The purpose of this guide is to:




## 3. Contributing to kvpress

All presses should be stored in the `presses` directory. Before opening a pull request with your new press, make sure to
- register it in the `__init__.py` file of repository
- register the press in [default_presses.py](tests/default_presses.py)
- update the README