Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions website/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ This file defines rules for contributors and automation agents working in `websi
---
title: <title>
sidebar_label: <label>
sidebar_position: <position>
---
```
- **Categories**: Every subfolder under `docs/` containing docs must have a `_category_.yaml` with `label` and `position`.
Expand Down Expand Up @@ -42,6 +43,8 @@ import TabItem from '@theme/TabItem';

### Code blocks

https://docusaurus.io/docs/markdown-features/code-blocks

````mdx
```yaml
image:
Expand All @@ -58,6 +61,8 @@ image:

### Admonitions

https://docusaurus.io/docs/markdown-features/admonitions

```mdx
::::info

Expand Down
1 change: 1 addition & 0 deletions website/docs/reference/heimdall/api-reference.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Heimdall API Reference
sidebar_label: API Reference
sidebar_position: 1
---
Comment on lines 1 to 5
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontmatter doesn’t follow the contributor guidance in website/AGENTS.md that says to wrap title in single quotes. Either update this page’s frontmatter to match the rule or adjust the rule to reflect the actual convention used across the docs.

Copilot uses AI. Check for mistakes.

## inference.networking.k8s.io/v1
Expand Down
284 changes: 284 additions & 0 deletions website/docs/reference/heimdall/plugin.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
---
title: Heimdall Plugin
sidebar_label: Plugin
sidebar_position: 2
---
Comment on lines +1 to +5
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontmatter doesn’t follow the contributor guidance in website/AGENTS.md that says to wrap title in single quotes. Either update this page’s frontmatter to match the rule or adjust the rule to reflect the actual convention used across the docs.

Copilot uses AI. Check for mistakes.

## Profile Handlers

### `single-profile-handler`

Handles a single profile which is always the primary profile.

No parameters.

### `pd-profile-handler`

Handles scheduler profiles for Prefill-Decode (PD) disaggregation.

| Parameter | Type | Default | Description |
| :----------------- | :------- | :---------------------- | :--------------------------------------------------- |
| `threshold` | `int` | `0` | Threshold for decoding operations. |
| `decodeProfile` | `string` | `"decode"` | Name of the profile to use for decode operations. |
| `prefillProfile` | `string` | `"prefill"` | Name of the profile to use for prefill operations. |
| `prefixPluginType` | `string` | `"prefix-cache-scorer"` | Type of the prefix cache plugin to use. |
| `prefixPluginName` | `string` | `"prefix-cache-scorer"` | Name of the prefix cache plugin to use. |
| `hashBlockSize` | `int` | `64` | Block size used for hashing tokens. |
| `primaryPort` | `int` | `0` | Port number of the primary container (0 to disable). |

## Filters

### `by-label`

Filters out pods based on the values defined by the given label.

| Parameter | Type | Default | Description |
| :-------------- | :--------- | :------ | :------------------------------------------------------------------------------ |
| `label` | `string` | - | The label key to filter by. (Required) |
| `validValues` | `[]string` | - | List of allowed values for the label. (Required unless `allowsNoLabel` is true) |
| `allowsNoLabel` | `bool` | `false` | Whether to allow pods that do not have the specified label. |

### `by-label-selector`

Filters out pods that do not match the configured label selector criteria.

| Parameter | Type | Default | Description |
| :------------ | :------------------ | :------ | :----------------------------------------- |
| `matchLabels` | `map[string]string` | - | Key-value pairs of labels that must match. |

### `prefill-filter`

Filters for pods designated with the `prefill` role. It retains pods that have the label `mif.moreh.io/role` set to `prefill`.

No parameters.

### `decode-filter`

Filters for pods designated with the `decode` role. It retains pods that satisfy one of the following conditions:

- The label `mif.moreh.io/role` is set to `decode` or `both`.
- The label `mif.moreh.io/role` is not set.

No parameters.

## Scorers

### `active-request-scorer`

Scores pods based on the number of active requests being served. Scores are normalized from 0 to 1.
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: “Scored normalized from 0 to 1.” is ungrammatical and reads like a typo. Consider rephrasing to something like “Scores are normalized from 0 to 1.”

Suggested change
Scores pods based on the number of active requests being served. Scores are normalized from 0 to 1.
Scores pods based on the number of active requests being served. The scores are normalized to the range [0, 1].

Copilot uses AI. Check for mistakes.

| Parameter | Type | Default | Description |
| :--------------- | :------- | :------ | :--------------------------------------------------------- |
| `requestTimeout` | `string` | `"2m"` | Duration to consider a request active (e.g., "30s", "1m"). |

### `load-aware-scorer`

Scores pods based on load (waiting queue size). Pods with empty queues get higher scores.

| Parameter | Type | Default | Description |
| :---------- | :---- | :------ | :-------------------------------- |
| `threshold` | `int` | `128` | Queue size threshold for scoring. |

### `no-hit-lru-scorer`

Favors pods that were least recently used for cold requests to distribute cache growth.

| Parameter | Type | Default | Description |
| :----------------- | :------- | :---------------------- | :------------------------------- |
| `prefixPluginType` | `string` | `"prefix-cache-scorer"` | Type of the prefix cache plugin. |
| `prefixPluginName` | `string` | `"prefix-cache-scorer"` | Name of the prefix cache plugin. |
| `lruSize` | `int` | `1024` | Size of the LRU cache. |

### `precise-prefix-cache-scorer`

Scores pods based on precise prefix-cache KV-block locality using an internal indexer.

| Parameter | Type | Default | Description |
| :--------------------- | :-------------------------------- | :------ | :---------------------------------------- |
| `tokenProcessorConfig` | [`Object`](#tokenprocessorconfig) | - | Configuration for the token processor. |
| `indexerConfig` | [`Object`](#indexerconfig) | - | Configuration for the KV cache indexer. |
| `kvEventsConfig` | [`Object`](#kveventsconfig) | - | Configuration for KV events subscription. |

#### `tokenProcessorConfig`

| Parameter | Type | Default | Description |
| :---------- | :------- | :------ | :-------------------------------------------------------------- |
| `blockSize` | `int` | `16` | Number of tokens per block. |
| `hashSeed` | `string` | `""` | Seed for computing block hashes. Should match `PYTHONHASHSEED`. |

#### `indexerConfig`

| Parameter | Type | Default | Description |
| :----------------------- | :-------------------------------- | :------ | :-------------------------------------------- |
| `kvBlockIndexConfig` | [`Object`](#kvblockindexconfig) | - | Configuration for the KV-block index backend. |
| `tokenizersPoolConfig` | [`Object`](#tokenizerspoolconfig) | - | Configuration for the tokenizers pool. |
| `enableMetrics` | `bool` | `false` | Whether to enable metrics for the indexer. |
| `metricsLoggingInterval` | `string` | `0s` | Interval for logging metrics (e.g., "10s"). |

#### `kvBlockIndexConfig`

Only one of the following backends should be configured.

| Parameter | Type | Default | Description |
| :---------------------- | :------------------------------------- | :------ | :----------------------------------------- |
| `inMemoryConfig` | [`Object`](#inmemoryconfig) | - | Configuration for in-memory index. |
| `redisConfig` | [`Object`](#redisconfig--valkeyconfig) | - | Configuration for Redis index. |
| `valkeyConfig` | [`Object`](#redisconfig--valkeyconfig) | - | Configuration for Valkey index. |
| `costAwareMemoryConfig` | [`Object`](#costawarememoryconfig) | - | Configuration for cost-aware memory index. |

##### `inMemoryConfig`

| Parameter | Type | Default | Description |
| :------------- | :---- | :------ | :------------------------------------- |
| `size` | `int` | `1e8` | Maximum number of keys in the index. |
| `podCacheSize` | `int` | `10` | Maximum number of pod entries per key. |
Comment on lines +131 to +134
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inMemoryConfig.size is documented as an int but the default is written as 1e8. Many configuration formats treat scientific notation as a float (or reject it for integers), so this default may be confusing/misleading. Prefer spelling this out as an integer literal (e.g., 100000000) if that’s the intended value.

Suggested change
| Parameter | Type | Default | Description |
| :------------- | :---- | :------ | :------------------------------------- |
| `size` | `int` | `1e8` | Maximum number of keys in the index. |
| `podCacheSize` | `int` | `10` | Maximum number of pod entries per key. |
| Parameter | Type | Default | Description |
| :------------- | :---- | :--------- | :------------------------------------- |
| `size` | `int` | `100000000`| Maximum number of keys in the index. |
| `podCacheSize` | `int` | `10` | Maximum number of pod entries per key. |

Copilot uses AI. Check for mistakes.

##### `redisConfig` / `valkeyConfig`

| Parameter | Type | Default | Description |
| :------------ | :------- | :------------------------- | :--------------------------------------- |
| `address` | `string` | `"redis://127.0.0.1:6379"` | Address of the Redis/Valkey server. |
| `backendType` | `string` | `"redis"` | Backend type ("redis" or "valkey"). |
| `enableRDMA` | `bool` | `false` | Enable RDMA (experimental, Valkey only). |

##### `costAwareMemoryConfig`

| Parameter | Type | Default | Description |
| :-------- | :------- | :------- | :-------------------------------------------- |
| `size` | `string` | `"2GiB"` | Maximum memory size (e.g., "2GiB", "500MiB"). |

#### `tokenizersPoolConfig`

| Parameter | Type | Default | Description |
| :------------- | :------------------------------------ | :------ | :-------------------------------------------- |
| `modelName` | `string` | - | Base model name for the tokenizer. (Required) |
| `workersCount` | `int` | `5` | Number of concurrent tokenizer workers. |
| `hf` | [`Object`](#hf-huggingface-tokenizer) | - | Configuration for HuggingFace tokenizer. |
| `local` | [`Object`](#local-local-tokenizer) | - | Configuration for local tokenizer. |
| `uds` | [`Object`](#uds-uds-tokenizer) | - | Configuration for UDS-based tokenizer. |

##### `hf` (HuggingFace Tokenizer)

| Parameter | Type | Default | Description |
| :------------------- | :------- | :------- | :-------------------------------------------------- |
| `enabled` | `bool` | `true` | Enable HuggingFace tokenizer. |
| `huggingFaceToken` | `string` | `""` | HuggingFace API token. |
| `tokenizersCacheDir` | `string` | `bin` | Directory to cache downloaded tokenizers. |
Comment thread
hhk7734 marked this conversation as resolved.
| `tokenizer` | `string` | `""` | Specific tokenizer to use (defaults to model name). |
| `tokenizerMode` | `string` | `"auto"` | Tokenizer mode ("auto", "hf", "limit", "mistral"). |
| `tokenizerRevision` | `string` | `""` | Revision of the tokenizer. |

##### `local` (Local Tokenizer)

| Parameter | Type | Default | Description |
| :------------------------------- | :------------------ | :--------------- | :------------------------------------------------ |
| `autoDiscoveryDir` | `string` | `/mnt/models` | Directory to search for tokenizers. |
| `autoDiscoveryTokenizerFileName` | `string` | `tokenizer.json` | Filename to search for. |
| `modelTokenizerMap` | `map[string]string` | - | Manual mapping of model names to tokenizer paths. |

##### `uds` (UDS Tokenizer)

| Parameter | Type | Default | Description |
| :----------- | :------- | :------------------------------------ | :--------------------------- |
| `socketFile` | `string` | `/tmp/tokenizer/tokenizer-uds.socket` | Path to the UDS socket file. |

#### `kvEventsConfig`

| Parameter | Type | Default | Description |
| :------------------- | :------------------------------ | :------ | :------------------------------------------------------- |
| `zmqEndpoint` | `string` | - | ZMQ endpoint to connect to (e.g., "tcp://indexer:5557"). |
| `topicFilter` | `string` | `"kv@"` | ZMQ topic filter subscription. |
| `concurrency` | `int` | `4` | Number of event processing workers. |
| `discoverPods` | `bool` | `true` | Enable automatic pod discovery. |
| `podDiscoveryConfig` | [`Object`](#poddiscoveryconfig) | - | Configuration for pod discovery. |

##### `podDiscoveryConfig`

| Parameter | Type | Default | Description |
| :----------------- | :------- | :--------------------------------- | :---------------------------------------- |
| `podLabelSelector` | `string` | `"llm-d.ai/inferenceServing=true"` | Label selector to find pods. |
| `podNamespace` | `string` | `""` | Namespace to watch pods in (empty = all). |
| `socketPort` | `int` | `5557` | Port where pods expose their ZMQ socket. |

### `session-affinity-scorer`

Routes subsequent requests in a session to the same pod as the first request.

This scorer relies on the `x-session-token` HTTP header to maintain session affinity:

1. **Response:** When a request is served, the plugin sets the `x-session-token` header in the response with the Base64-encoded name of the serving pod.
2. **Request:** For subsequent requests, the client must include this `x-session-token` header. The scorer decodes it to identify the target pod and assigns it a high score.

No parameters.

### `kv-cache-utilization-scorer`

Scores pods based on their KV cache utilization (lower utilization = higher score).

No parameters.

### `lora-affinity-scorer`

Scores pods based on LoRA adapter availability and capacity.

No parameters.

### `queue-scorer`

Scores pods based on their waiting queue size (smaller queue = higher score).

No parameters.

### `running-requests-size-scorer`

Scores pods based on their number of running requests.

No parameters.

### `prefix-cache-scorer`

Scores pods based on the length of the prefix match for the request prompt.

| Parameter | Type | Default | Description |
| :----------------------- | :----- | :------ | :------------------------------------------------------------ |
| `autoTune` | `bool` | `true` | Whether to automatically tune configuration based on metrics. |
| `blockSize` | `int` | `64` | Size of a token block for hashing. |
| `maxPrefixBlocksToMatch` | `int` | `256` | Maximum number of blocks to match for prefix caching. |
| `lruCapacityPerServer` | `int` | `31250` | Estimated LRU capacity per model server (in blocks). |

## Pickers

### `max-score-picker`

Picks the pod(s) with the maximum score from the list of candidates.

| Parameter | Type | Default | Description |
| :------------------ | :---- | :------ | :----------------------------------- |
| `maxNumOfEndpoints` | `int` | `1` | Maximum number of endpoints to pick. |

### `random-picker`

Picks random pod(s) from the candidates.

| Parameter | Type | Default | Description |
| :------------------ | :---- | :------ | :----------------------------------- |
| `maxNumOfEndpoints` | `int` | `1` | Maximum number of endpoints to pick. |

### `weighted-random-picker`

Picks pod(s) based on weighted random sampling derived from their scores.

| Parameter | Type | Default | Description |
| :------------------ | :---- | :------ | :----------------------------------- |
| `maxNumOfEndpoints` | `int` | `1` | Maximum number of endpoints to pick. |

## Response Plugins

### `response-header-handler`

Adds serving pod information to the response headers.

- `x-decoder-host-port`: Always set to the address and port of the pod that handled the decode phase (the primary target).
- `x-prefiller-host-port`: Set to the address and port of the prefill pod, if a separate prefill pod was used (PD disaggregation).

No configuration parameters.