-
Notifications
You must be signed in to change notification settings - Fork 213
Description
FYI I also put the design in a globally readable and commentable Google Doc: https://docs.google.com/document/d/10_u1Pvb3MD2Wii6NB50OVIGQVN3p_YEuV0j7PBdIOiY/edit?tab=t.0 if that's easier to leave comments on.
Overview
The Gateway API Inference Extension project aims to optimize self-hosting Generative Models on GKE. A key component of this system is the Endpoint Picker (EPP), which intelligently routes requests to appropriate model server backends. For advanced routing and load balancing, particularly in features like prefix sharding, the data plane needs to be aware of the "cost" associated with processing a request.
This proposal defines a flexible plugin for the EPP server configuration to allow users to declaratively configure how this cost is calculated and reported, without requiring code changes to the EPP binary itself. This API will be configurable via the existing --config-file or --config-text command-line flags used by the EPP server.
Proposed API
apiVersion: config.apix.gateway-api-inference-extension.sigs.k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- name: input-tokens-cost-reporter
type: cost-reporter
parameters:
# Defines where in dynamic metadata to return the data
metric:
namespace: envoy.lb # Defaults to envoy.lb if omitted
# What key to use in the provided namespace for the value from expression
name: x-gateway-inference-request-cost
# Specifies the source of data for the CEL expression.
dataSource: responseBody
# The CEL expression to calculate the cost.
expression: |
(has(responseBody.usage.prompt_tokens) ? responseBody.usage.prompt_tokens : 0) + \
(has(responseBody.usage.completion_tokens) ? responseBody.usage.completion_tokens : 0)
# Optional: CEL expression to determine if this metric should be calculated/reported
condition: "has(responseBody.usage)"
Detailed design
Data plane
The initial implementation will only support parsing the response body. The cost reporting logic will be triggered within the response processing path, specifically when handling the response body.
From pkg/epp/requestcontrol/plugins.go, the plugin will implement the ResponseStreaming and ResponseComplete interfaces.
For each configured metric, if a condition is provided, it's evaluated first. If the condition is met (or absent), the expression is evaluated against the dataSource. The result is expected to be an integer. If not, or if evaluation fails, the defaultValue is used if provided. Evaluation failure will not result in failing the request, instead a warning log will be emitted.
We will avoid buffering the entire response. The CEL expression will be evaluated individually on each chunk of the response body in the streaming case. The first successful evaluation will stop subsequent evaluation of the response body.
The calculated cost value is added to the ext-proc response to instruct Envoy to set the dynamic metadata in the specified namespace with the specified name.
Configuration Loading
The EPP server will parse the costReporting section from the YAML provided via --config-file or --config-text.
CEL Environment
The EPP binary would take a dependency upon the github.com/google/cel-go library. When the plugin is enabled, the plugin will initialize a CEL environment on a per-request basis. For each metric defined in configuration, it will compile the expression and condition strings. The environment will be configured to understand each of the dataSources. The initial implementation will only support responseBody. The expression assumes an integer output.