Replies: 2 comments
-
|
Yes, this is something we need and I think the proposal is on-point. We'll need to have policy, limits and bounds on this capability of course, but otherwise I'll move to accept this and start getting an Epic and issues for it up.
This is a bit tangential, but: while I agree that this is a valid use case it seems we may want to consider a local store option as well? |
Beta Was this translation helpful? Give feedback.
-
|
I'm in favor of considering this accepted. It's been a couple of weeks without any other comments, or objections, so we'll consider this accepted. @usize please feel to hit the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduce a first-class mechanism for filters (and embedded processors) to make async outbound HTTP calls during request processing: with proper timeout, cancellation, memory bounding, and observability.
This is the foundational primitive that enables praxis to act as an orchestrating proxy: a proxy that doesn't just route requests but executes multi-step request workflows. It is a prerequisite for #16 (llm-d Compatibility), #17 (External Processing), #18 (Wasm Runtime), and #24 (AI Agentic).
Motivation
Envoy + ext_proc makes orchestration impossible
The Gateway API Inference Extension (GIE) defines the Endpoint Picker Protocol with a hard requirement: "The EPP MUST implement the Envoy external processing service protocol." This protocol is fundamentally a single decision point: the EPP receives request headers/body, returns a destination endpoint via
x-gateway-destination-endpoint, and is done. It cannot express multi-step workflows, conditional branching, or mid-flight preemption.This is sufficient for simple inference routing (one model, one pod). It breaks down for workloads that require coordinating multiple backend interactions within a single client request.
P/D disaggregation in llm-d
llm-d splits LLM inference into separate prefill and decode phases running on independent GPU pools. The inference scheduler supports four disaggregation topologies: EPD, P/D, E/PD, and E/P/D. With the most advanced requiring orchestration across three distinct worker types (encode, prefill, decode).
Because ext_proc can only return a single endpoint, llm-d works around this with a sidecar on the decode pod:
x-prefiller-host-portheader (andx-encoder-hosts-portsfor multimodal)This sidecar is explicitly experimental. The llm-d-routing-sidecar repository states: "This repository is deprecated and shall soon be archived. All future development will [be] done under the llm-d-inference-scheduler repository." The code has been consolidated into the inference scheduler as a transitional measure.
Problems with the sidecar approach:
The fundamental issue is architectural: ext_proc is a decision point, not an orchestrator. The proxy layer needs the ability to execute multi-step workflows, not just select a single destination.
Orchestration in AI policy
P/D disaggregation is not the only pattern that requires orchestration from within request processing. The AI Gateway Working Group's Payload Processing proposal identifies several user stories that imply sub-request capability:
Guardrails / safety scanning: A processor calls an external detection engine (prompt injection scanner, PII detector, toxicity classifier) and blocks, sanitizes, or reports based on the result. The processor must call the scanner, await its verdict, and then decide whether to continue or reject. This is a sub-request.
Semantic caching: A processor checks a cache service for semantically similar prior requests. On a hit, it returns the cached response directly (short-circuiting the backend). On a miss, the request proceeds to the inference backend and the response is cached on the way back. Both the cache lookup and the cache write are sub-requests.
RAG augmentation: A processor calls a retrieval service to fetch relevant context, mutates the request body to inject the retrieved context, then forwards to the inference backend. The retrieval call is a sub-request.
MCP routing: A processor needs to look up session state from an external store to determine which MCP server should handle a tool call. The session lookup is a sub-request.
Provider failover with API translation: On failure from provider A, a processor translates the request format and retries against provider B. The retry against B is a sub-request with a transformed payload.
In each case, the processor needs to call out to another service and act on the result. Without a sub-request primitive, every one of these patterns requires either a bespoke sidecar, an external orchestration layer, or multiple ext_proc round-trips chained together.
Proposal
Sub-request API
Provide an HTTP client interface accessible from within filter execution (and by extension, from WASM-embedded processors via host calls). Key properties:
on_request/on_request_bodyexecution. The pipeline blocks on that filter until the sub-request completes or times out.Memory safety
Nested callouts (a processor making a sub-request, which is itself a callout from the proxy) create memory pressure. Each in-flight request holds: the buffered client body, processor state (WASM linear memory if applicable), the sub-request connection, and the sub-request response. Mitigations:
Release). Don't hold the full prompt while waiting for a sub-request.Processor safety contract (sub-task)
If processors can orchestrate (make sub-requests that cause side effects on external systems), the framework must define safety guarantees:
X-Processor-Attempt), enabling downstream services to deduplicate.PREFILL_SELECTED → PREFILL_COMPLETE → TRANSFER_COMPLETE → DECODE_FORWARDED). On retry, re-enter at the last successful checkpoint rather than replaying the entire sequence.Design
References
Beta Was this translation helpful? Give feedback.
All reactions