feat: add per-alias enforce_limits toggle for pre-dispatch context check#167
Merged
mcowger merged 3 commits intomcowger:mainfrom Apr 16, 2026
Merged
Conversation
Owner
|
Please rebase this on main. ALso, I've changed how I'm doing migrations - you'll no longer commit the journals, sql files etc. Please see CONTRIBUTING.md. Also, please run a |
44c5126 to
8f854d1
Compare
34f5b7d to
61658b5
Compare
Adds an opt-in boolean on model aliases that runs a fast token-estimation check before dispatching to the upstream provider. When enabled, the dispatcher rejects locally with a 400 context_length_exceeded error if the estimated input tokens plus reserved output tokens exceed the model's context window — avoiding a wasted upstream round-trip and an opaque provider-side 400. Behavior: - Toggle lives on ModelConfig alongside use_image_fallthrough (not in metadata.overrides) — it is routing policy, not catalog data. - Reservation uses min(request.max_tokens, metadata.max_completion_tokens) to minimize false rejections when the caller asked for a small completion. - Fails open (with a debug log) when no context_length is known — can't enforce what we don't know. - Reuses existing estimateInputTokens() heuristic (microseconds, no WASM) with a 10% safety multiplier to cover the estimator's ±20–30% variance. - The context_length_exceeded code propagates through each endpoint's native error envelope (chat/messages/responses/gemini). Includes 12 new tests covering toggle on/off, oversized/under-limit, missing metadata (fail-open), max_tokens-vs-metadata precedence, and all four API shapes (chat, messages, gemini, responses). https://claude.ai/code/session_0118yZYx8rXc4oV2SFpAiBcF
- Move enforceContextLimit into the dispatcher's per-target loop, right
after vision fallthrough completes and cooldown selects a live target.
This validates the finalized (possibly fallthrough-expanded) prompt
against the context window instead of the raw request. A thrown
ContextLengthExceededError still escapes the loop since it's a
client-side problem that failover can't resolve.
- Stop passing the whole UnifiedChatRequest to estimateInputTokens when
originalBody is absent — unified fields like tools/metadata/model
would inflate the estimate. Defensive fallback is now a minimal
{ messages } body.
- Clarify the Models page copy: the reservation is the *smaller* of
max_tokens and the model's max completion, not an either/or.
61658b5 to
0201cfd
Compare
github-actions bot
pushed a commit
that referenced
this pull request
Apr 17, 2026
…gle-O1uyF feat: add per-alias enforce_limits toggle for pre-dispatch context check
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an opt-in boolean on model aliases that runs a fast token-estimation
check before dispatching to the upstream provider. When enabled, the
dispatcher rejects locally with a 400 context_length_exceeded error if the
estimated input tokens plus reserved output tokens exceed the model's
context window — avoiding a wasted upstream round-trip and an opaque
provider-side 400.
Behavior:
metadata.overrides) — it is routing policy, not catalog data.
to minimize false rejections when the caller asked for a small
completion.
enforce what we don't know.
with a 10% safety multiplier to cover the estimator's ±20–30% variance.
native error envelope (chat/messages/responses/gemini).
Includes 12 new tests covering toggle on/off, oversized/under-limit,
missing metadata (fail-open), max_tokens-vs-metadata precedence, and all
four API shapes (chat, messages, gemini, responses).