Skip to content

feat: add per-alias enforce_limits toggle for pre-dispatch context check#167

Merged
mcowger merged 3 commits intomcowger:mainfrom
darkspadez:claude/add-enforce-limits-toggle-O1uyF
Apr 16, 2026
Merged

feat: add per-alias enforce_limits toggle for pre-dispatch context check#167
mcowger merged 3 commits intomcowger:mainfrom
darkspadez:claude/add-enforce-limits-toggle-O1uyF

Conversation

@darkspadez
Copy link
Copy Markdown
Contributor

Adds an opt-in boolean on model aliases that runs a fast token-estimation
check before dispatching to the upstream provider. When enabled, the
dispatcher rejects locally with a 400 context_length_exceeded error if the
estimated input tokens plus reserved output tokens exceed the model's
context window — avoiding a wasted upstream round-trip and an opaque
provider-side 400.

Behavior:

  • Toggle lives on ModelConfig alongside use_image_fallthrough (not in
    metadata.overrides) — it is routing policy, not catalog data.
  • Reservation uses min(request.max_tokens, metadata.max_completion_tokens)
    to minimize false rejections when the caller asked for a small
    completion.
  • Fails open (with a debug log) when no context_length is known — can't
    enforce what we don't know.
  • Reuses existing estimateInputTokens() heuristic (microseconds, no WASM)
    with a 10% safety multiplier to cover the estimator's ±20–30% variance.
  • The context_length_exceeded code propagates through each endpoint's
    native error envelope (chat/messages/responses/gemini).

Includes 12 new tests covering toggle on/off, oversized/under-limit,
missing metadata (fail-open), max_tokens-vs-metadata precedence, and all
four API shapes (chat, messages, gemini, responses).

@mcowger
Copy link
Copy Markdown
Owner

mcowger commented Apr 16, 2026

Please rebase this on main.

ALso, I've changed how I'm doing migrations - you'll no longer commit the journals, sql files etc. Please see CONTRIBUTING.md.

Also, please run a bun install after your rebase, which will configure the required git hooks and stuff

@darkspadez darkspadez force-pushed the claude/add-enforce-limits-toggle-O1uyF branch from 34f5b7d to 61658b5 Compare April 16, 2026 04:36
claude and others added 3 commits April 15, 2026 22:54
Adds an opt-in boolean on model aliases that runs a fast token-estimation
check before dispatching to the upstream provider. When enabled, the
dispatcher rejects locally with a 400 context_length_exceeded error if the
estimated input tokens plus reserved output tokens exceed the model's
context window — avoiding a wasted upstream round-trip and an opaque
provider-side 400.

Behavior:
- Toggle lives on ModelConfig alongside use_image_fallthrough (not in
  metadata.overrides) — it is routing policy, not catalog data.
- Reservation uses min(request.max_tokens, metadata.max_completion_tokens)
  to minimize false rejections when the caller asked for a small
  completion.
- Fails open (with a debug log) when no context_length is known — can't
  enforce what we don't know.
- Reuses existing estimateInputTokens() heuristic (microseconds, no WASM)
  with a 10% safety multiplier to cover the estimator's ±20–30% variance.
- The context_length_exceeded code propagates through each endpoint's
  native error envelope (chat/messages/responses/gemini).

Includes 12 new tests covering toggle on/off, oversized/under-limit,
missing metadata (fail-open), max_tokens-vs-metadata precedence, and all
four API shapes (chat, messages, gemini, responses).

https://claude.ai/code/session_0118yZYx8rXc4oV2SFpAiBcF
- Move enforceContextLimit into the dispatcher's per-target loop, right
  after vision fallthrough completes and cooldown selects a live target.
  This validates the finalized (possibly fallthrough-expanded) prompt
  against the context window instead of the raw request. A thrown
  ContextLengthExceededError still escapes the loop since it's a
  client-side problem that failover can't resolve.
- Stop passing the whole UnifiedChatRequest to estimateInputTokens when
  originalBody is absent — unified fields like tools/metadata/model
  would inflate the estimate. Defensive fallback is now a minimal
  { messages } body.
- Clarify the Models page copy: the reservation is the *smaller* of
  max_tokens and the model's max completion, not an either/or.
@mcowger mcowger force-pushed the claude/add-enforce-limits-toggle-O1uyF branch from 61658b5 to 0201cfd Compare April 16, 2026 06:04
@mcowger mcowger merged commit 5858376 into mcowger:main Apr 16, 2026
1 check failed
github-actions bot pushed a commit that referenced this pull request Apr 17, 2026
…gle-O1uyF

feat: add per-alias enforce_limits toggle for pre-dispatch context check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants