feat: response_format support, backend-owned locking, and zoom-panel crop-describe#62
Merged
Merged
Conversation
…nel crop-describe - Plumb OpenAI-compatible `response_format` through engine + backends; remote backend forwards it to the SDK (omitted when None), local backends ignore it. - Move generation lock from TrioCore to BaseBackend; RemoteHTTPBackend uses nullcontext so concurrent HTTP calls run in parallel while local GPU backends still serialize. - Add `vlm_api_concurrency` config (default 1) to size the FastAPI VLM semaphore; raise it for remote deployments where the remote service handles its own scheduling. - Replace per-crop VLM passes in /api/inference/crop-describe with a single composite call: YOLO crops become labeled zoom panels rendered alongside the full frame, with crop descriptions extracted from the response. - Map upstream 4xx (incl. DashScope content moderation) to HTTP 422 with structured detail; keep 5xx/connection errors as 503. - Tests: response_format forwarding/omission on the remote backend, and single-call crop-describe behavior with and without zoom panels. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
Liuhaai
added a commit
that referenced
this pull request
May 6, 2026
The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was introduced to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in PR #62. Now that each backend owns its own _lock (BaseBackend._lock = Lock() for local, RemoteHTTPBackend._lock = nullcontext() for remote), the API semaphore is no longer the serialization point: - Local backends still serialize on their per-backend lock — a higher semaphore value just lets requests wait at the lock instead of at the HTTP handler. Observable behavior is identical. - Remote backends use nullcontext, so the semaphore value directly controls how many HTTPS requests run in parallel against the upstream provider (e.g. DashScope). In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set), default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism. Raising the default to 16 makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a provider rate-limits aggressively. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
response_formatthroughTrioCoreand all backends. Remote backend forwards it to the OpenAI SDK (key is omitted whenNoneso providers that rejectnullaren't tripped); local MLX / ToMe / Compressed / Transformers backends accept and ignore it.TrioCoretoBaseBackend.RemoteHTTPBackendoverrides withcontextlib.nullcontext()so concurrent HTTP calls run in parallel; local GPU backends still serialize.EngineConfig.vlm_api_concurrency(default1). The FastAPI VLM semaphore is now lazy-initialized from this value — bump to 8–16 in remote-VLM deployments./api/inference/crop-describe: instead of one VLM pass per crop plus a scene pass, YOLO crops are rendered as labeled zoom panels alongside the full frame, sent in a single composite call. Crop descriptions are parsed back out of the model's structured response (CROPS:block).max_crops=0keeps the full-frame-only path. Big latency win when there are many crops.APIStatusError4xx (including DashScopedata_inspection_failedmoderation rejections) into HTTP 422 with a structured detail payload so callers can drop the frame instead of treating the VLM service as down. Other failures still map to 503.Test plan
pytest tests/— 399 passed, 7 skipped locallyresponse_formatis forwarded to the OpenAI SDK onRemoteHTTPBackend.generateresponse_formatkey is omitted whenNone/api/inference/crop-describemakes exactly one VLM call and returns extracted crop descriptionsmax_crops=0returns a single full-frame pass with emptycrop_descriptionsvlm_api_concurrency>1crop-describewith a real frame + YOLO detections🤖 Generated with Claude Code