Skip to content

feat: response_format support, backend-owned locking, and zoom-panel crop-describe#62

Merged
Liuhaai merged 2 commits into
mainfrom
feat/vlm-response-format-and-zoom-panels
May 6, 2026
Merged

feat: response_format support, backend-owned locking, and zoom-panel crop-describe#62
Liuhaai merged 2 commits into
mainfrom
feat/vlm-response-format-and-zoom-panels

Conversation

@Liuhaai
Copy link
Copy Markdown
Collaborator

@Liuhaai Liuhaai commented May 6, 2026

Summary

  • Plumbs OpenAI-compatible response_format through TrioCore and all backends. Remote backend forwards it to the OpenAI SDK (key is omitted when None so providers that reject null aren't tripped); local MLX / ToMe / Compressed / Transformers backends accept and ignore it.
  • Moves the generation lock from TrioCore to BaseBackend. RemoteHTTPBackend overrides with contextlib.nullcontext() so concurrent HTTP calls run in parallel; local GPU backends still serialize.
  • Adds EngineConfig.vlm_api_concurrency (default 1). The FastAPI VLM semaphore is now lazy-initialized from this value — bump to 8–16 in remote-VLM deployments.
  • Rewrites /api/inference/crop-describe: instead of one VLM pass per crop plus a scene pass, YOLO crops are rendered as labeled zoom panels alongside the full frame, sent in a single composite call. Crop descriptions are parsed back out of the model's structured response (CROPS: block). max_crops=0 keeps the full-frame-only path. Big latency win when there are many crops.
  • Translates upstream APIStatusError 4xx (including DashScope data_inspection_failed moderation rejections) into HTTP 422 with a structured detail payload so callers can drop the frame instead of treating the VLM service as down. Other failures still map to 503.

Test plan

  • pytest tests/ — 399 passed, 7 skipped locally
  • New tests cover:
    • response_format is forwarded to the OpenAI SDK on RemoteHTTPBackend.generate
    • response_format key is omitted when None
    • /api/inference/crop-describe makes exactly one VLM call and returns extracted crop descriptions
    • max_crops=0 returns a single full-frame pass with empty crop_descriptions
  • Smoke-test against a remote VLM deployment with vlm_api_concurrency>1
  • Smoke-test crop-describe with a real frame + YOLO detections

🤖 Generated with Claude Code

Liuhaai and others added 2 commits May 5, 2026 21:31
…nel crop-describe

- Plumb OpenAI-compatible `response_format` through engine + backends; remote
  backend forwards it to the SDK (omitted when None), local backends ignore it.
- Move generation lock from TrioCore to BaseBackend; RemoteHTTPBackend uses
  nullcontext so concurrent HTTP calls run in parallel while local GPU
  backends still serialize.
- Add `vlm_api_concurrency` config (default 1) to size the FastAPI VLM
  semaphore; raise it for remote deployments where the remote service
  handles its own scheduling.
- Replace per-crop VLM passes in /api/inference/crop-describe with a single
  composite call: YOLO crops become labeled zoom panels rendered alongside
  the full frame, with crop descriptions extracted from the response.
- Map upstream 4xx (incl. DashScope content moderation) to HTTP 422 with
  structured detail; keep 5xx/connection errors as 503.
- Tests: response_format forwarding/omission on the remote backend, and
  single-call crop-describe behavior with and without zoom panels.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Liuhaai Liuhaai merged commit c2e9d6c into main May 6, 2026
7 checks passed
Liuhaai added a commit that referenced this pull request May 6, 2026
The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was
introduced to protect local GPU backends from concurrent generation. It
defaulted to 1, which made sense before backend-owned locking landed in
PR #62.

Now that each backend owns its own _lock (BaseBackend._lock = Lock() for
local, RemoteHTTPBackend._lock = nullcontext() for remote), the API
semaphore is no longer the serialization point:

  - Local backends still serialize on their per-backend lock — a higher
    semaphore value just lets requests wait at the lock instead of at the
    HTTP handler. Observable behavior is identical.
  - Remote backends use nullcontext, so the semaphore value directly
    controls how many HTTPS requests run in parallel against the
    upstream provider (e.g. DashScope).

In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set),
default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10
concurrent describe calls but trio-core gated them back to 1. Operators
had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism.

Raising the default to 16 makes the common remote-backend case work out
of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if
a provider rate-limits aggressively.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant