feat: response_format support, backend-owned locking, and zoom-panel crop-describe by Liuhaai · Pull Request #62 · machinefi/trio-core

Liuhaai · 2026-05-06T04:31:56Z

Summary

Plumbs OpenAI-compatible response_format through TrioCore and all backends. Remote backend forwards it to the OpenAI SDK (key is omitted when None so providers that reject null aren't tripped); local MLX / ToMe / Compressed / Transformers backends accept and ignore it.
Moves the generation lock from TrioCore to BaseBackend. RemoteHTTPBackend overrides with contextlib.nullcontext() so concurrent HTTP calls run in parallel; local GPU backends still serialize.
Adds EngineConfig.vlm_api_concurrency (default 1). The FastAPI VLM semaphore is now lazy-initialized from this value — bump to 8–16 in remote-VLM deployments.
Rewrites /api/inference/crop-describe: instead of one VLM pass per crop plus a scene pass, YOLO crops are rendered as labeled zoom panels alongside the full frame, sent in a single composite call. Crop descriptions are parsed back out of the model's structured response (CROPS: block). max_crops=0 keeps the full-frame-only path. Big latency win when there are many crops.
Translates upstream APIStatusError 4xx (including DashScope data_inspection_failed moderation rejections) into HTTP 422 with a structured detail payload so callers can drop the frame instead of treating the VLM service as down. Other failures still map to 503.

Test plan

pytest tests/ — 399 passed, 7 skipped locally
New tests cover:
- response_format is forwarded to the OpenAI SDK on RemoteHTTPBackend.generate
- response_format key is omitted when None
- /api/inference/crop-describe makes exactly one VLM call and returns extracted crop descriptions
- max_crops=0 returns a single full-frame pass with empty crop_descriptions
Smoke-test against a remote VLM deployment with vlm_api_concurrency>1
Smoke-test crop-describe with a real frame + YOLO detections

🤖 Generated with Claude Code

…nel crop-describe - Plumb OpenAI-compatible `response_format` through engine + backends; remote backend forwards it to the SDK (omitted when None), local backends ignore it. - Move generation lock from TrioCore to BaseBackend; RemoteHTTPBackend uses nullcontext so concurrent HTTP calls run in parallel while local GPU backends still serialize. - Add `vlm_api_concurrency` config (default 1) to size the FastAPI VLM semaphore; raise it for remote deployments where the remote service handles its own scheduling. - Replace per-crop VLM passes in /api/inference/crop-describe with a single composite call: YOLO crops become labeled zoom panels rendered alongside the full frame, with crop descriptions extracted from the response. - Map upstream 4xx (incl. DashScope content moderation) to HTTP 422 with structured detail; keep 5xx/connection errors as 503. - Tests: response_format forwarding/omission on the remote backend, and single-call crop-describe behavior with and without zoom panels. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was introduced to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in PR #62. Now that each backend owns its own _lock (BaseBackend._lock = Lock() for local, RemoteHTTPBackend._lock = nullcontext() for remote), the API semaphore is no longer the serialization point: - Local backends still serialize on their per-backend lock — a higher semaphore value just lets requests wait at the lock instead of at the HTTP handler. Observable behavior is identical. - Remote backends use nullcontext, so the semaphore value directly controls how many HTTPS requests run in parallel against the upstream provider (e.g. DashScope). In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set), default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism. Raising the default to 16 makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a provider rate-limits aggressively. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Liuhaai and others added 2 commits May 5, 2026 21:31

style: apply ruff format to inference router and test

023f2bb

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Liuhaai merged commit c2e9d6c into main May 6, 2026
7 checks passed

Liuhaai mentioned this pull request May 6, 2026

config: raise vlm_api_concurrency default 1 → 16 #63

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: response_format support, backend-owned locking, and zoom-panel crop-describe#62

feat: response_format support, backend-owned locking, and zoom-panel crop-describe#62
Liuhaai merged 2 commits into
mainfrom
feat/vlm-response-format-and-zoom-panels

Liuhaai commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Liuhaai commented May 6, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant