config: raise vlm_api_concurrency default 1 → 16 by Liuhaai · Pull Request #63 · machinefi/trio-core

Liuhaai · 2026-05-06T05:16:47Z

Summary

Bump EngineConfig.vlm_api_concurrency default from 1 to 16.
Update the field description to reflect that backend-owned locking (PR feat: response_format support, backend-owned locking, and zoom-panel crop-describe #62) makes a higher API-layer semaphore safe for local backends and useful for remote ones.

Why

The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was added to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in #62.

After #62, each backend owns its own _lock:

BaseBackend._lock = threading.Lock() — local backends serialize generation here
RemoteHTTPBackend._lock = nullcontext() — remote backends don't serialize

The API semaphore is no longer the serialization point:

Local backends: still serialize on the per-backend lock. A higher semaphore just lets requests wait at the lock instead of at the HTTP handler — observable behavior is identical.
Remote backends: the semaphore directly caps how many HTTPS requests run in parallel against the upstream provider.

Prod impact

In the cortex deployment with TRIO_REMOTE_VLM_URL set (DashScope), default=1 caused VLM avg latency of ~12.7s because cortex sends up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to activate parallelism.

Raising the default makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a remote provider rate-limits.

Test plan

Existing inference/config/vlm tests pass (15 passed, 6 skipped).
Deploy to prod, confirm trio-core's lazy semaphore initializes at 16 (grep _vlm_semaphore in restart logs) and VLM avg latency drops from 12.7s → ~1–3s.
Watch DashScope error rate for 30 min — no spike in 429s.

🤖 Generated with Claude Code

The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was introduced to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in PR #62. Now that each backend owns its own _lock (BaseBackend._lock = Lock() for local, RemoteHTTPBackend._lock = nullcontext() for remote), the API semaphore is no longer the serialization point: - Local backends still serialize on their per-backend lock — a higher semaphore value just lets requests wait at the lock instead of at the HTTP handler. Observable behavior is identical. - Remote backends use nullcontext, so the semaphore value directly controls how many HTTPS requests run in parallel against the upstream provider (e.g. DashScope). In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set), default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism. Raising the default to 16 makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a provider rate-limits aggressively. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Liuhaai merged commit 115a54c into main May 6, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config: raise vlm_api_concurrency default 1 → 16#63

config: raise vlm_api_concurrency default 1 → 16#63
Liuhaai merged 1 commit into
mainfrom
raise-vlm-api-concurrency-default

Liuhaai commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Liuhaai commented May 6, 2026

Summary

Why

Prod impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant