Skip to content

config: raise vlm_api_concurrency default 1 → 16#63

Merged
Liuhaai merged 1 commit into
mainfrom
raise-vlm-api-concurrency-default
May 6, 2026
Merged

config: raise vlm_api_concurrency default 1 → 16#63
Liuhaai merged 1 commit into
mainfrom
raise-vlm-api-concurrency-default

Conversation

@Liuhaai
Copy link
Copy Markdown
Collaborator

@Liuhaai Liuhaai commented May 6, 2026

Summary

Why

The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was added to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in #62.

After #62, each backend owns its own _lock:

  • BaseBackend._lock = threading.Lock() — local backends serialize generation here
  • RemoteHTTPBackend._lock = nullcontext() — remote backends don't serialize

The API semaphore is no longer the serialization point:

  • Local backends: still serialize on the per-backend lock. A higher semaphore just lets requests wait at the lock instead of at the HTTP handler — observable behavior is identical.
  • Remote backends: the semaphore directly caps how many HTTPS requests run in parallel against the upstream provider.

Prod impact

In the cortex deployment with TRIO_REMOTE_VLM_URL set (DashScope), default=1 caused VLM avg latency of ~12.7s because cortex sends up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to activate parallelism.

Raising the default makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a remote provider rate-limits.

Test plan

  • Existing inference/config/vlm tests pass (15 passed, 6 skipped).
  • Deploy to prod, confirm trio-core's lazy semaphore initializes at 16 (grep _vlm_semaphore in restart logs) and VLM avg latency drops from 12.7s → ~1–3s.
  • Watch DashScope error rate for 30 min — no spike in 429s.

🤖 Generated with Claude Code

The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was
introduced to protect local GPU backends from concurrent generation. It
defaulted to 1, which made sense before backend-owned locking landed in
PR #62.

Now that each backend owns its own _lock (BaseBackend._lock = Lock() for
local, RemoteHTTPBackend._lock = nullcontext() for remote), the API
semaphore is no longer the serialization point:

  - Local backends still serialize on their per-backend lock — a higher
    semaphore value just lets requests wait at the lock instead of at the
    HTTP handler. Observable behavior is identical.
  - Remote backends use nullcontext, so the semaphore value directly
    controls how many HTTPS requests run in parallel against the
    upstream provider (e.g. DashScope).

In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set),
default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10
concurrent describe calls but trio-core gated them back to 1. Operators
had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism.

Raising the default to 16 makes the common remote-backend case work out
of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if
a provider rate-limits aggressively.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Liuhaai Liuhaai merged commit 115a54c into main May 6, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant