Summary
cllama's upstream-failure handling is the single biggest reliability risk for a live multi-agent pod. Three related gaps, observed downstream:
-
Full-timeout hangs. Upstream providers (seen with both xAI/grok and an OpenRouter-hosted model) accept a request and then never complete the stream; cllama waits the entire upstream timeout (~600s) before giving up with a 502. Every such failure is latency_ms ≈ 600000.
-
No auto-failover to the declared fallback MODEL slot on 5xx. Live behavior is retry-same-model-3×-then-die (API call failed after 3 retries: HTTP 503). When a provider circuit-breaks (e.g. "all declared providers cooling down"), every agent on that model blanks simultaneously with no fallback. (Downstream-tracked separately as tiverton-house#62.)
-
Tail latency drifts toward the runtime's stream-stall cancel cliff. The consuming runner (Hermes) cancels a turn at ~660s. A live per-model sample:
| model |
n |
hangs ≥120s |
p50 |
p90 |
max |
| claude-haiku-4-5 |
51 |
0 |
7.7s |
47s |
109s |
| grok-4.3 |
147 |
1 |
3.1s |
7s |
168s |
| minimax-m3 |
257 |
20 |
16s |
89s |
491s (8.2 min) |
minimax-m3's max of 491s is within ~170s of the 660s cliff; a further degradation cancels turns mid-stream during market hours. All of these return 200 right up until they cross the cliff, so latency, not error rate, is the early-warning signal.
Suggested directions
- Shorten the cllama→upstream timeout (e.g. 600s → ~120s) so a hung upstream fails fast instead of consuming a full turn.
- Implement auto-failover to the declared fallback
MODEL slot on 5xx / repeated stream errors, rather than retry-same-model-then-die.
- Treat
missing choices / mid-stream INTERNAL_ERROR as retryable.
- Optionally expose per-model latency percentiles in
claw audit so operators can see the cliff approaching.
Downstream references
mostlydev/tiverton-house#59 (full incident history across both desk models) and mostlydev/tiverton-house#62 (the fallback-slot gap specifically).
Summary
cllama's upstream-failure handling is the single biggest reliability risk for a live multi-agent pod. Three related gaps, observed downstream:
Full-timeout hangs. Upstream providers (seen with both xAI/grok and an OpenRouter-hosted model) accept a request and then never complete the stream; cllama waits the entire upstream timeout (~600s) before giving up with a 502. Every such failure is
latency_ms ≈ 600000.No auto-failover to the declared fallback
MODELslot on 5xx. Live behavior is retry-same-model-3×-then-die (API call failed after 3 retries: HTTP 503). When a provider circuit-breaks (e.g. "all declared providers cooling down"), every agent on that model blanks simultaneously with no fallback. (Downstream-tracked separately as tiverton-house#62.)Tail latency drifts toward the runtime's stream-stall cancel cliff. The consuming runner (Hermes) cancels a turn at ~660s. A live per-model sample:
minimax-m3's max of 491s is within ~170s of the 660s cliff; a further degradation cancels turns mid-stream during market hours. All of these return200right up until they cross the cliff, so latency, not error rate, is the early-warning signal.Suggested directions
MODELslot on 5xx / repeated stream errors, rather than retry-same-model-then-die.missing choices/ mid-streamINTERNAL_ERRORas retryable.claw auditso operators can see the cliff approaching.Downstream references
mostlydev/tiverton-house#59 (full incident history across both desk models) and mostlydev/tiverton-house#62 (the fallback-slot gap specifically).