cllama: harden upstream-failure handling — full-timeout (600s) hangs, no auto-failover to fallback MODEL on 5xx, and tail latency drifting toward the stream-stall cancel cliff

## Summary

cllama's upstream-failure handling is the single biggest reliability risk for a live multi-agent pod. Three related gaps, observed downstream:

1. **Full-timeout hangs.** Upstream providers (seen with both xAI/grok and an OpenRouter-hosted model) accept a request and then never complete the stream; cllama waits the entire upstream timeout (~600s) before giving up with a 502. Every such failure is `latency_ms ≈ 600000`.
2. **No auto-failover to the declared fallback `MODEL` slot on 5xx.** Live behavior is retry-same-model-3×-then-die (`API call failed after 3 retries: HTTP 503`). When a provider circuit-breaks (e.g. "all declared providers cooling down"), every agent on that model blanks simultaneously with no fallback. (Downstream-tracked separately as tiverton-house#62.)
3. **Tail latency drifts toward the runtime's stream-stall cancel cliff.** The consuming runner (Hermes) cancels a turn at ~660s. A live per-model sample:

   | model | n | hangs ≥120s | p50 | p90 | max |
   |---|---|---|---|---|---|
   | claude-haiku-4-5 | 51 | 0 | 7.7s | 47s | 109s |
   | grok-4.3 | 147 | 1 | 3.1s | 7s | 168s |
   | minimax-m3 | 257 | **20** | 16s | 89s | **491s (8.2 min)** |

   `minimax-m3`'s max of 491s is within ~170s of the 660s cliff; a further degradation cancels turns mid-stream during market hours. All of these return `200` right up until they cross the cliff, so **latency, not error rate, is the early-warning signal.**

## Suggested directions

- **Shorten the cllama→upstream timeout** (e.g. 600s → ~120s) so a hung upstream fails fast instead of consuming a full turn.
- **Implement auto-failover to the declared fallback `MODEL` slot** on 5xx / repeated stream errors, rather than retry-same-model-then-die.
- Treat `missing choices` / mid-stream `INTERNAL_ERROR` as retryable.
- Optionally expose per-model latency percentiles in `claw audit` so operators can see the cliff approaching.

## Downstream references

mostlydev/tiverton-house#59 (full incident history across both desk models) and mostlydev/tiverton-house#62 (the fallback-slot gap specifically).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cllama: harden upstream-failure handling — full-timeout (600s) hangs, no auto-failover to fallback MODEL on 5xx, and tail latency drifting toward the stream-stall cancel cliff #319

Summary

Suggested directions

Downstream references

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

model	n	hangs ≥120s	p50	p90	max
claude-haiku-4-5	51	0	7.7s	47s	109s
grok-4.3	147	1	3.1s	7s	168s
minimax-m3	257	20	16s	89s	491s (8.2 min)

cllama: harden upstream-failure handling — full-timeout (600s) hangs, no auto-failover to fallback MODEL on 5xx, and tail latency drifting toward the stream-stall cancel cliff #319

Description

Summary

Suggested directions

Downstream references

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions