Smithy fleet reliability — rolling tracker (meld side)

Rolling tracker for meld's experience of the smithy CI fleet. Companion issue to whatever lands in \`pulseengine/smithy\`; this issue captures meld-specific observations, failed PRs, and the acceptance bar for releases.

ubuntu-latest is **not** a fallback option per project policy. Smithy stability is a release-path dependency.

## Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)

### 1. Whole-fleet offline events

- **2026-05-03 ~19:00Z**: all 8 \`pulseengine-ci-01-*\` runners reported \`status: offline\` simultaneously. Recovered ~3 h later (status: online, busy: true).
- **2026-05-10 ~17:00Z–17:30Z**: full offline window again, recovered within ~20 min.

Pattern: when the host(s) go down, every queued job sits forever (jobs queue against runner labels, not against fleet health). meld saw a 2 h 11 min queue on PR #135 with zero pickup before the fleet came back.

### 2. Disk-space failures on rust-cpu runners

Runner pool currently 3× rust-cpu (\`pulseengine-ci-01-{5,6,7}\`). Two of three show the same pattern:

| Runner | Symptom |
|---|---|
| \`-5\` | Jobs fail in 30–70 s with \`error: failed to build archive: No space left on device (os error 28)\` |
| \`-6\` | Same as -5; also went offline mid-day twice during the v0.7.0 cycle |
| \`-7\` | Fuzz + bench succeed here |

Concrete failures: PR #134 (Bench compile), PR #135 (Bench, Coverage, fuzz_resolver_terminates), PR #137 (Bench, Coverage, Test, fuzz_parse_component, fuzz_fusion_roundtrip). Each landed via \`gh run rerun --failed\` until \`-7\` was assigned, except PR #137 which had to merge via \`--admin\` because the rerun cycle never converged.

### 3. Per-runner config drift (sanitizer + musl)

PR #135 fuzz failure on runner-5:

\`\`\`
error: sanitizer is incompatible with statically linked libc,
disable it using \`-C target-feature=-crt-static\`
\`\`\`

Same workflow, same toolchain install step (\`dtolnay/rust-toolchain@nightly\` with \`targets: x86_64-unknown-linux-musl\`), succeeds on \`-7\`, fails on \`-5/-6\`. Suggests host-level \`.cargo/config.toml\` or rustup component state differs between runners.

### 4. Cross-org capacity contention

While meld jobs queued, other org repos held rust-cpu slots:

- \`spar/CI\` run on \`release/v0.9.2\` ran 11.5 h before completing (start 09:29Z, presumed hung)
- \`loom/CI\` and \`loom/Validate Shared Architecture\` both ran 12 h+ before either failing or being cancelled

I cancelled the loom jobs to free slots for meld's release; this was a one-off intervention, not a sustainable answer.

## Acceptance bar for meld (proposed)

Smithy is "stable enough" to be meld's only CI path when **all** of:

- [ ] Three consecutive meld PRs merge without any \`--admin\` bypass.
- [ ] No PR sits in the \`queued\` state for more than 30 min without a runner being assigned.
- [ ] Disk-space (\`os error 28\`) failures stop appearing in retry-then-pass cycles for at least one full release cycle (~7–10 PRs).
- [ ] Sanitizer + musl fuzz builds succeed on every rust-cpu runner, not just \`-7\`.
- [ ] No whole-fleet offline event in a calendar week.

Until then, releases are explicitly authorized to merge with \`--admin\` for known-infra failures, documented in the release PR body.

## What this issue is for

- A single place to drop "I saw this on smithy today" reports from any meld PR.
- A trail of evidence for \`pulseengine/smithy\` agents/issues to consume.
- The release-side definition of "ready to lift the admin-bypass policy."

Pin / keep open. Update the checklist above as conditions are met.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smithy fleet reliability — rolling tracker (meld side) #139

Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)

1. Whole-fleet offline events

2. Disk-space failures on rust-cpu runners

3. Per-runner config drift (sanitizer + musl)

4. Cross-org capacity contention

Acceptance bar for meld (proposed)

What this issue is for

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runner	Symptom
`-5`	Jobs fail in 30–70 s with `error: failed to build archive: No space left on device (os error 28)`
`-6`	Same as -5; also went offline mid-day twice during the v0.7.0 cycle
`-7`	Fuzz + bench succeed here

Smithy fleet reliability — rolling tracker (meld side) #139

Description

Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)

1. Whole-fleet offline events

2. Disk-space failures on rust-cpu runners

3. Per-runner config drift (sanitizer + musl)

4. Cross-org capacity contention

Acceptance bar for meld (proposed)

What this issue is for

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions