Rolling tracker for meld's experience of the smithy CI fleet. Companion issue to whatever lands in `pulseengine/smithy`; this issue captures meld-specific observations, failed PRs, and the acceptance bar for releases.
ubuntu-latest is not a fallback option per project policy. Smithy stability is a release-path dependency.
Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)
1. Whole-fleet offline events
- 2026-05-03 ~19:00Z: all 8 `pulseengine-ci-01-*` runners reported `status: offline` simultaneously. Recovered ~3 h later (status: online, busy: true).
- 2026-05-10 ~17:00Z–17:30Z: full offline window again, recovered within ~20 min.
Pattern: when the host(s) go down, every queued job sits forever (jobs queue against runner labels, not against fleet health). meld saw a 2 h 11 min queue on PR #135 with zero pickup before the fleet came back.
2. Disk-space failures on rust-cpu runners
Runner pool currently 3× rust-cpu (`pulseengine-ci-01-{5,6,7}`). Two of three show the same pattern:
| Runner |
Symptom |
| `-5` |
Jobs fail in 30–70 s with `error: failed to build archive: No space left on device (os error 28)` |
| `-6` |
Same as -5; also went offline mid-day twice during the v0.7.0 cycle |
| `-7` |
Fuzz + bench succeed here |
Concrete failures: PR #134 (Bench compile), PR #135 (Bench, Coverage, fuzz_resolver_terminates), PR #137 (Bench, Coverage, Test, fuzz_parse_component, fuzz_fusion_roundtrip). Each landed via `gh run rerun --failed` until `-7` was assigned, except PR #137 which had to merge via `--admin` because the rerun cycle never converged.
3. Per-runner config drift (sanitizer + musl)
PR #135 fuzz failure on runner-5:
```
error: sanitizer is incompatible with statically linked libc,
disable it using `-C target-feature=-crt-static`
```
Same workflow, same toolchain install step (`dtolnay/rust-toolchain@nightly` with `targets: x86_64-unknown-linux-musl`), succeeds on `-7`, fails on `-5/-6`. Suggests host-level `.cargo/config.toml` or rustup component state differs between runners.
4. Cross-org capacity contention
While meld jobs queued, other org repos held rust-cpu slots:
- `spar/CI` run on `release/v0.9.2` ran 11.5 h before completing (start 09:29Z, presumed hung)
- `loom/CI` and `loom/Validate Shared Architecture` both ran 12 h+ before either failing or being cancelled
I cancelled the loom jobs to free slots for meld's release; this was a one-off intervention, not a sustainable answer.
Acceptance bar for meld (proposed)
Smithy is "stable enough" to be meld's only CI path when all of:
Until then, releases are explicitly authorized to merge with `--admin` for known-infra failures, documented in the release PR body.
What this issue is for
- A single place to drop "I saw this on smithy today" reports from any meld PR.
- A trail of evidence for `pulseengine/smithy` agents/issues to consume.
- The release-side definition of "ready to lift the admin-bypass policy."
Pin / keep open. Update the checklist above as conditions are met.
Rolling tracker for meld's experience of the smithy CI fleet. Companion issue to whatever lands in `pulseengine/smithy`; this issue captures meld-specific observations, failed PRs, and the acceptance bar for releases.
ubuntu-latest is not a fallback option per project policy. Smithy stability is a release-path dependency.
Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)
1. Whole-fleet offline events
Pattern: when the host(s) go down, every queued job sits forever (jobs queue against runner labels, not against fleet health). meld saw a 2 h 11 min queue on PR #135 with zero pickup before the fleet came back.
2. Disk-space failures on rust-cpu runners
Runner pool currently 3× rust-cpu (`pulseengine-ci-01-{5,6,7}`). Two of three show the same pattern:
Concrete failures: PR #134 (Bench compile), PR #135 (Bench, Coverage, fuzz_resolver_terminates), PR #137 (Bench, Coverage, Test, fuzz_parse_component, fuzz_fusion_roundtrip). Each landed via `gh run rerun --failed` until `-7` was assigned, except PR #137 which had to merge via `--admin` because the rerun cycle never converged.
3. Per-runner config drift (sanitizer + musl)
PR #135 fuzz failure on runner-5:
```
error: sanitizer is incompatible with statically linked libc,
disable it using `-C target-feature=-crt-static`
```
Same workflow, same toolchain install step (`dtolnay/rust-toolchain@nightly` with `targets: x86_64-unknown-linux-musl`), succeeds on `-7`, fails on `-5/-6`. Suggests host-level `.cargo/config.toml` or rustup component state differs between runners.
4. Cross-org capacity contention
While meld jobs queued, other org repos held rust-cpu slots:
I cancelled the loom jobs to free slots for meld's release; this was a one-off intervention, not a sustainable answer.
Acceptance bar for meld (proposed)
Smithy is "stable enough" to be meld's only CI path when all of:
Until then, releases are explicitly authorized to merge with `--admin` for known-infra failures, documented in the release PR body.
What this issue is for
Pin / keep open. Update the checklist above as conditions are met.