Skip to content

Smithy fleet reliability — rolling tracker (meld side) #139

@avrabe

Description

@avrabe

Rolling tracker for meld's experience of the smithy CI fleet. Companion issue to whatever lands in `pulseengine/smithy`; this issue captures meld-specific observations, failed PRs, and the acceptance bar for releases.

ubuntu-latest is not a fallback option per project policy. Smithy stability is a release-path dependency.

Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)

1. Whole-fleet offline events

  • 2026-05-03 ~19:00Z: all 8 `pulseengine-ci-01-*` runners reported `status: offline` simultaneously. Recovered ~3 h later (status: online, busy: true).
  • 2026-05-10 ~17:00Z–17:30Z: full offline window again, recovered within ~20 min.

Pattern: when the host(s) go down, every queued job sits forever (jobs queue against runner labels, not against fleet health). meld saw a 2 h 11 min queue on PR #135 with zero pickup before the fleet came back.

2. Disk-space failures on rust-cpu runners

Runner pool currently 3× rust-cpu (`pulseengine-ci-01-{5,6,7}`). Two of three show the same pattern:

Runner Symptom
`-5` Jobs fail in 30–70 s with `error: failed to build archive: No space left on device (os error 28)`
`-6` Same as -5; also went offline mid-day twice during the v0.7.0 cycle
`-7` Fuzz + bench succeed here

Concrete failures: PR #134 (Bench compile), PR #135 (Bench, Coverage, fuzz_resolver_terminates), PR #137 (Bench, Coverage, Test, fuzz_parse_component, fuzz_fusion_roundtrip). Each landed via `gh run rerun --failed` until `-7` was assigned, except PR #137 which had to merge via `--admin` because the rerun cycle never converged.

3. Per-runner config drift (sanitizer + musl)

PR #135 fuzz failure on runner-5:

```
error: sanitizer is incompatible with statically linked libc,
disable it using `-C target-feature=-crt-static`
```

Same workflow, same toolchain install step (`dtolnay/rust-toolchain@nightly` with `targets: x86_64-unknown-linux-musl`), succeeds on `-7`, fails on `-5/-6`. Suggests host-level `.cargo/config.toml` or rustup component state differs between runners.

4. Cross-org capacity contention

While meld jobs queued, other org repos held rust-cpu slots:

  • `spar/CI` run on `release/v0.9.2` ran 11.5 h before completing (start 09:29Z, presumed hung)
  • `loom/CI` and `loom/Validate Shared Architecture` both ran 12 h+ before either failing or being cancelled

I cancelled the loom jobs to free slots for meld's release; this was a one-off intervention, not a sustainable answer.

Acceptance bar for meld (proposed)

Smithy is "stable enough" to be meld's only CI path when all of:

  • Three consecutive meld PRs merge without any `--admin` bypass.
  • No PR sits in the `queued` state for more than 30 min without a runner being assigned.
  • Disk-space (`os error 28`) failures stop appearing in retry-then-pass cycles for at least one full release cycle (~7–10 PRs).
  • Sanitizer + musl fuzz builds succeed on every rust-cpu runner, not just `-7`.
  • No whole-fleet offline event in a calendar week.

Until then, releases are explicitly authorized to merge with `--admin` for known-infra failures, documented in the release PR body.

What this issue is for

  • A single place to drop "I saw this on smithy today" reports from any meld PR.
  • A trail of evidence for `pulseengine/smithy` agents/issues to consume.
  • The release-side definition of "ready to lift the admin-bypass policy."

Pin / keep open. Update the checklist above as conditions are met.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions