Skip to content

ci: free disk space before embeddings-all builds to stop intermittent "No space left on device" #775

@mosuka

Description

@mosuka

Summary

The ubuntu-latest x86_64 CI jobs that build --features embeddings-all (which pulls in candle-core / candle-transformers / tokenizers / hf-hub / image / reqwest) intermittently die with:

System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/.../Worker_...log'

The "Run test" step never completes (no failed step is recorded; the job log truncates right after the cache-restore step). The cause is only visible via the check-run annotations:

gh api repos/mosuka/laurus/check-runs/<job_id>/annotations

Observed on PRs #772 and #774 (the Test laurus (ubuntu-latest, x86_64-unknown-linux-gnu, stable) job failed at ~4m25s twice in a row before a later re-run happened to land on a runner with enough free space). The ubuntu-24.04-arm job is unaffected because it sets skip_embedding_features: true; Windows is unaffected (different disk layout).

This is an environment/disk-margin problem, not a code or test problem. The restored Rust build cache (~790 MB compressed, much larger expanded) plus the embeddings-all build artifacts push the runner's root disk over its limit. It is intermittent because the margin is razor-thin.

Root cause

GitHub-hosted ubuntu-latest runners ship with several large preinstalled toolchains the build does not use (.NET, Android SDK, GHC/Haskell, CodeQL bundle, cached Docker images). None of the workflows free any of it, so the heavy embeddings-all compile occasionally runs out of space.

Proposed fix

Add a reusable composite action .github/actions/free-disk-space that removes the unused preinstalled toolchains (reclaims ~25-30 GB) and prints df -h / before/after, then reference it (gated if: runner.os == 'Linux') right after checkout in the heavy embeddings-all jobs:

  • regression.yml (PR-blocking) — clippy, test-laurus, test-server, test-mcp, test-cli, test-python, test-nodejs, test-ruby, test-php
  • periodic.yml and release.yml — the equivalent clippy / test-* jobs

A composite action keeps the removal list in one place instead of pasting it into ~25 jobs. Because pull_request runs use the PR's own workflow definitions, the change self-validates on its own CI run.

The removal targets only safe, unused paths (/usr/share/dotnet, /usr/local/lib/android, /opt/ghc, /usr/local/.ghcup, /opt/hostedtoolcache/CodeQL, dangling Docker images) — it does NOT touch the Rust / Python / Node toolchains the jobs rely on.

Acceptance criteria

  • .github/actions/free-disk-space/action.yml composite action added.
  • Referenced (Linux-gated, after checkout) in the heavy embeddings-all jobs of regression.yml, periodic.yml, release.yml.
  • The action prints free space before/after so future disk regressions are diagnosable from the log.
  • CI is green on the fix PR (validates the action runs and the builds still pass).
  • actionlint (if available) reports no errors on the edited workflows.

Out of scope (possible follow-up)

  • release.yml build-* / publish-* artifact jobs (release-gated, not currently failing).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions