chore(ci): trigger GPU integration tests on ephemeral runners#286
Closed
beveradb wants to merge 1 commit into
Closed
chore(ci): trigger GPU integration tests on ephemeral runners#286beveradb wants to merge 1 commit into
beveradb wants to merge 1 commit into
Conversation
Verifies the n1+T4 ephemeral runner path after the 2026-05-17 dispatcher e2 fix (karaoke-gen#776). The GPU family was not affected by that bug (n1 supports onHostMaintenance=TERMINATE, required for attached GPUs), but has not been exercised under the ephemeral dispatcher since cutover. Trivial comment in `audio_separator/separator/__init__.py` flips the `audio_separator/**` path filter so the 3-job integration test suite (ensemble-presets, core-models, stems-and-quality) runs on freshly created GPU VMs. Safe to remove on next edit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
Collaborator
Author
|
Closing — used purely as a verification trigger for the ephemeral GPU runner path after nomadkaraoke/karaoke-gen#776. The rerun surfaced a known issue: the NVIDIA driver kernel module fails to load on fresh ephemeral GPU VMs ( |
beveradb
added a commit
to nomadkaraoke/karaoke-gen
that referenced
this pull request
May 18, 2026
…meral cutover) (#776) ## Summary Fixes three cutover bugs that left the ephemeral runner dispatcher silently broken since 2026-05-17T04:56Z: 1. **`on_host_maintenance="TERMINATE"` set unconditionally** — e2 machine types reject TERMINATE unless preemptible, so 100% of general/build VM creates failed. Fix: only set TERMINATE on GPU (n1) families where it's required because attached GPUs can't live-migrate. 3 regression tests added in `test_ephemeral.py`. 2. **GPU `disk_size_gb=150` but image is 200GB** — GCE rejects boot disks smaller than the source image. Fix: raise GPU disk to 200GB. 3. **Runner user has no passwordless sudo** — workflow steps like `sudo apt-get install -y google-cloud-cli-firestore-emulator` (backend-emulator-tests) and `sudo apt-get install -y ffmpeg` (package-*, GPU integration tests) failed with "a terminal is required to read the password". The legacy GHA runner VMs had NOPASSWD sudo. Fix: add `/etc/sudoers.d/runner` in the image provision script. Fixes 1 + 2 are dispatcher code (already applied via `pulumi up --target ...:runner-manager-source ...:runner-manager-function` from this branch). Fix 3 requires a new image build (triggered: build-runner-images.yml on this branch). ## Why the trivial file touches `backend/main.py` and `karaoke_gen/__init__.py` carry one-line comments to flip the `backend` and `package` `dorny/paths-filter` outputs so this PR exercises **all five PR-triggered self-hosted jobs** end-to-end on ephemeral runners. Verification PR (paired): nomadkaraoke/python-audio-separator#286. ## Test plan - [x] `pytest test_ephemeral.py` — 26 passed locally (3 new scheduling tests) - [x] Pulumi applied locally (both iterations, function update verified) - [x] First ephemeral general VM create succeeded (4 RUNNING; previously 100% failure) - [ ] Image rebuild completes for general+build+gpu with new sudo provision - [ ] PR CI passes with all 5 self-hosted jobs running on the new image - [ ] python-audio-separator#286 GPU integration tests pass on the new GPU image - [ ] After merge: `deploy-backend` succeeds on ephemeral build runner; prod /api/health/detailed reports new version - [ ] Phase 4 decommission of 7 legacy VMs+disks (separate PR, ~$220/mo saving) ## Context - `karaoke-gen/docs/EPHEMERAL-GHA-RUNNERS.md` - `docs/archive/2026-05-16-ephemeral-gha-runners-plan.md` (workspace) - Memory: `project_cost_sprint_may2026.md` @coderabbitai ignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
onHostMaintenance=TERMINATE(required because attached GPUs can't live-migrate) — but it hasn't been exercised under the ephemeral dispatcher since cutover (2026-05-17T04:56Z), so this is the verification PR.audio_separator/separator/__init__.pyflips theaudio_separator/**path filter to makedorny/paths-filterrun all three integration test jobs.Test plan
ensemble-presets,core-models,stems-and-qualityall dispatch on freshly-created ephemeral n1+T4 VMs/opt/audio-separator-models@coderabbitai ignore
🤖 Generated with Claude Code