Skip to content

Handle recovered image unpack panics#175

Merged
rgarcia merged 5 commits intomainfrom
fix/recover-interrupted-build-panic-pr
Mar 28, 2026
Merged

Handle recovered image unpack panics#175
rgarcia merged 5 commits intomainfrom
fix/recover-interrupted-build-panic-pr

Conversation

@rgarcia
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia commented Mar 28, 2026

Summary

  • validate OCI configs before unpacking so malformed recovered image builds fail cleanly instead of crashing hypeman startup
  • add a regression test that replays the captured hypeman-test recovery fixture through unpackLayers() and RecoverInterruptedBuilds()
  • route 20-second integration waits through the existing CI-aware timeout helper so Linux runners get the intended startup slack

Smoking Gun From The Dev Host

  • On dev-yul-hypeman-1, recovered hypeman-test:* builds repeatedly crashed hypeman with:
    • panic: runtime error: index out of range [0] with length 0
    • github.com/opencontainers/umoci/oci/layer.UnpackRootfs ... unpack.go:259
    • github.com/kernel/hypeman/lib/images.(*ociClient).unpackLayers ... oci.go
    • github.com/kernel/hypeman/lib/images.(*manager).RecoverInterruptedBuilds.func2 ... manager.go
  • Example host log excerpt from the captured bundle:
    • Mar 28 18:49:07 ... panic: runtime error: index out of range [0] with length 0
    • ... github.com/opencontainers/umoci/oci/layer.UnpackRootfs ... unpack.go:259
    • ... github.com/kernel/hypeman/lib/images.(*ociClient).unpackLayers ...
    • ... github.com/kernel/hypeman/lib/images.(*manager).RecoverInterruptedBuilds.func2 ...
  • The captured config blob for the failing image decodes to:
{"architecture":"amd64","os":"linux","config":{},"rootfs":{"type":"layers","diff_ids":[]}}
  • That same captured manifest has one layer, so the failure is reproducible locally from the checked-in fixture.

Investigation

  • In umoci v0.6.0, oci/layer/unpack.go:259 does layerDiffID := config.RootFS.DiffIDs[idx] with no bounds check.
  • For the captured fixture, config.RootFS.Type == "layers" but len(config.RootFS.DiffIDs) == 0 while len(manifest.Layers) == 1, so umoci panics instead of returning an error.
  • That panic does not look like an expected/intentional control-flow path; it is an unchecked malformed-config case in the library.
  • Because we can detect this specific malformed-image condition before calling umoci, this PR now validates the config/manifest relationship up front and returns a normal error.
  • With that validation in place, the captured regression no longer needs recover() wrappers in our code to handle this case.
  • unpackLayers() now passes the caller ctx into umoci instead of context.Background(). That change is intentional and orthogonal to the panic fix: it preserves cancellation/timeout propagation from the enclosing build or recovery flow. Using context.Background() would ignore caller cancellation and could let unpacking continue after the surrounding operation had already timed out, been canceled, or started shutting down.

Test plan

  • go test ./lib/images -run 'TestUnpackLayersCapturedFixtureReturnsErrorInsteadOfPanicking|TestRecoverInterruptedBuildsCapturedFixtureMarksBuildFailed'
  • go test ./lib/images
  • go test ./lib/instances -run 'TestWaitForState_SubscriptionDelivers|TestWaitForState_ChannelClosedOnDelete|TestWaitForState_TerminalViaSubscription|TestWaitForState_ShutdownIsTerminal|TestWaitForState_PausedIsTerminal|TestWaitForProcessExit_TimesOutForRunningProcess' -count=1

Made with Cursor


Note

Medium Risk
Moderate risk because it changes the image layer unpack path used during pulls and recovery; validation may cause previously-tolerated malformed images to fail earlier, but it’s limited to defensive checks and improved context propagation.

Overview
Prevents hypeman from crashing during startup recovery when unpacking malformed OCI images by validating the image config before calling umoci and returning a normal error instead of hitting an out-of-bounds panic; unpackLayers also now passes through the caller ctx to layer.UnpackRootfs.

Adds a regression test plus a captured on-disk OCI cache fixture to ensure unpackLayers() and RecoverInterruptedBuilds() fail the build cleanly for the malformed config case.

Adjusts multiple instance integration tests to route 20-second waitForInstanceState calls through the existing CI-aware integrationTestTimeout helper for more reliable timings on CI runners.

Written by Cursor Bugbot for commit 3124580. This will update automatically on new commits. Configure here.

rgarcia added 5 commits March 28, 2026 16:07
Convert OCI unpack panics into ordinary image build failures and add a regression test from the captured hypeman-test fixture so interrupted build recovery cannot take down hypeman startup.

Made-with: Cursor
Keep only the single manifest entry needed for the recovery regression so the fixture stays small while preserving the host failure case.

Made-with: Cursor
Route 20-second integration waits through the existing CI-aware timeout helper so Linux runners get the extra slack already intended for startup-heavy VM lifecycle tests.

Made-with: Cursor
Detect malformed image configs before calling umoci so recovered builds fail cleanly without relying on panic recovery for the captured diff_ids mismatch case.

Made-with: Cursor
Add a short comment explaining which malformed config conditions we reject before calling umoci.

Made-with: Cursor
@rgarcia rgarcia merged commit ddd8e85 into main Mar 28, 2026
6 checks passed
@rgarcia rgarcia deleted the fix/recover-interrupted-build-panic-pr branch March 28, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants