Skip to content

fix: stackUpgradeHandler constructs OCI image ref from talosVersion#42

Merged
ontave merged 13 commits into
mainfrom
session/25d-stack-upgrade-image-fix
May 17, 2026
Merged

fix: stackUpgradeHandler constructs OCI image ref from talosVersion#42
ontave merged 13 commits into
mainfrom
session/25d-stack-upgrade-image-fix

Conversation

@ontave
Copy link
Copy Markdown
Contributor

@ontave ontave commented May 7, 2026

Summary

  • stackUpgradeHandler was passing the raw version string (e.g., "v1.12.7") directly to TalosClient.Upgrade as the image reference
  • Talos then tried to resolve docker.io/library/v1.12.7:latest, which does not exist
  • talosUpgradeHandler correctly builds ghcr.io/siderolabs/installer:<version>; this fix applies the same pattern to stackUpgradeHandler
  • Variable renamed from talosImage to talosVersion at the point of reading the UpgradePolicy, then talosImage is computed from it

Root cause

Both talosUpgradeHandler and stackUpgradeHandler read a target version string from the UpgradePolicy CR. The talos-upgrade handler always appended the version to the installer image base; the stack-upgrade handler used the variable named talosImage and read directly from the CR into it without constructing the full OCI reference. A naming accident allowed the bug to hide in plain sight.

Test plan

  • Existing 3 stackUpgradeHandler unit tests pass (go test ./internal/capability/...)
  • Verified live on ccs-dev during session/25d: fixed Job completed status=Succeeded, TCOR records Succeeded, UpgradePolicy reached Ready=True

ontave added 13 commits May 7, 2026 09:01
talosImage was being set to the raw version string from the UpgradePolicy
(e.g., "v1.12.7") and passed directly to TalosClient.Upgrade, which then
tried to pull "docker.io/library/v1.12.7:latest". talosUpgradeHandler
correctly builds "ghcr.io/siderolabs/installer:<version>"; stack handler
now follows the same pattern.

Rename talosImage to talosVersion when reading from the UpgradePolicy,
then compute talosImage := "ghcr.io/siderolabs/installer:" + talosVersion.

Discovered during live ccs-dev stack upgrade (session/25d).
Stage=true left both Talos and kubelet changes sitting on disk indefinitely;
nodes required manual reboots to apply them. New behaviour mirrors
talosUpgradeHandler: per node, stage the kubelet image (staged mode so it
co-applies on the Talos reboot), then trigger Talos upgrade with stage=false
(immediate reboot), then wait for recovery before moving to the next node.

Drop talosconfig-path node enumeration in favour of TalosClient.Nodes()
(same source; cleaner and already tested via the stub). Require at least one
node (validation failure otherwise).

Tests: rename TestStackUpgrade_RunsBothUpgradeSteps to two tests --
  TestStackUpgrade_NoNodesReturnsValidationFailure
  TestStackUpgrade_RollingUpgrade_AllNodes (verifies per-node loop,
  upgradeCallCount == node count, all ApplyConfiguration calls use staged mode)
…stry

ghcr.io is not accessible from lab nodes. The docker.io registry mirror
(docker.io → 10.20.0.1:5000) is the only configured mirror. Using
docker.io/ image references allows Talos to resolve installer and kubelet
images through the local registry mirror during node upgrades.

Affects talos-upgrade, kube-upgrade, and stack-upgrade capabilities.
All imports of github.com/ontai-dev/conductor/pkg/runnerlib updated to
github.com/ontai-dev/conductor-sdk/runnerlib across 37 files. Internal
pkg/runnerlib deleted. go.mod updated with replace directive pointing to
../conductor-sdk and require entry. go mod tidy completed. All unit tests
pass: go build ./... and go test ./test/unit/... green before deletion.
…patcher types

Update all GVR references, scheme registrations, and import paths in
conductor to consume the migrated dispatcher types from wrapper/api/seam:
PackDelivery (was InfrastructureClusterPack), PackExecution, PackInstalled
(was InfrastructurePackInstance), PackReceipt, PackLog (was PackOperationResult).

packDeliveryRef field replaces clusterPackRef in pack_receipt_drift_loop.go
and all associated tests. compileLaunchBundle now embeds wrapper CRDs via
wrappercrd.FS so agents receive the seam.ontai.dev CRD bundle at startup.
Updates all dynamic-client GVR references from infrastructure.ontai.dev/
infrastructuretalosclusters to seam.ontai.dev/talosclusters. Updates kind
strings from InfrastructureTalosCluster to TalosCluster. Updates pack
execution GVR to seam.ontai.dev/packexecutions. All tests updated to match.
Replace seam-core -> seam and wrapper -> dispatcher in go.mod
replace/require. Update all Go import paths accordingly. Add seam-sdk
replace + require. Update conductor RunnerConfigSpec references and
compile_launch.go/test assertions for post-MIGRATION-3.8 CRD names
(lineagerecords, runnerconfigs under seam.ontai.dev).
…tories

Replace ../seam-core with ../seam and ../wrapper with ../dispatcher
following the seam-core -> seam and wrapper -> dispatcher filesystem
renames. Module paths were already updated in Phase 4.
…onductor

Update all guardian.ontai.dev API group references in conductor:
- compile_enable.go, compile_launch.go: enable bundle apiVersion strings, webhook names
- catalog.go and all 5 catalog YAML entries: apiVersion strings in rendered RBACProfiles
- capability/guardian.go, adapters.go: GVR Group fields for snapshot/profile/policy
- agent pull loops (rbacpolicy, rbacprofile, receipt, signing): GVR Group fields
- All unit, integration, and e2e test fixtures: GVR/GVK Group strings and apiVersion values
…ductor-sdk

- Dockerfile.compiler/execute/agent: seam-core/ -> seam/, wrapper/ -> dispatcher/
- Add COPY conductor-sdk/ and seam-sdk/ to all three builder stages
- cmd/conductor/main.go: fix stale "seam-core scheme" panic message to "seam"
- docs/conductor-schema.md: update InfrastructureRunnerConfig -> RunnerConfig,
  infrastructure.ontai.dev -> seam.ontai.dev throughout

Steps 6.1, 6.3, 6.4 were already complete (single binary entrypoint at
cmd/conductor/, single build target, go.mod already imports conductor-sdk).
Fresh documentation from current codebase. runner.ontai.dev claim removed
(conductor owns no API group). pkg/runnerlib replaced with conductor-sdk
reference. seam-core replaced with seam. All three image modes documented
accurately. Capability table rebuilt from conductor-sdk/runnerlib/constants.go.
…am-sdk/conductor-sdk); fix integration test GVR and CRD for RunnerConfig post-migration
@ontave ontave merged commit 616c715 into main May 17, 2026
1 of 4 checks passed
@ontave ontave deleted the session/25d-stack-upgrade-image-fix branch May 17, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant