Skip to content

eve-k pillar: upgrade longhorn-manager to v1.9.1 and fix BackupTargetName on failover#5765

Merged
eriknordmark merged 2 commits intolf-edge:masterfrom
andrewd-zededa:lh-mgr-mod-1.9.1
Apr 13, 2026
Merged

eve-k pillar: upgrade longhorn-manager to v1.9.1 and fix BackupTargetName on failover#5765
eriknordmark merged 2 commits intolf-edge:masterfrom
andrewd-zededa:lh-mgr-mod-1.9.1

Conversation

@andrewd-zededa
Copy link
Copy Markdown
Contributor

@andrewd-zededa andrewd-zededa commented Apr 7, 2026

Description

Upgrades github.com/longhorn/longhorn-manager from v1.6.0 to v1.9.1 in
pkg/pillar and fixes a failover regression that surfaces when volumes have
been migrated forward from older longhorn versions.

Fix: BackupTargetName empty on migrated volumes (kubeapi/longhorninfo.go)

Longhorn v1.9.x introduces a webhook validator that rejects any volume
Update() where Spec.BackupTargetName is empty:

"backup target name cannot be empty when creating a volume or updating
from an existing backup target"

Volumes migrated from longhorn < v1.7 predate the BackupTargetName field
and carry an empty value. longhornVolumeSetNode() — called during failover
— hits this validator and fails. The fix sets BackupTargetName to
"default" when the field is empty before calling Update(). The
default BackupTarget CR is always present in a longhorn installation,
even when no external backup target is configured (empty URL).

Dependency bump (go.mod / vendor)

go mod tidy and go mod vendor applied. Notable transitive bumps:

  • k8s.io/* v0.32.5 → v0.33.3
  • sigs.k8s.io/controller-runtime v0.16.1 → v0.20.4
  • sigs.k8s.io/structured-merge-diff/v4 v4.4.3 → v4.7.0
  • prometheus/client_golang v1.19.1 → v1.22.0

PR dependencies

None.

How to test and validate this PR

  1. Set up a 2+ node HV=k cluster running longhorn.
  2. Simulate or identify volumes originally created with longhorn < v1.7
    (i.e., volumes whose Spec.BackupTargetName is empty).
  3. Trigger a failover (graceful node reboot or power-off of the designated node).
  4. Confirm that longhornVolumeSetNode no longer logs
    "backup target name cannot be empty" and the volume successfully
    migrates to the surviving node.

Changelog notes

Optimize a failover of HV=k applications, trimming time spent migrating volumes.

PR Backports

  • 16.0-stable: To be determined.
  • 14.5-stable: To be determined.
  • 13.4-stable: To be determined.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.87%. Comparing base (2281599) to head (a420786).
⚠️ Report is 486 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5765       +/-   ##
===========================================
+ Coverage   19.52%   29.87%   +10.34%     
===========================================
  Files          19       18        -1     
  Lines        3021     2417      -604     
===========================================
+ Hits          590      722      +132     
+ Misses       2310     1549      -761     
- Partials      121      146       +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

rebased off latest master

@andrewd-zededa andrewd-zededa marked this pull request as ready for review April 9, 2026 17:51
Copy link
Copy Markdown
Contributor

@naiming-zededa naiming-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread pkg/pillar/docs/failover.md Outdated
Comment thread pkg/pillar/docs/failover.md Outdated
Comment thread pkg/pillar/docs/failover.md
Copy link
Copy Markdown

@zedi-pramodh zedi-pramodh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just few comments on the document.

Also please explain how de-scheduler works.

@zedi-pramodh
Copy link
Copy Markdown

We still need some documentation on de-scheduler.

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

We still need some documentation on de-scheduler.

Done, expanded on the policy and trigger under the '### Failback handling' section.

Copy link
Copy Markdown

@zedi-pramodh zedi-pramodh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

/rerun red

@rene
Copy link
Copy Markdown
Contributor

rene commented Apr 11, 2026

@andrewd-zededa , pls, rebase on top of master.

andrewd-zededa and others added 2 commits April 13, 2026 07:05
Volumes migrated from longhorn < v1.7 may have an empty BackupTargetName.
The v1.9.x webhook validator rejects any Update() where BackupTargetName
is empty, producing "backup target name cannot be empty" errors during
failover. Set it to "default" when unset before calling Update().

Also clarify failover.md: kubevirt virtualization support means HV=k.
Document Kubernetes object state timeline for VMIRs app failover in
docs/failover.md, covering node NotReady through new VMI Running,
including EVE-specific tolerateSec=15 and logcollectInterval=10s timing,
best-case timing summary, and descheduler-based failback handling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andrew Durbin <andrewd@zededa.com>
Bump github.com/longhorn/longhorn-manager v1.6.0 → v1.9.1 with
go mod tidy and go mod vendor. Transitive bumps include:
- k8s.io/* v0.32.5 → v0.33.3
- sigs.k8s.io/controller-runtime v0.16.1 → v0.20.4
- sigs.k8s.io/structured-merge-diff/v4 v4.4.3 → v4.7.0
- prometheus/client_golang v1.19.1 → v1.22.0
- gorilla/websocket, onsi/gomega, and several others

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andrew Durbin <andrewd@zededa.com>
@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

andrewd-zededa commented Apr 13, 2026

The yetus failure is not clear, no listing even runs on this PR.

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

Rebased, yetus seems to not even run which may be due to all the vendor changes.

@eriknordmark
Copy link
Copy Markdown
Contributor

FWIW all 4 eden smoke tests fail with the know issue with FAIL: TestHWInventory

@eriknordmark eriknordmark merged commit 6eac75b into lf-edge:master Apr 13, 2026
58 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants