Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate: why uninstaller get involved in the Longhorn app upgrade/rollback process #783

Closed
yasker opened this issue Oct 1, 2019 · 7 comments
Assignees
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related component/longhorn-manager Longhorn manager (control plane) kind/bug
Milestone

Comments

@yasker
Copy link
Member

yasker commented Oct 1, 2019

There are two reports that the uninstaller get involved: #782 (comment) and #755 (comment)

We don't expect the uninstaller to get involved during upgrade/rollback, since it will clean up the user volume data if triggered. If it's the mechanism of Helm chart rollback, we need to stop it from happening.

@yasker yasker added kind/bug component/longhorn-manager Longhorn manager (control plane) area/install-uninstall-upgrade Install, Uninstall or Upgrade related labels Oct 1, 2019
@yasker yasker added this to the v0.6.2 milestone Oct 1, 2019
@rbq
Copy link
Contributor

rbq commented Oct 1, 2019

@yasker Regarding my comment from #755: I saw the uninstaller running at least once, but don't remember if it was when I created a 0.5 release following a 0.6.1 one, or later when rolling back to a previous revision. Sorry, I wish I could be of more help.

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 1, 2019

Steps to reproduce this issue:

  1. Rancher v2.2.7. Kubernetes v1.13.9. 3-node cluster
  2. Deploy Longhorn v0.5.0 and wait for it to becomes active
  3. Upgrade Longhorn to v0.6.1 and wait for it to becomes active
  4. Roll back or upgrade Longhorn to v0.5.0 and wait for it to becomes active
  5. Re-upgrade Longhorn to v0.6.1 --> error message failed to install app longhorn-system. Error: UPGRADE FAILED: no CustomResourceDefinition with the name "instancemanagers.longhorn.rancher.io" found occurs
  6. Upgrade Longhorn to v0.5.0 --> The uninstaller is triggered

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 1, 2019

The causes of this issue:

  1. The upgrade failure of step 4 is due to the helm bug
  2. Before version v2.3.0, Rancher will retain only 1 release history. This means the only release history for step 5 is the record of the failed upgraded Longhorn. At step 5, Helm cannot find an earlier valid history for the 2nd upgrade then will somehow trigger helm delete hook
[main] 2019/10/01 19:16:10 Starting Tiller v2.10+unreleased (tls=false)
[main] 2019/10/01 19:16:10 GRPC listening on :39576
[main] 2019/10/01 19:16:10 Probes listening on :35605
[main] 2019/10/01 19:16:10 Storage driver is ConfigMap
[main] 2019/10/01 19:16:10 Max history per release is 1
[tiller] 2019/10/01 19:16:11 getting history for release longhorn-system
[storage] 2019/10/01 19:16:11 getting release history for "longhorn-system"
[tiller] 2019/10/01 19:16:11 preparing update for longhorn-system
[storage] 2019/10/01 19:16:11 getting deployed releases from "longhorn-system" history
[storage] 2019/10/01 19:16:11 getting last revision of "longhorn-system"
[storage] 2019/10/01 19:16:11 getting release history for "longhorn-system"
[storage] 2019/10/01 19:16:11 getting release history for "longhorn-system"
[tiller] 2019/10/01 19:16:11 name longhorn-system exists but is not in use, reusing name
[tiller] 2019/10/01 19:16:11 rendering longhorn chart using values
2019/10/01 19:16:11 info: manifest "longhorn/templates/ingress.yaml" is empty. Skipping.
2019/10/01 19:16:11 info: manifest "longhorn/templates/tls-secrets.yaml" is empty. Skipping.
[storage] 2019/10/01 19:16:11 updating release "longhorn-system.v4"
[tiller] 2019/10/01 19:16:12 executing 2 pre-delete hooks for longhorn-system
[kube] 2019/10/01 19:16:12 building resources from manifest
[kube] 2019/10/01 19:16:12 creating 1 resource(s)
[kube] 2019/10/01 19:16:12 Watching for changes to Job longhorn-uninstall with timeout of 5m0s
[kube] 2019/10/01 19:16:12 Add/Modify event for longhorn-uninstall: ADDED
[kube] 2019/10/01 19:16:12 longhorn-uninstall: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
[kube] 2019/10/01 19:16:12 Add/Modify event for longhorn-uninstall: MODIFIED
[kube] 2019/10/01 19:16:12 longhorn-uninstall: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
2019-10-01 19:19:35.349784 I | mvcc: store.index: compact 583036
2019-10-01 19:19:35.351875 I | mvcc: finished scheduled compaction at 583036 (took 1.196359ms)
W1001 19:19:49.710026       6 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.ConfigMap ended with: too old resource version: 1573096 (1574971)
[tiller] 2019/10/01 19:21:12 warning: Release longhorn-system pre-delete longhorn/templates/uninstall-job.yaml could not complete: timed out waiting for the condition
2019-10-01 19:21:12.477439 I | suppressing panic for copyResponse error in test; copy error: context canceled
2019/10/01 19:21:12 [ERROR] AppController p-598ws/longhorn-system [helm-controller] failed with : failed to install app longhorn-system. Error: UPGRADE FAILED: timed out waiting for the condition

This issue will be fixed by PR of Rancher v2.3.0

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 2, 2019

The key point for this issue is to avoid upgrade failure (error in step 4) in old Rancher. For the details, see: https://github.com/longhorn/longhorn/wiki/Longhorn-v0.6.0-Upgrade:-caveats-about-rolling-back-to-v0.5.0

@meldafrawi
Copy link
Contributor

meldafrawi commented Oct 3, 2019

Validation: Failed

Steps to test
Using Rancher v2.2.7; Kubernetes v1.13.9; 3-node cluster.

  1. Deploy Longhorn v0.5.0 and wait for it to becomes active
  2. Upgrade Longhorn to v0.6.1 and wait for it to becomes active
  3. Roll back or upgrade Longhorn to v0.5.0 and wait for it to becomes active
    • Scenario 1:
      • No cleanup before Re-upgrading Longhorn to v0.6.1 --> error message failed to install app longhorn-system. Error: UPGRADE FAILED: no CustomResourceDefinition with the name "instancemanagers.longhorn.rancher.io" found occurs.
      • Then upgrade Longhorn to v0.5.0 after the error message occurs --> The uninstaller is triggered
    • Scenario 2:
      • Do cleanup before Re-upgrading Longhorn to v0.6.1. The re-upgrade should succeed.

In Scenario 2, after upgrade to v0.6.1 succeeds:
5. Create a volume volume-1, attach it to a node.

Expected Result: Volume is attached (FAILED)
Failure: Volume is stuck in attaching state.

  1. detach the stuck volume, and upgrade to use master images.
  2. retry to attach volume-1 to a node.

Expected Result: Volume is attached (FAILED)
Failure: Volume is stuck in attaching state.

  1. Create another volume volume-2, attach it to a node, volume get attached successfully.

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 3, 2019

The above failure is actually caused by these two issues:
#776
#789

@meldafrawi Could you directly use the latest version(v0.6.2 in the private test chart) to test this issue?

@meldafrawi
Copy link
Contributor

Validation: PASSED

Steps to test
Using Rancher v2.2.7; Kubernetes v1.13.9; 3-node cluster.

  1. Deploy Longhorn v0.5.0 and wait for it to becomes active
  2. Upgrade Longhorn to v0.6.2 (master) and wait for it to becomes active
  3. Roll back or upgrade Longhorn to v0.5.0 and wait for it to becomes active
    • Scenario 1:
      • No cleanup before Re-upgrading Longhorn to v0.6.2 (master) --> error message failed to install app longhorn-system. Error: UPGRADE FAILED: no CustomResourceDefinition with the name "instancemanagers.longhorn.rancher.io" found occurs.
      • Then upgrade Longhorn to v0.5.0 after the error message occurs --> The uninstaller is triggered
    • Scenario 2:
      • Do cleanup before Re-upgrading Longhorn to v0.6.2 (master). The re-upgrade should succeed.

In Scenario 2, after upgrade to v0.6.2 (master) succeeds:
5. Create a volume volume-1, attach it to a node.

Expected Result: Volume is attached

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related component/longhorn-manager Longhorn manager (control plane) kind/bug
Projects
None yet
Development

No branches or pull requests

4 participants