Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

Closed
Bobonium opened this issue Dec 18, 2020 · 12 comments
Closed

[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

Bobonium opened this issue Dec 18, 2020 · 12 comments
Assignees
Labels
area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane) kind/feature Feature request, new feature priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@Bobonium
Copy link

Describe the bug
I manage my deployments with terraform. As a result I can see changes being done to the managed resources. I just upgraded to 1.1.0 and I did not ever configure any tolerations through the longhorn UI.
After the upgrade the following change is visible for the driver, the manager and the ui:

          ~ annotations      = {
              - "longhorn.io/last-applied-tolerations" = jsonencode([]) -> null
            }

The change shows us, that terraform wants to remove the annotation, as it's not part of the definition. It doesn't work though as longhorn always magically re-appends the value.
This is at least in theory also a problem for non terraform users. Re-applying the same deployment yamls should not result in a change. With this behavior it will always result in a change

To Reproduce

  • Deploy longhorn 1.1.0 from deployment yaml
  • Check annotations, the longhorn.io/last-applied-tolerations was magically added

Expected behavior
Please don't dynamically adjust resources that are being created by the deployment yamls. This has negative impacts for anyone that manages their true state in git

Log
irrelevant for the bug

Environment:
irrelevant for the bug

Additional context

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Dec 18, 2020

@Bobonium

How does Terraform react when Longhorn automatically adds the annotation? E.g Does it deploy/redeploy Longhorn repeatedly? Do Longhorn's normal functionalities work ok?

In another word, how badly is the problem?

@Bobonium
Copy link
Author

Bobonium commented Dec 19, 2020

Hey @PhanLe1010 ,

Terraform will always show and applies a change, because longhorn automatically re-adds the annotation if it does not exist. I can (and I am currently) use a workaround, by simply adding the automated annotation to my terraform code, so it doesn't display a diff anymore.

This is less a technical problem as it is a design flaw, that in my opinion shows, that Longhorn as a Project is not as mature as claims to be (>1). To be honest I really like this project, more than any of the other alternatives that I've tried so far and I'd like it to succeed and keep on using it in the future. But seeing changes like these let me worry, because the storage layer is the single most important part within a Cluster. If it breaks, most of the running workloads break and as such every design decision should be well thought through. Frankly, I don't think that that this was the case here.

I had a deeper look at how this system works, and from the point of view of someone who has to deploy and manage this, it's as unusable as it gets.
Beforehand you had a hardcoded check on tolerations that started with kubernetes.io and simply kept all tolerations that started like that, I can totally understand why you guys were unhappy with this.
The new logic, if I understand it correctly (see: https://github.com/longhorn/longhorn-manager/pull/771/files#diff-be348ed45296f49335ad1afc4a80cdcb18350cbcbbf701a740c2f400e6c31407R476-R497 ) uses the annotation to remember which keys are actually additionally added by longhorn and removes these keys before the new annotations are added on top of that.

But this is a shitty system in general and I'd like to explain just why.
First of all the handling is severely flawed, here's how to reproduce the problem:

  • apply yaml files with custom tolerations
  • make changes in UI to add more tolerations
  • apply yaml files again
    => you now no longer have the tolerations that were set by the UI and longhorn never updates them unless you update them in the UI again
    You could only fix this by either watching the manually created Resources, which would cost some CPU for stuff that you shouldn't even touch in the first place (funnily enough, you actually do this for the annotations).

The second thing is, that it just does not make any sense from the point of view of whoever deploys this. Whoever deploys this knows which tolerations should be applied to the Resources he controls. For example it could make sense to force the Longhorn UI to run on the same Node as the Longhorn-Manager, as all UI responses are based on the API communication between the two. At the same time it does not necessarily mean, that I require the same tolerations for them as for all the other Resources that are deployed.
Therefore Longhorn-UI, Longhorn-Manager, Longhorn-Driver-Deployer (basically everything that is in the deployment.yaml) should simply not automatically be touched, because it makes it really hard to keep proper track of your deployed Resources in Version Control Systems and in some cases might even give you end results that you did not want to have in the first place.

The third thing is, that you guys should really start about thinking of how Users deploy stuff in the real world. At this time it's currently not possible to deploy multiple independent Longhorn Systems in a single Cluster, and I don't even dare to try to rename the namespace it's deployed into. There are various reasons why one would like to have Longhorn deployed multiple times independently on the same Cluster (or into a namespace with other resources like a general storage namespace). Your changes further show that you guys currently don't think/care about this here . In the upgrade path you simply add this annotation to all daemonsets and deployments in the same namespace.
If one were to have a more generic namespace to host all storage solutions in, you'd touch stuff that you're not supposed to touch.

My recommendation would be to add a label to the Resources that are dynamically deployed. Perfectly suited for this would be app.kubernetes.io/managed-by=longhorn-manager, as it's a recommended label per the Kubernetes documentation. Then limit the UI configuration Kubernetes Taint Toleration to deployments that have this exact label and value.

@yasker yasker self-assigned this Dec 22, 2020
@yasker yasker moved this from To do to In progress in Community Issue Review Dec 22, 2020
@yasker
Copy link
Member

yasker commented Dec 22, 2020

Hi @Bobonium ,

Thanks for the write-up. Your suggestion makes sense.

The problem we were trying to solve before is to make it as easy as possible for the users to set tolerations. And since we don't expect the users to update the YAML file, we decided to update the tolerations for the running workload automatically. The decision we made at that time, as you noticed, continuously bothers us later. So in the latest v1.1.0 release, we decided to use the annotation as the record for tolerations set by Longhorn. But still, it won't solve all the problems since the user or Kubernetes might set the same taint to the workloads tool.

Previously we're focusing too much on trying to get everything automated for the user, but we didn't realize the design will bring problems to the users using gitops. If we don't consider automatically updating the existing Longhorn components from the deployment file (yaml or chart), then that's much easier for us to deal with it, and it's easier for you to deploy it. Now thinking about it, your suggestion makes perfect sense. I'd like to change the design as you suggested in the next release.

Regarding deploying multiple Longhorn clusters, I can see why it might be useful when there are different tiers. But the disk/node tag feature is designed to handle the situation. Also, since Longhorn is not multi-tenant by itself, there would be some use cases that users might want to deploy separate Longhorn for themselves. We haven't seen strong demands on this part yet so we didn't prioritize it. But resolving #1844 can a good first step toward that direction.

@yasker yasker added this to the v1.1.1 milestone Dec 22, 2020
@yasker yasker added area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane) kind/feature Feature request, new feature priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated labels Dec 22, 2020
@yasker yasker removed their assignment Dec 22, 2020
@yasker yasker removed this from In progress in Community Issue Review Dec 22, 2020
@yasker yasker added this to To do in Community Issue Review via automation Dec 22, 2020
@yasker yasker moved this from To do to Done in Community Issue Review Dec 22, 2020
@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Mar 12, 2021

Manual Tests:

Case 1: Existing Longhorn installation

  1. Install Longhorn master.
  2. Change toleration in UI setting
  3. Verify that longhorn.io/last-applied-tolerations annotation and toleration of manager, drive deployer, UI are not changed.
  4. Very that longhorn.io/last-applied-tolerations annotation and toleration for managed components (CSI components, IM pods, share manager pod, EI daemonset) are updated correctly

Case 2: New installation by Helm

  1. Install Longhorn master, set tolerations like:
defaultSettings:
  taintToleration: "key=value:NoSchedule"

longhornManager:
  priorityClass: ~
  tolerations:
  # If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
  # and uncomment this example block
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule

longhornDriver:
  priorityClass: ~
  tolerations:
  # If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
  # and uncomment this example block
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule

longhornUI:
  priorityClass: ~
  tolerations:
  # If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
  # and uncomment this example block
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule   
  1. Verify that the toleration is added for: IM pods, Share Manager pods, CSI deployments, CSI daemonset, the backup jobs, manager, drive deployer, UI
  2. Uninstall the Helm release. Verify that the uninstallation success

Case 3: Upgrading from Helm

  1. Install Longhorn v1.0.2 using Helm, set tolerations using Longhorn UI
  2. Upgrade Longhorn to master version, verify that longhorn.io/last-applied-tolerations is not set for manager, drive deployer, UI
  3. Verify that longhorn.io/managed-by: longhorn-manager label is added for: IM CRs, EI CRs, Share Manager CRs, IM pods, Share Manager pods, CSI services, CSI deployments, CSI daemonset
  4. Verify that longhorn.io/last-applied-tolerations is set for: IM pods, Share Manager pods, CSI deployments, CSI daemonset
  5. Upgrade the chart to specify toleration for manager, drive deployer, UI
  6. Verify that the toleration get applied
  7. Repeat this test case with Longhorn v1.1.0 in step 1

Case 4: Upgrading from kubectl

  1. Install Longhorn v1.0.2 using kubectl, set tolerations using Longhorn UI
  2. Upgrade Longhorn to master version, verify that longhorn.io/last-applied-tolerations is not set for manager, drive deployer, UI
  3. Verify that longhorn.io/managed-by: longhorn-manager label is added for: IM CRs, EI CRs, Share Manager CRs, IM pods, Share Manager pods, CSI services, CSI deployments, CSI daemonset
  4. Verify that longhorn.io/last-applied-tolerations is set for: IM pods, Share Manager pods, CSI deployments, CSI daemonset
  5. Upgrade the chart to specify toleration for manager, drive deployer, UI
  6. Verify that the toleration get applied
  7. Repeat this test case with Longhorn v1.1.0 in step 1

Case 5: Node with taints

  1. Add some taints to all node in the cluster, e.g., key=value:NoSchedule
  2. Repeate case 2, 3, 4

Case 6: Priority Class UI

  1. Change Priority Class setting in Longhorn UI
  2. Verify that Longhorn only updates the managed components

Case 7: Priority Class Helm

  1. Change Priority Class in Helm for manager, driver, UI
  2. Verify that only priority class name of manager, driver, UI get updated

PhanLe1010 added a commit to PhanLe1010/longhorn that referenced this issue Mar 15, 2021
… Driver, Manager

Now that we only watch/update managed components, we should allow
user to specify Helm values for user deployed components (manager,
driver, UI).

Longhorn longhorn#2120

Signed-off-by: Phan Le <phan.le@rancher.com>
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 15, 2021

Pre-merged Checklist

PhanLe1010 added a commit to PhanLe1010/longhorn that referenced this issue Mar 24, 2021
… Driver, Manager

Now that we only watch/update managed components, we should allow
user to specify Helm values for user deployed components (manager,
driver, UI).

Longhorn longhorn#2120

Signed-off-by: Phan Le <phan.le@rancher.com>
innobead pushed a commit that referenced this issue Mar 24, 2021
… Driver, Manager

Now that we only watch/update managed components, we should allow
user to specify Helm values for user deployed components (manager,
driver, UI).

Longhorn #2120

Signed-off-by: Phan Le <phan.le@rancher.com>
@khushboo-rancher
Copy link
Contributor

@PhanLe1010 For upgrade scenario, users having/want tolerations set for Longhorn-manager, UI, Driver deployer with previous Longhorn versions will see tolerations getting removed on upgrading to v1.1.1 and they will have to upgrade the chart/yaml for Longhorn-manager etc with the tolerations. Do we want to mention this somewhere in the release notes?
cc: @innobead

@PhanLe1010
Copy link
Contributor

Yeah, Good point. I agree that we should mention this behavior in the release note.

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Apr 5, 2021

Verified with Longhorn-master - 04/06/2021

Validation - Fail

There are two different behavior in below scenario when Longhorn is deployed with Helm and kubectl.

  1. Deploy longhorn master with the tolerations, the default tolerations as well as tolerations for Longhorn-manager, driver and UI.
  2. Once the Longhorn is deployed, delete the toleration using Longhorn UI.
    With kubectl
    -- No change in Longhorn-manager, Driver and UI.
    With Helm
    -- The tolerations with Longhorn-manager, Driver and UI get deleted and they get redeployed.

Also, the priority class doesn't get set for the recurring jobs when applied through the Longhorn UI. @PhanLe1010 Is this expected?

Update:
Validation - Pass

@khushboo-rancher
Copy link
Contributor

Sorry, I used the wrong image to test the helm installation. Updated #2120 (comment) as validation - Pass

The behavior is as below.

  1. For new installation, the toleration and priority class for components other than Longhorn-manager, UI and Driver deployer can be set using Longhorn UI or default setting. To set toleration and priority class for Longhorn-manager, UI and Driver deployer, user have to edit the Yaml/values while deploying Longhorn or later.
  2. For upgrade, the already set tolerations and priority class for Longhorn-manager, UI and Driver deployer will remain as it is and can be modified by editing the yaml/values while upgrading it or later. For the other components, they can be set using Longhorn Ui or default setting.

We need to edit the manual test cases as per the behavior.
Created a bug for priority class not getting updated for recurring jobs. #2444

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Apr 6, 2021

Thanks @khushboo-rancher for carefully testing

@khushboo-rancher
Copy link
Contributor

Linking this issue with #2262 to track and validate with Rancher charts changes.

@meldafrawi
Copy link
Contributor

Closing, will track integration test implementation in a separate issue #2482

scvalex pushed a commit to scvalex/longhorn that referenced this issue May 12, 2021
… Driver, Manager

Now that we only watch/update managed components, we should allow
user to specify Helm values for user deployed components (manager,
driver, UI).

Longhorn longhorn#2120

Signed-off-by: Phan Le <phan.le@rancher.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane) kind/feature Feature request, new feature priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

6 participants