[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

Bobonium · 2020-12-18T17:24:49Z

Describe the bug
I manage my deployments with terraform. As a result I can see changes being done to the managed resources. I just upgraded to 1.1.0 and I did not ever configure any tolerations through the longhorn UI.
After the upgrade the following change is visible for the driver, the manager and the ui:

          ~ annotations      = {
              - "longhorn.io/last-applied-tolerations" = jsonencode([]) -> null
            }

The change shows us, that terraform wants to remove the annotation, as it's not part of the definition. It doesn't work though as longhorn always magically re-appends the value.
This is at least in theory also a problem for non terraform users. Re-applying the same deployment yamls should not result in a change. With this behavior it will always result in a change

To Reproduce

Deploy longhorn 1.1.0 from deployment yaml
Check annotations, the longhorn.io/last-applied-tolerations was magically added

Expected behavior
Please don't dynamically adjust resources that are being created by the deployment yamls. This has negative impacts for anyone that manages their true state in git

Log
irrelevant for the bug

Environment:
irrelevant for the bug

Additional context

The text was updated successfully, but these errors were encountered:

PhanLe1010 · 2020-12-18T23:20:15Z

@Bobonium

How does Terraform react when Longhorn automatically adds the annotation? E.g Does it deploy/redeploy Longhorn repeatedly? Do Longhorn's normal functionalities work ok?

In another word, how badly is the problem?

Bobonium · 2020-12-19T13:11:47Z

Hey @PhanLe1010 ,

Terraform will always show and applies a change, because longhorn automatically re-adds the annotation if it does not exist. I can (and I am currently) use a workaround, by simply adding the automated annotation to my terraform code, so it doesn't display a diff anymore.

This is less a technical problem as it is a design flaw, that in my opinion shows, that Longhorn as a Project is not as mature as claims to be (>1). To be honest I really like this project, more than any of the other alternatives that I've tried so far and I'd like it to succeed and keep on using it in the future. But seeing changes like these let me worry, because the storage layer is the single most important part within a Cluster. If it breaks, most of the running workloads break and as such every design decision should be well thought through. Frankly, I don't think that that this was the case here.

I had a deeper look at how this system works, and from the point of view of someone who has to deploy and manage this, it's as unusable as it gets.
Beforehand you had a hardcoded check on tolerations that started with kubernetes.io and simply kept all tolerations that started like that, I can totally understand why you guys were unhappy with this.
The new logic, if I understand it correctly (see: https://github.com/longhorn/longhorn-manager/pull/771/files#diff-be348ed45296f49335ad1afc4a80cdcb18350cbcbbf701a740c2f400e6c31407R476-R497 ) uses the annotation to remember which keys are actually additionally added by longhorn and removes these keys before the new annotations are added on top of that.

But this is a shitty system in general and I'd like to explain just why.
First of all the handling is severely flawed, here's how to reproduce the problem:

apply yaml files with custom tolerations
make changes in UI to add more tolerations
apply yaml files again
=> you now no longer have the tolerations that were set by the UI and longhorn never updates them unless you update them in the UI again
You could only fix this by either watching the manually created Resources, which would cost some CPU for stuff that you shouldn't even touch in the first place (funnily enough, you actually do this for the annotations).

The second thing is, that it just does not make any sense from the point of view of whoever deploys this. Whoever deploys this knows which tolerations should be applied to the Resources he controls. For example it could make sense to force the Longhorn UI to run on the same Node as the Longhorn-Manager, as all UI responses are based on the API communication between the two. At the same time it does not necessarily mean, that I require the same tolerations for them as for all the other Resources that are deployed.
Therefore Longhorn-UI, Longhorn-Manager, Longhorn-Driver-Deployer (basically everything that is in the deployment.yaml) should simply not automatically be touched, because it makes it really hard to keep proper track of your deployed Resources in Version Control Systems and in some cases might even give you end results that you did not want to have in the first place.

The third thing is, that you guys should really start about thinking of how Users deploy stuff in the real world. At this time it's currently not possible to deploy multiple independent Longhorn Systems in a single Cluster, and I don't even dare to try to rename the namespace it's deployed into. There are various reasons why one would like to have Longhorn deployed multiple times independently on the same Cluster (or into a namespace with other resources like a general storage namespace). Your changes further show that you guys currently don't think/care about this here . In the upgrade path you simply add this annotation to all daemonsets and deployments in the same namespace.
If one were to have a more generic namespace to host all storage solutions in, you'd touch stuff that you're not supposed to touch.

My recommendation would be to add a label to the Resources that are dynamically deployed. Perfectly suited for this would be app.kubernetes.io/managed-by=longhorn-manager, as it's a recommended label per the Kubernetes documentation. Then limit the UI configuration Kubernetes Taint Toleration to deployments that have this exact label and value.

yasker · 2020-12-22T03:42:35Z

Hi @Bobonium ,

Thanks for the write-up. Your suggestion makes sense.

The problem we were trying to solve before is to make it as easy as possible for the users to set tolerations. And since we don't expect the users to update the YAML file, we decided to update the tolerations for the running workload automatically. The decision we made at that time, as you noticed, continuously bothers us later. So in the latest v1.1.0 release, we decided to use the annotation as the record for tolerations set by Longhorn. But still, it won't solve all the problems since the user or Kubernetes might set the same taint to the workloads tool.

Previously we're focusing too much on trying to get everything automated for the user, but we didn't realize the design will bring problems to the users using gitops. If we don't consider automatically updating the existing Longhorn components from the deployment file (yaml or chart), then that's much easier for us to deal with it, and it's easier for you to deploy it. Now thinking about it, your suggestion makes perfect sense. I'd like to change the design as you suggested in the next release.

Regarding deploying multiple Longhorn clusters, I can see why it might be useful when there are different tiers. But the disk/node tag feature is designed to handle the situation. Also, since Longhorn is not multi-tenant by itself, there would be some use cases that users might want to deploy separate Longhorn for themselves. We haven't seen strong demands on this part yet so we didn't prioritize it. But resolving #1844 can a good first step toward that direction.

PhanLe1010 · 2021-03-12T22:24:35Z

Manual Tests:

Case 1: Existing Longhorn installation

Install Longhorn master.
Change toleration in UI setting
Verify that longhorn.io/last-applied-tolerations annotation and toleration of manager, drive deployer, UI are not changed.
Very that longhorn.io/last-applied-tolerations annotation and toleration for managed components (CSI components, IM pods, share manager pod, EI daemonset) are updated correctly

Case 2: New installation by Helm

Install Longhorn master, set tolerations like:

defaultSettings:
  taintToleration: "key=value:NoSchedule"

longhornManager:
  priorityClass: ~
  tolerations:
  # If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
  # and uncomment this example block
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule

longhornDriver:
  priorityClass: ~
  tolerations:
  # If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
  # and uncomment this example block
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule

longhornUI:
  priorityClass: ~
  tolerations:
  # If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
  # and uncomment this example block
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule

Verify that the toleration is added for: IM pods, Share Manager pods, CSI deployments, CSI daemonset, the backup jobs, manager, drive deployer, UI
Uninstall the Helm release. Verify that the uninstallation success

Case 3: Upgrading from Helm

Install Longhorn v1.0.2 using Helm, set tolerations using Longhorn UI
Upgrade Longhorn to master version, verify that longhorn.io/last-applied-tolerations is not set for manager, drive deployer, UI
Verify that longhorn.io/managed-by: longhorn-manager label is added for: IM CRs, EI CRs, Share Manager CRs, IM pods, Share Manager pods, CSI services, CSI deployments, CSI daemonset
Verify that longhorn.io/last-applied-tolerations is set for: IM pods, Share Manager pods, CSI deployments, CSI daemonset
Upgrade the chart to specify toleration for manager, drive deployer, UI
Verify that the toleration get applied
Repeat this test case with Longhorn v1.1.0 in step 1

Case 4: Upgrading from kubectl

Install Longhorn v1.0.2 using kubectl, set tolerations using Longhorn UI
Upgrade Longhorn to master version, verify that longhorn.io/last-applied-tolerations is not set for manager, drive deployer, UI
Verify that longhorn.io/managed-by: longhorn-manager label is added for: IM CRs, EI CRs, Share Manager CRs, IM pods, Share Manager pods, CSI services, CSI deployments, CSI daemonset
Verify that longhorn.io/last-applied-tolerations is set for: IM pods, Share Manager pods, CSI deployments, CSI daemonset
Upgrade the chart to specify toleration for manager, drive deployer, UI
Verify that the toleration get applied
Repeat this test case with Longhorn v1.1.0 in step 1

Case 5: Node with taints

Add some taints to all node in the cluster, e.g., key=value:NoSchedule
Repeate case 2, 3, 4

Case 6: Priority Class UI

Change Priority Class setting in Longhorn UI
Verify that Longhorn only updates the managed components

Case 7: Priority Class Helm

Change Priority Class in Helm for manager, driver, UI
Verify that only priority class name of manager, driver, UI get updated

… Driver, Manager Now that we only watch/update managed components, we should allow user to specify Helm values for user deployed components (manager, driver, UI). Longhorn longhorn#2120 Signed-off-by: Phan Le <phan.le@rancher.com>

longhorn-io-github-bot · 2021-03-15T22:23:20Z

… Driver, Manager Now that we only watch/update managed components, we should allow user to specify Helm values for user deployed components (manager, driver, UI). Longhorn longhorn#2120 Signed-off-by: Phan Le <phan.le@rancher.com>

… Driver, Manager Now that we only watch/update managed components, we should allow user to specify Helm values for user deployed components (manager, driver, UI). Longhorn #2120 Signed-off-by: Phan Le <phan.le@rancher.com>

khushboo-rancher · 2021-04-04T19:28:59Z

@PhanLe1010 For upgrade scenario, users having/want tolerations set for Longhorn-manager, UI, Driver deployer with previous Longhorn versions will see tolerations getting removed on upgrading to v1.1.1 and they will have to upgrade the chart/yaml for Longhorn-manager etc with the tolerations. Do we want to mention this somewhere in the release notes?
cc: @innobead

PhanLe1010 · 2021-04-05T17:10:43Z

Yeah, Good point. I agree that we should mention this behavior in the release note.

khushboo-rancher · 2021-04-05T20:27:29Z

Verified with Longhorn-master - 04/06/2021

Validation - Fail

There are two different behavior in below scenario when Longhorn is deployed with Helm and kubectl.

Deploy longhorn master with the tolerations, the default tolerations as well as tolerations for Longhorn-manager, driver and UI.
Once the Longhorn is deployed, delete the toleration using Longhorn UI.
With kubectl
-- No change in Longhorn-manager, Driver and UI.
With Helm
-- The tolerations with Longhorn-manager, Driver and UI get deleted and they get redeployed.

Also, the priority class doesn't get set for the recurring jobs when applied through the Longhorn UI. @PhanLe1010 Is this expected?

Update:
Validation - Pass

khushboo-rancher · 2021-04-06T03:36:49Z

Sorry, I used the wrong image to test the helm installation. Updated #2120 (comment) as validation - Pass

The behavior is as below.

For new installation, the toleration and priority class for components other than Longhorn-manager, UI and Driver deployer can be set using Longhorn UI or default setting. To set toleration and priority class for Longhorn-manager, UI and Driver deployer, user have to edit the Yaml/values while deploying Longhorn or later.
For upgrade, the already set tolerations and priority class for Longhorn-manager, UI and Driver deployer will remain as it is and can be modified by editing the yaml/values while upgrading it or later. For the other components, they can be set using Longhorn Ui or default setting.

We need to edit the manual test cases as per the behavior.
Created a bug for priority class not getting updated for recurring jobs. #2444

PhanLe1010 · 2021-04-06T04:56:02Z

Thanks @khushboo-rancher for carefully testing

khushboo-rancher · 2021-04-07T11:15:10Z

Linking this issue with #2262 to track and validate with Rancher charts changes.

meldafrawi · 2021-04-16T19:04:29Z

Closing, will track integration test implementation in a separate issue #2482

… Driver, Manager Now that we only watch/update managed components, we should allow user to specify Helm values for user deployed components (manager, driver, UI). Longhorn longhorn#2120 Signed-off-by: Phan Le <phan.le@rancher.com>

PhanLe1010 added this to To do in Community Issue Review Dec 21, 2020

yasker self-assigned this Dec 22, 2020

yasker moved this from To do to In progress in Community Issue Review Dec 22, 2020

yasker added this to the v1.1.1 milestone Dec 22, 2020

yasker removed their assignment Dec 22, 2020

yasker removed this from In progress in Community Issue Review Dec 22, 2020

yasker assigned PhanLe1010 Dec 22, 2020

yasker added this to To do in Community Issue Review via automation Dec 22, 2020

yasker moved this from To do to Done in Community Issue Review Dec 22, 2020

joshimoo mentioned this issue Jan 27, 2021

[BUG] longhorn-driver-deployer fails to delete longhorn-csi-plugin DaemonSet due to Pods stuck in Terminating state on down nodes #2171

Closed

PhanLe1010 mentioned this issue Mar 11, 2021

Only watch and update toleration/priority class for managed components longhorn/longhorn-manager#842

Merged

PhanLe1010 mentioned this issue Mar 15, 2021

Allow user to specify priority class and tolerations for Longhorn UI, Driver, Manager #2348

Merged

meldafrawi mentioned this issue Apr 14, 2021

[e2e] add integration test for annotation last-applied-tolerations magically added in 1.1.0 #2482

Open

PhanLe1010 mentioned this issue Apr 14, 2021

Allow user to set node selector for system managed components, manager, UI, driver deployer #2478

Merged

meldafrawi closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

Bobonium commented Dec 18, 2020

PhanLe1010 commented Dec 18, 2020 •

edited

Loading

Bobonium commented Dec 19, 2020 •

edited

Loading

yasker commented Dec 22, 2020

PhanLe1010 commented Mar 12, 2021 •

edited

Loading

longhorn-io-github-bot commented Mar 15, 2021 •

edited by PhanLe1010

Loading

khushboo-rancher commented Apr 4, 2021

PhanLe1010 commented Apr 5, 2021

khushboo-rancher commented Apr 5, 2021 •

edited

Loading

khushboo-rancher commented Apr 6, 2021

PhanLe1010 commented Apr 6, 2021 •

edited

Loading

khushboo-rancher commented Apr 7, 2021

meldafrawi commented Apr 16, 2021

[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

[BUG] annotation last-applied-tolerations magically added in 1.1.0 #2120

Comments

Bobonium commented Dec 18, 2020

PhanLe1010 commented Dec 18, 2020 • edited Loading

Bobonium commented Dec 19, 2020 • edited Loading

yasker commented Dec 22, 2020

PhanLe1010 commented Mar 12, 2021 • edited Loading

Case 1: Existing Longhorn installation

Case 2: New installation by Helm

Case 3: Upgrading from Helm

Case 4: Upgrading from kubectl

Case 5: Node with taints

Case 6: Priority Class UI

Case 7: Priority Class Helm

longhorn-io-github-bot commented Mar 15, 2021 • edited by PhanLe1010 Loading

Pre-merged Checklist

khushboo-rancher commented Apr 4, 2021

PhanLe1010 commented Apr 5, 2021

khushboo-rancher commented Apr 5, 2021 • edited Loading

khushboo-rancher commented Apr 6, 2021

PhanLe1010 commented Apr 6, 2021 • edited Loading

khushboo-rancher commented Apr 7, 2021

meldafrawi commented Apr 16, 2021

PhanLe1010 commented Dec 18, 2020 •

edited

Loading

Bobonium commented Dec 19, 2020 •

edited

Loading

PhanLe1010 commented Mar 12, 2021 •

edited

Loading

longhorn-io-github-bot commented Mar 15, 2021 •

edited by PhanLe1010

Loading

khushboo-rancher commented Apr 5, 2021 •

edited

Loading

PhanLe1010 commented Apr 6, 2021 •

edited

Loading