Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support PriorityClass for Longhorn deployment #1487

Closed
zaggash opened this issue Jun 12, 2020 · 18 comments
Closed

[FEATURE] Support PriorityClass for Longhorn deployment #1487

zaggash opened this issue Jun 12, 2020 · 18 comments
Assignees
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@zaggash
Copy link

zaggash commented Jun 12, 2020

Is your feature request related to a problem? Please describe.
This is to avoid the situation where as soon as longhorn-pods are evicted, all other workloads depending on the evicted longhorn-pod can no longer work.

Describe the solution you'd like
In case of resource exhaustion (memory most probably), it would be nice if longhorn can be evicted later than any other "normal" workload

Describe alternatives you've considered
Is there a way to add something like the PriorityClass to leverage the functionality described above.

gz#10575

@yasker yasker added this to the v1.0.1 milestone Jun 12, 2020
@yasker yasker changed the title [FEATURE] Support PriorityClass for LongHorn deployment [FEATURE] Support PriorityClass for Longhorn deployment Jun 12, 2020
@yasker yasker added area/install-uninstall-upgrade Install, Uninstall or Upgrade related component/longhorn-manager Longhorn manager (control plane) priority/1 Highly recommended to fix in this release (managed by PO) require/manual-test-plan Require adding/updating manual test cases if they can't be automated labels Jun 12, 2020
@ttpcodes
Copy link
Contributor

ttpcodes commented Jun 16, 2020

Pre-merged Checklist

@ttpcodes
Copy link
Contributor

ttpcodes commented Jun 16, 2020

Test Steps:

  1. Set up a Longhorn installation.
  2. Verify that the default Priority of all the Pods in the longhorn-system namespace is 0:
~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:     0 
  1. In the Longhorn UI, under the Settings menu, set the Priority Class to a Priority Class that exists on the cluster. (By default, Kubernetes comes with system-cluster-critical and system-node-critical, both of which can be used for testing.
  2. The output of kubectl -n longhorn-system get pods should indicate that Pods are being terminated or created as a result of us changing this setting. Wait for all Pods to be running again.
  3. Check the Priority of the Pods now. The output of kubectl should indicate that the Priority Class has been applied to all of the Pods in longhorn-system.
~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many more times
Priority:             2000001000  
Priority Class Name:  system-node-critical
  1. (Step for manually verifying new Engine Image DaemonSets) Add a new Engine Image. The output of kubectl (like above) should indicate that the new Pods created from the Engine Image have the Priority Class set properly.
  2. Attempt to set the Priority Class to the name of a non-existent Priority Class. When attempting to save the new Setting, the UI should return an error and none of the workloads should be updated.
  3. Set the Priority Class back to nothing. Wait for Pods to be terminated and recreated with the new setting. The priority of the new Pods should match the output of Step 2.

@yasker yasker added require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation labels Jun 23, 2020
@yasker
Copy link
Member

yasker commented Jun 23, 2020

@ttpcodes I've updated the required labels to add automation e2e and docs. I thought it was a modification on the yaml level but since it's in the manager, we need e2e test and docs. It's similar to the taint tolerance feature.

@yasker
Copy link
Member

yasker commented Jun 23, 2020

Also, make sure we've updated the default setting yaml/chart for any new settings.

Make sure to test the default setting in the manual test as well.

@ttpcodes
Copy link
Contributor

ttpcodes commented Jul 1, 2020

Since the previous Manual Test Plan is now covered by the longhorn-manager integration tests, there only remains manual tests for the default setting. Here are the three test cases:

  1. Install Longhorn with no priority-class set in the default settings. The Priority Class setting should be empty after the installation completes according to the longhorn-ui, and the default Priority of all Pods in the longhorn-system namespace should be 0:
~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:     0 
  1. Install Longhorn with a nonexistent priority-class in the default settings. The system should fail to come online. The Priority Class setting should be set and the status of the Daemon Set for the longhorn-manager should indicate that the reason it failed was due to an invalid Priority Class:
~ kubectl -n longhorn-system describe lhs priority-class
Name:         priority-class
...
Value:                 nonexistent-priority-class
...
~ kubectl -n longhorn-system describe daemonset.apps/longhorn-manager
Name:           longhorn-manager
...
  Priority Class Name:  nonexistent-priority-class
Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-gbskd
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-9s7mg
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-gtl2j
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-9s7mg
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-gbskd
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-gtl2j
  Warning  FailedCreate      4s (x14 over 15s)  daemonset-controller  Error creating: pods "longhorn-manager-" is forbidden: no PriorityClass with name nonexistent-priority-class was found
  1. Install Longhorn with a valid priority-class in the default settings. The Priority Class setting should be set according to the longhorn-ui, and all the Pods in the longhorn-system namespace should have the right Priority set:
~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:             2000001000  
Priority Class Name:  system-node-critical

@khushboo-rancher khushboo-rancher self-assigned this Jul 1, 2020
@khushboo-rancher
Copy link
Contributor

Verified on master - 07-02-2020

Validation - Passed

Scenarios tested:

  1. Priority class with no value - If memory is exhausted, longhorn pod gets evicted.
  2. Priority class with system-cluster-critical and system-node-critical - Longhorn pod doesn't evict.
  3. Priority class with dummy value. - UI doesn't allow to set dummy value.

@runningman84
Copy link
Contributor

runningman84 commented Jul 3, 2020

Btw. this wont work at AWS or other public providers which do not have the latest k8s version:
kubernetes/kubernetes#76310

@yasker
Copy link
Member

yasker commented Jul 6, 2020

@ttpcodes Regarding the comment #1487 (comment) , can you check if PriorityClass has Kubernetes version restriction?

@ttpcodes
Copy link
Contributor

ttpcodes commented Jul 7, 2020

Priority Classes have been a stable feature in Kubernetes since 1.14. However, the case that's being referred to here seems to be the fact that before 1.17, Pods could not set the Priority Class to system-node-critical or system-cluster-critical, which are built-in Priority Classes that are shipped with Kubernetes. I validated this on a 1.16 cluster and confirmed that setting the Priority Class to system-node-critical causes the Longhorn workloads to fail to come back online.

@yasker
Copy link
Member

yasker commented Jul 7, 2020

@ttpcodes Thanks for clarifying. That should be fine since none of them is the Longhorn default.

Also, thanks @runningman84 for reporting. Are you referring to the test steps like @ttpcodes mentioned above, or something else?

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Jul 7, 2020

As @ttpcodes mentioned, before 1.17 Pods could not set the Priority Class to system-node-critical or system-cluster-critical.
EKS has latest kubernetes version as 1.15. Latest master of longhorn is getting deployed properly but we cannot set Priority Class to system-node-critical or system-cluster-critical, it causes longhorn pods to fail to come up healthy.

@runningman84
Copy link
Contributor

Maybe you can make the priority only default if k8s version is high enough. What you could also do is creating a own priority class with a similar numeric value as system node critical....
Or just wait a few months until 1.17 is default anyway

@yasker
Copy link
Member

yasker commented Jul 7, 2020

@runningman84 you know AWS might take a year or two to move to 1.17 :)

@ttpcodes we don't have default priority class right? Users should be able to just leave it blank.

@ttpcodes
Copy link
Contributor

ttpcodes commented Jul 7, 2020

Yes, there's no default value for the Priority Class. The user can just run Longhorn without this set, or they can run with their own Priority Class until 1.17 when system-node-critical and system-cluster-critical can be used.

@yasker
Copy link
Member

yasker commented Jul 7, 2020

That should be good enough for us.

yasker added a commit to yasker/longhorn that referenced this issue Jul 21, 2020
i## Highlights:
1. [Make Longhorn Helm Chart directly available through Helm repo](longhorn#1043). Now the users can use `helm repo add longhorn https://charts.longhorn.io` to find and install Longhorn easily through Helm command line.
1. [NFS on Longhorn for RWX Access Mode (experimental)](longhorn#1183). We've provided a way to add Longhorn backed NFS provisioner to the cluster, so now user can use `longhorn-nfs` StorageClass to provision RWX volumes. Note: currently NFS provisioner is not highly available. If the node NFS provisioner running is down, the workload using RWX volume might be interrupted for a couple of minutes before restoring the service. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/rwx-workloads/) for details.
1. Improved Air Gap Environment Support. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/airgap/) for updated document for air gap installation.
    1. [Remove the air gap image name length limitation](longhorn#1323).
    1. [Change the default image pull policy from `Always` to `IfNotPresent`](longhorn#1491)
    1. [Aggregate the necessary Docker images](longhorn#1419)
    1. [HTTP/HTTPS_PROXY support for S3 backupstore](longhorn#1540)
1. [Support PriorityClass for Longhorn deployment](longhorn#1487). Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/priority-class/) for details.

Live upgrade from v1.0.0 to v1.0.1 is supported.

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
yasker added a commit that referenced this issue Jul 21, 2020
i## Highlights:
1. [Make Longhorn Helm Chart directly available through Helm repo](#1043). Now the users can use `helm repo add longhorn https://charts.longhorn.io` to find and install Longhorn easily through Helm command line.
1. [NFS on Longhorn for RWX Access Mode (experimental)](#1183). We've provided a way to add Longhorn backed NFS provisioner to the cluster, so now user can use `longhorn-nfs` StorageClass to provision RWX volumes. Note: currently NFS provisioner is not highly available. If the node NFS provisioner running is down, the workload using RWX volume might be interrupted for a couple of minutes before restoring the service. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/rwx-workloads/) for details.
1. Improved Air Gap Environment Support. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/airgap/) for updated document for air gap installation.
    1. [Remove the air gap image name length limitation](#1323).
    1. [Change the default image pull policy from `Always` to `IfNotPresent`](#1491)
    1. [Aggregate the necessary Docker images](#1419)
    1. [HTTP/HTTPS_PROXY support for S3 backupstore](#1540)
1. [Support PriorityClass for Longhorn deployment](#1487). Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/priority-class/) for details.

Live upgrade from v1.0.0 to v1.0.1 is supported.

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
@drewhemm
Copy link

Have set the priority-class config value to the name of a custom Priority Class and yet none of the Longhorn workloads are using it. Priority for all LH pods remains at 0 - any ideas why?

@yasker
Copy link
Member

yasker commented Oct 1, 2020

@drewhemm Can you file a new issue and add the reproduce steps?

@drewhemm
Copy link

drewhemm commented Oct 1, 2020

Sure - here it is:
#1838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
Status: Closed
Development

No branches or pull requests

7 participants