[FEATURE] Support PriorityClass for Longhorn deployment #1487

zaggash · 2020-06-12T09:36:07Z

Is your feature request related to a problem? Please describe.
This is to avoid the situation where as soon as longhorn-pods are evicted, all other workloads depending on the evicted longhorn-pod can no longer work.

Describe the solution you'd like
In case of resource exhaustion (memory most probably), it would be nice if longhorn can be evicted later than any other "normal" workload

Describe alternatives you've considered
Is there a way to add something like the PriorityClass to leverage the functionality described above.

gz#10575

ttpcodes · 2020-06-16T23:39:42Z

ttpcodes · 2020-06-16T23:58:48Z

Test Steps:

Set up a Longhorn installation.
Verify that the default Priority of all the Pods in the longhorn-system namespace is 0:

~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:     0

In the Longhorn UI, under the Settings menu, set the Priority Class to a Priority Class that exists on the cluster. (By default, Kubernetes comes with system-cluster-critical and system-node-critical, both of which can be used for testing.
The output of kubectl -n longhorn-system get pods should indicate that Pods are being terminated or created as a result of us changing this setting. Wait for all Pods to be running again.
Check the Priority of the Pods now. The output of kubectl should indicate that the Priority Class has been applied to all of the Pods in longhorn-system.

~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many more times
Priority:             2000001000  
Priority Class Name:  system-node-critical

(Step for manually verifying new Engine Image DaemonSets) Add a new Engine Image. The output of kubectl (like above) should indicate that the new Pods created from the Engine Image have the Priority Class set properly.
Attempt to set the Priority Class to the name of a non-existent Priority Class. When attempting to save the new Setting, the UI should return an error and none of the workloads should be updated.
Set the Priority Class back to nothing. Wait for Pods to be terminated and recreated with the new setting. The priority of the new Pods should match the output of Step 2.

yasker · 2020-06-23T19:28:07Z

@ttpcodes I've updated the required labels to add automation e2e and docs. I thought it was a modification on the yaml level but since it's in the manager, we need e2e test and docs. It's similar to the taint tolerance feature.

yasker · 2020-06-23T19:36:41Z

Also, make sure we've updated the default setting yaml/chart for any new settings.

Make sure to test the default setting in the manual test as well.

ttpcodes · 2020-07-01T18:37:00Z

Since the previous Manual Test Plan is now covered by the longhorn-manager integration tests, there only remains manual tests for the default setting. Here are the three test cases:

Install Longhorn with no priority-class set in the default settings. The Priority Class setting should be empty after the installation completes according to the longhorn-ui, and the default Priority of all Pods in the longhorn-system namespace should be 0:

~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:     0

Install Longhorn with a nonexistent priority-class in the default settings. The system should fail to come online. The Priority Class setting should be set and the status of the Daemon Set for the longhorn-manager should indicate that the reason it failed was due to an invalid Priority Class:

~ kubectl -n longhorn-system describe lhs priority-class
Name:         priority-class
...
Value:                 nonexistent-priority-class
...
~ kubectl -n longhorn-system describe daemonset.apps/longhorn-manager
Name:           longhorn-manager
...
  Priority Class Name:  nonexistent-priority-class
Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-gbskd
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-9s7mg
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-gtl2j
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-9s7mg
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-gbskd
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-gtl2j
  Warning  FailedCreate      4s (x14 over 15s)  daemonset-controller  Error creating: pods "longhorn-manager-" is forbidden: no PriorityClass with name nonexistent-priority-class was found

Install Longhorn with a valid priority-class in the default settings. The Priority Class setting should be set according to the longhorn-ui, and all the Pods in the longhorn-system namespace should have the right Priority set:

~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:             2000001000  
Priority Class Name:  system-node-critical

khushboo-rancher · 2020-07-03T00:27:12Z

Verified on master - 07-02-2020

Validation - Passed

Scenarios tested:

Priority class with no value - If memory is exhausted, longhorn pod gets evicted.
Priority class with system-cluster-critical and system-node-critical - Longhorn pod doesn't evict.
Priority class with dummy value. - UI doesn't allow to set dummy value.

runningman84 · 2020-07-03T13:59:01Z

Btw. this wont work at AWS or other public providers which do not have the latest k8s version:
kubernetes/kubernetes#76310

yasker · 2020-07-06T17:49:29Z

@ttpcodes Regarding the comment #1487 (comment) , can you check if PriorityClass has Kubernetes version restriction?

ttpcodes · 2020-07-07T00:26:11Z

Priority Classes have been a stable feature in Kubernetes since 1.14. However, the case that's being referred to here seems to be the fact that before 1.17, Pods could not set the Priority Class to system-node-critical or system-cluster-critical, which are built-in Priority Classes that are shipped with Kubernetes. I validated this on a 1.16 cluster and confirmed that setting the Priority Class to system-node-critical causes the Longhorn workloads to fail to come back online.

yasker · 2020-07-07T00:30:20Z

@ttpcodes Thanks for clarifying. That should be fine since none of them is the Longhorn default.

Also, thanks @runningman84 for reporting. Are you referring to the test steps like @ttpcodes mentioned above, or something else?

khushboo-rancher · 2020-07-07T02:55:45Z

As @ttpcodes mentioned, before 1.17 Pods could not set the Priority Class to system-node-critical or system-cluster-critical.
EKS has latest kubernetes version as 1.15. Latest master of longhorn is getting deployed properly but we cannot set Priority Class to system-node-critical or system-cluster-critical, it causes longhorn pods to fail to come up healthy.

runningman84 · 2020-07-07T05:10:09Z

Maybe you can make the priority only default if k8s version is high enough. What you could also do is creating a own priority class with a similar numeric value as system node critical....
Or just wait a few months until 1.17 is default anyway

yasker · 2020-07-07T17:16:24Z

@runningman84 you know AWS might take a year or two to move to 1.17 :)

@ttpcodes we don't have default priority class right? Users should be able to just leave it blank.

ttpcodes · 2020-07-07T17:27:12Z

Yes, there's no default value for the Priority Class. The user can just run Longhorn without this set, or they can run with their own Priority Class until 1.17 when system-node-critical and system-cluster-critical can be used.

yasker · 2020-07-07T17:28:57Z

That should be good enough for us.

i## Highlights: 1. [Make Longhorn Helm Chart directly available through Helm repo](longhorn#1043). Now the users can use `helm repo add longhorn https://charts.longhorn.io` to find and install Longhorn easily through Helm command line. 1. [NFS on Longhorn for RWX Access Mode (experimental)](longhorn#1183). We've provided a way to add Longhorn backed NFS provisioner to the cluster, so now user can use `longhorn-nfs` StorageClass to provision RWX volumes. Note: currently NFS provisioner is not highly available. If the node NFS provisioner running is down, the workload using RWX volume might be interrupted for a couple of minutes before restoring the service. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/rwx-workloads/) for details. 1. Improved Air Gap Environment Support. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/airgap/) for updated document for air gap installation. 1. [Remove the air gap image name length limitation](longhorn#1323). 1. [Change the default image pull policy from `Always` to `IfNotPresent`](longhorn#1491) 1. [Aggregate the necessary Docker images](longhorn#1419) 1. [HTTP/HTTPS_PROXY support for S3 backupstore](longhorn#1540) 1. [Support PriorityClass for Longhorn deployment](longhorn#1487). Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/priority-class/) for details. Live upgrade from v1.0.0 to v1.0.1 is supported. Signed-off-by: Sheng Yang <sheng.yang@rancher.com>

i## Highlights: 1. [Make Longhorn Helm Chart directly available through Helm repo](#1043). Now the users can use `helm repo add longhorn https://charts.longhorn.io` to find and install Longhorn easily through Helm command line. 1. [NFS on Longhorn for RWX Access Mode (experimental)](#1183). We've provided a way to add Longhorn backed NFS provisioner to the cluster, so now user can use `longhorn-nfs` StorageClass to provision RWX volumes. Note: currently NFS provisioner is not highly available. If the node NFS provisioner running is down, the workload using RWX volume might be interrupted for a couple of minutes before restoring the service. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/rwx-workloads/) for details. 1. Improved Air Gap Environment Support. Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/airgap/) for updated document for air gap installation. 1. [Remove the air gap image name length limitation](#1323). 1. [Change the default image pull policy from `Always` to `IfNotPresent`](#1491) 1. [Aggregate the necessary Docker images](#1419) 1. [HTTP/HTTPS_PROXY support for S3 backupstore](#1540) 1. [Support PriorityClass for Longhorn deployment](#1487). Check [here](https://staging.longhorn.io/docs/1.0.0/advanced-resources/deploy/priority-class/) for details. Live upgrade from v1.0.0 to v1.0.1 is supported. Signed-off-by: Sheng Yang <sheng.yang@rancher.com>

drewhemm · 2020-09-30T15:38:29Z

Have set the priority-class config value to the name of a custom Priority Class and yet none of the Longhorn workloads are using it. Priority for all LH pods remains at 0 - any ideas why?

yasker · 2020-10-01T04:22:59Z

@drewhemm Can you file a new issue and add the reproduce steps?

drewhemm · 2020-10-01T09:12:32Z

Sure - here it is:
#1838

yasker added this to the v1.0.1 milestone Jun 12, 2020

yasker changed the title ~~[FEATURE] Support PriorityClass for LongHorn deployment~~ [FEATURE] Support PriorityClass for Longhorn deployment Jun 12, 2020

yasker assigned ttpcodes Jun 12, 2020

ttpcodes mentioned this issue Jun 16, 2020

Add Priority Class setting longhorn/longhorn-manager#606

Merged

shuo-wu mentioned this issue Jun 17, 2020

[FEATURE/BUG] Add resource requests/limits to default deployment/controller rollouts. #1502

Closed

yasker added require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation labels Jun 23, 2020

yasker added the highlight Important feature/issue to highlight label Jun 29, 2020

This was referenced Jun 30, 2020

Add tests for Priority Class setting longhorn/longhorn-tests#360

Merged

Add documentation for Priority Class setting longhorn/website#157

Merged

khushboo-rancher self-assigned this Jul 1, 2020

meldafrawi closed this as completed Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support PriorityClass for Longhorn deployment #1487

[FEATURE] Support PriorityClass for Longhorn deployment #1487

zaggash commented Jun 12, 2020 •

edited

Loading

ttpcodes commented Jun 16, 2020 •

edited

Loading

ttpcodes commented Jun 16, 2020 •

edited

Loading

yasker commented Jun 23, 2020 •

edited

Loading

yasker commented Jun 23, 2020 •

edited

Loading

ttpcodes commented Jul 1, 2020

khushboo-rancher commented Jul 3, 2020

runningman84 commented Jul 3, 2020 •

edited

Loading

yasker commented Jul 6, 2020

ttpcodes commented Jul 7, 2020

yasker commented Jul 7, 2020

khushboo-rancher commented Jul 7, 2020 •

edited

Loading

runningman84 commented Jul 7, 2020

yasker commented Jul 7, 2020

ttpcodes commented Jul 7, 2020

yasker commented Jul 7, 2020

drewhemm commented Sep 30, 2020

yasker commented Oct 1, 2020

drewhemm commented Oct 1, 2020

[FEATURE] Support PriorityClass for Longhorn deployment #1487

[FEATURE] Support PriorityClass for Longhorn deployment #1487

Comments

zaggash commented Jun 12, 2020 • edited Loading

ttpcodes commented Jun 16, 2020 • edited Loading

Pre-merged Checklist

ttpcodes commented Jun 16, 2020 • edited Loading

yasker commented Jun 23, 2020 • edited Loading

yasker commented Jun 23, 2020 • edited Loading

ttpcodes commented Jul 1, 2020

khushboo-rancher commented Jul 3, 2020

runningman84 commented Jul 3, 2020 • edited Loading

yasker commented Jul 6, 2020

ttpcodes commented Jul 7, 2020

yasker commented Jul 7, 2020

khushboo-rancher commented Jul 7, 2020 • edited Loading

runningman84 commented Jul 7, 2020

yasker commented Jul 7, 2020

ttpcodes commented Jul 7, 2020

yasker commented Jul 7, 2020

drewhemm commented Sep 30, 2020

yasker commented Oct 1, 2020

drewhemm commented Oct 1, 2020

zaggash commented Jun 12, 2020 •

edited

Loading

ttpcodes commented Jun 16, 2020 •

edited

Loading

ttpcodes commented Jun 16, 2020 •

edited

Loading

yasker commented Jun 23, 2020 •

edited

Loading

yasker commented Jun 23, 2020 •

edited

Loading

runningman84 commented Jul 3, 2020 •

edited

Loading

khushboo-rancher commented Jul 7, 2020 •

edited

Loading