Skip to content

Commit

Permalink
Update non-graceful node shutdown to beta
Browse files Browse the repository at this point in the history
  • Loading branch information
xing-yang committed Jun 9, 2022
1 parent 6a4aadc commit f9442d1
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 12 deletions.
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-storage/2268.yaml
@@ -1,3 +1,5 @@
kep-number: 2268
alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
80 changes: 70 additions & 10 deletions keps/sig-storage/2268-non-graceful-shutdown/README.md
Expand Up @@ -16,8 +16,10 @@ This includes the Summary and Motivation sections.
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Unit tests](#unit-tests)
- [E2E tests](#e2e-tests)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [E2E tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
Expand Down Expand Up @@ -144,15 +146,28 @@ To mitigate this we plan to have a high test coverage and to introduce this enha

### Test Plan

### Unit tests
#### Prerequisite testing updates

There are existing tests for Pod GC controller: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/pod_gc.go

There are existing tests for attach detach controller. Creating a pod that uses PVCs which will test attach: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/create.go
Deleting a pod will trigger PVC to be detached: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/delete.go

#### Unit tests
* Add unit tests to affected components in kube-controller-manager:
* Add tests in Pod GC Controller for the new logic to clean up pods and the `out-of-service` taint.
* Add tests in Attachdetach Controller for the changed logic that allow volumes to be forcefully detached without wait.

### E2E tests
#### Integration tests

Add a test for forcefully terminating pods in https://github.com/kubernetes/kubernetes/blob/master/test/integration/garbagecollector/garbage_collector_test.go.

Add a test for force detach without waiting for 6 minutes in https://github.com/kubernetes/kubernetes/blob/master/test/integration/volume/attach_detach_test.go

#### E2E tests
* New E2E tests to validate workloads move successfully to another running node when a node is shutdown.
* Feature gate for `NonGracefulFailover` is disabled, feature is not active.
* Feature gate for `NonGracefulFailover` is enabled. Add `out-of-service` taint after node is shutdown:
* Feature gate for `NodeOutOfServiceVolumeDetach` is disabled, feature is not active.
* Feature gate for `NodeOutOfServiceVolumeDetach` is enabled. Add `out-of-service` taint after node is shutdown:
* Verify workloads are moved to another node successfully.
* Verify the `out-of-service` taint is removed after the shutdown node is cleaned up.
* Add stress and scale tests before moving from beta to GA.
Expand Down Expand Up @@ -217,7 +232,7 @@ _This section must be completed when targeting alpha to a release._

* **How can this feature be enabled / disabled in a live cluster?**
- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: NonGracefulFailover
- Feature gate name: NodeOutOfServiceVolumeDetach
- Components depending on the feature gate: kube-controller-manager
- [ ] Other
- Describe the mechanism:
Expand Down Expand Up @@ -266,17 +281,24 @@ _This section must be completed when targeting beta graduation to a release._
* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.

* **What specific metrics should inform a rollback?**
If for some reason, the user does not want the workload to failover to a
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
user can prevent the failover from happening by not applying the `out-of-service`
taint.

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
We will manually test upgrade from 1.24 to 1.25 and rollback from 1.25 to 1.24.

* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?**
Even if applying deprecation policies, they may still surprise some users.
No.

### Monitoring Requirements

Expand All @@ -286,6 +308,10 @@ _This section must be completed when targeting beta graduation to a release._
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
An operator can check if the `NodeOutOfServiceVolumeDetach` feature gate is
enabled and if there is an `out-of-service` taint on the shutdown node.
The usage of this feature requires the manual step of applying a taint
so the operator should be the one applying it.

* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
Expand All @@ -294,7 +320,9 @@ the health of the service?**
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
- Details: Check whether the workload moved to a different running node
after the original node is shutdown and the `out-of-service` taint
is applied on the shutdown node.

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
At a high level, this usually will be in the form of "high percentile of SLI
Expand All @@ -304,6 +332,8 @@ the health of the service?**
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
The failover should always happen if the feature gate is enabled, the taint
is applied, and there are other running nodes.

* **Are there any missing metrics that would be useful to have to improve observability
of this feature?**
Expand All @@ -320,13 +350,17 @@ _This section must be completed when targeting beta graduation to a release._
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
If the workload is running on a StatefulSet, it depends on the CSI driver.

For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- [Dependency name] CSI driver
- Usage description:
- Impact of its outage on the feature:
- Impact of its outage on the feature: If the CSI driver is not running,
the pod cannot use the persistent volume any more so the workload will
not be running properly.
- Impact of its degraded performance or high-error rates on the feature:
Workload does not work properly if the CSI driver is down.

### Scalability

Expand All @@ -349,27 +383,32 @@ previous answers based on experience in the field._
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
No.

* **Will enabling / using this feature result in introducing new API types?**
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
No.

* **Will enabling / using this feature result in any new calls to the cloud
provider?**
No.

* **Will enabling / using this feature result in increasing size or count of
the existing API objects?**
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
No.

* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?**
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
No.

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
Expand All @@ -378,6 +417,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
No.

### Troubleshooting

Expand All @@ -388,20 +428,40 @@ details). For now, we leave it here.
_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
If API server or etcd is not available, we can't get accurate status of node or pod.
However the usage of this feature is very manual so an operator can verify
before applying the taint.

* **What are other known failure modes?**
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
After applying the `out-of-service` taint, if the workload does not move
to a different running node immediately, that is an indicator something
might be wrong.
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
So if the workload does not failover, it behaves the same as when this
feature is not enabled. The operator should try to find out why the
failover didn't happen.
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
Set log level to at least 4.
For example, the following message is in GC Controller if the feature is
enabled and `out-of-service` taint is applied. If the pods are forcefully
deleted by the GC Controller, this message should show up.
klog.V(4).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
There is also a message in Attach Detach Controller that checks the taint.
If the taint is applied and feature gate is enabled, it force detaches the
volume without waiting for 6 minutes.
klog.V(4).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
- Testing: Are there any tests for failure mode? If not, describe why.
We have unit tests that cover different combination of pod and node statuses.

* **What steps should be taken if SLOs are not being met to determine the problem?**
In that case, we need to go through the logs and find out the root cause.

[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
Expand Down
4 changes: 2 additions & 2 deletions keps/sig-storage/2268-non-graceful-shutdown/kep.yaml
Expand Up @@ -20,12 +20,12 @@ see-also:
replaces:

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.24"
latest-milestone: "v1.25"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down

0 comments on commit f9442d1

Please sign in to comment.