Skip to content

Commit

Permalink
Cleanup page garbage-collection and nodes
Browse files Browse the repository at this point in the history
  • Loading branch information
Zhuzhenghao committed May 13, 2023
1 parent 028d490 commit 59545bf
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 69 deletions.
78 changes: 39 additions & 39 deletions content/en/docs/concepts/architecture/garbage-collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,21 @@ weight: 70
{{<glossary_definition term_id="garbage-collection" length="short">}} This
allows the clean up of resources like the following:

* [Terminated pods](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection)
* [Completed Jobs](/docs/concepts/workloads/controllers/ttlafterfinished/)
* [Objects without owner references](#owners-dependents)
* [Unused containers and container images](#containers-images)
* [Dynamically provisioned PersistentVolumes with a StorageClass reclaim policy of Delete](/docs/concepts/storage/persistent-volumes/#delete)
* [Stale or expired CertificateSigningRequests (CSRs)](/docs/reference/access-authn-authz/certificate-signing-requests/#request-signing-process)
* {{<glossary_tooltip text="Nodes" term_id="node">}} deleted in the following scenarios:
* On a cloud when the cluster uses a [cloud controller manager](/docs/concepts/architecture/cloud-controller/)
* On-premises when the cluster uses an addon similar to a cloud controller
manager
* [Node Lease objects](/docs/concepts/architecture/nodes/#heartbeats)
* [Terminated pods](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection)
* [Completed Jobs](/docs/concepts/workloads/controllers/ttlafterfinished/)
* [Objects without owner references](#owners-dependents)
* [Unused containers and container images](#containers-images)
* [Dynamically provisioned PersistentVolumes with a StorageClass reclaim policy of Delete](/docs/concepts/storage/persistent-volumes/#delete)
* [Stale or expired CertificateSigningRequests (CSRs)](/docs/reference/access-authn-authz/certificate-signing-requests/#request-signing-process)
* {{<glossary_tooltip text="Nodes" term_id="node">}} deleted in the following scenarios:
* On a cloud when the cluster uses a [cloud controller manager](/docs/concepts/architecture/cloud-controller/)
* On-premises when the cluster uses an addon similar to a cloud controller
manager
* [Node Lease objects](/docs/concepts/architecture/nodes/#heartbeats)

## Owners and dependents {#owners-dependents}

Many objects in Kubernetes link to each other through [*owner references*](/docs/concepts/overview/working-with-objects/owners-dependents/).
Many objects in Kubernetes link to each other through [*owner references*](/docs/concepts/overview/working-with-objects/owners-dependents/).
Owner references tell the control plane which objects are dependent on others.
Kubernetes uses owner references to give the control plane, and other API
clients, the opportunity to clean up related resources before deleting an
Expand All @@ -49,7 +49,7 @@ In v1.20+, if a cluster-scoped dependent specifies a namespaced kind as an owner
it is treated as having an unresolvable owner reference, and is not able to be garbage collected.

In v1.20+, if the garbage collector detects an invalid cross-namespace `ownerReference`,
or a cluster-scoped dependent with an `ownerReference` referencing a namespaced kind, a warning Event
or a cluster-scoped dependent with an `ownerReference` referencing a namespaced kind, a warning Event
with a reason of `OwnerRefInvalidNamespace` and an `involvedObject` of the invalid dependent is reported.
You can check for that kind of Event by running
`kubectl get events -A --field-selector=reason=OwnerRefInvalidNamespace`.
Expand All @@ -61,31 +61,31 @@ Kubernetes checks for and deletes objects that no longer have owner
references, like the pods left behind when you delete a ReplicaSet. When you
delete an object, you can control whether Kubernetes deletes the object's
dependents automatically, in a process called *cascading deletion*. There are
two types of cascading deletion, as follows:
two types of cascading deletion, as follows:

* Foreground cascading deletion
* Background cascading deletion
* Foreground cascading deletion
* Background cascading deletion

You can also control how and when garbage collection deletes resources that have
owner references using Kubernetes {{<glossary_tooltip text="finalizers" term_id="finalizer">}}.
owner references using Kubernetes {{<glossary_tooltip text="finalizers" term_id="finalizer">}}.

### Foreground cascading deletion {#foreground-deletion}

In foreground cascading deletion, the owner object you're deleting first enters
a *deletion in progress* state. In this state, the following happens to the
owner object:
owner object:

* The Kubernetes API server sets the object's `metadata.deletionTimestamp`
field to the time the object was marked for deletion.
* The Kubernetes API server also sets the `metadata.finalizers` field to
`foregroundDeletion`.
* The object remains visible through the Kubernetes API until the deletion
process is complete.
* The Kubernetes API server sets the object's `metadata.deletionTimestamp`
field to the time the object was marked for deletion.
* The Kubernetes API server also sets the `metadata.finalizers` field to
`foregroundDeletion`.
* The object remains visible through the Kubernetes API until the deletion
process is complete.

After the owner object enters the deletion in progress state, the controller
deletes the dependents. After deleting all the dependent objects, the controller
deletes the owner object. At this point, the object is no longer visible in the
Kubernetes API.
Kubernetes API.

During foreground cascading deletion, the only dependents that block owner
deletion are those that have the `ownerReference.blockOwnerDeletion=true` field.
Expand Down Expand Up @@ -113,7 +113,7 @@ to override this behaviour, see [Delete owner objects and orphan dependents](/do
The {{<glossary_tooltip text="kubelet" term_id="kubelet">}} performs garbage
collection on unused images every five minutes and on unused containers every
minute. You should avoid using external garbage collection tools, as these can
break the kubelet behavior and remove containers that should exist.
break the kubelet behavior and remove containers that should exist.

To configure options for unused container and image garbage collection, tune the
kubelet using a [configuration file](/docs/tasks/administer-cluster/kubelet-config-file/)
Expand All @@ -124,13 +124,13 @@ resource type.
### Container image lifecycle

Kubernetes manages the lifecycle of all images through its *image manager*,
which is part of the kubelet, with the cooperation of
which is part of the kubelet, with the cooperation of
{{< glossary_tooltip text="cadvisor" term_id="cadvisor" >}}. The kubelet
considers the following disk usage limits when making garbage collection
decisions:

* `HighThresholdPercent`
* `LowThresholdPercent`
* `HighThresholdPercent`
* `LowThresholdPercent`

Disk usage above the configured `HighThresholdPercent` value triggers garbage
collection, which deletes images in order based on the last time they were used,
Expand All @@ -140,17 +140,17 @@ until disk usage reaches the `LowThresholdPercent` value.
### Container garbage collection {#container-image-garbage-collection}

The kubelet garbage collects unused containers based on the following variables,
which you can define:
which you can define:

* `MinAge`: the minimum age at which the kubelet can garbage collect a
container. Disable by setting to `0`.
* `MaxPerPodContainer`: the maximum number of dead containers each Pod
can have. Disable by setting to less than `0`.
* `MaxContainers`: the maximum number of dead containers the cluster can have.
Disable by setting to less than `0`.
* `MinAge`: the minimum age at which the kubelet can garbage collect a
container. Disable by setting to `0`.
* `MaxPerPodContainer`: the maximum number of dead containers each Pod
can have. Disable by setting to less than `0`.
* `MaxContainers`: the maximum number of dead containers the cluster can have.
Disable by setting to less than `0`.

In addition to these variables, the kubelet garbage collects unidentified and
deleted containers, typically starting with the oldest first.
deleted containers, typically starting with the oldest first.

`MaxPerPodContainer` and `MaxContainers` may potentially conflict with each other
in situations where retaining the maximum number of containers per Pod
Expand All @@ -171,8 +171,8 @@ You can tune garbage collection of resources by configuring options specific to
the controllers managing those resources. The following pages show you how to
configure garbage collection:

* [Configuring cascading deletion of Kubernetes objects](/docs/tasks/administer-cluster/use-cascading-deletion/)
* [Configuring cleanup of finished Jobs](/docs/concepts/workloads/controllers/ttlafterfinished/)
* [Configuring cascading deletion of Kubernetes objects](/docs/tasks/administer-cluster/use-cascading-deletion/)
* [Configuring cleanup of finished Jobs](/docs/concepts/workloads/controllers/ttlafterfinished/)

<!-- * [Configuring unused container and image garbage collection](/docs/tasks/administer-cluster/reconfigure-kubelet/) -->

Expand Down
59 changes: 29 additions & 30 deletions content/en/docs/concepts/architecture/nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ first and re-added after the update.
### Self-registration of Nodes

When the kubelet flag `--register-node` is true (the default), the kubelet will attempt to
register itself with the API server. This is the preferred pattern, used by most distros.
register itself with the API server. This is the preferred pattern, used by most distros.

For self-registration, the kubelet is started with the following options:

Expand Down Expand Up @@ -122,7 +122,7 @@ Pods already scheduled on the Node may misbehave or cause issues if the Node
configuration will be changed on kubelet restart. For example, already running
Pod may be tainted against the new labels assigned to the Node, while other
Pods, that are incompatible with that Pod will be scheduled based on this new
label. Node re-registration ensures all Pods will be drained and properly
label. Node re-registration ensures all Pods will be drained and properly
re-scheduled.
{{< /note >}}

Expand Down Expand Up @@ -225,9 +225,9 @@ of the Node resource. For example, the following JSON structure describes a heal

When problems occur on nodes, the Kubernetes control plane automatically creates
[taints](/docs/concepts/scheduling-eviction/taint-and-toleration/) that match the conditions
affecting the node. An example of this is when the `status` of the Ready condition
affecting the node. An example of this is when the `status` of the Ready condition
remains `Unknown` or `False` for longer than the kube-controller-manager's `NodeMonitorGracePeriod`,
which defaults to 40 seconds. This will cause either an `node.kubernetes.io/unreachable` taint, for an `Unknown` status,
which defaults to 40 seconds. This will cause either an `node.kubernetes.io/unreachable` taint, for an `Unknown` status,
or a `node.kubernetes.io/not-ready` taint, for a `False` status, to be added to the Node.

These taints affect pending pods as the scheduler takes the Node's taints into consideration when
Expand Down Expand Up @@ -321,7 +321,7 @@ This period can be configured using the `--node-monitor-period` flag on the

### Rate limits on eviction

In most cases, the node controller limits the eviction rate to
In most cases, the node controller limits the eviction rate to
`--node-eviction-rate` (default 0.1) per second, meaning it won't evict pods
from more than 1 node per 10 seconds.

Expand All @@ -345,7 +345,7 @@ then the eviction mechanism does not take per-zone unavailability into account.
A key reason for spreading your nodes across availability zones is so that the
workload can be shifted to healthy zones when one entire zone goes down.
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
the normal rate of `--node-eviction-rate`. The corner case is when all zones are
the normal rate of `--node-eviction-rate`. The corner case is when all zones are
completely unhealthy (none of the nodes in the cluster are healthy). In such a
case, the node controller assumes that there is some problem with connectivity
between the control plane and the nodes, and doesn't perform any evictions.
Expand Down Expand Up @@ -550,36 +550,36 @@ are emitted under the kubelet subsystem to monitor node shutdowns.

{{< feature-state state="beta" for_k8s_version="v1.26" >}}

A node shutdown action may not be detected by kubelet's Node Shutdown Manager,
either because the command does not trigger the inhibitor locks mechanism used by
kubelet or because of a user error, i.e., the ShutdownGracePeriod and
ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
A node shutdown action may not be detected by kubelet's Node Shutdown Manager,
either because the command does not trigger the inhibitor locks mechanism used by
kubelet or because of a user error, i.e., the ShutdownGracePeriod and
ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
section [Graceful Node Shutdown](#graceful-node-shutdown) for more details.

When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
that are part of a {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}} will be stuck in terminating status on
the shutdown node and cannot move to a new running node. This is because kubelet on
the shutdown node is not available to delete the pods so the StatefulSet cannot
create a new pod with the same name. If there are volumes used by the pods, the
VolumeAttachments will not be deleted from the original shutdown node so the volumes
used by these pods cannot be attached to a new running node. As a result, the
application running on the StatefulSet cannot function properly. If the original
shutdown node comes up, the pods will be deleted by kubelet and new pods will be
When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
that are part of a {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}} will be stuck in terminating status on
the shutdown node and cannot move to a new running node. This is because kubelet on
the shutdown node is not available to delete the pods so the StatefulSet cannot
create a new pod with the same name. If there are volumes used by the pods, the
VolumeAttachments will not be deleted from the original shutdown node so the volumes
used by these pods cannot be attached to a new running node. As a result, the
application running on the StatefulSet cannot function properly. If the original
shutdown node comes up, the pods will be deleted by kubelet and new pods will be
created on a different running node. If the original shutdown node does not come up,
these pods will be stuck in terminating status on the shutdown node forever.

To mitigate the above situation, a user can manually add the taint `node.kubernetes.io/out-of-service` with either `NoExecute`
or `NoSchedule` effect to a Node marking it out-of-service.
To mitigate the above situation, a user can manually add the taint `node.kubernetes.io/out-of-service` with either `NoExecute`
or `NoSchedule` effect to a Node marking it out-of-service.
If the `NodeOutOfServiceVolumeDetach`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
is enabled on {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}}, and a Node is marked out-of-service with this taint, the
pods on the node will be forcefully deleted if there are no matching tolerations on it and volume
detach operations for the pods terminating on the node will happen immediately. This allows the
Pods on the out-of-service node to recover quickly on a different node.
Pods on the out-of-service node to recover quickly on a different node.

During a non-graceful shutdown, Pods are terminated in the two phases:

1. Force delete the Pods that do not have matching `out-of-service` tolerations.
2. Immediately perform detach volume operation for such pods.
2. Immediately perform detach volume operation for such pods.

{{< note >}}
- Before adding the taint `node.kubernetes.io/out-of-service` , it should be verified
Expand Down Expand Up @@ -641,10 +641,9 @@ see [KEP-2400](https://github.com/kubernetes/enhancements/issues/2400) and its
## {{% heading "whatsnext" %}}

Learn more about the following:
* [Components](/docs/concepts/overview/components/#node-components) that make up a node.
* [API definition for Node](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#node-v1-core).
* [Node](https://git.k8s.io/design-proposals-archive/architecture/architecture.md#the-kubernetes-node) section of the architecture design document.
* [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/).
* [Node Resource Managers](/docs/concepts/policy/node-resource-managers/).
* [Resource Management for Windows nodes](/docs/concepts/configuration/windows-resource-management/).

* [Components](/docs/concepts/overview/components/#node-components) that make up a node.
* [API definition for Node](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#node-v1-core).
* [Node](https://git.k8s.io/design-proposals-archive/architecture/architecture.md#the-kubernetes-node) section of the architecture design document.
* [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/).
* [Node Resource Managers](/docs/concepts/policy/node-resource-managers/).
* [Resource Management for Windows nodes](/docs/concepts/configuration/windows-resource-management/).

0 comments on commit 59545bf

Please sign in to comment.