Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix leaking memory backed volumes of terminated pods #36779

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Nov 14, 2016

Currently, we allow volumes to remain mounted on the node, even though the pod is terminated. This creates a vector for a malicious user to exhaust memory on the node by creating memory backed volumes containing large files.

This PR removes memory backed volumes (emptyDir w/ medium Memory, secrets, configmaps) of terminated pods from the node.

@saad-ali @derekwaynecarr


This change is Reviewable

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Nov 14, 2016
// Iterate through all pods and add to desired state of world if they don't
// exist but should
func (dswp *desiredStateOfWorldPopulator) findAndAddNewPods() {
for _, pod := range dswp.podManager.GetPods() {
if isPodTerminated(pod) {
// Do not (re)add volumes for terminated pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless "non-memory backed volume"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but I couldn't think of a good reason to add a volume back to a node for a pod that is terminated. The only reason I'm filtering the ones we remove is to not make waves for the release. In general, I can't think of a reason we would want to leave volumes attached to a node for a terminated pods except for debugging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but I couldn't think of a good reason to add a volume back to a node for a pod that is terminated. The only reason I'm filtering the ones we remove is to not make waves for the release. In general, I can't think of a reason we would want to leave volumes attached to a node for a terminated pods except for debugging.

Neither can I, but in the interest of minimizing potentially disruptive changes, it makes sense to limit the scope of the change as much as possible. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali the change required to check that actually makes the change more invasive because I need to get the volume. In findAndRemoveDeletedPods(), that is already available via volumeToMount. In findAndAddNewPods(), the logic is inverted; for each pod, process volumes. I'm not sure if it is safe to call dswp.desiredStateOfWorld.GetVolumesToMount() in findAndAddNewPods().

I can extend my e2e tests to include a check that ensures non-memory backed volumes are untouched by this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. That's fine. I'm a little nervous about it, but let's just get it in and give it time to bake and keep an eye out for strange volume behaviors.

@saad-ali
Copy link
Member

@yujuhong Can you also take a look at this PR

@derekwaynecarr derekwaynecarr added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Nov 15, 2016
@derekwaynecarr
Copy link
Member

Is it possible to write a node E2E test that verifies this behavior?

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Nov 15, 2016

For example, can we do a test like the following:

  1. launch a pod that writes into a memory backed empty dir
  2. terminate said pod (but dont delete it)
  3. launch a new pod that mounts the kubelet volume host path
  4. have that pod wait a minimum amount of time to verify the volume from Better error messages if go isn't installed, or if gcloud is old. #2 is deleted

See example of something similar here:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cgroup_manager_test.go#L104

@sjenning
Copy link
Contributor Author

@derekwaynecarr I'm looking into it

@yujuhong
Copy link
Contributor

Change LGTM. Like @derekwaynecarr suggested, an e2e test would be nice.

@sjenning sjenning force-pushed the fix-memory-leak-via-terminated-pods branch from eaa1be0 to 010a826 Compare November 15, 2016 19:21
@sjenning
Copy link
Contributor Author

@derekwaynecarr @yujuhong updated PR with e2e node test

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 15, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GCE Node e2e failed for commit 010a826c6a0b25dbcc0913357acba92fdeb1e2c3. Full PR test history.

The magic incantation to run this job again is @k8s-bot node e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 010a826c6a0b25dbcc0913357acba92fdeb1e2c3. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@sjenning sjenning force-pushed the fix-memory-leak-via-terminated-pods branch from 010a826 to 3dc3de1 Compare November 15, 2016 20:28
{
Name: "kubelet-pods",
VolumeSource: api.VolumeSource{
HostPath: &api.HostPathVolumeSource{Path: "/var/lib/kubelet/pods"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a // TODO to pull this value from the test context for node (it needs to be added)

@derekwaynecarr
Copy link
Member

one nit for a todo, otherwise this LGTM.

@saad-ali - can you take a pass and mark for 1.5 milestone if you agree?

@saad-ali
Copy link
Member

One minor comment about potentially minimizing the change further

@saad-ali
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2016
@saad-ali saad-ali added this to the v1.5 milestone Nov 18, 2016
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit eca9e98 into kubernetes:master Nov 18, 2016
@sjenning sjenning deleted the fix-memory-leak-via-terminated-pods branch November 21, 2016 16:17
k8s-github-robot pushed a commit that referenced this pull request Jan 6, 2017
Automatic merge from submit-queue (batch tested with PRs 39493, 39496)

kubelet: fix nil deref in volume type check

An attempt to address memory exhaustion through a build up of terminated pods with memory backed volumes on the node in PR #36779 introduced this.

For the `VolumeSpec`, either the `Volume` or `PersistentVolume` field is set, not both.  This results in a situation where there is a nil deref on PVs.  Since PVs are inherently not memory-backend, only local/temporal volumes should be considered.

This needs to go into 1.5 as well.

Fixes #39480

@saad-ali @derekwaynecarr @grosskur @gnufied

```release-note
fixes nil dereference when doing a volume type check on persistent volumes
```
k8s-github-robot pushed a commit that referenced this pull request Jan 20, 2017
Automatic merge from submit-queue (batch tested with PRs 37228, 40146, 40075, 38789, 40189)

kubelet: storage: teardown terminated pod volumes

This is a continuation of the work done in #36779

There really is no reason to keep volumes for terminated pods attached on the node.  This PR extends the removal of volumes on the node from memory-backed (the current policy) to all volumes.

@pmorie raised a concern an impact debugging volume related issues if terminated pod volumes are removed.  To address this issue, the PR adds a `--keep-terminated-pod-volumes` flag the kubelet and sets it for `hack/local-up-cluster.sh`.

For consideration in 1.6.

Fixes #35406

@derekwaynecarr @vishh @dashpole

```release-note
kubelet tears down pod volumes on pod termination rather than pod deletion
```
@ghost
Copy link

ghost commented Feb 14, 2018

@sjenning Should this change also delete the emptyDir-Memory volume for OOMKilled terminated containers? Because I still get an OOMKilled loop, if my cgroup mem limit is reached, because of large files written to the emptyDir with OpenShift 3.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants