Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak from not closing hcs containers #78594

Merged

Conversation

@benmoss
Copy link
Member

commented May 31, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Windows containers opened through hcsshim.OpenContainer need to be freed through container.Close() or they leak memory. This is a memory dump from a Windows kubelet I took with Go's pprof:
profile001

Which issue(s) this PR fixes:
Fixes #78555

Does this PR introduce a user-facing change?:

Fixes a memory leak in Kubelet on Windows caused by not not closing containers when fetching container metrics

/assign @feiskyer @PatrickLang
/sig windows

Only thing worth noting is that I decided if for some reason container.Close() does error, to return the error rather than logging it and proceeding. Seems like something is substantially wrong if this happens.

@PatrickLang

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 31, 2019

@PatrickLang

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

/milestone v1.15
This can be a big memory leak - reports of 4GB with a lot of stats requests with many containers running

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

@PatrickLang: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.10, v1.11, v1.12, v1.13, v1.14, v1.15, v1.16, v1.17, v1.18, v1.4, v1.5, v1.6, v1.7, v1.8, v1.9]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.15
This can be a big memory leak - reports of 4GB with a lot of stats requests with many containers running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@PatrickLang PatrickLang added this to Backlog (v.1.15) in SIG-Windows May 31, 2019

@PatrickLang PatrickLang moved this from Backlog (v.1.15) to In Progress+Review in SIG-Windows May 31, 2019

@PatrickLang

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

/retest
/assign @yujuhong

@benmoss

This comment has been minimized.

Copy link
Member Author

commented May 31, 2019

/priority 1.15

@yujuhong

This comment has been minimized.

Copy link
Member

commented May 31, 2019

/lgtm
/approve

We should cherry-pick this to 1.14. Can you add a release note? @benmoss thanks!

@benmoss

This comment has been minimized.

Copy link
Member Author

commented May 31, 2019

Added a release note, was kinda unclear if it warranted one but yeah definitely people want to know if a release fixes a known issue :)

@yujuhong

This comment has been minimized.

Copy link
Member

commented May 31, 2019

Added a release note, was kinda unclear if it warranted one but yeah definitely people want to know if a release fixes a known issue :)

It's required if we want to patch 1.14, and I think we should :-)

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone May 31, 2019

@@ -51,7 +51,9 @@ func (p *criStatsProvider) listContainerNetworkStats() (map[string]*statsapi.Net
klog.V(4).Infof("Failed to get statistics for container %q with error '%v', continue to get stats for other containers", c.ID, err)
continue
}

if err := container.Close(); err != nil {

This comment has been minimized.

Copy link
@mtaufen

mtaufen May 31, 2019

Contributor

Does this need to be part of the error handling path? E.g. if container.Statistics() gets an error, it looks like we'll skip closing the handle.

Might better to use a defer (but if you do please test it to ensure container is tracked properly across loop iterations, I always have trouble remembering whether receivers are evaluated at defer time or return time).

This comment has been minimized.

Copy link
@yujuhong

yujuhong May 31, 2019

Member

Agree. We should probably defer after handling the error of OpenContainer (~ line 48).

This comment has been minimized.

Copy link
@benmoss

benmoss Jun 2, 2019

Author Member

Ahh, quite true.

This comment has been minimized.

Copy link
@benmoss

benmoss Jun 2, 2019

Author Member

@mtaufen @yujuhong I moved it to a private method so I could use defer, the only trouble is catching the Close() error in a defer is now sorta tricky. I used the go.uber.org/multierr package since it helps solve this problem nicely, and was already vendored but it's right now a transitive dependency for the project. I don't know if promoting it to a direct dependency is something people worry about. It looks like k8s largely follows the "ignore deferred errors" pattern today 😄.

This comment has been minimized.

Copy link
@benmoss

benmoss Jun 2, 2019

Author Member

On second thought, and after seeing it triggers a new level of PR review to add that dep, I decided to go with a simpler approach that avoids the new dep :)

@benmoss benmoss force-pushed the benmoss:windows-kubelet-memory-leak branch from 1763e6a to 287b1ba Jun 2, 2019

@k8s-ci-robot k8s-ci-robot added size/S and removed size/XS labels Jun 2, 2019

@benmoss benmoss force-pushed the benmoss:windows-kubelet-memory-leak branch 3 times, most recently from 59d96df to b920f00 Jun 2, 2019

@benmoss benmoss force-pushed the benmoss:windows-kubelet-memory-leak branch from b920f00 to 1fcad1b Jun 2, 2019

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Jun 2, 2019

@benmoss

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2019

/remove-area dependency

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feiskyer
Copy link
Member

left a comment

/lgtm

thanks for the fix.

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 3, 2019

@fejta-bot

This comment has been minimized.

Copy link

commented Jun 3, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit acb321e into kubernetes:master Jun 3, 2019

21 checks passed

cla/linuxfoundation benmoss authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

SIG-Windows automation moved this from In Progress+Review to Done (v1.15) Jun 3, 2019

k8s-ci-robot added a commit that referenced this pull request Jun 19, 2019

Merge pull request #78665 from benmoss/automated-cherry-pick-of-#7859…
…4-upstream-release-1.14

Automated cherry pick of #78594: Fix memory leak from not closing hcs container handles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.