Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix potential memory leak issue in processing watch request #85410

Merged

Conversation

answer1991
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix API Server potential memory leak issue in processing watch request

Which issue(s) this PR fixes:

Related Issue: #84001
Related PR: #84693

Special notes for your reviewer:

After #84693 picked, API Server processing watch request still may leak memory(does not stop watcher).

Does this PR introduce a user-facing change?:

Fix API Server potential memory leak issue in processing watch request.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 18, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @answer1991. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 18, 2019
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2019
@answer1991
Copy link
Contributor Author

CC @lavalamp @tedyu

@neolit123
Copy link
Member

/ok-to-test
/priority backlog

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 18, 2019
@lavalamp
Copy link
Member

/lgtm
/approve

Let's cherry pick this back. Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: answer1991, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2019
@hzxuzhonghu
Copy link
Member

Doesn't the cancel almost equals to watcher.Stop here?

ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
watcher, err := rw.Watch(ctx, &opts)
if err != nil {
scope.err(err, w, req)
return
}
requestInfo, _ := request.RequestInfoFrom(ctx)
metrics.RecordLongRunning(req, requestInfo, metrics.APIServerComponent, func() {
serveWatch(watcher, scope, outputMediaType, req, w, timeout)
})

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 19, 2019
@@ -591,6 +591,7 @@ func TestWatchHTTPErrors(t *testing.T) {
}

s := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
defer watcher.Stop()
Copy link
Contributor

@tedyu tedyu Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this addition needed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watcher.Stop has moved to serveWatch function. The test codes now is tricky, but I have not found a better way to call watcher.Stop in test codes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're stopping the watcher on any request, instead of just the relevant ones.

@answer1991
Copy link
Contributor Author

/hold

Let me refactor codes again.

@answer1991
Copy link
Contributor Author

@tedyu @riking

I had already improved the PR, many thanks for a review.

BTW, I do not think it's a bug any more, as the real Watcher will be closed if context is done:

defer c.Stop()
for {
select {
case event, ok := <-c.input:
if !ok {
return
}
// only send events newer than resourceVersion
if event.ResourceVersion > resourceVersion {
c.sendWatchCacheEvent(event)
}
case <-ctx.Done():
return
}
}
}

@wojtek-t
Copy link
Member

@answer1991 - the test failure seems related

@answer1991
Copy link
Contributor Author

@wojtek-t

The tests passed in my environment. Let me try run test again, if also failed let me debug try to resolve it :-P

/test pull-kubernetes-bazel-test

@answer1991
Copy link
Contributor Author

@wojtek-t

Found the test failed root cause, we should call FakeWatcher.IsStopped() instead of FakeWatcher.Stopped: #86120

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2019
@wojtek-t
Copy link
Member

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2019
@k8s-ci-robot k8s-ci-robot merged commit 34f3492 into kubernetes:master Dec 10, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Dec 10, 2019
@tedyu
Copy link
Contributor

tedyu commented Dec 10, 2019

I ran the new test after reverting the change to staging/src/k8s.io/apiserver/pkg/endpoints/handlers/watch.go

The test still passed.

@answer1991
Copy link
Contributor Author

answer1991 commented Dec 10, 2019

@tedyu

Yes, it is. As we dose not test serveWatch in staging/src/k8s.io/apiserver/pkg/endpoints/handlers/watch.go, but the func serveWatch has a mock in

// serveWatch will serve a watch response according to the watcher and watchServer.
// Before watchServer.ServeHTTP, an error may occur like k8s.io/apiserver/pkg/endpoints/handlers/watch.go#serveWatch does.
func serveWatch(watcher watch.Interface, watchServer *handlers.WatchServer, preServeErr error) http.HandlerFunc {
return func(w http.ResponseWriter, req *http.Request) {
defer watcher.Stop()
if preServeErr != nil {
responsewriters.ErrorNegotiated(preServeErr, watchServer.Scope.Serializer, watchServer.Scope.Kind.GroupVersion(), w, req)
return
}
watchServer.ServeHTTP(w, req)
}
}

If we use the older method to ServeHTTP, then the test case TestWatchHTTPErrorsBeforeServe will be failed:

	s := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
                if preServeErr != nil { 
 			responsewriters.ErrorNegotiated(preServeErr, watchServer.Scope.Serializer, watchServer.Scope.Kind.GroupVersion(), w, req) 
 			return 
 		} 
		watchServer.ServeHTTP(w, req)
	}))
	defer s.Close()

@tedyu
Copy link
Contributor

tedyu commented Dec 17, 2019

With the above modification:

$ GO111MODULE=on go test -run=TestWatchHTTPErrors
^Csignal: interrupt
FAIL	k8s.io/apiserver/pkg/endpoints	129.222s

I would expect that existing tests would pass.

@answer1991
Copy link
Contributor Author

@tedyu

If you want existing tests would pass, you should call serveWatch function which wrapper defer watcher.Stop().

Call watchServer.ServeHTTP directly will leak watcher goroutine and test will be failed, for both cases TestWatchHTTPErrors and TestWatchHTTPErrorsBeforeServe. As we had moved defer watcher.Stop() to serveWatch function.

@@ -64,6 +64,8 @@ func (w *realTimeoutFactory) TimeoutCh() (<-chan time.Time, func() bool) {
// serveWatch will serve a watch response.
// TODO: the functionality in this method and in WatchServer.Serve is not cleanly decoupled.
func serveWatch(watcher watch.Interface, scope *RequestScope, mediaTypeOptions negotiation.MediaTypeOptions, req *http.Request, w http.ResponseWriter, timeout time.Duration) {
defer watcher.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @answer1991 , why do we need this when we already have the watcher to be closed after timeout in cacheWatcher.process().

@MikeSpreitzer
Copy link
Member

+1 on #85410 (comment)
As far as I can tell, this fix was never put in the earlier supported releases.
@lavalamp
@answer1991

@wojtek-t
Copy link
Member

It's part of 1.18, so with 1.19 out, we can just cherrypick it to 1.17. But that sounds reasonable to me.

@answer1991 - can you please open 1.17 cherrypick?

@answer1991
Copy link
Contributor Author

@wojtek-t sure, is there any automated command in github to help do this? or should I cherry-pick it manually?

@wojtek-t
Copy link
Member

k8s-ci-robot added a commit that referenced this pull request Aug 31, 2020
Automated cherry pick of #85410: fix potential memory leak issue in processing watch request
@MikeSpreitzer
Copy link
Member

I thought I heard that our policy now is to support the previous 3 minor releases.

@wojtek-t
Copy link
Member

Isn't it that starting with 1.19 we will be supporting a release for 1 year? So it will be effectively 4 releases, but when 1.19 will become the oldest one?

@MikeSpreitzer
Copy link
Member

I see I remembered wrong. https://kubernetes.io/docs/setup/release/version-skew-policy/#supported-versions has the statement. So 1.17 is the oldest supported one now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet