Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop Watching when there is encoding error #84693

Merged
merged 1 commit into from Nov 8, 2019

Conversation

tedyu
Copy link
Contributor

@tedyu tedyu commented Nov 3, 2019

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
In WatchServer#HandleWS, if s.EmbeddedEncoder.Encode() fails, s.Watching should be stopped.
This would make encoding error handling consistent with that of s.Encoder.Encode().

kube-apiserver: fixed a bug that could cause a goroutine leak if the apiserver encountered an encoding error serving a watch to a websocket watcher

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 3, 2019
@tedyu
Copy link
Contributor Author

tedyu commented Nov 4, 2019

/assign @liggitt

@caesarxuchao
Copy link
Member

/unassign @liggitt
/assign @lavalamp

@k8s-ci-robot k8s-ci-robot assigned lavalamp and unassigned liggitt Nov 5, 2019
@tedyu
Copy link
Contributor Author

tedyu commented Nov 6, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@lavalamp
Copy link
Member

lavalamp commented Nov 7, 2019

/lgtm
/approve

Tests seem broken for unrelated reason. Is there a fix if you rebase?

/retest

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 7, 2019
@tedyu
Copy link
Contributor Author

tedyu commented Nov 7, 2019

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 7, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2019
@lavalamp
Copy link
Member

lavalamp commented Nov 7, 2019

Good call @liggitt, I'm not sure. Maybe. We should understand the contract better; I can't dig in at the moment. It'd be best if we could defer the call and be certain it'll be called....

@tedyu
Copy link
Contributor Author

tedyu commented Nov 7, 2019

@liggitt
See line 204 w.r.t. the other two places you mentioned.

	defer s.Watching.Stop()

@liggitt
Copy link
Member

liggitt commented Nov 7, 2019

See line 204 w.r.t. the other two places you mentioned.

defer s.Watching.Stop()

ah, you're right. this seems like it should do the same in this method (but can be a follow up if we want to keep this minimal for back-porting)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2019
@tedyu
Copy link
Contributor Author

tedyu commented Nov 7, 2019

@liggitt @lavalamp
I have pushed new patch where defer is used - this would prevent bugs in the future if some new case is added.
(In the initial comment from Jordan above there was no 'follow up')

@@ -285,10 +285,12 @@ func (s *WatchServer) HandleWS(ws *websocket.Conn) {
buf := &bytes.Buffer{}
streamBuf := &bytes.Buffer{}
ch := s.Watching.ResultChan()

defer s.Watching.Stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to double-stop this thing? (e.g. for line 298 it is likely already stopped)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, that's exactly the path exercised by line 204/226 above in the normal case of the watch channel closing

@lavalamp
Copy link
Member

lavalamp commented Nov 7, 2019

Nifty. Let's get the merge train started and we can verify in parallel.

@lavalamp
Copy link
Member

lavalamp commented Nov 7, 2019

/lgtm

@liggitt
Copy link
Member

liggitt commented Nov 8, 2019

picks opened to 1.14, 1.15, 1.16

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 8, 2019
k8s-ci-robot added a commit that referenced this pull request Nov 8, 2019
…3-upstream-release-1.16

Automated cherry pick of #84693: Stop Watching when there is encoding error
k8s-ci-robot added a commit that referenced this pull request Nov 8, 2019
…3-upstream-release-1.15

Automated cherry pick of #84693: Stop Watching when there is encoding error
k8s-ci-robot added a commit that referenced this pull request Nov 8, 2019
…3-upstream-release-1.14

Automated cherry pick of #84693: Stop Watching when there is encoding error
@answer1991
Copy link
Contributor

@tedyu @lavalamp

Could you provide some more detail logs about this bug, as I found even the code has some defect, the watcher will be closed finally.

Because the watcher implement by cacherWatcher is context-aware, when request finished and context is done, the watcher will call watcher.Stop:

ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
watcher, err := rw.Watch(ctx, &opts)
if err != nil {
scope.err(err, w, req)
return
}
requestInfo, _ := request.RequestInfoFrom(ctx)
metrics.RecordLongRunning(req, requestInfo, metrics.APIServerComponent, func() {
serveWatch(watcher, scope, outputMediaType, req, w, timeout)
})

defer c.Stop()
for {
select {
case event, ok := <-c.input:
if !ok {
return
}
// only send events newer than resourceVersion
if event.ResourceVersion > resourceVersion {
c.sendWatchCacheEvent(event)
}
case <-ctx.Done():
return
}
}
}

@lavalamp
Copy link
Member

@answer1991 Thanks for the investigation; when was the watcher made context aware? I think some versions still in support don't have that change? And this is definitely easy to cherry-pick.

I think we need to have metrics, one for "running watches" and one for "in-flight requests", so we can compare and definitively see that watches are not being leaked.

@tedyu still looking for a test :)

@tedyu
Copy link
Contributor Author

tedyu commented Nov 20, 2019

@lavalamp
Adding test is on my agenda.

@answer1991
Copy link
Contributor

@answer1991 Thanks for the investigation; when was the watcher made context aware? I think some versions still in support don't have that change? And this is definitely easy to cherry-pick.

I think we need to have metrics, one for "running watches" and one for "in-flight requests", so we can compare and definitively see that watches are not being leaked.

@tedyu still looking for a test :)

From v1.10, watcher implement by cacherWatcher had already been context-aware(I did not check earlier codes). But the test codes, FakeWatcher is not context-aware. I think the memory leak is only exist in test codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants