-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When pleg channel is full, discard events and record its count #72709
Conversation
Hi @changyaowei. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
pkg/kubelet/pleg/generic.go
Outdated
@@ -192,6 +192,12 @@ func (g *GenericPLEG) relist() { | |||
metrics.PLEGRelistInterval.Observe(metrics.SinceInMicroseconds(lastRelistTime)) | |||
} | |||
|
|||
_, err := g.runtime.Status() | |||
if err != nil { | |||
klog.Errorf("Container runtime sanity check failed: %v", err) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
/retest |
2bb877c
to
a183bd9
Compare
pkg/kubelet/pleg/generic.go
Outdated
newEventCount = newEventCount + len(value) | ||
} | ||
if (newEventCount + len(g.eventChannel)) > cap(g.eventChannel) { | ||
klog.Errorf("new event count %d, now event channel is %d, cap is %s, channel will full, so discard",newEventCount,len(g.eventChannel),cap(g.eventChannel)) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
pkg/kubelet/pleg/generic.go
Outdated
@@ -224,6 +224,15 @@ func (g *GenericPLEG) relist() { | |||
} | |||
} | |||
|
|||
newEventCount := 0 |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Please update the PR title, mark it as WIP until proper tests are added, and rebase the code with lasted master branch, |
a183bd9
to
6c72a42
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: changyaowei, yujuhong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/hold Just hold this for @Random-Liu to give final pass. @changyaowei |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall with one comment
/lgtm |
@changyaowei: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Does it qualify for cherry pick into 1.11, 1.12 and 1.13? |
@Random-Liu Does it need to cherry pick into 1.11, 1.12 and 1.13? |
Either the feature doesn't work or the test is broken, but |
|
@apelisse By test , i found that the line of |
Are you saying that this is a timing issue? In general, time shouldn't have an impact in unit-tests. I don't have any problem with the time, my main concern is the flakiness. |
@apelisse I run the test in my mac and everything is ok, and can you give your fail output |
Yeah, you ran it twice, I ran it 1000 times :-) Also note that go caches the output after the first success, so you can't simple run it multiple times naively. It failed 230 times out of the 1000 times I ran it. |
@apelisse We are trying to reproduce and dig into the change now. |
You can reproduce by running:
|
@apelisse, @changyaowei |
@changyaowei Does it qualify for cherry pick into 1.12 ? |
What type of PR is this?
What this PR does / why we need it:
When event channel is full, discard the event, if continue put event in channel and block a
relistThreshold
time,pleg Healthy()
will return error, if like so,kubelet syncloop()
get a runtime error and not to consume event, it will lead to a deadlockIn
runtimeErrors func
invokepleg health func
Which issue(s) this PR fixes:
Fixes #72482
Special notes for your reviewer:
Does this PR introduce a user-facing change?: