Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When pleg channel is full, discard events and record its count #72709

Merged
merged 3 commits into from Feb 13, 2019

Conversation

@changyaowei
Copy link
Contributor

changyaowei commented Jan 9, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug
/kind design

What this PR does / why we need it:
When event channel is full, discard the event, if continue put event in channel and block a relistThreshold time, pleg Healthy() will return error, if like so, kubelet syncloop() get a runtime error and not to consume event, it will lead to a deadlock

image

In runtimeErrors func invoke pleg health func
image

image

Which issue(s) this PR fixes:

Fixes #72482

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

when pleg channel is full, discard events and record its count
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 9, 2019

Hi @changyaowei. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Jan 9, 2019

/ok-to-test

@@ -192,6 +192,12 @@ func (g *GenericPLEG) relist() {
metrics.PLEGRelistInterval.Observe(metrics.SinceInMicroseconds(lastRelistTime))
}

_, err := g.runtime.Status()
if err != nil {
klog.Errorf("Container runtime sanity check failed: %v", err)

This comment was marked as resolved.

@resouer

resouer Jan 9, 2019

Member

Let's say the container runtime is dead, and PLEG returned here with logged this line of error. Then what will happen next?

What will be the status of Pods? And how user knows the container runtime is dead? (user will never check log unless sth really bad happens)

This comment was marked as resolved.

@changyaowei

changyaowei Jan 9, 2019

Author Contributor

Container runtime dead not represent container dead, container will not be impacted. We return here, 3 minutes later, the node status will be not ready which we get from apiserver.

@changyaowei

This comment has been minimized.

Copy link
Contributor Author

changyaowei commented Jan 9, 2019

/retest

newEventCount = newEventCount + len(value)
}
if (newEventCount + len(g.eventChannel)) > cap(g.eventChannel) {
klog.Errorf("new event count %d, now event channel is %d, cap is %s, channel will full, so discard",newEventCount,len(g.eventChannel),cap(g.eventChannel))

This comment was marked as resolved.

@resouer

resouer Jan 18, 2019

Member

We don't need to reveal the accurate number of events which will is not quite helpful to debug, so simply:

Suggested change
klog.Errorf("new event count %d, now event channel is %d, cap is %s, channel will full, so discard",newEventCount,len(g.eventChannel),cap(g.eventChannel))
klog.Errorf("event channel is about to be full, discard this relist() cycle",newEventCount,len(g.eventChannel),cap(g.eventChannel))

This comment was marked as resolved.

@changyaowei

changyaowei Jan 18, 2019

Author Contributor

OK

@@ -224,6 +224,15 @@ func (g *GenericPLEG) relist() {
}
}

newEventCount := 0

This comment was marked as resolved.

@resouer

resouer Jan 18, 2019

Member

I will prefer: upcomingEventCount

This comment was marked as resolved.

@changyaowei

changyaowei Jan 18, 2019

Author Contributor

ok

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Jan 18, 2019

Please update the PR title, mark it as WIP until proper tests are added, and rebase the code with lasted master branch,

@changyaowei changyaowei force-pushed the changyaowei:pleg_relist branch from a183bd9 to 6c72a42 Jan 18, 2019

@changyaowei changyaowei changed the title when runtime status is error, PLEG don't relist [WIP] when runtime status is error, PLEG don't relist Jan 18, 2019

@Random-Liu Random-Liu added this to the v1.14 milestone Jan 31, 2019

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 31, 2019

/approve
/assign @Random-Liu

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 31, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: changyaowei, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Feb 1, 2019

/lgtm

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Feb 1, 2019

/hold

Just hold this for @Random-Liu to give final pass. @changyaowei

@Random-Liu
Copy link
Member

Random-Liu left a comment

LGTM overall with one comment

Show resolved Hide resolved pkg/kubelet/pleg/generic_test.go Outdated

@k8s-ci-robot k8s-ci-robot removed the lgtm label Feb 13, 2019

@Random-Liu

This comment has been minimized.

Copy link
Member

Random-Liu commented Feb 13, 2019

/lgtm

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 13, 2019

@changyaowei: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws b52afc3 link /test pull-kubernetes-e2e-kops-aws
pull-kubernetes-local-e2e-containerized 19f7389 link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 289a60a into kubernetes:master Feb 13, 2019

16 of 17 checks passed

pull-kubernetes-local-e2e-containerized Job failed.
Details
cla/linuxfoundation changyaowei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped
tide In merge pool.
Details

@changyaowei changyaowei deleted the changyaowei:pleg_relist branch Feb 13, 2019

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented Feb 13, 2019

Does it qualify for cherry pick into 1.11, 1.12 and 1.13?

@changyaowei

This comment has been minimized.

Copy link
Contributor Author

changyaowei commented Feb 14, 2019

@Random-Liu Does it need to cherry pick into 1.11, 1.12 and 1.13?

@apelisse

This comment has been minimized.

Copy link
Member

apelisse commented Mar 1, 2019

Either the feature doesn't work or the test is broken, but TestEventChannelFull is extremely flaky.

@apelisse

This comment has been minimized.

Copy link
Member

apelisse commented Mar 1, 2019

//pkg/kubelet/pleg:go_default_test       FAILED in 230 out of 1000 in 0.2s
@changyaowei

This comment has been minimized.

Copy link
Contributor Author

changyaowei commented Mar 4, 2019

@apelisse By test , i found that the line of klog.Error take a long time.

@apelisse

This comment has been minimized.

Copy link
Member

apelisse commented Mar 4, 2019

Are you saying that this is a timing issue? In general, time shouldn't have an impact in unit-tests. I don't have any problem with the time, my main concern is the flakiness.

@changyaowei

This comment has been minimized.

Copy link
Contributor Author

changyaowei commented Mar 5, 2019

@apelisse I run the test in my mac and everything is ok, and can you give your fail output
image

@apelisse

This comment has been minimized.

Copy link
Member

apelisse commented Mar 5, 2019

Yeah, you ran it twice, I ran it 1000 times :-) Also note that go caches the output after the first success, so you can't simple run it multiple times naively.

It failed 230 times out of the 1000 times I ran it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.