Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When pleg channel is full, discard events and record its count #72709

Merged
merged 3 commits into from
Feb 13, 2019

Conversation

changyaowei
Copy link
Contributor

@changyaowei changyaowei commented Jan 9, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug
/kind design

What this PR does / why we need it:
When event channel is full, discard the event, if continue put event in channel and block a relistThreshold time, pleg Healthy() will return error, if like so, kubelet syncloop() get a runtime error and not to consume event, it will lead to a deadlock

image

In runtimeErrors func invoke pleg health func
image

image

Which issue(s) this PR fixes:

Fixes #72482

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

when pleg channel is full, discard events and record its count

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 9, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @changyaowei. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 9, 2019
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 9, 2019
@resouer
Copy link
Contributor

resouer commented Jan 9, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 9, 2019
@@ -192,6 +192,12 @@ func (g *GenericPLEG) relist() {
metrics.PLEGRelistInterval.Observe(metrics.SinceInMicroseconds(lastRelistTime))
}

_, err := g.runtime.Status()
if err != nil {
klog.Errorf("Container runtime sanity check failed: %v", err)

This comment was marked as resolved.

This comment was marked as resolved.

@changyaowei
Copy link
Contributor Author

/retest

@yujuhong yujuhong self-assigned this Jan 15, 2019
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2019
newEventCount = newEventCount + len(value)
}
if (newEventCount + len(g.eventChannel)) > cap(g.eventChannel) {
klog.Errorf("new event count %d, now event channel is %d, cap is %s, channel will full, so discard",newEventCount,len(g.eventChannel),cap(g.eventChannel))

This comment was marked as resolved.

This comment was marked as resolved.

@@ -224,6 +224,15 @@ func (g *GenericPLEG) relist() {
}
}

newEventCount := 0

This comment was marked as resolved.

This comment was marked as resolved.

@resouer
Copy link
Contributor

resouer commented Jan 18, 2019

Please update the PR title, mark it as WIP until proper tests are added, and rebase the code with lasted master branch,

@changyaowei changyaowei changed the title when runtime status is error, PLEG don't relist [WIP] when runtime status is error, PLEG don't relist Jan 18, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: changyaowei, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2019
@resouer
Copy link
Contributor

resouer commented Feb 1, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2019
@resouer
Copy link
Contributor

resouer commented Feb 1, 2019

/hold

Just hold this for @Random-Liu to give final pass. @changyaowei

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2019
Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall with one comment

pkg/kubelet/pleg/generic_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2019
@Random-Liu
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2019
@Random-Liu Random-Liu removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2019
@k8s-ci-robot
Copy link
Contributor

@changyaowei: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws b52afc3 link /test pull-kubernetes-e2e-kops-aws
pull-kubernetes-local-e2e-containerized 19f7389 link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 289a60a into kubernetes:master Feb 13, 2019
@changyaowei changyaowei deleted the pleg_relist branch February 13, 2019 09:46
@redbaron
Copy link
Contributor

redbaron commented Feb 13, 2019

Does it qualify for cherry pick into 1.11, 1.12 and 1.13?

@changyaowei
Copy link
Contributor Author

@Random-Liu Does it need to cherry pick into 1.11, 1.12 and 1.13?

@apelisse
Copy link
Member

apelisse commented Mar 1, 2019

Either the feature doesn't work or the test is broken, but TestEventChannelFull is extremely flaky.

@apelisse
Copy link
Member

apelisse commented Mar 1, 2019

//pkg/kubelet/pleg:go_default_test       FAILED in 230 out of 1000 in 0.2s

@changyaowei
Copy link
Contributor Author

@apelisse By test , i found that the line of klog.Error take a long time.

@apelisse
Copy link
Member

apelisse commented Mar 4, 2019

Are you saying that this is a timing issue? In general, time shouldn't have an impact in unit-tests. I don't have any problem with the time, my main concern is the flakiness.

@changyaowei
Copy link
Contributor Author

@apelisse I run the test in my mac and everything is ok, and can you give your fail output
image

@apelisse
Copy link
Member

apelisse commented Mar 5, 2019

Yeah, you ran it twice, I ran it 1000 times :-) Also note that go caches the output after the first success, so you can't simple run it multiple times naively.

It failed 230 times out of the 1000 times I ran it.

@resouer
Copy link
Contributor

resouer commented Mar 26, 2019

@apelisse We are trying to reproduce and dig into the change now.

@apelisse
Copy link
Member

You can reproduce by running:

bazel test --runs_per_test=1000 //pkg/kubelet/pleg:go_default_test

@ping035627
Copy link
Contributor

@apelisse, @changyaowei
How has this problem progressed? TestEventChannelFull is also often fails in my ci environment.

@gsjeon
Copy link

gsjeon commented Aug 18, 2023

@changyaowei Does it qualify for cherry pick into 1.12 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in PLEG relist() for health check and Kubelet syncLoop()
9 participants