Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking Test] [sig-node] gce-cos-master-alpha-features failed for EventedPLEG #122721

Closed
pacoxu opened this issue Jan 12, 2024 · 24 comments
Closed
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@pacoxu
Copy link
Member

pacoxu commented Jan 12, 2024

Which jobs are failing?

https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features is breaked

  • Flaking rate is higher than 50%.

Which tests are failing?

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-alpha-features/1745617205501890560

Since when has it been failing?

After #122697 is merged.

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features

Reason for failure (if possible)

EventedPLEG has a known issue: #121349, and it will cause static pod startup failure(most time)

Anything else we need to know?

#122124
#122763 is a potienial fix but not complete yet.

Relevant SIG(s)

/sig node

@pacoxu pacoxu added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 12, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacoxu pacoxu changed the title [Failing Test] [sig-node] gce-cos-master-alpha-features failed for EventedPLEG [Flaking Test] [sig-node] gce-cos-master-alpha-features failed for EventedPLEG Jan 14, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/milestone v1.30
/cc @kubernetes/sig-node-leads @kubernetes/release-team-release-signal

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jan 15, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/priority critical-urgent
for master blocking board failure(more than 50% flaking rate)

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jan 15, 2024
@aojea
Copy link
Member

aojea commented Jan 15, 2024

Assigning the feature owners #111384

/assign @harche @derekwaynecarr

@harche
Copy link
Contributor

harche commented Jan 15, 2024

Thanks @pacoxu for fixit it #122763 (comment)

@p0lyn0mial
Copy link
Contributor

p0lyn0mial commented Jan 15, 2024

@harche
Copy link
Contributor

harche commented Jan 15, 2024

@p0lyn0mial triggered it, #122763 (comment)

@p0lyn0mial
Copy link
Contributor

@pacoxu @harche it looks like finding and merging a fix might take some time. Have you considered reverting the changes that introduced the issue?

@pacoxu
Copy link
Member Author

pacoxu commented Jan 17, 2024

@pacoxu @harche it looks like finding and merging a fix might take some time. Have you considered reverting the changes that introduced the issue?

evented pleg is reverted from beta to alpha recently
Do you mean revert the feature which is introduced since 1.27?

@pacoxu
Copy link
Member Author

pacoxu commented Jan 17, 2024

The pr in review can fix this failure, but it needs more review.

@p0lyn0mial
Copy link
Contributor

pull-kubernetes-e2e-gce-cos-alpha-features is constantly failing. In the past it was fine, something broke it around 11th of January. Maybe not everything was reverted ? Is there a way to disable the alpha feature you mentioned from the *-alpha-features jobs ? (you could reenable it once you know it works).

@pacoxu
Copy link
Member Author

pacoxu commented Jan 17, 2024

#122697 revert the evented pleg from beta to alpha. This CI only enables all alpha feature gates. Before the 122697, the feature is beta and by default disabled, so the feature did not fail this alpha CI at that time.

After the revert, it is an alpha feature gate and will be enabled in this Alpha CI and it started to fail.

@p0lyn0mial
Copy link
Contributor

Since it is failing in alpha, would you consider disabling the feature altogether ? I think it boils down to setting --env=KUBE_FEATURE_GATES=AllAlpha=true,YOUR_FEATURE=false in https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml#L400 (and similarly for the other jobs).

@pacoxu
Copy link
Member Author

pacoxu commented Jan 17, 2024

Since it is failing in alpha, would you consider disabling the feature altogether ? I think it boils down to setting --env=KUBE_FEATURE_GATES=AllAlpha=true,YOUR_FEATURE=false in https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml#L400 (and similarly for the other jobs).

This is a solution to fix the CI as well.

  • Pros to disable it:
    • gce-cos-master-alpha-features is one CI of master blocking board
    • gce-cos-master-alpha-features failed for EventedPLEG only
    • there is still discussion in the PR, and no clear timeline
  • Cons:
    • gce-cos-master-alpha-features flakes and can pass with retry (passing rate is about ~40% according to the testgrid.)
    • PR is almost done and waiting for more reviews and approval
    • we have to revert the test-infra change once EventedPLEG bug is fixed.

@harche @aojea @smarterclayton what's your idea?

@harche
Copy link
Contributor

harche commented Jan 17, 2024

IMHO, considering we do not have a timeline, it may not be a bad idea to omit the Evented PLEG feature from alpha features job temporarily to unblock everyone else.

@pacoxu
Copy link
Member Author

pacoxu commented Jan 17, 2024

@tzneal tzneal moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board Jan 17, 2024
@aojea
Copy link
Member

aojea commented Jan 17, 2024

@pacoxu what issue do we want to keep open tro track this?

@pacoxu
Copy link
Member Author

pacoxu commented Jan 19, 2024

Reference

In kubernetes/test-infra#31647, I only disabled evented pleg for presubmit CI jobs.

The master blocking board is still failing(flaking in high rate). We still need to fix it in a high priority.

@harche
Copy link
Contributor

harche commented Jan 19, 2024

@pacoxu while we still trying to tweak Generic PLEG to exhibit similar behaviour, if we gate all changes to Evented PLEG only then we can merge the PR as it doesn't alter anything when regular Generic PLEG is in action.

https://github.com/kubernetes/kubernetes/pull/122778/files#r1459564818

@harche
Copy link
Contributor

harche commented Jan 29, 2024

@pacoxu while we still trying to tweak Generic PLEG to exhibit similar behaviour, if we gate all changes to Evented PLEG only then we can merge the PR as it doesn't alter anything when regular Generic PLEG is in action.

https://github.com/kubernetes/kubernetes/pull/122778/files#r1459564818

No luck yet keeping the cluster steady while messing around with Generic PLEG to reproduce the issue.

cc @sairameshv

@harche
Copy link
Contributor

harche commented Jan 29, 2024

I am trying to see why do we need to restart the container if it is in created state even in case of Generic PLEG? what happens if you make the change in this PR for ContainerStateCreated also applicable to the Generic PLEG?

https://github.com/kubernetes/kubernetes/pull/122737/files#diff-cb70f01a3ac982d9bd1fda913788b2ef7d9862cf6392204f6db00f3cb2292813R90

@harche
Copy link
Contributor

harche commented Jan 29, 2024

I am trying to see why do we need to restart the container if it is in created state even in case of Generic PLEG? what happens if you make the change in this PR for ContainerStateCreated also applicable to the Generic PLEG?

https://github.com/kubernetes/kubernetes/pull/122737/files#diff-cb70f01a3ac982d9bd1fda913788b2ef7d9862cf6392204f6db00f3cb2292813R90

Looks like it was introduced with this PR -
9fa1ad2

But I am not sure if this assumption, 9fa1ad2#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1214-R1219 is true.

In the kubelet you can clearly see create container and start container are two distinct steps.

@pacoxu
Copy link
Member Author

pacoxu commented Feb 2, 2024

As we disabled it in CI, the CI is not flaing for this. Let's discuss it in a straightforward issue.

@pacoxu
Copy link
Member Author

pacoxu commented Feb 2, 2024

As we disabled it in CI, the CI is not flaing for this. Let's discuss it in a straightforward issue.

@harche I add all related context in #123087.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Development

No branches or pull requests

6 participants