New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaking Test] [sig-node] gce-cos-master-alpha-features failed for EventedPLEG #122721
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v1.30 |
/priority critical-urgent |
Assigning the feature owners #111384 /assign @harche @derekwaynecarr |
Thanks @pacoxu for fixit it #122763 (comment) |
@pacoxu is it also going to fix for example, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/122571/pull-kubernetes-e2e-gce-cos-alpha-features/1746816095597105152 fails in a similar way as https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-alpha-features/1746600815063207936 |
@p0lyn0mial triggered it, #122763 (comment) |
The pr in review can fix this failure, but it needs more review. |
|
#122697 revert the evented pleg from beta to alpha. This CI only enables all alpha feature gates. Before the 122697, the feature is beta and by default disabled, so the feature did not fail this alpha CI at that time. After the revert, it is an alpha feature gate and will be enabled in this Alpha CI and it started to fail. |
Since it is failing in alpha, would you consider disabling the feature altogether ? I think it boils down to setting |
This is a solution to fix the CI as well.
@harche @aojea @smarterclayton what's your idea? |
IMHO, considering we do not have a timeline, it may not be a bad idea to omit the Evented PLEG feature from alpha features job temporarily to unblock everyone else. |
I opened kubernetes/test-infra#31647. In https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-cos-alpha-features, |
@pacoxu what issue do we want to keep open tro track this? |
In kubernetes/test-infra#31647, I only disabled evented pleg for presubmit CI jobs. The master blocking board is still failing(flaking in high rate). We still need to fix it in a high priority. |
@pacoxu while we still trying to tweak Generic PLEG to exhibit similar behaviour, if we gate all changes to Evented PLEG only then we can merge the PR as it doesn't alter anything when regular Generic PLEG is in action. https://github.com/kubernetes/kubernetes/pull/122778/files#r1459564818 |
No luck yet keeping the cluster steady while messing around with Generic PLEG to reproduce the issue. cc @sairameshv |
I am trying to see why do we need to restart the container if it is in created state even in case of Generic PLEG? what happens if you make the change in this PR for |
Looks like it was introduced with this PR - But I am not sure if this assumption, 9fa1ad2#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1214-R1219 is true. In the kubelet you can clearly see create container and start container are two distinct steps. |
As we disabled it in CI, the CI is not flaing for this. Let's discuss it in a straightforward issue. |
Which jobs are failing?
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features is breaked
Which tests are failing?
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-alpha-features/1745617205501890560
Since when has it been failing?
After #122697 is merged.
Testgrid link
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features
Reason for failure (if possible)
EventedPLEG has a known issue: #121349, and it will cause static pod startup failure(most time)
Anything else we need to know?
#122124#122763 is a potienial fix but not complete yet.
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: