-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated cherry pick of #51039 #51224 #51230 #52057 #52200
Automated cherry pick of #51039 #51224 #51230 #52057 #52200
Conversation
The "Saturate" phase of StatefulSet e2e tests verifies orderly startup by controlling when each Pod is allowed to report Ready. If a Pod unexepectedly goes down during the test, the replacement Pod created by the controller will forget if it was already allowed to report Ready. After this change, the signal that allows each Pod to report Ready is persisted in the Pod's PVC. Thus, the replacement Pod will remember that it was already told to proceed to a Ready state.
The test used to scale the StatefulSet down to 0, wait for ListPods to return 0 matching Pods, and then scale the StatefulSet back up. This was prone to a race in which StatefulSet was told to scale back up before it had observed its own deletion of the last Pod, as evidenced by logs showing the creation of Pod ss-1 prior to the creation of the replacement Pod ss-0. We now wait for the controller to observe all deletions before scaling it back up. This should fix flakes of the form: ``` Too many pods scheduled, expected 1 got 2 ```
We seem to get a lot of flakes due to "connection refused" while running `kubectl exec`. I can't find any reason this would be caused by the test flow, so I'm adding retries to see if that helps.
Thanks @enisoc . LGTM. /lgtm |
/retest Review the full test history for this PR. |
/retest |
@wojtek-t It seems like federation e2e on release-1.7 has not passed in while: https://k8s-testgrid.appspot.com/release-1.7-all#gce-federation-release-1-7&width=5 |
@kubernetes/sig-federation-bugs - can you please take a look? ^^ |
/retest Review the full test history for this PR. |
Since federation presubmit is broken, but this cherrypick is touching only test files and is supposed to deflake those, I'm kicking tests and will merge it manually. /retest |
/retest |
The initial retry up to 20s was giving up too soon. I'm seeing this test flake because the Node rebooted and it takes ~2min to recover. Now StatefulSet RunHostCmd calls will use the same 5min timeout as with other Pod state checks.
952b3ac
to
9716e80
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enisoc, wojtek-t Associated issue: 51039 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
/test pull-kubernetes-e2e-kops-aws |
@enisoc: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Strange - kops claims "no test failures". I'm merging it manually to reduce flakiness. |
Cherry pick of #51039 #51224 #51230 #52057 on release-1.7.
These are flakiness fixes that have helped on master.
#51039: StatefulSet: Deflake e2e "Saturate" phase.
#51224: StatefulSet: Deflake e2e "restart" phase.
#51230: StatefulSet: Deflake e2e
kubectl exec
commands.#52057: StatefulSet: Deflake e2e RunHostCmd.
ref #48031