Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #51039 #51224 #51230 #52057 #52200

Conversation

enisoc
Copy link
Member

@enisoc enisoc commented Sep 8, 2017

Cherry pick of #51039 #51224 #51230 #52057 on release-1.7.

These are flakiness fixes that have helped on master.

#51039: StatefulSet: Deflake e2e "Saturate" phase.
#51224: StatefulSet: Deflake e2e "restart" phase.
#51230: StatefulSet: Deflake e2e kubectl exec commands.
#52057: StatefulSet: Deflake e2e RunHostCmd.

ref #48031

The "Saturate" phase of StatefulSet e2e tests verifies orderly startup
by controlling when each Pod is allowed to report Ready.
If a Pod unexepectedly goes down during the test, the replacement Pod
created by the controller will forget if it was already allowed to
report Ready.

After this change, the signal that allows each Pod to report Ready is
persisted in the Pod's PVC. Thus, the replacement Pod will remember that
it was already told to proceed to a Ready state.
The test used to scale the StatefulSet down to 0, wait for ListPods to
return 0 matching Pods, and then scale the StatefulSet back up.

This was prone to a race in which StatefulSet was told to scale back up
before it had observed its own deletion of the last Pod, as evidenced by
logs showing the creation of Pod ss-1 prior to the creation of the
replacement Pod ss-0.

We now wait for the controller to observe all deletions before
scaling it back up. This should fix flakes of the form:

```
Too many pods scheduled, expected 1 got 2
```
We seem to get a lot of flakes due to "connection refused" while running
`kubectl exec`. I can't find any reason this would be caused by the test
flow, so I'm adding retries to see if that helps.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2017
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 8, 2017
@enisoc enisoc added this to the v1.7 milestone Sep 8, 2017
@enisoc enisoc added cherrypick-candidate release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 8, 2017
@enisoc enisoc assigned wojtek-t and unassigned sttts and enj Sep 8, 2017
@wojtek-t
Copy link
Member

Thanks @enisoc . LGTM.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2017
@wojtek-t wojtek-t added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. cherrypick-candidate labels Sep 11, 2017
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@enisoc
Copy link
Member Author

enisoc commented Sep 11, 2017

/retest

@enisoc
Copy link
Member Author

enisoc commented Sep 11, 2017

@wojtek-t It seems like federation e2e on release-1.7 has not passed in while:

https://k8s-testgrid.appspot.com/release-1.7-all#gce-federation-release-1-7&width=5

@wojtek-t
Copy link
Member

@kubernetes/sig-federation-bugs - can you please take a look? ^^

@k8s-ci-robot k8s-ci-robot added sig/federation kind/bug Categorizes issue or PR as related to a bug. labels Sep 11, 2017
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@wojtek-t
Copy link
Member

Since federation presubmit is broken, but this cherrypick is touching only test files and is supposed to deflake those, I'm kicking tests and will merge it manually.

/retest

@wojtek-t
Copy link
Member

/retest

The initial retry up to 20s was giving up too soon.
I'm seeing this test flake because the Node rebooted and it takes ~2min
to recover.
Now StatefulSet RunHostCmd calls will use the same 5min timeout as with
other Pod state checks.
@enisoc enisoc force-pushed the automated-cherry-pick-of-#51039-#51224-#51230-#52057-upstream-release-1.7 branch from 952b3ac to 9716e80 Compare September 12, 2017 17:09
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2017
@enisoc
Copy link
Member Author

enisoc commented Sep 12, 2017

@wojtek-t Could you re-LGTM? I just pulled in a few lines from #52352 due to a flake seen on this PR.

@wojtek-t
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 13, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enisoc, wojtek-t

Associated issue: 51039

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@wojtek-t
Copy link
Member

/test pull-kubernetes-e2e-kops-aws

@k8s-ci-robot
Copy link
Contributor

@enisoc: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 9716e80 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wojtek-t
Copy link
Member

Strange - kops claims "no test failures".

I'm merging it manually to reduce flakiness.

@wojtek-t wojtek-t merged commit b0f7214 into kubernetes:release-1.7 Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants