Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch WaitForCertificate to informers to avoid broken watches #73030

Merged
merged 2 commits into from
Feb 7, 2019

Conversation

tnozicka
Copy link
Contributor

@tnozicka tnozicka commented Jan 17, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
ListWatchUntil can't handle closed watches and is being replaced by UntilWithSync based on informers.

Special notes for your reviewer:
Split from #50102

Replaces #73027 since this was already done in #50102 and we don't need to write it all over again.

Requires:

Release note:

NONE

/cc @wojtek-t @liggitt @tedyu

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 17, 2019
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 17, 2019
@k8s-ci-robot
Copy link
Contributor

@tnozicka: GitHub didn't allow me to request PR reviews from the following users: tedyu.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?
/kind bug

What this PR does / why we need it:
ListWatchUntil can't handle closed watches and is being replaced by UntilWithSync based on informers.

Special notes for your reviewer:
Split from #50102

Replaces #73027 since this was already done in #50102 and we don't need to write it all over again.

Release note:

NONE

/cc @wojtek-t @liggitt @tedyu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2019
@@ -417,30 +416,6 @@ func TestRotateCertCreateCSRError(t *testing.T) {
}
}

func TestRotateCertWaitingForResultError(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a different watch error we could/should trigger to exercise this case, or is timeout the only error that can occur now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think informers just go on in parallel no matter what happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't setting the timeout short let us exercise this case?

@tnozicka
Copy link
Contributor Author

I'll have to fix the simple fake added 2 months ago here https://github.com/kubernetes/kubernetes/blame/fde87329cbfbd08c6cdf3b6b8dd354ee8e10a858/cmd/kubelet/app/server_bootstrap_test.go#L263 as the strict path matching won't work otherwise

@tnozicka
Copy link
Contributor Author

requires #73080 first
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 18, 2019
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2019
@tnozicka
Copy link
Contributor Author

rebased to pick #73080
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 22, 2019
@tnozicka
Copy link
Contributor Author

/retest

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - modulo some minor comment.

@@ -254,7 +255,11 @@ func (s *csrSimulator) ServeHTTP(w http.ResponseWriter, req *http.Request) {
defer s.lock.Unlock()
t := s.t

t.Logf("Request %s %s %s", req.Method, req.URL, req.UserAgent())
// filter out timeouts as csrSimulator don't support them
req.URL.RawQuery = regexp.MustCompile("&timeout=[0-9smh]*").ReplaceAllLiteralString(req.URL.RawQuery, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The * at the end means that we can probably filter out some other parameters.
Maybe that's fine, but if so this requires updating the comment.

Copy link
Contributor Author

@tnozicka tnozicka Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the * applies only to the set it is following so it shouldn't filter out other params - https://play.golang.org/p/eOMx58K6U2x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest

q := req.URL.Query()
q.Del("timeout")
q.Del("timeoutSeconds")
req.URL.RawQuery = q.Encode()

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 22, 2019
}); err != nil {
// Wait for the certificate to be signed. This interface and internal timout
// is a remainder after the old design using raw watch wrapped with backoff.
timeout := 15 * time.Minute
Copy link
Contributor Author

@tnozicka tnozicka Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope that seems like a reasonable timeout - there was 1m + the backoff
this whole func is wrapped in backoff and poolInfinite in other place where it accounts for failures. It would be nice to plumb it through but that's not in the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously, the backoff was a private package var we manipulated in tests... can we do the same with the timeout, to restore TestRotateCertWaitingForResultError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(previously, the backoff took 4 * 1m + ~8 minutes + watch durations, so 15 minutes approximates current delay and seems ok)

@tnozicka
Copy link
Contributor Author

/retest

@wojtek-t
Copy link
Member

wojtek-t commented Jan 23, 2019

This looks reasonable to me now, but I will let @liggitt to make a final look.
/assign @liggitt

@@ -417,30 +416,6 @@ func TestRotateCertCreateCSRError(t *testing.T) {
}
}

func TestRotateCertWaitingForResultError(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't setting the timeout short let us exercise this case?

}); err != nil {
// Wait for the certificate to be signed. This interface and internal timout
// is a remainder after the old design using raw watch wrapped with backoff.
timeout := 15 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously, the backoff was a private package var we manipulated in tests... can we do the same with the timeout, to restore TestRotateCertWaitingForResultError?

}); err != nil {
// Wait for the certificate to be signed. This interface and internal timout
// is a remainder after the old design using raw watch wrapped with backoff.
timeout := 15 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(previously, the backoff took 4 * 1m + ~8 minutes + watch durations, so 15 minutes approximates current delay and seems ok)

@tnozicka
Copy link
Contributor Author

tnozicka commented Feb 6, 2019

@liggitt thanks, comments addressed

@liggitt
Copy link
Member

liggitt commented Feb 6, 2019

/retest
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2019
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@tnozicka
Copy link
Contributor Author

tnozicka commented Feb 7, 2019

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 7, 2019

@tnozicka: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized 29ba8b2 link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 1b26097 into kubernetes:master Feb 7, 2019
@tnozicka tnozicka deleted the fix-csr-list-watch branch February 7, 2019 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants