Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for all informers to sync in /readyz. #92644

Merged
merged 1 commit into from
Jul 1, 2020

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Jun 30, 2020

Based on #92508 by @mborsz

What type of PR is this?
/kind bug

What this PR does / why we need it:
Context: #92506

In large clusters, informers in kube-apiserver may require significant time (30-50s) to initialize. Before this happens, kube-apiserver is not able to answer some requests (e.g. node authorizer is not able to accept any request).

This PR adds a way to determine if informers in kube-apiserver are already synced. This check is added to /readyz which can be used to determine if traffic can be sent to particular kube-apiserver.

Ref: #92506

Extend kube-apiserver /readyz with new "informer-sync" check ensuring that internal informers are synced.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2020
@wojtek-t
Copy link
Member Author

To copy some context from the original PR:

@deads2k

As I recall, we actually had this check in openshift and chose to remove it because it wasn't required for request safety and answering some requests generally worked better than answering no requests and failing a health check. /readyz would be less worrying from a "death cycle" point of view, but @sttts may still have opinions since he's currently looking at those.

@sttts

Readyz sounds much better than healthz. This is not about being healthy. Would like to hear @logicalhan 's opinion. He reworked the logic there quite a bit few months ago.

@logicalhan

Yeah /readyz sounds appropriate for this. We should probably add a release note stating that we are adding a readyz check which further bifurcates behavior from the deprecated healthz endpoint (as a word of warning).

@k8s-ci-robot k8s-ci-robot added area/apiserver area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 30, 2020
@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 30, 2020
@wojtek-t
Copy link
Member Author

/cc @mborsz - for posterity

@k8s-ci-robot k8s-ci-robot requested a review from mborsz June 30, 2020 11:07
@k8s-ci-robot
Copy link
Contributor

@wojtek-t: GitHub didn't allow me to request PR reviews from the following users: -, for, posterity.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @mborsz - for posterity

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wojtek-t
Copy link
Member Author

/retest

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

just a couple minor comments.

Comment on lines 618 to 622
if err := s.addReadyzChecks(healthz.NewInformerSyncHealthz(c.SharedInformerFactory)); err != nil {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be moved up between lines 616-617, since we only want to add the readyz check once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

var status int
res.StatusCode(&status)
raw, err := res.Raw()
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to do this, then we should probably just fully convert this function so that it doesn't return the bool.

if err != nil || status != http.StatusOK {
   t.Fatalf("got %v but wanted 200, error: %v", status, err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then the calls on 121 and 124 don't have to be in conditional form.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the function to return error actually

Comment on lines 107 to 114
started := i.sharedInformerFactory.WaitForCacheSync(stopCh)
klog.V(4).Infof("SharedInformerFactory.WaitForCacheSync finished with: %v", started)
var notStarted []string
for informType, started := range started {
if !started {
klog.Warningf("informer for %v has not synced yet", informType.String())
notStarted = append(notStarted, informType.String())
}
}
if len(notStarted) != 0 {
return fmt.Errorf("%d informers not started yet: %v", len(notStarted), notStarted)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a huge nit (sorry):

	stopCh := make(chan struct{})
	// Close stopCh to force checking if informers are synced now.
	close(stopCh)

	var informersByStarted map[bool][]string
	for informType, started := range i.sharedInformerFactory.WaitForCacheSync(stopCh) {
         informersByStarted[started] = append(informersByStarted[started], informType.String())
	}
	if len(informersByStarted[false]) != 0 {
		return fmt.Errorf("%d informers have not started yet: %v (but %v informers have started)", len(notStarted), informersByStarted[false], informersByStarted[true])
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2020
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2020
@wojtek-t
Copy link
Member Author

@logicalhan sorry - fixed one more typo - if you could re-lgtm

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 30, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2020
@logicalhan
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2020
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@SataQiu
Copy link
Member

SataQiu commented Jul 1, 2020

/retest

@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020
@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 1, 2020

Just fixed the minor compile error in integration test - reaaplying the label.

@wojtek-t wojtek-t added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020
@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 1, 2020

/retest

@k8s-ci-robot
Copy link
Contributor

@wojtek-t: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kind-ipv6 3f68000 link /test pull-kubernetes-e2e-kind-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 4c853bb into kubernetes:master Jul 1, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jul 1, 2020
k8s-ci-robot added a commit that referenced this pull request Jul 8, 2020
…#92644-origin-release-1.16

[1.16] Cherry pick of #92644: Wait for all informers to sync in /readyz.
k8s-ci-robot added a commit that referenced this pull request Jul 9, 2020
…#92644-origin-release-1.18

[1.18] Cherry pick of #92644: Wait for all informers to sync in /readyz.
k8s-ci-robot added a commit that referenced this pull request Jul 9, 2020
…#92644-origin-release-1.17

[1.17] Cherry pick of #92644: Wait for all informers to sync in /readyz.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants