Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking #1724

Merged
merged 1 commit into from May 14, 2020

Conversation

cybertron
Copy link
Member

Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.

I also moved the liveness probe for haproxy itself to use a /readyz
endpoint for consistency. This isn't strictly necessary, but I think
it will be less confusing if there aren't multiple health check
endpoints in the config.

0: openshift/installer#3537

- What I did

- How to verify it

- Description for the changelog
Use correct health check endpoint in haproxy configuration to avoid intermittent outages of the api on graceful shutdown.

Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.

I also moved the liveness probe for haproxy itself to use a /readyz
endpoint for consistency. This isn't strictly necessary, but I think
it will be less confusing if there aren't multiple health check
endpoints in the config.

0: openshift/installer#3537
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label May 11, 2020
@openshift-ci-robot
Copy link
Contributor

@cybertron: This pull request references Bugzilla bug 1823950, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 11, 2020
@@ -22,7 +22,7 @@ contents:
listen health_check_http_url
bind :::50936 v4v6
mode http
monitor-uri /healthz
monitor-uri /readyz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't /readyz endpoint used usually for readiness check and /healthyz used for liveness check?
If that is that is the case, I think we should consider keeping /healthz here, as it's used for HAProxy pod Liveness check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time /healthz turns red, it's too late -- you'll possibly have sent traffic to the node after it's become unavailable. We're seeing a lot of flakes in CI with connection reset and other errors related to the API server availability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes two changes:
A. The change that API folks recommended doing, change from /healthz to /readyz in HAProxy backend check [1], it's covered in this PR - I'm fine with that.
B. Changing the endpoint name for HAProxy static pod Liveness from /healthz to /readyz, [2]. my comment/question was about the endpoints naming convention for Liveness/Readiness probes.

I just checked the endpoints naming used in other pods (kube-api-server) and seems that they are not using /healthz for Liveness probe and /readyz for Readiness probe.
So I guess we can change it also here

[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml#L35
[2] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml#L24

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to capture what I said on slack, this healthcheck is an arbitrary endpoint. It doesn't actually relate to the api, it just needs to match whatever the haproxy liveness probe uses. I changed it so we'd be using consistent names here, but we might consider moving this endpoint to a completely different name to make it clear that it has nothing to do with the api endpoints.

@jcpowermac
Copy link
Contributor

cc:
@patrickdillon
@mtnbikenc

@stbenjam
Copy link
Member

Hey all, what's needed to get this merged? metal CI has been suffering for a while with these flakes, and I'd really like to see this soak for a few days before the 4.5 development window closes to confirm it fixes the API server flakes.

@yboaron
Copy link
Contributor

yboaron commented May 12, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2020
@cybertron
Copy link
Member Author

/assign @ericavonb

@ericavonb
Copy link
Contributor

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, ericavonb, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2020
@stbenjam
Copy link
Member

@ericavonb Do you know what the status of e2e-gcp-op is? It's blocking merging this but doesn't look like it's passed at all in the last week or so.

@stbenjam
Copy link
Member

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@ericavonb
Copy link
Contributor

@sinnykumari was going to look into that test for another pr. sinny, do these seem to be failing for the same reason?

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sinnykumari
Copy link
Contributor

sinnykumari commented May 13, 2020

yes, they are failing due to same reason, see #1723 .

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@stbenjam
Copy link
Member

stbenjam commented May 13, 2020

#1689 override e2e-gcp-op, can you do the same here please 🙏 ? We really really really need to see the impact of this on our 4.5 periodics.

@sinnykumari @ericavonb

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

13 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@yuqi-zhang
Copy link
Contributor

/retest

The GCP-op should be passing now.

@kikisdeliveryservice
Copy link
Contributor

yay it passed!

@openshift-merge-robot openshift-merge-robot merged commit ce9eadd into openshift:master May 14, 2020
@openshift-ci-robot
Copy link
Contributor

@cybertron: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:

In response to this:

Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet