New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking #1724
Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking #1724
Conversation
Per [0], the /readyz endpoint is how the api communicates that it is gracefully shutting down. Once /readyz starts to report failure, we want to stop sending traffic to that backend. If we wait for /healthz, it may be too late because once /healthz starts failing the api is already not accepting connections. I also moved the liveness probe for haproxy itself to use a /readyz endpoint for consistency. This isn't strictly necessary, but I think it will be less confusing if there aren't multiple health check endpoints in the config. 0: openshift/installer#3537
@cybertron: This pull request references Bugzilla bug 1823950, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -22,7 +22,7 @@ contents: | |||
listen health_check_http_url | |||
bind :::50936 v4v6 | |||
mode http | |||
monitor-uri /healthz | |||
monitor-uri /readyz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't /readyz endpoint used usually for readiness check and /healthyz used for liveness check?
If that is that is the case, I think we should consider keeping /healthz here, as it's used for HAProxy pod Liveness check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time /healthz turns red, it's too late -- you'll possibly have sent traffic to the node after it's become unavailable. We're seeing a lot of flakes in CI with connection reset and other errors related to the API server availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR includes two changes:
A. The change that API folks recommended doing, change from /healthz to /readyz in HAProxy backend check [1], it's covered in this PR - I'm fine with that.
B. Changing the endpoint name for HAProxy static pod Liveness from /healthz to /readyz, [2]. my comment/question was about the endpoints naming convention for Liveness/Readiness probes.
I just checked the endpoints naming used in other pods (kube-api-server) and seems that they are not using /healthz for Liveness probe and /readyz for Readiness probe.
So I guess we can change it also here
[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml#L35
[2] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml#L24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to capture what I said on slack, this healthcheck is an arbitrary endpoint. It doesn't actually relate to the api, it just needs to match whatever the haproxy liveness probe uses. I changed it so we'd be using consistent names here, but we might consider moving this endpoint to a completely different name to make it clear that it has nothing to do with the api endpoints.
Hey all, what's needed to get this merged? metal CI has been suffering for a while with these flakes, and I'd really like to see this soak for a few days before the 4.5 development window closes to confirm it fixes the API server flakes. |
/lgtm |
/assign @ericavonb |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, ericavonb, yboaron The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ericavonb Do you know what the status of e2e-gcp-op is? It's blocking merging this but doesn't look like it's passed at all in the last week or so. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@sinnykumari was going to look into that test for another pr. sinny, do these seem to be failing for the same reason? |
/retest Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
yes, they are failing due to same reason, see #1723 . |
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
#1689 override e2e-gcp-op, can you do the same here please 🙏 ? We really really really need to see the impact of this on our 4.5 periodics. |
/retest Please review the full test history for this PR and help us cut down flakes. |
13 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest The GCP-op should be passing now. |
yay it passed! |
@cybertron: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.
I also moved the liveness probe for haproxy itself to use a /readyz
endpoint for consistency. This isn't strictly necessary, but I think
it will be less confusing if there aren't multiple health check
endpoints in the config.
0: openshift/installer#3537
- What I did
- How to verify it
- Description for the changelog
Use correct health check endpoint in haproxy configuration to avoid intermittent outages of the api on graceful shutdown.