Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking #1724

cybertron · 2020-05-11T18:23:58Z

Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.

I also moved the liveness probe for haproxy itself to use a /readyz
endpoint for consistency. This isn't strictly necessary, but I think
it will be less confusing if there aren't multiple health check
endpoints in the config.

0: openshift/installer#3537

- What I did

- How to verify it

- Description for the changelog
Use correct health check endpoint in haproxy configuration to avoid intermittent outages of the api on graceful shutdown.

Per [0], the /readyz endpoint is how the api communicates that it is gracefully shutting down. Once /readyz starts to report failure, we want to stop sending traffic to that backend. If we wait for /healthz, it may be too late because once /healthz starts failing the api is already not accepting connections. I also moved the liveness probe for haproxy itself to use a /readyz endpoint for consistency. This isn't strictly necessary, but I think it will be less confusing if there aren't multiple health check endpoints in the config. 0: openshift/installer#3537

openshift-ci-robot · 2020-05-11T18:24:04Z

@cybertron: This pull request references Bugzilla bug 1823950, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yboaron · 2020-05-12T10:16:23Z

templates/master/00-master/openstack/files/openstack-haproxy-haproxy.yaml

@@ -22,7 +22,7 @@ contents:
    listen health_check_http_url
      bind :::50936 v4v6
      mode http
-      monitor-uri /healthz
+      monitor-uri /readyz


Isn't /readyz endpoint used usually for readiness check and /healthyz used for liveness check?
If that is that is the case, I think we should consider keeping /healthz here, as it's used for HAProxy pod Liveness check

By the time /healthz turns red, it's too late -- you'll possibly have sent traffic to the node after it's become unavailable. We're seeing a lot of flakes in CI with connection reset and other errors related to the API server availability.

This PR includes two changes:
A. The change that API folks recommended doing, change from /healthz to /readyz in HAProxy backend check [1], it's covered in this PR - I'm fine with that.
B. Changing the endpoint name for HAProxy static pod Liveness from /healthz to /readyz, [2]. my comment/question was about the endpoints naming convention for Liveness/Readiness probes.

I just checked the endpoints naming used in other pods (kube-api-server) and seems that they are not using /healthz for Liveness probe and /readyz for Readiness probe.
So I guess we can change it also here

[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml#L35
[2] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-haproxy-haproxy.yaml#L24

Just to capture what I said on slack, this healthcheck is an arbitrary endpoint. It doesn't actually relate to the api, it just needs to match whatever the haproxy liveness probe uses. I changed it so we'd be using consistent names here, but we might consider moving this endpoint to a completely different name to make it clear that it has nothing to do with the api endpoints.

jcpowermac · 2020-05-12T13:52:41Z

cc:
@patrickdillon
@mtnbikenc

stbenjam · 2020-05-12T14:39:17Z

Hey all, what's needed to get this merged? metal CI has been suffering for a while with these flakes, and I'd really like to see this soak for a few days before the 4.5 development window closes to confirm it fixes the API server flakes.

yboaron · 2020-05-12T16:37:27Z

/lgtm

cybertron · 2020-05-12T17:00:49Z

/assign @ericavonb

ericavonb · 2020-05-12T20:09:42Z

/lgtm

openshift-ci-robot · 2020-05-12T20:09:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, ericavonb, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ericavonb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stbenjam · 2020-05-12T20:16:32Z

@ericavonb Do you know what the status of e2e-gcp-op is? It's blocking merging this but doesn't look like it's passed at all in the last week or so.

stbenjam · 2020-05-12T20:17:03Z

pull-ci-openshift-machine-config-operator-master-e2e-gcp-op - 16 runs, 100% failed

openshift-bot · 2020-05-12T20:18:50Z

/retest