Added healthchecks to the connect inject webhook. #626

NicoletaPopoviciu · 2021-08-19T15:36:15Z

Based on #458
Changes proposed in this PR:

added readiness and liveness checks to the connect inject webhook
@ishustava + I think that only the webhook should have the readiness check:
- It doesn't make sense to incorporate it into the readiness status because controller doesn't receive any traffic and doesn't listen on any ports (only the webhook does).
- If consul cluster is unhealthy, the endpoints controller will keep requeuing the request, and the readiness check will not help in changing that behavior. It will also not help to add a liveness check in that case, since the current behavior of requeuing the request is better than letting kubernetes to restart the container.

How I've tested this PR:
-unit tests
-acceptance tests

How I expect reviewers to test this PR:
-code review

Checklist:

tests added
CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

hashicorp-cla · 2021-08-19T15:36:19Z

All committers have signed the CLA.

NicoletaPopoviciu · 2021-08-19T15:39:19Z

.circleci/config.yml

@@ -68,7 +68,7 @@ commands:
        type: string
      consul-k8s-image:
        type: string
-        default: "docker.mirror.hashicorp.services/hashicorpdev/consul-k8s-control-plane:latest"


reminder to revert before merging

ishustava

Looks great Nicoleta 🎉 😃 🐶 !!

control-plane/connect-inject/health_checks_test.go

thisisnotashwin

This looks great!! Excellent first PR @NicoletaPopoviciu

thisisnotashwin · 2021-08-19T16:03:17Z

charts/consul/templates/connect-inject-deployment.yaml

+              port: 9445
+              scheme: HTTP
+            failureThreshold: 2
+            initialDelaySeconds: 2


This is more out of curiosity than anything else, but is there a reason the initialDelaySeconds is 2 here while it's value is 1 in the livenessProbe?

We've used the same configuration as before for both probes and didn't re-evaluate whether those make sense since it's essentially the same check as before. We'll take a deeper look on Monday.

We took another look, the two probes are initially triggered 1 second apart to keep the two requests staggered.

AHA!! good find!! thanks for the update Nicoleta!!

thisisnotashwin · 2021-08-19T16:35:58Z

control-plane/connect-inject/health_checks_test.go

+		{"Both cert and key files not present.", nil, nil, true},
+		{"Cert file not empty and key file missing.", ptrToString("test"), nil, true},
+		{"Key file not empty and cert file missing.", nil, ptrToString("test"), true},
+		{"Both cert and key files are present and not empty.", ptrToString("test"), ptrToString("test"), false},
+		{"Both cert and key files are present but both are empty.", ptrToString(""), ptrToString(""), true},
+		{"Both cert and key files are present but key file is empty.", ptrToString("test"), ptrToString(""), true},
+		{"Both cert and key files are present but cert file is empty.", ptrToString(""), ptrToString("test"), true},


This is excellent!!

lkysow · 2021-08-19T20:56:08Z

charts/consul/templates/connect-inject-deployment.yaml

+              scheme: HTTP
+            failureThreshold: 2
+            initialDelaySeconds: 1
+            periodSeconds: 2


I'm curious if it needs to be checked every 2 seconds? How likely is it for the certs to be removed from disk?

Yep, we'll take another look. We've added the same configuration as before the refactor, but we will re-think whether those still make sense.

We kept periodSeconds=2, as it was the value we had before the t-proxy refactor; we looked more into Kubernetes documentation, and the default value for periodSeconds is 10. Do you have any guidance as to whether we should increase the value, and by how much? And to your point once there the certificates are available, they are unlikely to be removed.

So the issue we were having before we removed these probes was that the webhook cert manager was being a bit slow in creating the secrets and the inject deployment was crash looping. I'd like to avoid that issue this time round.

When Kubernetes tries to schedule a Pod, it will make a call to this webhook. If the webhook doesn't have certs, then the call will fail and since the failure policy is set to fail, the Pod scheduling will fail. In sum, we kinda already get the correct behaviour, even without the probes.

Now the readiness probe is still good because if for some reason the cert secrets aren't getting created or being mounted properly, you'll see that there's an issue because the probe will be failing.

The liveness probe will only help in the case where the certs are created but aren't being mounted properly and somehow restarting the container will get the certs mounted.

Once the certs have been created and the pod starts successfully the only situations where we'll have issues are:

The secret is deleted.

The cert manager deletes the keys from the secret.

In both these cases, Pod scheduling will start to fail because of the failure policy so I don't think quick failure detection is necessary.

So in sum, I think we should tweak these values so they give good signal if there's an issue, they give some time on startup for the webhook cert manager to create the secrets, and they don't check too often to reduce load.

Maybe 10s for initialDelaySeconds, then 30s for periodSeconds.

Co-authored-by: Iryna Shustava <ishustava@users.noreply.github.com>

This reverts commit dbb4ba3.

NicoletaPopoviciu added type/enhancement New feature or request area/connect Related to Connect service mesh, e.g. injection labels Aug 19, 2021

NicoletaPopoviciu requested review from ishustava and thisisnotashwin August 19, 2021 15:37

NicoletaPopoviciu commented Aug 19, 2021

View reviewed changes

NicoletaPopoviciu and others added 3 commits August 19, 2021 11:43

Add healthprobes to connect inject webhook.

9399ab7

Update docker image for acceptance tests

dbb4ba3

Test.

65fe289

NicoletaPopoviciu force-pushed the nicoleta-tproxy-healthcheck branch from cc6b14f to 65fe289 Compare August 19, 2021 15:45

Update CHANGELOG.md

ba3ea43

ishustava approved these changes Aug 19, 2021

View reviewed changes

control-plane/connect-inject/health_checks_test.go Outdated Show resolved Hide resolved

thisisnotashwin approved these changes Aug 19, 2021

View reviewed changes

lkysow reviewed Aug 19, 2021

View reviewed changes

NicoletaPopoviciu and others added 4 commits August 23, 2021 16:34

Update control-plane/connect-inject/health_checks_test.go

a632cfc

Co-authored-by: Iryna Shustava <ishustava@users.noreply.github.com>

Merge branch 'master' into nicoleta-tproxy-healthcheck

e4d4745

Revert "Update docker image for acceptance tests"

e1369dc

This reverts commit dbb4ba3.

fix changelog

d2fbc8f

NicoletaPopoviciu merged commit 01d22a2 into main Aug 25, 2021

NicoletaPopoviciu deleted the nicoleta-tproxy-healthcheck branch August 25, 2021 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added healthchecks to the connect inject webhook. #626

Added healthchecks to the connect inject webhook. #626

NicoletaPopoviciu commented Aug 19, 2021

hashicorp-cla commented Aug 19, 2021 •

edited

Loading

NicoletaPopoviciu Aug 19, 2021

ishustava left a comment

thisisnotashwin left a comment

thisisnotashwin Aug 19, 2021

ishustava Aug 20, 2021

NicoletaPopoviciu Aug 24, 2021

thisisnotashwin Aug 24, 2021

thisisnotashwin Aug 19, 2021

lkysow Aug 19, 2021

ishustava Aug 20, 2021

NicoletaPopoviciu Aug 24, 2021

lkysow Aug 25, 2021

Added healthchecks to the connect inject webhook. #626

Added healthchecks to the connect inject webhook. #626

Conversation

NicoletaPopoviciu commented Aug 19, 2021

hashicorp-cla commented Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

ishustava left a comment

Choose a reason for hiding this comment

thisisnotashwin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashicorp-cla commented Aug 19, 2021 •

edited

Loading