Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added healthchecks to the connect inject webhook. #626

Merged
merged 8 commits into from
Aug 25, 2021

Conversation

NicoletaPopoviciu
Copy link
Collaborator

Based on #458
Changes proposed in this PR:

  • added readiness and liveness checks to the connect inject webhook
  • @ishustava + I think that only the webhook should have the readiness check:
    • It doesn't make sense to incorporate it into the readiness status because controller doesn't receive any traffic and doesn't listen on any ports (only the webhook does).
    • If consul cluster is unhealthy, the endpoints controller will keep requeuing the request, and the readiness check will not help in changing that behavior. It will also not help to add a liveness check in that case, since the current behavior of requeuing the request is better than letting kubernetes to restart the container.

How I've tested this PR:
-unit tests
-acceptance tests

How I expect reviewers to test this PR:
-code review

Checklist:

  • tests added
  • CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

@hashicorp-cla
Copy link

hashicorp-cla commented Aug 19, 2021

CLA assistant check
All committers have signed the CLA.

@NicoletaPopoviciu NicoletaPopoviciu added type/enhancement New feature or request area/connect Related to Connect service mesh, e.g. injection labels Aug 19, 2021
@@ -68,7 +68,7 @@ commands:
type: string
consul-k8s-image:
type: string
default: "docker.mirror.hashicorp.services/hashicorpdev/consul-k8s-control-plane:latest"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to revert before merging

Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Nicoleta 🎉 😃 🐶 !!

control-plane/connect-inject/health_checks_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@thisisnotashwin thisisnotashwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!! Excellent first PR @NicoletaPopoviciu

port: 9445
scheme: HTTP
failureThreshold: 2
initialDelaySeconds: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more out of curiosity than anything else, but is there a reason the initialDelaySeconds is 2 here while it's value is 1 in the livenessProbe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've used the same configuration as before for both probes and didn't re-evaluate whether those make sense since it's essentially the same check as before. We'll take a deeper look on Monday.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We took another look, the two probes are initially triggered 1 second apart to keep the two requests staggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHA!! good find!! thanks for the update Nicoleta!!

Comment on lines +18 to +24
{"Both cert and key files not present.", nil, nil, true},
{"Cert file not empty and key file missing.", ptrToString("test"), nil, true},
{"Key file not empty and cert file missing.", nil, ptrToString("test"), true},
{"Both cert and key files are present and not empty.", ptrToString("test"), ptrToString("test"), false},
{"Both cert and key files are present but both are empty.", ptrToString(""), ptrToString(""), true},
{"Both cert and key files are present but key file is empty.", ptrToString("test"), ptrToString(""), true},
{"Both cert and key files are present but cert file is empty.", ptrToString(""), ptrToString("test"), true},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent!!

scheme: HTTP
failureThreshold: 2
initialDelaySeconds: 1
periodSeconds: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if it needs to be checked every 2 seconds? How likely is it for the certs to be removed from disk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we'll take another look. We've added the same configuration as before the refactor, but we will re-think whether those still make sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kept periodSeconds=2, as it was the value we had before the t-proxy refactor; we looked more into Kubernetes documentation, and the default value for periodSeconds is 10. Do you have any guidance as to whether we should increase the value, and by how much? And to your point once there the certificates are available, they are unlikely to be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue we were having before we removed these probes was that the webhook cert manager was being a bit slow in creating the secrets and the inject deployment was crash looping. I'd like to avoid that issue this time round.

When Kubernetes tries to schedule a Pod, it will make a call to this webhook. If the webhook doesn't have certs, then the call will fail and since the failure policy is set to fail, the Pod scheduling will fail. In sum, we kinda already get the correct behaviour, even without the probes.

Now the readiness probe is still good because if for some reason the cert secrets aren't getting created or being mounted properly, you'll see that there's an issue because the probe will be failing.

The liveness probe will only help in the case where the certs are created but aren't being mounted properly and somehow restarting the container will get the certs mounted.

Once the certs have been created and the pod starts successfully the only situations where we'll have issues are:

  1. The secret is deleted.
  2. The cert manager deletes the keys from the secret.

In both these cases, Pod scheduling will start to fail because of the failure policy so I don't think quick failure detection is necessary.

So in sum, I think we should tweak these values so they give good signal if there's an issue, they give some time on startup for the webhook cert manager to create the secrets, and they don't check too often to reduce load.

Maybe 10s for initialDelaySeconds, then 30s for periodSeconds.

@NicoletaPopoviciu NicoletaPopoviciu merged commit 01d22a2 into main Aug 25, 2021
@NicoletaPopoviciu NicoletaPopoviciu deleted the nicoleta-tproxy-healthcheck branch August 25, 2021 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connect Related to Connect service mesh, e.g. injection type/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants