-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added healthchecks to the connect inject webhook. #626
Conversation
.circleci/config.yml
Outdated
@@ -68,7 +68,7 @@ commands: | |||
type: string | |||
consul-k8s-image: | |||
type: string | |||
default: "docker.mirror.hashicorp.services/hashicorpdev/consul-k8s-control-plane:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to revert before merging
cc6b14f
to
65fe289
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great Nicoleta 🎉 😃 🐶 !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!! Excellent first PR @NicoletaPopoviciu
port: 9445 | ||
scheme: HTTP | ||
failureThreshold: 2 | ||
initialDelaySeconds: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more out of curiosity than anything else, but is there a reason the initialDelaySeconds
is 2 here while it's value is 1 in the livenessProbe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've used the same configuration as before for both probes and didn't re-evaluate whether those make sense since it's essentially the same check as before. We'll take a deeper look on Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We took another look, the two probes are initially triggered 1 second apart to keep the two requests staggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AHA!! good find!! thanks for the update Nicoleta!!
{"Both cert and key files not present.", nil, nil, true}, | ||
{"Cert file not empty and key file missing.", ptrToString("test"), nil, true}, | ||
{"Key file not empty and cert file missing.", nil, ptrToString("test"), true}, | ||
{"Both cert and key files are present and not empty.", ptrToString("test"), ptrToString("test"), false}, | ||
{"Both cert and key files are present but both are empty.", ptrToString(""), ptrToString(""), true}, | ||
{"Both cert and key files are present but key file is empty.", ptrToString("test"), ptrToString(""), true}, | ||
{"Both cert and key files are present but cert file is empty.", ptrToString(""), ptrToString("test"), true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent!!
scheme: HTTP | ||
failureThreshold: 2 | ||
initialDelaySeconds: 1 | ||
periodSeconds: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if it needs to be checked every 2 seconds? How likely is it for the certs to be removed from disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we'll take another look. We've added the same configuration as before the refactor, but we will re-think whether those still make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We kept periodSeconds=2, as it was the value we had before the t-proxy refactor; we looked more into Kubernetes documentation, and the default value for periodSeconds is 10. Do you have any guidance as to whether we should increase the value, and by how much? And to your point once there the certificates are available, they are unlikely to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue we were having before we removed these probes was that the webhook cert manager was being a bit slow in creating the secrets and the inject deployment was crash looping. I'd like to avoid that issue this time round.
When Kubernetes tries to schedule a Pod, it will make a call to this webhook. If the webhook doesn't have certs, then the call will fail and since the failure policy is set to fail, the Pod scheduling will fail. In sum, we kinda already get the correct behaviour, even without the probes.
Now the readiness probe is still good because if for some reason the cert secrets aren't getting created or being mounted properly, you'll see that there's an issue because the probe will be failing.
The liveness probe will only help in the case where the certs are created but aren't being mounted properly and somehow restarting the container will get the certs mounted.
Once the certs have been created and the pod starts successfully the only situations where we'll have issues are:
- The secret is deleted.
- The cert manager deletes the keys from the secret.
In both these cases, Pod scheduling will start to fail because of the failure policy so I don't think quick failure detection is necessary.
So in sum, I think we should tweak these values so they give good signal if there's an issue, they give some time on startup for the webhook cert manager to create the secrets, and they don't check too often to reduce load.
Maybe 10s for initialDelaySeconds, then 30s for periodSeconds.
Co-authored-by: Iryna Shustava <ishustava@users.noreply.github.com>
This reverts commit dbb4ba3.
Based on #458
Changes proposed in this PR:
How I've tested this PR:
-unit tests
-acceptance tests
How I expect reviewers to test this PR:
-code review
Checklist: