-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a docker-compose.test.yml to do a health check of built container. #2808
Conversation
docker/Dockerfile
Outdated
@@ -38,4 +38,6 @@ USER zipkin | |||
|
|||
EXPOSE 9410 9411 | |||
|
|||
HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health || exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I didn't realize we were using the debug image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply any blocking? the start period is really long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't block, it probes every interval regardless - went ahead and lowered interval to 5s. Any failures during the start-period aren't counted as "unhealthy", they're just skipped, but they still happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devinsba I didn't notice the debug image until now also. In fact my hotel can't even pull the zipkin image it is so big. I opened this up for follow-up in case there's a way to reduce size again openzipkin-attic/docker-zipkin#226
default: | ||
aliases: | ||
- zipkin | ||
sut: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps healthchecker
would be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That name is hard coded as the name that docker hub checks for the test status - added a link to the docs so it's easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 seems like an odd name but works for me
docker/Dockerfile
Outdated
@@ -38,4 +38,6 @@ USER zipkin | |||
|
|||
EXPOSE 9410 9411 | |||
|
|||
HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health || exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc also @jcarres-mdsol in case you are using health checking in your deployment. This isn't the release dockerfile, but I suspect it will soon converge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested.. I think we should document the parameters used in a comment, maybe get a few people to ack though I understand it is deployment specific.
My main concern would be blocking for an unnecessarily long time (like tools that depend on healthcheck to pass). I know we've had complaints and at least on large site dropping zipkin due to a 30s startup time.
However, I'd like to think we can improve that and would prefer a few more health checks invoked vs lock in a big delay which ideally we can improve. See #2788
cc also @jeqo @eddumelendez
@@ -1,6 +1,7 @@ | |||
**/\.* | |||
!.git | |||
!**/.eslintrc | |||
**/*.test.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fuzz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is since building a docker image doesn't require the compsoe file in the build context.
Going to go ahead and merge this in to get it in action, happy to follow up with any more tweaks. The healthcheck is just a default if a user didn't specify anything, and I think it's a reasonable default for those that don't have stricter needs in their deployments. |
Going to go ahead and merge this in to get it in action, happy to follow up
with any more tweaks. The healthcheck is just a default if a user didn't
specify anything, and I think it's a reasonable default for those that
don't have stricter needs in their deployments.
Thanks I agree. Appreciate your answering the questions!
|
PS I don't agree that a start period of 30s is reasonable. We've intentionally overridden defaults with a very long start period. I just want to clarify that I still thing this is excessive. |
I think your concern was whether it blocks, but it
A shorter start period has a higher chance of waiting for startup since if the server starts up just after the start period, it will take 3 healthy checks before the server is treated as healthy (any health check failures during the start period are ignored rather than setting the server as explicitly unhealthy, an unhealthy server requires multiple healthy checks to come back unlike a starting one that just takes one). Let me know if there's still a problem with that behavior. |
When I used to run systems, it was often the case (perhaps excessive) that polling intervals were one second. I have seen 5 before, though. I know at twitter it was a real problem to have long delays before health was marked as such, some of this was caused by having to readback metrics that were themselves delayed. What you are saying is that service will receive traffic until such time as a health check is bad (ex not until such time it is good). That should be in a comment because otherwise people will read this and say "zipkin takes a half a minute to start" which echos and causes more damage and time loss than most bugs do. |
I would suggest a comment like
|
For example, I know @basvanbeek had a startup on his laptop that took almost 30s. He rebooted his laptop and then it took less than 10. It is super important that we discuss the difference between things overly defensive due to laptops etc and things for production. We are in a copy/paste culture. A health check of 1s if bad, we should also highlight that. Literally we are overriding all the defaults and the values used should be transparently explained. If 1s is triggering a bug or something, because health checks are not synchronized properly etc, we should say why the duration is so long. This whole issue would have been easier on me at least if you were transparent about the values used in the first place. Every time we override defaults we all should know why.. that's how you can take holidays etc. |
Anyway I can't explain the 5s part, I would like to know why we are overriding it.. if it is random, based on personal experience, or walking around a problem. If the latter, I'd like to make an issue about it. Meanwhile I'll make a comment about the initial duration in a separate PR to try to avoid FUD from spreading. |
As we routinely get FUD about slow startup or memory usage, this documents health check parameters in efforts to avoid them being read literally as "zipkin takes 30s to startup" See #2808
Here's an attempt to get off this topic and back to work :) #2812 |
As we routinely get FUD about slow startup or memory usage, this documents health check parameters in efforts to avoid them being read literally as "zipkin takes 30s to startup" See #2808
Also adds a health check command to our docker image for being able to see health with
docker ps
.Docker Hub automatically runs this file after building an image. Example build at https://cloud.docker.com/repository/registry-1.docker.io/anuraaga/docker-hub-test/builds/7f0936ce-74e2-4614-815c-ef57daa7db42
Currently, the test just curl's the health endpoint, without jq. This doesn't actually test until line/armeria#2088 - I figured since this is a new test, it's not worth getting too complex finding a reputable image with curl and jq and can wait until that's in to finish it.