Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a docker-compose.test.yml to do a health check of built container. #2808

Merged
merged 6 commits into from
Sep 19, 2019

Conversation

anuraaga
Copy link
Contributor

Also adds a health check command to our docker image for being able to see health with docker ps.

Docker Hub automatically runs this file after building an image. Example build at https://cloud.docker.com/repository/registry-1.docker.io/anuraaga/docker-hub-test/builds/7f0936ce-74e2-4614-815c-ef57daa7db42

Currently, the test just curl's the health endpoint, without jq. This doesn't actually test until line/armeria#2088 - I figured since this is a new test, it's not worth getting too complex finding a reputable image with curl and jq and can wait until that's in to finish it.

@@ -38,4 +38,6 @@ USER zipkin

EXPOSE 9410 9411

HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health || exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I didn't realize we were using the debug image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply any blocking? the start period is really long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't block, it probes every interval regardless - went ahead and lowered interval to 5s. Any failures during the start-period aren't counted as "unhealthy", they're just skipped, but they still happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devinsba I didn't notice the debug image until now also. In fact my hotel can't even pull the zipkin image it is so big. I opened this up for follow-up in case there's a way to reduce size again openzipkin-attic/docker-zipkin#226

default:
aliases:
- zipkin
sut:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps healthchecker would be a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That name is hard coded as the name that docker hub checks for the test status - added a link to the docs so it's easier to follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 seems like an odd name but works for me

@@ -38,4 +38,6 @@ USER zipkin

EXPOSE 9410 9411

HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health || exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc also @jcarres-mdsol in case you are using health checking in your deployment. This isn't the release dockerfile, but I suspect it will soon converge.

Copy link
Member

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested.. I think we should document the parameters used in a comment, maybe get a few people to ack though I understand it is deployment specific.

My main concern would be blocking for an unnecessarily long time (like tools that depend on healthcheck to pass). I know we've had complaints and at least on large site dropping zipkin due to a 30s startup time.

However, I'd like to think we can improve that and would prefer a few more health checks invoked vs lock in a big delay which ideally we can improve. See #2788
cc also @jeqo @eddumelendez

@@ -1,6 +1,7 @@
**/\.*
!.git
!**/.eslintrc
**/*.test.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fuzz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is since building a docker image doesn't require the compsoe file in the build context.

@anuraaga
Copy link
Contributor Author

Going to go ahead and merge this in to get it in action, happy to follow up with any more tweaks. The healthcheck is just a default if a user didn't specify anything, and I think it's a reasonable default for those that don't have stricter needs in their deployments.

@anuraaga anuraaga merged commit 08eed8a into openzipkin:master Sep 19, 2019
@codefromthecrypt
Copy link
Member

codefromthecrypt commented Sep 19, 2019 via email

@codefromthecrypt
Copy link
Member

PS I don't agree that a start period of 30s is reasonable. We've intentionally overridden defaults with a very long start period. I just want to clarify that I still thing this is excessive.

@anuraaga
Copy link
Contributor Author

I think your concern was whether it blocks, but it

Doesn't block, it probes every interval regardless - went ahead and lowered interval to 5s. Any failures during the start-period aren't counted as "unhealthy", they're just skipped, but they still happen.

A shorter start period has a higher chance of waiting for startup since if the server starts up just after the start period, it will take 3 healthy checks before the server is treated as healthy (any health check failures during the start period are ignored rather than setting the server as explicitly unhealthy, an unhealthy server requires multiple healthy checks to come back unlike a starting one that just takes one).

Let me know if there's still a problem with that behavior.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Sep 21, 2019

When I used to run systems, it was often the case (perhaps excessive) that polling intervals were one second. I have seen 5 before, though. I know at twitter it was a real problem to have long delays before health was marked as such, some of this was caused by having to readback metrics that were themselves delayed.

What you are saying is that service will receive traffic until such time as a health check is bad (ex not until such time it is good). That should be in a comment because otherwise people will read this and say "zipkin takes a half a minute to start" which echos and causes more damage and time loss than most bugs do.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Sep 21, 2019

I would suggest a comment like

# We put an excessive 30s initial delay to avoid spurious health check failures from delaying tests.
# The health check config here is not a recommendation for production. Do not copy/paste

@codefromthecrypt
Copy link
Member

For example, I know @basvanbeek had a startup on his laptop that took almost 30s. He rebooted his laptop and then it took less than 10. It is super important that we discuss the difference between things overly defensive due to laptops etc and things for production. We are in a copy/paste culture.

A health check of 1s if bad, we should also highlight that. Literally we are overriding all the defaults and the values used should be transparently explained. If 1s is triggering a bug or something, because health checks are not synchronized properly etc, we should say why the duration is so long.

This whole issue would have been easier on me at least if you were transparent about the values used in the first place. Every time we override defaults we all should know why.. that's how you can take holidays etc.

@codefromthecrypt
Copy link
Member

Anyway I can't explain the 5s part, I would like to know why we are overriding it.. if it is random, based on personal experience, or walking around a problem. If the latter, I'd like to make an issue about it.

Meanwhile I'll make a comment about the initial duration in a separate PR to try to avoid FUD from spreading.

codefromthecrypt pushed a commit that referenced this pull request Sep 21, 2019
As we routinely get FUD about slow startup or memory usage, this
documents health check parameters in efforts to avoid them being read
literally as "zipkin takes 30s to startup"

See #2808
@codefromthecrypt
Copy link
Member

Here's an attempt to get off this topic and back to work :) #2812

codefromthecrypt pushed a commit that referenced this pull request Sep 21, 2019
As we routinely get FUD about slow startup or memory usage, this
documents health check parameters in efforts to avoid them being read
literally as "zipkin takes 30s to startup"

See #2808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants