Adds a docker-compose.test.yml to do a health check of built container. #2808

anuraaga · 2019-09-18T13:57:59Z

Also adds a health check command to our docker image for being able to see health with docker ps.

Docker Hub automatically runs this file after building an image. Example build at https://cloud.docker.com/repository/registry-1.docker.io/anuraaga/docker-hub-test/builds/7f0936ce-74e2-4614-815c-ef57daa7db42

Currently, the test just curl's the health endpoint, without jq. This doesn't actually test until line/armeria#2088 - I figured since this is a new test, it's not worth getting too complex finding a reputable image with curl and jq and can wait until that's in to finish it.

Add docker test and health check.

devinsba · 2019-09-18T15:01:57Z

docker/Dockerfile

@@ -38,4 +38,6 @@ USER zipkin

 EXPOSE 9410 9411

+HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health || exit 1


Cool, I didn't realize we were using the debug image

does this imply any blocking? the start period is really long

Doesn't block, it probes every interval regardless - went ahead and lowered interval to 5s. Any failures during the start-period aren't counted as "unhealthy", they're just skipped, but they still happen.

@devinsba I didn't notice the debug image until now also. In fact my hotel can't even pull the zipkin image it is so big. I opened this up for follow-up in case there's a way to reduce size again openzipkin-attic/docker-zipkin#226

devinsba · 2019-09-18T15:03:03Z

docker/docker-compose.test.yml

+      default:
+        aliases:
+        - zipkin
+  sut:


perhaps healthchecker would be a better name?

That name is hard coded as the name that docker hub checks for the test status - added a link to the docs so it's easier to follow.

👍 seems like an odd name but works for me

codefromthecrypt · 2019-09-18T23:45:20Z

docker/Dockerfile

@@ -38,4 +38,6 @@ USER zipkin

 EXPOSE 9410 9411

+HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health || exit 1


cc also @jcarres-mdsol in case you are using health checking in your deployment. This isn't the release dockerfile, but I suspect it will soon converge.

codefromthecrypt

I'm interested.. I think we should document the parameters used in a comment, maybe get a few people to ack though I understand it is deployment specific.

My main concern would be blocking for an unnecessarily long time (like tools that depend on healthcheck to pass). I know we've had complaints and at least on large site dropping zipkin due to a 30s startup time.

However, I'd like to think we can improve that and would prefer a few more health checks invoked vs lock in a big delay which ideally we can improve. See #2788
cc also @jeqo @eddumelendez

codefromthecrypt · 2019-09-18T23:45:41Z

.dockerignore

@@ -1,6 +1,7 @@
 **/\.*
 !.git
 !**/.eslintrc
+**/*.test.yml


This is since building a docker image doesn't require the compsoe file in the build context.

anuraaga · 2019-09-19T11:34:46Z

Going to go ahead and merge this in to get it in action, happy to follow up with any more tweaks. The healthcheck is just a default if a user didn't specify anything, and I think it's a reasonable default for those that don't have stricter needs in their deployments.

codefromthecrypt · 2019-09-19T22:39:10Z

Going to go ahead and merge this in to get it in action, happy to follow up

with any more tweaks. The healthcheck is just a default if a user didn't specify anything, and I think it's a reasonable default for those that don't have stricter needs in their deployments.

Thanks I agree. Appreciate your answering the questions!

codefromthecrypt · 2019-09-19T22:42:07Z

PS I don't agree that a start period of 30s is reasonable. We've intentionally overridden defaults with a very long start period. I just want to clarify that I still thing this is excessive.

anuraaga · 2019-09-20T09:14:07Z

I think your concern was whether it blocks, but it

Doesn't block, it probes every interval regardless - went ahead and lowered interval to 5s. Any failures during the start-period aren't counted as "unhealthy", they're just skipped, but they still happen.

A shorter start period has a higher chance of waiting for startup since if the server starts up just after the start period, it will take 3 healthy checks before the server is treated as healthy (any health check failures during the start period are ignored rather than setting the server as explicitly unhealthy, an unhealthy server requires multiple healthy checks to come back unlike a starting one that just takes one).

Let me know if there's still a problem with that behavior.

codefromthecrypt · 2019-09-21T00:19:39Z

When I used to run systems, it was often the case (perhaps excessive) that polling intervals were one second. I have seen 5 before, though. I know at twitter it was a real problem to have long delays before health was marked as such, some of this was caused by having to readback metrics that were themselves delayed.

What you are saying is that service will receive traffic until such time as a health check is bad (ex not until such time it is good). That should be in a comment because otherwise people will read this and say "zipkin takes a half a minute to start" which echos and causes more damage and time loss than most bugs do.

codefromthecrypt · 2019-09-21T00:22:08Z

I would suggest a comment like

# We put an excessive 30s initial delay to avoid spurious health check failures from delaying tests.
# The health check config here is not a recommendation for production. Do not copy/paste

codefromthecrypt · 2019-09-21T00:30:54Z

For example, I know @basvanbeek had a startup on his laptop that took almost 30s. He rebooted his laptop and then it took less than 10. It is super important that we discuss the difference between things overly defensive due to laptops etc and things for production. We are in a copy/paste culture.

A health check of 1s if bad, we should also highlight that. Literally we are overriding all the defaults and the values used should be transparently explained. If 1s is triggering a bug or something, because health checks are not synchronized properly etc, we should say why the duration is so long.

This whole issue would have been easier on me at least if you were transparent about the values used in the first place. Every time we override defaults we all should know why.. that's how you can take holidays etc.

codefromthecrypt · 2019-09-21T00:37:03Z

Anyway I can't explain the 5s part, I would like to know why we are overriding it.. if it is random, based on personal experience, or walking around a problem. If the latter, I'd like to make an issue about it.

Meanwhile I'll make a comment about the initial duration in a separate PR to try to avoid FUD from spreading.

As we routinely get FUD about slow startup or memory usage, this documents health check parameters in efforts to avoid them being read literally as "zipkin takes 30s to startup" See #2808

codefromthecrypt · 2019-09-21T01:00:02Z

Here's an attempt to get off this topic and back to work :) #2812

As we routinely get FUD about slow startup or memory usage, this documents health check parameters in efforts to avoid them being read literally as "zipkin takes 30s to startup" See #2808

anuraaga added 4 commits September 18, 2019 12:37

Add docker test and health check.

e4d429f

pre_build instead of build since it sets up builds.

0f2216c

Merge pull request #1 from anuraaga/docker-test

1595070

Add docker test and health check.

Update docker-compose.test.yml

433330c

anuraaga mentioned this pull request Sep 18, 2019

Migrate docker builds to Docker Hub openzipkin-attic/docker-zipkin#86

Closed

anuraaga requested review from abesto, codefromthecrypt and devinsba September 18, 2019 14:14

devinsba approved these changes Sep 18, 2019

View reviewed changes

Update docker-compose.test.yml

cb7cdca

codefromthecrypt reviewed Sep 18, 2019

View reviewed changes

Document and tighten health check

9199199

anuraaga merged commit 08eed8a into openzipkin:master Sep 19, 2019

codefromthecrypt mentioned this pull request Sep 21, 2019

Adds rationale for healthcheck config #2812

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a docker-compose.test.yml to do a health check of built container. #2808

Adds a docker-compose.test.yml to do a health check of built container. #2808

anuraaga commented Sep 18, 2019

devinsba Sep 18, 2019

codefromthecrypt Sep 18, 2019

anuraaga Sep 19, 2019

codefromthecrypt Sep 23, 2019

devinsba Sep 18, 2019

anuraaga Sep 18, 2019

devinsba Sep 18, 2019

codefromthecrypt Sep 18, 2019

codefromthecrypt left a comment

codefromthecrypt Sep 18, 2019

anuraaga Sep 19, 2019

anuraaga commented Sep 19, 2019

codefromthecrypt commented Sep 19, 2019 via email

codefromthecrypt commented Sep 19, 2019

anuraaga commented Sep 20, 2019

codefromthecrypt commented Sep 21, 2019 •

edited

codefromthecrypt commented Sep 21, 2019 •

edited

codefromthecrypt commented Sep 21, 2019

codefromthecrypt commented Sep 21, 2019

codefromthecrypt commented Sep 21, 2019

		@@ -38,4 +38,6 @@ USER zipkin

		EXPOSE 9410 9411

		HEALTHCHECK --interval=10s --start-period=30s --timeout=5s CMD wget --quiet http://localhost:9411/health \|\| exit 1

Adds a docker-compose.test.yml to do a health check of built container. #2808

Adds a docker-compose.test.yml to do a health check of built container. #2808

Conversation

anuraaga commented Sep 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuraaga commented Sep 19, 2019

codefromthecrypt commented Sep 19, 2019 via email

codefromthecrypt commented Sep 19, 2019

anuraaga commented Sep 20, 2019

codefromthecrypt commented Sep 21, 2019 • edited

codefromthecrypt commented Sep 21, 2019 • edited

codefromthecrypt commented Sep 21, 2019

codefromthecrypt commented Sep 21, 2019

codefromthecrypt commented Sep 21, 2019

codefromthecrypt commented Sep 21, 2019 •

edited

codefromthecrypt commented Sep 21, 2019 •

edited