New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grace period option to health checks. #28938

Merged
merged 1 commit into from Apr 6, 2017

Conversation

@elifa

elifa commented Nov 29, 2016

- What I did
Added the option --start-period to HEALTHCHECK in order to allow containers with startup times to have a health check configured based on their behaviour once started.

The --start-period flag defines a period during which health check results are not counted towards the maximum number of retries configured by the --retries flag. However, if a health check succeeds during the grace period, any failures from there on will be counted towards the retries.

The default is to use no start period, meaning no change to current behaviour.

Additionally the run flag --health-start-period has been added to the CLI to override the value from the build.

The need for this has been discussed in #26498 and #26664 and this is my suggestion of how to solve the use cases discussed there.

- How I did it
Based on how long it has been since the container start Container.State.StartedAt and the given --start-period, it is determined whether the Container.State.Health.FailingStreak is incremented or not.

- How to verify it
A test has been added for verification of the new functionallity.

- Description for the changelog

Add --start-period flag to HEALTHCHECK and run override flag --health-start-period in order to enable health check for containers with an initial startup time.

Signed-off-by: Elias Faxö elias.faxo@gmail.com

@vdemeester vdemeester added this to the 1.14.0 milestone Nov 29, 2016

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Nov 29, 2016

Contributor

I don't really care for the name, but otherwise SGTM.
Maybe start-period to make it clear this is to give time during container startup, not once it's already started.

Contributor

cpuguy83 commented Nov 29, 2016

I don't really care for the name, but otherwise SGTM.
Maybe start-period to make it clear this is to give time during container startup, not once it's already started.

@elifa

This comment has been minimized.

Show comment
Hide comment
@elifa

elifa Nov 29, 2016

@cpuguy83 I agree, the name could be better. I'll change it to start-period instead and edit the PR.

elifa commented Nov 29, 2016

@cpuguy83 I agree, the name could be better. I'll change it to start-period instead and edit the PR.

Show outdated Hide outdated docs/reference/builder.md Outdated
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Dec 15, 2016

Member

I can see this being useful for services that have a long startup period, so sgtm. We can bike-shed over naming during review

Member

thaJeztah commented Dec 15, 2016

I can see this being useful for services that have a long startup period, so sgtm. We can bike-shed over naming during review

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Dec 15, 2016

Member

ping @aaronlehmann could you have a look if this would work for SwarmKit?

Member

thaJeztah commented Dec 15, 2016

ping @aaronlehmann could you have a look if this would work for SwarmKit?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Dec 15, 2016

Contributor

It would need to be plumbed through SwarmKit. This would involve a PR to SwarmKit to add the new flag to the protobuf definitions, and changes in Docker to add the flag to service create/update, and pass it through to the container in the executor.

Contributor

aaronlehmann commented Dec 15, 2016

It would need to be plumbed through SwarmKit. This would involve a PR to SwarmKit to add the new flag to the protobuf definitions, and changes in Docker to add the flag to service create/update, and pass it through to the container in the executor.

@elifa

This comment has been minimized.

Show comment
Hide comment
@elifa

elifa Dec 19, 2016

@aaronlehmann Do you want me to add a PR to SwarmKit and add mapping for service to this PR? Was a bit uncertain if you wanted this to be merged before adding it to SwarmKit as it has a dependency to the types updated in this PR.

elifa commented Dec 19, 2016

@aaronlehmann Do you want me to add a PR to SwarmKit and add mapping for service to this PR? Was a bit uncertain if you wanted this to be merged before adding it to SwarmKit as it has a dependency to the types updated in this PR.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Dec 19, 2016

Contributor

Not sure I'm the best one to answer that question. I think if maintainers are happy with the design of this PR, the next step would be to open a SwarmKit PR to add support there. But I don't want to get ahead of the design review.

Contributor

aaronlehmann commented Dec 19, 2016

Not sure I'm the best one to answer that question. I think if maintainers are happy with the design of this PR, the next step would be to open a SwarmKit PR to add support there. But I don't want to get ahead of the design review.

@kakawait

This comment has been minimized.

Show comment
Hide comment
@kakawait

kakawait Jan 23, 2017

What do you think about a readiness healthcheck/probe in addition to --health-start-period in order to optimize start-period duration?

If readiness healthcheck/probe returns OK before --health-start-period we can change state to unhealthy (then healthy when healthcheck will be triggered). However --health-start-period must be the max time, even if readiness healthcheck/probe always returns NOT OK (act as timeout).

related to #26664 (comment)

I used to write spring-boot application with http://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/system/EmbeddedServerPortFileWriter.html that will simply write file with port number when servlet container is ready to listen request. I expect a way to write something:

my container is not readiness (!= healthy) until that file is present and contains valid value.

Because any operations like database upgrade (flyway) are performed before servlet container listen.


PS: And maybe a different restart policy to apply between that state and unhealthy

kakawait commented Jan 23, 2017

What do you think about a readiness healthcheck/probe in addition to --health-start-period in order to optimize start-period duration?

If readiness healthcheck/probe returns OK before --health-start-period we can change state to unhealthy (then healthy when healthcheck will be triggered). However --health-start-period must be the max time, even if readiness healthcheck/probe always returns NOT OK (act as timeout).

related to #26664 (comment)

I used to write spring-boot application with http://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/system/EmbeddedServerPortFileWriter.html that will simply write file with port number when servlet container is ready to listen request. I expect a way to write something:

my container is not readiness (!= healthy) until that file is present and contains valid value.

Because any operations like database upgrade (flyway) are performed before servlet container listen.


PS: And maybe a different restart policy to apply between that state and unhealthy

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi
Member

aluzzardi commented Jan 23, 2017

Show outdated Hide outdated daemon/health.go Outdated
Show outdated Hide outdated docs/reference/builder.md Outdated
@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 24, 2017

Contributor

Design sounds good to me.

Contributor

dongluochen commented Jan 24, 2017

Design sounds good to me.

@krasi-georgiev

This comment has been minimized.

Show comment
Hide comment
@krasi-georgiev

krasi-georgiev Jan 24, 2017

Contributor

I think readiness test is more flexible

Contributor

krasi-georgiev commented Jan 24, 2017

I think readiness test is more flexible

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Jan 24, 2017

Member

My 2 cents: The title of the PR says Grace period. I kinda like that term better than start period, e.g. --healthcheck-grace-period.

@dongluochen ?

Member

aluzzardi commented Jan 24, 2017

My 2 cents: The title of the PR says Grace period. I kinda like that term better than start period, e.g. --healthcheck-grace-period.

@dongluochen ?

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 24, 2017

Contributor

I don't have a preference between grace-period and start-period.

Contributor

dongluochen commented Jan 24, 2017

I don't have a preference between grace-period and start-period.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Jan 26, 2017

Member

We were discussing this in the maintainers meeting, and think this needs more discussion (healthcheck vs readiness check)

Member

thaJeztah commented Jan 26, 2017

We were discussing this in the maintainers meeting, and think this needs more discussion (healthcheck vs readiness check)

@thaJeztah thaJeztah moved this from backlog to Revisit in maintainers-session Jan 26, 2017

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Feb 6, 2017

Contributor

@thaJeztah Can you summarize the alternatives and how are we going to resolve?

Contributor

dongluochen commented Feb 6, 2017

@thaJeztah Can you summarize the alternatives and how are we going to resolve?

@dnephin

This comment has been minimized.

Show comment
Hide comment
@dnephin

dnephin Feb 10, 2017

Member

I believe it is well summarized in this comment: #26664 (comment)

A fixed time interval is not a good solution for readiness checks, and there doesn't seem to be any other use cases for a grace period.

Member

dnephin commented Feb 10, 2017

I believe it is well summarized in this comment: #26664 (comment)

A fixed time interval is not a good solution for readiness checks, and there doesn't seem to be any other use cases for a grace period.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Feb 10, 2017

Member

We've been discussing this PR, and considering all options, this seems to be the best way forward. Although a "fixed time" can be hard to determine (e.g., how long does a database migration take?), setting a very long time (1 day) will work, because the "grace period" automatically "expires" once the service becomes healthy.

Having a separate "readiness" check doesn't fit in the SwarmKit design (@dongluochen will be better at explaining this).

Some things that need to be discussed is to create a list of service options that can be updated on a running service, without causing a re-deploy of tasks (e.g., being able to remove the grace period from the service definition)

I am moving this to code review but would like @dongluochen and @aluzzardi to check if this matches what we discussed

Member

thaJeztah commented Feb 10, 2017

We've been discussing this PR, and considering all options, this seems to be the best way forward. Although a "fixed time" can be hard to determine (e.g., how long does a database migration take?), setting a very long time (1 day) will work, because the "grace period" automatically "expires" once the service becomes healthy.

Having a separate "readiness" check doesn't fit in the SwarmKit design (@dongluochen will be better at explaining this).

Some things that need to be discussed is to create a list of service options that can be updated on a running service, without causing a re-deploy of tasks (e.g., being able to remove the grace period from the service definition)

I am moving this to code review but would like @dongluochen and @aluzzardi to check if this matches what we discussed

Added start period option to health check.
Signed-off-by: Elias Faxö <elias.faxo@gmail.com>
@dongluochen

LGTM

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Apr 6, 2017

Member

Gonna take @dongluochen's LGTM into account here. All green 👍

Member

thaJeztah commented Apr 6, 2017

Gonna take @dongluochen's LGTM into account here. All green 👍

@thaJeztah thaJeztah merged commit c4010e2 into moby:master Apr 6, 2017

6 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 32606 has succeeded
Details
janky Jenkins build Docker-PRs 41215 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 1392 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 12332 has succeeded
Details
z Jenkins build Docker-PRs-s390x 1226 has succeeded
Details
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Apr 6, 2017

Member

Thanks so much @elifa !

Member

thaJeztah commented Apr 6, 2017

Thanks so much @elifa !

@elifa

This comment has been minimized.

Show comment
Hide comment
@elifa

elifa Apr 7, 2017

Great! Thanks for all your help!
I have created a PR towards swarmkit docker/swarmkit#2103 finalizing the feature.

elifa commented Apr 7, 2017

Great! Thanks for all your help!
I have created a PR towards swarmkit docker/swarmkit#2103 finalizing the feature.

@@ -22,6 +22,7 @@ keywords: "API, Docker, rcli, REST, documentation"
* `POST /networks/create` now supports creating the ingress network, by specifying an `Ingress` boolean field. As of now this is supported only when using the overlay network driver.
* `GET /networks/(name)` now returns an `Ingress` field showing whether the network is the ingress one.
* `GET /networks/` now supports a `scope` filter to filter networks based on the network mode (`swarm`, `global`, or `local`).
* `POST /containers/create`, `POST /service/create` and `POST /services/(id or name)/update` now takes the field `StartPeriod` as a part of the `HealthConfig` allowing for specification of a period during which the container should not be considered unealthy even if health checks do not pass.

This comment has been minimized.

@ijc

ijc Apr 10, 2017

Contributor

"unealthy" is a typo I think.

@ijc

ijc Apr 10, 2017

Contributor

"unealthy" is a typo I think.

This comment has been minimized.

@albers

albers Apr 11, 2017

Member

@ijc25 created #32523 for that

@albers

albers Apr 11, 2017

Member

@ijc25 created #32523 for that

dnephin pushed a commit to dnephin/docker that referenced this pull request Apr 17, 2017

Merge pull request moby#28938 from elifa/master
Grace period option to health checks.
@pascalandy

This comment has been minimized.

Show comment
Hide comment
@pascalandy

pascalandy May 16, 2017

Hello gents,
I'm so glad to see appearing. Since the healthcheck, my services take 30s to 180s to deploy.

I'm not sure to understand the difference between the two flags.
I don't want any healthcheck for the first 90 seconds. I guess I only need to use:

--health-start-period "90s"

I'm not sure why --start-period "90s" exist. Can you enlighten me?
Here is my core setup to start nginx:

docker service create \
	--name "$CTN_nginx_app" \
	--hostname "$CTN_nginx_app" \
	--network "$NTW_FRONT" \
	--replicas "1" \
	--reserve-memory "12M" \
	--limit-memory "20M" \
	--constraint node.labels.apps_accepted=="yes" \
	--mount	type="bind",src="$WWW_SRC_NGINX",target="$WWW_DST_NGINX" \
	--restart-condition "any" \
	--restart-max-attempts "55" \
	--update-delay "5s" \
	--update-parallelism "1" \
	--start-period "90s" \
	--health-start-period "90s" \
nginx:alpine

Cheers!
Pascal

pascalandy commented May 16, 2017

Hello gents,
I'm so glad to see appearing. Since the healthcheck, my services take 30s to 180s to deploy.

I'm not sure to understand the difference between the two flags.
I don't want any healthcheck for the first 90 seconds. I guess I only need to use:

--health-start-period "90s"

I'm not sure why --start-period "90s" exist. Can you enlighten me?
Here is my core setup to start nginx:

docker service create \
	--name "$CTN_nginx_app" \
	--hostname "$CTN_nginx_app" \
	--network "$NTW_FRONT" \
	--replicas "1" \
	--reserve-memory "12M" \
	--limit-memory "20M" \
	--constraint node.labels.apps_accepted=="yes" \
	--mount	type="bind",src="$WWW_SRC_NGINX",target="$WWW_DST_NGINX" \
	--restart-condition "any" \
	--restart-max-attempts "55" \
	--update-delay "5s" \
	--update-parallelism "1" \
	--start-period "90s" \
	--health-start-period "90s" \
nginx:alpine

Cheers!
Pascal

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 16, 2017

Member

@pascalandy They're the same; the --start-period is not a command-line option, but an option to the HEALTHCHECK Dockerfile instruction to define the start-period as part of the image, whereas --health-start-period is a command-line option to set/override that period at runtime.

This option is only used if the image / container you're running uses a health check (your example uses the nginx:alpine image, which does not have a health check defined). When using the --health-start-period, it works roughly like this;

Say, the service is created with

$ docker service create --name=health \
  --health-cmd='exit 1' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  nginx:alpine
  1. The container (task) is created and started
  2. Every 10 seconds the command exit 1 is executed to check if the container is healthy
  3. If the health-cmd, exit 1, fails (hint: that's always 😄), but the container has not been running for more than 60 seconds (the start period), don't count the failure, and let the container run as usual
  4. Similarly; if the health-cmd takes longer than 3 seconds (--health-timeout) to complete, don't count the failure, and let the container run as usual
  5. After 60 seconds, start tracking failures. If exit 1 fails (or takes longer to complete than 3 seconds) 3 times in a row, mark the container/task as "unhealthy", stop it, and start a new task to replace it.

If you run the above example, you can follow what's happening. During the first 60 seconds, you can inspect the container, and see that the health check is failing, and failures are logged;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

However, even though it fails 3 times or more in a row, the FailingStreak remains 0, and Status remains "starting" (because we're still in the "start period", and the container hasn't reported as "healthy" yet);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:41.415861379Z",
      "End": "2017-05-16T09:53:41.452461489Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:54:01.492147629Z",
      "End": "2017-05-16T09:54:01.533664526Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

Once the container is running for the 60 seconds, docker starts to track failures (FailingStreak is incremented with each consecutive failure);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 1,
  "Log": [
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
...

And when it reaches --health-retries (3), the container/task is marked as unhealthy;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "unhealthy",
  "FailingStreak": 3,
  "Log": [
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
...

At that point, Swarm takes control; stops the container/task and starts a new one to replace it;

$ docker service ps health
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                 ERROR               PORTS
uqaf5bbiunnf        health.1            nginx:alpine        91d5d251ecc3        Running             Starting 20 seconds ago
be45d670f023         \_ health.1        nginx:alpine        91d5d251ecc3        Shutdown            Complete 25 seconds ago
Member

thaJeztah commented May 16, 2017

@pascalandy They're the same; the --start-period is not a command-line option, but an option to the HEALTHCHECK Dockerfile instruction to define the start-period as part of the image, whereas --health-start-period is a command-line option to set/override that period at runtime.

This option is only used if the image / container you're running uses a health check (your example uses the nginx:alpine image, which does not have a health check defined). When using the --health-start-period, it works roughly like this;

Say, the service is created with

$ docker service create --name=health \
  --health-cmd='exit 1' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  nginx:alpine
  1. The container (task) is created and started
  2. Every 10 seconds the command exit 1 is executed to check if the container is healthy
  3. If the health-cmd, exit 1, fails (hint: that's always 😄), but the container has not been running for more than 60 seconds (the start period), don't count the failure, and let the container run as usual
  4. Similarly; if the health-cmd takes longer than 3 seconds (--health-timeout) to complete, don't count the failure, and let the container run as usual
  5. After 60 seconds, start tracking failures. If exit 1 fails (or takes longer to complete than 3 seconds) 3 times in a row, mark the container/task as "unhealthy", stop it, and start a new task to replace it.

If you run the above example, you can follow what's happening. During the first 60 seconds, you can inspect the container, and see that the health check is failing, and failures are logged;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

However, even though it fails 3 times or more in a row, the FailingStreak remains 0, and Status remains "starting" (because we're still in the "start period", and the container hasn't reported as "healthy" yet);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:41.415861379Z",
      "End": "2017-05-16T09:53:41.452461489Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:54:01.492147629Z",
      "End": "2017-05-16T09:54:01.533664526Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

Once the container is running for the 60 seconds, docker starts to track failures (FailingStreak is incremented with each consecutive failure);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 1,
  "Log": [
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
...

And when it reaches --health-retries (3), the container/task is marked as unhealthy;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "unhealthy",
  "FailingStreak": 3,
  "Log": [
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
...

At that point, Swarm takes control; stops the container/task and starts a new one to replace it;

$ docker service ps health
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                 ERROR               PORTS
uqaf5bbiunnf        health.1            nginx:alpine        91d5d251ecc3        Running             Starting 20 seconds ago
be45d670f023         \_ health.1        nginx:alpine        91d5d251ecc3        Shutdown            Complete 25 seconds ago
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 16, 2017

Member

To add to the above; the "health start period" (or "grace period", which is also used as a term), allows you to monitor a service's health (by inspecting the .State.Health of the container), during the startup period. This can be helpful for services that need to perform certain tasks the first time they are started (think of a database-migration), but you still want to keep track if the migration is running, and if the health-checks are running (you can log messages as part of the health check, which show up in .State.Health in the container inspect).

Swarm mode takes the health-state into account when routing network traffic to the task. The startup period can take less than the specified amount (e.g. the database migration took less time than expected, whoop!), at which point the container/task will start to receive network requests.

To illustrate the above; a simple example; the health-check below simulates a long-running startup. The first 40 seconds (4 healthcheck intervals), the healthcheck returns "unhealthy". Because of the health-start-period the container is not terminated, but the health-check is still performed every 10 seconds;

$ docker service create --name=health \
  --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  -p8080:80 \
  nginx:alpine

As long as the container is not "healthy", no traffic is routed to the task;

$ curl localhost:8080
curl: (7) Failed to connect to localhost port 8080: Connection refused

After 40 seconds, the container becomes healthy, and Swarm starts to route traffic to it;

$ curl localhost:8080

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
Member

thaJeztah commented May 16, 2017

To add to the above; the "health start period" (or "grace period", which is also used as a term), allows you to monitor a service's health (by inspecting the .State.Health of the container), during the startup period. This can be helpful for services that need to perform certain tasks the first time they are started (think of a database-migration), but you still want to keep track if the migration is running, and if the health-checks are running (you can log messages as part of the health check, which show up in .State.Health in the container inspect).

Swarm mode takes the health-state into account when routing network traffic to the task. The startup period can take less than the specified amount (e.g. the database migration took less time than expected, whoop!), at which point the container/task will start to receive network requests.

To illustrate the above; a simple example; the health-check below simulates a long-running startup. The first 40 seconds (4 healthcheck intervals), the healthcheck returns "unhealthy". Because of the health-start-period the container is not terminated, but the health-check is still performed every 10 seconds;

$ docker service create --name=health \
  --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  -p8080:80 \
  nginx:alpine

As long as the container is not "healthy", no traffic is routed to the task;

$ curl localhost:8080
curl: (7) Failed to connect to localhost port 8080: Connection refused

After 40 seconds, the container becomes healthy, and Swarm starts to route traffic to it;

$ curl localhost:8080

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
@pascalandy

This comment has been minimized.

Show comment
Hide comment
@pascalandy

pascalandy May 16, 2017

This is crystal clear now @thaJeztah. Thank you so much for this deep explanation :)

pascalandy commented May 16, 2017

This is crystal clear now @thaJeztah. Thank you so much for this deep explanation :)

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 16, 2017

Member

@pascalandy you're welcome! I took a bit of time to write it down, because I noticed that documentation around this was largely missing, so thought it would help as a starting point for that (I opened an issue in the documentation repository: docker/docker.github.io#3282).

To come back to your initial comment;

my services take 30s to 180s to deploy

Be aware that the deploy time is separate from the "start period"; the deploy time may include pulling the image before the task/container is started. This time is not part of the start period (which starts once the task/container is actually started).

Member

thaJeztah commented May 16, 2017

@pascalandy you're welcome! I took a bit of time to write it down, because I noticed that documentation around this was largely missing, so thought it would help as a starting point for that (I opened an issue in the documentation repository: docker/docker.github.io#3282).

To come back to your initial comment;

my services take 30s to 180s to deploy

Be aware that the deploy time is separate from the "start period"; the deploy time may include pulling the image before the task/container is started. This time is not part of the start period (which starts once the task/container is actually started).

@pascalandy

This comment has been minimized.

Show comment
Hide comment
@pascalandy

pascalandy May 16, 2017

Perfectly aware but the pulling is already done.

the deploy time may include pulling the image before the task/container is started

pascalandy commented May 16, 2017

Perfectly aware but the pulling is already done.

the deploy time may include pulling the image before the task/container is started

@vide

This comment has been minimized.

Show comment
Hide comment
@vide

vide May 22, 2017

@thaJeztah is this already integrated with Compose v3.X? Or should I open a specific issue to have it implemented?

vide commented May 22, 2017

@thaJeztah is this already integrated with Compose v3.X? Or should I open a specific issue to have it implemented?

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 22, 2017

Member

@vide a quick glance at the docker-compose 3.3 schema tells me it's not implemented yet; https://github.com/docker/cli/blob/master/cli/compose/schema/data/config_schema_v3.3.json#L310-L326

Can you open an issue in the https://github.com/docker/cli/issues issue tracker?

Member

thaJeztah commented May 22, 2017

@vide a quick glance at the docker-compose 3.3 schema tells me it's not implemented yet; https://github.com/docker/cli/blob/master/cli/compose/schema/data/config_schema_v3.3.json#L310-L326

Can you open an issue in the https://github.com/docker/cli/issues issue tracker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment