Perform first healthcheck on startup without waiting the specified interval #33410

alexpirine · 2017-05-26T20:23:28Z

This is a feature request.

Currently, health checks work and are great to ensure continuous, uninterrupted service, even when rolling updates.

However, I might want to specify a rather long health check interval, let's say 5 minutes, because my health check checks for a few different things and is a bit time- and resource-consuming.

The documentation states:

The health check will first run interval seconds after the container is started, and then again interval seconds after each previous check completes.

This doesn't work well with continuous delivery, since I'll have to wait 5 minutes before the first container with my new software version is available. This is a long time to wait. And I don't really want to end up with containers running different versions of software for a long period time.

Why couldn't we just run the first health check right after startup, without waiting for interval delay?

The text was updated successfully, but these errors were encountered:

boaz0 · 2017-06-01T18:04:02Z

@thaJeztah WDYT?

Looking at the docs I couldn't find anything that allows this.

thaJeztah · 2017-06-01T19:00:35Z

I think the reason for this is that a lot of container may need a "warm up time" (e.g. database getting ready to accept connections) so if the first health check is performed directly after the container is started, it would result in a lot of "false positives".

Now that we have --health-start-period, this can be circumvented, because that option can be set to "ignore" the first xx seconds of failures, but it's not something that's set by default.

Still I wonder if there's many containers that may report unhealthy if the first check is done at start + 0 seconds.

I understand the use-case, but I'm not sure about the consequences if we change this, as it may break existing setups

ping @dongluochen @stevvooe wdyt?

stevvooe · 2017-06-05T20:03:06Z

@thaJeztah This is a little surprising. Why would we wait at all to run the healthcheck? The "start period" should be the time after which we consider a failure to be a hard failure. If the health check succeeds under that time, the container should go into service.

@alexpirine I'd recommend you test if this works as expected. That wording may be inaccurate.

alexpirine · 2017-06-05T20:35:14Z

@stevvooe I tested it and the behavior is as currently described in the documentation. The very first check is run only after the defined health check interval, and the container goes into service only after it is completed. So it can take quite a lot of time to reload all the replicas.

stevvooe · 2017-06-05T20:43:17Z

@alexpirine Ok, it really should just start running and only consider a fault as a failure until after the start period to minimize bring up.

We should consider this a bug.

yongtang · 2017-06-22T16:27:28Z

I think one way to address the issue without break the compatibility is to have the default value of --health-start-period as the --interval (currently --health-start-period=0 by default). I will take a closer look at it.

yongtang · 2017-07-01T22:51:01Z

Created a PR #33918 with the following changes:

The first healthcheck is performed on startup (at time 0)
The first healthcheck is only valid when start-period < 0

With the above changes, backward-compatibility will be maintained. Please take a look

josephtyler · 2017-07-12T04:32:22Z

Looking forward to this enhancement. I also need the health check to be run immediately after the container has started.

…terval This fix tries to address the issue raised in moby#33410 where the first healthcheck was only performed after the specified interval. That might be problematic in case the interval is long yet, as is specified in moby#33410. This fix make the following changes: 1. The first healthcheck is performed on startup (at time 0) 2. The first healthcheck is only valid when no-wait = true The changes allows the backward-compatibility while at the same time it is possbile to address the issue raised in moby#33410. Additional test cases have been added. This fix fixes moby#33410. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

rdxmb · 2017-10-13T21:25:46Z

There are GUIs like 'Portainer visualizer' which do not show the tasks at all when filtering for "running" tasks - which means healthy in their filtering. 'starting' has to be 'starting' - defined by start-period and not by interval. I also want to use long time intervals - but not waiting for the first check getting healthy.
Thanks for fixing this!

As @thaJeztah wrote

Now that we have --health-start-period, this can be circumvented, because that option can be set to "ignore" the first xx seconds of failures, but it's not something that's set by default.

the health-start-period has been implemented through all "stages" with docker/cli#475

rdxmb · 2018-02-26T23:54:21Z

There are two problems in the actual healthcheck when using the swarm mode:

If interval is too small - and the containers' process (e.g. a database) has not fully started, the service will be restarted -> this becomes a restart-loop
If a greater interval is chosen, the swarm does not route into the container until interval x1 has been reached, even if the process is fully loaded.

Therefore, the solution for the healthcheck must be to run the first check on time 0 and not on time interval x1

Optionally, the --start-period could be --interval by default.

I tried to make some examples of "existing setups" and compared them with the suggested solution.

@thaJeztah There is one thing which will probably break in my opinion. I think this could be fixed with
--start-period = --interval by default:

dtakken · 2018-03-20T10:10:06Z

In the suggested solution above, I still do not see how a container that needs --start-period time to get healthy would become available any sooner than --interval seconds? This can only be achieved when a health check is done after --start-period seconds. As long as that does not happen, any container that is not healthy instantly will not be available for --interval time...

rdxmb · 2018-03-20T11:28:40Z

@dtakken I must admit that you are right. Thanks for your correction, I have not seen that. So, what would you suggest?
~ ~~Add a --start-period-interval to check very often within --start-period ?~~ ~ (edited: this is bullshit)

(Still hoping somebody will have a simple solution for all that ...)

dtakken · 2018-03-22T10:26:09Z

The simplest solution would be to change Docker behavior and perform a health check just after --start-period seconds.

In case there are concerns about breaking existing setups, a new option would need to be introduced. Something like --check-on-start could trigger an additional check just when the container is expected to have started, after --start-period seconds.

Since --start-period is zero by default, enabling --check-on-start would do a first health check immediately, which makes sense for most containers. For containers that need a second to start, setting --start-period=2 for instance would do a health check after two seconds.

@thaJeztah: Which of these two options would be acceptable?

thaJeztah · 2018-03-26T12:09:16Z

The simplest solution would be to change Docker behavior and perform a health check just after --start-period seconds

That would defeat the purpose of --start-period. The --start-period acts as a "grace period", in other words; health-checks are running during that period, but a failed check does not increment the failure counter. Once health checks pass during start-period, the container/service becomes healthy, and traffic is routed to it.

The start-period for that reason, should always be made longer than --interval, so that at least one health check is performed during the start-period.

There is one thing which will probably break in my opinion. I think this could be fixed with
--start-period = --interval by default:

@rdxmb If I understand correctly, the breaking change would be that "retry-count" expiration would occur one "interval" time sooner than before (because the first health check is started at t=0, instead of t=interval, correct?

SvenAbels · 2018-03-26T12:14:09Z

@thaJeztah The main problem with the current approach is for checks that take a long time or checks that are resource intensive. For example, we got a container where we run the healthcheck every 20 min because check healthcheck causes ~20 secs of CPU load. So with this setup it means that it takes us 20 minutes before the first traffic is routed to the container, which makes any restarting process quite problematic... ;-)

alexpirine · 2018-03-26T14:13:53Z

Copy Kubernetes?

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes

rdxmb · 2018-03-26T14:31:10Z

@thaJeztah correct. However, like @dtakken explained, it does not really fix the actual problems...

dtakken · 2018-03-28T09:55:04Z

That would defeat the purpose of --start-period.

You are completely right about that one @thaJeztah. I should have proposed to introduce an option to run an additional health check just before --start-period. :-)

The start-period for that reason, should always be made longer than --interval, so that at least one health check is performed during the start-period.

This is exactly the source of the problems. This forces us to either use a really short --interval or accept the fact that it takes a long time before traffic gets routed to the container. What is needed is some way to have a longer --interval and have the container go live quickly too. An option to do an additional health check immediately after starting the container would solve the issue for containers that can go healthy instantly. An option to do an additional health check just before --start-period will also solve the issue for containers that need a second or two to get healthy.

How about introducing a --check-before-start option that:

Performs the first health check at t=0. If successful, container goes into service instantly. If it fails, failure counter is not incremented.
Performs a second check just before --start-period ends when:
- The first check failed and
- --interval is longer than --start-period
  If successful, container goes into service after about --start-period seconds. If it fails, failure counter is not incremented.

The periodical checks every --interval seconds are retained.

dtakken · 2018-03-28T17:35:42Z

Indeed @alexpirine, it looks like the option that I propose for Docker mimics Kubernetes.

ColdHeat · 2019-08-18T00:03:39Z

Seems like this issue could be easily solved by copying Kubernetes' approach. Add --initial-delay which denotes the seconds until the first healthcheck and then use --interval from then on after.

Default --initial-delay to a null value to wait interval seconds until first healthcheck to maintain backwards compatibility. If --initial-delay is set to 0, run health check immediately.

What is the concern with this implementation?

intgr · 2019-08-19T11:45:35Z

I think a far more flexible solution would be to provide an option such as --health-start-period-interval (start-period-health-interval?), to override --health-interval only during the start period.

Typically during container start, I want the interval to be low (in single digit seconds), to get the new version up quickly. But during normal operation, I don't want logs full of health check requests.

mhemrg · 2020-02-03T10:44:33Z

I think the best solution to fix this issue is what @ColdHeat explained.
What do you think @thaJeztah?
I don't know how can I help in implementing this solution. But is this a feature you could plan for the upcoming release of Docker Engine?

rdxmb · 2020-02-03T10:51:43Z

Hello, I still have had a subscription for this issue, although I have been working with kubernetes instead of swarm for over a year.
Therefore, I just want to post some inspiration for a very good solution: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

TLDR: They use separate definitions for readiness and liveness, so you can use different parameters for start (until ready) and live (after ready)

xkortex · 2021-10-21T15:46:05Z

Is this still under active consideration? I also encounter situations where I would like to run a health check shortly after the container comes up, but then after that, I wish for a longer interval. E.g.

t = 0s - container started
t = 5s - initial health check
t = 60/120/180... health checks every INTERVAL ( e.g. 60 seconds) thereafter.

zdm · 2021-12-12T10:28:59Z

Ideally it should have 2 intervals.
First interval used during start period only. Second - after start period.

start_period = 30s
start_interval=5s
runtime_interval=1m

At start - checks every 5 seconds for 30 seconds.
After start period - checks every 1 minute.

I am wondering, why docker devs unable to build usable logic. This is annoying. How long this issue will be opened?

zdm · 2021-12-12T10:35:02Z

People creates bug reports and pull requests but has no feedback at all for a years.

domgenv · 2022-03-18T20:00:46Z

This feature would be really great to have. I would like to run a health check shortly after startup and then at spaced intervals thereafter. Same idea as xkortex's comment.

The problem with the current functionality is summed up perfectly by @cpuguy83 in his comment here.

"Typically when a container is just starting you want to check if it is ready more quickly than a typical healthcheck might run. Without this users have to balance between running healthchecks to frequently vs taking a very long time to mark a container as healthy for the first time."

fabrice-toussaint · 2022-07-13T14:21:53Z

Any updates on this?

UnaiUribarri · 2022-07-21T12:04:38Z

We can run the first health check at MIN(start-period, interval) to solve this issue.

The way Docker health checks work is: 1. The container is started 2. Docker waits a full healthcheck `interval` 3. A healthcheck is made. If OK, mark as healthy The problem is that (as described in 2.), Docker always waits a full interval before doing the first check. The `interval` configuration serves 2 purposes: - wait time before the first health check - wait time between subsequent health checks This annoyance is also described in: moby/moby#33410 The way this manifests as a problem is when reverse-proxying with Traefik, like we do by default in https://github.com/spantaleev/vaultwarden-docker-ansible-deploy Traefik refuses to reverse-proxy to containers which are unhealthy (Health Status=starting is also considered unhealthy). This means that Vaultwarden, with its current 60 second healthcheck interval will take a full minute to become healthy and to be reachable by Traefik. In vaultwarden-docker-ansible-deploy, we currently work around this by overriding the default health check interval: https://github.com/spantaleev/vaultwarden-docker-ansible-deploy/blob/015c459970d41743bd75980c2df02e389e1b461b/roles/custom/devture_vaultwarden/templates/vaultwarden.service.j2#L15-L36 With our workaround (which this patch also applies), it *only* takes 10 seconds after Vaultwarden startup and until Traefik would reverse-proxy to it.

aradalvand · 2022-12-21T02:41:13Z

Why are there no updates on this one even though it seems reasonably simple to implement and there have also been lots of requests for it?

EmilianoBruni · 2023-01-09T12:53:31Z

To bypass this problem I use this script in Perl but simply adaptable to other languages

#!/usr/bin/env perl

# script to use in docker healthcheck to bypass this problem
# https://github.com/moby/moby/issues/33410
# exit(0) no problem, exit(1) problem
# check an url every interval seconds from the first run. 
# If called in other time do nothing and exit(0)
# usage in docker compose
# healthcheck:
#            test: ${APPPATH}/bin/hc -i 300 https://url.to:3000/check
#            interval: 10s # this defines the first run
#
# REMEMBER .hc must not exists and/or doesn't be deployed to container

use Mojo::Util qw(getopt);
use Mojo::File;
use LWP::Simple;
use FindBin;

use constant CFG => $FindBin::Bin . '/.hc';

getopt 'i|interval=i' => \my $interval;

die "Url can't be empty" unless $ARGV[0];

my $cfg = Mojo::File->new(CFG);

# first run, always ok
&e($cfg,0) unless (-e CFG);

# if now - mtime is < interval do nothing and exit with prev value
exit($cfg->slurp) if (time - $cfg->stat->mtime< $interval);

# perform real check (function "get" return undef if fails)
exit(&e($cfg, get($ARGV[0]) ? 0 : 1));

sub e {
    $_[0]->spurt($_[1]);
    exit($_[1]);
}

Hoping it will be fixed soon

tc-mccarthy · 2023-01-25T00:59:22Z

+1 on this.

handersendk · 2023-03-05T19:49:29Z

+1 Would be amazing to have this solved as it is frustrating to have the interval control how fast you deploy new releases when using CI. One of the main benefits of docker swarm and health checks is that we can deploy with zero downtime, but this ruins the experience and introduces an artifical lag to the deployment.

Several other comments has already started what we need to archive.

This feature would be really great to have. I would like to run a health check shortly after startup and then at spaced intervals thereafter. Same idea as xkortex's comment.

The problem with the current functionality is summed up perfectly by @cpuguy83 in his comment here.

"Typically when a container is just starting you want to check if it is ready more quickly than a typical healthcheck might run. Without this users have to balance between running healthchecks to frequently vs taking a very long time to mark a container as healthy for the first time."

SandaruKasa · 2023-05-08T22:03:37Z

+1, would be amazing to have it

fredfortier · 2023-05-15T18:27:31Z

The original problem in this thread is specific, asking to run the first health check before not after the first interval.

I can't easily tell what changed since. I understand the backward compatibility with making this specific change.

As others pointed out, I believe this particular issue brings into question the practicality of the single health probe for readiness and liveness.

In my case, it is difficult to reconcile these concerns with a single health probe. I can work around it with a test command similar to this: #33410 (comment) by touching a file to indicate readiness, and running either the readiness or liveness probe depending on whether this file exists. However, it is a hack that overlaps with the intent of start-period, so I'm concerned about maintainability.

I believe readiness and liveness concerns can be separated without breaking the existing conventions, perhaps by including an optional start-test kind of option for use cases that demand a separate readiness probe.

cpuguy83 · 2023-05-15T18:31:42Z

I implemented this some time ago here: #40894
It needs a couple of updates but I doubt I'll have a chance to deal with it, if someone wants to take it over.

cpuguy83 · 2023-07-06T22:39:50Z

Changes are merged into moby/moby, but there are other ecoysystem changes that are required.
I've created a tracking issue for this: #45897

thaJeztah added area/runtime kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Jun 1, 2017

yongtang mentioned this issue Jul 1, 2017

Perform first healthcheck on startup without waiting the specified interval #33918

Closed

rdxmb mentioned this issue Oct 13, 2017

Swam Mode: Healthchecks with large intervals make task unavailable after restarting #35126

Closed

cpuguy83 mentioned this issue May 2, 2020

Add health start interval #40894

Merged

jan-di added a commit to jan-di/docker-php-apache that referenced this issue Nov 9, 2020

remove docker healthcheck (see moby/moby#33410)

9903c7c

sanderegg mentioned this issue Sep 19, 2022

♻️ Maintenance: improve CI reliability + add healthcheck in migration service ITISFoundation/osparc-simcore#3365

Merged

spantaleev mentioned this issue Dec 16, 2022

Make container get to a healthy state faster (60s -> 10s) dani-garcia/vaultwarden#2996

Closed

bsousaa mentioned this issue Jan 31, 2023

Add ability to manually execute a health check docker/cli#1479

Closed

julienloizelet mentioned this issue Mar 3, 2023

[FEATURE REQUEST] : decrease the health check timeout ddev/ddev-redis-commander#7

Closed

thaJeztah mentioned this issue Apr 12, 2023

health check with higher rate during start_period docker/compose#10461

Open

neersighted closed this as completed in #40894 Jul 6, 2023

thaJeztah added this to the 25.0.0 milestone Jul 7, 2023

axllent referenced this issue in axllent/mailpit Apr 3, 2024

Feature: Add readyz subcommand for Docker healthcheck (#270)

a805567

Perform first healthcheck on startup without waiting the specified interval #33410

Perform first healthcheck on startup without waiting the specified interval #33410

Comments

alexpirine commented May 26, 2017 • edited

boaz0 commented Jun 1, 2017 • edited

thaJeztah commented Jun 1, 2017

stevvooe commented Jun 5, 2017

alexpirine commented Jun 5, 2017 • edited

stevvooe commented Jun 5, 2017

yongtang commented Jun 22, 2017

yongtang commented Jul 1, 2017

josephtyler commented Jul 12, 2017

rdxmb commented Oct 13, 2017 • edited

rdxmb commented Feb 26, 2018 • edited

dtakken commented Mar 20, 2018

rdxmb commented Mar 20, 2018 • edited

dtakken commented Mar 22, 2018

thaJeztah commented Mar 26, 2018

SvenAbels commented Mar 26, 2018

alexpirine commented Mar 26, 2018

rdxmb commented Mar 26, 2018

dtakken commented Mar 28, 2018

dtakken commented Mar 28, 2018

ColdHeat commented Aug 18, 2019

intgr commented Aug 19, 2019 • edited

mhemrg commented Feb 3, 2020

rdxmb commented Feb 3, 2020 • edited

xkortex commented Oct 21, 2021

zdm commented Dec 12, 2021 • edited

zdm commented Dec 12, 2021

domgenv commented Mar 18, 2022

fabrice-toussaint commented Jul 13, 2022

UnaiUribarri commented Jul 21, 2022

aradalvand commented Dec 21, 2022 • edited

EmilianoBruni commented Jan 9, 2023

tc-mccarthy commented Jan 25, 2023

handersendk commented Mar 5, 2023

SandaruKasa commented May 8, 2023

fredfortier commented May 15, 2023 • edited

cpuguy83 commented May 15, 2023

cpuguy83 commented Jul 6, 2023

alexpirine commented May 26, 2017 •

edited

boaz0 commented Jun 1, 2017 •

edited

alexpirine commented Jun 5, 2017 •

edited

rdxmb commented Oct 13, 2017 •

edited

rdxmb commented Feb 26, 2018 •

edited

rdxmb commented Mar 20, 2018 •

edited

intgr commented Aug 19, 2019 •

edited

rdxmb commented Feb 3, 2020 •

edited

zdm commented Dec 12, 2021 •

edited

aradalvand commented Dec 21, 2022 •

edited

fredfortier commented May 15, 2023 •

edited