Docker Healthcheck support on Portainer Container #3572

JaneX8 · 2020-02-23T20:21:53Z

Describe the feature
Being able to see a "health status" of the Portainer Docker container.

Describe the solution you'd like
I would like support for the Docker Healthcheck (that is also shown in Portainer.io 's own dashboard and probably other Docker management software).

Describe alternatives you've considered
Alternative is setting up something similarly without the use of the already existing tools within Docker.

Additional context
The Dockerfile could contain something like this:

HEALTHCHECK --interval=60s --timeout=10s --retries=3 CMD curl -sS http://localhost:9000 || exit 1.

For debugging and testing purposses you can use:

docker inspect --format "{{json .State.Health}}" containername

The text was updated successfully, but these errors were encountered:

hhromic · 2020-03-03T14:38:14Z

This is indeed a very useful suggestion. I also have been thinking on how to do this since some time. Please find a couple of comments from my own experience.

First, I wouldn't advise on using curl like suggested in this ticket because then we need to ship the curl binary (and dependencies) inside the container as well. I would also advise to not force the healthcheck in the Dockerfile using the HEALTHCHECK directive.

Instead, I propose to implement a simple healthcheck routine in the Portainer binary itself that can then be used by Docker during healthchecks. In this case, Portainer can dial to itself requesting a status update and return the appropriate result and exit level if HTTP code is 2XX or non 2XX.

Luckily, Portainer already implements a status API endpoint that can be leveraged for this proposal. Therefore we just need to implement a simple flag, e.g. --healthcheck for the Portainer binary that calls its own Status API, return the results and exits with an appropriate error level.

For example:

# healthy case
$ portainer --healthcheck; echo $?
{"Authentication":true,"EndpointManagement":true,"Snapshot":true,"Analytics":false,"Version":"1.23.1"}
0

# unhealthy case
$ portainer --healthcheck; echo $?
{"err": "Something bad happened"}
1

With the above in place, then healthchecks can be enabled in a Portainer stack with the following:

healthcheck:
  test: ['CMD', 'portainer', '--healthcheck']

For reference, this is how the Kong API Gateway does healthcheck, i.e. kong up command in a stack, and how PostgreSQL does it as well, i.e. pg_isready command also in a stack. This approach is more robust, requires no additional dependencies and can be smarter than just checking if the server responds via HTTP, i.e. return more elaborate status reports.

Moreover, this same approach can also be implemented for the Portainer Agent binary.

@itsconquest if you and the Portainer team agree on this idea, I can work on it relatively quick as it doesn't involve working with UI elements and I can easily test on my side.

Ornias1993 · 2020-05-17T09:38:19Z

@ElleshaHackett
In curl-enabled containers, I mostly curl the page and grep a part of the "good" status page. Works like a charm and checks more than just http 200. Your example just checks if "something" is served with http 200 on port 9000. Thats not enough to verify portainer is actually processing requests.

@hhromic This would indeed be a nice way to go.
Without curl that's not an option, so this would be very nice to have indeed.
Did you actually start working on it?

hhromic · 2020-05-17T11:04:55Z

@Ornias1993 no I have not started working on this :)
I was first waiting for some input from the Portainer team as in if they are interested, but then I forgot about this issue hehe.

@deviantony @itsconquest now that I've got more familiar with the Portainer codebase, perhaps I can code a prototype and submit as a PR for review?

Ornias1993 · 2020-05-17T11:21:46Z

@hhromic Ahh, okey... Happens the best of us :)

I read through most of the previous discussions about it.
Afaik @deviantony and @itsconquest arn't against it, but no-one actually takes it on or finishes it.

I think the fastest way of getting feedback is throwing in a prototype and work from there indeed. 👍

hhromic · 2020-05-17T11:27:03Z

Alright then, I'll put a prototype together this week and see how it goes !

ghost · 2020-05-20T22:25:45Z

Sounds like a good idea! I look forward to reviewing your work @hhromic :)

rhuanbarreto · 2020-07-07T08:36:08Z

Could be good also to have control over the healthcheck of the image or even disable the healthcheck according to https://docs.docker.com/engine/reference/run/#healthcheck

Ornias1993 · 2020-07-07T08:41:46Z

@rhuanbarreto You can always overrule it in docker. So thats a given.

rhuanbarreto · 2020-07-07T09:06:17Z

Yes. But is it possible to do it in portainer?

Ornias1993 · 2020-07-07T09:18:09Z

Thats not the scope of this issue, there is another issue for handling healthchecks inside portainer though.

Ornias1993 · 2020-11-01T13:15:22Z

Actually this was already implemented way before this issue...
See #1366

And got reverted just because it isn't compatible with the --ssl flag (which makes it unsuitable to add to the dockerfile).

modem7 · 2021-01-10T00:12:56Z

Hey guys,

Just stumbled across this, was there any movement on the --healthcheck? I understand there were a few issues with the previous solution

Thanks!

Ornias1993 · 2021-01-10T12:26:41Z

Maintainers are not interested it seems.
And don't even care enough to just say so.

kwilliams1987 · 2021-01-15T17:37:05Z

Would really like this feature also, it's a little odd that a platform designed for managing and monitoring your docker containers doesn't include the option to monitor itself. 🤷‍♂️

modem7 · 2021-01-16T00:25:07Z

@hhromic was there any updates your end?

hhromic · 2021-01-16T11:03:05Z

@modem7 , all,
Apologies, I've been really busy in the last months with work so I haven't had the time I wish I had to work on this.
I someone wants to step-up, please do so, otherwise I will try to get back to this as soon as I can.

deviantony · 2021-01-19T20:53:18Z

Sorry for the silence on that one, we're interested in that feature it's just that we have a lot of stuff to deal with as well.

We've been giving it more thoughts and we're thinking about bringing support for this feature along #821, this should work around the potential issue we had so far with HTTP/HTTPS and the healthcheck.

We have #821 in our backlog at the moment and we'll start thinking about this one based on the existing implementations that have been provided by contributors.

urda · 2023-04-28T00:52:28Z

    healthcheck:
      test: "wget --no-verbose --tries=1 --spider --no-check-certificate https://localhost:9443 || exit 1"
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 20s
^ This approach is harmful. It will generate thousands of ssl_client 'defuncts' processes on the host.
    healthcheck:
      test: "wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1"
      interval: 60s
      timeout: 5s
      retries: 3
      start_period: 20s
^ This is the correct approach, as it tests port 9000 over HTTP, thus not producing the army of defuncts. --no-check-certificate is not needed in that case because the testing is being done on the http port.

For those that are just using a pure docker run that might look something like:

docker run \
-d \
--name portainer \
--restart always \
--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /path/to/docker/portainer/data:/data \
-v /path/to/docker/portainer/ssl:/ssl \
portainer/portainer-ce:alpine \
--bind-https ":443" \
--sslcert /ssl/portainer.crt \
--sslkey /ssl/portainer.key

Where

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \

Are the major health check configurations.

barndawgie · 2023-04-28T20:25:19Z

    healthcheck:
      test: "wget --no-verbose --tries=1 --spider --no-check-certificate https://localhost:9443 || exit 1"
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 20s
^ This approach is harmful. It will generate thousands of ssl_client 'defuncts' processes on the host.
    healthcheck:
      test: "wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1"
      interval: 60s
      timeout: 5s
      retries: 3
      start_period: 20s
^ This is the correct approach, as it tests port 9000 over HTTP, thus not producing the army of defuncts. --no-check-certificate is not needed in that case because the testing is being done on the http port.

For those that are just using a pure docker run that might look something like:

docker run \
-d \
--name portainer \
--restart always \
--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /path/to/docker/portainer/data:/data \
-v /path/to/docker/portainer/ssl:/ssl \
portainer/portainer-ce:alpine \
--bind-https ":443" \
--sslcert /ssl/portainer.crt \
--sslkey /ssl/portainer.key

Where

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \

Are the major health check configurations.

That doesn't seem to work since there is no shell or wget in the container, as far as I can tell:

~$ docker exec portainer-ce 'wget'
OCI runtime exec failed: exec failed: container_linux.go:367: starting container process caused: exec: "wget": executable file not found in $PATH: unknown

urda · 2023-04-28T20:28:47Z

healthcheck:
  test: "wget --no-verbose --tries=1 --spider --no-check-certificate https://localhost:9443 || exit 1"
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 20s
^ This approach is harmful. It will generate thousands of ssl_client 'defuncts' processes on the host.
healthcheck:
  test: "wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1"
  interval: 60s
  timeout: 5s
  retries: 3
  start_period: 20s
^ This is the correct approach, as it tests port 9000 over HTTP, thus not producing the army of defuncts. --no-check-certificate is not needed in that case because the testing is being done on the http port.

For those that are just using a pure docker run that might look something like:
docker run \

-d \

--name portainer \

--restart always \

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \

--health-interval=60s \

--health-retries=3 \

--health-timeout=5s \

--health-start-period=20s \

-v /var/run/docker.sock:/var/run/docker.sock \

-v /path/to/docker/portainer/data:/data \

-v /path/to/docker/portainer/ssl:/ssl \

portainer/portainer-ce:alpine \

--bind-https ":443" \

--sslcert /ssl/portainer.crt \

--sslkey /ssl/portainer.key
Where
--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \

--health-interval=60s \

--health-retries=3 \

--health-timeout=5s \

--health-start-period=20s \
Are the major health check configurations.

That doesn't seem to work since there is no shell or wget in the container, as far as I can tell:
~$ docker exec portainer-ce 'wget'

OCI runtime exec failed: exec failed: container_linux.go:367: starting container process caused: exec: "wget": executable file not found in $PATH: unknown

It does, make sure you're using the portainer/portainer-ce:alpine image which has the required tools.

t0mtaylor · 2023-06-30T16:35:49Z

When I use the alpine packages for both the Portainer UI and Agents in single node/local mode (RedHat Linux), it includes the sh shell - which has two utilities preinstalled we can use for health checks:

Also using version: '3.8' set at the top of the docker compose file, and start_period:30s has been added below - supported in stack compose 3.4 since docker 17.09 - docker/cli#475

If you are running docker swarm mode, you'll have to setup a separate bash script for each server and check the agent is running, and if there is an issue you can then re-deploy the service for the agents to restart.

Currently when the healtcheck is enabled for an agent in docker swarm mode, it causes a DNS Error within the agent container, meaning the dns resolution fails - unable to retrieve a list of IP associated to the host | error="lookup tasks.agent on 127.0.0.11:53: no such host" host=tasks.agent - https://github.com/portainer/agent/blob/45b383bc613bf9e64be8637c37a93201cf33db78/cmd/agent/main.go#L134 Ideally we need to be able to set the sleep timeout here from 3 seconds to a bigger value via en ENV Var so we can try and make it work with the start_period of the docker healthcheck.

Ideally this is an issue with the agent that Portainer should fix so the Healthchecks can be enabled on the agents.

If you running on a rapsberry pi or a congested/busy swarm, the start_period may require increasing, same for the timeouts, etc - have a play!

Portainer UI - using wget - works on local node and swarm mode

     image: portainer/portainer-ee:2.18.3-alpine
     healthcheck:
        test: "wget --no-verbose --tries=3 --spider http://localhost:9000/api/system/status || exit 1"
        interval: 60s
        timeout: 15s
        retries: 3
        start_period: 30s

Portainer Agents - using wget - only works on single node mode (not swarm)

now using wget instead of nc, to reduce tls handshake errors within the agent log output, uses a /ping uri but we are missing some security headers, would be great if this worked without any auth headers required
added AGENT_CLUSTER_PROBE_TIMEOUT and AGENT_CLUSTER_PROBE_INTERVAL to improve performance on your node by reducing frequency of checking the agent(s)
the hostname can also be forced on the agents in single node mode also, see https://lucatnt.com/2021/11/fix-portainer-agent-restart-loop/
AGENT_CLUSTER_ADDR set to localhost for single mode only, not swarm mode - should be tasks.agent for swarm mode

     image: portainer/agent:2.18.3-alpine
     environment:
        # REQUIRED: Should be equal to the service name prefixed by "tasks." when
        # deployed inside an overlay network
        # Set AGENT_CLUSTER_ADDR to localhost for Single Node only, not Swarm mode!
        AGENT_CLUSTER_ADDR: localhost 
        # Performance tweaks
        AGENT_CLUSTER_PROBE_TIMEOUT: "2000ms"
        AGENT_CLUSTER_PROBE_INTERVAL: "3000ms"
        # AGENT_PORT: 9001
        # LOG_LEVEL: debug
     healthcheck:
        test: "wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping || exit 1"
        interval: 30s
        timeout: 10s
        retries: 3
        start_period: 30s

No need for "too much" hackery! :)

have updated this comment due to various issues with healthchecks on the agents in swarm mode

Enissay · 2023-06-30T16:42:15Z

@t0mtaylor image names seems to be reversed... Please double check so I can test asap :-)

sgtcoder · 2023-06-30T18:26:27Z

t0mtaylor

this definitely does not work for me

modem7 · 2023-06-30T18:28:07Z

t0mtaylor

this definitely does not work for me

Can you post your compose file so we can see what you're trying to do?

As wget etc 100% works on the alpine images.

sgtcoder · 2023-06-30T18:29:41Z

  portainer-agent:
    image: portainer/agent:alpine
    ports:
      - 9001:9001/tcp
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: timeout 10 nc -z -v localhost 9001 || exit 1
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

sgtcoder · 2023-06-30T18:32:54Z

When I run the command in the container, I get that it is open: localhost (127.0.0.1:9001) open

But for some reason setting it as a healthcheck makes the container not connectable.

t0mtaylor · 2023-07-01T01:47:31Z

@t0mtaylor image names seems to be reversed... Please double check so I can test asap :-)

@Enissay Ahhh i see, have now fixed @sgtcoder 👍

sgtcoder · 2023-07-01T01:50:21Z

@t0mtaylor image names seems to be reversed... Please double check so I can test asap :-)

@Enissay Not sure what your on about - they work fine for me 👍

He is saying that you have the nc command for the Portainer UI and the wget command for your agent... In your example.

t0mtaylor · 2023-07-01T01:50:56Z

When I run the command in the container, I get that it is open: localhost (127.0.0.1:9001) open

But for some reason setting it as a healthcheck makes the container not connectable.

@sgtcoder see the updated comment - #3572 (comment), i've added start_period:30s so it has enough time to start the containers and register 🚀

Just make sure your using the latest docker compose version, im using version: '3.8' - minimum you can use is 3.4

Also added screenshot of it working to the main comment too 🕺

t0mtaylor · 2023-07-01T02:01:46Z

Did that work for you @sgtcoder with the start_period ?

sgtcoder · 2023-07-01T02:02:27Z

@t0mtaylor I just booted my computer and ssh'ing and checking now. Thank you for the updates. I will let you know.

sgtcoder · 2023-07-01T02:10:16Z

It's strange because I am still getting "Environment is unreachable."

portainer-agent:
    image: portainer/agent:2.18.3-alpine
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: "timeout 10 nc -z -v localhost 9001 || exit 1"
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
  portainer:
    image: portainer/portainer-ee:2.18.3-alpine
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Phoenix
    command: -H tcp://tasks.portainer-agent:9001 --tlsskipverify
    volumes:
      - /mnt/storage/dockers/portainer:/data
    healthcheck:
      test: "wget --no-verbose --tries=3 --spider http://localhost:9000 || exit 1"
      interval: 60s
      timeout: 15s
      retries: 3
      start_period: 120s
    networks:
      - core_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

I know the command works in the container itself. It's literally no matter what healthcheck I put on the portainer agent, it becomes unreachable

t0mtaylor · 2023-07-01T02:14:54Z

Only difference with mine, is I have a separate network for the agents (which is defined for both ui and agents, then a seperate network for the ui only which is accessible via the load balancer), but you are also missing this set below your image declaration on the agent service:

    environment:
      # REQUIRED: Should be equal to the service name prefixed by "tasks." when
      # deployed inside an overlay network
      AGENT_CLUSTER_ADDR: tasks.portainer-agent
      # AGENT_PORT: 9001
      # LOG_LEVEL: debug

sgtcoder · 2023-07-01T02:19:30Z

Thank you for that information. I will dig deeper. I did try the environment and still same issue. Definitely strange. And I never saw that environment line in the code sample portainer provided us since it's also ran in the command section.

Per Portainer Swarm Setup

docker network create \
--driver overlay \
  portainer_agent_network

docker service create \
  --name portainer_agent \
  --network portainer_agent_network \
  -p 9001:9001/tcp \
  --mode global \
  --constraint 'node.platform.os == linux' \
  --mount type=bind,src=//var/run/docker.sock,dst=/var/run/docker.sock \
  --mount type=bind,src=//var/lib/docker/volumes,dst=/var/lib/docker/volumes \
  portainer/agent:2.18.3

sgtcoder · 2023-07-01T02:22:41Z

unable to retrieve a list of IP associated to the host | error="lookup tasks.portainer-agent on 127.0.0.11:53: no such host"

sgtcoder · 2023-07-01T02:30:17Z

#8578

AGENT_CLUSTER_ADDR: localhost

This seemed to work. for some reason it doesn't let the DNS work properly in healthcheck

t0mtaylor · 2023-07-01T11:56:08Z

#8578

AGENT_CLUSTER_ADDR: localhost

This seemed to work. for some reason it doesn't let the DNS work properly in healthcheck

After a while, i had this issue on the agents - i think the agents got restarted but then couldnt start due to a dns problem

github.com/portainer/agent/cmd/agent/main.go:141 > unable to retrieve a list of IP associated to the host | error="lookup tasks.agent on 127.0.0.11:53: no such host" host=tasks.agent

And the UI reported this

{"time":1688195821,"message":"http: proxy error: dial tcp: lookup tasks.agent on 127.0.0.11:53: no such host"}

I tried something similar but it doesn't work in a docker swarm, although for single node swarm or services it should be ok

What im looking at now is how to trigger all the portainer containers to restart if one of the agent fails the healthcheck, maybe with a seperate docker container monitoring them - or updating the healthcheck to trigger the parent docker host to relaunch the containers.

FYI - I've also updated the comment with a healthcheck api call so you know its up and running for the UI

     image: portainer/portainer-ee:2.18.3-alpine
     healthcheck:
        test: "wget --no-verbose --tries=3 --spider http://localhost:9000/api/system/status || exit 1"

lonix1 · 2023-07-01T13:10:11Z

With wget --spider there is no difference to checking http://localhost:9000/api/system/status or http://localhost:9000. In both cases one simply checks that there is a response. It's not a proper "healthcheck", but rather, "proof of life". 😏

(Portainer still needs a proper HEALTHCHECK endpoint, preferably at the conventional endpoint of /api/healthz.)

t0mtaylor · 2023-07-01T13:22:46Z

With wget --spider there is no difference to checking http://localhost:9000/api/system/status or http://localhost:9000. In both cases one simply checks that there is a response. It's not a proper "healthcheck", but rather, "proof of life". 😏

(Portainer still needs a proper HEALTHCHECK endpoint, preferably at the conventional endpoint of /api/healthz.)

@lonix1 i prefer to call http://localhost:9000/api/system/status as at least you know the api is up and running, instead of flooding the logs with 401 errors, as this returns a nice 200 - even though --spider just checks for a response :)

It pretty much doing what a healthcheck endpoint is doing, just giving more info about the status 🚀

lonix1 · 2023-07-01T13:34:28Z

@t0mtaylor I didn't consider the log. Good idea.

The response is this:

{
  "Version": "2.0.0",
  "demoEnvironment": {
    "enabled": true,
    "environments": [
      0
    ],
    "users": [
      1
    ]
  },
  "instanceID": "299ab403-70a8-4c05-92f7-bf7a994d50df"
}

So to be complete, in a script, I'd do something like this:

[ $(wget --quiet -O- --tries=1 http://localhost:9000/api/system/status | sed -nE 's/.*Version":"([^"]*)".*/\1/p' | wc -l) = 1 ] \
  && echo up || echo down

That not only checks that the page exists, but that it is returning expected data. I've extracted the Version arbitrarily - if that is found, then the API is up.

However in a compose file, I'd do something simpler:

healthcheck:
  # ...
  test: wget --no-verbose --tries=1 --spider http://localhost:9000/api/system/status || exit 1

t0mtaylor · 2023-07-01T13:50:50Z

@t0mtaylor I didn't consider the log. Good idea.

The response is this:
{
  "Version": "2.0.0",
  "demoEnvironment": {
    "enabled": true,
    "environments": [
      0
    ],
    "users": [
      1
    ]
  },
  "instanceID": "299ab403-70a8-4c05-92f7-bf7a994d50df"
}
So to be complete, in a script, I'd do something like this:
[ $(wget --quiet -O- --tries=1 http://localhost:9000/api/system/status | sed -nE 's/.*Version":"([^"]*)".*/\1/p' | wc -l) = 1 ] \
  && echo up || echo down
That not only checks that the page exists, but that it is returning expected data. I've extracted the Version arbitrarily - if that is found, then the API is up.

However in a compose file, I'd do something simpler:
healthcheck:
  # ...
  test: wget --no-verbose --tries=1 --spider http://localhost:9000/api/system/status || exit 1

@lonix1 Yea i would keep it simple for the healthceck as its giving you enough to determine its healthy

I do something similar checking the version in a bash script which checks services are running every 5 mins and also check how many containers are running per service, as docker can still be a bit flakey and services vanish from the swarm!

I've updated the main comment #3572 (comment) as theres an issue with the healthcheck for agents when running in swarm mode - but running single node on a rapsberry pi for example both healthchecks for UI and Agents work, as @sgtcoder has confirmed on his setup 👍

sgtcoder · 2023-07-01T18:21:04Z

Thank you guys for all the updates. I applied a bunch of the suggestions. I still had to use localhost on single swarm node, but it seems to work aside from the TLS handshake log errors. I had issues in general with using more than one docker node swarm with trying to replicate storage with both performance issues and overhead, so I just stick with one node for now.

Start period of 5 seconds seems to be fine for me. Running on a dedicated HPe DL380 Gen9 server with the docker VM configured with 32GB RAM and 32vCPU.

Here is what I have now

version: "3.8"
services:
  portainer-agent:
    image: portainer/agent:alpine
    environment:
      AGENT_CLUSTER_ADDR: localhost
      AGENT_CLUSTER_PROBE_TIMEOUT: 2000ms
      AGENT_CLUSTER_PROBE_INTERVAL: 3000ms
      #LOG_LEVEL: DEBUG
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: timeout 10 nc -z -v 127.0.0.1 9001 || exit 1
      start_period: 5s
      interval: 15s
      timeout: 5s
      retries: 5
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
  portainer:
    image: portainer/portainer-ee:alpine
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Phoenix
    command: -H tcp://tasks.portainer-agent:9001 --tlsskipverify
    volumes:
      - /mnt/storage/dockers/portainer:/data
    healthcheck:
      test: wget --no-verbose --tries=3 --spider http://127.0.0.1:9000/api/system/status || exit 1
      start_period: 5s
      interval: 15s
      timeout: 5s
      retries: 5
    networks:
      - core_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
networks:
  core_network:
    external: true

t0mtaylor · 2023-07-05T01:15:22Z

@sgtcoder try the wget for the agent healthcheck and that will remove the tls handshake errors :)

  healthcheck:
        test: "wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping || exit 1"
        interval: 30s
        timeout: 10s
        retries: 3
        start_period: 30s

These healthchecks work ok for single node setup - because of the way the agents do a dns lookup and theres a hardcoded timeout/sleep, the healthcheck wont work on the agents in swarm mode as he dns doesnt resolve the tasks.agent - i've detailed this within the main comment earlier #3572 (comment)

As a workaround, I have a separate bash script checking with docker that the agent containers are up and running on each server, and ive exposed port 9001 so i can wget that also on each server - not ideal but a way forward until @tamarahenson and team improve the agent - ideally they add a http ping or a shell command we can use to verify the agent once the dns lookup issue is fixed for swarm mode when it starts up.

sgtcoder · 2023-07-05T14:51:47Z

I tried the wget again, but for whatever reason, that causes the check to fail, whereas the nc command works.

@t0mtaylor

t0mtaylor · 2023-07-07T01:34:08Z

@sgtcoder Have you tried the wget via sh in the container whilst the agent is running? whats the output? does it have an error?

get the interactive shell of the container - replace CONTAINERID with the real one after running docker ps | grep portainer and you see the agent id

docker exec -it CONTAINERID sh

with returned the shell ready to use on the agent container

/app #

run the wget command

wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping

my output is this - its an error 400 but thats good as it hit the agent on port 9001:

/app # wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping

Connecting to 127.0.0.1:9001 (127.0.0.1:9001)
wget: server returned error: HTTP/1.0 400 Bad Request

ghost added area/service-creation area/stack-creation kind/enhancement Applied to Feature Requests labels Feb 24, 2020

ghost added area/dockerfile area/project and removed area/service-creation area/stack-creation area/project labels Mar 5, 2020

ghost mentioned this issue Jul 20, 2020

Health checks not updated when using duplicate/edit #3321

Closed

jekewa mentioned this issue Oct 24, 2020

feat(containers): add UI to view and edit Docker HEALTHCHECK configuration values on containers #4407

Open

Ornias1993 mentioned this issue Nov 1, 2020

(re-)add Healthcheck API #4427

Closed

deviantony added this to the backlog milestone Jan 19, 2021

deviantony removed this from the backlog milestone Mar 4, 2021

portainer locked and limited conversation to collaborators Jul 27, 2023

jamescarppe converted this issue into discussion #9597 Jul 27, 2023

This issue was moved to a discussion.

Docker Healthcheck support on Portainer Container #3572

Docker Healthcheck support on Portainer Container #3572

Comments

JaneX8 commented Feb 23, 2020 • edited

hhromic commented Mar 3, 2020

Ornias1993 commented May 17, 2020 • edited

hhromic commented May 17, 2020

Ornias1993 commented May 17, 2020

hhromic commented May 17, 2020

ghost commented May 20, 2020 • edited by ghost

rhuanbarreto commented Jul 7, 2020

Ornias1993 commented Jul 7, 2020

rhuanbarreto commented Jul 7, 2020

Ornias1993 commented Jul 7, 2020

Ornias1993 commented Nov 1, 2020 • edited

modem7 commented Jan 10, 2021 • edited

Ornias1993 commented Jan 10, 2021

kwilliams1987 commented Jan 15, 2021

modem7 commented Jan 16, 2021

hhromic commented Jan 16, 2021

deviantony commented Jan 19, 2021

urda commented Apr 28, 2023

barndawgie commented Apr 28, 2023

urda commented Apr 28, 2023

t0mtaylor commented Jun 30, 2023 • edited

Enissay commented Jun 30, 2023

sgtcoder commented Jun 30, 2023

modem7 commented Jun 30, 2023 • edited

sgtcoder commented Jun 30, 2023

sgtcoder commented Jun 30, 2023

t0mtaylor commented Jul 1, 2023 • edited

sgtcoder commented Jul 1, 2023

t0mtaylor commented Jul 1, 2023 • edited

t0mtaylor commented Jul 1, 2023

sgtcoder commented Jul 1, 2023

sgtcoder commented Jul 1, 2023

t0mtaylor commented Jul 1, 2023

sgtcoder commented Jul 1, 2023

sgtcoder commented Jul 1, 2023

sgtcoder commented Jul 1, 2023

t0mtaylor commented Jul 1, 2023 • edited

lonix1 commented Jul 1, 2023 • edited

t0mtaylor commented Jul 1, 2023

lonix1 commented Jul 1, 2023 • edited

t0mtaylor commented Jul 1, 2023

sgtcoder commented Jul 1, 2023

t0mtaylor commented Jul 5, 2023 • edited

sgtcoder commented Jul 5, 2023

t0mtaylor commented Jul 7, 2023 • edited

This issue was moved to a discussion.

JaneX8 commented Feb 23, 2020 •

edited

Ornias1993 commented May 17, 2020 •

edited

ghost commented May 20, 2020 •

edited by ghost

Ornias1993 commented Nov 1, 2020 •

edited

modem7 commented Jan 10, 2021 •

edited

t0mtaylor commented Jun 30, 2023 •

edited

modem7 commented Jun 30, 2023 •

edited

t0mtaylor commented Jul 1, 2023 •

edited

t0mtaylor commented Jul 1, 2023 •

edited

t0mtaylor commented Jul 1, 2023 •

edited

lonix1 commented Jul 1, 2023 •

edited

lonix1 commented Jul 1, 2023 •

edited

t0mtaylor commented Jul 5, 2023 •

edited

t0mtaylor commented Jul 7, 2023 •

edited