Docker Healthcheck support on Portainer Container #9597

JaneX8 · 2020-02-23T20:21:53Z

JaneX8
Feb 23, 2020

Describe the feature
Being able to see a "health status" of the Portainer Docker container.

Describe the solution you'd like
I would like support for the Docker Healthcheck (that is also shown in Portainer.io 's own dashboard and probably other Docker management software).

Describe alternatives you've considered
Alternative is setting up something similarly without the use of the already existing tools within Docker.

Additional context
The Dockerfile could contain something like this:

HEALTHCHECK --interval=60s --timeout=10s --retries=3 CMD curl -sS http://localhost:9000 || exit 1.

For debugging and testing purposses you can use:

docker inspect --format "{{json .State.Health}}" containername

hhromic · 2020-03-03T14:38:14Z

hhromic
Mar 3, 2020

This is indeed a very useful suggestion. I also have been thinking on how to do this since some time. Please find a couple of comments from my own experience.

First, I wouldn't advise on using curl like suggested in this ticket because then we need to ship the curl binary (and dependencies) inside the container as well. I would also advise to not force the healthcheck in the Dockerfile using the HEALTHCHECK directive.

Instead, I propose to implement a simple healthcheck routine in the Portainer binary itself that can then be used by Docker during healthchecks. In this case, Portainer can dial to itself requesting a status update and return the appropriate result and exit level if HTTP code is 2XX or non 2XX.

Luckily, Portainer already implements a status API endpoint that can be leveraged for this proposal. Therefore we just need to implement a simple flag, e.g. --healthcheck for the Portainer binary that calls its own Status API, return the results and exits with an appropriate error level.

For example:

# healthy case
$ portainer --healthcheck; echo $?
{"Authentication":true,"EndpointManagement":true,"Snapshot":true,"Analytics":false,"Version":"1.23.1"}
0

# unhealthy case
$ portainer --healthcheck; echo $?
{"err": "Something bad happened"}
1

With the above in place, then healthchecks can be enabled in a Portainer stack with the following:

healthcheck:
  test: ['CMD', 'portainer', '--healthcheck']

For reference, this is how the Kong API Gateway does healthcheck, i.e. kong up command in a stack, and how PostgreSQL does it as well, i.e. pg_isready command also in a stack. This approach is more robust, requires no additional dependencies and can be smarter than just checking if the server responds via HTTP, i.e. return more elaborate status reports.

Moreover, this same approach can also be implemented for the Portainer Agent binary.

@itsconquest if you and the Portainer team agree on this idea, I can work on it relatively quick as it doesn't involve working with UI elements and I can easily test on my side.

0 replies

Ornias1993 · 2020-05-17T09:38:19Z

Ornias1993
May 17, 2020

@ElleshaHackett
In curl-enabled containers, I mostly curl the page and grep a part of the "good" status page. Works like a charm and checks more than just http 200. Your example just checks if "something" is served with http 200 on port 9000. Thats not enough to verify portainer is actually processing requests.

@hhromic This would indeed be a nice way to go.
Without curl that's not an option, so this would be very nice to have indeed.
Did you actually start working on it?

0 replies

hhromic · 2020-05-17T11:04:55Z

hhromic
May 17, 2020

@Ornias1993 no I have not started working on this :)
I was first waiting for some input from the Portainer team as in if they are interested, but then I forgot about this issue hehe.

@deviantony @itsconquest now that I've got more familiar with the Portainer codebase, perhaps I can code a prototype and submit as a PR for review?

0 replies

Ornias1993 · 2020-05-17T11:21:46Z

Ornias1993
May 17, 2020

@hhromic Ahh, okey... Happens the best of us :)

I read through most of the previous discussions about it.
Afaik @deviantony and @itsconquest arn't against it, but no-one actually takes it on or finishes it.

I think the fastest way of getting feedback is throwing in a prototype and work from there indeed. 👍

0 replies

hhromic · 2020-05-17T11:27:03Z

hhromic
May 17, 2020

Alright then, I'll put a prototype together this week and see how it goes !

0 replies

ghost · 2020-05-20T22:25:45Z

ghost
May 20, 2020

Sounds like a good idea! I look forward to reviewing your work @hhromic :)

0 replies

rhuanbarreto · 2020-07-07T08:36:08Z

rhuanbarreto
Jul 7, 2020

Could be good also to have control over the healthcheck of the image or even disable the healthcheck according to https://docs.docker.com/engine/reference/run/#healthcheck

0 replies

Ornias1993 · 2020-07-07T08:41:46Z

Ornias1993
Jul 7, 2020

@rhuanbarreto You can always overrule it in docker. So thats a given.

0 replies

rhuanbarreto · 2020-07-07T09:06:17Z

rhuanbarreto
Jul 7, 2020

Yes. But is it possible to do it in portainer?

0 replies

Ornias1993 · 2020-07-07T09:18:09Z

Ornias1993
Jul 7, 2020

Thats not the scope of this issue, there is another issue for handling healthchecks inside portainer though.

0 replies

Ornias1993 · 2020-11-01T13:15:22Z

Ornias1993
Nov 1, 2020

Actually this was already implemented way before this issue...
See #1366

And got reverted just because it isn't compatible with the --ssl flag (which makes it unsuitable to add to the dockerfile).

0 replies

modem7 · 2021-01-10T00:12:56Z

modem7
Jan 10, 2021

Hey guys,

Just stumbled across this, was there any movement on the --healthcheck? I understand there were a few issues with the previous solution

Thanks!

0 replies

Ornias1993 · 2021-01-10T12:26:41Z

Ornias1993
Jan 10, 2021

Maintainers are not interested it seems.
And don't even care enough to just say so.

0 replies

kwilliams1987 · 2021-01-15T17:37:05Z

kwilliams1987
Jan 15, 2021

Would really like this feature also, it's a little odd that a platform designed for managing and monitoring your docker containers doesn't include the option to monitor itself. 🤷‍♂️

0 replies

modem7 · 2021-01-16T00:25:07Z

modem7
Jan 16, 2021

@hhromic was there any updates your end?

0 replies

sgtcoder · 2023-07-01T02:19:30Z

sgtcoder
Jul 1, 2023

Thank you for that information. I will dig deeper. I did try the environment and still same issue. Definitely strange. And I never saw that environment line in the code sample portainer provided us since it's also ran in the command section.

Per Portainer Swarm Setup

docker network create \
--driver overlay \
  portainer_agent_network

docker service create \
  --name portainer_agent \
  --network portainer_agent_network \
  -p 9001:9001/tcp \
  --mode global \
  --constraint 'node.platform.os == linux' \
  --mount type=bind,src=//var/run/docker.sock,dst=/var/run/docker.sock \
  --mount type=bind,src=//var/lib/docker/volumes,dst=/var/lib/docker/volumes \
  portainer/agent:2.18.3

0 replies

sgtcoder · 2023-07-01T02:22:41Z

sgtcoder
Jul 1, 2023

unable to retrieve a list of IP associated to the host | error="lookup tasks.portainer-agent on 127.0.0.11:53: no such host"

0 replies

sgtcoder · 2023-07-01T02:30:17Z

sgtcoder
Jul 1, 2023

#8578

AGENT_CLUSTER_ADDR: localhost

This seemed to work. for some reason it doesn't let the DNS work properly in healthcheck

0 replies

t0mtaylor · 2023-07-01T11:56:08Z

t0mtaylor
Jul 1, 2023

#8578

AGENT_CLUSTER_ADDR: localhost

This seemed to work. for some reason it doesn't let the DNS work properly in healthcheck

After a while, i had this issue on the agents - i think the agents got restarted but then couldnt start due to a dns problem

github.com/portainer/agent/cmd/agent/main.go:141 > unable to retrieve a list of IP associated to the host | error="lookup tasks.agent on 127.0.0.11:53: no such host" host=tasks.agent

And the UI reported this

{"time":1688195821,"message":"http: proxy error: dial tcp: lookup tasks.agent on 127.0.0.11:53: no such host"}

I tried something similar but it doesn't work in a docker swarm, although for single node swarm or services it should be ok

What im looking at now is how to trigger all the portainer containers to restart if one of the agent fails the healthcheck, maybe with a seperate docker container monitoring them - or updating the healthcheck to trigger the parent docker host to relaunch the containers.

FYI - I've also updated the comment with a healthcheck api call so you know its up and running for the UI

     image: portainer/portainer-ee:2.18.3-alpine
     healthcheck:
        test: "wget --no-verbose --tries=3 --spider http://localhost:9000/api/system/status || exit 1"

0 replies

lonix1 · 2023-07-01T13:10:11Z

lonix1
Jul 1, 2023

With wget --spider there is no difference to checking http://localhost:9000/api/system/status or http://localhost:9000. In both cases one simply checks that there is a response. It's not a proper "healthcheck", but rather, "proof of life". 😏

(Portainer still needs a proper HEALTHCHECK endpoint, preferably at the conventional endpoint of /api/healthz.)

0 replies

t0mtaylor · 2023-07-01T13:22:46Z

t0mtaylor
Jul 1, 2023

With wget --spider there is no difference to checking http://localhost:9000/api/system/status or http://localhost:9000. In both cases one simply checks that there is a response. It's not a proper "healthcheck", but rather, "proof of life". 😏

(Portainer still needs a proper HEALTHCHECK endpoint, preferably at the conventional endpoint of /api/healthz.)

@lonix1 i prefer to call http://localhost:9000/api/system/status as at least you know the api is up and running, instead of flooding the logs with 401 errors, as this returns a nice 200 - even though --spider just checks for a response :)

It pretty much doing what a healthcheck endpoint is doing, just giving more info about the status 🚀

0 replies

lonix1 · 2023-07-01T13:34:28Z

lonix1
Jul 1, 2023

@t0mtaylor I didn't consider the log. Good idea.

The response is this:

{
  "Version": "2.0.0",
  "demoEnvironment": {
    "enabled": true,
    "environments": [
      0
    ],
    "users": [
      1
    ]
  },
  "instanceID": "299ab403-70a8-4c05-92f7-bf7a994d50df"
}

So to be complete, in a script, I'd do something like this:

[ $(wget --quiet -O- --tries=1 http://localhost:9000/api/system/status | sed -nE 's/.*Version":"([^"]*)".*/\1/p' | wc -l) = 1 ] \
  && echo up || echo down

That not only checks that the page exists, but that it is returning expected data. I've extracted the Version arbitrarily - if that is found, then the API is up.

However in a compose file, I'd do something simpler:

healthcheck:
  # ...
  test: wget --no-verbose --tries=1 --spider http://localhost:9000/api/system/status || exit 1

0 replies

t0mtaylor · 2023-07-01T13:50:50Z

t0mtaylor
Jul 1, 2023

@t0mtaylor I didn't consider the log. Good idea.

The response is this:
{
  "Version": "2.0.0",
  "demoEnvironment": {
    "enabled": true,
    "environments": [
      0
    ],
    "users": [
      1
    ]
  },
  "instanceID": "299ab403-70a8-4c05-92f7-bf7a994d50df"
}
So to be complete, in a script, I'd do something like this:
[ $(wget --quiet -O- --tries=1 http://localhost:9000/api/system/status | sed -nE 's/.*Version":"([^"]*)".*/\1/p' | wc -l) = 1 ] \
  && echo up || echo down
That not only checks that the page exists, but that it is returning expected data. I've extracted the Version arbitrarily - if that is found, then the API is up.

However in a compose file, I'd do something simpler:
healthcheck:
  # ...
  test: wget --no-verbose --tries=1 --spider http://localhost:9000/api/system/status || exit 1

@lonix1 Yea i would keep it simple for the healthceck as its giving you enough to determine its healthy

I do something similar checking the version in a bash script which checks services are running every 5 mins and also check how many containers are running per service, as docker can still be a bit flakey and services vanish from the swarm!

I've updated the main comment #3572 (comment) as theres an issue with the healthcheck for agents when running in swarm mode - but running single node on a rapsberry pi for example both healthchecks for UI and Agents work, as @sgtcoder has confirmed on his setup 👍

0 replies

sgtcoder · 2023-07-01T18:21:04Z

sgtcoder
Jul 1, 2023

Thank you guys for all the updates. I applied a bunch of the suggestions. I still had to use localhost on single swarm node, but it seems to work aside from the TLS handshake log errors. I had issues in general with using more than one docker node swarm with trying to replicate storage with both performance issues and overhead, so I just stick with one node for now.

Start period of 5 seconds seems to be fine for me. Running on a dedicated HPe DL380 Gen9 server with the docker VM configured with 32GB RAM and 32vCPU.

Here is what I have now

version: "3.8"
services:
  portainer-agent:
    image: portainer/agent:alpine
    environment:
      AGENT_CLUSTER_ADDR: localhost
      AGENT_CLUSTER_PROBE_TIMEOUT: 2000ms
      AGENT_CLUSTER_PROBE_INTERVAL: 3000ms
      #LOG_LEVEL: DEBUG
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: timeout 10 nc -z -v 127.0.0.1 9001 || exit 1
      start_period: 5s
      interval: 15s
      timeout: 5s
      retries: 5
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
  portainer:
    image: portainer/portainer-ee:alpine
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Phoenix
    command: -H tcp://tasks.portainer-agent:9001 --tlsskipverify
    volumes:
      - /mnt/storage/dockers/portainer:/data
    healthcheck:
      test: wget --no-verbose --tries=3 --spider http://127.0.0.1:9000/api/system/status || exit 1
      start_period: 5s
      interval: 15s
      timeout: 5s
      retries: 5
    networks:
      - core_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
networks:
  core_network:
    external: true

1 reply

coodyme Feb 4, 2024

not working... still unhealty... if you try to execute docker exec -it portainer wget ... you will get an error.

t0mtaylor · 2023-07-05T01:15:22Z

t0mtaylor
Jul 5, 2023

@sgtcoder try the wget for the agent healthcheck and that will remove the tls handshake errors :)

  healthcheck:
        test: "wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping || exit 1"
        interval: 30s
        timeout: 10s
        retries: 3
        start_period: 30s

These healthchecks work ok for single node setup - because of the way the agents do a dns lookup and theres a hardcoded timeout/sleep, the healthcheck wont work on the agents in swarm mode as he dns doesnt resolve the tasks.agent - i've detailed this within the main comment earlier #3572 (comment)

As a workaround, I have a separate bash script checking with docker that the agent containers are up and running on each server, and ive exposed port 9001 so i can wget that also on each server - not ideal but a way forward until @tamarahenson and team improve the agent - ideally they add a http ping or a shell command we can use to verify the agent once the dns lookup issue is fixed for swarm mode when it starts up.

0 replies

sgtcoder · 2023-07-05T14:51:47Z

sgtcoder
Jul 5, 2023

I tried the wget again, but for whatever reason, that causes the check to fail, whereas the nc command works.

@t0mtaylor

0 replies

t0mtaylor · 2023-07-07T01:34:08Z

t0mtaylor
Jul 7, 2023

@sgtcoder Have you tried the wget via sh in the container whilst the agent is running? whats the output? does it have an error?

get the interactive shell of the container - replace CONTAINERID with the real one after running docker ps | grep portainer and you see the agent id

docker exec -it CONTAINERID sh

with returned the shell ready to use on the agent container

/app #

run the wget command

wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping

my output is this - its an error 400 but thats good as it hit the agent on port 9001:

/app # wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping

Connecting to 127.0.0.1:9001 (127.0.0.1:9001)
wget: server returned error: HTTP/1.0 400 Bad Request

0 replies

AlexBGoode · 2024-03-02T14:06:58Z

AlexBGoode
Mar 2, 2024

docker exec -it agent sh

/app # wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' https://localho
st:9001/ping
Connecting to localhost:9001 (127.0.0.1:9001)
remote file exists

that's the way how it works

you need to specify httpS not just http

and this way spawns no extra SSL log warnings like "http: TLS handshake error from 172.24.0.1:54186: tls: first record does not look like a TLS handshake"

0 replies

Makishima · 2024-03-09T10:50:09Z

Makishima
Mar 9, 2024

Is there a solution for Portainer only, without its agent?

I am unable to attach to the container, as if it does not have bash or sh.

1 reply

AlexBGoode Mar 11, 2024

If you switch from using image:portainer/portainer-ce:latest to image:portainer/portainer-ce:alpine then sh became available and you'll be able to attach to. Both versions of an image are official but alpine is provided with sh (not bash).

As for the health checking of an portainer only, I utilised the following:
wget -q --no-verbose --tries=3 --spider http://127.0.0.1:9000/api/system/status || exit 1

kevdogg · 2024-06-02T15:06:55Z

kevdogg
Jun 2, 2024

He in case it wasn't obvious to anyone reading this report, in order to use healthchecks for the portainer and portainer-agent containers, you'll need the alpine versions in order to perform these "Healthchecks" on the containers.

0 replies

Docker Healthcheck support on Portainer Container #9597

Replies: 67 comments · 2 replies

docker exec -it agent sh

Replies: 67 comments 2 replies