Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Healthcheck support on Portainer Container #3572

Closed
JaneX8 opened this issue Feb 23, 2020 · 64 comments
Closed

Docker Healthcheck support on Portainer Container #3572

JaneX8 opened this issue Feb 23, 2020 · 64 comments
Labels
area/dockerfile kind/enhancement Applied to Feature Requests

Comments

@JaneX8
Copy link

JaneX8 commented Feb 23, 2020

Describe the feature
Being able to see a "health status" of the Portainer Docker container.

Describe the solution you'd like
I would like support for the Docker Healthcheck (that is also shown in Portainer.io 's own dashboard and probably other Docker management software).

Describe alternatives you've considered
Alternative is setting up something similarly without the use of the already existing tools within Docker.

Additional context
The Dockerfile could contain something like this:

HEALTHCHECK --interval=60s --timeout=10s --retries=3 CMD curl -sS http://localhost:9000 || exit 1.

For debugging and testing purposses you can use:

docker inspect --format "{{json .State.Health}}" containername

image

@hhromic
Copy link
Contributor

hhromic commented Mar 3, 2020

This is indeed a very useful suggestion. I also have been thinking on how to do this since some time. Please find a couple of comments from my own experience.

First, I wouldn't advise on using curl like suggested in this ticket because then we need to ship the curl binary (and dependencies) inside the container as well. I would also advise to not force the healthcheck in the Dockerfile using the HEALTHCHECK directive.

Instead, I propose to implement a simple healthcheck routine in the Portainer binary itself that can then be used by Docker during healthchecks. In this case, Portainer can dial to itself requesting a status update and return the appropriate result and exit level if HTTP code is 2XX or non 2XX.

Luckily, Portainer already implements a status API endpoint that can be leveraged for this proposal. Therefore we just need to implement a simple flag, e.g. --healthcheck for the Portainer binary that calls its own Status API, return the results and exits with an appropriate error level.

For example:

# healthy case
$ portainer --healthcheck; echo $?
{"Authentication":true,"EndpointManagement":true,"Snapshot":true,"Analytics":false,"Version":"1.23.1"}
0

# unhealthy case
$ portainer --healthcheck; echo $?
{"err": "Something bad happened"}
1

With the above in place, then healthchecks can be enabled in a Portainer stack with the following:

healthcheck:
  test: ['CMD', 'portainer', '--healthcheck']

For reference, this is how the Kong API Gateway does healthcheck, i.e. kong up command in a stack, and how PostgreSQL does it as well, i.e. pg_isready command also in a stack. This approach is more robust, requires no additional dependencies and can be smarter than just checking if the server responds via HTTP, i.e. return more elaborate status reports.

Moreover, this same approach can also be implemented for the Portainer Agent binary.

@itsconquest if you and the Portainer team agree on this idea, I can work on it relatively quick as it doesn't involve working with UI elements and I can easily test on my side.

@Ornias1993
Copy link

Ornias1993 commented May 17, 2020

@ElleshaHackett
In curl-enabled containers, I mostly curl the page and grep a part of the "good" status page. Works like a charm and checks more than just http 200. Your example just checks if "something" is served with http 200 on port 9000. Thats not enough to verify portainer is actually processing requests.

@hhromic This would indeed be a nice way to go.
Without curl that's not an option, so this would be very nice to have indeed.
Did you actually start working on it?

@hhromic
Copy link
Contributor

hhromic commented May 17, 2020

@Ornias1993 no I have not started working on this :)
I was first waiting for some input from the Portainer team as in if they are interested, but then I forgot about this issue hehe.

@deviantony @itsconquest now that I've got more familiar with the Portainer codebase, perhaps I can code a prototype and submit as a PR for review?

@Ornias1993
Copy link

@hhromic Ahh, okey... Happens the best of us :)

I read through most of the previous discussions about it.
Afaik @deviantony and @itsconquest arn't against it, but no-one actually takes it on or finishes it.

I think the fastest way of getting feedback is throwing in a prototype and work from there indeed. 👍

@hhromic
Copy link
Contributor

hhromic commented May 17, 2020

Alright then, I'll put a prototype together this week and see how it goes !

@ghost
Copy link

ghost commented May 20, 2020

Sounds like a good idea! I look forward to reviewing your work @hhromic :)

@rhuanbarreto
Copy link

Could be good also to have control over the healthcheck of the image or even disable the healthcheck according to https://docs.docker.com/engine/reference/run/#healthcheck

@Ornias1993
Copy link

@rhuanbarreto You can always overrule it in docker. So thats a given.

@rhuanbarreto
Copy link

Yes. But is it possible to do it in portainer?

@Ornias1993
Copy link

Thats not the scope of this issue, there is another issue for handling healthchecks inside portainer though.

@Ornias1993
Copy link

Ornias1993 commented Nov 1, 2020

Actually this was already implemented way before this issue...
See #1366

And got reverted just because it isn't compatible with the --ssl flag (which makes it unsuitable to add to the dockerfile).

@modem7
Copy link

modem7 commented Jan 10, 2021

Hey guys,

Just stumbled across this, was there any movement on the --healthcheck? I understand there were a few issues with the previous solution

Thanks!

@Ornias1993
Copy link

Maintainers are not interested it seems.
And don't even care enough to just say so.

@kwilliams1987
Copy link

Would really like this feature also, it's a little odd that a platform designed for managing and monitoring your docker containers doesn't include the option to monitor itself. 🤷‍♂️

@modem7
Copy link

modem7 commented Jan 16, 2021

@hhromic was there any updates your end?

@hhromic
Copy link
Contributor

hhromic commented Jan 16, 2021

@modem7 , all,
Apologies, I've been really busy in the last months with work so I haven't had the time I wish I had to work on this.
I someone wants to step-up, please do so, otherwise I will try to get back to this as soon as I can.

@deviantony
Copy link
Member

Sorry for the silence on that one, we're interested in that feature it's just that we have a lot of stuff to deal with as well.

We've been giving it more thoughts and we're thinking about bringing support for this feature along #821, this should work around the potential issue we had so far with HTTP/HTTPS and the healthcheck.

We have #821 in our backlog at the moment and we'll start thinking about this one based on the existing implementations that have been provided by contributors.

@deviantony deviantony added this to the backlog milestone Jan 19, 2021
@deviantony deviantony removed this from the backlog milestone Mar 4, 2021
@urda
Copy link

urda commented Apr 28, 2023

    healthcheck:
      test: "wget --no-verbose --tries=1 --spider --no-check-certificate https://localhost:9443 || exit 1"
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 20s

^ This approach is harmful. It will generate thousands of ssl_client 'defuncts' processes on the host.

    healthcheck:
      test: "wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1"
      interval: 60s
      timeout: 5s
      retries: 3
      start_period: 20s

^ This is the correct approach, as it tests port 9000 over HTTP, thus not producing the army of defuncts. --no-check-certificate is not needed in that case because the testing is being done on the http port.

For those that are just using a pure docker run that might look something like:

docker run \
-d \
--name portainer \
--restart always \
--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /path/to/docker/portainer/data:/data \
-v /path/to/docker/portainer/ssl:/ssl \
portainer/portainer-ce:alpine \
--bind-https ":443" \
--sslcert /ssl/portainer.crt \
--sslkey /ssl/portainer.key

Where

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \

Are the major health check configurations.

@barndawgie
Copy link

    healthcheck:
      test: "wget --no-verbose --tries=1 --spider --no-check-certificate https://localhost:9443 || exit 1"
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 20s

^ This approach is harmful. It will generate thousands of ssl_client 'defuncts' processes on the host.

    healthcheck:
      test: "wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1"
      interval: 60s
      timeout: 5s
      retries: 3
      start_period: 20s

^ This is the correct approach, as it tests port 9000 over HTTP, thus not producing the army of defuncts. --no-check-certificate is not needed in that case because the testing is being done on the http port.

For those that are just using a pure docker run that might look something like:

docker run \
-d \
--name portainer \
--restart always \
--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /path/to/docker/portainer/data:/data \
-v /path/to/docker/portainer/ssl:/ssl \
portainer/portainer-ce:alpine \
--bind-https ":443" \
--sslcert /ssl/portainer.crt \
--sslkey /ssl/portainer.key

Where

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \
--health-interval=60s \
--health-retries=3 \
--health-timeout=5s \
--health-start-period=20s \

Are the major health check configurations.

That doesn't seem to work since there is no shell or wget in the container, as far as I can tell:

~$ docker exec portainer-ce 'wget'
OCI runtime exec failed: exec failed: container_linux.go:367: starting container process caused: exec: "wget": executable file not found in $PATH: unknown

@urda
Copy link

urda commented Apr 28, 2023

healthcheck:
  test: "wget --no-verbose --tries=1 --spider --no-check-certificate https://localhost:9443 || exit 1"
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 20s

^ This approach is harmful. It will generate thousands of ssl_client 'defuncts' processes on the host.

healthcheck:
  test: "wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1"
  interval: 60s
  timeout: 5s
  retries: 3
  start_period: 20s

^ This is the correct approach, as it tests port 9000 over HTTP, thus not producing the army of defuncts. --no-check-certificate is not needed in that case because the testing is being done on the http port.

For those that are just using a pure docker run that might look something like:

docker run \

-d \

--name portainer \

--restart always \

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \

--health-interval=60s \

--health-retries=3 \

--health-timeout=5s \

--health-start-period=20s \

-v /var/run/docker.sock:/var/run/docker.sock \

-v /path/to/docker/portainer/data:/data \

-v /path/to/docker/portainer/ssl:/ssl \

portainer/portainer-ce:alpine \

--bind-https ":443" \

--sslcert /ssl/portainer.crt \

--sslkey /ssl/portainer.key

Where

--health-cmd='wget --no-verbose --tries=1 --spider http://localhost:9000 || exit 1' \

--health-interval=60s \

--health-retries=3 \

--health-timeout=5s \

--health-start-period=20s \

Are the major health check configurations.

That doesn't seem to work since there is no shell or wget in the container, as far as I can tell:


~$ docker exec portainer-ce 'wget'

OCI runtime exec failed: exec failed: container_linux.go:367: starting container process caused: exec: "wget": executable file not found in $PATH: unknown

It does, make sure you're using the portainer/portainer-ce:alpine image which has the required tools.

@t0mtaylor
Copy link

t0mtaylor commented Jun 30, 2023

When I use the alpine packages for both the Portainer UI and Agents in single node/local mode (RedHat Linux), it includes the sh shell - which has two utilities preinstalled we can use for health checks:

Also using version: '3.8' set at the top of the docker compose file, and start_period:30s has been added below - supported in stack compose 3.4 since docker 17.09 - docker/cli#475

If you are running docker swarm mode, you'll have to setup a separate bash script for each server and check the agent is running, and if there is an issue you can then re-deploy the service for the agents to restart.

Currently when the healtcheck is enabled for an agent in docker swarm mode, it causes a DNS Error within the agent container, meaning the dns resolution fails - unable to retrieve a list of IP associated to the host | error="lookup tasks.agent on 127.0.0.11:53: no such host" host=tasks.agent - https://github.com/portainer/agent/blob/45b383bc613bf9e64be8637c37a93201cf33db78/cmd/agent/main.go#L134 Ideally we need to be able to set the sleep timeout here from 3 seconds to a bigger value via en ENV Var so we can try and make it work with the start_period of the docker healthcheck.

Ideally this is an issue with the agent that Portainer should fix so the Healthchecks can be enabled on the agents.

If you running on a rapsberry pi or a congested/busy swarm, the start_period may require increasing, same for the timeouts, etc - have a play!

Portainer UI - using wget - works on local node and swarm mode

     image: portainer/portainer-ee:2.18.3-alpine
     healthcheck:
        test: "wget --no-verbose --tries=3 --spider http://localhost:9000/api/system/status || exit 1"
        interval: 60s
        timeout: 15s
        retries: 3
        start_period: 30s

Portainer Agents - using wget - only works on single node mode (not swarm)

  • now using wget instead of nc, to reduce tls handshake errors within the agent log output, uses a /ping uri but we are missing some security headers, would be great if this worked without any auth headers required
  • added AGENT_CLUSTER_PROBE_TIMEOUT and AGENT_CLUSTER_PROBE_INTERVAL to improve performance on your node by reducing frequency of checking the agent(s)
  • the hostname can also be forced on the agents in single node mode also, see https://lucatnt.com/2021/11/fix-portainer-agent-restart-loop/
  • AGENT_CLUSTER_ADDR set to localhost for single mode only, not swarm mode - should be tasks.agent for swarm mode
     image: portainer/agent:2.18.3-alpine
     environment:
        # REQUIRED: Should be equal to the service name prefixed by "tasks." when
        # deployed inside an overlay network
        # Set AGENT_CLUSTER_ADDR to localhost for Single Node only, not Swarm mode!
        AGENT_CLUSTER_ADDR: localhost 
        # Performance tweaks
        AGENT_CLUSTER_PROBE_TIMEOUT: "2000ms"
        AGENT_CLUSTER_PROBE_INTERVAL: "3000ms"
        # AGENT_PORT: 9001
        # LOG_LEVEL: debug
     healthcheck:
        test: "wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping || exit 1"
        interval: 30s
        timeout: 10s
        retries: 3
        start_period: 30s

image

No need for "too much" hackery! :)

  • have updated this comment due to various issues with healthchecks on the agents in swarm mode

@Enissay
Copy link

Enissay commented Jun 30, 2023

@t0mtaylor image names seems to be reversed... Please double check so I can test asap :-)

@sgtcoder
Copy link

t0mtaylor

this definitely does not work for me

@modem7
Copy link

modem7 commented Jun 30, 2023

t0mtaylor

this definitely does not work for me

Can you post your compose file so we can see what you're trying to do?

As wget etc 100% works on the alpine images.

@sgtcoder
Copy link

  portainer-agent:
    image: portainer/agent:alpine
    ports:
      - 9001:9001/tcp
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: timeout 10 nc -z -v localhost 9001 || exit 1
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

@sgtcoder
Copy link

When I run the command in the container, I get that it is open: localhost (127.0.0.1:9001) open

But for some reason setting it as a healthcheck makes the container not connectable.

@t0mtaylor
Copy link

t0mtaylor commented Jul 1, 2023

@t0mtaylor image names seems to be reversed... Please double check so I can test asap :-)

@Enissay Ahhh i see, have now fixed @sgtcoder 👍

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

@t0mtaylor image names seems to be reversed... Please double check so I can test asap :-)

@Enissay Not sure what your on about - they work fine for me 👍

He is saying that you have the nc command for the Portainer UI and the wget command for your agent... In your example.

@t0mtaylor
Copy link

t0mtaylor commented Jul 1, 2023

When I run the command in the container, I get that it is open: localhost (127.0.0.1:9001) open

But for some reason setting it as a healthcheck makes the container not connectable.

@sgtcoder see the updated comment - #3572 (comment), i've added start_period:30s so it has enough time to start the containers and register 🚀

Just make sure your using the latest docker compose version, im using version: '3.8' - minimum you can use is 3.4

Also added screenshot of it working to the main comment too 🕺

@t0mtaylor
Copy link

Did that work for you @sgtcoder with the start_period ?

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

@t0mtaylor I just booted my computer and ssh'ing and checking now. Thank you for the updates. I will let you know.

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

It's strange because I am still getting "Environment is unreachable."

portainer-agent:
    image: portainer/agent:2.18.3-alpine
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: "timeout 10 nc -z -v localhost 9001 || exit 1"
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
  portainer:
    image: portainer/portainer-ee:2.18.3-alpine
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Phoenix
    command: -H tcp://tasks.portainer-agent:9001 --tlsskipverify
    volumes:
      - /mnt/storage/dockers/portainer:/data
    healthcheck:
      test: "wget --no-verbose --tries=3 --spider http://localhost:9000 || exit 1"
      interval: 60s
      timeout: 15s
      retries: 3
      start_period: 120s
    networks:
      - core_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

I know the command works in the container itself. It's literally no matter what healthcheck I put on the portainer agent, it becomes unreachable

@t0mtaylor
Copy link

Only difference with mine, is I have a separate network for the agents (which is defined for both ui and agents, then a seperate network for the ui only which is accessible via the load balancer), but you are also missing this set below your image declaration on the agent service:

    environment:
      # REQUIRED: Should be equal to the service name prefixed by "tasks." when
      # deployed inside an overlay network
      AGENT_CLUSTER_ADDR: tasks.portainer-agent
      # AGENT_PORT: 9001
      # LOG_LEVEL: debug

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

Thank you for that information. I will dig deeper. I did try the environment and still same issue. Definitely strange. And I never saw that environment line in the code sample portainer provided us since it's also ran in the command section.

Per Portainer Swarm Setup

docker network create \
--driver overlay \
  portainer_agent_network

docker service create \
  --name portainer_agent \
  --network portainer_agent_network \
  -p 9001:9001/tcp \
  --mode global \
  --constraint 'node.platform.os == linux' \
  --mount type=bind,src=//var/run/docker.sock,dst=/var/run/docker.sock \
  --mount type=bind,src=//var/lib/docker/volumes,dst=/var/lib/docker/volumes \
  portainer/agent:2.18.3

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

unable to retrieve a list of IP associated to the host | error="lookup tasks.portainer-agent on 127.0.0.11:53: no such host"

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

#8578

AGENT_CLUSTER_ADDR: localhost

This seemed to work. for some reason it doesn't let the DNS work properly in healthcheck

@t0mtaylor
Copy link

t0mtaylor commented Jul 1, 2023

#8578

AGENT_CLUSTER_ADDR: localhost

This seemed to work. for some reason it doesn't let the DNS work properly in healthcheck

After a while, i had this issue on the agents - i think the agents got restarted but then couldnt start due to a dns problem

github.com/portainer/agent/cmd/agent/main.go:141 > unable to retrieve a list of IP associated to the host | error="lookup tasks.agent on 127.0.0.11:53: no such host" host=tasks.agent

And the UI reported this

{"time":1688195821,"message":"http: proxy error: dial tcp: lookup tasks.agent on 127.0.0.11:53: no such host"}

I tried something similar but it doesn't work in a docker swarm, although for single node swarm or services it should be ok

What im looking at now is how to trigger all the portainer containers to restart if one of the agent fails the healthcheck, maybe with a seperate docker container monitoring them - or updating the healthcheck to trigger the parent docker host to relaunch the containers.

FYI - I've also updated the comment with a healthcheck api call so you know its up and running for the UI

     image: portainer/portainer-ee:2.18.3-alpine
     healthcheck:
        test: "wget --no-verbose --tries=3 --spider http://localhost:9000/api/system/status || exit 1"

@lonix1
Copy link

lonix1 commented Jul 1, 2023

With wget --spider there is no difference to checking http://localhost:9000/api/system/status or http://localhost:9000. In both cases one simply checks that there is a response. It's not a proper "healthcheck", but rather, "proof of life". 😏

(Portainer still needs a proper HEALTHCHECK endpoint, preferably at the conventional endpoint of /api/healthz.)

@t0mtaylor
Copy link

With wget --spider there is no difference to checking http://localhost:9000/api/system/status or http://localhost:9000. In both cases one simply checks that there is a response. It's not a proper "healthcheck", but rather, "proof of life". 😏

(Portainer still needs a proper HEALTHCHECK endpoint, preferably at the conventional endpoint of /api/healthz.)

@lonix1 i prefer to call http://localhost:9000/api/system/status as at least you know the api is up and running, instead of flooding the logs with 401 errors, as this returns a nice 200 - even though --spider just checks for a response :)

It pretty much doing what a healthcheck endpoint is doing, just giving more info about the status 🚀

@lonix1
Copy link

lonix1 commented Jul 1, 2023

@t0mtaylor I didn't consider the log. Good idea.

The response is this:

{
  "Version": "2.0.0",
  "demoEnvironment": {
    "enabled": true,
    "environments": [
      0
    ],
    "users": [
      1
    ]
  },
  "instanceID": "299ab403-70a8-4c05-92f7-bf7a994d50df"
}

So to be complete, in a script, I'd do something like this:

[ $(wget --quiet -O- --tries=1 http://localhost:9000/api/system/status | sed -nE 's/.*Version":"([^"]*)".*/\1/p' | wc -l) = 1 ] \
  && echo up || echo down

That not only checks that the page exists, but that it is returning expected data. I've extracted the Version arbitrarily - if that is found, then the API is up.

However in a compose file, I'd do something simpler:

healthcheck:
  # ...
  test: wget --no-verbose --tries=1 --spider http://localhost:9000/api/system/status || exit 1

@t0mtaylor
Copy link

@t0mtaylor I didn't consider the log. Good idea.

The response is this:

{
  "Version": "2.0.0",
  "demoEnvironment": {
    "enabled": true,
    "environments": [
      0
    ],
    "users": [
      1
    ]
  },
  "instanceID": "299ab403-70a8-4c05-92f7-bf7a994d50df"
}

So to be complete, in a script, I'd do something like this:

[ $(wget --quiet -O- --tries=1 http://localhost:9000/api/system/status | sed -nE 's/.*Version":"([^"]*)".*/\1/p' | wc -l) = 1 ] \
  && echo up || echo down

That not only checks that the page exists, but that it is returning expected data. I've extracted the Version arbitrarily - if that is found, then the API is up.

However in a compose file, I'd do something simpler:

healthcheck:
  # ...
  test: wget --no-verbose --tries=1 --spider http://localhost:9000/api/system/status || exit 1

@lonix1 Yea i would keep it simple for the healthceck as its giving you enough to determine its healthy

I do something similar checking the version in a bash script which checks services are running every 5 mins and also check how many containers are running per service, as docker can still be a bit flakey and services vanish from the swarm!

I've updated the main comment #3572 (comment) as theres an issue with the healthcheck for agents when running in swarm mode - but running single node on a rapsberry pi for example both healthchecks for UI and Agents work, as @sgtcoder has confirmed on his setup 👍

@sgtcoder
Copy link

sgtcoder commented Jul 1, 2023

Thank you guys for all the updates. I applied a bunch of the suggestions. I still had to use localhost on single swarm node, but it seems to work aside from the TLS handshake log errors. I had issues in general with using more than one docker node swarm with trying to replicate storage with both performance issues and overhead, so I just stick with one node for now.

Start period of 5 seconds seems to be fine for me. Running on a dedicated HPe DL380 Gen9 server with the docker VM configured with 32GB RAM and 32vCPU.

Here is what I have now

version: "3.8"
services:
  portainer-agent:
    image: portainer/agent:alpine
    environment:
      AGENT_CLUSTER_ADDR: localhost
      AGENT_CLUSTER_PROBE_TIMEOUT: 2000ms
      AGENT_CLUSTER_PROBE_INTERVAL: 3000ms
      #LOG_LEVEL: DEBUG
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    healthcheck:
      test: timeout 10 nc -z -v 127.0.0.1 9001 || exit 1
      start_period: 5s
      interval: 15s
      timeout: 5s
      retries: 5
    networks:
      - core_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
  portainer:
    image: portainer/portainer-ee:alpine
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Phoenix
    command: -H tcp://tasks.portainer-agent:9001 --tlsskipverify
    volumes:
      - /mnt/storage/dockers/portainer:/data
    healthcheck:
      test: wget --no-verbose --tries=3 --spider http://127.0.0.1:9000/api/system/status || exit 1
      start_period: 5s
      interval: 15s
      timeout: 5s
      retries: 5
    networks:
      - core_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
networks:
  core_network:
    external: true

@t0mtaylor
Copy link

t0mtaylor commented Jul 5, 2023

@sgtcoder try the wget for the agent healthcheck and that will remove the tls handshake errors :)

  healthcheck:
        test: "wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping || exit 1"
        interval: 30s
        timeout: 10s
        retries: 3
        start_period: 30s

These healthchecks work ok for single node setup - because of the way the agents do a dns lookup and theres a hardcoded timeout/sleep, the healthcheck wont work on the agents in swarm mode as he dns doesnt resolve the tasks.agent - i've detailed this within the main comment earlier #3572 (comment)

As a workaround, I have a separate bash script checking with docker that the agent containers are up and running on each server, and ive exposed port 9001 so i can wget that also on each server - not ideal but a way forward until @tamarahenson and team improve the agent - ideally they add a http ping or a shell command we can use to verify the agent once the dns lookup issue is fixed for swarm mode when it starts up.

@sgtcoder
Copy link

sgtcoder commented Jul 5, 2023

I tried the wget again, but for whatever reason, that causes the check to fail, whereas the nc command works.

@t0mtaylor

@t0mtaylor
Copy link

t0mtaylor commented Jul 7, 2023

@sgtcoder Have you tried the wget via sh in the container whilst the agent is running? whats the output? does it have an error?

  1. get the interactive shell of the container - replace CONTAINERID with the real one after running docker ps | grep portainer and you see the agent id
docker exec -it CONTAINERID sh 

with returned the shell ready to use on the agent container

/app #
  1. run the wget command
wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping

my output is this - its an error 400 but thats good as it hit the agent on port 9001:

/app # wget --no-check-certificate --no-verbose --tries=3 --spider --header='Content-Type:application/json' http://127.0.0.1:9001/ping

Connecting to 127.0.0.1:9001 (127.0.0.1:9001)
wget: server returned error: HTTP/1.0 400 Bad Request

@portainer portainer locked and limited conversation to collaborators Jul 27, 2023
@jamescarppe jamescarppe converted this issue into discussion #9597 Jul 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area/dockerfile kind/enhancement Applied to Feature Requests
Projects
None yet
Development

No branches or pull requests