Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Loading Environment error after rebooting manager node #8256

Open
joshbuker opened this issue Dec 31, 2022 · 18 comments
Open

Failed Loading Environment error after rebooting manager node #8256

joshbuker opened this issue Dec 31, 2022 · 18 comments

Comments

@joshbuker
Copy link

Bug description
After rebooting the manager node in a 1 manager, 1 worker docker swarm setup, portainer will boot the service, but attempting to load the primary environment will fail with an error:

Failed loading environment
Unable to find an agent on any manager node

This error happens both from the worker and manager node instances of the portainer web interface.

Expected behavior
After a reboot, portainer boots normally and allows loading the environment.

Portainer Logs
Provide the logs of your Portainer container or Service.

Steps to reproduce the issue:

  1. Setup a docker swarm with 1 manager, 1 worker. The manager is Ubuntu Server in my case, while the worker is Pop OS!
  2. Verify that portainer is working normally by visiting the web interface
  3. Reboot the manager node (system reboot)
  4. Visit web interface again, attempt clicking on primary environment
  5. See error

Additional context
Discord thread has additional context if needed.

@tamarahenson
Copy link

@joshbuker

Thank you for logging this request that was initially posted in our Discord channel. I am going to provide further updates here with my testing and forward to Product for review.

Thanks!

@joshbuker
Copy link
Author

Per our Discord conversation, using sudo systemctl restart docker on the worker node is confirmed as a workaround for the issue. After rebooting the service, the error goes away and the GUI can be accessed again - without any changes on the manager node.

@tamarahenson
Copy link

DOCKER SWARM REBOOT TESTING:

(1) Manager, (1+) Worker

[1] Install Swarm Portainer Server on Manager:
curl -L https://downloads.portainer.io/ee2-16/portainer-agent-stack.yml -o portainer-agent-stack.yml
docker stack deploy -c portainer-agent-stack.yml portainer

[2] Reboot Manager Node
[3] Navigate UI
[4] Wait 2 Minutes
https://user-images.githubusercontent.com/20426210/210469483-0e6c40c9-3325-4601-8fcf-3c11100d9b33.mov
[5] Toast appears:

Failed loading environment
Unable to find an agent on any manager node

[6] Home >> Environment:primary >> Node indicator is no longer appearing
Screen Shot 2023-01-02 at 4 16 08 PM

WORKAROUND:
[1] sudo systemctl restart docker on worker node
[2] Navigate UI on Manager
[3] Wait 1 minute.
[4] Home >> Environment:primary >> Node indicator is appearing with node counts
Screen Shot 2023-01-02 at 4 21 29 PM
[5] Toast no longer appears

Note: this workaround also works on multiple worker node clusters. You only need to restart docker on one worker node.

ADDITIONAL TESTING:

(2) Manager, (1) Worker
Notes: Issue does not occur.

TODO:
Need to determine if this is Docker Swarm behavior or a bug. Updates to follow.

Thanks!

@tamarahenson
Copy link

Update:

Upon further investigation, I have found a better workaround.

Run the following on the Manager node (image name may need to be edited):

sudo docker service update --image portainer/agent:2.16.2 --force portainer_agent

Thanks!

@vampywiz17
Copy link

same problem, the workaround working... but a really annoying that this issue are exist more than one year...

@CeamKrier
Copy link

Per our Discord conversation, using sudo systemctl restart docker on the worker node is confirmed as a workaround for the issue. After rebooting the service, the error goes away and the GUI can be accessed again - without any changes on the manager node.

This resolved my issue

@bdoublet91
Copy link

bdoublet91 commented Apr 3, 2024

Hi,
I'm using portainer for 4 years in production swarm env. I have always seen this issue and the only workaround is to restart portainer_agent on all nodes ... quite annoying. And come again few days later
I think portainer doesnt support well network latency / reconnections and also I can't use portainer in a multicloud swarm env with 5ms latency because the agent node timeout half the time...
I use the latest version 2.19.4

So now I restart portainer in a cronjob at 23:00pm every day to limit this issue ...
Please, give us a fix for that

thanks

@BloodBlight
Copy link

This has been an issue for years now. Has any progress been made? :/

We have written a cron job that looks fore the "unable to redirect request to a manager node: no manager node found" then kicks off a service update, but it is not really a great solution...

@whysi
Copy link

whysi commented May 2, 2024

we are on the same boat, this happen in every environment that we have (Portainer BE customer) after we reboot any node in a swarm cluster, not necessairly a manager node. This is quite frustrating and we have it since the beginning of our experience with Portainer.
We tried to configure a systemd unit that restart the portainer_agent service on one manager for each swarm cluster 40 minutes after the node booted up (that's pretty much the time it takes to update one of our clusters plus 20 minutes of margin) but is not a real solution. For example if we have to restart any worker node for an unplanned maintenance then we have to remember also to manually restart the portainer_agent service for that swarm cluster.

@tamarahenson Any plan to fix this? The workaround is not a real solution because we have different reboot/update policies in our swarm clusters. If that's not on roadmap we will open a support case on the customer portal because we are having many complaints from out end users that are perceiving Portainer as unstable and broken every time they have this errors. I know that the workaround is a quick fix but at that point the user has already experienced the error.

@Kegelcizer
Copy link

Happened again today after restarting two worker nodes (Azure VMs).
1st VM took a while and did not affect anything.
2nd VM had a fast reboot and throwed the same "portainer unable to find an agent on any manager node" message in Portainer web.

Fixed it with: docker service update --image portainer/agent:2.xx.x --force portainer_agent

@D0wn3r
Copy link

D0wn3r commented Jun 26, 2024

I am experiencing the same problem after restarting my entire Swarm cluster. Portainer and its agents do not start properly, resulting in the following error message:

Unable to find an agent on any manager node

I'm in portainer v1.19.5 (the LTS version).
Running a docker service update temporarily fixes the problem, but the issue reoccurs after each reboot.

This is an ongoing issue. Can we have a permanent fix, please?

@lambdan
Copy link

lambdan commented Sep 9, 2024

Make sure the agent container running on a manager is exactly called agent (expanded to portainer_agent if your stack is portainer). I just wasted alot of time debugging this because I had renamed it. (Its ok for it to be called something else on non-manager nodes)

@rkj
Copy link

rkj commented Sep 9, 2024

what if it's deployed through docker swarm? Following this guide: https://docs.portainer.io/start/install-ce/server/swarm/linux, the https://downloads.portainer.io/ce2-21/portainer-agent-stack.yml creates portainer_agent.randomstring names on nodes.

@BloodBlight
Copy link

I am in the same state as @rkj. It is a swarm deployment using the same template.

@lambdan
Copy link

lambdan commented Sep 9, 2024

what if it's deployed through docker swarm? Following this guide: https://docs.portainer.io/start/install-ce/server/swarm/linux, the https://downloads.portainer.io/ce2-21/portainer-agent-stack.yml creates portainer_agent.randomstring names on nodes.

I use the same compose file and and also use docker swarms, so I guess the random characters at the end is fine?
In my case it was that I created a duplicate of the agent service in that compose.yml, where I changed where the volumes folder was mapped because I have it elsewhere for persistence sake, and so I called this new service agent_persistent. Coincidentally all my managers has this persistent mapping so they need this version of the service. And this didnt work.

Then I finally tried restoring the back the old name, and instead I made a duplicate of the service with regular volume mapping, so non-managers get that, and it works.

TLDR:

Doesnt work ("cannot find a manager on any node"):

agent_persistent:
  image: portainer/agent:2.21.0
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /persistent/docker/volumes:/var/lib/docker/volumes
  networks:
    - agent_network
  deploy:
    mode: global
    placement:
      constraints: 
        node.role == manager

agent:
  image: portainer/agent:2.21.0
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /var/lib/docker/volumes:/var/lib/docker/volumes
  networks:
    - agent_network
  deploy:
    mode: global
    placement:
      constraints: 
        node.role != manager

Works:

agent:
  image: portainer/agent:2.21.0
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /persistent/docker/volumes:/var/lib/docker/volumes
  networks:
    - agent_network
  deploy:
    mode: global
    placement:
      constraints: 
        node.role == manager

agent_worker:
  image: portainer/agent:2.21.0
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /var/lib/docker/volumes:/var/lib/docker/volumes
  networks:
    - agent_network
  deploy:
    mode: global
    placement:
      constraints: 
        node.role != manager

@rkj
Copy link

rkj commented Sep 9, 2024

Ok, this is pretty silly, the agent is missing 9001 port exposed. Added that and it seems to work now.

@BloodBlight
Copy link

@rkj , Are we having about the same issue? It only breaks after a reboot, and not during initial deployment?

@rkj
Copy link

rkj commented Sep 9, 2024

Maybe it's different.
It used to work, but then my server rebooted two nights ago and it stopped working which brought me to this bug. Then I realized my portainer was still running on single node instead of swarm, so I moved it there and it didn't help so I'm a bit lost. In either case exposing the ports does fix the issue for me, which I guess is odd as the portainer and portainer_agents should be on same network and not have to expose ports to the hosts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests