-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden the agent to be reliable against node reboot, crash, shutdown etc. #114
Comments
I was able to break the portainer agent, with the UI showing the swarm in a "down" state and not loading and erroring on every page, by quicking draining 2/3 nodes in a 3-node 3-manager swarm. One of the drained nodes was the leader. Problem went away after restarting the agent on the remaining node. |
Hi there, seems to me that demoted managers are still treated as managers, and vice-versa. The UI gets errors such as "cannot retrieve tasks, services, etc." in a config like 1 Manager + 1 Worker. I suppose that agents behind the scene are not appropriate. |
Are you running the Agent globally?
Which is different from Portainer, which will be running on the manager node:
Unless there's something going wrong with your overlay network... |
Yeah |
And I do not encounter this problem with two manager nodes. That's only with 1 manager and 1 worker. If I let the autorefresh on the UI, it will give a list, then a red error flash, then a list, then a red error flash, etc. |
This continues to be an issue for us with docker swarm mode. The agent keeps the old IP when a node is rebooted, either worker or manager. We run 3 managers and 3 workers. The issue is exacerbated by running in a cloud environment - AWS - with ephemeral private IPs. It would likely never surface if the nodes had statically assigned IPs. |
Currently the agent is not reliable in certain situations and sometimes needs to be force updated or removed and re-deployed either when a node is rebooted, crashes, is drained for maintenance or is under a lot of load. This can also occur when the docker daemon is restarted.
When the above issues occur, the endpoint could show as down, or you might see an error when browsing different views in Portainer such as
Failure could not retrieve images
.The agent should be made more reliable as it should handle these situations.
Additional info:
The symptoms of this are discussed a lot here on the Portainer repo, but I have moved it here to be a feature request
The text was updated successfully, but these errors were encountered: