New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent does not automatically recover from network failure #2065
Comments
With #2093 we now use grpc which has automated reconnect, with heartbeat, and will retry when network calls fail. So any errors related to networks problems are now solved. There have been mentions of issues with old Docker versions where streaming logs can freeze and result in the build logs not being fully uploaded, and the build sitting in pending state. I checked the docker (moby) issue tracker and it appears to be fixed in Docker 17.03 and higher. So I feel like this issue is resolved, but we can re-open if we notice any problems with grpc or with newer versions of Docker. |
I experienced this issue with:
Can confirm that this is no longer an issue with:
|
@bradrydzewski I still have this problem with image
Server & agents deployed to k8s. When node with drone server restarts, agents never connect back to server until agent container restart(recreate). |
We're also still encountering issues like this (drone server on k8s, agents are external vm's). But it's hard to pinpoint the exact cause. Drone server and agent are on 0.8.5 with keepalive enabled
to be honest, I'm not even sure the's a transport issue. We simply got a job hanging in pending and is stuck there. |
I also experience this issue. After a while without any builds (a day or so?), the agents "get stuck" (the Web-UI just shows the ringing alarm-clock) and I have to restart them. Afterwards they pick up the job just fine. I'm running the Drone server in a Docker swarm cluster ( Neither the server nor the agents emit any relevant log messages. |
Should we reopen this issue or create a new one? |
Please read / use the following thread: https://www.reddit.com/r/droneci/comments/8opifu/drone_stops_working_after_some_little_time/ |
I run Drone in ECS behind Traefik. Traekfik, like many other edge routers, does HTTP/S routing only. This change to GRPC breaks our Drone environment and we are now forced to use Drone < 0.8. Of course, this problem isn't Drone's fault - I just wanted to highlight this new limitation of running Drone agent clusters with Traefik. |
grpc has been a nightmare to support and starting with 0.9 (next drone release) the default transport will be simple rest over http. The implementation is already complete and in the testing phase. |
Besides that, traefik got grpc support. |
This issue needs more research, but is here for tracking purposes.
It may be possible under certain network conditions (usually when a reverse proxy or lb is involved) for network connections to terminate but not properly close. For this we should consider a heartbeat or timeout. cc @patrickjahns
Some symptoms for this issue could include:
I will be locking this issue because there is nothing really more to say here. If one needs to discuss this issue please ping me in the main gitter channel.
The text was updated successfully, but these errors were encountered: