Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent does not automatically recover from network failure #2065

Closed
bradrydzewski opened this issue Jun 6, 2017 · 10 comments
Closed

Agent does not automatically recover from network failure #2065

bradrydzewski opened this issue Jun 6, 2017 · 10 comments
Projects

Comments

@bradrydzewski
Copy link

bradrydzewski commented Jun 6, 2017

This issue needs more research, but is here for tracking purposes.

It may be possible under certain network conditions (usually when a reverse proxy or lb is involved) for network connections to terminate but not properly close. For this we should consider a heartbeat or timeout. cc @patrickjahns

Some symptoms for this issue could include:

  • Builds not being pulled from queue
  • Build status not being updated with the server

I will be locking this issue because there is nothing really more to say here. If one needs to discuss this issue please ping me in the main gitter channel.

@harness harness locked and limited conversation to collaborators Jun 21, 2017
@bradrydzewski bradrydzewski changed the title JSONRPC heartbeat or ping pong Agent does not automatically recover from network failure Jun 28, 2017
@bradrydzewski bradrydzewski added this to In Progress in Version 0.8 Jun 28, 2017
@bradrydzewski bradrydzewski moved this from In Progress to Done in Version 0.8 Jun 30, 2017
@harness harness unlocked this conversation Jun 30, 2017
@bradrydzewski
Copy link
Author

With #2093 we now use grpc which has automated reconnect, with heartbeat, and will retry when network calls fail. So any errors related to networks problems are now solved.

There have been mentions of issues with old Docker versions where streaming logs can freeze and result in the build logs not being fully uploaded, and the build sitting in pending state. I checked the docker (moby) issue tracker and it appears to be fixed in Docker 17.03 and higher.

So I feel like this issue is resolved, but we can re-open if we notice any problems with grpc or with newer versions of Docker.

@tonglil
Copy link

tonglil commented Aug 10, 2017

I experienced this issue with:

/ # docker version
Client:
 Version:      1.12.6
 API version:  1.23
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Wed Jan 11 00:23:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Wed Jan 11 00:23:16 2017
 OS/Arch:      linux/amd64

Can confirm that this is no longer an issue with:

Client:
 Version:      17.06.0-ce
 API version:  1.23
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:15:15 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:51:55 2017
 OS/Arch:      linux/amd64
 Experimental: false

@delfer
Copy link

delfer commented Mar 15, 2018

@bradrydzewski I still have this problem with image drone/agent:0.8 and

Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 18:59:38 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:05:23 2017
 OS/Arch:      linux/amd64
 Experimental: true

Server & agents deployed to k8s. When node with drone server restarts, agents never connect back to server until agent container restart(recreate).

@genisd
Copy link

genisd commented May 3, 2018

We're also still encountering issues like this (drone server on k8s, agents are external vm's). But it's hard to pinpoint the exact cause. Drone server and agent are on 0.8.5 with keepalive enabled

Ubuntu 16.04 with

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

to be honest, I'm not even sure the's a transport issue.

We simply got a job hanging in pending and is stuck there.
Newer jobs get picked up, but that one job is simply not doing anything / stuck in pending

@jacksgt
Copy link

jacksgt commented Jul 6, 2018

I also experience this issue. After a while without any builds (a day or so?), the agents "get stuck" (the Web-UI just shows the ringing alarm-clock) and I have to restart them. Afterwards they pick up the job just fine.

I'm running the Drone server in a Docker swarm cluster (drone/drone:0.8.5) and the agents as a separate service (drone/agent:0.8.5), but nevertheless both are running on the same host, which makes the connection loss even weirder.

Neither the server nor the agents emit any relevant log messages.

@jacksgt
Copy link

jacksgt commented Jul 6, 2018

Should we reopen this issue or create a new one?

@bradrydzewski
Copy link
Author

bradrydzewski commented Jul 6, 2018

Should we reopen this issue or create a new one?

Please read / use the following thread: https://www.reddit.com/r/droneci/comments/8opifu/drone_stops_working_after_some_little_time/

@mechtron
Copy link

I run Drone in ECS behind Traefik. Traekfik, like many other edge routers, does HTTP/S routing only. This change to GRPC breaks our Drone environment and we are now forced to use Drone < 0.8.

Of course, this problem isn't Drone's fault - I just wanted to highlight this new limitation of running Drone agent clusters with Traefik.

@bradrydzewski
Copy link
Author

grpc has been a nightmare to support and starting with 0.9 (next drone release) the default transport will be simple rest over http. The implementation is already complete and in the testing phase.

@tboerger
Copy link

Besides that, traefik got grpc support.

bot2-harness pushed a commit that referenced this issue May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

7 participants