Agent does not automatically recover from network failure #2065

bradrydzewski · 2017-06-06T08:58:56Z

This issue needs more research, but is here for tracking purposes.

It may be possible under certain network conditions (usually when a reverse proxy or lb is involved) for network connections to terminate but not properly close. For this we should consider a heartbeat or timeout. cc @patrickjahns

Some symptoms for this issue could include:

Builds not being pulled from queue
Build status not being updated with the server

I will be locking this issue because there is nothing really more to say here. If one needs to discuss this issue please ping me in the main gitter channel.

bradrydzewski · 2017-06-30T02:09:15Z

With #2093 we now use grpc which has automated reconnect, with heartbeat, and will retry when network calls fail. So any errors related to networks problems are now solved.

There have been mentions of issues with old Docker versions where streaming logs can freeze and result in the build logs not being fully uploaded, and the build sitting in pending state. I checked the docker (moby) issue tracker and it appears to be fixed in Docker 17.03 and higher.

So I feel like this issue is resolved, but we can re-open if we notice any problems with grpc or with newer versions of Docker.

tonglil · 2017-08-10T21:10:43Z

I experienced this issue with:

/ # docker version
Client:
 Version:      1.12.6
 API version:  1.23
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Wed Jan 11 00:23:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Wed Jan 11 00:23:16 2017
 OS/Arch:      linux/amd64

Can confirm that this is no longer an issue with:

Client:
 Version:      17.06.0-ce
 API version:  1.23
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:15:15 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:51:55 2017
 OS/Arch:      linux/amd64
 Experimental: false

delfer · 2018-03-15T11:16:19Z

@bradrydzewski I still have this problem with image drone/agent:0.8 and

Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 18:59:38 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:05:23 2017
 OS/Arch:      linux/amd64
 Experimental: true

Server & agents deployed to k8s. When node with drone server restarts, agents never connect back to server until agent container restart(recreate).

genisd · 2018-05-03T12:10:59Z

We're also still encountering issues like this (drone server on k8s, agents are external vm's). But it's hard to pinpoint the exact cause. Drone server and agent are on 0.8.5 with keepalive enabled

Ubuntu 16.04 with

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

to be honest, I'm not even sure the's a transport issue.

We simply got a job hanging in pending and is stuck there.
Newer jobs get picked up, but that one job is simply not doing anything / stuck in pending

jacksgt · 2018-07-06T08:52:46Z

I also experience this issue. After a while without any builds (a day or so?), the agents "get stuck" (the Web-UI just shows the ringing alarm-clock) and I have to restart them. Afterwards they pick up the job just fine.

I'm running the Drone server in a Docker swarm cluster (drone/drone:0.8.5) and the agents as a separate service (drone/agent:0.8.5), but nevertheless both are running on the same host, which makes the connection loss even weirder.

Neither the server nor the agents emit any relevant log messages.

jacksgt · 2018-07-06T08:52:59Z

Should we reopen this issue or create a new one?

bradrydzewski · 2018-07-06T14:41:14Z

Should we reopen this issue or create a new one?

Please read / use the following thread: https://www.reddit.com/r/droneci/comments/8opifu/drone_stops_working_after_some_little_time/

mechtron · 2018-09-12T02:42:52Z

I run Drone in ECS behind Traefik. Traekfik, like many other edge routers, does HTTP/S routing only. This change to GRPC breaks our Drone environment and we are now forced to use Drone < 0.8.

Of course, this problem isn't Drone's fault - I just wanted to highlight this new limitation of running Drone agent clusters with Traefik.

bradrydzewski · 2018-09-12T02:58:42Z

grpc has been a nightmare to support and starting with 0.9 (next drone release) the default transport will be simple rest over http. The implementation is already complete and in the testing phase.

tboerger · 2018-09-12T08:45:04Z

Besides that, traefik got grpc support.

harness locked and limited conversation to collaborators Jun 21, 2017

bradrydzewski mentioned this issue Jun 26, 2017

Build status not updated, but build is complete #2090

Closed

bradrydzewski changed the title ~~JSONRPC heartbeat or ping pong~~ Agent does not automatically recover from network failure Jun 28, 2017

bradrydzewski added this to In Progress in Version 0.8 Jun 28, 2017

bradrydzewski mentioned this issue Jun 28, 2017

Use grpc for agent <> server protocol #2093

Merged

bradrydzewski moved this from In Progress to Done in Version 0.8 Jun 30, 2017

harness unlocked this conversation Jun 30, 2017

bradrydzewski closed this as completed Jun 30, 2017

ozbillwang mentioned this issue Aug 17, 2017

Feature: Allow drone to spin-up remote agents #1052

Closed

bot2-harness pushed a commit that referenced this issue May 24, 2024

Fix DiffCut For New File (#2065)

43897cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent does not automatically recover from network failure #2065

Agent does not automatically recover from network failure #2065

bradrydzewski commented Jun 6, 2017 •

edited

bradrydzewski commented Jun 30, 2017

tonglil commented Aug 10, 2017

delfer commented Mar 15, 2018

genisd commented May 3, 2018 •

edited

jacksgt commented Jul 6, 2018 •

edited

jacksgt commented Jul 6, 2018

bradrydzewski commented Jul 6, 2018 •

edited

mechtron commented Sep 12, 2018

bradrydzewski commented Sep 12, 2018

tboerger commented Sep 12, 2018

Agent does not automatically recover from network failure #2065

Agent does not automatically recover from network failure #2065

Comments

bradrydzewski commented Jun 6, 2017 • edited

bradrydzewski commented Jun 30, 2017

tonglil commented Aug 10, 2017

delfer commented Mar 15, 2018

genisd commented May 3, 2018 • edited

jacksgt commented Jul 6, 2018 • edited

jacksgt commented Jul 6, 2018

bradrydzewski commented Jul 6, 2018 • edited

mechtron commented Sep 12, 2018

bradrydzewski commented Sep 12, 2018

tboerger commented Sep 12, 2018

bradrydzewski commented Jun 6, 2017 •

edited

genisd commented May 3, 2018 •

edited

jacksgt commented Jul 6, 2018 •

edited

bradrydzewski commented Jul 6, 2018 •

edited