Build status not updated, but build is complete #2090

zaa · 2017-06-26T11:50:00Z

Hello,

I've got a fun issue with drone 0.7. All jobs of a build were successfully completed, but the build was shown as still being running. I clicked on "Cancel" button on the build page and the build info page shown that the build was killed (the build info page had information about the build in red).
However in the list of builds on the left and in output from "drone build info" I see that the build is still running.

$ drone build info repo/project 20
Number: 20
Status: running
Event: push
...

And no new builds can start (they are waiting in the pending state).

I see the following logs in the agent:

pipeline: finish uploading logs: 540: step smoke-test
pipeline: ping queue: 540
pipeline: execution complete: 540
pipeline: ping queue: 540
rpc: error making call: jsonrpc2: code 0 message: queue: task cancelled
pipeline: cancel signal received: 540: jsonrpc2: code 0 message: queue: task cancelled
pipeline: cancel ping loop: 540

I've tried to restart drone server and agent and the pending build started running, but the build that was stuck is still shown as "running".

Thank you.

The text was updated successfully, but these errors were encountered:

zaa · 2017-06-26T11:53:28Z

$ drone build stop repo/project 20
client error 400: Cannot cancel a non-running build

bradrydzewski · 2017-06-26T11:59:48Z

duplicate of #2065

zaa · 2017-06-26T12:09:58Z

@bradrydzewski Oh, I see that you are aware of the issue.
Could you please provide a bit more context?
What should we expect? Is the issue being addressed or is there any plans to address it?
Can we help in any way?

Thank you.

bradrydzewski · 2017-06-26T15:51:12Z

@zaa yes, I plan to resolve this in the upcoming 0.8 release (no planned release date). I am currently evaluating grpc which would give us heartbeat, reconnect, retry with backoff and more. I first need to understand the implications of such a decision, since grpc uses http2 and could complicate installation (nginx doesn't support http2, for example)

mrueg · 2017-06-26T19:29:20Z

@bradrydzewski would you consider to add a "build reset" functionality? That might be able to mitigate possibly network related and similar issues.

bradrydzewski · 2017-06-26T19:35:57Z

@mrueg I would rather see the time and effort go into fixing the root cause. The existing code could be improved by implementing websocket ping/pong to keep the connection alive, and then slightly tweaking the retry logic here:
https://github.com/cncd/pipeline/blob/master/pipeline/rpc/client.go#L139

bradrydzewski · 2017-07-16T23:21:05Z

copy of the gitter conversation where @zaa tracks down the possible root cause to the docker logs endpoint freezing.

@zaa Regarding #2090. I added a bunch of logs to the agent and found out the following. In our case the build as a whole was not progressing to "success" state (client.Done call in the agent) because it was stuck on the uploads.Wait() call.

@zaa The call was waiting for a step to send its logs to server (io.Copy(logstream, limitedPart) part).
Event though the step was completed (I see last line of its output), for some reason the call was still waiting for data to copy over. But the container was already stopped.

@zaa In our environment (kubernetes + docker 1.12.4 on debian jessie) it takes several build restarts to reproduce the issue, but it is quite persistent.

@bradrydzewski the fact that a container is stopped should not be a problem. You can run docker logs -f on a stopped container and it will output and close the stream at EOF

@bradrydzewski did you look at the docker daemon logs for errors?

@bradrydzewski also, is it possible there is a bug in older docker daemons? I believe kubernetes environments use 1.12 or 1.13? I use 17.05 and have not been able to reproduce

@bradrydzewski maybe moby/moby#30135. See this excerpt:

this issue was fixed and the fix is included in docker 17.03, so you'll have to upgrade to that version to get the fix. If you're still able to reproduce on 17.03, please open a new issue with details.

Note that you can identify this issue if you see the following in 0.8

log.Printf("pipeline: execution complete: %s", work.ID)

But do not see a subsequent log message indicating uploading the logs is complete:

log.Printf("pipeline: logging complete: %s", work.ID)

bradrydzewski · 2017-07-16T23:23:13Z

Note that at this time I still have not been able to repeat, but I am running docker 17.05 (perhaps that is why). If one can reproduce this issue with docker 17.05 (and drone version 0.8 which uses grpc) then we can re-open.

gtaylor · 2017-07-16T23:57:44Z

FWIW, the latest major release of Kubernetes 1.7 that just came out has this certification statement:

Docker versions 1.10.3, 1.11.2, 1.12.6 have been validated

While it is possible to run newer or older versions of Docker than those, the above are the only officially supported versions for Kubernetes 1.7.

bradrydzewski · 2017-07-17T00:07:02Z

I think one option would be to run docker:dind in the same pod as your drone agent instead of using the host machine docker daemon. This would allow you to use newer versions of docker with drone, with an added benefit of a bit more host machine isolation.

Unfortunately most of the streaming code sits inside the docker library, as opposed to drone code, which limits our options for addressing the issue. (assuming we can verify this is a docker issue)

tonglil · 2017-08-10T21:14:02Z

@gtaylor I experienced this issue running older version of docker:1.12.6-dind and fixed it with docker:stable-dind. I also recommend running Drone with dind instead of mounting the host docker socket on K8s so you can set resource constraints.

Send me a message and I can give you a preview of what we're doing. I hope to open source our Drone on GKE setup when it's more battle tested.

zaa · 2017-08-11T09:50:29Z

Yep, we resolved the issue by running drone-agent and docker-in-docker in the same Kubernetes pod. But as was mentioned earlier, 1.12.x version is recommended by Kubernetes and most production like installations of Kubernetes are using it.

Maybe it makes sense to add some kind of a warning or a recommendation into documentation about minimal Docker version and/or suggestion to use dind in Kubernetes.

ptagr · 2017-08-13T08:57:59Z

@tonglil @zaa could you share more details on your setup? Do you have to share the docker binary with the drone-agent if you are not mounting the docker socket?

tonglil · 2017-08-14T01:52:20Z

@Punitag I'll see if I can put something together for this week.

zaa · 2017-08-14T07:24:53Z

@Punitag we run docker-agent and dind as two containers in the same kubernetes pod. The agent connects to the docker daemon via tcp (tcp://127.0.0.1:2375).

tonglil · 2017-08-15T22:07:51Z

Early preview of what I'm putting together for a PR to the official docs on running this on GKE, this is how we run it currently (on GKE).

https://gist.github.com/tonglil/4108f5c74bf4e382511f4c1b633d2d9a

A few things missing:

0.8 (pending testing with grpc server-agent)
Global Registry + Secret (currently manually creating the Secret resources first, then add mounts to the server pod)

bradrydzewski · 2017-09-12T16:38:34Z

I wanted to provide a quick update, since it looks like there were multiple root causes to this issue and we have at least two solutions now. The below comment was copied from discourse, you can visit the origin thread here.

I just merged a pull request that fixes an issue where large log output causes the upload to return an error due to exceeding the maximum grpc payload size. The agent will continue to retry the upload indefinitely because the error will always be the same, thus causing the build to get stuck.

Thanks to @tboerger for pinpointing the exact error:

err: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger
than max (7399047 vs. 4194304)

This fix will limit the size of the logs (per step) to ensure it does not execeed the grpc limits. A more permanent solution will be to implement grpc streaming, which is in the long term, is definitely how this should be implemented anyway.

So in conclusion I believe there were at least two different root causes for builds getting stuck that we have discovered:

docker bug where logs stuck after killing build moby/moby#30135
drone infinitely retrying when payload size error received drone/drone#2208

I therefore believe that both upgrading docker and getting the drone/agent:latest image with the patch liming log size will resolve this issue for most, if not all, people.

gtaylor · 2017-09-12T17:22:19Z

Nice find!

Just so I am clear on behavior, we'll see a truncated build log (at the limit boundary) if our build goes over the limit?

bradrydzewski · 2017-09-12T18:38:42Z

yes, each step in the pipeline will truncate the logs at 2mb. Note that the aggregate of all logs for all steps can exceed 2mb, so this is just a per-step limit.

tonglil · 2017-09-13T03:05:17Z

Thanks, it's a fair short-term workaround for allowing builds to complete; one can reduce log output in the mean time until grpc streaming is implemented.

bradrydzewski closed this as completed Jun 26, 2017

bradrydzewski changed the title ~~Build information mismatch~~ Build status not updated, but build is complete Jul 16, 2017

bradrydzewski mentioned this issue Sep 5, 2017

Matrix builds are stuck even though the agent finished #2190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build status not updated, but build is complete #2090

Build status not updated, but build is complete #2090

zaa commented Jun 26, 2017 •

edited

zaa commented Jun 26, 2017

bradrydzewski commented Jun 26, 2017

zaa commented Jun 26, 2017

bradrydzewski commented Jun 26, 2017

mrueg commented Jun 26, 2017

bradrydzewski commented Jun 26, 2017

bradrydzewski commented Jul 16, 2017 •

edited

bradrydzewski commented Jul 16, 2017

gtaylor commented Jul 16, 2017 •

edited

bradrydzewski commented Jul 17, 2017 •

edited

tonglil commented Aug 10, 2017

zaa commented Aug 11, 2017 •

edited

ptagr commented Aug 13, 2017

tonglil commented Aug 14, 2017

zaa commented Aug 14, 2017

tonglil commented Aug 15, 2017 •

edited

bradrydzewski commented Sep 12, 2017

gtaylor commented Sep 12, 2017

bradrydzewski commented Sep 12, 2017

tonglil commented Sep 13, 2017

Build status not updated, but build is complete #2090

Build status not updated, but build is complete #2090

Comments

zaa commented Jun 26, 2017 • edited

zaa commented Jun 26, 2017

bradrydzewski commented Jun 26, 2017

zaa commented Jun 26, 2017

bradrydzewski commented Jun 26, 2017

mrueg commented Jun 26, 2017

bradrydzewski commented Jun 26, 2017

bradrydzewski commented Jul 16, 2017 • edited

bradrydzewski commented Jul 16, 2017

gtaylor commented Jul 16, 2017 • edited

bradrydzewski commented Jul 17, 2017 • edited

tonglil commented Aug 10, 2017

zaa commented Aug 11, 2017 • edited

ptagr commented Aug 13, 2017

tonglil commented Aug 14, 2017

zaa commented Aug 14, 2017

tonglil commented Aug 15, 2017 • edited

bradrydzewski commented Sep 12, 2017

gtaylor commented Sep 12, 2017

bradrydzewski commented Sep 12, 2017

tonglil commented Sep 13, 2017

zaa commented Jun 26, 2017 •

edited

bradrydzewski commented Jul 16, 2017 •

edited

gtaylor commented Jul 16, 2017 •

edited

bradrydzewski commented Jul 17, 2017 •

edited

zaa commented Aug 11, 2017 •

edited

tonglil commented Aug 15, 2017 •

edited