Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build status not updated, but build is complete #2090

Closed
zaa opened this issue Jun 26, 2017 · 20 comments
Closed

Build status not updated, but build is complete #2090

zaa opened this issue Jun 26, 2017 · 20 comments

Comments

@zaa
Copy link

zaa commented Jun 26, 2017

Hello,

I've got a fun issue with drone 0.7. All jobs of a build were successfully completed, but the build was shown as still being running. I clicked on "Cancel" button on the build page and the build info page shown that the build was killed (the build info page had information about the build in red).
However in the list of builds on the left and in output from "drone build info" I see that the build is still running.

$ drone build info repo/project 20
Number: 20
Status: running
Event: push
...

And no new builds can start (they are waiting in the pending state).

I see the following logs in the agent:

pipeline: finish uploading logs: 540: step smoke-test
pipeline: ping queue: 540
pipeline: execution complete: 540
pipeline: ping queue: 540
rpc: error making call: jsonrpc2: code 0 message: queue: task cancelled
pipeline: cancel signal received: 540: jsonrpc2: code 0 message: queue: task cancelled
pipeline: cancel ping loop: 540

I've tried to restart drone server and agent and the pending build started running, but the build that was stuck is still shown as "running".

Thank you.

@zaa
Copy link
Author

zaa commented Jun 26, 2017

$ drone build stop repo/project 20
client error 400: Cannot cancel a non-running build

@bradrydzewski
Copy link

duplicate of #2065

@zaa
Copy link
Author

zaa commented Jun 26, 2017

@bradrydzewski Oh, I see that you are aware of the issue.
Could you please provide a bit more context?
What should we expect? Is the issue being addressed or is there any plans to address it?
Can we help in any way?

Thank you.

@bradrydzewski
Copy link

@zaa yes, I plan to resolve this in the upcoming 0.8 release (no planned release date). I am currently evaluating grpc which would give us heartbeat, reconnect, retry with backoff and more. I first need to understand the implications of such a decision, since grpc uses http2 and could complicate installation (nginx doesn't support http2, for example)

@mrueg
Copy link

mrueg commented Jun 26, 2017

@bradrydzewski would you consider to add a "build reset" functionality? That might be able to mitigate possibly network related and similar issues.

@bradrydzewski
Copy link

@mrueg I would rather see the time and effort go into fixing the root cause. The existing code could be improved by implementing websocket ping/pong to keep the connection alive, and then slightly tweaking the retry logic here:
https://github.com/cncd/pipeline/blob/master/pipeline/rpc/client.go#L139

@bradrydzewski
Copy link

bradrydzewski commented Jul 16, 2017

copy of the gitter conversation where @zaa tracks down the possible root cause to the docker logs endpoint freezing.

@zaa Regarding #2090. I added a bunch of logs to the agent and found out the following. In our case the build as a whole was not progressing to "success" state (client.Done call in the agent) because it was stuck on the uploads.Wait() call.

@zaa The call was waiting for a step to send its logs to server (io.Copy(logstream, limitedPart) part).
Event though the step was completed (I see last line of its output), for some reason the call was still waiting for data to copy over. But the container was already stopped.

@zaa In our environment (kubernetes + docker 1.12.4 on debian jessie) it takes several build restarts to reproduce the issue, but it is quite persistent.

@bradrydzewski the fact that a container is stopped should not be a problem. You can run docker logs -f on a stopped container and it will output and close the stream at EOF

@bradrydzewski did you look at the docker daemon logs for errors?

@bradrydzewski also, is it possible there is a bug in older docker daemons? I believe kubernetes environments use 1.12 or 1.13? I use 17.05 and have not been able to reproduce

@bradrydzewski maybe moby/moby#30135. See this excerpt:

this issue was fixed and the fix is included in docker 17.03, so you'll have to upgrade to that version to get the fix. If you're still able to reproduce on 17.03, please open a new issue with details.

Note that you can identify this issue if you see the following in 0.8

log.Printf("pipeline: execution complete: %s", work.ID)

But do not see a subsequent log message indicating uploading the logs is complete:

log.Printf("pipeline: logging complete: %s", work.ID)

@bradrydzewski bradrydzewski changed the title Build information mismatch Build status not updated, but build is complete Jul 16, 2017
@bradrydzewski
Copy link

Note that at this time I still have not been able to repeat, but I am running docker 17.05 (perhaps that is why). If one can reproduce this issue with docker 17.05 (and drone version 0.8 which uses grpc) then we can re-open.

@gtaylor
Copy link

gtaylor commented Jul 16, 2017

FWIW, the latest major release of Kubernetes 1.7 that just came out has this certification statement:

Docker versions 1.10.3, 1.11.2, 1.12.6 have been validated

While it is possible to run newer or older versions of Docker than those, the above are the only officially supported versions for Kubernetes 1.7.

@bradrydzewski
Copy link

bradrydzewski commented Jul 17, 2017

I think one option would be to run docker:dind in the same pod as your drone agent instead of using the host machine docker daemon. This would allow you to use newer versions of docker with drone, with an added benefit of a bit more host machine isolation.

Unfortunately most of the streaming code sits inside the docker library, as opposed to drone code, which limits our options for addressing the issue. (assuming we can verify this is a docker issue)

@tonglil
Copy link

tonglil commented Aug 10, 2017

@gtaylor I experienced this issue running older version of docker:1.12.6-dind and fixed it with docker:stable-dind. I also recommend running Drone with dind instead of mounting the host docker socket on K8s so you can set resource constraints.

Send me a message and I can give you a preview of what we're doing. I hope to open source our Drone on GKE setup when it's more battle tested.

@zaa
Copy link
Author

zaa commented Aug 11, 2017

Yep, we resolved the issue by running drone-agent and docker-in-docker in the same Kubernetes pod. But as was mentioned earlier, 1.12.x version is recommended by Kubernetes and most production like installations of Kubernetes are using it.

Maybe it makes sense to add some kind of a warning or a recommendation into documentation about minimal Docker version and/or suggestion to use dind in Kubernetes.

@ptagr
Copy link

ptagr commented Aug 13, 2017

@tonglil @zaa could you share more details on your setup? Do you have to share the docker binary with the drone-agent if you are not mounting the docker socket?

@tonglil
Copy link

tonglil commented Aug 14, 2017

@Punitag I'll see if I can put something together for this week.

@zaa
Copy link
Author

zaa commented Aug 14, 2017

@Punitag we run docker-agent and dind as two containers in the same kubernetes pod. The agent connects to the docker daemon via tcp (tcp://127.0.0.1:2375).

@tonglil
Copy link

tonglil commented Aug 15, 2017

Early preview of what I'm putting together for a PR to the official docs on running this on GKE, this is how we run it currently (on GKE).

https://gist.github.com/tonglil/4108f5c74bf4e382511f4c1b633d2d9a

A few things missing:

  • 0.8 (pending testing with grpc server-agent)
  • Global Registry + Secret (currently manually creating the Secret resources first, then add mounts to the server pod)

@bradrydzewski
Copy link

I wanted to provide a quick update, since it looks like there were multiple root causes to this issue and we have at least two solutions now. The below comment was copied from discourse, you can visit the origin thread here.


I just merged a pull request that fixes an issue where large log output causes the upload to return an error due to exceeding the maximum grpc payload size. The agent will continue to retry the upload indefinitely because the error will always be the same, thus causing the build to get stuck.

Thanks to @tboerger for pinpointing the exact error:

err: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger
than max (7399047 vs. 4194304)

This fix will limit the size of the logs (per step) to ensure it does not execeed the grpc limits. A more permanent solution will be to implement grpc streaming, which is in the long term, is definitely how this should be implemented anyway.

So in conclusion I believe there were at least two different root causes for builds getting stuck that we have discovered:

I therefore believe that both upgrading docker and getting the drone/agent:latest image with the patch liming log size will resolve this issue for most, if not all, people.

@gtaylor
Copy link

gtaylor commented Sep 12, 2017

Nice find!

Just so I am clear on behavior, we'll see a truncated build log (at the limit boundary) if our build goes over the limit?

@bradrydzewski
Copy link

yes, each step in the pipeline will truncate the logs at 2mb. Note that the aggregate of all logs for all steps can exceed 2mb, so this is just a per-step limit.

@tonglil
Copy link

tonglil commented Sep 13, 2017

Thanks, it's a fair short-term workaround for allowing builds to complete; one can reduce log output in the mean time until grpc streaming is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants