New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in logs, state change #1202
Comments
I started to run into this problem recently. I am trying to build Apache Pig including running all core tests. The job runs for about 10 hours and occasionally it is running 9-10 minute long tests without outputting anything to the screen. However the job does complete after all. Once the job completes the GoCD server reschedules it to run again. Funny thing is that rescheduling happens only towards the end of the run. Usually it takes one extra run to complete the job, but today it's gone on the third. Here are the relevant logs.
|
@vladistan can you increase the verbosity of the tests and check if this is still a problem? Generally the server must reschedule a task when it's lost connection to the agent. |
@zabil I did last night... I noticed these on the agent that coincide with the time when Job is rescheduled. Will run another test right now with everything restarted clean... `2016-08-22 07:07:44,392 [loopThread] ERROR thoughtworks.go.agent.AgentController:145 - [Agent Loop] Error occurred during loop: |
So today's job did the same thing here are the logs. The agent panel says lost contact. Going to try suggestions from here http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded will let you know how it goes.
|
Ok, restarted with 1G of heap. Now agent doesn't crash. But job got rescheduled again. And I see this in the logs
|
@vladistan — The agent OOM is a known issue, a workaround is to setup your test artifacts (junit xml reports) to be published as a build artifact instead. The down side of this approach is that you'd not see the reports on the failures tab — |
@ketan I actually added extra heap memory to the agent and it seems to be running fine. But Job is still being rescheduled. This time the agent is still attached to the server and I don't see any error messages there. What I did not is set job timeout to 4800 minutes to see if it changes anything. |
@vladistan — the job timeout is required in case your build does not generate any console output for that amount of time (see https://docs.go.cd/current/configuration/job_timeout.html), so even if your build takes very long, as long as it sporadically generates output every few minutes, it's all good. If you're still seeing the job being rescheduled, you may find the root cause in the server log as well. |
@ketan yes you are right, it didn't help. That's what I see in the logs now. Note agent got an I/O exception of some sort, and server has rescheduled the job around the same time, but I don't see what was the reason for rescheduling. I increased verbosity to DEBUG as @zabil suggested, now I see all kind of junk in the logs. Do you have some specific messages I should be looking for? AGENT:
SERVER:
|
That generally indicates a problem on the server, that it's unable to service any web requests. Could you paste the output of Please be sure to scrub out any sensitive information in the output, if you have any. |
Or if it works better for you — https://gitter.im/gocd/gocd |
There's this issue #2145 that was fixed on 16.5 What version are you running? Just checked and see that you are on 16.1. Upgrading to 16.5 should fix this for you. |
@zabil — from the original trace #1202 (comment) the issue is the usual xslt transform, and does not appear to be the number of revisions (yet). @vladistan — I suggest upgrading to the latest to give it another shot, if things fail with the same exception as #1202 (comment). Please use the workaround I provided. |
@ketan I was talking about this |
Ok, I still see the issue after upgrade to 16.9.0 here is the link to /go/api/support contents. I don't see HttpMethodDirector errors anymore. The full log is below..
|
So last week I did the following changes to our environment:
After doing all that job restarts are not happening anymore. @ketan @zabil do you think any of the above could cause jobs to be randomly restarted? |
For the record, I think increasing heap size could have helped them not go out of memory (and hence restart the job). |
@arvindsv I still saw the restarts after heap size was increased and there were no heap errors reported in the logs. Do you think time drift between slave / master or persistent failures in the artifact server could be related to this issue? |
@vladistan I don't think the time drift should be a cause. That's opinion though, and not verified. What do you mean by "artifact server"? Do you mean some kind of Nexus-like external artifact repository? If so, that shouldn't matter since it'll be a task which uploads. If it's always happening and reproducible, it might be worth changing the log4j.properties on the agent side and set it to DEBUG. It will produce a lot of logs, but might have something useful. |
This issue has been automatically marked as stale because it has not had activity in the last 90 days. |
Hi there,
We're seeing a lot of jobs showing up as failed when they haven't really failed, and then restarting on their own. The jobs and stages eventually complete but any of these restarts make the build take longer. It looks as though the Go server (ours is Go Version: 13.1.1) seems to be getting confused about state changes; I'm including a stack trace as well as the two warning messages that follow it:
There's a related issue at #24 but that one was closed by the user without any follow up.
Any ideas of what's going on, and how we can fix it?
The text was updated successfully, but these errors were encountered: