Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in logs, state change #1202

Closed
mlorenz opened this issue Jun 8, 2015 · 21 comments
Closed

Errors in logs, state change #1202

mlorenz opened this issue Jun 8, 2015 · 21 comments
Labels

Comments

@mlorenz
Copy link

mlorenz commented Jun 8, 2015

Hi there,

We're seeing a lot of jobs showing up as failed when they haven't really failed, and then restarting on their own. The jobs and stages eventually complete but any of these restarts make the build take longer. It looks as though the Go server (ours is Go Version: 13.1.1) seems to be getting confused about state changes; I'm including a stack trace as well as the two warning messages that follow it:

JobInstanceSqlMapDao:306 - State change for a completed Job is not allowed. Job JobIdentifier[goconfig, 11933, 11933, UpdateGoConfiguration, 1, CopyNewConfiguration, 211844] is currently State=Preparing, Result=Failed`
java.lang.Exception
        at com.thoughtworks.cruise.server.dao.JobInstanceSqlMapDao.logIfJobIsCompleted(JobInstanceSqlMapDao.java:306)
        at com.thoughtworks.cruise.server.dao.JobInstanceSqlMapDao.access$700(JobInstanceSqlMapDao.java:41)
        at com.thoughtworks.cruise.server.dao.JobInstanceSqlMapDao$2.doInTransaction(JobInstanceSqlMapDao.java:280)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate$2.doInTransaction(TransactionTemplate.java:23)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:130)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate.execute(TransactionTemplate.java:19)
        at com.thoughtworks.cruise.server.dao.JobInstanceSqlMapDao.updateStateAndResult(JobInstanceSqlMapDao.java:262)
        at com.thoughtworks.cruise.server.service.JobInstanceService$3.doInTransactionWithoutResult(JobInstanceService.java:151)
        at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate$2.doInTransaction(TransactionTemplate.java:23)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:130)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate.execute(TransactionTemplate.java:19)
        at com.thoughtworks.cruise.server.service.JobInstanceService.internalUpdateJobStateAndResult(JobInstanceService.java:149)
        at com.thoughtworks.cruise.server.service.JobInstanceService.access$000(JobInstanceService.java:42)
        at com.thoughtworks.cruise.server.service.JobInstanceService$1.doInTransactionWithoutResult(JobInstanceService.java:95)
        at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate$2.doInTransaction(TransactionTemplate.java:23)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:130)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate.execute(TransactionTemplate.java:19)
        at com.thoughtworks.cruise.server.service.JobInstanceService.updateStateAndResult(JobInstanceService.java:93)
        at com.thoughtworks.cruise.server.service.ScheduleService$5.doInTransactionWithoutResult(ScheduleService.java:500)
        at com.thoughtworks.cruise.server.transaction.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:7)
        at com.thoughtworks.cruise.server.transaction.TransactionCallback.doWithExceptionHandling(TransactionCallback.java:8)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate$3.doInTransaction(TransactionTemplate.java:37)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:130)
        at com.thoughtworks.cruise.server.transaction.TransactionTemplate.executeWithExceptionHandling(TransactionTemplate.java:33)
        at com.thoughtworks.cruise.server.service.ScheduleService.updateJobStatus(ScheduleService.java:491)
        at com.thoughtworks.cruise.server.service.BuildRepositoryService.updateStatusFromAgent(BuildRepositoryService.java:42)
        at com.thoughtworks.cruise.remote.BuildRepositoryRemoteImpl$1.call(BuildRepositoryRemoteImpl.java:52)
        at com.thoughtworks.cruise.remote.BuildRepositoryRemoteImpl.handleFailuresDuringReporting(BuildRepositoryRemoteImpl.java:90)
        at com.thoughtworks.cruise.remote.BuildRepositoryRemoteImpl.reportCurrentStatus(BuildRepositoryRemoteImpl.java:48)
        at com.thoughtworks.cruise.server.messaging.BuildRepositoryMessageProducer.reportCurrentStatus(BuildRepositoryMessageProducer.java:34)
        at sun.reflect.GeneratedMethodAccessor254.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:319)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
        at org.springframework.remoting.support.RemoteInvocationTraceInterceptor.invoke(RemoteInvocationTraceInterceptor.java:77)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
        at com.sun.proxy.$Proxy37.reportCurrentStatus(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor253.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.springframework.remoting.support.RemoteInvocation.invoke(RemoteInvocation.java:205)
        at org.springframework.remoting.support.DefaultRemoteInvocationExecutor.invoke(DefaultRemoteInvocationExecutor.java:38)
        at org.springframework.remoting.support.RemoteInvocationBasedExporter.invoke(RemoteInvocationBasedExporter.java:78)
        at org.springframework.remoting.support.RemoteInvocationBasedExporter.invokeAndCreateResult(RemoteInvocationBasedExporter.java:114)
        at org.springframework.remoting.httpinvoker.HttpInvokerServiceExporter.handleRequest(HttpInvokerServiceExporter.java:73)
        at org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter.handle(HttpRequestHandlerAdapter.java:49)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
        at com.thoughtworks.cruise.server.web.i18n.LocaleResolver.doFilter(LocaleResolver.java:26)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:359)
        at org.tuckey.web.filters.urlrewrite.RuleChain.handleRewrite(RuleChain.java:176)
        at org.tuckey.web.filters.urlrewrite.RuleChain.doRules(RuleChain.java:145)
        at org.tuckey.web.filters.urlrewrite.UrlRewriter.processRequest(UrlRewriter.java:92)
        at org.tuckey.web.filters.urlrewrite.UrlRewriteFilter.doFilter(UrlRewriteFilter.java:381)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at com.thoughtworks.cruise.server.web.FlashLoadingFilter.doFilter(FlashLoadingFilter.java:29)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at org.springframework.security.intercept.web.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:109)
        at org.springframework.security.intercept.web.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:83)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at org.springframework.security.ui.ExceptionTranslationFilter.doFilterHttp(ExceptionTranslationFilter.java:101)
        at org.springframework.security.ui.SpringSecurityFilter.doFilter(SpringSecurityFilter.java:53)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at org.springframework.security.ui.x509.X509ProcessingFilter.doFilter(X509ProcessingFilter.java:137)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at com.thoughtworks.cruise.server.web.i18n.LocaleResolver.doFilter(LocaleResolver.java:26)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at com.thoughtworks.cruise.server.security.ArtifactSizeEnforcementFilter.doFilter(ArtifactSizeEnforcementFilter.java:60)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at com.thoughtworks.cruise.server.security.PerformanceLoggingFilter.doFilter(PerformanceLoggingFilter.java:30)
        at org.springframework.security.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:371)
        at org.springframework.security.util.FilterChainProxy.doFilter(FilterChainProxy.java:174)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:346)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:259)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:939)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

WARN [75@MessageListener for WorkFinder] ScheduleService:585 - [Job Reschedule] Found latest incomplete job for agent Agent..[Job Instance: JobInstance{stageId=85263, name='CopyNewConfiguration', state=Preparing, result=Failed,     
WARN [75@MessageListener for WorkFinder] ScheduleService:599 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=85263, name='CopyNewConfiguration', state=Preparing, result=Failed,

There's a related issue at #24 but that one was closed by the user without any follow up.

Any ideas of what's going on, and how we can fix it?

@arvindsv arvindsv modified the milestone: Unknown - Needs interest from someone Jan 4, 2016
@vladistan
Copy link

I started to run into this problem recently. I am trying to build Apache Pig including running all core tests. The job runs for about 10 hours and occasionally it is running 9-10 minute long tests without outputting anything to the screen. However the job does complete after all. Once the job completes the GoCD server reschedules it to run again. Funny thing is that rescheduling happens only towards the end of the run. Usually it takes one extra run to complete the job, but today it's gone on the third.

Here are the relevant logs.

2016-08-20 00:08:58,895  INFO [qtp1319204982-23] BuildRepositoryRemoteImpl:104 - [Agent [dkr-host-1, 172.17.0.1, c28c1c3b-748f-42ec-8d88-8d375842b363, 74cfb996-5d21-42af-9ac3-e583d618b09f]] is reporting result [Passed] for [Build [Pig-PR/23/defaultStage/2/buildAndIntialTest/2264]]
2016-08-20 00:08:58,912  INFO [qtp1319204982-18] BuildRepositoryRemoteImpl:104 - [Agent [dkr-host-1, 172.17.0.1, c28c1c3b-748f-42ec-8d88-8d375842b363, 74cfb996-5d21-42af-9ac3-e583d618b09f]] is reporting status [Completing] for [Build [Pig-PR/23/defaultStage/2/buildAndIntialTest/2264]]
2016-08-20 00:09:19,513  WARN [71@MessageListener for WorkFinder] ScheduleService:602 - [Job Reschedule] Found latest incomplete job for agent Agent [dkr-host-1, 172.17.0.1, c28c1c3b-748f-42ec-8d88-8d375842b363] [Job Instance: JobInstance{stageId=2047, name='buildAndIntialTest', state=Completing, result=Passed, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@e40352b[currentState=Scheduled,stateChangeTime=Fri Aug 19 15:39:42 EDT 2016,jobId=2264,stageId=2047,id=12310], com.thoughtworks.go.domain.JobStateTransition@1962a15c[currentState=Assigned,stateChangeTime=Fri Aug 19 15:39:48 EDT 2016,jobId=2264,stageId=2047,id=12311], com.thoughtworks.go.domain.JobStateTransition@1a4fac09[currentState=Preparing,stateChangeTime=Fri Aug 19 15:39:58 EDT 2016,jobId=2264,stageId=2047,id=12312], com.thoughtworks.go.domain.JobStateTransition@64a7905d[currentState=Building,stateChangeTime=Fri Aug 19 15:39:58 EDT 2016,jobId=2264,stageId=2047,id=12313], com.thoughtworks.go.domain.JobStateTransition@34f6fd4b[currentState=Completing,stateChangeTime=Sat Aug 20 00:08:58 EDT 2016,jobId=2264,stageId=2047,id=12314]], scheduledDate=Fri Aug 19 15:39:42 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@436f6217, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@4590f697, log=com.thoughtworks.go.domain.NullJobInstanceLog@302d572f, identifier=JobIdentifier[Pig-PR, 23, 23, defaultStage, 2, buildAndIntialTest, 2264], plan=null, runOnAllAgents=false, runMultipleInstance=false}]
2016-08-20 00:09:19,514  WARN [71@MessageListener for WorkFinder] ScheduleService:616 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=2047, name='buildAndIntialTest', state=Completing, result=Passed, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@e40352b[currentState=Scheduled,stateChangeTime=Fri Aug 19 15:39:42 EDT 2016,jobId=2264,stageId=2047,id=12310], com.thoughtworks.go.domain.JobStateTransition@1962a15c[currentState=Assigned,stateChangeTime=Fri Aug 19 15:39:48 EDT 2016,jobId=2264,stageId=2047,id=12311], com.thoughtworks.go.domain.JobStateTransition@1a4fac09[currentState=Preparing,stateChangeTime=Fri Aug 19 15:39:58 EDT 2016,jobId=2264,stageId=2047,id=12312], com.thoughtworks.go.domain.JobStateTransition@64a7905d[currentState=Building,stateChangeTime=Fri Aug 19 15:39:58 EDT 2016,jobId=2264,stageId=2047,id=12313], com.thoughtworks.go.domain.JobStateTransition@34f6fd4b[currentState=Completing,stateChangeTime=Sat Aug 20 00:08:58 EDT 2016,jobId=2264,stageId=2047,id=12314]], scheduledDate=Fri Aug 19 15:39:42 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@436f6217, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@4590f697, log=com.thoughtworks.go.domain.NullJobInstanceLog@302d572f, identifier=JobIdentifier[Pig-PR, 23, 23, defaultStage, 2, buildAndIntialTest, 2264], plan=null, runOnAllAgents=false, runMultipleInstance=false}
2016-08-20 00:09:19,530  INFO [71@MessageListener for WorkFinder] ScheduleService:634 - [Job Reschedule] Scheduled new job: JobIdentifier[Pig-PR, 23, 23, defaultStage, 2, buildAndIntialTest, 2265]. Replacing old job: JobIdentifier[Pig-PR, 23, 23, defaultStage, 2, buildAndIntialTest, 2264]

@zabil
Copy link
Contributor

zabil commented Aug 22, 2016

@vladistan can you increase the verbosity of the tests and check if this is still a problem?

Generally the server must reschedule a task when it's lost connection to the agent.
There's another thread that checks for this status and a silent job must not affect it. We just want to rule that out.

@vladistan
Copy link

@zabil I did last night... I noticed these on the agent that coincide with the time when Job is rescheduled. Will run another test right now with everything restarted clean...

`2016-08-22 07:07:44,392 [loopThread] ERROR thoughtworks.go.agent.AgentController:145 - [Agent Loop] Error occurred during loop:
17361 org.springframework.remoting.RemoteAccessException: Could not access HTTP invoker remote service at [https://gocd-dev.home.v-lad.org:8154/go/remoting/remoteBuildRepository]; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded
17362 at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:212)
17363 at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:145)
17364 at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
17365 at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
17366 at com.sun.proxy.$Proxy5.getWork(Unknown Source)
17367 at com.thoughtworks.go.agent.AgentController.retrieveWork(AgentController.java:179)
17368 at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:132)
17369 at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
17370 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
17371 at java.lang.reflect.Method.invoke(Method.java:606)
17372 at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
17373 at org.springframework.scheduling.support.MethodInvokingRunnable.run(MethodInvokingRunnable.java:65)
17374 at org.springframework.scheduling.timer.DelegatingTimerTask.run(DelegatingTimerTask.java:70)
17375 at java.util.TimerThread.mainLoop(Timer.java:555)
17376 at java.util.TimerThread.run(Timer.java:505)
17377 Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
17378 2016-08-22 07:07:44,983 [pingThread] ERROR thoughtworks.go.agent.AgentController:116 - Error occurred when agent tried to ping server:
17379 org.springframework.remoting.RemoteAccessException: Could not access HTTP invoker remote service at [https://gocd-dev.home.v-lad.org:8154/go/remoting/remoteBuildRepository]; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded
17380 at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:212)
17381 at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:145)
17382 at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
17383 at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
17384 at com.sun.proxy.$Proxy5.ping(Unknown Source)
17385 at com.thoughtworks.go.agent.AgentController.ping(AgentController.java:112)
17386 at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
17387 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
17388 at java.lang.reflect.Method.invoke(Method.java:606)
17389 at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
17390 at org.springframework.scheduling.support.MethodInvokingRunnable.run(MethodInvokingRunnable.java:65)
17391 at org.springframework.scheduling.timer.DelegatingTimerTask.run(DelegatingTimerTask.java:70)
17392 at java.util.TimerThread.mainLoop(Timer.java:555)
17393 at java.util.TimerThread.run(Timer.java:505)
17394 Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded``

@vladistan
Copy link

vladistan commented Aug 22, 2016

So today's job did the same thing here are the logs. The agent panel says lost contact.

Going to try suggestions from here http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded will let you know how it goes.

2016-08-22 16:22:37,181 [loopThread] INFO  thoughtworks.go.util.HttpService:129 - Got back 201 from server
2016-08-22 16:22:37,185 [loopThread] INFO  thoughtworks.go.util.HttpService:69 - Uploading file [/tmp/cruise-865b980c-b5e0-4996-bd10-89160c9b4cd5/13c9c142-08a3-49e9-beef-e2a2f0567c92/TEST-org.apache.pig.tools.grunt.TestGruntParser.xml.zip] to url [https://gocd-dev.home.v-lad.org:8154/go/remoting/files/Pig-PR/25/defaultStage/3/buildAndIntialTest/?attempt=1&buildId=2315]
2016-08-22 16:22:37,192 [loopThread] INFO  thoughtworks.go.util.HttpService:129 - Got back 201 from server
2016-08-22 16:23:06,266 [loopThread] INFO  thoughtworks.go.work.DefaultGoPublisher:90 - Stopping Transmission for Build [Pig-PR/25/defaultStage/3/buildAndIntialTest/2315]
2016-08-22 16:23:06,269 [loopThread] ERROR springframework.scheduling.timer.MethodInvokingTimerTaskFactoryBean:68 - Invocation of method 'loop' on target class [class com.thoughtworks.go.agent.AgentController] failed
java.lang.OutOfMemoryError: Java heap space
        at com.sun.org.apache.xml.internal.utils.FastStringBuffer.append(FastStringBuffer.java:682)
        at com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM.characters(SAX2DTM.java:2111)
        at com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:865)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.characters(AbstractSAXParser.java:546)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:482)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:458)
        at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:252)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.getDOM(TransformerImpl.java:559)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:742)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:353)
        at com.thoughtworks.go.domain.UnitTestReportGenerator.generate(UnitTestReportGenerator.java:72)
        at com.thoughtworks.go.config.TestArtifactPlan.mergeAndUploadTestResult(TestArtifactPlan.java:101)
        at com.thoughtworks.go.config.TestArtifactPlan.publish(TestArtifactPlan.java:66)
        at com.thoughtworks.go.domain.DefaultJobPlan.publishArtifacts(DefaultJobPlan.java:105)
        at com.thoughtworks.go.remote.work.BuildWork.completeJob(BuildWork.java:198)
        at com.thoughtworks.go.remote.work.BuildWork.build(BuildWork.java:146)
        at com.thoughtworks.go.remote.work.BuildWork.doWork(BuildWork.java:91)
        at com.thoughtworks.go.agent.JobRunner.run(JobRunner.java:53)
        at com.thoughtworks.go.agent.AgentController.retrieveWork(AgentController.java:186)
        at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:132)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
        at org.springframework.scheduling.support.MethodInvokingRunnable.run(MethodInvokingRunnable.java:65)
        at org.springframework.scheduling.timer.DelegatingTimerTask.run(DelegatingTimerTask.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
2016-08-22 16:23:27,725 [loopThread] ERROR thoughtworks.go.agent.AgentController:145 - [Agent Loop] Error occurred during loop:
org.springframework.remoting.RemoteAccessException: Could not access HTTP invoker remote service at [https://gocd-dev.home.v-lad.org:8154/go/remoting/remoteBuildRepository]; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:212)
        at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:145)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
        at com.sun.proxy.$Proxy5.getWork(Unknown Source)
        at com.thoughtworks.go.agent.AgentController.retrieveWork(AgentController.java:179)
        at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:132)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

@vladistan
Copy link

Ok, restarted with 1G of heap. Now agent doesn't crash. But job got rescheduled again. And I see this in the logs

2016-08-22 21:38:07,003 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:168 - Opened SqlMapSession [com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl@2df8d4f7] for iBATIS operation
2016-08-22 21:38:07,004 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:185 - Obtained JDBC Connection [Transaction-aware proxy for target Connection  from DataSource [org.apache.commons.dbcp.BasicDataSource@3e4f78fb]] for iBATIS operation
2016-08-22 21:38:07,005  WARN [rescheduleHungBuildThread] ScheduleService:569 - Found hung job[id=JobInstance{stageId=2051, name='buildAndIntialTest', state=Building, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[], scheduledDate=Mon Aug 22 17:47:20 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@265010ed, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@52874b41, log=com.thoughtworks.go.domain.NullJobInstanceLog@305dc2bb, identifier=JobIdentifier[Pig-PR, 25, 25, defaultStage, 3, buildAndIntialTest, 2410], plan=null, runOnAllAgents=false, runMultipleInstance=false}], rescheduling it
2016-08-22 21:38:07,006  WARN [rescheduleHungBuildThread] ScheduleService:616 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=2051, name='buildAndIntialTest', state=Building, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[], scheduledDate=Mon Aug 22 17:47:20 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@265010ed, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@52874b41, log=com.thoughtworks.go.domain.NullJobInstanceLog@305dc2bb, identifier=JobIdentifier[Pig-PR, 25, 25, defaultStage, 3, buildAndIntialTest, 2410], plan=null, runOnAllAgents=false, runMultipleInstance=false}
2016-08-22 21:38:07,006 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:168 - Opened SqlMapSession [com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl@76886deb] for iBATIS operation
2016-08-22 21:38:07,006 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:185 - Obtained JDBC Connection [Transaction-aware proxy for target Connection  from DataSource [org.apache.commons.dbcp.BasicDataSource@3e4f78fb]] for iBATIS operation
2016-08-22 21:38:07,009 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:168 - Opened SqlMapSession [com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl@30d868c] for iBATIS operation

@ketan
Copy link
Member

ketan commented Aug 23, 2016

@vladistan — The agent OOM is a known issue, a workaround is to setup your test artifacts (junit xml reports) to be published as a build artifact instead. The down side of this approach is that you'd not see the reports on the failures tab —

screen shot 2016-08-23 at 9 05 06 am

@vladistan
Copy link

@ketan I actually added extra heap memory to the agent and it seems to be running fine. But Job is still being rescheduled. This time the agent is still attached to the server and I don't see any error messages there.

What I did not is set job timeout to 4800 minutes to see if it changes anything.

@ketan
Copy link
Member

ketan commented Aug 23, 2016

@vladistan — the job timeout is required in case your build does not generate any console output for that amount of time (see https://docs.go.cd/current/configuration/job_timeout.html), so even if your build takes very long, as long as it sporadically generates output every few minutes, it's all good.

If you're still seeing the job being rescheduled, you may find the root cause in the server log as well.

@vladistan
Copy link

@ketan yes you are right, it didn't help. That's what I see in the logs now. Note agent got an I/O exception of some sort, and server has rescheduled the job around the same time, but I don't see what was the reason for rescheduling. I increased verbosity to DEBUG as @zabil suggested, now I see all kind of junk in the logs. Do you have some specific messages I should be looking for?

AGENT:

1629 2016-08-22 23:30:50,591 [Thread-11] INFO  go.util.command.CommandLine:62 -     [junit] Running org.apache.pig.test.TestCollectedGroup
1630 2016-08-22 23:32:25,550 [pingThread] INFO  apache.commons.httpclient.HttpMethodDirector:438 - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request:      The server gocd-dev.home.v-lad.org failed to respond
1631 2016-08-22 23:32:25,551 [pingThread] INFO  apache.commons.httpclient.HttpMethodDirector:444 - Retrying request
1632 2016-08-22 23:34:52,329 [Thread-11] INFO  go.util.command.CommandLine:62 -     [junit] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 241.659 sec
1633 2016-08-22 23:34:52,726 [Thread-11] INFO  go.util.command.CommandLine:62 -     [junit] OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0

SERVER:

2016-08-22 23:38:05,805 DEBUG [qtp999445351-19] GoExceptionTranslationFilter:104 - Chain processed normally
2016-08-22 23:38:07,064 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:168 - Opened SqlMapSession [com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl@36842e14] for iBATIS operation
2016-08-22 23:38:07,064 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:185 - Obtained JDBC Connection [Transaction-aware proxy for target Connection  from DataSource [org.apache.commons.dbcp.BasicDataSource@3e4f78fb]] for iBATIS operation
2016-08-22 23:38:07,067  WARN [rescheduleHungBuildThread] ScheduleService:569 - Found hung job[id=JobInstance{stageId=2052, name='buildAndIntialTest', state=Building, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[], scheduledDate=Mon Aug 22 22:06:28 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@1e7ddc1b, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@58fd8040, log=com.thoughtworks.go.domain.NullJobInstanceLog@6aef1764, identifier=JobIdentifier[Pig-PR, 25, 25, defaultStage, 4, buildAndIntialTest, 2412], plan=null, runOnAllAgents=false, runMultipleInstance=false}], rescheduling it
2016-08-22 23:38:07,068  WARN [rescheduleHungBuildThread] ScheduleService:616 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=2052, name='buildAndIntialTest', state=Building, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[], scheduledDate=Mon Aug 22 22:06:28 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@1e7ddc1b, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@58fd8040, log=com.thoughtworks.go.domain.NullJobInstanceLog@6aef1764, identifier=JobIdentifier[Pig-PR, 25, 25, defaultStage, 4, buildAndIntialTest, 2412], plan=null, runOnAllAgents=false, runMultipleInstance=false}
2016-08-22 23:38:07,068 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:168 - Opened SqlMapSession [com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl@51aac789] for iBATIS operation
2016-08-22 23:38:07,068 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:185 - Obtained JDBC Connection [Transaction-aware proxy for target Connection  from DataSource [org.apache.commons.dbcp.BasicDataSource@3e4f78fb]] for iBATIS operation
2016-08-22 23:38:07,071 DEBUG [rescheduleHungBuildThread] SqlMapClientDaoSupport$SqlMapClientTemplate:168 - Opened SqlMapSession [com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl@2ab10e25] for iBATIS operation

@ketan
Copy link
Member

ketan commented Aug 23, 2016

1630 2016-08-22 23:32:25,550 [pingThread] INFO apache.commons.httpclient.HttpMethodDirector:438 - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server gocd-dev.home.v-lad.org failed to respond
1631 2016-08-22 23:32:25,551 [pingThread] INFO apache.commons.httpclient.HttpMethodDirector:444 - Retrying request

That generally indicates a problem on the server, that it's unable to service any web requests. Could you paste the output of http://.../go/api/support into a gist/pastebin/whatever and provide a link here?

Please be sure to scrub out any sensitive information in the output, if you have any.

@ketan
Copy link
Member

ketan commented Aug 23, 2016

Or if it works better for you — https://gitter.im/gocd/gocd

@vladistan
Copy link

@zabil
Copy link
Contributor

zabil commented Aug 23, 2016

There's this issue #2145 that was fixed on 16.5

What version are you running?

Just checked and see that you are on 16.1. Upgrading to 16.5 should fix this for you.

@ketan
Copy link
Member

ketan commented Aug 23, 2016

@zabil — from the original trace #1202 (comment) the issue is the usual xslt transform, and does not appear to be the number of revisions (yet).

@vladistan — I suggest upgrading to the latest to give it another shot, if things fail with the same exception as #1202 (comment). Please use the workaround I provided.

@zabil
Copy link
Contributor

zabil commented Aug 23, 2016

@ketan I was talking about this
#1202 (comment)

@vladistan
Copy link

Ok, I still see the issue after upgrade to 16.9.0 here is the link to /go/api/support contents.

I don't see HttpMethodDirector errors anymore. The full log is below..

<invld>2016-10-09T15:37:33.393460-04:00 dkr-host-1 go-agent 2016-10-09 15:37:26,749 [pool-3-thread-1] INFO  thoughtworks.go.util.HttpService:136 - Got back 200 from server
<invld>2016-10-09T15:37:33.153494-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,968 DEBUG [CCTray-Queue-Processor] CcTrayActivityListener:93 - Handling CCTray activity for job: JobInstance{stageId=2197, name='buildAndIntialTest', state=Scheduled, result=Unknown, agentUuid='null', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@907c91f[currentState=Rescheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3067,stageId=2197,id=15080], com.thoughtworks.go.domain.JobStateTransition@75b492e4[currentState=Scheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3068,stageId=2197,id=15081]], scheduledDate=Sun Oct 09 15:37:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@5e2516da, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@54ff727f, log=com.thoughtworks.go.domain.NullJobInstanceLog@4eb61b71, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3068], plan=null, runOnAllAgents=false, runMultipleInstance=false}
<invld>2016-10-09T15:37:33.153467-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,968 DEBUG [rescheduleHungBuildThread] CcTrayActivityListener:88 - Adding CCTray activity for job into queue: JobInstance{stageId=2197, name='buildAndIntialTest', state=Scheduled, result=Unknown, agentUuid='null', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@907c91f[currentState=Rescheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3067,stageId=2197,id=15080], com.thoughtworks.go.domain.JobStateTransition@75b492e4[currentState=Scheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3068,stageId=2197,id=15081]], scheduledDate=Sun Oct 09 15:37:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@5e2516da, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@54ff727f, log=com.thoughtworks.go.domain.NullJobInstanceLog@4eb61b71, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3068], plan=null, runOnAllAgents=false, runMultipleInstance=false}
<invld>2016-10-09T15:37:33.153445-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,965 DEBUG [CCTray-Queue-Processor] CcTrayActivityListener:93 - Handling CCTray activity for job: JobInstance{stageId=2197, name='buildAndIntialTest', state=Rescheduled, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@907c91f[currentState=Rescheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3067,stageId=2197,id=15080], com.thoughtworks.go.domain.JobStateTransition@75b492e4[currentState=Scheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3068,stageId=2197,id=15081]], scheduledDate=Sun Oct 09 10:45:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@5e2516da, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@54ff727f, log=com.thoughtworks.go.domain.NullJobInstanceLog@4eb61b71, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067], plan=null, runOnAllAgents=false, runMultipleInstance=false}
<invld>2016-10-09T15:37:33.153409-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,965 DEBUG [rescheduleHungBuildThread] CcTrayActivityListener:88 - Adding CCTray activity for job into queue: JobInstance{stageId=2197, name='buildAndIntialTest', state=Rescheduled, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@907c91f[currentState=Rescheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3067,stageId=2197,id=15080], com.thoughtworks.go.domain.JobStateTransition@75b492e4[currentState=Scheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3068,stageId=2197,id=15081]], scheduledDate=Sun Oct 09 10:45:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@5e2516da, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@54ff727f, log=com.thoughtworks.go.domain.NullJobInstanceLog@4eb61b71, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067], plan=null, runOnAllAgents=false, runMultipleInstance=false}
<invld>2016-10-09T15:37:33.153385-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,964 DEBUG [rescheduleHungBuildThread] AgentAssignment:45 - Removed agent assignment for job [JobInstance{stageId=2197, name='buildAndIntialTest', state=Rescheduled, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@907c91f[currentState=Rescheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3067,stageId=2197,id=15080], com.thoughtworks.go.domain.JobStateTransition@75b492e4[currentState=Scheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3068,stageId=2197,id=15081]], scheduledDate=Sun Oct 09 10:45:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@5e2516da, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@54ff727f, log=com.thoughtworks.go.domain.NullJobInstanceLog@4eb61b71, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067], plan=null, runOnAllAgents=false, runMultipleInstance=false}]
<invld>2016-10-09T15:37:33.153368-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,963  INFO [rescheduleHungBuildThread] ScheduleService:619 - [Job Reschedule] Scheduled new job: JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3068]. Replacing old job: JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067]
<invld>2016-10-09T15:37:33.153278-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,945 DEBUG [rescheduleHungBuildThread] JobInstanceService:141 - job status updated [JobInstance{stageId=2197, name='buildAndIntialTest', state=Rescheduled, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[com.thoughtworks.go.domain.JobStateTransition@907c91f[currentState=Rescheduled,stateChangeTime=Sun Oct 09 15:37:26 EDT 2016,jobId=3067,stageId=2197,id=15080]], scheduledDate=Sun Oct 09 10:45:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@5e2516da, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@54ff727f, log=com.thoughtworks.go.domain.NullJobInstanceLog@4eb61b71, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067], plan=null, runOnAllAgents=false, runMultipleInstance=false}]
<invld>2016-10-09T15:37:33.152809-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,935  WARN [rescheduleHungBuildThread] ScheduleService:601 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=2197, name='buildAndIntialTest', state=Building, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[], scheduledDate=Sun Oct 09 10:45:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@20903a36, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@5cddbfed, log=com.thoughtworks.go.domain.NullJobInstanceLog@6263903c, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067], plan=null, runOnAllAgents=false, runMultipleInstance=false}
<invld>2016-10-09T15:37:33.152802-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,935  WARN [rescheduleHungBuildThread] ScheduleService:553 - Found hung job[id=JobInstance{stageId=2197, name='buildAndIntialTest', state=Building, result=Unknown, agentUuid='c28c1c3b-748f-42ec-8d88-8d375842b363', stateTransitions=[], scheduledDate=Sun Oct 09 10:45:26 EDT 2016, timeProvider=com.thoughtworks.go.util.TimeProvider@20903a36, ignored=false, matcher=com.thoughtworks.go.domain.BuildOutputMatcher@5cddbfed, log=com.thoughtworks.go.domain.NullJobInstanceLog@6263903c, identifier=JobIdentifier[Pig-PR, 30, 30, defaultStage, 1, buildAndIntialTest, 3067], plan=null, runOnAllAgents=false, runMultipleInstance=false}], rescheduling it
<invld>2016-10-09T15:37:33.152788-04:00 gocd-master-2 go-server 2016-10-09 15:37:26,873  INFO [cancelHungJobThread] ConsoleActivityMonitor:95 - Job 'Pig-PR/30/defaultStage/1/buildAndIntialTest' has not updated console log for more than '11' minutes

@vladistan
Copy link

vladistan commented Oct 15, 2016

So last week I did the following changes to our environment:

  1. Increased heap size on all agents on all hosts to 1G/6G
  2. Made sure that the system time is synchronized between master and all agents with max drift of 20sec. There was one agent host that had time drift of 20 minutes behind the master.
  3. Found that some pipelines were using old Artifact server that was taken down a while ago. This was creating a lot of exceptions in the log files. So I put a new server at that address.

After doing all that job restarts are not happening anymore. @ketan @zabil do you think any of the above could cause jobs to be randomly restarted?

@arvindsv
Copy link
Member

For the record, I think increasing heap size could have helped them not go out of memory (and hence restart the job).

@vladistan
Copy link

@arvindsv I still saw the restarts after heap size was increased and there were no heap errors reported in the logs.

Do you think time drift between slave / master or persistent failures in the artifact server could be related to this issue?

@arvindsv
Copy link
Member

arvindsv commented Nov 2, 2016

@vladistan I don't think the time drift should be a cause. That's opinion though, and not verified.

What do you mean by "artifact server"? Do you mean some kind of Nexus-like external artifact repository? If so, that shouldn't matter since it'll be a task which uploads.

If it's always happening and reproducible, it might be worth changing the log4j.properties on the agent side and set it to DEBUG. It will produce a lot of logs, but might have something useful.

@stale
Copy link

stale bot commented Apr 1, 2020

This issue has been automatically marked as stale because it has not had activity in the last 90 days.
If you can still reproduce this error on the master branch using local development environment or on the latest GoCD Release, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions.

@stale stale bot added the stale label Apr 1, 2020
@stale stale bot closed this as completed Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants