HAPI v2.4 deadlocks in BaseRuntimeElementDefinition.getConstructor(...) #664

Closed
karlmdavis opened this Issue Jun 5, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@karlmdavis
Contributor

karlmdavis commented Jun 5, 2017

The HAPI server is completely deadlocking on me if I try to send enough requests at it. A threaddump is full of entries stuck like this one:

java.lang.Thread.State: BLOCKED (on object monitor)
  at java.util.Collections$SynchronizedMap.get(Collections.java:2584)
    - waiting to lock <0x00000006839dc418> (a java.util.Collections$SynchronizedMap)
  at ca.uhn.fhir.context.BaseRuntimeElementDefinition.getConstructor(BaseRuntimeElementDefinition.java:78)
  at ca.uhn.fhir.context.BaseRuntimeElementDefinition.newInstance(BaseRuntimeElementDefinition.java:162)
  at ca.uhn.fhir.parser.ParserState$ElementCompositeState.enteringNewElement(ParserState.java:1605)
  at ca.uhn.fhir.parser.ParserState.enteringNewElement(ParserState.java:98)
  at ca.uhn.fhir.parser.XmlParser.doXmlLoop(XmlParser.java:224)
  at ca.uhn.fhir.parser.XmlParser.parseResource(XmlParser.java:1050)
  at ca.uhn.fhir.parser.XmlParser.doParseResource(XmlParser.java:183)
  at ca.uhn.fhir.parser.BaseParser.parseResource(BaseParser.java:656)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.loadStructureDefinitions(DefaultProfileValidationSupport.java:184)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.provideStructureDefinitionMap(DefaultProfileValidationSupport.java:205)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.fetchAllStructureDefinitions(DefaultProfileValidationSupport.java:52)
  at org.hl7.fhir.dstu3.hapi.validation.ValidationSupportChain.fetchAllStructureDefinitions(ValidationSupportChain.java:122)
  at org.hl7.fhir.dstu3.hapi.validation.HapiWorkerContext.allStructures(HapiWorkerContext.java:65)
  at org.hl7.fhir.dstu3.utils.FHIRPathEngine.<init>(FHIRPathEngine.java:149)
  at ca.uhn.fhir.jpa.dao.dstu3.SearchParamExtractorDstu3.extractValues(SearchParamExtractorDstu3.java:690)
  at ca.uhn.fhir.jpa.dao.dstu3.SearchParamExtractorDstu3.extractSearchParamStrings(SearchParamExtractorDstu3.java:375)
  at ca.uhn.fhir.jpa.dao.BaseHapiFhirDao.extractSearchParamStrings(BaseHapiFhirDao.java:436)
  at ca.uhn.fhir.jpa.dao.BaseHapiFhirDao.updateEntity(BaseHapiFhirDao.java:1250)
  at ca.uhn.fhir.jpa.dao.dstu3.FhirSystemDaoDstu3.doTransaction(FhirSystemDaoDstu3.java:511)
  at ca.uhn.fhir.jpa.dao.dstu3.FhirSystemDaoDstu3.transaction(FhirSystemDaoDstu3.java:260)
  at ca.uhn.fhir.jpa.dao.dstu3.FhirSystemDaoDstu3.transaction(FhirSystemDaoDstu3.java:254)
  at ca.uhn.fhir.jpa.dao.dstu3.FhirSystemDaoDstu3.transaction(FhirSystemDaoDstu3.java:80)
  at sun.reflect.GeneratedMethodAccessor163.invoke(Unknown Source)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333)
  at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
  at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:99)
  at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:282)
  at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:96)
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
  at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213)
  at com.sun.proxy.$Proxy128.transaction(Unknown Source)
  at ca.uhn.fhir.jpa.provider.dstu3.JpaSystemProviderDstu3.transaction(JpaSystemProviderDstu3.java:219)
  at sun.reflect.GeneratedMethodAccessor159.invoke(Unknown Source)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at ca.uhn.fhir.rest.method.BaseMethodBinding.invokeServerMethod(BaseMethodBinding.java:258)
  at ca.uhn.fhir.rest.method.TransactionMethodBinding.invokeServer(TransactionMethodBinding.java:132)
  at ca.uhn.fhir.rest.method.BaseResourceReturningMethodBinding.doInvokeServer(BaseResourceReturningMethodBinding.java:306)
  at ca.uhn.fhir.rest.method.BaseResourceReturningMethodBinding.invokeServer(BaseResourceReturningMethodBinding.java:258)
  at ca.uhn.fhir.rest.server.RestfulServer.handleRequest(RestfulServer.java:676)
  at ca.uhn.fhir.rest.server.RestfulServer.doPost(RestfulServer.java:1236)
  at ca.uhn.fhir.rest.server.RestfulServer.service(RestfulServer.java:1208)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
  at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:85)
  at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:129)
  at com.codahale.metrics.servlet.AbstractInstrumentedFilter.doFilter(AbstractInstrumentedFilter.java:104)
  at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
  at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
  at org.ebaysf.web.cors.CORSFilter.handleNonCORS(CORSFilter.java:437)
  at org.ebaysf.web.cors.CORSFilter.doFilter(CORSFilter.java:172)
  at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
  at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
  at ch.qos.logback.classic.helpers.MDCInsertingServletFilter.doFilter(MDCInsertingServletFilter.java:51)
  at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
  at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
  at io.undertow.servlet.handlers.FilterHandler.handleRequest(FilterHandler.java:84)
  at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62)
  at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36)
  at org.wildfly.extension.undertow.security.SecurityContextAssociationHandler.handleRequest(SecurityContextAssociationHandler.java:78)
  at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
  at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:131)
  at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57)
  at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
  at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46)
  at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64)
  at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60)
  at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77)
  at io.undertow.security.handlers.NotificationReceiverHandler.handleRequest(NotificationReceiverHandler.java:50)
  at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43)
  at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
  at org.wildfly.extension.undertow.security.jacc.JACCContextIdHandler.handleRequest(JACCContextIdHandler.java:61)
  at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
  at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
  at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:285)
  at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:264)
  at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:81)
  at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:175)
  at io.undertow.server.Connectors.executeRootHandler(Connectors.java:207)
  at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:802)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)

Interestingly, this only started happening once I "turned down" the ca.uhn.fhir.jpa.dao.SearchBuilder logging category -- I guess those log events changed the timing enough to avoid the deadlock. Unfortunately for me, though, I have to turn down that logging category, as it generates GBs and GBs of log noise during our ETL.

@jamesagnew

This comment has been minimized.

Show comment
Hide comment
@jamesagnew

jamesagnew Jun 5, 2017

Owner

Hi Karl,

Would you be able to try this using the latest 2.5-SNAPSHOT builds? We resolved a deadlock in that version (which will be released very shortly) so it's quite likely you are experiencing the same thing.

Owner

jamesagnew commented Jun 5, 2017

Hi Karl,

Would you be able to try this using the latest 2.5-SNAPSHOT builds? We resolved a deadlock in that version (which will be released very shortly) so it's quite likely you are experiencing the same thing.

@karlmdavis

This comment has been minimized.

Show comment
Hide comment
@karlmdavis

karlmdavis Jun 5, 2017

Contributor

@jamesagnew: I just found a couple threads around that in the user group. But neither of those seemed to be in the XML parsing code like this one is. Still think that's likely fixed?

And yeah, I can try out the latest snapshot in dev. Busy day today, but I'll try to kick off a benchmark run at some point here (easiest way for me to test that).

Contributor

karlmdavis commented Jun 5, 2017

@jamesagnew: I just found a couple threads around that in the user group. But neither of those seemed to be in the XML parsing code like this one is. Still think that's likely fixed?

And yeah, I can try out the latest snapshot in dev. Busy day today, but I'll try to kick off a benchmark run at some point here (easiest way for me to test that).

@karlmdavis karlmdavis referenced this issue in CMSgov/bluebutton-data-server Jun 5, 2017

Merged

CBBF-60: Dialing down the log spamming from ETL #31

@karlmdavis

This comment has been minimized.

Show comment
Hide comment
@karlmdavis

karlmdavis Jun 5, 2017

Contributor

Small note: Apparently, this isn't related to my logging change at all. It's still happening even after I tried rolling that back. Bother...

Contributor

karlmdavis commented Jun 5, 2017

Small note: Apparently, this isn't related to my logging change at all. It's still happening even after I tried rolling that back. Bother...

@karlmdavis

This comment has been minimized.

Show comment
Hide comment
@karlmdavis

karlmdavis Jun 5, 2017

Contributor

So... been staring at this for a bit now, and I'm pretty convinced that this isn't a threading/synchronization problem in this code at all. The code in BaseRuntimeElementDefinition isn't broken. It'll definitely be slow -- it'll redo the reflection lookup a bunch of unnecessary times until its cache is warmed up -- but that shouldn't cause deadlocks.

And in fact, if I let HAPI run long enough, it eventually dies with an OutOfMemoryError: GC overhead limit exceeded error. There's something else going on here -- not sure what yet, though.

Contributor

karlmdavis commented Jun 5, 2017

So... been staring at this for a bit now, and I'm pretty convinced that this isn't a threading/synchronization problem in this code at all. The code in BaseRuntimeElementDefinition isn't broken. It'll definitely be slow -- it'll redo the reflection lookup a bunch of unnecessary times until its cache is warmed up -- but that shouldn't cause deadlocks.

And in fact, if I let HAPI run long enough, it eventually dies with an OutOfMemoryError: GC overhead limit exceeded error. There's something else going on here -- not sure what yet, though.

@jamesagnew

This comment has been minimized.

Show comment
Hide comment
@jamesagnew

jamesagnew Jun 5, 2017

Owner

Ok, upon staring at that a bit myself, I don't see how it could be threading related. That map only has locks on read and write operations and neither holds the lock (I assume you came to this conclusion too :) ), so that probably isn't the root cause.

The system spending all of its time on garbage collection would certainly explain this though.

If you look at the stack traces of the various threads when this happens, do they all have this bit near the top?

  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.loadStructureDefinitions(DefaultProfileValidationSupport.java:184)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.provideStructureDefinitionMap(DefaultProfileValidationSupport.java:205)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.fetchAllStructureDefinitions(DefaultProfileValidationSupport.java:52)

That would probably explain it... That bit loads all of the structure definitions into memory. It'll only do it once under normal conditions, but I can see how it would do it lots of times if the server gets a whole bunch of requests at the same time right at startup.

Are you in a position to try this against a modified build of HAPI? I would postulate that putting a double-check synchronized block in DefaultProfileValidationSupport#provideStructureDefinitionMap would solve this.

Owner

jamesagnew commented Jun 5, 2017

Ok, upon staring at that a bit myself, I don't see how it could be threading related. That map only has locks on read and write operations and neither holds the lock (I assume you came to this conclusion too :) ), so that probably isn't the root cause.

The system spending all of its time on garbage collection would certainly explain this though.

If you look at the stack traces of the various threads when this happens, do they all have this bit near the top?

  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.loadStructureDefinitions(DefaultProfileValidationSupport.java:184)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.provideStructureDefinitionMap(DefaultProfileValidationSupport.java:205)
  at org.hl7.fhir.dstu3.hapi.validation.DefaultProfileValidationSupport.fetchAllStructureDefinitions(DefaultProfileValidationSupport.java:52)

That would probably explain it... That bit loads all of the structure definitions into memory. It'll only do it once under normal conditions, but I can see how it would do it lots of times if the server gets a whole bunch of requests at the same time right at startup.

Are you in a position to try this against a modified build of HAPI? I would postulate that putting a double-check synchronized block in DefaultProfileValidationSupport#provideStructureDefinitionMap would solve this.

@jamesagnew

This comment has been minimized.

Show comment
Hide comment
@jamesagnew

jamesagnew Jun 6, 2017

Owner

Ok I was able to reproduce by doing a bunch of validates in parallel threads. I've fixed it with a synchronize block, at least here. I'm going to check that in and close this bug, but please feel free to re-open or comment if this fix doesn't help in your environment.

Owner

jamesagnew commented Jun 6, 2017

Ok I was able to reproduce by doing a bunch of validates in parallel threads. I've fixed it with a synchronize block, at least here. I'm going to check that in and close this bug, but please feel free to re-open or comment if this fix doesn't help in your environment.

@jamesagnew jamesagnew closed this in 7c6bb01 Jun 7, 2017

@karlmdavis

This comment has been minimized.

Show comment
Hide comment
@karlmdavis

karlmdavis Jun 8, 2017

Contributor

James: Thanks a ton! (Sorry, I've been stuck off-site for the last two days, and haven't been able to juggle anything else.)

To answer your question above: yes, they all had that bit near the top.

After your earlier comments and hints, I was also able to workaround this by bumping our heap from 5GB to 120GB or something like that. It probably didn't need that much, but the server had it (huzzah, EC2!), so why not?

Contributor

karlmdavis commented Jun 8, 2017

James: Thanks a ton! (Sorry, I've been stuck off-site for the last two days, and haven't been able to juggle anything else.)

To answer your question above: yes, they all had that bit near the top.

After your earlier comments and hints, I was also able to workaround this by bumping our heap from 5GB to 120GB or something like that. It probably didn't need that much, but the server had it (huzzah, EC2!), so why not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment