New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary synchronized lock when invoking com.hazelcast.instance.LifecycleServiceImpl.isRunning() #2454
Comments
Thanks. Let me have a look. I hate excessive locking. |
Wrote you some comments at Skype @pveentjer @th0rb3n: Is it possible to create a bigger threaddump? :) |
I'm not allowed to attach the thread dump: "Unfortunately, we don't support that file type." But I already mailed that info to you... |
I don't think we need it. I removed the unwanted synchronization. The code was threadsafe already, no need to have it wrapped in synchronized blocks. |
Maybe zipping it? :) or rename it to txt ;) |
Github issue tracker only allows image attachments... The important question rather is why Hazelcast becomes unstable/broken Maybe this quick overview does help you outAfter about 20 minutes (150 concurrent users), issues like these start to show up: com.openexchange.exception.OXException: SST-0005 Categories=ERROR Message='Removing session with session identifier 7e3f323c410c4c5a800e90bd6016d5fd failed.' exceptionID=2090470710-99638 After a short while, the amount of threads spike and TimeoutExceptions occur: WARN: [10.20.31.42]:5701 [perf.qa] While asking 'is-executing': InvocationImpl{ serviceName='hz:impl:mapService', op=RemoveOperation{sessions-6}, partitionId=194, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.20.31.41]:5701} Then we get these, threads still rising: WARN: [10.20.31.42]:5701 [perf.qa] Retrying invocation: InvocationImpl{ serviceName='hz:impl:mapService', op=RemoveOperation{sessions-6}, partitionId=152, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=120, callTimeout=5000, target=Address[10.20.31.41]:5701}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.20.31.41]:5701 INFO: Failed to put session 4f91305599dc40718e7513b504b46cfa with Auth-Id 44ecf2f308634fdd8cce2acc0845e4db into session storage (user=596, context=1): SST-0003 Categories=ERROR Message='Saving session with session identifier 4f91305599dc40718e7513b504b46cfa failed.' exceptionID=2090470710-143656 Finally (probably after a certain amount of retries), an error occurs indicating that the connection to the other Hazelcast node got lost. See o-x.log.15 and earlier. ERROR: [10.20.31.42]:5701 [perf.qa] Could not join cluster, shutting down! Subsequently, memory usage spikes - i assume because new sessions cannot be sent to the distributed session storage but are kept in memory without getting invalidated on logout. Lots of exceptions are thrown which relate to the earlier issue of losing cluster connection. com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active! May 14, 2014 9:17:28 AM com.hazelcast.logging.Slf4jFactory$Slf4jLogger.log(Slf4jFactory$Slf4jLogger.java:87) java.lang.IllegalStateException:Couldn't connect to discovered master! tryCount: 50 connection: null
|
This issue has been resolved in 3.2.2 and 3.3. |
I have verified tags v3.2.2 and v3.2.4. they are not having this fix. Please suggest. The change I see is here so can you please confirm which version I should use having this fix? Thanks! |
I think you meant it is fixed in v3.3.2 and v.3.3, correct? |
…ntegrationTest (#2454) * Cleaner message when getMasterAddress returns null * Throw exception in ditchJobs() if job doesn't cancel * Avoid swallowing exception in test
Created on behalf of Christoph Enegelbert:
We faced the situation that ones Hazelcast cluster becomes unstable/broken, any thread that attempts to check Hazelcast health status via com.hazelcast.instance.LifecycleServiceImpl.isRunning() gets BLOCKED as there is a single thread running com.hazelcast.instance.LifecycleServiceImpl.runUnderLifecycleLock() that seems to never release that lock.
This leads to thousands of threads being BLOCKED at that location until system becomes unresponsive.
Instead of using a synchronized block, the check may be implemented through utilizing an AtomicXYZ variable. Thus a quick check for Hazelcast life status does not block.
The text was updated successfully, but these errors were encountered: