Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary synchronized lock when invoking com.hazelcast.instance.LifecycleServiceImpl.isRunning() #2454

Closed
th0rb3n opened this issue May 14, 2014 · 10 comments
Milestone

Comments

@th0rb3n
Copy link

th0rb3n commented May 14, 2014

Created on behalf of Christoph Enegelbert:

We faced the situation that ones Hazelcast cluster becomes unstable/broken, any thread that attempts to check Hazelcast health status via com.hazelcast.instance.LifecycleServiceImpl.isRunning() gets BLOCKED as there is a single thread running com.hazelcast.instance.LifecycleServiceImpl.runUnderLifecycleLock() that seems to never release that lock.

This leads to thousands of threads being BLOCKED at that location until system becomes unresponsive.

Instead of using a synchronized block, the check may be implemented through utilizing an AtomicXYZ variable. Thus a quick check for Hazelcast life status does not block.

@th0rb3n th0rb3n closed this as completed May 14, 2014
@th0rb3n th0rb3n reopened this May 14, 2014
@pveentjer
Copy link
Contributor

Thanks. Let me have a look. I hate excessive locking.

@noctarius
Copy link
Contributor

Wrote you some comments at Skype @pveentjer

@th0rb3n: Is it possible to create a bigger threaddump? :)

@pveentjer
Copy link
Contributor

#2457
#2455

@pveentjer pveentjer added this to the 3.2.2 milestone May 14, 2014
@th0rb3n
Copy link
Author

th0rb3n commented May 14, 2014

I'm not allowed to attach the thread dump: "Unfortunately, we don't support that file type."

But I already mailed that info to you...

@pveentjer
Copy link
Contributor

I don't think we need it. I removed the unwanted synchronization. The code was threadsafe already, no need to have it wrapped in synchronized blocks.

@noctarius
Copy link
Contributor

Maybe zipping it? :) or rename it to txt ;)

@th0rb3n
Copy link
Author

th0rb3n commented May 14, 2014

Github issue tracker only allows image attachments...
As said before, there not that much to investigate on the thread dump.

The important question rather is why Hazelcast becomes unstable/broken

Maybe this quick overview does help you out

After about 20 minutes (150 concurrent users), issues like these start to show up:

com.openexchange.exception.OXException: SST-0005 Categories=ERROR Message='Removing session with session identifier 7e3f323c410c4c5a800e90bd6016d5fd failed.' exceptionID=2090470710-99638
Caused by: com.hazelcast.core.OperationTimeoutException: No response for 10000 ms. Aborting invocation! InvocationFuture{invocation=InvocationImpl{ serviceName='hz:impl:mapService', op=RemoveOperation{sessions-6}, partitionId=220, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.20.31.41]:5701}, done=false}

After a short while, the amount of threads spike and TimeoutExceptions occur:

WARN: [10.20.31.42]:5701 [perf.qa] While asking 'is-executing': InvocationImpl{ serviceName='hz:impl:mapService', op=RemoveOperation{sessions-6}, partitionId=194, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.20.31.41]:5701}
java.util.concurrent.TimeoutException: null

Then we get these, threads still rising:

WARN: [10.20.31.42]:5701 [perf.qa] Retrying invocation: InvocationImpl{ serviceName='hz:impl:mapService', op=RemoveOperation{sessions-6}, partitionId=152, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=120, callTimeout=5000, target=Address[10.20.31.41]:5701}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.20.31.41]:5701

INFO: Failed to put session 4f91305599dc40718e7513b504b46cfa with Auth-Id 44ecf2f308634fdd8cce2acc0845e4db into session storage (user=596, context=1): SST-0003 Categories=ERROR Message='Saving session with session identifier 4f91305599dc40718e7513b504b46cfa failed.' exceptionID=2090470710-143656

Finally (probably after a certain amount of retries), an error occurs indicating that the connection to the other Hazelcast node got lost. See o-x.log.15 and earlier.

ERROR: [10.20.31.42]:5701 [perf.qa] Could not join cluster, shutting down!
java.lang.IllegalStateException

Subsequently, memory usage spikes - i assume because new sessions cannot be sent to the distributed session storage but are kept in memory without getting invalidated on logout. Lots of exceptions are thrown which relate to the earlier issue of losing cluster connection.

com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!

May 14, 2014 9:17:28 AM com.hazelcast.logging.Slf4jFactory$Slf4jLogger.log(Slf4jFactory$Slf4jLogger.java:87)
ERROR: [10.20.31.42]:5701 [perf.qa] Could not join cluster, shutting down!

java.lang.IllegalStateException:

Couldn't connect to discovered master! tryCount: 50
address: Address[10.20.31.42]:5701
masterAddress: Address[10.20.31.41]:5701
multicast: true

connection: null

at com.hazelcast.cluster.AbstractJoiner.failedJoiningToMaster(AbstractJoiner.java:139)
at com.hazelcast.cluster.MulticastJoiner.doJoin(MulticastJoiner.java:69)
at com.hazelcast.cluster.AbstractJoiner.join(AbstractJoiner.java:61)
at com.hazelcast.instance.Node.join(Node.java:525)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.instance.Node.join(Node.java:530)
at com.hazelcast.instance.Node.rejoin(Node.java:514)
at com.hazelcast.cluster.ClusterServiceImpl$6.run(ClusterServiceImpl.java:589)
at com.hazelcast.instance.LifecycleServiceImpl.runUnderLifecycleLock(LifecycleServiceImpl.java:94)
at com.hazelcast.cluster.ClusterServiceImpl.merge(ClusterServiceImpl.java:571)
at com.hazelcast.cluster.MergeClustersOperation.run(MergeClustersOperation.java:53)
at com.hazelcast.spi.impl.OperationServiceImpl.doRunOperation(OperationServiceImpl.java:274)
at com.hazelcast.spi.impl.OperationServiceImpl.runOperation(OperationServiceImpl.java:184)
at com.hazelcast.cluster.AbstractJoiner.startClusterMerge(AbstractJoiner.java:244)
at com.hazelcast.cluster.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:118)
at com.hazelcast.cluster.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:46)
at com.hazelcast.cluster.SplitBrainHandler.run(SplitBrainHandler.java:36)
at com.hazelcast.util.executor.ManagedExecutorService$Worker.run(ManagedExecutorService.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)

@pveentjer
Copy link
Contributor

This issue has been resolved in 3.2.2 and 3.3.

@dspatel81
Copy link

I think you meant it is fixed in v3.3.2 and v.3.3, correct?

frant-hartm pushed a commit that referenced this issue Mar 26, 2021
…ntegrationTest (#2454)

* Cleaner message when getMasterAddress returns null
* Throw exception in ditchJobs() if job doesn't cancel
* Avoid swallowing exception in test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants