Hazelcast terminates listener incorrectly (Enterprise 3.7.6 version) #10101

tqt9209 · 2017-03-17T16:01:09Z

This is the exact same issue in this ticket: #9766
The problem is not fixed in 3.7.6. I am not sure how the original ticket was closed when the problem still exists.
Here is the description of the problem again:
I'm using Enterprise 3.7.6 version.
Application ->hzClient (3.7.6 version) ------------------- hzServer1 (3.7.6)
------------------- hzServer2 (3.7.6)
hzServer1 and hzServer2 are members in the same cluster.
hzClient registers message listener with the cluster to receive messages from reliable topic.

Steps to reproduce:
stop hzServer1
start hzServer1
stop hzServer2
start hzServer2
... repeat couple times as needed.
Basically alternate start/stop of the two servers but leave one server active at a time.
The listener is then terminated incorrectly (see log below). After that client never receives any messages from the topic again.

Please make sure that you address all the items below and test it carefully before closing the ticket. It's very time consuming to pick up new version and problem still not fixed.

When isTerminal(Throwable failure) on ReliableMessageListener returns false (and isLossTolerant() returns true) then listener CANNOT BE TERMINATED FOR ANY REASON. Currently, look like these call back NEVER INVOKED which looks like more bugs.
When you get exception (ex: TargetDisconnectException like below), feed the exception into isTerminal(...) call back. If the call back returns false then find another active cluster member and re-register the listeners. If there are no active member, wait until one of them become active and register listeners again. Basically the logic that register/re-register listener needs to be observer of connection state so that it can re-register (as needed) when cluster member disconnect/reconnect.
Make sure that fix is tested carefully so that fixed really means fixed. Steps to test.

Create a cluster with two members hzServer1 and hzServer2
Start hzServer1 first then start hzServer2
Create client and register for reliable topic.
Create a publish client.
Publish something and verify that client receives the messages.
Stop hzServer1
Look at logs and verify that listener not terminated.
Publish something and make sure that client receives the messages.
Start hzServer1 and wait until it's up.
Stop hzServer2
Look at logs and verify that listener not terminated.
Publish something and make sure that client receives the messages.
Start hzServer2
Repeat this cycle 3-4 times to make sure that listeners not terminated and client still receives message.

Logs:

2017.03.17 11:15:09:298 EDT | Warn       | Hazelcast
        hz.client_0 [hzc1] [3.7.6] Terminating MessageListener com.broadsoft.persistence.hazelcast.BWHazelcastInstance$TopicContainer$ReliableMessageListenerImpl@111bed78 on topic: profileManagementUpdate. Reason: Unhandled exception, message: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6} <java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6}>java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6}
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:66)
        at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:246)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
Caused by: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6}
        at com.hazelcast.spi.exception.TargetDisconnectedException.newTargetDisconnectedExceptionCausedByHeartbeat(TargetDisconnectedException.java:66)
        at com.hazelcast.client.spi.impl.ClientInvocationServiceSupport$CleanResourcesTask.run(ClientInvocationServiceSupport.java:221)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        ... 5 more
Caused by: java.io.EOFException: Remote socket closed!
        at com.hazelcast.client.connection.nio.ClientReadHandler.handle(ClientReadHandler.java:87)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.handleSelectionKey(NonBlockingIOThread.java:345)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.handleSelectionKeys(NonBlockingIOThread.java:330)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.selectLoop(NonBlockingIOThread.java:248)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.run(NonBlockingIOThread.java:201)

The text was updated successfully, but these errors were encountered:

sancar · 2017-03-21T11:31:54Z

@tqt9209
As you said, I can easily reproduce the issue with 3.7.6
It seems that issue is only fixed in 3.8-SNAPSHOT and not backported to fix 3.7.6 .
#9811
Closing the issue #9766 that is assigned 3.7.6 to was misinforming, sorry for that.
And we are not planning to backport it anymore.

I have tried with our latest release 3.8.1 and can not reproduce the issue.
That has been said your findings looks suspicious to me too. It may be the case that there are some other bugs around. @pveentjer Can you take a look and clear things our for us ?

tqt9209 · 2017-03-22T15:37:29Z

I assume that the released version 3.8 doesn't have the fix (I don't see #9811 in the release note of 3.8).
The version that you tested (3.8.1) is not released yet right?

sancar · 2017-03-23T07:48:36Z

@tqt9209
Fix is done in 3.8.
Because milestone of #9766 was set to 3.7.6 at the time, it is documented that it is fixed in 3.7.6, which is not really the case. It is actually fixed in 3.8. Since when releasing we did not realised that, 3.8 release notes does not have that info.
Let me see if we can fix the release notes.

tqt9209 · 2017-03-27T21:39:20Z

I tried 3.8 version and it works much better.
Listener still terminated one time. I cannot reproduce it again however.
I think I stopped one cluster member then turn off the server (for that cluster member).

Exception log is below. Please take a look.
Thanks

        hz.client_0 [hzc1] [3.8] Terminating MessageListener com.broadsoft.persistence.hazelcast.BWHazelcastInstance$TopicContainer$ReliableMessageListenerImpl@655ac9a5 on topic: profileManagementUpdate. Reason: Unhandled exception, message: java.io.IOException: No available connection to address [10.16.176.21]:5701 <java.util.concurrent.ExecutionException: java.io.IOException: No available connection to address [10.16.176.21]:5701>java.util.concurrent.ExecutionException: java.io.IOException: No available connection to address [10.16.176.21]:5701
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:66)
        at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:243)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
Caused by: java.io.IOException: No available connection to address [10.16.176.21]:5701
        at com.hazelcast.client.spi.impl.ClientSmartInvocationServiceImpl.getOrTriggerConnect(ClientSmartInvocationServiceImpl.java:88)
        at com.hazelcast.client.spi.impl.ClientSmartInvocationServiceImpl.invokeOnPartitionOwner(ClientSmartInvocationServiceImpl.java:47)
        at com.hazelcast.client.spi.impl.ClientInvocation.invokeOnSelection(ClientInvocation.java:133)
        at com.hazelcast.client.spi.impl.ClientInvocation.invoke(ClientInvocation.java:114)
        at com.hazelcast.client.spi.impl.ClientInvocation.run(ClientInvocation.java:144)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.hazelcast.util.executor.LoggingScheduledExecutor$LoggingDelegatingFuture.run(LoggingScheduledExecutor.java:128)
        ... 5 more

sancar · 2017-03-30T09:29:25Z

This IOException is related to #9696
Once that is fixed, there should not be any issue here. Closing this one.

bart-w0 · 2017-08-25T16:37:22Z

Hello, I updated hazelcast libraries from 3.7.4 to 3.8.4 in hope to fix listeners' disconnections, but I still experience this issue.

Steps to reproduce:

Run two hazelcast instances in a cluster (instance-1, instance-2).
Connect a tool (client) that saves a new object to a hazelcast map every 1 second.
Connect a client that reads events from a topic (configured to message new objects from the map) using Listeners. Verify that messages are being received.
Restart instance1 or instance2. Repeat few times.
Not all messages are being received - these which are being saved at instance-X are lost.

Logs on the client:

[2017-08-25 17:22:39.099][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4726
[2017-08-25 17:22:40.116][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4727
[2017-08-25 17:22:40.859][hz.client_0.event-4] INFO c.h.c.s.i.ClientMembershipListener@log:49 - hz.client_0 [data_test_group] [3.8.4] 

Members [2] {
	Member [instance-2.test.com]:5701 - 016a33fb-3eae-46c4-8a22-9a95490bf1fe
	Member [instance-1.test.com]:5701 - 57df3d12-5970-41f4-92a8-da11e08d694f
}

[2017-08-25 17:22:41.128][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4728
[2017-08-25 17:22:41.258][hz.hz.client_0.IO.thread-out-0] DEBUG c.h.c.c.n.ClientSocketWriterInitializer@log:49 - Initializing ClientSocketWriter WriteHandler with Cluster Protocol
[2017-08-25 17:22:41.291][hz.client_0.internal-3] INFO c.h.c.c.ClientConnectionManager@log:49 - hz.client_0 [data_test_group] [3.8.4] Authenticated with server [instance-1.test.com]:5701, server version:3.8.4 Local address: /10.230.35.99:57247
[2017-08-25 17:22:42.144][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4729
[2017-08-25 17:22:43.168][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4730
[2017-08-25 17:22:45.273][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4732
…….
[2017-08-25 17:23:09.227][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4755
[2017-08-25 17:23:10.263][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4756
[2017-08-25 17:23:11.294][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4757
[2017-08-25 17:23:12.335][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4758
[2017-08-25 17:23:13.360][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4759
[2017-08-25 17:23:14.232][hz.hz.client_0.IO.thread-in-0] WARN c.h.c.c.nio.ClientConnection@log:49 - hz.client_0 [data_test_group] [3.8.4] ClientConnection{alive=false, connectionId=13, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.230.35.99:57126 remote=instance-2.test.com/10.226.13.161:5701]}, remoteEndpoint=[instance-2.test.com]:5701, lastReadTime=2017-08-25 17:23:14.232, lastWriteTime=2017-08-25 17:18:10.004, closedTime=2017-08-25 17:23:14.232, lastHeartbeatRequested=2017-08-25 17:17:05.336, lastHeartbeatReceived=2017-08-25 17:17:05.341, connected server version=3.8.4} lost. Reason: java.io.EOFException[Remote socket closed!]
[2017-08-25 17:23:14.391][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4760
[2017-08-25 17:23:15.492][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4761
[2017-08-25 17:23:17.516][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4763
[2017-08-25 17:23:19.559][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4765
[2017-08-25 17:23:20.582][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4766

We can see instance reconnecting and later, a socket is being closed. Is it the same issue? Is there a solution for this?

bart-w0 · 2017-08-29T10:18:10Z

Hello @sancar, do you think this will be fixed in 3.8.6? Should you reopen this issue?

sancar · 2017-08-29T12:56:53Z

Hi @bart-w0,
We don't have any known issues in 3.8.4.

Are you using Topic or ReliableTopic ? Original issue was related to ReliableTopic.
We have recently fixed a problem regarding use of hostnames in memberlist. You might be suffering from that and listener problem could be one of the causes. Can you try using ip-addresses instead of hostnames to see if it solves your problem.

I did not quite get the third step.
What I understand is following:
A client adds message listener to a topic. And the client is calling topic.publish when EntryListener.entryAdded event is called. Is that correct ?
It would be better if you can share sample client app for us to reproduce the problem.

sancar · 2017-09-05T08:40:07Z

Hi @bart-w0
There seems to be multiple report of same issue.
We will follow the issue in #11264

tqt9209 mentioned this issue Mar 17, 2017

Hazelcast terminates listener incorrectly. #9766

Closed

sancar added Team: Client Type: Defect labels Mar 20, 2017

sancar added this to the 3.8.2 milestone Mar 20, 2017

sancar closed this as completed Mar 30, 2017

sancar modified the milestones: 3.8.6, 3.8.2 Aug 28, 2017

sancar mentioned this issue Sep 5, 2017

Hazelcast (3.8.4) closing listeners #11264

Closed

mmedenjak added the Source: Community PR or issue was opened by a community user label Jul 11, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hazelcast terminates listener incorrectly (Enterprise 3.7.6 version) #10101

Hazelcast terminates listener incorrectly (Enterprise 3.7.6 version) #10101

tqt9209 commented Mar 17, 2017 •

edited by sancar

Loading

sancar commented Mar 21, 2017

tqt9209 commented Mar 22, 2017

sancar commented Mar 23, 2017

tqt9209 commented Mar 27, 2017 •

edited

Loading

sancar commented Mar 30, 2017

bart-w0 commented Aug 25, 2017

bart-w0 commented Aug 29, 2017

sancar commented Aug 29, 2017

sancar commented Sep 5, 2017

Hazelcast terminates listener incorrectly (Enterprise 3.7.6 version) #10101

Hazelcast terminates listener incorrectly (Enterprise 3.7.6 version) #10101

Comments

tqt9209 commented Mar 17, 2017 • edited by sancar Loading

sancar commented Mar 21, 2017

tqt9209 commented Mar 22, 2017

sancar commented Mar 23, 2017

tqt9209 commented Mar 27, 2017 • edited Loading

sancar commented Mar 30, 2017

bart-w0 commented Aug 25, 2017

bart-w0 commented Aug 29, 2017

sancar commented Aug 29, 2017

sancar commented Sep 5, 2017

tqt9209 commented Mar 17, 2017 •

edited by sancar

Loading

tqt9209 commented Mar 27, 2017 •

edited

Loading