New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hazelcast terminates listener incorrectly (Enterprise 3.7.6 version) #10101

Closed
tqt9209 opened this Issue Mar 17, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@tqt9209

tqt9209 commented Mar 17, 2017

This is the exact same issue in this ticket: #9766
The problem is not fixed in 3.7.6. I am not sure how the original ticket was closed when the problem still exists.
Here is the description of the problem again:
I'm using Enterprise 3.7.6 version.
Application ->hzClient (3.7.6 version) ------------------- hzServer1 (3.7.6)
------------------- hzServer2 (3.7.6)
hzServer1 and hzServer2 are members in the same cluster.
hzClient registers message listener with the cluster to receive messages from reliable topic.

Steps to reproduce:
stop hzServer1
start hzServer1
stop hzServer2
start hzServer2
... repeat couple times as needed.
Basically alternate start/stop of the two servers but leave one server active at a time.
The listener is then terminated incorrectly (see log below). After that client never receives any messages from the topic again.

Please make sure that you address all the items below and test it carefully before closing the ticket. It's very time consuming to pick up new version and problem still not fixed.

  1. When isTerminal(Throwable failure) on ReliableMessageListener returns false (and isLossTolerant() returns true) then listener CANNOT BE TERMINATED FOR ANY REASON. Currently, look like these call back NEVER INVOKED which looks like more bugs.
  2. When you get exception (ex: TargetDisconnectException like below), feed the exception into isTerminal(...) call back. If the call back returns false then find another active cluster member and re-register the listeners. If there are no active member, wait until one of them become active and register listeners again. Basically the logic that register/re-register listener needs to be observer of connection state so that it can re-register (as needed) when cluster member disconnect/reconnect.
  3. Make sure that fix is tested carefully so that fixed really means fixed. Steps to test.
  • Create a cluster with two members hzServer1 and hzServer2
  • Start hzServer1 first then start hzServer2
  • Create client and register for reliable topic.
  • Create a publish client.
  • Publish something and verify that client receives the messages.
  • Stop hzServer1
  • Look at logs and verify that listener not terminated.
  • Publish something and make sure that client receives the messages.
  • Start hzServer1 and wait until it's up.
  • Stop hzServer2
  • Look at logs and verify that listener not terminated.
  • Publish something and make sure that client receives the messages.
  • Start hzServer2
  • Repeat this cycle 3-4 times to make sure that listeners not terminated and client still receives message.

Logs:

2017.03.17 11:15:09:298 EDT | Warn       | Hazelcast
        hz.client_0 [hzc1] [3.7.6] Terminating MessageListener com.broadsoft.persistence.hazelcast.BWHazelcastInstance$TopicContainer$ReliableMessageListenerImpl@111bed78 on topic: profileManagementUpdate. Reason: Unhandled exception, message: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6} <java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6}>java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6}
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:66)
        at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:246)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
Caused by: com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.16.176.21]:5701 due to heartbeat problems. Current time: 2017-03-17 11:15:09.265. Last heartbeat requested: 2017-03-17 11:14:49.266. Last heartbeat received: 2017-03-17 11:14:49.268. Last read: 2017-03-17 11:15:08.561. Connection ClientConnection{live=false, connectionId=2, socketChannel=SSLSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[10.16.176.21]:5701, lastReadTime=2017-03-17 11:15:08.561, lastWriteTime=2017-03-17 11:14:59.443, closedTime=2017-03-17 11:15:08.561, lastHeartbeatRequested=2017-03-17 11:14:49.266, lastHeartbeatReceived=2017-03-17 11:14:49.268, connected server version=3.7.6}
        at com.hazelcast.spi.exception.TargetDisconnectedException.newTargetDisconnectedExceptionCausedByHeartbeat(TargetDisconnectedException.java:66)
        at com.hazelcast.client.spi.impl.ClientInvocationServiceSupport$CleanResourcesTask.run(ClientInvocationServiceSupport.java:221)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        ... 5 more
Caused by: java.io.EOFException: Remote socket closed!
        at com.hazelcast.client.connection.nio.ClientReadHandler.handle(ClientReadHandler.java:87)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.handleSelectionKey(NonBlockingIOThread.java:345)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.handleSelectionKeys(NonBlockingIOThread.java:330)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.selectLoop(NonBlockingIOThread.java:248)
        at com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThread.run(NonBlockingIOThread.java:201)


@sancar

This comment has been minimized.

Member

sancar commented Mar 21, 2017

@tqt9209
As you said, I can easily reproduce the issue with 3.7.6
It seems that issue is only fixed in 3.8-SNAPSHOT and not backported to fix 3.7.6 .
#9811
Closing the issue #9766 that is assigned 3.7.6 to was misinforming, sorry for that.
And we are not planning to backport it anymore.

I have tried with our latest release 3.8.1 and can not reproduce the issue.
That has been said your findings looks suspicious to me too. It may be the case that there are some other bugs around. @pveentjer Can you take a look and clear things our for us ?

@tqt9209

This comment has been minimized.

tqt9209 commented Mar 22, 2017

I assume that the released version 3.8 doesn't have the fix (I don't see #9811 in the release note of 3.8).
The version that you tested (3.8.1) is not released yet right?

@sancar

This comment has been minimized.

Member

sancar commented Mar 23, 2017

@tqt9209
Fix is done in 3.8.
Because milestone of #9766 was set to 3.7.6 at the time, it is documented that it is fixed in 3.7.6, which is not really the case. It is actually fixed in 3.8. Since when releasing we did not realised that, 3.8 release notes does not have that info.
Let me see if we can fix the release notes.

@tqt9209

This comment has been minimized.

tqt9209 commented Mar 27, 2017

I tried 3.8 version and it works much better.
Listener still terminated one time. I cannot reproduce it again however.
I think I stopped one cluster member then turn off the server (for that cluster member).

Exception log is below. Please take a look.
Thanks

        hz.client_0 [hzc1] [3.8] Terminating MessageListener com.broadsoft.persistence.hazelcast.BWHazelcastInstance$TopicContainer$ReliableMessageListenerImpl@655ac9a5 on topic: profileManagementUpdate. Reason: Unhandled exception, message: java.io.IOException: No available connection to address [10.16.176.21]:5701 <java.util.concurrent.ExecutionException: java.io.IOException: No available connection to address [10.16.176.21]:5701>java.util.concurrent.ExecutionException: java.io.IOException: No available connection to address [10.16.176.21]:5701
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:66)
        at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:243)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
Caused by: java.io.IOException: No available connection to address [10.16.176.21]:5701
        at com.hazelcast.client.spi.impl.ClientSmartInvocationServiceImpl.getOrTriggerConnect(ClientSmartInvocationServiceImpl.java:88)
        at com.hazelcast.client.spi.impl.ClientSmartInvocationServiceImpl.invokeOnPartitionOwner(ClientSmartInvocationServiceImpl.java:47)
        at com.hazelcast.client.spi.impl.ClientInvocation.invokeOnSelection(ClientInvocation.java:133)
        at com.hazelcast.client.spi.impl.ClientInvocation.invoke(ClientInvocation.java:114)
        at com.hazelcast.client.spi.impl.ClientInvocation.run(ClientInvocation.java:144)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.hazelcast.util.executor.LoggingScheduledExecutor$LoggingDelegatingFuture.run(LoggingScheduledExecutor.java:128)
        ... 5 more
@sancar

This comment has been minimized.

Member

sancar commented Mar 30, 2017

This IOException is related to #9696
Once that is fixed, there should not be any issue here. Closing this one.

@sancar sancar closed this Mar 30, 2017

@bart-w0

This comment has been minimized.

bart-w0 commented Aug 25, 2017

Hello, I updated hazelcast libraries from 3.7.4 to 3.8.4 in hope to fix listeners' disconnections, but I still experience this issue.

Steps to reproduce:

  1. Run two hazelcast instances in a cluster (instance-1, instance-2).
  2. Connect a tool (client) that saves a new object to a hazelcast map every 1 second.
  3. Connect a client that reads events from a topic (configured to message new objects from the map) using Listeners. Verify that messages are being received.
  4. Restart instance1 or instance2. Repeat few times.
  5. Not all messages are being received - these which are being saved at instance-X are lost.

Logs on the client:

[2017-08-25 17:22:39.099][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4726
[2017-08-25 17:22:40.116][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4727
[2017-08-25 17:22:40.859][hz.client_0.event-4] INFO c.h.c.s.i.ClientMembershipListener@log:49 - hz.client_0 [data_test_group] [3.8.4] 

Members [2] {
	Member [instance-2.test.com]:5701 - 016a33fb-3eae-46c4-8a22-9a95490bf1fe
	Member [instance-1.test.com]:5701 - 57df3d12-5970-41f4-92a8-da11e08d694f
}

[2017-08-25 17:22:41.128][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4728
[2017-08-25 17:22:41.258][hz.hz.client_0.IO.thread-out-0] DEBUG c.h.c.c.n.ClientSocketWriterInitializer@log:49 - Initializing ClientSocketWriter WriteHandler with Cluster Protocol
[2017-08-25 17:22:41.291][hz.client_0.internal-3] INFO c.h.c.c.ClientConnectionManager@log:49 - hz.client_0 [data_test_group] [3.8.4] Authenticated with server [instance-1.test.com]:5701, server version:3.8.4 Local address: /10.230.35.99:57247
[2017-08-25 17:22:42.144][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4729
[2017-08-25 17:22:43.168][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4730
[2017-08-25 17:22:45.273][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4732
…….
[2017-08-25 17:23:09.227][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4755
[2017-08-25 17:23:10.263][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4756
[2017-08-25 17:23:11.294][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4757
[2017-08-25 17:23:12.335][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4758
[2017-08-25 17:23:13.360][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4759
[2017-08-25 17:23:14.232][hz.hz.client_0.IO.thread-in-0] WARN c.h.c.c.nio.ClientConnection@log:49 - hz.client_0 [data_test_group] [3.8.4] ClientConnection{alive=false, connectionId=13, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.230.35.99:57126 remote=instance-2.test.com/10.226.13.161:5701]}, remoteEndpoint=[instance-2.test.com]:5701, lastReadTime=2017-08-25 17:23:14.232, lastWriteTime=2017-08-25 17:18:10.004, closedTime=2017-08-25 17:23:14.232, lastHeartbeatRequested=2017-08-25 17:17:05.336, lastHeartbeatReceived=2017-08-25 17:17:05.341, connected server version=3.8.4} lost. Reason: java.io.EOFException[Remote socket closed!]
[2017-08-25 17:23:14.391][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4760
[2017-08-25 17:23:15.492][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4761
[2017-08-25 17:23:17.516][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4763
[2017-08-25 17:23:19.559][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4765
[2017-08-25 17:23:20.582][hz.client_0.event-3] DEBUG c.t.listeners.MessageListener@onMessage:29 - Received message: 4766

We can see instance reconnecting and later, a socket is being closed. Is it the same issue? Is there a solution for this?

@sancar sancar modified the milestones: 3.8.6, 3.8.2 Aug 28, 2017

@bart-w0

This comment has been minimized.

bart-w0 commented Aug 29, 2017

Hello @sancar, do you think this will be fixed in 3.8.6? Should you reopen this issue?

@sancar

This comment has been minimized.

Member

sancar commented Aug 29, 2017

Hi @bart-w0,
We don't have any known issues in 3.8.4.

Are you using Topic or ReliableTopic ? Original issue was related to ReliableTopic.
We have recently fixed a problem regarding use of hostnames in memberlist. You might be suffering from that and listener problem could be one of the causes. Can you try using ip-addresses instead of hostnames to see if it solves your problem.

I did not quite get the third step.
What I understand is following:
A client adds message listener to a topic. And the client is calling topic.publish when EntryListener.entryAdded event is called. Is that correct ?
It would be better if you can share sample client app for us to reproduce the problem.

@sancar

This comment has been minimized.

Member

sancar commented Sep 5, 2017

Hi @bart-w0
There seems to be multiple report of same issue.
We will follow the issue in #11264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment