Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead node causes high reconnect count #11176

Closed
philnate opened this issue Aug 21, 2017 · 3 comments
Closed

Dead node causes high reconnect count #11176

philnate opened this issue Aug 21, 2017 · 3 comments

Comments

@philnate
Copy link

@philnate philnate commented Aug 21, 2017

Hi,

we're using Hazelcast 3.7.3 in a cluster of 3 standalone nodes and 3 clients. We noticed a sudden increase in Threads in the client systems (3,6k within less than 10min). According to the logs it seems that one Hazelcast node died (and was restored by the orchestration system) the connected clients tried to reconnect many many times per second, before eventually the connection was reestablished.

August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] HazelcastClient 3.7.3 (20161117 - 8166eea) is CLIENT_DISCONNECTED
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] HazelcastClient 3.7.3 (20161117 - 8166eea) is CLIENT_DISCONNECTED
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] 

Members [2] {
	Member [10.100.48.16]:5701 - dce9f8f6-205e-41b7-9921-6e2dade83551
	Member [10.100.86.30]:5701 - 6db3a7ed-1637-4374-9772-22cd19a30811
}
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] ClientConnection{live=false, connectionId=29, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.100.85.97:45698 remote=/10.100.52.32:5701]}, remoteEndpoint=[10.100.52.32]:5701, lastReadTime=2017-08-17 12:26:00.362, lastWriteTime=2017-08-17 12:26:00.098, closedTime=2017-08-17 12:26:00.362, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.7.3} lost. Reason: java.io.IOException[Connection reset by peer]
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] ClientConnection{live=false, connectionId=25, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.100.48.13:57408 remote=/10.100.52.32:5701]}, remoteEndpoint=[10.100.52.32]:5701, lastReadTime=2017-08-17 12:26:00.378, lastWriteTime=2017-08-17 12:26:00.114, closedTime=2017-08-17 12:26:00.381, lastHeartbeatRequested=2017-08-14 18:29:45.149, lastHeartbeatReceived=2017-08-14 18:29:45.151, connected server version=3.7.3} lost. Reason: java.io.EOFException[Remote socket closed!]
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] ClientConnection{live=false, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.100.86.31:40766 remote=/10.200.15.12:5701]}, remoteEndpoint=[10.100.52.32]:5701, lastReadTime=2017-08-17 12:26:00.363, lastWriteTime=2017-08-17 12:25:54.474, closedTime=2017-08-17 12:26:00.363, lastHeartbeatRequested=2017-08-17 12:25:54.257, lastHeartbeatReceived=2017-08-17 12:25:54.258, connected server version=3.7.3} lost. Reason: java.io.EOFException[Remote socket closed!]
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] 

Members [3] {
	Member [10.100.48.16]:5701 - dce9f8f6-205e-41b7-9921-6e2dade83551
	Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
	Member [10.100.86.30]:5701 - 6db3a7ed-1637-4374-9772-22cd19a30811
}
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] Authenticated with server [10.100.48.16]:5701, server version:3.7.3 Local address: /10.100.86.31:55688
August 17th 2017, 14:26:00.000	hz.client_0 [ccluster] [3.7.3] 

Members [2] {
	Member [10.100.48.16]:5701 - dce9f8f6-205e-41b7-9921-6e2dade83551
	Member [10.100.86.30]:5701 - 6db3a7ed-1637-4374-9772-22cd19a30811
}
August 17th 2017, 14:26:01.000	hz.client_0 [ccluster] [3.7.3] HazelcastClient 3.7.3 (20161117 - 8166eea) is CLIENT_DISCONNECTED
August 17th 2017, 14:26:01.000	hz.client_0 [ccluster] [3.7.3] 

Members [2] {
	Member [10.100.48.16]:5701 - dce9f8f6-205e-41b7-9921-6e2dade83551
	Member [10.100.86.30]:5701 - 6db3a7ed-1637-4374-9772-22cd19a30811
}
August 17th 2017, 14:26:01.000	hz.client_0 [ccluster] [3.7.3] ClientConnection{live=false, connectionId=26, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.100.48.13:44448 remote=/10.200.15.12:5701]}, remoteEndpoint=[10.100.48.16]:5701, lastReadTime=2017-08-17 12:26:01.117, lastWriteTime=2017-08-17 12:26:01.117, closedTime=2017-08-17 12:26:01.117, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.7.3} lost. Reason: com.hazelcast.core.HazelcastException[Listener ClientRegistrationKey{ userRegistrationId='eb25db3f-3b7d-4170-8d8d-7ae6a98a26ea'} can not be added to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a]
August 17th 2017, 14:26:01.000	hz.client_0 [ccluster] [3.7.3] Authenticated with server [10.100.48.16]:5701, server version:3.7.3 Local address: /10.100.48.13:43010
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:02.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a

Hazelcast logged as well the following exception:

{
    "timeMillis": 1502972762282,
    "thread": "hz.client_0.eventRegistration-",
    "level": "WARN",
    "loggerName": "com.hazelcast.client.spi.ClientListenerService",
    "message": "hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a",
    "thrown": {
      "commonElementCount": 0,
      "localizedMessage": "java.net.ConnectException: Connection refused",
      "message": "java.net.ConnectException: Connection refused",
      "name": "com.hazelcast.core.HazelcastException",
      "cause": {
        "commonElementCount": 5,
        "localizedMessage": "Connection refused",
        "message": "Connection refused",
        "name": "java.net.ConnectException",
        "extendedStackTrace": [
          {
            "class": "sun.nio.ch.SocketChannelImpl",
            "method": "checkConnect",
            "file": "SocketChannelImpl.java",
            "line": -2,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          },
          {
            "class": "sun.nio.ch.SocketChannelImpl",
            "method": "finishConnect",
            "file": "SocketChannelImpl.java",
            "line": 717,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          },
          {
            "class": "sun.nio.ch.SocketAdaptor",
            "method": "connect",
            "file": "SocketAdaptor.java",
            "line": 111,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          },
          {
            "class": "com.hazelcast.client.connection.nio.ClientConnectionManagerImpl",
            "method": "createSocketConnection",
            "file": "ClientConnectionManagerImpl.java",
            "line": 350,
            "exact": false,
            "location": "hazelcast-client-3.7.3.jar!/",
            "version": "3.7.3"
          },
          {
            "class": "com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$InitConnectionTask",
            "method": "run",
            "file": "ClientConnectionManagerImpl.java",
            "line": 579,
            "exact": false,
            "location": "hazelcast-client-3.7.3.jar!/",
            "version": "3.7.3"
          },
          {
            "class": "java.util.concurrent.Executors$RunnableAdapter",
            "method": "call",
            "file": "Executors.java",
            "line": 511,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          },
          {
            "class": "java.util.concurrent.FutureTask",
            "method": "run",
            "file": "FutureTask.java",
            "line": 266,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          },
          {
            "class": "java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask",
            "method": "access$201",
            "file": "ScheduledThreadPoolExecutor.java",
            "line": 180,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          },
          {
            "class": "java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask",
            "method": "run",
            "file": "ScheduledThreadPoolExecutor.java",
            "line": 293,
            "exact": false,
            "location": "?",
            "version": "1.8.0_112"
          }
        ]
      },
      "extendedStackTrace": [
        {
          "class": "com.hazelcast.util.ExceptionUtil",
          "method": "peel",
          "file": "ExceptionUtil.java",
          "line": 73,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.util.ExceptionUtil",
          "method": "peel",
          "file": "ExceptionUtil.java",
          "line": 52,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.util.ExceptionUtil",
          "method": "rethrow",
          "file": "ExceptionUtil.java",
          "line": 83,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.connection.nio.ClientConnectionManagerImpl",
          "method": "getOrConnect",
          "file": "ClientConnectionManagerImpl.java",
          "line": 251,
          "exact": false,
          "location": "hazelcast-client-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.spi.impl.listener.ClientSmartListenerService",
          "method": "getOrConnect",
          "file": "ClientSmartListenerService.java",
          "line": 428,
          "exact": false,
          "location": "hazelcast-client-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.spi.impl.listener.ClientSmartListenerService",
          "method": "ensureConnectionsToAllServers",
          "file": "ClientSmartListenerService.java",
          "line": 418,
          "exact": true,
          "location": "hazelcast-client-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.spi.impl.listener.ClientSmartListenerService",
          "method": "access$700",
          "file": "ClientSmartListenerService.java",
          "line": 52,
          "exact": true,
          "location": "hazelcast-client-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.spi.impl.listener.ClientSmartListenerService$5$1",
          "method": "run",
          "file": "ClientSmartListenerService.java",
          "line": 255,
          "exact": true,
          "location": "hazelcast-client-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "java.util.concurrent.Executors$RunnableAdapter",
          "method": "call",
          "file": "Executors.java",
          "line": 511,
          "exact": true,
          "location": "?",
          "version": "1.8.0_112"
        },
        {
          "class": "java.util.concurrent.FutureTask",
          "method": "run",
          "file": "FutureTask.java",
          "line": 266,
          "exact": true,
          "location": "?",
          "version": "1.8.0_112"
        },
        {
          "class": "java.util.concurrent.ThreadPoolExecutor",
          "method": "runWorker",
          "file": "ThreadPoolExecutor.java",
          "line": 1142,
          "exact": true,
          "location": "?",
          "version": "1.8.0_112"
        },
        {
          "class": "java.util.concurrent.ThreadPoolExecutor$Worker",
          "method": "run",
          "file": "ThreadPoolExecutor.java",
          "line": 617,
          "exact": true,
          "location": "?",
          "version": "1.8.0_112"
        },
        {
          "class": "java.lang.Thread",
          "method": "run",
          "file": "Thread.java",
          "line": 745,
          "exact": true,
          "location": "?",
          "version": "1.8.0_112"
        },
        {
          "class": "com.hazelcast.util.executor.HazelcastManagedThread",
          "method": "executeRun",
          "file": "HazelcastManagedThread.java",
          "line": 76,
          "exact": true,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.util.executor.HazelcastManagedThread",
          "method": "run",
          "file": "HazelcastManagedThread.java",
          "line": 92,
          "exact": true,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        }
      ]
    }

A couple of seconds later it was still logging all those reconnect tries, but now throwing a different exception:

August 17th 2017, 14:26:12.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a    
August 17th 2017, 14:26:12.000	hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a
August 17th 2017, 14:26:12.000	hz.client_0 [ccluster] [3.7.3] ClientConnection{live=false, connectionId=364, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.100.48.13:50452 remote=/10.100.52.32:5701]}, remoteEndpoint=null, lastReadTime=2017-08-17 12:26:12.200, lastWriteTime=2017-08-17 12:26:12.199, closedTime=2017-08-17 12:26:12.200, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=null} lost. Reason: java.util.concurrent.ExecutionException[com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not ready yet!]

{
    "timeMillis": 1502972772951,
    "thread": "hz.client_0.eventRegistration-",
    "level": "WARN",
    "loggerName": "com.hazelcast.client.spi.ClientListenerService",
    "message": "hz.client_0 [ccluster] [3.7.3] Could not open connection to member Member [10.100.52.32]:5701 - ae0590fb-8da8-45cd-b950-24ed1f65723a",
    "thrown": {
      "commonElementCount": 0,
      "localizedMessage": "Hazelcast instance is not ready yet!",
      "message": "Hazelcast instance is not ready yet!",
      "name": "com.hazelcast.core.HazelcastInstanceNotActiveException",
      "extendedStackTrace": [
        {
          "class": "com.hazelcast.client.impl.protocol.task.AbstractMessageTask",
          "method": "initializeAndProcessMessage",
          "file": "AbstractMessageTask.java",
          "line": 129,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.impl.protocol.task.AbstractMessageTask",
          "method": "run",
          "file": "AbstractMessageTask.java",
          "line": 107,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.client.impl.ClientEngineImpl$PriorityPartitionSpecificRunnable",
          "method": "run",
          "file": "ClientEngineImpl.java",
          "line": 555,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl",
          "method": "run",
          "file": "OperationRunnerImpl.java",
          "line": 137,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.spi.impl.operationexecutor.impl.OperationThread",
          "method": "process",
          "file": "OperationThread.java",
          "line": 127,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        },
        {
          "class": "com.hazelcast.spi.impl.operationexecutor.impl.OperationThread",
          "method": "run",
          "file": "OperationThread.java",
          "line": 102,
          "exact": false,
          "location": "hazelcast-3.7.3.jar!/",
          "version": "3.7.3"
        }
      ]
    }

Unfortunately I have no thread dump to showing that Hazelcast caused the massive Thread increase, still I believe the repeated reconnect tries are related to the increase in Thread count.

@mmedenjak mmedenjak added this to the 3.9 milestone Aug 22, 2017
@ihsandemir
Copy link
Contributor

@ihsandemir ihsandemir commented Aug 22, 2017

Hello @philnate
The listener mechanism changed significantly since 3.7.3, I strongly recommend you to test with the latest release version.

The initial exception is received because the old member was not removed from the cluster member list yet and the client assumes that it is still a valid member and tries to register the existing listeners but it fails as expected since the member is closed. It recovers from this as soon as the member list in the cluster is updated.

The second exception also is seen when the new member start is not completed yet and the client is trying to register the listener and gets this exception.

Still, I could not reason how it would affect the number of threads usage. We use executors internally but their thread pool size does not grow as you suggest, they are mostly fixed size. We need a reproducer for increasing thread claim to understand your problem better. Or a thread dump as you mentioned.

All of these are being handled a little different now at the latest releases, hence please go ahead and try with the latest.

@ihsandemir
Copy link
Contributor

@ihsandemir ihsandemir commented Aug 23, 2017

Hello @philnate

Please try your case with latest patch version (3.8.4 for now). I strongly believe that it may result differently due to recent changes in listener mechanisms. I have no clue how you have many threads. We need to understand what threads those are, hence, we probably need a thread dump at least if you can regenerate the issue at 3.8.4. Until then, I will close this issue and you can reopen it if you still face the problem with version 3.8.4.

@ihsandemir ihsandemir closed this Aug 23, 2017
@philnate
Copy link
Author

@philnate philnate commented Aug 28, 2017

@ihsandemir I will try out with latest hazelcast, hopefully the issue will then be gone. I can confirm that hazelcast wasn't the reason for the high number of threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.