Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Dead node causes high reconnect count #11176
we're using Hazelcast 3.7.3 in a cluster of 3 standalone nodes and 3 clients. We noticed a sudden increase in Threads in the client systems (3,6k within less than 10min). According to the logs it seems that one Hazelcast node died (and was restored by the orchestration system) the connected clients tried to reconnect many many times per second, before eventually the connection was reestablished.
Hazelcast logged as well the following exception:
A couple of seconds later it was still logging all those reconnect tries, but now throwing a different exception:
Unfortunately I have no thread dump to showing that Hazelcast caused the massive Thread increase, still I believe the repeated reconnect tries are related to the increase in Thread count.
The initial exception is received because the old member was not removed from the cluster member list yet and the client assumes that it is still a valid member and tries to register the existing listeners but it fails as expected since the member is closed. It recovers from this as soon as the member list in the cluster is updated.
The second exception also is seen when the new member start is not completed yet and the client is trying to register the listener and gets this exception.
Still, I could not reason how it would affect the number of threads usage. We use executors internally but their thread pool size does not grow as you suggest, they are mostly fixed size. We need a reproducer for increasing thread claim to understand your problem better. Or a thread dump as you mentioned.
All of these are being handled a little different now at the latest releases, hence please go ahead and try with the latest.
Please try your case with latest patch version (3.8.4 for now). I strongly believe that it may result differently due to recent changes in listener mechanisms. I have no clue how you have many threads. We need to understand what threads those are, hence, we probably need a thread dump at least if you can regenerate the issue at 3.8.4. Until then, I will close this issue and you can reopen it if you still face the problem with version 3.8.4.