-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnecting a client without blocking (Feature Request) #9692
Comments
This would be very useful and I've come across this problem just recently. Ideally I'd like to have a long running reconnect process not block operations on This is particularly problematic when trying to use Hazelcast as a provider to Cache Annotations on methods, such as Spring Cache or JCache Annotations. The Annotation on the method always calls the cache first, if the cache connection is down and in a long re-try cycle it will just block and therefore the method body will never be called. I'd like the Client call to fail fast, but still maintain a long re-try thread in the background. In this way I can have the annotated method body called, which can provide the alternative read to another backing store and then at some point in the future the cache connection just comes back to life. |
As an added Scenario, without this it makes writing things like Circuit Breakers impossible for Hazelcast Clients |
I am interested by the second option mentioned in this issue :
As my client goes to shutdown state after 2 attempts I would be happy where those retry are set. |
I've been trying to figure out a suitable approach for keeping the server healthy even if Hazelcast is unhealthy (see also #8662). It would be easiest to just have the client configured to never give up because then I can cache the instance handle statically (handles to distributed objects also) and reuse it throughout my application. But doing so means that distributed object operations could block indefinitely, consume all my server threads and lock everything up. On the other hand, configuring the client to give up means that:
The programming model would be simpler if the connection and object handles were things you could just declare statically and control by policy, but that policy needs to be flexible enough to handle concerns that are unique to each. Hopefully, no real world deployment is going to be so unstable that this stuff will get used much, but understanding and planning for and testing what could happen is a source of friction. |
@tcataldo @nlwillia You guys are aware that Client Connection Strategies were released just recently in 3.9? These allow for non blocking when a HazelcastInstance is unavailable and gives wider options for re-try. In fact this issue should be closed as it address the topic of "Reconnecting a client without blocking" |
You know, I looked at that, but somehow came away with an incorrect impression of what it was doing (some kind of operation buffering). I did some testing, and with |
@nlwillia I would expect getLifecycleService().isRunning() to return true even when the HazelcastInstance is re-trying a connection and is in a disconnected state. The question is did the LifecycleListener return to you a LifecycleState.CLIENT_DISCONNECTED? If that didn't happen when your client disconnected then we need to raise a new issue. |
@nlwillia your point on "register topic listeners once" does not work for me. My case: I shutdown the cluster: Then the cluster comes back up:
I kept my last application log about heartbeat as this is related to a reliable topic in my cluster propagating the state of a central component. So what I see is my consumers are not automagically reconnected when the cluster comes up. |
@tcataldo Can you describe the scenario about reliable in more detail ? Preferably in google group. Although issue looks seems same, it is very likely that causes are different. We would like to get more info and work on this. @nlwillia As far as I understand our latest 3.9 release works as expected(except reliable topic). I would like to close this issue if everything is OK along with #8662. Is that OK ? |
Yes, the client sees the When I tested, I used a regular topic. I just retried with a reliable topic, and it did not auto-reconnect, but that's easy to handle in the lifecycle listener (which is needed anyway for async start). Maybe something to look at or document, but not a show-stopper. I'm not the author of this issue, but from my perspective it appears to be addressed. The fact that it was still open with the earlier comments led me to believe there was still no solution to this problem which no longer appears to be the case. The async mode significantly reduces the importance of elaborate control over the client connection policy which is the thrust of #8662. The only thing I would still complain about there is the (admittedly oblique) use case of wanting to do a fail-fast "is-it-there" initial connection and convert it into a retry-forever "keep-it-alive" connection. I can work around that by using two connections with different configurations though. (I'm trying to be clever by having a QA version of the application connect to a separate cluster if it's there otherwise start its own private one internally.) |
I am closing this issue as its main request is addressed. I have also updated the milestone of this issue since it is solved in 3.9. |
The process of reconnecting a Hazelcast client after a cluster goes down is problematic in some use cases. There are currently two strategies that you can take.
One option is to specify a finite number of reconnect attempts. After that is exceeded, the client gets destroyed and any objects (maps, sets, etc.) that were retrieved from that client are no longer usable (anything that accesses them will get a HazelcastInstanceNotActiveException even after the Hazelcast cluster comes back up).
The second option is to specify an unlimited number of reconnect attempts. This will cause the client to eventually reconnect when the cluster comes back up. However, everything trying to use the client while the cluster is down will block until the connection to the cluster is restored. In the case of a web application with heavy load, this is likely not acceptable because requests will get queued up and potentially starve out the server's resources.
It seems like a third option would be desirable. The HazelcastInstance knows if it's running (getLifecycleService().isRunning()) so it would be nice if it would fail immediately if something tried to use it (e.g. hazelcastInstance.getMap(), imap.get('key'), etc.). In the background, any non connected client would try to reconnect based on some policy (try every X number of seconds, exponential back-off, etc.).
This approach provides the benefits of fast failure so huge backlogs of work don't pile up if the cluster goes down while also avoiding the problems of destroying the client and invalidating all of the objects that were retrieved from it.
The text was updated successfully, but these errors were encountered: