Reconnecting a client without blocking (Feature Request) #9692

MattNohelty · 2017-01-18T22:13:04Z

The process of reconnecting a Hazelcast client after a cluster goes down is problematic in some use cases. There are currently two strategies that you can take.

One option is to specify a finite number of reconnect attempts. After that is exceeded, the client gets destroyed and any objects (maps, sets, etc.) that were retrieved from that client are no longer usable (anything that accesses them will get a HazelcastInstanceNotActiveException even after the Hazelcast cluster comes back up).

The second option is to specify an unlimited number of reconnect attempts. This will cause the client to eventually reconnect when the cluster comes back up. However, everything trying to use the client while the cluster is down will block until the connection to the cluster is restored. In the case of a web application with heavy load, this is likely not acceptable because requests will get queued up and potentially starve out the server's resources.

It seems like a third option would be desirable. The HazelcastInstance knows if it's running (getLifecycleService().isRunning()) so it would be nice if it would fail immediately if something tried to use it (e.g. hazelcastInstance.getMap(), imap.get('key'), etc.). In the background, any non connected client would try to reconnect based on some policy (try every X number of seconds, exponential back-off, etc.).

This approach provides the benefits of fast failure so huge backlogs of work don't pile up if the cluster goes down while also avoiding the problems of destroying the client and invalidating all of the objects that were retrieved from it.

dbrimley · 2017-01-19T10:11:05Z

This would be very useful and I've come across this problem just recently. Ideally I'd like to have a long running reconnect process not block operations on HazelcastInstance and fail fast as in this proposal by @MattNohelty.

This is particularly problematic when trying to use Hazelcast as a provider to Cache Annotations on methods, such as Spring Cache or JCache Annotations. The Annotation on the method always calls the cache first, if the cache connection is down and in a long re-try cycle it will just block and therefore the method body will never be called.

I'd like the Client call to fail fast, but still maintain a long re-try thread in the background. In this way I can have the annotated method body called, which can provide the alternative read to another backing store and then at some point in the future the cache connection just comes back to life.

dbrimley · 2017-01-24T11:13:55Z

As an added Scenario, without this it makes writing things like Circuit Breakers impossible for Hazelcast Clients

tcataldo · 2017-11-16T16:08:17Z

I am interested by the second option mentioned in this issue :

The second option is to specify an unlimited number of reconnect attempts. This will cause the client to eventually reconnect when the cluster comes back up. However, everything trying to use the client while the cluster is down will block until the connection to the cluster is restored. In the case of a web application with heavy load, this is likely not acceptable because requests will get queued up and potentially starve out the server's resources.

As my client goes to shutdown state after 2 attempts I would be happy where those retry are set.

nlwillia · 2017-11-16T18:05:55Z

I've been trying to figure out a suitable approach for keeping the server healthy even if Hazelcast is unhealthy (see also #8662). It would be easiest to just have the client configured to never give up because then I can cache the instance handle statically (handles to distributed objects also) and reuse it throughout my application. But doing so means that distributed object operations could block indefinitely, consume all my server threads and lock everything up. On the other hand, configuring the client to give up means that:

I have to decide on some global lowest-common-denominator threshold for when timeout should occur (or use different connections with different configurations for different tolerances).
I have to build my own factory abstraction if I want to support a recovery (circuit breaker, new connection, etc.) pattern and hit that on every request.
I have to build my own connect listener feature for things like topic registrations so they'll get bound on any new connection object that's created.
I have to resolve new distributed object handles on every request because if the connection shuts down any old ones are useless.

The programming model would be simpler if the connection and object handles were things you could just declare statically and control by policy, but that policy needs to be flexible enough to handle concerns that are unique to each. Hopefully, no real world deployment is going to be so unstable that this stuff will get used much, but understanding and planning for and testing what could happen is a source of friction.

dbrimley · 2017-11-16T18:23:28Z

@tcataldo @nlwillia You guys are aware that Client Connection Strategies were released just recently in 3.9?

These allow for non blocking when a HazelcastInstance is unavailable and gives wider options for re-try.

In fact this issue should be closed as it address the topic of "Reconnecting a client without blocking"

nlwillia · 2017-11-16T20:43:44Z

You know, I looked at that, but somehow came away with an incorrect impression of what it was doing (some kind of operation buffering). I did some testing, and with ReconnectMode.ASYNC and unlimited network attempts configured, the behavior does seem to be suitable. I can cache static handles to both the instance and distributed objects, register topic listeners once (the first time a LifecycleListener sees LifecycleState.CLIENT_CONNECTED), and distributed object calls fail fast with HazelcastClientOfflineException without the need for a circuit breaker. getLifecycleService().isRunning() ends up always being true because the client is always either available or asynchronously trying to connect and throwing exceptions for operations in the interim.

dbrimley · 2017-11-17T06:48:11Z

@nlwillia I would expect getLifecycleService().isRunning() to return true even when the HazelcastInstance is re-trying a connection and is in a disconnected state. The question is did the LifecycleListener return to you a LifecycleState.CLIENT_DISCONNECTED? If that didn't happen when your client disconnected then we need to raise a new issue.

tcataldo · 2017-11-17T08:14:57Z

@nlwillia your point on "register topic listeners once" does not work for me. My case: I shutdown the cluster:
2017-11-17 07:40:34,854 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.c.ClientConnectionManager WARN - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Unable to get alive cluster connection, try in 1969 ms later, attempt 12 of 2147483647.

Then the cluster comes back up:

2017-11-17 07:40:36,824 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.c.ClientConnectionManager INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Trying to connect to [172.16.167.184]:5701 as owner member
2017-11-17 07:40:36,884 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.internal-1] c.h.c.c.ClientConnectionManager INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Authenticated with server [172.16.167.184]:5701, server version:3.9-EA Local address: /172.16.167.184:41744
2017-11-17 07:40:36,897 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.event-5] c.h.c.s.i.ClientMembershipListener INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] 

Members [1] {
	Member [172.16.167.184]:5701 - 9f7dc416-c46b-4ef9-9c66-b7bdd1bf06c9
}

2017-11-17 07:40:36,897 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.event-5] n.b.h.c.i.ClusterClient INFO - JVM bm-eas 16415879-e483-4f49-b143-e4edade5978e left.
2017-11-17 07:40:36,898 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.event-5] n.b.h.c.i.ClusterClient INFO - JVM bm-core 9f7dc416-c46b-4ef9-9c66-b7bdd1bf06c9 joined.
2017-11-17 07:40:36,898 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.c.ClientConnectionManager INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Setting ClientConnection{alive=true, connectionId=5, channel=NioChannel{/172.16.167.184:41744->/172.16.167.184:5701}, remoteEndpoint=[172.16.167.184]:5701, lastReadTime=2017-11-17 07:40:36.897, lastWriteTime=2017-11-17 07:40:36.884, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.9-EA} as owner with principal ClientPrincipal{uuid='e28d9e9b-d6e7-43ce-ab85-6a41883c3eb4', ownerUuid='9f7dc416-c46b-4ef9-9c66-b7bdd1bf06c9'}
2017-11-17 07:40:36,898 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.LifecycleService INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] HazelcastClient 3.9-EA (20170704 - f594093) is CLIENT_CONNECTED
2017-11-17 07:40:41,211 [vert.x-eventloop-thread-0] n.b.s.s.i.StateObserverVerticle WARN - no heartbeat since 4568 ms, switch to UNKNOWN & trigger a refresh

I kept my last application log about heartbeat as this is related to a reliable topic in my cluster propagating the state of a central component. So what I see is my consumers are not automagically reconnected when the cluster comes up.

sancar · 2017-11-17T13:14:44Z

@tcataldo Can you describe the scenario about reliable in more detail ? Preferably in google group. Although issue looks seems same, it is very likely that causes are different. We would like to get more info and work on this.

@nlwillia As far as I understand our latest 3.9 release works as expected(except reliable topic). I would like to close this issue if everything is OK along with #8662. Is that OK ?

nlwillia · 2017-11-17T13:35:25Z

Yes, the client sees the CLIENT_DISCONNECTED event. I only mentioned isRunning() because in an async, retry-forever configuration it's no longer something that the client needs to worry about.

When I tested, I used a regular topic. I just retried with a reliable topic, and it did not auto-reconnect, but that's easy to handle in the lifecycle listener (which is needed anyway for async start). Maybe something to look at or document, but not a show-stopper.

I'm not the author of this issue, but from my perspective it appears to be addressed. The fact that it was still open with the earlier comments led me to believe there was still no solution to this problem which no longer appears to be the case. The async mode significantly reduces the importance of elaborate control over the client connection policy which is the thrust of #8662. The only thing I would still complain about there is the (admittedly oblique) use case of wanting to do a fail-fast "is-it-there" initial connection and convert it into a retry-forever "keep-it-alive" connection. I can work around that by using two connections with different configurations though. (I'm trying to be clever by having a QA version of the application connect to a separate cluster if it's there otherwise start its own private one internally.)

sancar · 2017-11-22T13:55:11Z

I am closing this issue as its main request is addressed.
We have worked on an ClientConnectionStrategy interface but decided not to open at the moment. @nlwillia fail-fast "is-it-there" initial connection and convert it into a retry-forever "keep-it-alive" connection. request and the issue #8662 can be addressed there once that work is finished.
So, we will be keep #8662 open for now.

I have also updated the milestone of this issue since it is solved in 3.9.

sancar added Team: Client Type: Enhancement labels Jan 19, 2017

sancar added this to the Backlog milestone Jan 19, 2017

dbrimley mentioned this issue Jan 24, 2017

Give HazelcastClient an option to handle connecting lazily and in non-blocking manner #552

Closed

sancar mentioned this issue Mar 23, 2017

support java client auto reconnect #10135

Closed

jtesser mentioned this issue Jun 27, 2017

Hazelcast client mode breaks if the hazelcast hiccups dotCMS/core#11998

Closed

sancar self-assigned this Nov 17, 2017

sancar modified the milestones: Backlog, 3.9 Nov 22, 2017

sancar closed this as completed Nov 22, 2017

neilmendum mentioned this issue Nov 29, 2017

'Configuring Client Connection Strategy' is incorrect in reference manual hazelcast/hazelcast-reference-manual#444

Closed

simpleusr mentioned this issue Dec 4, 2017

Hazelcast - Client mode - How to recover after cluster failure? #11906

Closed

mmedenjak added the Source: Community PR or issue was opened by a community user label Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnecting a client without blocking (Feature Request) #9692

Reconnecting a client without blocking (Feature Request) #9692

MattNohelty commented Jan 18, 2017 •

edited

Loading

dbrimley commented Jan 19, 2017 •

edited

Loading

dbrimley commented Jan 24, 2017

tcataldo commented Nov 16, 2017

nlwillia commented Nov 16, 2017

dbrimley commented Nov 16, 2017

nlwillia commented Nov 16, 2017

dbrimley commented Nov 17, 2017

tcataldo commented Nov 17, 2017

sancar commented Nov 17, 2017

nlwillia commented Nov 17, 2017

sancar commented Nov 22, 2017

Reconnecting a client without blocking (Feature Request) #9692

Reconnecting a client without blocking (Feature Request) #9692

Comments

MattNohelty commented Jan 18, 2017 • edited Loading

dbrimley commented Jan 19, 2017 • edited Loading

dbrimley commented Jan 24, 2017

tcataldo commented Nov 16, 2017

nlwillia commented Nov 16, 2017

dbrimley commented Nov 16, 2017

nlwillia commented Nov 16, 2017

dbrimley commented Nov 17, 2017

tcataldo commented Nov 17, 2017

sancar commented Nov 17, 2017

nlwillia commented Nov 17, 2017

sancar commented Nov 22, 2017

MattNohelty commented Jan 18, 2017 •

edited

Loading

dbrimley commented Jan 19, 2017 •

edited

Loading