Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnecting a client without blocking (Feature Request) #9692

Closed
MattNohelty opened this issue Jan 18, 2017 · 11 comments
Closed

Reconnecting a client without blocking (Feature Request) #9692

MattNohelty opened this issue Jan 18, 2017 · 11 comments
Assignees
Labels
Source: Community PR or issue was opened by a community user Team: Client Type: Enhancement
Milestone

Comments

@MattNohelty
Copy link

MattNohelty commented Jan 18, 2017

The process of reconnecting a Hazelcast client after a cluster goes down is problematic in some use cases. There are currently two strategies that you can take.

One option is to specify a finite number of reconnect attempts. After that is exceeded, the client gets destroyed and any objects (maps, sets, etc.) that were retrieved from that client are no longer usable (anything that accesses them will get a HazelcastInstanceNotActiveException even after the Hazelcast cluster comes back up).

The second option is to specify an unlimited number of reconnect attempts. This will cause the client to eventually reconnect when the cluster comes back up. However, everything trying to use the client while the cluster is down will block until the connection to the cluster is restored. In the case of a web application with heavy load, this is likely not acceptable because requests will get queued up and potentially starve out the server's resources.

It seems like a third option would be desirable. The HazelcastInstance knows if it's running (getLifecycleService().isRunning()) so it would be nice if it would fail immediately if something tried to use it (e.g. hazelcastInstance.getMap(), imap.get('key'), etc.). In the background, any non connected client would try to reconnect based on some policy (try every X number of seconds, exponential back-off, etc.).

This approach provides the benefits of fast failure so huge backlogs of work don't pile up if the cluster goes down while also avoiding the problems of destroying the client and invalidating all of the objects that were retrieved from it.

@dbrimley
Copy link
Contributor

dbrimley commented Jan 19, 2017

This would be very useful and I've come across this problem just recently. Ideally I'd like to have a long running reconnect process not block operations on HazelcastInstance and fail fast as in this proposal by @MattNohelty.

This is particularly problematic when trying to use Hazelcast as a provider to Cache Annotations on methods, such as Spring Cache or JCache Annotations. The Annotation on the method always calls the cache first, if the cache connection is down and in a long re-try cycle it will just block and therefore the method body will never be called.

I'd like the Client call to fail fast, but still maintain a long re-try thread in the background. In this way I can have the annotated method body called, which can provide the alternative read to another backing store and then at some point in the future the cache connection just comes back to life.

@dbrimley
Copy link
Contributor

As an added Scenario, without this it makes writing things like Circuit Breakers impossible for Hazelcast Clients

@tcataldo
Copy link

I am interested by the second option mentioned in this issue :

The second option is to specify an unlimited number of reconnect attempts. This will cause the client to eventually reconnect when the cluster comes back up. However, everything trying to use the client while the cluster is down will block until the connection to the cluster is restored. In the case of a web application with heavy load, this is likely not acceptable because requests will get queued up and potentially starve out the server's resources.

As my client goes to shutdown state after 2 attempts I would be happy where those retry are set.

@nlwillia
Copy link

I've been trying to figure out a suitable approach for keeping the server healthy even if Hazelcast is unhealthy (see also #8662). It would be easiest to just have the client configured to never give up because then I can cache the instance handle statically (handles to distributed objects also) and reuse it throughout my application. But doing so means that distributed object operations could block indefinitely, consume all my server threads and lock everything up. On the other hand, configuring the client to give up means that:

  • I have to decide on some global lowest-common-denominator threshold for when timeout should occur (or use different connections with different configurations for different tolerances).
  • I have to build my own factory abstraction if I want to support a recovery (circuit breaker, new connection, etc.) pattern and hit that on every request.
  • I have to build my own connect listener feature for things like topic registrations so they'll get bound on any new connection object that's created.
  • I have to resolve new distributed object handles on every request because if the connection shuts down any old ones are useless.

The programming model would be simpler if the connection and object handles were things you could just declare statically and control by policy, but that policy needs to be flexible enough to handle concerns that are unique to each. Hopefully, no real world deployment is going to be so unstable that this stuff will get used much, but understanding and planning for and testing what could happen is a source of friction.

@dbrimley
Copy link
Contributor

@tcataldo @nlwillia You guys are aware that Client Connection Strategies were released just recently in 3.9?

These allow for non blocking when a HazelcastInstance is unavailable and gives wider options for re-try.

In fact this issue should be closed as it address the topic of "Reconnecting a client without blocking"

@nlwillia
Copy link

You know, I looked at that, but somehow came away with an incorrect impression of what it was doing (some kind of operation buffering). I did some testing, and with ReconnectMode.ASYNC and unlimited network attempts configured, the behavior does seem to be suitable. I can cache static handles to both the instance and distributed objects, register topic listeners once (the first time a LifecycleListener sees LifecycleState.CLIENT_CONNECTED), and distributed object calls fail fast with HazelcastClientOfflineException without the need for a circuit breaker. getLifecycleService().isRunning() ends up always being true because the client is always either available or asynchronously trying to connect and throwing exceptions for operations in the interim.

@dbrimley
Copy link
Contributor

@nlwillia I would expect getLifecycleService().isRunning() to return true even when the HazelcastInstance is re-trying a connection and is in a disconnected state. The question is did the LifecycleListener return to you a LifecycleState.CLIENT_DISCONNECTED? If that didn't happen when your client disconnected then we need to raise a new issue.

@tcataldo
Copy link

@nlwillia your point on "register topic listeners once" does not work for me. My case: I shutdown the cluster:
2017-11-17 07:40:34,854 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.c.ClientConnectionManager WARN - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Unable to get alive cluster connection, try in 1969 ms later, attempt 12 of 2147483647.

Then the cluster comes back up:

2017-11-17 07:40:36,824 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.c.ClientConnectionManager INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Trying to connect to [172.16.167.184]:5701 as owner member
2017-11-17 07:40:36,884 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.internal-1] c.h.c.c.ClientConnectionManager INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Authenticated with server [172.16.167.184]:5701, server version:3.9-EA Local address: /172.16.167.184:41744
2017-11-17 07:40:36,897 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.event-5] c.h.c.s.i.ClientMembershipListener INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] 

Members [1] {
	Member [172.16.167.184]:5701 - 9f7dc416-c46b-4ef9-9c66-b7bdd1bf06c9
}

2017-11-17 07:40:36,897 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.event-5] n.b.h.c.i.ClusterClient INFO - JVM bm-eas 16415879-e483-4f49-b143-e4edade5978e left.
2017-11-17 07:40:36,898 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.event-5] n.b.h.c.i.ClusterClient INFO - JVM bm-core 9f7dc416-c46b-4ef9-9c66-b7bdd1bf06c9 joined.
2017-11-17 07:40:36,898 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.c.ClientConnectionManager INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] Setting ClientConnection{alive=true, connectionId=5, channel=NioChannel{/172.16.167.184:41744->/172.16.167.184:5701}, remoteEndpoint=[172.16.167.184]:5701, lastReadTime=2017-11-17 07:40:36.897, lastWriteTime=2017-11-17 07:40:36.884, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.9-EA} as owner with principal ClientPrincipal{uuid='e28d9e9b-d6e7-43ce-ab85-6a41883c3eb4', ownerUuid='9f7dc416-c46b-4ef9-9c66-b7bdd1bf06c9'}
2017-11-17 07:40:36,898 [bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3.cluster-] c.h.c.LifecycleService INFO - bm-hps-7e62969d-f5f6-4e7b-930a-2b42ae935ab3 [bluemind-72D26E8A-5BB1-48A4-BC71-EEE92E0CE4EE] [3.9-EA] HazelcastClient 3.9-EA (20170704 - f594093) is CLIENT_CONNECTED
2017-11-17 07:40:41,211 [vert.x-eventloop-thread-0] n.b.s.s.i.StateObserverVerticle WARN - no heartbeat since 4568 ms, switch to UNKNOWN & trigger a refresh

I kept my last application log about heartbeat as this is related to a reliable topic in my cluster propagating the state of a central component. So what I see is my consumers are not automagically reconnected when the cluster comes up.

@sancar
Copy link
Contributor

sancar commented Nov 17, 2017

@tcataldo Can you describe the scenario about reliable in more detail ? Preferably in google group. Although issue looks seems same, it is very likely that causes are different. We would like to get more info and work on this.

@nlwillia As far as I understand our latest 3.9 release works as expected(except reliable topic). I would like to close this issue if everything is OK along with #8662. Is that OK ?

@sancar sancar self-assigned this Nov 17, 2017
@nlwillia
Copy link

Yes, the client sees the CLIENT_DISCONNECTED event. I only mentioned isRunning() because in an async, retry-forever configuration it's no longer something that the client needs to worry about.

When I tested, I used a regular topic. I just retried with a reliable topic, and it did not auto-reconnect, but that's easy to handle in the lifecycle listener (which is needed anyway for async start). Maybe something to look at or document, but not a show-stopper.

I'm not the author of this issue, but from my perspective it appears to be addressed. The fact that it was still open with the earlier comments led me to believe there was still no solution to this problem which no longer appears to be the case. The async mode significantly reduces the importance of elaborate control over the client connection policy which is the thrust of #8662. The only thing I would still complain about there is the (admittedly oblique) use case of wanting to do a fail-fast "is-it-there" initial connection and convert it into a retry-forever "keep-it-alive" connection. I can work around that by using two connections with different configurations though. (I'm trying to be clever by having a QA version of the application connect to a separate cluster if it's there otherwise start its own private one internally.)

@sancar sancar modified the milestones: Backlog, 3.9 Nov 22, 2017
@sancar
Copy link
Contributor

sancar commented Nov 22, 2017

I am closing this issue as its main request is addressed.
We have worked on an ClientConnectionStrategy interface but decided not to open at the moment. @nlwillia fail-fast "is-it-there" initial connection and convert it into a retry-forever "keep-it-alive" connection. request and the issue #8662 can be addressed there once that work is finished.
So, we will be keep #8662 open for now.

I have also updated the milestone of this issue since it is solved in 3.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Source: Community PR or issue was opened by a community user Team: Client Type: Enhancement
Projects
None yet
Development

No branches or pull requests

6 participants