Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Error using discovery strategy in hazelcast client #11116
I've come across an issue(s) testing failover over the client. I have a very simple 2 member cluster running on docker/mesosphere. The cluster itself uses a discovery strategy to discover it's cluster peers (via marathon). This works very well.
I also use a discovery strategy in the hazlecast client by setting a discovery strategy that just calls a restful service returning the address/port of the cluster members.
My client configuration looks like:
ClientConfig config = new ClientConfig(); DiscoveryConfig discoveryConfig = new DiscoveryConfig(); discoveryConfig.addDiscoveryStrategyConfig( new DiscoveryStrategyConfig(new ClusterDiscoveryStrategyFactory(URI.create(hazelcastClusterDiscoveryUri)))); ClientNetworkConfig networkConfig = new ClientNetworkConfig(); networkConfig.setDiscoveryConfig(discoveryConfig); config.setNetworkConfig(networkConfig); config.setProperty("hazelcast.discovery.enabled", "true"); config.setProperty("hazelcast.logging.type", "slf4j"); config.setSerializationConfig(new SerializationConfig() .setGlobalSerializerConfig(new GlobalSerializerConfig() .setImplementation(new HazelcastKryoSerializer<>(new KryoSerializer(false))))); return HazelcastClient.newHazelcastClient(config);
My first issue is that even though I set the discovery strategy and don't explicitly set the addresses for the client to connect to, Hazelcast adds a DefaultAddressProvider that adds localhost. Is this by design or a configuration issue on my side? I am providing a discovery strategy to return the cluster members to connect to. I don't want hazelcast to attempt to bind to 5701/2/3 on localhost.
My second issue is that, in my tests, when I stop the cluster member that the client is connected to, I expect the client to seamlessly failover to the remaining member in the cluster. What I am seeing is that, on startup, the client is establishing a connection with both cluster members. The first one is the owner member
From the logs above, it has connected to and authenticated against both members.
After this I would expect the existing connection made on startup to be promoted in the client. However, it initiates a new connection (with a new connection ID) and thus breaks the assertion in the code as below:
As I read the implementation, the issue seems to be, if the owning member fails, it is establishing a new connection to an address where there is already an active connection
Any hints on how to resolve would be much appreciated
@robbiecross thanks for the detailed report.
Let's see what we can do about this issue.
I should have given more detail about the 2nd issue (Assertion error). My test is asynchronously updating an IMap from a hz client. Whilst the IMap is being updated (imap.put...) continuously, I am killing one of the cluster members (so the cluster goes from 2 to 1)
Adding localhost is already reported here #10606
AssertionError on the other hand is problematic. Assertion is there because, we were thinking it is impossible to hit that line. It seems we were wrong. I am assigning this to to 3.8.5 to look into it in detail.
Not sure if it helps but I was wondering if this assertion error was being flushed out by using discovery strategy in the client to discovery cluster members. However, I also tested with an initial static list of addresses by setting the address list in the client network config. The assertion error still arises. It appears to be down to this discrepancy in connectionId of an existing connection e.g.
Healthy cluster with 3 members and a client connected
I examined the logs and I think I have a reasoning why the 2nd issue may occur:
Firstly I noticed this: The member list printed uses domain names of the members: [lx-mesos-d13.unix.dnsbego.de]:31744 and in the connection log we see this:
This line is being called with ip address: https://github.com/hazelcast/hazelcast/blob/v3.8.2/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnectionManagerImpl.java#L258 and it returns null. If it were called with domain name it would find it. Hence, a new connection is initiated and onAuthenticated is being called and this line uses the domain name: https://github.com/hazelcast/hazelcast/blob/v3.8.2/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnectionManagerImpl.java#L653 and it returns the old connection as not being null, which causes the assertion error.
If this is the case, here is proposed solution:
Here is a manual test to produce the issue:
This test causes assertion error.
Note: I did put 3.8.2 code references instead of 3.8.4 but it should be similar.
Make sure that the provided address string is used in the activeConnectionsMap. This fixes the domain name to ip address conversion problem and duplicate connection initiation problem to the same member one with ip address and one with domain name. We made the assumption that the user will have to use the same host name or ip in the client config if the member is configured with a public ip address. fixes hazelcast#11116 fixes hazelcast#11264 backport of hazelcast#11226