Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using discovery strategy in hazelcast client #11116

Closed
robbiecross opened this issue Aug 16, 2017 · 6 comments
Closed

Error using discovery strategy in hazelcast client #11116

robbiecross opened this issue Aug 16, 2017 · 6 comments

Comments

@robbiecross
Copy link

@robbiecross robbiecross commented Aug 16, 2017

Version: 3.8.4
Cluster Size: 2
Clients: 1
Java 8

Hi,

I've come across an issue(s) testing failover over the client. I have a very simple 2 member cluster running on docker/mesosphere. The cluster itself uses a discovery strategy to discover it's cluster peers (via marathon). This works very well.

I also use a discovery strategy in the hazlecast client by setting a discovery strategy that just calls a restful service returning the address/port of the cluster members.

My client configuration looks like:

ClientConfig config = new ClientConfig();		
DiscoveryConfig discoveryConfig = new DiscoveryConfig();
		discoveryConfig.addDiscoveryStrategyConfig(
				new DiscoveryStrategyConfig(new ClusterDiscoveryStrategyFactory(URI.create(hazelcastClusterDiscoveryUri))));
		ClientNetworkConfig networkConfig = new ClientNetworkConfig();
		networkConfig.setDiscoveryConfig(discoveryConfig);
		config.setNetworkConfig(networkConfig);		
		config.setProperty("hazelcast.discovery.enabled", "true");
        config.setProperty("hazelcast.logging.type", "slf4j");
		config.setSerializationConfig(new SerializationConfig()
				.setGlobalSerializerConfig(new GlobalSerializerConfig()
						.setImplementation(new HazelcastKryoSerializer<>(new KryoSerializer(false)))));
        return HazelcastClient.newHazelcastClient(config);

My first issue is that even though I set the discovery strategy and don't explicitly set the addresses for the client to connect to, Hazelcast adds a DefaultAddressProvider that adds localhost. Is this by design or a configuration issue on my side? I am providing a discovery strategy to return the cluster members to connect to. I don't want hazelcast to attempt to bind to 5701/2/3 on localhost.

My second issue is that, in my tests, when I stop the cluster member that the client is connected to, I expect the client to seamlessly failover to the remaining member in the cluster. What I am seeing is that, on startup, the client is establishing a connection with both cluster members. The first one is the owner member

2017-08-16 16:43:53,676 INFO [hz.client_0.internal-2] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Setting ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.64.0.29:64800 remote=/89.21.121.14:31765]}, remoteEndpoint=[lx-mesos-d02.unix.dnsbego.de]:31765, lastReadTime=2017-08-16 16:43:53.648, lastWriteTime=2017-08-16 16:43:53.613, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3} as owner  with principal ClientPrincipal{uuid='56c4d015-4ad3-4bf4-82ef-57a23c5b2b64', ownerUuid='69713e18-22d1-4bca-bf83-019d1411a66c'}
2017-08-16 16:43:53,677 TRACE [hz.client_0.internal-2] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Authentication succeeded for ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.64.0.29:64800 remote=/89.21.121.14:31765]}, remoteEndpoint=[lx-mesos-d02.unix.dnsbego.de]:31765, lastReadTime=2017-08-16 16:43:53.648, lastWriteTime=2017-08-16 16:43:53.613, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3} and there was no old connection to this end-point
2017-08-16 16:43:53,677 INFO [hz.client_0.internal-2] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Authenticated with server [lx-mesos-d02.unix.dnsbego.de]:31765, server version:3.8.3 Local address: /10.64.0.29:64800
2017-08-16 16:43:53,691 INFO [hz.client_0.event-2] c.h.c.s.i.ClientMembershipListener [?:?] hz.client_0 [dev] [3.8.4] 

Members [2] {
	Member [lx-mesos-d13.unix.dnsbego.de]:31744 - bbe23fd9-a5ef-424b-8fdb-48b7446da79a
	Member [lx-mesos-d02.unix.dnsbego.de]:31765 - 69713e18-22d1-4bca-bf83-019d1411a66c
}

2017-08-16 16:43:53,692 INFO [main] c.h.c.LifecycleService [?:?] hz.client_0 [dev] [3.8.4] HazelcastClient 3.8.4 (20170809 - 297a77e) is CLIENT_CONNECTED
2017-08-16 16:43:53,810 TRACE [hz.client_0.internal-2] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Authentication succeeded for ClientConnection{alive=true, connectionId=2, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.64.0.29:64801 remote=lx-mesos-d13.unix.dnsbego.de/89.21.122.52:31744]}, remoteEndpoint=[lx-mesos-d13.unix.dnsbego.de]:31744, lastReadTime=2017-08-16 16:43:53.809, lastWriteTime=2017-08-16 16:43:53.780, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3} and there was no old connection to this end-point
2017-08-16 16:43:53,811 INFO [hz.client_0.internal-2] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Authenticated with server [lx-mesos-d13.unix.dnsbego.de]:31744, server version:3.8.3 Local address: /10.64.0.29:64801

From the logs above, it has connected to and authenticated against both members.
Now, I kill the cluster member that the client is connected to:

2017-08-16 16:43:55,199 WARN [hz.hz.client_0.IO.thread-in-0] c.h.c.c.n.ClientConnection [?:?] hz.client_0 [dev] [3.8.4] ClientConnection{alive=false, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.64.0.29:64800 remote=/89.21.121.14:31765]}, remoteEndpoint=[lx-mesos-d02.unix.dnsbego.de]:31765, lastReadTime=2017-08-16 16:43:55.196, lastWriteTime=2017-08-16 16:43:55.187, closedTime=2017-08-16 16:43:55.197, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3} lost. Reason: java.io.EOFException[Remote socket closed!]
2017-08-16 16:43:55,200 INFO [hz.hz.client_0.IO.thread-in-0] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Removed connection to endpoint: [lx-mesos-d02.unix.dnsbego.de]:31765, connection: ClientConnection{alive=false, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[closed]}, remoteEndpoint=[lx-mesos-d02.unix.dnsbego.de]:31765, lastReadTime=2017-08-16 16:43:55.196, lastWriteTime=2017-08-16 16:43:55.199, closedTime=2017-08-16 16:43:55.197, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3}
2017-08-16 16:43:55,200 INFO [hz.hz.client_0.IO.thread-in-0] c.h.c.LifecycleService [?:?] hz.client_0 [dev] [3.8.4] HazelcastClient 3.8.4 (20170809 - 297a77e) is CLIENT_DISCONNECTED

After this I would expect the existing connection made on startup to be promoted in the client. However, it initiates a new connection (with a new connection ID) and thus breaks the assertion in the code as below:

2017-08-16 16:43:56,277 INFO [hz.client_0.cluster-] c.h.c.s.i.ClusterListenerSupport [?:?] hz.client_0 [dev] [3.8.4] Trying to connect to [89.21.122.52]:31744 as owner member
2017-08-16 16:43:56,287 INFO [hz.client_0.internal-3] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Setting ClientConnection{alive=true, connectionId=3, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.64.0.29:64808 remote=/89.21.122.52:31744]}, remoteEndpoint=[lx-mesos-d13.unix.dnsbego.de]:31744, lastReadTime=2017-08-16 16:43:56.286, lastWriteTime=2017-08-16 16:43:56.281, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3} as owner  with principal ClientPrincipal{uuid='56c4d015-4ad3-4bf4-82ef-57a23c5b2b64', ownerUuid='bbe23fd9-a5ef-424b-8fdb-48b7446da79a'}
2017-08-16 16:43:56,287 TRACE [hz.client_0.internal-3] c.h.c.c.ClientConnectionManager [?:?] hz.client_0 [dev] [3.8.4] Re-authentication succeeded for ClientConnection{alive=true, connectionId=3, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.64.0.29:64808 remote=/89.21.122.52:31744]}, remoteEndpoint=[lx-mesos-d13.unix.dnsbego.de]:31744, lastReadTime=2017-08-16 16:43:56.286, lastWriteTime=2017-08-16 16:43:56.281, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.3}
2017-08-16 16:43:56,288 ERROR [hz.client_0.internal-3] c.h.c.s.ClientInvocationService [Slf4jFactory.java:68] hz.client_0 [dev] [3.8.4] Failed asynchronous execution of execution callback: com.hazelcast.client.spi.impl.ClientInvocationFuture$InternalDelegatingExecutionCallback@e612fa9for call ClientMessage{length=141, correlationId=14, messageType=2, partitionId=-1, isComplete=true, isRetryable=true, isEvent=false, writeOffset=0}
java.lang.AssertionError: null
	at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl.onAuthenticated(ClientConnectionManagerImpl.java:669) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl.access$900(ClientConnectionManagerImpl.java:91) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$2.onResponse(ClientConnectionManagerImpl.java:578) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$2.onResponse(ClientConnectionManagerImpl.java:550) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.spi.impl.ClientInvocationFuture$InternalDelegatingExecutionCallback.onResponse(ClientInvocationFuture.java:129) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.spi.impl.ClientInvocationFuture$InternalDelegatingExecutionCallback.onResponse(ClientInvocationFuture.java:117) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.spi.impl.ClientInvocationFuture$InternalDelegatingExecutionCallback.onResponse(ClientInvocationFuture.java:129) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.client.spi.impl.ClientInvocationFuture$InternalDelegatingExecutionCallback.onResponse(ClientInvocationFuture.java:117) ~[hazelcast-client-3.8.4.jar:3.8.4]
	at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:248) ~[hazelcast-3.8.4.jar:3.8.4]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_05]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_05]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_05]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_05]
	at com.hazelcast.util.executor.LoggingScheduledExecutor$LoggingDelegatingFuture.run(LoggingScheduledExecutor.java:138) [hazelcast-3.8.4.jar:3.8.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_05]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_05]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_05]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.8.4.jar:3.8.4]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.8.4.jar:3.8.4]

As I read the implementation, the issue seems to be, if the owning member fails, it is establishing a new connection to an address where there is already an active connection

Any hints on how to resolve would be much appreciated

Thanks

@vbekiaris
Copy link
Contributor

@vbekiaris vbekiaris commented Aug 16, 2017

@robbiecross thanks for the detailed report.

My first issue is that even though I set the discovery strategy and don't explicitly set the addresses for the client to connect to, Hazelcast adds a DefaultAddressProvider that adds localhost. Is this by design or a configuration issue on my side? I am providing a discovery strategy to return the cluster members to connect to. I don't want hazelcast to attempt to bind to 5701/2/3 on localhost.

Looks like the DefaultAddressProvider is anyway added to the list of address providers and when there are no addresses configured in the ClientConfig, localhost is explicitly added.

Let's see what we can do about this issue.

@robbiecross
Copy link
Author

@robbiecross robbiecross commented Aug 17, 2017

Thanks @vbekiaris

I should have given more detail about the 2nd issue (Assertion error). My test is asynchronously updating an IMap from a hz client. Whilst the IMap is being updated (imap.put...) continuously, I am killing one of the cluster members (so the cluster goes from 2 to 1)

@sancar
Copy link
Member

@sancar sancar commented Aug 17, 2017

Adding localhost is already reported here #10606
This seems like not a desired behaviour and we will fix that.

AssertionError on the other hand is problematic. Assertion is there because, we were thinking it is impossible to hit that line. It seems we were wrong. I am assigning this to to 3.8.5 to look into it in detail.

@sancar sancar added this to the 3.8.5 milestone Aug 17, 2017
@sancar sancar added the Type: Defect label Aug 17, 2017
@ihsandemir ihsandemir modified the milestones: 3.8.6, 3.8.5 Aug 18, 2017
@robbiecross
Copy link
Author

@robbiecross robbiecross commented Aug 23, 2017

Not sure if it helps but I was wondering if this assertion error was being flushed out by using discovery strategy in the client to discovery cluster members. However, I also tested with an initial static list of addresses by setting the address list in the client network config. The assertion error still arises. It appears to be down to this discrepancy in connectionId of an existing connection e.g.

Healthy cluster with 3 members and a client connected
I scale my cluster from 3 instances to 1.
The client's owning connection has stopped.
The client tries to establish an owning connection with the remaining cluster member.
That connection is established but has a new id (4) to that when it initially connected to the cluster

@ihsandemir
Copy link
Contributor

@ihsandemir ihsandemir commented Aug 23, 2017

I examined the logs and I think I have a reasoning why the 2nd issue may occur:

Firstly I noticed this: The member list printed uses domain names of the members: [lx-mesos-d13.unix.dnsbego.de]:31744 and in the connection log we see this:
Trying to connect to [89.21.122.52]:31744 as owner member
I believe that 89.21.122.52 is the ip address corresponding to lx-mesos-d13.unix.dnsbego.de
Furthermore, when setting the remote endpoint address in a connection we use what is received from the authenticated member and it is lx-mesos-d13.unix.dnsbego.de (see https://github.com/hazelcast/hazelcast/blob/master/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnectionManagerImpl.java#L640)

This line is being called with ip address: https://github.com/hazelcast/hazelcast/blob/v3.8.2/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnectionManagerImpl.java#L258 and it returns null. If it were called with domain name it would find it. Hence, a new connection is initiated and onAuthenticated is being called and this line uses the domain name: https://github.com/hazelcast/hazelcast/blob/v3.8.2/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnectionManagerImpl.java#L653 and it returns the old connection as not being null, which causes the assertion error.

If this is the case, here is proposed solution:
Always do not do any conversion to socket address here but just return Address: https://github.com/hazelcast/hazelcast/blob/v3.8.2/hazelcast-client/src/main/java/com/hazelcast/client/spi/impl/ClusterListenerSupport.java#L113 and socket address conversion is already being done at this point: https://github.com/hazelcast/hazelcast/blob/v3.8.2/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnectionManagerImpl.java#L389

Here is a manual test to produce the issue:

  1. Put the follow'ng entry to /etc/hosts : "127.0.0.1 ihsan"
    @Test
    public void testConnectWithDNSNames() {
        Config config = new Config();
        config.getNetworkConfig().setPublicAddress("ihsan");
        HazelcastInstance hazelcastInstance = Hazelcast.newHazelcastInstance(config);

        ClientConfig clientConfig = new ClientConfig();
        clientConfig.getNetworkConfig().addAddress("ihsan");
        HazelcastClient.newHazelcastClient(clientConfig);

        Hazelcast.newHazelcastInstance(config);

        hazelcastInstance.shutdown();

        sleepSeconds(10);
    }

This test causes assertion error.

Note: I did put 3.8.2 code references instead of 3.8.4 but it should be similar.

@ihsandemir ihsandemir reopened this Sep 14, 2017
sancar added a commit to sancar/hazelcast that referenced this issue Sep 14, 2017
Make sure that the provided address string is used in the activeConnectionsMap. This fixes the domain name to ip address conversion problem and duplicate connection initiation problem to the same member one with ip address and one with domain name.

We made the assumption that the user will have to use the same host name or ip in the client config if the member is configured with a public ip address.

fixes hazelcast#11116
fixes hazelcast#11264
backport of hazelcast#11226
@sancar
Copy link
Member

@sancar sancar commented Sep 15, 2017

fixed by #11368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

5 participants
You can’t perform that action at this time.