Client fails to proeperly reconnect to single node cluster after hazelcast server is restarted #6168

kedar-joshi · 2015-09-10T10:08:59Z

Hi,
I ran into following exception while upgrading to version 3.5.2 from version 3.2.3. The scenario is as follows -

Use JAVA client programmatic configuration to connect to single node Hazelcast instance. This step work as expected and client is able to perform get and put operations on the distributed maps
Shutdown the Hazelcast server and restart it after couple of seconds.
The client is shown to have connected to the cluster, but all the map operations take very long to return. After a long time these map operations return with following exceptions, and the client never recovers.

I am using following code to configure the client -

// Preparing client configuration
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName(SystemConfigUtil.getCacheGroup()).setPassword(SystemConfigUtil.getCachePassword());

ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();

networkConfig.addAddress("127.0.0.1");
networkConfig.setConnectionAttemptLimit(Integer.MAX_VALUE);
networkConfig.setConnectionAttemptPeriod(10000);
networkConfig.setConnectionTimeout(5000);

// Creating client instance
HAZELCAST_CLIENT = HazelcastClient.newHazelcastClient(clientConfig);

Following exception is thrown for every map operation after Hazelcast server is restarted -

com.hazelcast.core.HazelcastException: java.io.IOException: Not able to setup owner connection!
 at com.hazelcast.util.ExceptionUtil.rethrow(ExceptionUtil.java:67) ~[ExceptionUtil.class:3.5.2]
 at com.hazelcast.util.ExceptionUtil.rethrow(ExceptionUtil.java:62) ~[ExceptionUtil.class:3.5.2]
 at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:133) ~[ClientProxy.class:3.5.2]
 at com.hazelcast.client.proxy.ClientMapProxy.put(ClientMapProxy.java:364) ~[ClientMapProxy.class:3.5.2]
 at com.hazelcast.client.proxy.ClientMapProxy.put(ClientMapProxy.java:206) ~[ClientMapProxy.class:3.5.2]
 at com.locationguru.CSF.cache.manager.CacheManagerImpl.setUserInfo(CacheManagerImpl.java:826) ~[CacheManagerImpl.class:na]
 at com.locationguru.llp.user.controller.UserController.doAuthenticate(UserController.java:266) ~[UserController.class:na]
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_71]
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_71]
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_71]
 at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_71]
 at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213) [InvocableHandlerMethod.class:3.1.1.RELEASE]
 at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126) [InvocableHandlerMethod.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96) [ServletInvocableHandlerMethod.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617) [RequestMappingHandlerAdapter.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578) [RequestMappingHandlerAdapter.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80) [AbstractHandlerMethodAdapter.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923) [DispatcherServlet.class:3.1.1.RELEASE]
 at com.locationguru.CSF.base.BaseDispatcherServlet.doDispatch(BaseDispatcherServlet.java:37) [BaseDispatcherServlet.class:na]
 at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) [DispatcherServlet.class:3.1.1.RELEASE]
 at com.locationguru.CSF.base.BaseDispatcherServlet.doService(BaseDispatcherServlet.java:32) [BaseDispatcherServlet.class:na]
 at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) [FrameworkServlet.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789) [FrameworkServlet.class:3.1.1.RELEASE]
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:644) [servlet-api.jar:na]
 at com.locationguru.CSF.base.BaseDispatcherServlet.service(BaseDispatcherServlet.java:42) [BaseDispatcherServlet.class:na]
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:725) [servlet-api.jar:na]
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:291) [catalina.jar:8.0.12]
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.12]
 at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) [tomcat-websocket.jar:8.0.12]
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239) [catalina.jar:8.0.12]
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106) [catalina.jar:8.0.12]
 at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:142) [catalina.jar:8.0.12]
 at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79) [catalina.jar:8.0.12]
 at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88) [catalina.jar:8.0.12]
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:534) [catalina.jar:8.0.12]
 at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1081) [tomcat-coyote.jar:8.0.12]
 at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:658) [tomcat-coyote.jar:8.0.12]
 at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:222) [tomcat-coyote.jar:8.0.12]
 at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1566) [tomcat-coyote.jar:8.0.12]
 at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1523) [tomcat-coyote.jar:8.0.12]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71]
 at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-util.jar:8.0.12]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]

As per my analysis, following method in com.hazelcast.client.spi.impl.ClientSmartInvocationServiceImpl.java checks for owner connections but ownerConnectionAddress is always returned as NULL by clientClusterService.getOwnerConnectionAddress(); method, causing almost infinite checks for valid connections.

private void ensureOwnerConnectionAvailable() throws IOException {
    ClientClusterService clientClusterService = client.getClientClusterService();
    Address ownerConnectionAddress = clientClusterService.getOwnerConnectionAddress();

    boolean isOwnerConnectionAvailable = ownerConnectionAddress != null
            && connectionManager.getConnection(ownerConnectionAddress) != null;

    if (!isOwnerConnectionAvailable) {
        if (isShutdown()) {
            throw new HazelcastException("ConnectionManager is not active!");
        }
        throw new IOException("Not able to setup owner connection!");
    }
}

If you check the following method in com.hazelcast.client.spi.impl.ClusterListenerSupport.java, ownerConnectionAddress is set to NULL in the first line but is never set to a proper value again. This causes getOwnerConnectionAddress() to always return NULL when called from ensureOwnerConnectionAvailable()

private void connectToOne() throws Exception {
    ownerConnectionAddress = null;
    final ClientNetworkConfig networkConfig = client.getClientConfig().getNetworkConfig();
    final int connAttemptLimit = networkConfig.getConnectionAttemptLimit();
    final int connectionAttemptPeriod = networkConfig.getConnectionAttemptPeriod();

    final int connectionAttemptLimit = connAttemptLimit == 0 ? Integer.MAX_VALUE : connAttemptLimit;

    int attempt = 0;
    Set<InetSocketAddress> triedAddresses = new HashSet<InetSocketAddress>();
    while (attempt < connectionAttemptLimit) {
        if (!client.getLifecycleService().isRunning()) {
            if (LOGGER.isFinestEnabled()) {
                LOGGER.finest("Giving up on retrying to connect to cluster since client is shutdown");
            }
            break;
        }
        attempt++;
        final long nextTry = Clock.currentTimeMillis() + connectionAttemptPeriod;

        boolean isConnected = connect(triedAddresses);

        if (isConnected) {
            return;
        }

        final long remainingTime = nextTry - Clock.currentTimeMillis();
        LOGGER.warning(
                String.format("Unable to get alive cluster connection, try in %d ms later, attempt %d of %d.",
                        Math.max(0, remainingTime), attempt, connectionAttemptLimit));

        if (remainingTime > 0) {
            try {
                Thread.sleep(remainingTime);
            } catch (InterruptedException e) {
                break;
            }
        }
    }
    throw new IllegalStateException("Unable to connect to any address in the config! "
            + "The following addresses were tried:" + triedAddresses);
}

The text was updated successfully, but these errors were encountered:

kedar-joshi · 2015-09-10T12:27:54Z

Update 1:

As it turned out, I am using clientConfig.addListenerConfig(new ListenerConfig(<some implementation>)); to populate the cache whenever the server is restarted. With following steps the issue can be clearly reproduced (only with 3.5.x branch) -

Create client configuration as one normally does, which works fine.
Get an IMap from HazelcastInstance instace
Add an implementation of ListenerConfig to listen for the client events like disconnection and reconnections. Make sure that upon client connected event, this listener updates the map created in step 2.
This will result in Hazelcast to wait for ListenerConfig to finish its execution, while the ListenerConfig itself is waiting to access the map; thus creating the deadlock.

I hope this helps.

Note: This issue is only reproducible with 3.5.x branch

sancar · 2015-09-10T13:05:26Z

Hi @kedar-joshi
As response to your first comment
ownerConnectionAddress is set back in
ClusterListenerSupport.connect(Set triedAddresses) which is called by ClusterListenerSupport.connectToOne.

Accesing hazelcast from hazelcast listeners itself is mostly not advised.
We will prepare a fix. In the fix we will call the listener in another thread rather than the thread that is trying to connect to cluster back.

You can do a similar workaround for your case.

Thanks for the report.

kedar-joshi · 2015-09-10T15:52:58Z

@sancar you are correct about listener itself accessing Hazelcast. We knew it was a bad implementation on our part as soon as the cause became clear and thus we changed our implementation according to your suggestion.

Thank you for the reply and suggested workaround.

After a straight forward fix made to offload listener call to Executor, I came accross more problem in the design. Since most of remote requests are blocking if client is not connected to remote(if client has connected to node with saying you are my owner), the operation that trying to connect to cluster and others should be in seperate executor pool. We have two executor pools one is for internal operations and other for alien code to hazelcast like listeners and CompletableFuture.andThen calls. Since these two are doing a remote call that can potentially block on waiting owner address to be determined, cluster thread is moved to singleThreadExecutor to its own. And more cleanup done to differentiate alien and internal executor usage. fixes hazelcast#6168

After a straight forward fix made to offload listener call to Executor, I came across more problem in the design. Since most of remote requests are blocking if client is not connected to remote(if client has connected to node with saying you are my owner), the operation that trying to connect to cluster and others should be in separate executor pool. We have two executor pools one is for internal operations and other for alien code to hazelcast like listeners and CompletableFuture.andThen calls. Since these two are doing a remote call that can potentially block on waiting owner address to be determined, cluster thread is moved to singleThreadExecutor to its own. And more cleanup done to differentiate alien and internal executor usage. fixes hazelcast#6168 forward port of hazelcast#6217

After a straight forward fix made to offload listener call to Executor, I came accross more problem in the design. Since most of remote requests are blocking if client is not connected to remote(if client has connected to node with saying you are my owner), the operation that trying to connect to cluster and others should be in seperate executor pool. We have two executor pools one is for internal operations and other for alien code to hazelcast like listeners and CompletableFuture.andThen calls. Since these two are doing a remote call that can potentially block on waiting owner address to be determined, cluster thread is moved to singleThreadExecutor to its own. And more cleanup done to differentiate alien and internal executor usage. fixes hazelcast#6168

mdogan added Team: Client PENDING labels Sep 10, 2015

sancar added Type: Defect and removed PENDING labels Sep 10, 2015

sancar added this to the 3.5.3 milestone Sep 10, 2015

sancar self-assigned this Sep 10, 2015

sancar mentioned this issue Sep 15, 2015

Fixes a deadlock in client listeners #6217

Merged

sancar mentioned this issue Sep 15, 2015

Fixes deadlock in client listeners #6218

Merged

sancar closed this as completed in #6218 Sep 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client fails to proeperly reconnect to single node cluster after hazelcast server is restarted #6168

Client fails to proeperly reconnect to single node cluster after hazelcast server is restarted #6168

kedar-joshi commented Sep 10, 2015

kedar-joshi commented Sep 10, 2015

sancar commented Sep 10, 2015

kedar-joshi commented Sep 10, 2015

Client fails to proeperly reconnect to single node cluster after hazelcast server is restarted #6168

Client fails to proeperly reconnect to single node cluster after hazelcast server is restarted #6168

Comments

kedar-joshi commented Sep 10, 2015

kedar-joshi commented Sep 10, 2015

sancar commented Sep 10, 2015

kedar-joshi commented Sep 10, 2015