Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client fails to proeperly reconnect to single node cluster after hazelcast server is restarted #6168

Closed
kedar-joshi opened this issue Sep 10, 2015 · 3 comments · Fixed by #6218
Assignees
Milestone

Comments

@kedar-joshi
Copy link

Hi,
I ran into following exception while upgrading to version 3.5.2 from version 3.2.3. The scenario is as follows -

  1. Use JAVA client programmatic configuration to connect to single node Hazelcast instance. This step work as expected and client is able to perform get and put operations on the distributed maps
  2. Shutdown the Hazelcast server and restart it after couple of seconds.
  3. The client is shown to have connected to the cluster, but all the map operations take very long to return. After a long time these map operations return with following exceptions, and the client never recovers.

I am using following code to configure the client -

// Preparing client configuration
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName(SystemConfigUtil.getCacheGroup()).setPassword(SystemConfigUtil.getCachePassword());

ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();

networkConfig.addAddress("127.0.0.1");
networkConfig.setConnectionAttemptLimit(Integer.MAX_VALUE);
networkConfig.setConnectionAttemptPeriod(10000);
networkConfig.setConnectionTimeout(5000);

// Creating client instance
HAZELCAST_CLIENT = HazelcastClient.newHazelcastClient(clientConfig);

Following exception is thrown for every map operation after Hazelcast server is restarted -

com.hazelcast.core.HazelcastException: java.io.IOException: Not able to setup owner connection!
 at com.hazelcast.util.ExceptionUtil.rethrow(ExceptionUtil.java:67) ~[ExceptionUtil.class:3.5.2]
 at com.hazelcast.util.ExceptionUtil.rethrow(ExceptionUtil.java:62) ~[ExceptionUtil.class:3.5.2]
 at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:133) ~[ClientProxy.class:3.5.2]
 at com.hazelcast.client.proxy.ClientMapProxy.put(ClientMapProxy.java:364) ~[ClientMapProxy.class:3.5.2]
 at com.hazelcast.client.proxy.ClientMapProxy.put(ClientMapProxy.java:206) ~[ClientMapProxy.class:3.5.2]
 at com.locationguru.CSF.cache.manager.CacheManagerImpl.setUserInfo(CacheManagerImpl.java:826) ~[CacheManagerImpl.class:na]
 at com.locationguru.llp.user.controller.UserController.doAuthenticate(UserController.java:266) ~[UserController.class:na]
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_71]
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_71]
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_71]
 at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_71]
 at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213) [InvocableHandlerMethod.class:3.1.1.RELEASE]
 at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126) [InvocableHandlerMethod.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96) [ServletInvocableHandlerMethod.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617) [RequestMappingHandlerAdapter.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578) [RequestMappingHandlerAdapter.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80) [AbstractHandlerMethodAdapter.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923) [DispatcherServlet.class:3.1.1.RELEASE]
 at com.locationguru.CSF.base.BaseDispatcherServlet.doDispatch(BaseDispatcherServlet.java:37) [BaseDispatcherServlet.class:na]
 at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) [DispatcherServlet.class:3.1.1.RELEASE]
 at com.locationguru.CSF.base.BaseDispatcherServlet.doService(BaseDispatcherServlet.java:32) [BaseDispatcherServlet.class:na]
 at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) [FrameworkServlet.class:3.1.1.RELEASE]
 at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789) [FrameworkServlet.class:3.1.1.RELEASE]
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:644) [servlet-api.jar:na]
 at com.locationguru.CSF.base.BaseDispatcherServlet.service(BaseDispatcherServlet.java:42) [BaseDispatcherServlet.class:na]
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:725) [servlet-api.jar:na]
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:291) [catalina.jar:8.0.12]
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.12]
 at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) [tomcat-websocket.jar:8.0.12]
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239) [catalina.jar:8.0.12]
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106) [catalina.jar:8.0.12]
 at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:142) [catalina.jar:8.0.12]
 at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79) [catalina.jar:8.0.12]
 at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610) [catalina.jar:8.0.12]
 at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88) [catalina.jar:8.0.12]
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:534) [catalina.jar:8.0.12]
 at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1081) [tomcat-coyote.jar:8.0.12]
 at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:658) [tomcat-coyote.jar:8.0.12]
 at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:222) [tomcat-coyote.jar:8.0.12]
 at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1566) [tomcat-coyote.jar:8.0.12]
 at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1523) [tomcat-coyote.jar:8.0.12]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71]
 at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-util.jar:8.0.12]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]

As per my analysis, following method in com.hazelcast.client.spi.impl.ClientSmartInvocationServiceImpl.java checks for owner connections but ownerConnectionAddress is always returned as NULL by clientClusterService.getOwnerConnectionAddress(); method, causing almost infinite checks for valid connections.

private void ensureOwnerConnectionAvailable() throws IOException {
    ClientClusterService clientClusterService = client.getClientClusterService();
    Address ownerConnectionAddress = clientClusterService.getOwnerConnectionAddress();

    boolean isOwnerConnectionAvailable = ownerConnectionAddress != null
            && connectionManager.getConnection(ownerConnectionAddress) != null;

    if (!isOwnerConnectionAvailable) {
        if (isShutdown()) {
            throw new HazelcastException("ConnectionManager is not active!");
        }
        throw new IOException("Not able to setup owner connection!");
    }
}

If you check the following method in com.hazelcast.client.spi.impl.ClusterListenerSupport.java, ownerConnectionAddress is set to NULL in the first line but is never set to a proper value again. This causes getOwnerConnectionAddress() to always return NULL when called from ensureOwnerConnectionAvailable()

private void connectToOne() throws Exception {
    ownerConnectionAddress = null;
    final ClientNetworkConfig networkConfig = client.getClientConfig().getNetworkConfig();
    final int connAttemptLimit = networkConfig.getConnectionAttemptLimit();
    final int connectionAttemptPeriod = networkConfig.getConnectionAttemptPeriod();

    final int connectionAttemptLimit = connAttemptLimit == 0 ? Integer.MAX_VALUE : connAttemptLimit;

    int attempt = 0;
    Set<InetSocketAddress> triedAddresses = new HashSet<InetSocketAddress>();
    while (attempt < connectionAttemptLimit) {
        if (!client.getLifecycleService().isRunning()) {
            if (LOGGER.isFinestEnabled()) {
                LOGGER.finest("Giving up on retrying to connect to cluster since client is shutdown");
            }
            break;
        }
        attempt++;
        final long nextTry = Clock.currentTimeMillis() + connectionAttemptPeriod;

        boolean isConnected = connect(triedAddresses);

        if (isConnected) {
            return;
        }

        final long remainingTime = nextTry - Clock.currentTimeMillis();
        LOGGER.warning(
                String.format("Unable to get alive cluster connection, try in %d ms later, attempt %d of %d.",
                        Math.max(0, remainingTime), attempt, connectionAttemptLimit));

        if (remainingTime > 0) {
            try {
                Thread.sleep(remainingTime);
            } catch (InterruptedException e) {
                break;
            }
        }
    }
    throw new IllegalStateException("Unable to connect to any address in the config! "
            + "The following addresses were tried:" + triedAddresses);
}
@kedar-joshi
Copy link
Author

Update 1:

As it turned out, I am using clientConfig.addListenerConfig(new ListenerConfig(<some implementation>)); to populate the cache whenever the server is restarted. With following steps the issue can be clearly reproduced (only with 3.5.x branch) -

  1. Create client configuration as one normally does, which works fine.
  2. Get an IMap from HazelcastInstance instace
  3. Add an implementation of ListenerConfig to listen for the client events like disconnection and reconnections. Make sure that upon client connected event, this listener updates the map created in step 2.
  4. This will result in Hazelcast to wait for ListenerConfig to finish its execution, while the ListenerConfig itself is waiting to access the map; thus creating the deadlock.

I hope this helps.

Note: This issue is only reproducible with 3.5.x branch

@sancar
Copy link
Contributor

sancar commented Sep 10, 2015

Hi @kedar-joshi
As response to your first comment
ownerConnectionAddress is set back in
ClusterListenerSupport.connect(Set triedAddresses) which is called by ClusterListenerSupport.connectToOne.

Accesing hazelcast from hazelcast listeners itself is mostly not advised.
We will prepare a fix. In the fix we will call the listener in another thread rather than the thread that is trying to connect to cluster back.

You can do a similar workaround for your case.

Thanks for the report.

@sancar sancar added this to the 3.5.3 milestone Sep 10, 2015
@sancar sancar self-assigned this Sep 10, 2015
@kedar-joshi
Copy link
Author

@sancar you are correct about listener itself accessing Hazelcast. We knew it was a bad implementation on our part as soon as the cause became clear and thus we changed our implementation according to your suggestion.

Thank you for the reply and suggested workaround.

sancar pushed a commit to sancar/hazelcast that referenced this issue Sep 15, 2015
After a straight forward fix made to offload listener call
to Executor, I came accross more problem in the design.
Since most of remote requests are blocking if client is not
connected to remote(if client has connected to node with saying you
are my owner), the operation that trying to connect to cluster and
others should be in seperate executor pool. We have two executor pools
one is for internal operations and other for alien code to hazelcast
like listeners and CompletableFuture.andThen calls. Since these
two are doing a remote call that can potentially block on waiting owner
address to be determined, cluster thread is moved to singleThreadExecutor
to its own. And more cleanup done to differentiate alien and internal
executor usage.

fixes hazelcast#6168
sancar pushed a commit to sancar/hazelcast that referenced this issue Sep 15, 2015
After a straight forward fix made to offload listener call
to Executor, I came across more problem in the design.
Since most of remote requests are blocking if client is not
connected to remote(if client has connected to node with saying you
are my owner), the operation that trying to connect to cluster and
others should be in separate executor pool. We have two executor pools
one is for internal operations and other for alien code to hazelcast
like listeners and CompletableFuture.andThen calls. Since these
two are doing a remote call that can potentially block on waiting owner
address to be determined, cluster thread is moved to singleThreadExecutor
to its own. And more cleanup done to differentiate alien and internal
executor usage.

fixes hazelcast#6168

forward port of hazelcast#6217
sancar pushed a commit to sancar/hazelcast that referenced this issue Sep 15, 2015
After a straight forward fix made to offload listener call
to Executor, I came across more problem in the design.
Since most of remote requests are blocking if client is not
connected to remote(if client has connected to node with saying you
are my owner), the operation that trying to connect to cluster and
others should be in separate executor pool. We have two executor pools
one is for internal operations and other for alien code to hazelcast
like listeners and CompletableFuture.andThen calls. Since these
two are doing a remote call that can potentially block on waiting owner
address to be determined, cluster thread is moved to singleThreadExecutor
to its own. And more cleanup done to differentiate alien and internal
executor usage.

fixes hazelcast#6168

forward port of hazelcast#6217
sancar pushed a commit to sancar/hazelcast that referenced this issue Sep 15, 2015
After a straight forward fix made to offload listener call
to Executor, I came accross more problem in the design.
Since most of remote requests are blocking if client is not
connected to remote(if client has connected to node with saying you
are my owner), the operation that trying to connect to cluster and
others should be in seperate executor pool. We have two executor pools
one is for internal operations and other for alien code to hazelcast
like listeners and CompletableFuture.andThen calls. Since these
two are doing a remote call that can potentially block on waiting owner
address to be determined, cluster thread is moved to singleThreadExecutor
to its own. And more cleanup done to differentiate alien and internal
executor usage.

fixes hazelcast#6168
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants