Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[io] Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit #10886

Closed
wilevers opened this issue Jul 7, 2017 · 8 comments
Closed

Comments

@wilevers
Copy link

@wilevers wilevers commented Jul 7, 2017

Hi all,

On my system, (Ubuntu 14.04, Java 1.8.0_131-b11, hazelcast-all.jar version 3.8.2), the following program often hangs after printing 'all instances shut down':

import com.hazelcast.config.Config;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;

public class Main 
{
	public static void main(String[] args) {
		System.out.println("creating instances...");
		Config config = new Config();
		HazelcastInstance[] instances = new HazelcastInstance[3];
		for (int i = 0; i < instances.length; ++i) {
			instances[i] = Hazelcast.newHazelcastInstance(config);
		}
		System.out.println("instances created. shutting them down...");
		for (int i = instances.length - 1; i >= 0; --i) {
			instances[i].shutdown();
		}
		System.out.println("all instances shut down");
	}
}

A jstack dump of the hanging Java process reveals a non-daemon Hazelcast thread hanging in NonBlockingIOThread.selectLoop():

"hz._hzInstance_2_dev.IO.thread-in-0" #84 prio=5 os_prio=0 tid=0x00007ff02c951800 nid=0x65a3 runnable [0x00007fef7cccd000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x0000000776359288> (a sun.nio.ch.Util$3)
	- locked <0x0000000776359298> (a java.util.Collections$UnmodifiableSet)
	- locked <0x00000007762c5a28> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.selectLoop(NonBlockingIOThread.java:248)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.run(NonBlockingIOThread.java:203)

Is this a bug in Hazelcast?

Kind regards,

Wil Evers

@jerrinot jerrinot added this to the 3.9 milestone Jul 7, 2017
@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Jul 7, 2017

it looks so, thank you for reporting!

@mdogan
Copy link
Contributor

@mdogan mdogan commented Jul 7, 2017

I think we already fixed this issue in 3.8.3. See #10651

@wilevers; can you try with 3.8.3 or 3.9-SNAPSHOT?

@wilevers
Copy link
Author

@wilevers wilevers commented Jul 7, 2017

Just tried 3.8.3, and the issue does not appear. Thanks!
Not sure I like the solution direction taken in #10651, though. It seems to me the root cause is a task clearing the thread's interrupt status. It shouldn't. What if this task first clears the thread's interrupt status and then enters a blocking call, before returning to NonBlockingIOThread.selectLoop())?

Regards,
Wil

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Jul 10, 2017

Good points. I'll have a closer look; the amount of code called from the IO thread is limited. Lets see if we can find the violating code.

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Jul 10, 2017

I see one potential problem already. If an InterruptedException is thrown in the SelectionHandler.handle method, it isn't correctly handled. It is just caught as Throwable and not checked for InterruptedException; and the flag isn't restored.

@pveentjer pveentjer self-assigned this Jul 10, 2017
@pveentjer
Copy link
Member

@pveentjer pveentjer commented Jul 10, 2017

Even though the above InterruptedException handling isn't correct, it isn't the cause. Another problem is with the PacketDispatcherImpl which also catches all exceptions and doesn't handle the InterruptedException specifically

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Jul 10, 2017

Also the PacketDispatcherImpl isn't the (last) cause. Search continues

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Jul 10, 2017

Provided a customer InterruptedException that provides some more info:

public
class InterruptedException extends Exception {
    private static final long serialVersionUID = 6700697376100628473L;

    /**
     * Constructs an <code>InterruptedException</code> with no detail  message.
     */
    public InterruptedException() {
        super();

        logStackTrace();
    }

    private static void logStackTrace(){
        if(Thread.currentThread().getName().contains("in-")){
            try{
                throw new Exception();
            }catch(Exception e){
                e.printStackTrace();
            }
        } else{
            System.out.println("-------------------------------"+Thread.currentThread().getName()+"--------------------------------");
        }
    }

    /**
     * Constructs an <code>InterruptedException</code> with the
     * specified detail message.
     *
     * @param   s   the detail message.
     */
    public InterruptedException(String s) {
        super(s);

        logStackTrace();
    }
}

And finding one more source of gobbling the exception:

java.lang.Exception
	at java.lang.InterruptedException.logStackTrace(InterruptedException.java:65)
	at java.lang.InterruptedException.<init>(InterruptedException.java:59)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1245)
	at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:442)
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate.addNewWorkerIfRequired(CachedExecutorServiceDelegate.java:142)
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate.execute(CachedExecutorServiceDelegate.java:116)
	at com.hazelcast.spi.impl.executionservice.impl.ExecutionServiceImpl.execute(ExecutionServiceImpl.java:248)
	at com.hazelcast.nio.NodeIOService.onDisconnect(NodeIOService.java:162)
	at com.hazelcast.nio.tcp.TcpIpConnection.close(TcpIpConnection.java:274)
	at com.hazelcast.internal.networking.nonblocking.AbstractHandler.onFailure(AbstractHandler.java:128)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.handleSelectionKey(NonBlockingIOThread.java:349)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.handleSelectionKeys(NonBlockingIOThread.java:332)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.selectLoop(NonBlockingIOThread.java:250)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.run(NonBlockingIOThread.java:203)

Which leads to

 @SuppressFBWarnings("VO_VOLATILE_INCREMENT")
    private void addNewWorkerIfRequired() {
        if (size < maxPoolSize) {
            try {
                if (lock.tryLock(TIME, TimeUnit.MILLISECONDS)) {
                    try {
                        if (size < maxPoolSize && getQueueSize() > 0) {
                            size++;
                            cachedExecutor.execute(new Worker());
                        }
                    } finally {
                        lock.unlock();
                    }
                }
            } catch (InterruptedException ignored) {
                EmptyStatement.ignore(ignored);<----
            }
        }
    }
pveentjer added a commit to pveentjer/hazelcast that referenced this issue Jul 10, 2017
@nilskp nilskp added the in progress label Jul 10, 2017
pveentjer added a commit to pveentjer/hazelcast that referenced this issue Jul 10, 2017
pveentjer added a commit to pveentjer/hazelcast that referenced this issue Jul 10, 2017
@mmedenjak mmedenjak changed the title Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit [io] Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants
You can’t perform that action at this time.