Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[io] Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit #10886

Closed
wilevers opened this issue Jul 7, 2017 · 8 comments · Fixed by #10894
Closed

[io] Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit #10886

wilevers opened this issue Jul 7, 2017 · 8 comments · Fixed by #10894
Assignees
Labels
Source: Community PR or issue was opened by a community user Team: Core Type: Defect
Milestone

Comments

@wilevers
Copy link

wilevers commented Jul 7, 2017

Hi all,

On my system, (Ubuntu 14.04, Java 1.8.0_131-b11, hazelcast-all.jar version 3.8.2), the following program often hangs after printing 'all instances shut down':

import com.hazelcast.config.Config;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;

public class Main 
{
	public static void main(String[] args) {
		System.out.println("creating instances...");
		Config config = new Config();
		HazelcastInstance[] instances = new HazelcastInstance[3];
		for (int i = 0; i < instances.length; ++i) {
			instances[i] = Hazelcast.newHazelcastInstance(config);
		}
		System.out.println("instances created. shutting them down...");
		for (int i = instances.length - 1; i >= 0; --i) {
			instances[i].shutdown();
		}
		System.out.println("all instances shut down");
	}
}

A jstack dump of the hanging Java process reveals a non-daemon Hazelcast thread hanging in NonBlockingIOThread.selectLoop():

"hz._hzInstance_2_dev.IO.thread-in-0" #84 prio=5 os_prio=0 tid=0x00007ff02c951800 nid=0x65a3 runnable [0x00007fef7cccd000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x0000000776359288> (a sun.nio.ch.Util$3)
	- locked <0x0000000776359298> (a java.util.Collections$UnmodifiableSet)
	- locked <0x00000007762c5a28> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.selectLoop(NonBlockingIOThread.java:248)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.run(NonBlockingIOThread.java:203)

Is this a bug in Hazelcast?

Kind regards,

Wil Evers

@jerrinot jerrinot added this to the 3.9 milestone Jul 7, 2017
@jerrinot
Copy link
Contributor

jerrinot commented Jul 7, 2017

it looks so, thank you for reporting!

@mdogan
Copy link
Contributor

mdogan commented Jul 7, 2017

I think we already fixed this issue in 3.8.3. See #10651

@wilevers; can you try with 3.8.3 or 3.9-SNAPSHOT?

@wilevers
Copy link
Author

wilevers commented Jul 7, 2017

Just tried 3.8.3, and the issue does not appear. Thanks!
Not sure I like the solution direction taken in #10651, though. It seems to me the root cause is a task clearing the thread's interrupt status. It shouldn't. What if this task first clears the thread's interrupt status and then enters a blocking call, before returning to NonBlockingIOThread.selectLoop())?

Regards,
Wil

@pveentjer
Copy link
Contributor

Good points. I'll have a closer look; the amount of code called from the IO thread is limited. Lets see if we can find the violating code.

@pveentjer
Copy link
Contributor

I see one potential problem already. If an InterruptedException is thrown in the SelectionHandler.handle method, it isn't correctly handled. It is just caught as Throwable and not checked for InterruptedException; and the flag isn't restored.

@pveentjer pveentjer self-assigned this Jul 10, 2017
@pveentjer
Copy link
Contributor

pveentjer commented Jul 10, 2017

Even though the above InterruptedException handling isn't correct, it isn't the cause. Another problem is with the PacketDispatcherImpl which also catches all exceptions and doesn't handle the InterruptedException specifically

@pveentjer
Copy link
Contributor

pveentjer commented Jul 10, 2017

Also the PacketDispatcherImpl isn't the (last) cause. Search continues

@pveentjer
Copy link
Contributor

Provided a customer InterruptedException that provides some more info:

public
class InterruptedException extends Exception {
    private static final long serialVersionUID = 6700697376100628473L;

    /**
     * Constructs an <code>InterruptedException</code> with no detail  message.
     */
    public InterruptedException() {
        super();

        logStackTrace();
    }

    private static void logStackTrace(){
        if(Thread.currentThread().getName().contains("in-")){
            try{
                throw new Exception();
            }catch(Exception e){
                e.printStackTrace();
            }
        } else{
            System.out.println("-------------------------------"+Thread.currentThread().getName()+"--------------------------------");
        }
    }

    /**
     * Constructs an <code>InterruptedException</code> with the
     * specified detail message.
     *
     * @param   s   the detail message.
     */
    public InterruptedException(String s) {
        super(s);

        logStackTrace();
    }
}

And finding one more source of gobbling the exception:

java.lang.Exception
	at java.lang.InterruptedException.logStackTrace(InterruptedException.java:65)
	at java.lang.InterruptedException.<init>(InterruptedException.java:59)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1245)
	at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:442)
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate.addNewWorkerIfRequired(CachedExecutorServiceDelegate.java:142)
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate.execute(CachedExecutorServiceDelegate.java:116)
	at com.hazelcast.spi.impl.executionservice.impl.ExecutionServiceImpl.execute(ExecutionServiceImpl.java:248)
	at com.hazelcast.nio.NodeIOService.onDisconnect(NodeIOService.java:162)
	at com.hazelcast.nio.tcp.TcpIpConnection.close(TcpIpConnection.java:274)
	at com.hazelcast.internal.networking.nonblocking.AbstractHandler.onFailure(AbstractHandler.java:128)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.handleSelectionKey(NonBlockingIOThread.java:349)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.handleSelectionKeys(NonBlockingIOThread.java:332)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.selectLoop(NonBlockingIOThread.java:250)
	at com.hazelcast.internal.networking.nonblocking.NonBlockingIOThread.run(NonBlockingIOThread.java:203)

Which leads to

 @SuppressFBWarnings("VO_VOLATILE_INCREMENT")
    private void addNewWorkerIfRequired() {
        if (size < maxPoolSize) {
            try {
                if (lock.tryLock(TIME, TimeUnit.MILLISECONDS)) {
                    try {
                        if (size < maxPoolSize && getQueueSize() > 0) {
                            size++;
                            cachedExecutor.execute(new Worker());
                        }
                    } finally {
                        lock.unlock();
                    }
                }
            } catch (InterruptedException ignored) {
                EmptyStatement.ignore(ignored);<----
            }
        }
    }

pveentjer added a commit to pveentjer/hazelcast that referenced this issue Jul 10, 2017
pveentjer added a commit to pveentjer/hazelcast that referenced this issue Jul 10, 2017
pveentjer added a commit to pveentjer/hazelcast that referenced this issue Jul 10, 2017
@mmedenjak mmedenjak changed the title Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit [io] Thread leak after HazelcastInstance.shutdown() in 3.8.2 so JVM won't exit Jul 11, 2017
@mmedenjak mmedenjak added the Source: Community PR or issue was opened by a community user label Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Source: Community PR or issue was opened by a community user Team: Core Type: Defect
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants