Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cpu usage when up to about 500 connections. #5278

Closed
alvin-xu opened this issue May 20, 2016 · 33 comments
Closed

High cpu usage when up to about 500 connections. #5278

alvin-xu opened this issue May 20, 2016 · 33 comments

Comments

@alvin-xu
Copy link

This is a really strange problem which happened only in a specific computer.

Computer: Dell Latitude E5520, 4G memory, Windows 7 32bit.
netty-version: 5.0.0.Alpha1, 5.0.0.Alpha2

When the connection to the netty server up to about 500 (there is a threshold value, but not certainly). A new connection will cause one thread use cpu 25%, and always print the following message. And another new connection will increase to 50% cpu usage. If i close the above two connections, the cpu usage will become normal and not print the abnormal message immediately. If i close other unrelated connections, amounts may be 10 or 12, not ceratinly, but cpu usage will be normal finally.

2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [WARN]-[Slf4JLogger.java:136] - Selector.select() returned prematurely 1024 times in a row; rebuilding selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [INFO]-[Slf4JLogger.java:101] - Migrated 64 channel(s) to the new Selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [WARN]-[Slf4JLogger.java:136] - Selector.select() returned prematurely 1024 times in a row; rebuilding selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [INFO]-[Slf4JLogger.java:101] - Migrated 64 channel(s) to the new Selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [WARN]-[Slf4JLogger.java:136] - Selector.select() returned prematurely 1024 times in a row; rebuilding selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [INFO]-[Slf4JLogger.java:101] - Migrated 64 channel(s) to the new Selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [WARN]-[Slf4JLogger.java:136] - Selector.select() returned prematurely 1024 times in a row; rebuilding selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [INFO]-[Slf4JLogger.java:101] - Migrated 64 channel(s) to the new Selector. 2016-05-19 11:19:01 [Thread-1] [DEBUG]-[EapBootstrap.java:415] - send heartbeat... 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [WARN]-[Slf4JLogger.java:136] - Selector.select() returned prematurely 1024 times in a row; rebuilding selector. 2016-05-19 11:19:01 [nioEventLoopGroup-11-1] [INFO]-[Slf4JLogger.java:101] - Migrated 64 channel(s) to the new Selector.

@alvin-xu
Copy link
Author

alvin-xu commented May 20, 2016

After the test, i found that the threshold value of the problem (high cpu usage) is related to the threads number directly, which can be set by '-Dio.netty.eventLoopThreads'.

If -Dio.netty.eventLoopThreads=8, then threshold value is about 500 ~~ 8 * 64;
If -Dio.netty.eventLoopThreads=4, then threshold value is about 250 ~~ 4 * 64;

So, I guess the problem happens only when most of the threads is up to 64 connections.

However, why does only one computer will happen ??? What is the real reason to the problem ?

@windie
Copy link
Member

windie commented May 23, 2016

@alvin-xu Netty 5 has been deprecated (See #4466). Could you try 4.1.0.CR7 and see if the bug is still there?

@Scottmitch
Copy link
Member

+1 @windie - Also great to see you are a Netty contributor now! Welcome aboard!!

@windie
Copy link
Member

windie commented May 24, 2016

@Scottmitch - Thanks :)

@alvin-xu
Copy link
Author

@windie There is also the same problem with 4.1.0.CR7 in the aforementioned computer. It's really difficult for me to make sure what in the computer causes the bug. Do you have any idea?

@johnou
Copy link
Contributor

johnou commented May 25, 2016

@alvin-xu what JDK? are you able to take a cpu snapshot with VisualVM or another profiling tool?

@windie
Copy link
Member

windie commented May 26, 2016

@alvin-xu I cannot reproduce it. Do you have a simple reproducer?

@chenxiuheng
Copy link

chenxiuheng commented May 30, 2016

I've had this problem too. The problem will disappear if I use netty-transport-native-epoll instead. @alvin-xu, you can have a try

@alvin-xu
Copy link
Author

alvin-xu commented Jun 3, 2016

JDK version is as following: @johnou

java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) Client VM (build 24.79-b02, mixed mode, sharing)

snapshot-1464934821453-cpu-combined

This is the VisualVM Snapshot, you should change the suffix to “nps”,then you can read it.
snapshot-1464934821453-cpu-calltree.txt

@alvin-xu
Copy link
Author

alvin-xu commented Jun 3, 2016

In the above result, I just start a common server received connection and do nothing, with the init configuration, -Dio.netty.eventLoopThreads==2.

I think the code is not the reason, but the specific platform. @windie

private EventLoopGroup bossGroup;
    private EventLoopGroup  workerGroup;
    private ServerBootstrap bootstrap;
    private int             port;

    public EapManageServer(int port) {
        bossGroup = new NioEventLoopGroup();
        workerGroup = new NioEventLoopGroup();
        bootstrap = new ServerBootstrap();
        this.port = port;

        bootstrap.group(bossGroup, workerGroup).channel(NioServerSocketChannel.class)
                .option(ChannelOption.SO_BACKLOG, 100).option(ChannelOption.TCP_NODELAY, true)
                .childOption(ChannelOption.SO_KEEPALIVE, true)
                .childHandler(new ChannelInitializer<SocketChannel>() {
                    @Override
                    protected void initChannel(SocketChannel socketChannel) throws Exception {
                        System.out.println("connected");
                        socketChannel.pipeline().addLast(new DiscardServerHandler());
                    }
                });
    }

    public void run() throws InterruptedException {
        bootstrap.bind(port).sync().channel().closeFuture().sync();
    }

    public void stop() {
        bossGroup.shutdownGracefully();
        workerGroup.shutdownGracefully();
    }`

And then start a number of clients to connect the server. When the client amount reach 120+, then cpu usage will rise to 50% ( I only start with 2 threads, and the computer has 4 cpu center).

public static void client(String host, int port) throws InterruptedException {
        EventLoopGroup workerGroup = new NioEventLoopGroup();

        try {
            Bootstrap b = new Bootstrap(); 
            b.group(workerGroup); 
            b.channel(NioSocketChannel.class); 
            b.option(ChannelOption.SO_KEEPALIVE, true); 
            b.handler(new ChannelInitializer<SocketChannel>() {
                @Override
                public void initChannel(SocketChannel ch) throws Exception {
                    ch.pipeline().addLast(new TimeClientHandler());
                }
            });

            // Start the client.
            ChannelFuture f = b.connect(host, port).sync(); 

            // Wait until the connection is closed.
            f.channel().closeFuture().sync();
        } finally {
            workerGroup.shutdownGracefully();
        }

    }
    public static void main(String[] args) throws Exception {
        ExecutorService es = Executors.newCachedThreadPool();
        for(int i=0;i<120; i++) {
            es.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        client("192.168.0.23", 8080);
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
            });
            System.out.println(i);
        }
    }

@Scottmitch
Copy link
Member

can you provide a few stack traces while in this condition (use jstack)?

@alvin-xu
Copy link
Author

alvin-xu commented Jun 6, 2016

There are three files with jstack. Jstack_0 is captured when server started up and no connections, jstack_124 is up to 124 connections but normal, the last one is the error one up to 128 connections, CPU usage is up to 50% by the netty server at this time.

However, I can't find any problem, May be you could. : ) @Scottmitch

jstack_0.txt
jstack_124.txt
jstack_128_error.txt

@alvin-xu
Copy link
Author

alvin-xu commented Jun 6, 2016

@chenxiuheng The “netty-transport-native-epoll” is only supported in Linux, but my problem is occured in windows. Thank you for your help anyway.

@Scottmitch
Copy link
Member

@alvin-xu - Thanks for the stack dumps. It would be useful if you could provide a few stack traces while in the CPU is at a persistent elevated usage. IIUC only the last stack trace was at elevated CPU usage. I would like to compare the stack traces to see if there is any interesting pattern. Lets start with 4 stack traces while CPU is elevated.

@alvin-xu
Copy link
Author

alvin-xu commented Jun 7, 2016

I'm sorry for that I can't understand your intent clearly. I used jstack barely. I think the following stack dumps may what you want. @Scottmitch . They are all captured while CPU is elevated.
jstackl_1.txt
jstackl_2.txt
jstackl_3.txt
jstackl_4.txt
jstackl_5.txt
jstackl_6.txt
jstackl_7.txt
jstackl_8.txt
jstackl_9.txt

@Scottmitch
Copy link
Member

Scottmitch commented Jun 7, 2016

@alvin-xu - Thanks for the info. It appears that you are using Netty 5.x? Can you provide the same information running with Netty 4.1.0.Final (5.x has been deprecated as @windie mentioned #5278 (comment))?

@alvin-xu
Copy link
Author

alvin-xu commented Jun 8, 2016

@Scottmitch I used 4.1.0.Final, and also has the same problem on that computer. The following is stack trace. I closed the last 2 connections in "jstackl_4.1.0_10_normal.txt" and "jstackl_4.1.0_11_normal.txt", which will make cpu usage normal. Then, I opened another 2 connections in the latter tester and problem show again.

jstackl_4.1.0_1.txt
jstackl_4.1.0_2.txt
jstackl_4.1.0_3.txt
jstackl_4.1.0_4.txt
jstackl_4.1.0_5.txt
jstackl_4.1.0_6.txt
jstackl_4.1.0_7.txt
jstackl_4.1.0_8.txt
jstackl_4.1.0_9.txt
jstackl_4.1.0_10_normal.txt
jstackl_4.1.0_11_normal.txt
jstackl_4.1.0_12.txt
jstackl_4.1.0_13.txt

@Scottmitch
Copy link
Member

@alvin-xu - Thanks again for the stack traces. Most of the stack traces look like we are waiting on the poll mechanism. However the 12th stack trace shows an interesting sign that we are being interrupted [1]. It is possible that the polling mechanism is behaving in a way we don't expect and we are just spinning calling poll ... can you verify which threads are consuming unusually high amount of CPU via a profiler (visualvm should work)? Can you also verify if you are seeing log statements like [2]:

[1]

"nioEventLoopGroup-3-2" prio=10 tid=0x0412ec00 nid=0x1e94 runnable [0x04d6f000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.Thread.isInterrupted(Native Method)
    at java.lang.Thread.interrupted(Thread.java:983)
    at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:675)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:325)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None

[2]

11:33:55.997 [test-1-1] WARN  io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely 512 times in a row; rebuilding selector.
11:33:55.997 [test-1-1] INFO  io.netty.channel.nio.NioEventLoop - Migrated 1 channel(s) to the new Selector.

@alvin-xu
Copy link
Author

@Scottmitch I took 3 snapshot with cpu profiler as following:

  1. normal state:
    cpu-live
  2. cpu usage up to 50%:
    cpu-live2
  3. after closed last 2 connections, cpu was normal, I restarted profiler again :
    cpu-live3
  4. I try to restart profiler again, bug it never show a result for a really long time.

Can you also verify if you are seeing log statements like [2]

The log i captured is as following. I think it much like what you said.

六月 12, 2016 9:24:05 上午 io.netty.channel.nio.NioEventLoop select 警告: Selector.select() returned prematurely 512 times in a row; rebuilding selector. 六月 12, 2016 9:24:05 上午 io.netty.channel.nio.NioEventLoop rebuildSelector 信息: Migrated 64 channel(s) to the new Selector.

@normanmaurer
Copy link
Member

Yes it is...

Can you tell us what 'SelectorProvider.provider()' returns on this system?

Am 12.06.2016 um 04:02 schrieb Alvin Xu notifications@github.com:

@Scottmitch I took 3 snapshot with cpu profiler as following:

normal state:

cpu usage up to 50%:

after closed last 2 connections, cpu was normal, I restarted profiler again :

I try to restart profiler again, bug it never show a result for a really long time.

Can you also verify if you are seeing log statements like [2]

The log i captured is as following. I think it much like what you said.

六月 12, 2016 9:24:05 上午 io.netty.channel.nio.NioEventLoop select
警告: Selector.select() returned prematurely 512 times in a row; rebuilding selector.
六月 12, 2016 9:24:05 上午 io.netty.channel.nio.NioEventLoop rebuildSelector
信息: Migrated 64 channel(s) to the new Selector.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@alvin-xu
Copy link
Author

@normanmaurer It's sun.nio.ch.WindowsSelectorProvider which has no difference with other computer.

I used

SelectorProvider provider = SelectorProvider.provider();
System.out.println(provider.toString());

@johnou
Copy link
Contributor

johnou commented Jun 12, 2016

Is the other computer running 1.7.0_79 too?

@alvin-xu
Copy link
Author

alvin-xu commented Jun 12, 2016

@johnou No, It's 1.8.0_60 .

@normanmaurer
Copy link
Member

So it not happens on 1.8.0_60? If so can you uprade on the system where you see the problem and check again?

Am 12.06.2016 um 08:38 schrieb Alvin Xu notifications@github.com:

No, It's 1.8.0_60 .


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@johnou
Copy link
Contributor

johnou commented Jun 12, 2016

Are you able to upgrade to 1.8 to rule out a bug in 1.7 JRE?

@alvin-xu
Copy link
Author

It seems the problem has no direct relation with JDK. I have uninstalled JDK1.7 and installed JDK1.8, however, the problem still there.

java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b15)
Java HotSpot(TM) Client VM (build 25.91-b15, mixed mode)

@alvin-xu
Copy link
Author

I try to debug the library. And I found that the select function [1] does not block for the specify timeoutMillis . In another word, the function [1] block in a very short time then return, which will cause the following problem [2] think that something is wrong and it has to rebuild the selector to fix the problem. Unfortunately, rebuild selectors can not fix the problem, so it will always rebuilding selectors when the repeat times up to 512 by default. And this is what cause the cpu elevated.

So, the focus is why the select function [1] does not block ? And the above actions which you suggested to do could rule out the factor of JDK。 Are there any other reasons which could cause the problem? such as Network Interface Card。

private void select(boolean oldWakenUp) throws IOException {
        Selector selector = this.selector;

        try {
            int e = 0;
            long currentTimeNanos = System.nanoTime();
            long selectDeadLineNanos = currentTimeNanos + this.delayNanos(currentTimeNanos);

            while(true) {
                long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
                if(timeoutMillis <= 0L) {
                    if(e == 0) {
                        selector.selectNow();
                        e = 1;
                    }
                    break;
                }

                if(this.hasTasks() && this.wakenUp.compareAndSet(false, true)) {
                    selector.selectNow();
                    e = 1;
                    break;
                }

                int selectedKeys = selector.select(timeoutMillis);     [1]
                ++e;
                if(selectedKeys != 0 || oldWakenUp || this.wakenUp.get() || this.hasTasks() || this.hasScheduledTasks()) {
                    break;
                }

                if(Thread.interrupted()) {
                    if(logger.isDebugEnabled()) {
                        logger.debug("Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
                    }

                    e = 1;
                    break;
                }

                long time = System.nanoTime();
                if(time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
                    e = 1;
                } else if(SELECTOR_AUTO_REBUILD_THRESHOLD > 0 && e >= SELECTOR_AUTO_REBUILD_THRESHOLD) {   
                    [2]
                    logger.warn("Selector.select() returned prematurely {} times in a row; rebuilding selector.", Integer.valueOf(e));
                    this.rebuildSelector();
                    selector = this.selector;
                    selector.selectNow();
                    e = 1;
                    break;
                }

                currentTimeNanos = time;
            }

            if(e > 3 && logger.isDebugEnabled()) {
                logger.debug("Selector.select() returned prematurely {} times in a row.", Integer.valueOf(e - 1));
            }
        } catch (CancelledKeyException var13) {
            if(logger.isDebugEnabled()) {
                logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector - JDK bug?", var13);
            }
        }

    }

@normanmaurer
Copy link
Member

It soubds like either a OS or JDK bug

Am 12.06.2016 um 10:58 schrieb Alvin Xu notifications@github.com:

I try to debug the library. And I found that the select function [1] does not block for the specify timeoutMillis . In another word, the function [1] block in a very short time then return, which will cause the following problem [2] think that something is wrong and it has to rebuild the selector to fix the problem. Unfortunately, rebuild selectors can not fix the problem, so it will always rebuilding selectors when the repeat times up to 512 by default. And this is what cause the cpu elevated.

So, the focus is why the select function [1] does not block ? And the above actions which you suggested to do could rule out the factor of JDK。 Are there any other reasons which could cause the problem? such as Network Interface Card。

private void select(boolean oldWakenUp) throws IOException {
Selector selector = this.selector;

    try {
        int e = 0;
        long currentTimeNanos = System.nanoTime();
        long selectDeadLineNanos = currentTimeNanos + this.delayNanos(currentTimeNanos);

        while(true) {
            long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
            if(timeoutMillis <= 0L) {
                if(e == 0) {
                    selector.selectNow();
                    e = 1;
                }
                break;
            }

            if(this.hasTasks() && this.wakenUp.compareAndSet(false, true)) {
                selector.selectNow();
                e = 1;
                break;
            }

            int selectedKeys = selector.select(timeoutMillis);     [1]
            ++e;
            if(selectedKeys != 0 || oldWakenUp || this.wakenUp.get() || this.hasTasks() || this.hasScheduledTasks()) {
                break;
            }

            if(Thread.interrupted()) {
                if(logger.isDebugEnabled()) {
                    logger.debug("Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
                }

                e = 1;
                break;
            }

            long time = System.nanoTime();
            if(time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
                e = 1;
            } else if(SELECTOR_AUTO_REBUILD_THRESHOLD > 0 && e >= SELECTOR_AUTO_REBUILD_THRESHOLD) {   
                [2]
                logger.warn("Selector.select() returned prematurely {} times in a row; rebuilding selector.", Integer.valueOf(e));
                this.rebuildSelector();
                selector = this.selector;
                selector.selectNow();
                e = 1;
                break;
            }

            currentTimeNanos = time;
        }

        if(e > 3 && logger.isDebugEnabled()) {
            logger.debug("Selector.select() returned prematurely {} times in a row.", Integer.valueOf(e - 1));
        }
    } catch (CancelledKeyException var13) {
        if(logger.isDebugEnabled()) {
            logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector - JDK bug?", var13);
        }
    }

}


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@alvin-xu
Copy link
Author

@normanmaurer I only found the problem with one computer I mentioned at beginning. This is also really strange. So I doubt that some hardware of the computer cause the problem ...

@freevest
Copy link

@alvin-xu Rarely meet Chinese player...

@alvin-xu
Copy link
Author

alvin-xu commented Nov 4, 2016

Haha, does my Chinese English sells me out? @freevest

@normanmaurer
Copy link
Member

Let me close this as its very old and there was no more followup.

@normanmaurer
Copy link
Member

Let me close this as its very old and there was no more followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants