Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoll consuming lot more CPU than Nio #11695

Closed
praveen97uma opened this issue Sep 20, 2021 · 30 comments · Fixed by #12145
Closed

Epoll consuming lot more CPU than Nio #11695

praveen97uma opened this issue Sep 20, 2021 · 30 comments · Fixed by #12145
Assignees
Milestone

Comments

@praveen97uma
Copy link

praveen97uma commented Sep 20, 2021

I am building a websocket based broker. During load testing, we found that Epoll transport uses around 55% CPU compared to around 20% for Nio, just for maintaining the connections without doing any business specific IO on those connections. Is this expected? What could I be doing wrong? Happy to share any more info required around this.

Total concurrent connections: Around 27K
Boss Threads: 1
Worker Threads: 32

Cores in the VM: 8

Relevant code that sets up netty.

  EventExecutorChooserFactory chooserFactory = DefaultEventExecutorChooserFactory.INSTANCE;

  ThreadFactory bossThreadFactory = ThreadFactoryUtil.createInstrumented("boss", metricRegistry);

  Executor workerExecutor = new ThreadPerTaskExecutor(
          ThreadFactoryUtil.createAffinityThreadFactory("worker.thread", metricRegistry));

private void setupNioEventLoopGroups(EventExecutorChooserFactory chooserFactory, ThreadFactory bossThreadFactory,
            Executor workerExecutor) {
        bossPool = new NioEventLoopGroup(acceptorThreads, bossThreadFactory);
        workerPool = new NioEventLoopGroup(workerThreads, workerExecutor, chooserFactory, SelectorProvider.provider(),
                DefaultSelectStrategyFactory.INSTANCE);
        channelType = NioServerSocketChannel.class;
        log.info("Initialiazing Java NIO Event System");
    }

    private void setupEpollEventLoopGroups(EventExecutorChooserFactory chooserFactory, ThreadFactory bossThreadFactory,
            Executor workerExecutor) {
        bossPool = new EpollEventLoopGroup(acceptorThreads, bossThreadFactory);
        workerPool = new EpollEventLoopGroup(workerThreads, workerExecutor, chooserFactory,
                DefaultSelectStrategyFactory.INSTANCE);
        channelType = EpollServerSocketChannel.class;
        log.info("Initialiazing Epoll IO Event System");
    }


        ServerBootstrap serverBootstrap = new ServerBootstrap().group(bossPool, workerPool);

        // Choose socket options.
        Map<ChannelOption<?>, Object> channelOptions = new HashMap<>();
        channelOptions.put(ChannelOption.SO_BACKLOG, 256);
        channelOptions.put(ChannelOption.ALLOCATOR, new PooledByteBufAllocator(true));
        channelOptions.put(ChannelOption.SO_TIMEOUT, 3000);

        channelOptions.forEach(
                (key, value) -> serverBootstrap.option(ChannelOption.valueOf(String.valueOf(key)), value));

        // Set transport options
        serverBootstrap.childOption(ChannelOption.TCP_NODELAY, true);
        serverBootstrap.childOption(ChannelOption.SO_KEEPALIVE, true);
        serverBootstrap.childOption(ChannelOption.SO_LINGER, -1);
        serverBootstrap.childOption(ChannelOption.SO_REUSEADDR, true);
        serverBootstrap.childOption(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT);
        serverBootstrap.childOption(ChannelOption.ALLOW_HALF_CLOSURE, false);
        serverBootstrap.childOption(ChannelOption.SO_SNDBUF, 10 * 1024);
        serverBootstrap.channel(serverGroup.getChannelType());
        serverBootstrap.childHandler(channelInitializer);

Netty version

4.1.68

JVM version (e.g. java -version)

openjdk version "11.0.12" 2021-07-20 LTS
OpenJDK Runtime Environment Zulu11.50+19-CA (build 11.0.12+7-LTS)
OpenJDK 64-Bit Server VM Zulu11.50+19-CA (build 11.0.12+7-LTS, mixed mode)

OS version (e.g. uname -a)

Linux bolt-004 5.8.0-1041-azure #44~20.04.1-Ubuntu SMP Fri Aug 20 20:41:09 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Epoll
Screenshot 2021-09-20 at 6 36 30 PM

Nio
Screenshot 2021-09-20 at 6 22 49 PM

@hyperxpro
Copy link
Contributor

More information needed.

@praveen97uma
Copy link
Author

@hyperxpro I have added some code. Please let me know if anything specific info would help.

@franz1981
Copy link
Contributor

@praveen97uma
The load and performance result is the same for both the tests?

eg
NIO perform the test in X seconds.
EPOLL perform the test in Y seconds.

@praveen97uma
Copy link
Author

@franz1981 Yes, The ramp up to 27K connections is same for both NIO and EPOLL. The graph I shared are in steady state after the ramp up with nothing being written to the connections, except the websocket ping/pongs every 5 sec, which again happens for both the transports.

@franz1981
Copy link
Contributor

Yes, The ramp up to 27K connections is same for both NIO and EPOLL

It would worth to capture some profiling using https://github.com/jvm-profiling-tools/async-profiler

  1. https://github.com/jvm-profiling-tools/async-profiler#basic-usage
  2. add -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints as JVM flags
  3. run the profiler with -t -d 10 (or whatever duration you think is ok)

And share the flamegraphs (-f nio.html and -f epoll.html) somehow

qq

Worker Threads: 32
Cores in the VM: 8

if you don't suppose to block on the event loop, why not using 8 worker threads (or less?)

@praveen97uma
Copy link
Author

praveen97uma commented Sep 20, 2021

@franz1981
Copy link
Contributor

franz1981 commented Sep 20, 2021

there are an insane amount of samples on epollWait0 due to timerset.
What's this instrumented runnable? can be disabled?

Another thing: I won't suggest to set affinity with netty threads > number of cores...is a BAD idea :)
as suggested on

if you don't suppose to block on the event loop, why not using 8 worker threads (or less?)

@hyperxpro
Copy link
Contributor

Also, try this: ServerBootstrap#option(EpollChannelOption.EPOLL_MODE, EpollMode.EDGE_TRIGGERED)

@franz1981
Copy link
Contributor

franz1981 commented Sep 20, 2021

Good point, it could be that wakeup is spuriously happening (each time, so not so "spuriously") or that there are scheduled tasks that cause epoll to awake and go asleep continuosly

@praveen97uma
Copy link
Author

@franz1981 @hyperxpro Thank you both for the help . So, I am doing the following changes and will keep you guys posted.

  1. Remove the InstrumntedThreadFactory which instruments threads created/running/terminated created by a thread factory.
  2. Reduce the worker threads to 8 (equal to no of cores). I am anyway using separate event loop group for blocking channel handlers. Found out the hard way to not set affinity to threads more than > no of cores.
    3). try ServerBootstrap#option(EpollChannelOption.EPOLL_MODE, EpollMode.EDGE_TRIGGERED)

@praveen97uma
Copy link
Author

@franz1981 I made the changes. The CPU has dropped to 40% but, IMO, it is still high.

Screenshot 2021-09-20 at 9 01 15 PM

Screenshot 2021-09-20 at 8 59 43 PM

@franz1981
Copy link
Contributor

franz1981 commented Sep 20, 2021

It's still too much time...
There are any scheduled task submitted?
looking at the huge amount of sample spent while dequeing the prio q on the right I would say yes

@praveen97uma
Copy link
Author

I have scheduled tasks which run at every 200ms to consume messages from a buffer queue for that connection and push to the client. This is scheduled on the channel's event executor itself. That is the cause of all this I guess then. @franz1981

@franz1981
Copy link
Contributor

franz1981 commented Sep 20, 2021

yeah, probably rescheduling it is a bit more costy with epoll, or it could be due to the hypervised env you run the app, according to the flamegraphs it doesn't seem a bare metal server no? Maybe I'm wrong :)

@hyperxpro
Copy link
Contributor

Are you running the program on Shared-CPU VM or Bare Metal?

@praveen97uma
Copy link
Author

Its not a bare metal but Azure's VM. These are virtual CPUs.

@normanmaurer
Copy link
Member

@praveen97uma can you upload the flame graph so we can navigate in it ? It seems it spend quite some time on timerfd_settime

@franz1981
Copy link
Contributor

franz1981 commented Sep 20, 2021

@normanmaurer It looks to me that due to the too tight period of the scheduled tasks (for that virtual env) epoll always find some new scheduled task to rearm, causing that method to be called too much.

                                if (curDeadlineNanos == prevDeadlineNanos) {
                                    // No timer activity needed
                                    strategy = epollWaitNoTimerChange();
                                } else {
                                    // Timerfd needs to be re-armed or disarmed
                                    prevDeadlineNanos = curDeadlineNanos;
                                    strategy = epollWait(curDeadlineNanos);
                                }

on EpollEventLoop

@praveen97uma
Copy link
Author

praveen97uma commented Sep 20, 2021

@praveen97uma
Copy link
Author

@normanmaurer @franz1981 @hyperxpro Guys, so ByteBuf.duplicate() creates a new UnpooledDuplicatedByteBuf(). Is it possible to let netty create a Pooled version of the bytebuf? This particular call is causing the highest allocations in my profiling.

@normanmaurer
Copy link
Member

@praveen97uma if you use the PooledByteBufAllocator it should pool these.

@normanmaurer
Copy link
Member

@praveen97uma also can you explain why and how you schedule so many tasks ?

@praveen97uma
Copy link
Author

@normanmaurer We had to batch push messages over the connection every 200ms. These msgs are merged in a specific structure to reduce bytes on the wire. I have a buffer queue for every connection and tasks are scheduled for every connection on the eventloops that do this batching and flush over the connection. Is this entirely wrong approach and is there a better way to do this in some other way in netty? Had read about FlushConsolidationHandler but not sure if there is a hook that I can use to do the merging I want to do before flushing.

I am using PolledByteBufAllocator.

serverBootstrap.childOption(ChannelOption.ALLOCATOR, new PooledByteBufAllocator(true));


    public static ByteBuf encodePublishOperation(ByteBufAllocator byteBufAllocator,
            List<ServerPublishOperation> topicPublishes) {

        ByteBuf byteBuf = byteBufAllocator.buffer();

ServerPublishOperation is a ByteBufHolder. Channel's alloc is passed to the above function.

Screenshot 2021-10-04 at 3 26 11 PM

@normanmaurer
Copy link
Member

@normanmaurer We had to batch push messages over the connection every 200ms. These msgs are merged in a specific structure to reduce bytes on the wire. I have a buffer queue for every connection and tasks are scheduled for every connection on the eventloops that do this batching and flush over the connection. Is this entirely wrong approach and is there a better way to do this in some other way in netty? Had read about FlushConsolidationHandler but not sure if there is a hook that I can use to do the merging I want to do before flushing.

I am using PolledByteBufAllocator.

serverBootstrap.childOption(ChannelOption.ALLOCATOR, new PooledByteBufAllocator(true));


    public static ByteBuf encodePublishOperation(ByteBufAllocator byteBufAllocator,
            List<ServerPublishOperation> topicPublishes) {

        ByteBuf byteBuf = byteBufAllocator.buffer();

ServerPublishOperation is a ByteBufHolder. Channel's alloc is passed to the above function.

Screenshot 2021-10-04 at 3 26 11 PM

I see... If you switch to retainedDuplicate() it should be pooled as well. Just note that this also will increment the reference count by 1.

@chrisvest
Copy link
Contributor

I have a buffer queue for every connection and tasks are scheduled for every connection on the eventloops that do this batching and flush over the connection. Is this entirely wrong approach and is there a better way to do this in some other way in netty?

If you keep track of your connections (ideally just the ones with messages), you could have a single task scheduled, that go over all current connections and process their batches.

@NiteshKant
Copy link
Member

If you keep track of your connections (ideally just the ones with messages), you could have a single task scheduled, that go over all current connections and process their batches.

+1 .. may be not a single task but a single task per eventloop so that you do not have to jump the eventloop for flushing the connection.

@normanmaurer
Copy link
Member

I wonder if the time used by timerfd_settime is any indicator that we might be better of to not use it when possible (when the timeout is milliseconds) and just use epoll_wait(...) with the right timeout. WDYT ?

@chrisvest
Copy link
Contributor

Only using epoll_wait when the timeout is greater than a millisecond will cause rounding in timeouts, though. For 200 millisecond timeouts that probably won't matter, but it might for a 1,5 millisecond timeout. So we'd need to pick a cut-off point. On kernel 5.11 and newer, we can use epoll_pwait2, which takes a timespec parameter.

@normanmaurer
Copy link
Member

Let me put this on my todo list...

@normanmaurer normanmaurer self-assigned this Mar 3, 2022
@normanmaurer
Copy link
Member

@praveen97uma I think this should improve things: #12145

normanmaurer added a commit that referenced this issue Mar 10, 2022
…uts are scheduled

Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes #11695
normanmaurer added a commit that referenced this issue Mar 17, 2022
…uts are scheduled

Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes #11695
normanmaurer added a commit that referenced this issue Mar 18, 2022
…uts are scheduled(#12145)


Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes #11695
normanmaurer added a commit that referenced this issue Mar 18, 2022
…uts are scheduled(#12145)

Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes #11695
lhotari added a commit to apache/pulsar that referenced this issue Apr 19, 2022
)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes #14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this issue Apr 20, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004
lhotari added a commit to datastax/pulsar that referenced this issue Apr 20, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
gaoran10 pushed a commit to apache/pulsar that referenced this issue Apr 21, 2022
)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes #14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
gaoran10 pushed a commit to apache/pulsar that referenced this issue Apr 21, 2022
)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes #14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
lhotari added a commit to datastax/pulsar that referenced this issue Apr 22, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
lhotari added a commit to datastax/pulsar that referenced this issue Apr 22, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
codelipenghui pushed a commit to apache/pulsar that referenced this issue Apr 28, 2022
)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes #14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
nicoloboschi pushed a commit to datastax/pulsar that referenced this issue Apr 28, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
normanmaurer added a commit that referenced this issue Jul 6, 2022
…uts are scheduled(#12145)

Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes #11695
normanmaurer added a commit that referenced this issue Jul 6, 2022
… are scheduled (#12196)


Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes #11695
raidyue pushed a commit to raidyue/netty that referenced this issue Jul 8, 2022
…uts are scheduled(netty#12145)


Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes netty#11695
Jason918 pushed a commit to Jason918/pulsar that referenced this issue Aug 1, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
Jason918 pushed a commit to Jason918/pulsar that referenced this issue Aug 1, 2022
…che#15212)

* Upgrade Netty to 4.1.76.Final and Netty Tcnative to 2.0.51.Final

Fixes apache#14015
- release notes https://netty.io/news/2022/04/12/4-1-76-Final.html
  - contains fix for netty/netty#11695

* Upgrade grpc to 1.45.1 and protobuf to 3.19.2

- grpc < 1.45.1 is not compatible with Netty > 4.1.74.Final
  - grpc/grpc-java#9004

(cherry picked from commit 332a3c7)
franz1981 pushed a commit to franz1981/netty that referenced this issue Aug 22, 2022
…uts are scheduled(netty#12145)


Motivation:

At the moment we might end up calling timerfd_settime everytime a new timer is scheduled. This can produce quite some overhead. We should try to reduce the number of syscalls when possible.

Modifications:

- If we are using Linux Kernel >= 5.11 use directly epoll_pwait2(...)
- If the scheduled timeout is big enough just use epoll_wait(...) without timerfd_settime and accept some inaccuracy.

Result:

Fixes netty#11695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants