Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

EPollArrayWrapper.epollWait 100% CPU Usage #327

Closed
blucas opened this Issue · 36 comments
@blucas

Hi,

I believe I have an issue similar to #302 but on Linux (Ubuntu 10.04) with JDK (1.6.0u30) and JDK(1.7.0u4) using Netty-4.0.0 (Revision: 52a7d28)

The app is proxying connections to backend systems. The proxy has a pool of channels that it can use to send requests to the backend systems. If the pool is low on channels, new channels are spawned and put into the pool so that requests sent to the proxy can be serviced. The pools get populated on app startup, so that is why it doesn't take long at all for the CPU to spike through the roof (22 seconds into the app lifecycle).

The test box has two CPUs, the output from 'top' is below:

PID  USER   PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
8220 root   20   0 2281m 741m  10m R 50.2 18.7  0:22.57 java                                                                             
8218 root   20   0 2281m 741m  10m R 49.9 18.7  0:22.65 java                                                                             
8219 root   20   0 2281m 741m  10m R 49.2 18.7  0:22.86 java                                                                             
8221 root   20   0 2281m 741m  10m R 49.2 18.7  0:22.20 java 

Thread Dump for the four NioClient based Worker Threads that are chewing up all the CPU.

"backend-worker-pool-7-thread-1" prio=10 tid=0x00007f5918015800 nid=0x201a runnable [0x00007f5924ba3000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x000000008be93580> (a sun.nio.ch.Util$2)
    - locked <0x000000008be93570> (a java.util.Collections$UnmodifiableSet)
    - locked <0x000000008be92548> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at io.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:55)
    at io.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:261)
    at io.netty.channel.socket.nio.NioWorker.run(NioWorker.java:37)
    at io.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:43)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)   Locked ownable synchronizers:    - <0x000000008be00748> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"backend-worker-pool-7-thread-2" prio=10 tid=0x00007f5918012000 nid=0x201b runnable [0x00007f5924b82000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x000000008be94a28> (a sun.nio.ch.Util$2)
    - locked <0x000000008be94a18> (a java.util.Collections$UnmodifiableSet)
    - locked <0x000000008be90648> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at io.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:55)
    at io.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:261)
    at io.netty.channel.socket.nio.NioWorker.run(NioWorker.java:37)
    at io.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:43)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)   Locked ownable synchronizers:    - <0x000000008be904c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"backend-worker-pool-7-thread-3" prio=10 tid=0x00007f5918007800 nid=0x201c runnable [0x00007f5924b61000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x000000008be952e0> (a sun.nio.ch.Util$2)
    - locked <0x000000008be952d0> (a java.util.Collections$UnmodifiableSet)
    - locked <0x000000008be8f858> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at io.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:55)
    at io.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:261)
    at io.netty.channel.socket.nio.NioWorker.run(NioWorker.java:37)
    at io.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:43)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)   Locked ownable synchronizers:    - <0x000000008be8f618> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"backend-worker-pool-7-thread-4" prio=10 tid=0x00007f5918019000 nid=0x201d runnable [0x00007f5924b40000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x000000008be003f8> (a sun.nio.ch.Util$2)
    - locked <0x000000008be003e8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x000000008be00408> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at io.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:55)
    at io.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:261)
    at io.netty.channel.socket.nio.NioWorker.run(NioWorker.java:37)
    at io.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:43)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)   Locked ownable synchronizers:    - <0x000000008be004e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
@blucas

I've done some googling and it looks like this could be a JDK bug.
Take a look at the thread dump supplied on 13-MAY-2009. It is exactly the same as mine! Although they say the bug has been fixed, it obviously hasn't. Could you please investigate the workaround supplied in the link? They mention cancelling the selector and invoking a selectNow() method. I have NO clue where to do that myself to test if it actually works, so if someone could change it in the repo, I'm more than happy to give it a test.

In fact I've actually found a post by @Trustin on this very subject. But it looks like no changes were made back then.

@blucas

@trustin @normanmaurer Any thoughts on this?

@Rena4ka

Please, update. Merged on new server wit j6u33 and now experiencing this bug. Will try to downgrade jre...

@Rena4ka

It is happens on Netty 3.5.0 Final, 3.2.7 Final and 3.4.7 j6u33 and may be others.

@twinforces

Yeah, I see this too with UltraESB.

I think its a problem with anything that uses the Non-blocking IO library.

@incubos

I confirm this bug on Netty 3.3.0 Final (from Akka 2.0.2) & j7u4.

@vikiitd

Hi,

We are seeing this issue in our production environment also under heavy concurrent load. Are there any updates/fix for this issue. On some forums there are discussions related to this issue being resolved by upgrading jdk, etc. I am not seeing this issue with netty.3.2.4.Final.jar but it is happening consistently with netty3.4.4.Final.jar Does anyone know if there is a work-around for this issue besides downgrading back to 3.2.4 netty jar version.

Thanks in advance!

@vikiitd

I am observing this issue with Netty-3.2.6.Final.jar also, although less frequently. Is it possible that this issue got introduced before 3.3 or it is not a netty issue at all?

@trustin
Owner

Is there any reliable way to reproduce this problem? Then it would be much easier for me to fix this issue (probably it's more correct to say 'apply workaround').

@vrajesh1989

Hi. This issue is occurring in Netty-3.2.4.Final.jar also. There is no reliable way to reproduce this issue, it is not happening consistently.

@normanmaurer normanmaurer was assigned
@normanmaurer
Collaborator

Let me see if I can fix this ...

@normanmaurer normanmaurer referenced this issue from a commit
@normanmaurer normanmaurer Re-create Selector if we hit the epoll(..) jdk bug which leads to 100…
…% cpu load. This is just a workaround but helps to recover. See #327
ff3f2b6
@normanmaurer
Collaborator

@trustin I think this is because of an epoll bug which still is not fixed or was fixed and is now present again.

Please review the following workaround:
#565

So basically we do the following here:

  • check if Selector.select(timeout) returns 0 and complete before the timeout
  • if so we increment a counter
  • if the counter == 10 we create a new selector and register all the Channels from the old one to the new one and close the old Selector. After that we reset the counter.
  • if the counter < 10 we just process
  • if Selector.select(timeout) returns > 0 or complete after the timeout we reset the counter

The fix is kind of the same as jetty,grizzly and mina do :) WDYT ?

@vikiitd

it will be great if we can have the workaround provided by @normanmaurer . The jdk bug seems to be hanging around for sometime now, not sure if it is getting fixed anytime soon.

@normanmaurer
Collaborator

@vikiitd I have lost the hope to get nio bugs fixed in core java a long time ago :(

@normanmaurer
Collaborator

Workaround for the epoll(..) bug was commited.

@normanmaurer normanmaurer reopened this
@normanmaurer normanmaurer referenced this issue from a commit
@normanmaurer normanmaurer Correctly detect if the Selector.select(..) unblocks because of a man…
…ual wakup or because of a closed channel. See #327
63a234b
@normanmaurer normanmaurer referenced this issue from a commit
@normanmaurer normanmaurer Disable epoll bug workaround by default. It can be enabled via -Dorg.…
…jboss.netty.epollBugWorkaround=true. It will be enabled by default later if it has proven to be stable. See #327
45829ff
@leahxschmidt leahxschmidt referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
@cmunger

We had the same issue on a solaris x86 platform and jdk 1.7.0_09 on multiple applications using the NIO api. Netty 3.5.11 was also flawed. We ended enabling the -Dorg.jboss.netty.epollBugWorkaround system property, this fixed the 100% cpu problem and we had 3-4 warnings a day saying the epoll bug was detected and selector recreated. I've decided to look a the next sun jvm release (1.7.0_10) and saw that they have fixed a similar bug : http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7200742. The good news is that after upgrading our softwares to jdk 1.7.0_11 the 100% cpu bug has disappeared. We are now running netty 3.5.11 without the org.jboss.netty.epollBugWorkaround system props fine. So upgrade your JDK.

@ghost

Suspect I've also run into this with Netty 3.6.1.Final, on Openstack, Linux 3.2.0-38-virtual #61-Ubuntu SMP, with IcedTea OpenJDK 1.7.0_21 (7u21-2.3.9-0ubuntu0.12.04.1). At 100% CPU, every thread is sleeping except for the Netty workers, which are all in state NATIVE:

Thread 22428: (state = IN_NATIVE)
 - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
 - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=228 (Compiled frame)
 - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=81 (Compiled frame)
 - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87 (Compiled frame)
 - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
 - org.jboss.netty.channel.socket.nio.SelectorUtil.select(java.nio.channels.Selector) @bci=4, line=64 (Compiled frame)
 - org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(java.nio.channels.Selector) @bci=1, line=410 (Compiled frame)
 - org.jboss.netty.channel.socket.nio.AbstractNioSelector.run() @bci=49, line=207 (Compiled frame)
 - org.jboss.netty.channel.socket.nio.AbstractNioWorker.run() @bci=1, line=88 (Interpreted frame)
 - org.jboss.netty.channel.socket.nio.NioWorker.run() @bci=1, line=178 (Interpreted frame)
 - org.jboss.netty.util.ThreadRenamingRunnable.run() @bci=55, line=108 (Interpreted frame)
 - org.jboss.netty.util.internal.DeadLockProofWorker$1.run() @bci=14, line=42 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=722 (Interpreted frame)

Strace shows heavy use of futex and epoll_wait, but I'm not sure what a baseline is here:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 82.23  195.596490        1578    123991     21585 futex
 17.01   40.453827        1508     26820           epoll_wait
  0.35    0.821100           1   1309799           gettimeofday
  0.23    0.552041       23002        24         7 restart_syscall
  0.08    0.185864          30      6144           sendto
  0.06    0.139350           6     22414           write
  0.04    0.096209           1    148338           clock_gettime
  0.00    0.010583           1      7175           recvfrom
  0.00    0.003684           0     25116      5638 read
  0.00    0.000000           0      3532           mprotect
  0.00    0.000000           0       156           rt_sigreturn
  0.00    0.000000           0      2230           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00  237.859148               1675739     27230 total

Haven't been able to reproduce outside of production yet. I've added -Dorg.jboss.netty.epollBugWorkaround=true; we'll see if that prevents the issue from recurring.

@trustin
Owner
@ghost

Trustin, would you like to shoot me an email? kingsbury@factual.com.

@ghost

Swapped out the JVM for Oracle 1.7.0_21-b11. Still seeing surprisingly high CPU use, but it hasn't spiked to 100% yet. Jvisualvm claims pretty much everything is idle except for the Netty worker threads, which are spending their CPU time in sun.nio.ch.SelectorImpl.select():

screenshot from 2013-05-03 13 57 06

@vbajaria

I have been seeing a similar issue in my production environment too. It's Ubuntu 12.04 with jdk 1.6.0_38-b05 and running Netty 3.6.2

Is this patch not in that release or am I missing something ?

@kingsbury @trustin what was the outcome of your debugging ? any pointers ?

@vbajaria

@normanmaurer I saw your answer at hazelcast/hazelcast#81 and also tried the fix mentioned in that post but it seems to have not helped my cause.

I did a local test and bumped up my java version to 1.7 and the issue does not seem to have fixed itself.

My "New I/O worker"(s) keep on spinning even though there is no work to be done.

What else could I look at ?

@s2oBCN

I have the same issue in a production environment.
And I can't change the java version which is jRockit 1.5.0_26
My Netty version is 3.6.2.Final
Is this problem fixed in newer versions?

Or Is there any workaround?

Thank you!!!!!!!!!

nettycpu

@s2oBCN

mmm, and I have a
Red Hat Enterprise Linux Server release 5.3
Linux version 2.6.18-128.el5

Maybe this is the epull bug, but I thought that this was ok in Netty 3.6.2.Final. ??

I think that one solution could be to change from NioClientSocketChannelFactory to OioClientSocketChannelFactory. So anyway, in my clients I only use one connection to the server and they never have a response from the server (setReadable(false)).

This is the trace:
"New I/O worker #1" id=57 idx=0xe0 tid=23001 prio=5 alive, in native
at sun/nio/ch/PollArrayWrapper.poll0(JIJ)I(Native Method)
at sun/nio/ch/PollArrayWrapper.poll(PollArrayWrapper.java:100)[inlined]
at sun/nio/ch/PollSelectorImpl.doSelect(PollSelectorImpl.java:56)[optimized]
at sun/nio/ch/SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)[inlined]
at sun/nio/ch/SelectorImpl.select(SelectorImpl.java:80)[inlined]
at org/jboss/netty/channel/socket/nio/SelectorUtil.select(SelectorUtil.java:64)[inlined]
at org/jboss/netty/channel/socket/nio/AbstractNioSelector.select(AbstractNioSelector.java:409)[optimized]
^-- Holding lock: sun/nio/ch/Util$1@ 0x87494e68[biased lock]
^-- Holding lock: java/util/Collections$UnmodifiableSet@ 0x87494e58[biased lock]
^-- Holding lock: sun/nio/ch/PollSelectorImpl@ 0x874926f8[biased lock]

Than you!!

@mzafir

Is this issue fixed? or simply got closed without a resolution

@jhooda

I'm seeing similar issue on Ubuntu 12.04with Sun/Oracle build 1.7.0_60-b19. Below is stack trace:
---------------------------8<---------------
"New I/O worker #7" prio=10 tid=0x00007f31d8d88000 nid=0x2485 runnable [0x00007f31bce65000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000c0105168> (a sun.nio.ch.Util$2)
- locked <0x00000000c0105158> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000c0105040> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
--------------------------8<-------------------------------------

@normanmaurer
Collaborator
@mzafir
@jhooda

I'm using io.netty.netty-3.7.1.Final.jar

@jhooda

I have one boss thread and 9 worker thread in RUNNABLE state that are driving up CPU (my webapp is in in idle state with zero requests) I was expecting the threads to be in WAITING state?

@jhooda

Sorry missed this info. kernel is "3.13.0-24-generic"

@mzafir
@jhooda

We are using 3.7.1.Final: io.netty 3.7.1.Final (http://mvnrepository.com/artifact/io.netty/netty/3.7.1.Final)
(This netty version came bundled with play that we are using, version 2.2.3, from http://www.playframework.com/download)
Thanks.

@cbatson

I have been trying to make sense of this issue for ages. I'm wondering if it's not a problem with profiling: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html

@krasa

@cbatson I have this problem now too, it seems related to YourKit. My app goes nuts when sampling is enabled and eats cpu with the agent enabled.
image

@badnotes

I think it is short of network resource

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.