EPollArrayWrapper.epollWait 100% CPU Usage #327
|
I've done some googling and it looks like this could be a JDK bug. In fact I've actually found a post by @Trustin on this very subject. But it looks like no changes were made back then. |
|
@trustin @normanmaurer Any thoughts on this? |
|
Please, update. Merged on new server wit j6u33 and now experiencing this bug. Will try to downgrade jre... |
|
It is happens on Netty 3.5.0 Final, 3.2.7 Final and 3.4.7 j6u33 and may be others. |
|
Yeah, I see this too with UltraESB. I think its a problem with anything that uses the Non-blocking IO library. |
|
I confirm this bug on Netty 3.3.0 Final (from Akka 2.0.2) & j7u4. |
|
Hi, We are seeing this issue in our production environment also under heavy concurrent load. Are there any updates/fix for this issue. On some forums there are discussions related to this issue being resolved by upgrading jdk, etc. I am not seeing this issue with netty.3.2.4.Final.jar but it is happening consistently with netty3.4.4.Final.jar Does anyone know if there is a work-around for this issue besides downgrading back to 3.2.4 netty jar version. Thanks in advance! |
|
I am observing this issue with Netty-3.2.6.Final.jar also, although less frequently. Is it possible that this issue got introduced before 3.3 or it is not a netty issue at all? |
|
Is there any reliable way to reproduce this problem? Then it would be much easier for me to fix this issue (probably it's more correct to say 'apply workaround'). |
|
Hi. This issue is occurring in Netty-3.2.4.Final.jar also. There is no reliable way to reproduce this issue, it is not happening consistently. |
|
Let me see if I can fix this ... |
|
@trustin I think this is because of an epoll bug which still is not fixed or was fixed and is now present again. Please review the following workaround: So basically we do the following here:
The fix is kind of the same as jetty,grizzly and mina do :) WDYT ? |
|
it will be great if we can have the workaround provided by @normanmaurer . The jdk bug seems to be hanging around for sometime now, not sure if it is getting fixed anytime soon. |
|
@vikiitd I have lost the hope to get nio bugs fixed in core java a long time ago :( |
|
Workaround for the epoll(..) bug was commited. |
|
We had the same issue on a solaris x86 platform and jdk 1.7.0_09 on multiple applications using the NIO api. Netty 3.5.11 was also flawed. We ended enabling the -Dorg.jboss.netty.epollBugWorkaround system property, this fixed the 100% cpu problem and we had 3-4 warnings a day saying the epoll bug was detected and selector recreated. I've decided to look a the next sun jvm release (1.7.0_10) and saw that they have fixed a similar bug : http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7200742. The good news is that after upgrading our softwares to jdk 1.7.0_11 the 100% cpu bug has disappeared. We are now running netty 3.5.11 without the org.jboss.netty.epollBugWorkaround system props fine. So upgrade your JDK. |
|
Suspect I've also run into this with Netty 3.6.1.Final, on Openstack, Linux 3.2.0-38-virtual #61-Ubuntu SMP, with IcedTea OpenJDK 1.7.0_21 (7u21-2.3.9-0ubuntu0.12.04.1). At 100% CPU, every thread is sleeping except for the Netty workers, which are all in state NATIVE:
Strace shows heavy use of futex and epoll_wait, but I'm not sure what a baseline is here:
Haven't been able to reproduce outside of production yet. I've added -Dorg.jboss.netty.epollBugWorkaround=true; we'll see if that prevents the issue from recurring. |
|
I'm investigating this issue at Twitter. I can provide a JAR with debug logging enabled to have a better understanding of the problem. Will you run it for me and get back with the log file? It might perform worse due to large amount of messages.l though. Kyle Kingsbury notifications@github.com wrote:
sent from a mobile device |
|
Trustin, would you like to shoot me an email? kingsbury@factual.com. |
|
I have been seeing a similar issue in my production environment too. It's Ubuntu 12.04 with jdk 1.6.0_38-b05 and running Netty 3.6.2 Is this patch not in that release or am I missing something ? @kingsbury @trustin what was the outcome of your debugging ? any pointers ? |
|
@normanmaurer I saw your answer at hazelcast/hazelcast#81 and also tried the fix mentioned in that post but it seems to have not helped my cause. I did a local test and bumped up my java version to 1.7 and the issue does not seem to have fixed itself. My "New I/O worker"(s) keep on spinning even though there is no work to be done. What else could I look at ? |
|
mmm, and I have a Maybe this is the epull bug, but I thought that this was ok in Netty 3.6.2.Final. ?? I think that one solution could be to change from NioClientSocketChannelFactory to OioClientSocketChannelFactory. So anyway, in my clients I only use one connection to the server and they never have a response from the server (setReadable(false)). This is the trace: Than you!! |
|
Is this issue fixed? or simply got closed without a resolution |
|
I'm seeing similar issue on Ubuntu 12.04with Sun/Oracle build 1.7.0_60-b19. Below is stack trace: |
|
What kernel version?
|
|
its a RUNNABLE thread, what is the issue you are facing? On Wed, Jul 2, 2014 at 11:44 AM, Norman Maurer notifications@github.com
|
|
I'm using io.netty.netty-3.7.1.Final.jar |
|
I have one boss thread and 9 worker thread in RUNNABLE state that are driving up CPU (my webapp is in in idle state with zero requests) I was expecting the threads to be in WAITING state? |
|
Sorry missed this info. kernel is "3.13.0-24-generic" |
|
please do rpm -qa and let me know what netty release are you using. On Wed, Jul 2, 2014 at 12:02 PM, jsh notifications@github.com wrote:
|
|
We are using 3.7.1.Final: io.netty 3.7.1.Final (http://mvnrepository.com/artifact/io.netty/netty/3.7.1.Final) |
|
I have been trying to make sense of this issue for ages. I'm wondering if it's not a problem with profiling: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html |
|
@cbatson I have this problem now too, it seems related to YourKit. My app goes nuts when sampling is enabled and eats cpu with the agent enabled. |
|
I think it is short of network resource |
|
I think I may be seeing this issue on netty 4.0.33, java 8. Was the workaround for this issue removed in the 4.0 series? |
|
I am seeing that too with Java 8 and Netty 3.10.5.Final |
|
hi "nioEventLoopGroup-5-238" prio=10 tid=0x00007f4b2c1d4000 nid=0x77a2 runnable [0x00007f4918911000] |
|
Hi have like this problem, 10.137.25.36.3869 10.133.142.93.3868 32768 0 102400 0 17/17 CLOSED last pid: 8340; load avg: 1.06, 1.05, 1.03; up 3+07:02:58 23:16:48 PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND if socket state is change with closed then cpu usage high for that thread aprox.. usage %25 cpu thraead dump below for that process... Thread t@123: (state = IN_NATIVE)
|
|
hi, any solutiion to this problem? We are running on a linux server using , Java 8 and Netty 3.10.4.Final |
|
hi, any solutiion to this problem? "nioEventLoopGroup-3-2" #56 prio=10 os_prio=0 tid=0x0000000017f2b800 nid=0x46c7 runnable [0x00002af3fe020000] |
|
cpu 100% "nioEventLoopGroup-3-2" #56 prio=10 os_prio=0 tid=0x0000000017f2b800 nid=0x46c7 runnable [0x00002af3fe020000]
|
|
@adjsnlxx What exact version of the JDK are you using? Curious because we ran into a very similar issue, not with Netty but with Oracle Coherence, and turned out to be JDK bug. Our instances would lock up, hit max server threads and stop responding. What drew me to this thread was the epollWait calls in the thread dumps. Here was our thread stack trace: Thread analysis showed a thread trying to lock an object (DatagramPacket) that no other thread has locked. This is a soft reference so it is not cleaned up until full GC. Again, this was Coherence, not Netty. Why I mention this is because the issue is similar and turned out to be a bug in the JDK. Coherence team has filed a bug but Java team closed as they couldn't reproduce it. (We were able to reproduce it with sufficient load.) Therefore, Coherence team resolved the issue. Some details on this bug: This behavior observed in JDK 1.7.0_80 or later and JDK_1.8.0_45 or later (though as early as JDK1.8.0_25 but not as frequent). Therefore it would be interesting to know the exact version of JDK you see this issue with. |
|
@kupci I am using Java version is 1.8.0_101,netty 4.1.5 |
|
@adjsnlxx Are you able to test the issue with the JDKs mentioned above, i.e. JDK before JDK1.7.0_80, or before JDK1.8.0_25? For example, we did not see issue with JDK1.7.0_25. I realize you may have Java 8 code, or they may be other bugs that prevent you from doing this, but it would help rule out the possibility that what you are seeing is a JDK bug. |
|
Same problem here. CentOS 7.2, kernel 3.10.0, JDK 1.8.0_77
|
|
@hbprotoss your stack shows tomcat and not netty, so its not really about netty. That said I think its a JDK / Kernel issue |
|
@hbprotoss @kupci can any of you share a reproducer maybe ? |
|
Hey Guys, we are having the same problem, however in our case the threads keep increasing and reaches even 6000. We are using Netty 3.10.5 Final and JDK 1.8.101. Do you guys have any work around for this? |
@amit2103 - Netty 3.x has been EOL a while ago I would recommend updating to 4.1. Also I don't think we have been able to reproduce this issue, and no reproducer has been supplied. Can you provide a reproducer based upon 4.1 (or 4.x)? |
|
We have the same problem. Netty 4.1.0.CR7, Jdk 1.8. We have about 6000 threads for nioEventLoopGroup and eventually come to OOM. Locked ownable synchronizers: |
|
@endlesstian - Can you provide a reproducer? Also can you update to the latest version of Netty (4.1.6) and list your version of JDK (also update to latest version if necessary). |
|
@endlesstian - Did you find the solution of your problem. I am facing the same problem with Netty 4.1.0.CR7, Jdk 1.8. and almost 10k thread in tomcat. Tomcat became irresponsive |




Hi,
I believe I have an issue similar to #302 but on Linux (Ubuntu 10.04) with JDK (1.6.0u30) and JDK(1.7.0u4) using Netty-4.0.0 (Revision: 52a7d28)
The app is proxying connections to backend systems. The proxy has a pool of channels that it can use to send requests to the backend systems. If the pool is low on channels, new channels are spawned and put into the pool so that requests sent to the proxy can be serviced. The pools get populated on app startup, so that is why it doesn't take long at all for the CPU to spike through the roof (22 seconds into the app lifecycle).
The test box has two CPUs, the output from 'top' is below:
Thread Dump for the four NioClient based Worker Threads that are chewing up all the CPU.