Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU spikes with jetty winstone threads #4717

Closed
satyagv557 opened this issue Mar 26, 2020 · 36 comments · Fixed by #7950
Closed

High CPU spikes with jetty winstone threads #4717

satyagv557 opened this issue Mar 26, 2020 · 36 comments · Fixed by #7950
Labels
Enhancement Mac OSX For issues unique to Mac OSX

Comments

@satyagv557
Copy link

satyagv557 commented Mar 26, 2020

OS: | Mac OS X, 10.14.6 , x86_64/64 (4 cores)
Java: | Java(TM) SE Runtime Environment, 1.8.0_241-b07
JVM: | Java HotSpot(TM) 64-Bit Server VM, 25.241-b07

Description
We are upgraded Jenkins v2.187 to 2.219 and this issue started happening, we tried with Jenkins 2.225 but issue remains.

our servers in production reported 100% CPU usage. The instance had been running for several days. The CPU spike occurred instantly and remained until the instance was rebooted.

Hundred's of Jetty (winstone) threads are filing up not getting closed with the method sun.nio.ch.FileDispatcherImpl.preClose0(Native Method)

Jetty (winstone)-66537
sun.nio.ch.FileDispatcherImpl.preClose0(Native Method)
sun.nio.ch.SocketDispatcher.preClose(SocketDispatcher.java:59)
sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:838)
java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
org.eclipse.jetty.io.ChannelEndPoint.doClose(ChannelEndPoint.java:197)
org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175)
org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.doClose(SslConnection.java:1366)
org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175)
org.eclipse.jetty.io.AbstractConnection.close(AbstractConnection.java:254)
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:543)
org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:398)
org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)

When normal state we are seeing below for jetty winstone thread looks like

Jetty (winstone)-14
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.reservedWait(ReservedThreadExecutor.java:304)
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
java.lang.Thread.run(Thread.java:748)

@olamy
Copy link
Member

olamy commented Mar 26, 2020

as it's happening when using Jenkins who has a layer on the top of Jetty.
We need to have more details:

  • command line/parameters used to start Jenkins

Are you really running Jenkins server on OSX?

@satyagv557
Copy link
Author

Yes, we run Jenkins on OSX.

Start jenkins using below args:
/usr/bin/java -Xms8g -Xmx8g -XX:+UseG1GC -Djava.net.preferIPv4Stack=true -Djavax.net.ssl.keyStore=.keystore/cacert -Djavax.net.ssl.keyStorePassword=XXXX -jar "$war" --httpsKeyStore=****.jks --httpsKeyStorePassword=XXXXXX --httpsPort=8443 --httpPort=-1

@gregw
Copy link
Contributor

gregw commented Mar 26, 2020

Does it say what the thread is doing at sun.nio.ch.FileDispatcherImpl.preClose0(Native Method)? blocked trying to get a lock?

Googling for that I do see a few reports of similar sounding issues... can you have a look and tell us which are most similar?

@satyagv557
Copy link
Author

sun.nio.ch.FileDispatcherImpl.preClose0(Native Method) is in runnable state but thread is not getting closed, stack trace is below, similar threads are keep on growing and not closing.
Jetty (winstone)-66537
sun.nio.ch.FileDispatcherImpl.preClose0(Native Method)
sun.nio.ch.SocketDispatcher.preClose(SocketDispatcher.java:59)
sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:838)
java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
org.eclipse.jetty.io.ChannelEndPoint.doClose(ChannelEndPoint.java:197)
org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175)
org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.doClose(SslConnection.java:1366)
org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192)
org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175)
org.eclipse.jetty.io.AbstractConnection.close(AbstractConnection.java:254)
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:543)
org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:398)
org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)

@gregw
Copy link
Contributor

gregw commented Mar 26, 2020

So it is runnable but not progressing??? not waiting for a lock? It must then be spinning, which would explain your high CPU spike.
This is looking like some OS or JVM issue that we are tickling. Can you try different versions of the JVM to see if that changes the issue?

@satyagv557
Copy link
Author

We tried to two different Java/JVM, JDK 1.8_192 build and JDK1.8_241 which is latest, we are seeing issue with the both versions, upto to some point of time we are seeing CPU running stable, when these winstone threads starts late processing CPU starts spiking.

@sbordet
Copy link
Contributor

sbordet commented Mar 26, 2020

@satyagv557 can you please try with JDK 11 or later? This is because the socket layer has been rewritten in JDK 11 and made more stable.

@satyagv557
Copy link
Author

I can try with that but this issue never happened until we update to Jenkins v2.219 and thought it's happening because of the issue #4537 and to fix that updated to Jenkins v2.25 which comes with Jetty v9.47 but issue still remain, Do you think this is Jetty issue?

@gregw
Copy link
Contributor

gregw commented Mar 26, 2020

@satyagv557 we are not saying that changes in jetty have not indirectly caused the problem, just that we think it is a OS and/or JVM bug that is being tickled by the changes that we have made.
Ultimately we think it is a JVM bug because if we wanted to, we couldn't deliberately write code that would spin forever within Channel.close.

@joakime
Copy link
Contributor

joakime commented Mar 26, 2020

We're running the winstone Jenkins 2.226 on the CI that does the Jetty builds.
We don't see CPU spinning on that machine.

@satyagv557
Copy link
Author

I am happy to do the java upgrade but there are several reported bugs with 11, I am worried about seeing new issues!, Is there any way we can confirm what causing this issue? If yes can you provide how can we conclude the issue is from JVM side or Jetty or OS? Thanks.

@satyagv557
Copy link
Author

satyagv557 commented Mar 27, 2020

Here is the some reference application where they seeing same issue and fixed
akka/akka#25605, might be helpful

@satyagv557
Copy link
Author

"Jetty (winstone)-290959" Id=290959 Group=main RUNNABLE (in native)
at sun.nio.ch.FileDispatcherImpl.preClose0(Native Method)
at sun.nio.ch.SocketDispatcher.preClose(SocketDispatcher.java:59)
at sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:838)
- locked java.lang.Object@6ba7b639
at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
- locked java.lang.Object@35f06432

at org.eclipse.jetty.io.ChannelEndPoint.doClose(ChannelEndPoint.java:197)
at org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220)
at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192)
at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.doClose(SslConnection.java:1366)
at org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220)
at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192)
at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175)
at org.eclipse.jetty.io.AbstractConnection.close(AbstractConnection.java:254)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:543)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:398)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.lang.Thread.run(Thread.java:748)

@satyagv557
Copy link
Author

@gregw @sbordet : can you please take a look on above stack trace, thanks

@gregw
Copy link
Contributor

gregw commented Mar 30, 2020

@sbordet @lorban FYI Akka worked around this problem with: akka/akka#25923 I think the essence of it is to do key.cancel before channel.close

@gregw
Copy link
Contributor

gregw commented Mar 30, 2020

@satyagv557 is that stack from java11?

@lorban
Copy link
Contributor

lorban commented Apr 28, 2020

Here's my theory of what's happening, based on the information at hand:

Jetty's ChannelEndPoint.doClose() calls _channel.close() which eventually calls the FileDispatcherImpl.preClose0() native method.
The native method reads the file descriptor of the socket, and calls dup2 from an internally saved file descriptor onto the socket's; that internal fd was created earlier on by calling socketpair, closing one fd and keeping the other one around. That's all it does; there's no loop of any sort in this method. If the dup2 syscall succeeds, the socket's file descriptor now points to a socket which peer has been closed. But if the dup2 syscall fails, it throws IOException.

I suspect that under certain rare circumstances, dup2 fails and FileDispatcherImpl.preClose0() throws IOException, which is what I think is happening here.

When IOException is thrown, Jetty's ChannelEndPoint.doClose() catches it and logs at debug level.

My theory is that since the file descriptor wasn't closed, the selector is going to be notified again that it can be closed, Jetty tries to close it again, dup2 fails again and it all starts over, creating an infinite loop.

Assuming my theory is correct, explicitly cancelling the key before closing the channel would explain why the problem goes away as that makes sure the selector won't wake up again for the unclosed file descriptor.

To confirm it, I would need to know if FileDispatcherImpl.preClose0() does indeed throw an IOException, then what the reported error message would be since the error message is based on errno. Maybe by just enabling debug logs on org.eclipse.jetty.io.ChannelEndPoint and redirecting those logs to a specific file would generate little enough traffic to make it manageable?

We could call _key.cancel() before calling _channel.close() to make the problem go away, but I fear that might hide an OS resource leak as I'm not certain the socket would eventually be closed. Maybe that would not happen, or rarely enough that it wouldn't be too important?

@lorban
Copy link
Contributor

lorban commented Apr 28, 2020

Oh, and assuming Jenkins won't serve a ton of traffic, maybe switching the SelectorProvider to a different implementation would make the problem go away? Since the problem seems to be somewhere between the JVM and the OS, this has some chances to work around the problem.

If you wan to try this out, just set the java.nio.channels.spi.SelectorProvider system property to sun.nio.ch.PollSelectorProvider on the command line.

@michaelkwan
Copy link

Just want to chime in on this issue. We ran into this as well, changing the selector provider per @lorban suggested works; however, this has many impacts on nio and a real fix is needed in jetty lib.

@lorban
Copy link
Contributor

lorban commented May 5, 2020

@michaelkwan Thanks for trying out the suggested workaround. Could you please be more explicit about the nio impacts this has in your experience?

As far as I can tell, this seems to be caused by a bug somewhere between the JVM and the OS, and there is little Jetty could do to solve it.

If you could collect the debug logs of org.eclipse.jetty.io.ChannelEndPoint as I mentioned above while the problem occurs, we could maybe come up with a solution, but so far my suspicion is that any workaround Jetty may implement (like the one Akka put in place) will result in a file descriptor leak, which would just defer the problem to a later time, and make it much more complex to trace back.

@joakime
Copy link
Contributor

joakime commented Aug 13, 2020

Akka reports : JDK 11: Async channel closing semantics changed

akka/akka#25913

This is something to confirm on our behaviors as well.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been a
full year without activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the Stale For auto-closed stale issues and pull requests label Feb 28, 2022
@sbordet sbordet removed the Stale For auto-closed stale issues and pull requests label Mar 2, 2022
@GaryWKeim
Copy link

We're seeing this happen consistently with Terracotta BigMemory 4.x (Jetty 9.4.45.v20220203) on MacOS (Monterey 12.3.1) with the newest Zulu-8 (1.8.0_322) but not in any newer JDK versions.

@joakime
Copy link
Contributor

joakime commented Apr 21, 2022

@GaryWKeim can you enable debug logging temporarily and capture some activity? (maximum of 5 minutes).
That should indicate what's consuming the CPU.
Bonus if you can confirm the CPU usage remains high even when there are no active connections.

@GaryWKeim
Copy link

Also, when this occurs, the process becomes a zombie that can't be terminated in any way and a reboot is necessary to free up any files/ports the zombie is holding open:

% ps auxww | grep "(java)"
% gkeim 37381 0.0 0.0 0 0 ?? ?E 8:18PM 0:00.00 (java)

What packages would you like me to log at DEBUG?

@joakime
Copy link
Contributor

joakime commented Apr 22, 2022

Log at DEBUG level on the named logger org.eclipse.jetty

@lorban
Copy link
Contributor

lorban commented Apr 22, 2022

@GaryWKeim Have you tried the workaround I mentioned earlier, i.e.: set the java.nio.channels.spi.SelectorProvider system property to sun.nio.ch.PollSelectorProvider?

@GaryWKeim
Copy link

Yes, the work-around seems to work.

@lorban
Copy link
Contributor

lorban commented Apr 22, 2022

Your description seems to confirm what I previously stated: this seems to be caused by a bug somewhere between the JVM and the OS, and there is little Jetty can do to solve it.

I'm afraid your only options are either to use the workaround or a JDK more recent than 1.8.

@GaryWKeim
Copy link

I was only able to get some logging after the problem occurred. Maybe it will help
jetty-4717.zip
.

lorban added a commit that referenced this issue Apr 26, 2022
Signed-off-by: Ludovic Orban <lorban@bitronix.be>
@lorban
Copy link
Contributor

lorban commented Apr 26, 2022

@GaryWKeim since you seem to be able to reproduce this problem, maybe you could test a candidate fix?

I've created a branch which contains a change that should/could solve this problem: https://github.com/eclipse/jetty.project/tree/jetty-9.4.x-4717-macos-high-cpu

I'd be glad if you could build it, try it out to see if the problem goes away AND make sure we're not leaking file descriptors with partially-closed sockets.

Thanks!

@GaryWKeim
Copy link

After pounding on this for several days now using your branch, I have not been able to reproduce the problem and I see no leaking file descriptors.

@lorban lorban linked a pull request May 3, 2022 that will close this issue
lorban added a commit that referenced this issue May 3, 2022
Signed-off-by: Ludovic Orban <lorban@bitronix.be>
lorban added a commit that referenced this issue May 4, 2022
…t pass a null key to the ctor

Signed-off-by: Ludovic Orban <lorban@bitronix.be>
lorban added a commit that referenced this issue May 11, 2022
Signed-off-by: Ludovic Orban <lorban@bitronix.be>
lorban added a commit that referenced this issue May 11, 2022
…t pass a null key to the ctor

Signed-off-by: Ludovic Orban <lorban@bitronix.be>
@github-actions
Copy link

This issue has been automatically marked as stale because it has been a
full year without activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the Stale For auto-closed stale issues and pull requests label Apr 29, 2023
@sbordet sbordet closed this as completed Apr 29, 2023
@AIUPFK1
Copy link

AIUPFK1 commented May 3, 2023

@sbordet May I know the final workaround?

@sbordet
Copy link
Contributor

sbordet commented May 3, 2023

@sbordet May I know the final workaround?

Update Jetty, or avoid using Mac OS 😃
The fix from @lorban reported in the comments seemed to be enough.

@joakime
Copy link
Contributor

joakime commented May 3, 2023

Note: Jetty 9.x is now at End of Community Support

See:

The fix from @lorban was merged and is present in all currently released versions of Jetty 9 thru Jetty 12

@joakime joakime added Enhancement Mac OSX For issues unique to Mac OSX and removed Help Wanted More Info Required Stale For auto-closed stale issues and pull requests labels May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Mac OSX For issues unique to Mac OSX
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants