New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemoryError caused by CyclicTimeouts #10120
Comments
I do not understand how the profiler shows the client-side Can you please detail a little more what you are doing, how long the proxy runs, whether the 5 minutes timeout expires or not, etc? |
It's possible this is caused by addIdleTimeoutListener chain referencing the previous idle timeout. |
I was working with @jsalts on this. Here's a GC root path I found in the heap dump when memory was near full (I trimmed a lot of the middle references out). I think that reference list comes from org.eclipse.jetty.server.internal.HttpChannelState.ChannelRequest#addIdleTimeoutListener adding an idle listener on each request which generates a lamba referencing a previous idle listener. I’m assuming the same HttpChannelState is reused on multiple requests to a persistent connection so the lamba reference list just keeps growing. My guess is that the server is keeping the connection open and reusing it and over time something keeps collecting in the open connection state (e.g HttpChannelState). If you stop the test client, which would close the connections, the memory is eventually cleaned up on the server. If you run the same test client with a "Connection: close" header, the memory issue doesn't appear. I think we pointed to the ProxyServlet originally because we see the OOM more when requests go through the proxy because the default Jetty HttpClient doesn't appear to set a Connection header which defaults to keep-alive. Our other clients likely close the connection more often allowing the remote server to cleanup the state. But as @jsalts said, we can now reproduce it without using the ProxyServlet by doing a lot of requests with no Connection header (i.e. keep-alive). Thanks, |
When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Thanks for the feedback! |
@sbordet Following up on this, it seems like the patch helped but we're now OOMing over 1-2 days instead of 4 hours. At first glance the hprof shows a similar state as above. I've attached a memory after garbage collection graph and it looks like it leaks for about 2 days before flattening out at our -Xmx limit and throwing an OOM exception. Will update if we get more information. Retested the patch again and it definitely looks more stable than it was before. I'll leave the test running overnight and see what happens. |
@jsalts thanks for this additional report. Please let us know if you have a heap dump we can look at, or you analysis of this issue. |
I just started to look into this and it looks like it might be an issue with the proxy side. Maybe this is part of the issue we originally saw when we blamed the proxy. Here are the initial screenshots from the heap dump. I'm going to try to recreate the issue locally today. From a first glance, it looks like maybe the org.eclipse.jetty.client.transport.HttpConnection#requestTimeouts are not cleared when a client connection is reused in the proxy? |
I am not sure I understand the image above for the GC root -- is it one instance that may actually be a legit one? How many connections do you have? Is this heap dump taken after load, at rest, with no requests being processed? |
There are a lot of instances of that similar GC root in the dump. Looking for just the CyclicTimeout.Wakeup, I can see 4871 instance in the dump which seems odd. Our "timeout" value on the outbound proxy requests is 5 minutes and as far as we know all the requests were completing successfully. That is, none of them were actually lasting the full 5 minutes and timing out. If I read the heap dump above correctly, it looks like the scheduled executor queue had 4872 elements in it which aligns with the 4871 Wakeups in the dump. The requests are pretty regular from another component in our system. I included a graph of the requests leading up to the OOM at 02:36. The requests should only be coming from a couple of external components so I would guess the actual number of active connections to be very low (e.g. < 5). The service that had the OOM is not exposed to the public and only services a handful of other internal components. The heap dump was taken automatically by the JVM when the OOM occurred so it would be during the requests/load after the process was running for about 2 days (Friday through Sunday morning). I've tried a bunch of different scenarios locally but I haven't been able to recreate the issue yet. I enabled debug logging on CyclicTimeouts and CyclicTimeout to see if that gives any clues in the environment where we saw the OOM. I'll update if I get anymore info from that. It seems like it might be a CyclicTimeout Wakeup leak but looking at the code I'm not sure how that can happen yet because it looks like the Wakeups only schedule themselves in the executor if an earlier one arrives. So you'd think that if we only had a handful of connections there should only be a handful of Wakeups in the executor queue. It's possible this is an issue on our side with generating tons of independent connections but I haven't seen any evidence of that yet. |
@mpilone looking into this. |
Thanks, I arrived at the same conclusion but was looking at Jetty 10, which has similar issues, but slightly different. Thank you again for the detailed report. |
Fixed handling of Expirable.getExpireNanoTime() in case it returns Long.MAX_VALUE. Also fixed implementations of Expirable that were not initializing their expireNanoTime field to Long.MAX_VALUE. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
I confirm that in Jetty 10 we do not handle well However, in 12 we also do not handle this case well, and in addition we leak. The difference is how we initialize PR is #10148, but it is for Jetty 10. Thanks! |
Fixed handling of Expirable.getExpireNanoTime() in case it returns Long.MAX_VALUE. Also fixed implementations of Expirable that were not initializing their expireNanoTime field to Long.MAX_VALUE. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@mpilone the fix has been merged into |
Thanks, @sbordet . We applied the patch yesterday and aren't seeing the really long Wakeups being scheduled so that's promising. We'll update to the latest 12.0.x HEAD and let it run for a few days to see how it behaves. |
@mpilone thanks for the feedback. Let us know how's going in a few days. |
After 48 hours the number of Wakeups in the queue still looks good. Thanks for the fix. |
Hi, while debugging a probably related issue, we noted that there is now a difference in code regarding the fix #10148.
We are having issues with a heavily increasing heap size in Jetty 12.0.6, and are wondering if this could be still related to this issue. Thanks for looking into it again! |
@jomapp the rollover in Jetty 12 is fine, as we deal with nanoTimes. |
@sbordet Thank you for your fast response! |
Yes, but this is fine because it's a nanoTime. In Jetty 12 the codebase is slightly different, so the difference is accommodated to fit the Jetty 12 codebase. |
Jetty version(s)
12.0.0.beta3
Java version/vendor
(use: java -version)
openjdk 17.0.3 2022-04-19 LTS
OS type/version
linux
Description
This is mostly a shot in the dark at this point. We're using ProxyServlet with a five minute timeout and wondering if it's related.
The text was updated successfully, but these errors were encountered: