OutOfMemoryError caused by CyclicTimeouts #10120

jsalts · 2023-07-18T00:48:48Z

Jetty version(s)
12.0.0.beta3

Java version/vendor (use: java -version)
openjdk 17.0.3 2022-04-19 LTS

OS type/version
linux

Description

This is mostly a shot in the dark at this point. We're using ProxyServlet with a five minute timeout and wondering if it's related.

The text was updated successfully, but these errors were encountered:

sbordet · 2023-07-18T07:19:43Z

I do not understand how the profiler shows the client-side exchange having a reference to o.e.j.server.internal.HttpConnection, as there is no link between the two.

Can you please detail a little more what you are doing, how long the proxy runs, whether the 5 minutes timeout expires or not, etc?

jsalts · 2023-07-18T16:01:23Z

We might have ruled out the proxy. We're able to reproduce it with this type of test case.

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;

class Scratch {
    // Uses the standard Java HttpClient
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {

        System.setProperty("jdk.httpclient.allowRestrictedHeaders", "connection");

        java.net.http.HttpClient client = java.net.http.HttpClient.newBuilder()
                .version(java.net.http.HttpClient.Version.HTTP_1_1)
                .build();

        for (int i = 0; i < 100_000; ++i) {
            var request = HttpRequest.newBuilder()
                    .header("Accept", "application/json")
                    .header("Content-Type", "application/json")
                    // Sending no Connection header, which is the same as keep-alive in Http 1.1,
                    // seems to result in OOM on the remote side.
//                    .header("Connection", "close")
                    .uri(new URI("http://localhost:8033/app/endpoint))
                    .POST(HttpRequest.BodyPublishers.ofString("{\"ids\":[6961806]}", StandardCharsets.UTF_8))
                    .build();

            HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

            int status = response.statusCode();
            if (status != 200) {
                System.out.println("ERROR: " + status);
            }

            if (i % 100 == 0) {
                System.out.println(i);
            }

            // 200 requests/second
            Thread.sleep(5);
        }
    }
}

jsalts · 2023-07-18T16:12:27Z

It's possible this is caused by addIdleTimeoutListener chain referencing the previous idle timeout.

mpilone · 2023-07-18T16:31:03Z

I was working with @jsalts on this. Here's a GC root path I found in the heap dump when memory was near full (I trimmed a lot of the middle references out).

I think that reference list comes from org.eclipse.jetty.server.internal.HttpChannelState.ChannelRequest#addIdleTimeoutListener adding an idle listener on each request which generates a lamba referencing a previous idle listener. I’m assuming the same HttpChannelState is reused on multiple requests to a persistent connection so the lamba reference list just keeps growing.

My guess is that the server is keeping the connection open and reusing it and over time something keeps collecting in the open connection state (e.g HttpChannelState). If you stop the test client, which would close the connections, the memory is eventually cleaned up on the server.

If you run the same test client with a "Connection: close" header, the memory issue doesn't appear.

I think we pointed to the ProxyServlet originally because we see the OOM more when requests go through the proxy because the default Jetty HttpClient doesn't appear to set a Connection header which defaults to keep-alive. Our other clients likely close the connection more often allowing the remote server to cleanup the state. But as @jsalts said, we can now reproduce it without using the ProxyServlet by doing a lot of requests with no Connection header (i.e. keep-alive).

Thanks,
-mike

When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet · 2023-07-18T17:30:34Z

@jsalts @mpilone thanks for the detailed report.

Can you try #10126 and see if it solves this issue?

jsalts · 2023-07-18T18:04:14Z

Looks better to me.

Before:

After:

When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet · 2023-07-18T20:05:32Z

Thanks for the feedback!

jsalts · 2023-07-24T01:58:10Z

@sbordet Following up on this, it seems like the patch helped but we're now OOMing over 1-2 days instead of 4 hours. At first glance the hprof shows a similar state as above. I've attached a memory after garbage collection graph and it looks like it leaks for about 2 days before flattening out at our -Xmx limit and throwing an OOM exception.

Will update if we get more information.

Retested the patch again and it definitely looks more stable than it was before. I'll leave the test running overnight and see what happens.

sbordet · 2023-07-24T07:46:11Z

@jsalts thanks for this additional report.

Please let us know if you have a heap dump we can look at, or you analysis of this issue.

mpilone · 2023-07-24T14:25:20Z

I just started to look into this and it looks like it might be an issue with the proxy side. Maybe this is part of the issue we originally saw when we blamed the proxy. Here are the initial screenshots from the heap dump. I'm going to try to recreate the issue locally today. From a first glance, it looks like maybe the org.eclipse.jetty.client.transport.HttpConnection#requestTimeouts are not cleared when a client connection is reused in the proxy?

sbordet · 2023-07-24T21:30:14Z

I am not sure I understand the image above for the GC root -- is it one instance that may actually be a legit one?

How many connections do you have?

Is this heap dump taken after load, at rest, with no requests being processed?

mpilone · 2023-07-25T12:56:28Z

There are a lot of instances of that similar GC root in the dump. Looking for just the CyclicTimeout.Wakeup, I can see 4871 instance in the dump which seems odd. Our "timeout" value on the outbound proxy requests is 5 minutes and as far as we know all the requests were completing successfully. That is, none of them were actually lasting the full 5 minutes and timing out. If I read the heap dump above correctly, it looks like the scheduled executor queue had 4872 elements in it which aligns with the 4871 Wakeups in the dump.

The requests are pretty regular from another component in our system. I included a graph of the requests leading up to the OOM at 02:36. The requests should only be coming from a couple of external components so I would guess the actual number of active connections to be very low (e.g. < 5). The service that had the OOM is not exposed to the public and only services a handful of other internal components.

The heap dump was taken automatically by the JVM when the OOM occurred so it would be during the requests/load after the process was running for about 2 days (Friday through Sunday morning).

I've tried a bunch of different scenarios locally but I haven't been able to recreate the issue yet. I enabled debug logging on CyclicTimeouts and CyclicTimeout to see if that gives any clues in the environment where we saw the OOM. I'll update if I get anymore info from that.

It seems like it might be a CyclicTimeout Wakeup leak but looking at the code I'm not sure how that can happen yet because it looks like the Wakeups only schedule themselves in the executor if an earlier one arrives. So you'd think that if we only had a handful of connections there should only be a handful of Wakeups in the executor queue. It's possible this is an issue on our side with generating tons of independent connections but I haven't seen any evidence of that yet.

-mike

mpilone · 2023-07-25T13:06:50Z

Just after posting that last comment I went to check if we got any new info from the CyclicTimeout logging. I see a lot of Wakeups getting scheduled, but even more oddly, I see some really long Wakeups being scheduled. For example, 9208175844786ms. I don't know what could cause that yet but I could see that leading to a buildup of Wakeups in the executor queue over time.

I included two screenshots. The first shows all the Wakeups being scheduled over the last hour. A mix of our 5 minute timeouts and the really long timeouts. The second is filtered to show only the really long timeouts. There are a lot of them for a one hour period.

mpilone · 2023-07-25T13:35:31Z

I'm wondering if the issue is in org.eclipse.jetty.io.CyclicTimeouts#onTimeoutExpired. I see two potential problems (but this is my first time looking at that code). The first is the use of earliestTimeout as it is set to "now + Long.MAX_VALUE" but it is only an AtomicLong. Isn't that going to cause a rollover error?

The second is that the line "long expiresAt = expirable.getExpireNanoTime();" will ask for expiration time from the HttpChannel. The logic in HttpChannel,

return exchange != null ? exchange.getExpireNanoTime() : Long.MAX_VALUE;"

is to return MAX_VALUE if there's no exchange. Later in onTimeoutExpired, it calculates

earliest = Math.min(earliest, NanoTime.elapsed(now, expiresAt));

and then schedules a timeout if if (earliest != Long.MAX_VALUE). But earliest won't equal MAX_LONG if the code just calculated earliest as the elapsed time from now to expiresAt (which is MAX_LONG).

Again, I'm just reading through this code for the first time but I think it can lead to Wakeups being schedule for near MAX_LONG - now time if a channel doesn't have an exchange when the timeout expires.

sbordet · 2023-07-25T14:43:55Z

@mpilone looking into this.
Definitely there is an issue handling MAX_VALUE, as you discovered.
I'll keep this issue posted on my results.

mpilone · 2023-07-25T15:56:02Z

Now that I understand what's going on a little more, I think you can reproduce it with:

Setup a proxy with a 5 minute "timeout" servlet config value
Make a few requests through the proxy
Wait 5 minutes for the timeout to expire
Watch the log to see a new Wakeup get scheduled for 9222549011955ms in the future

Ironically I think I couldn't reproduce this before because I was putting a bunch of load on the proxy thinking it would accelerate the leak but it seems like it happens when the timeout expires and there's no active exchange (i.e. it is idle).

You can probably recreate it better with just a unit test directly on CyclicTimeout(s) but at least I can reproduce it end-to-end now.

[2023-07-25T11:48:58.288-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeouts               ] [HttpClient@47198524-scheduler-1    ]: Entity HttpChannelOverHTTP@74e2d56(exchange=null)[send=HttpSenderOverHTTP@1ae684bc(req=QUEUED,failure=null)[HttpGenerator@2d804ddc{s=START}],recv=HttpReceiverOverHTTP@7552f1d0(ex=null,rsp=IDLE,failure=null)[HttpParser{s=START,0 of -1}]] expires in 9222549011956 ms for org.eclipse.jetty.client.transport.HttpConnection$RequestTimeouts@64a6301
[2023-07-25T11:48:58.289-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeouts               ] [HttpClient@47198524-scheduler-1    ]: Scheduling timeout in 9222549011955 ms for org.eclipse.jetty.client.transport.HttpConnection$RequestTimeouts@64a6301
[2023-07-25T11:48:58.289-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeout                ] [HttpClient@47198524-scheduler-1    ]: Installed timeout in 9222549011955 ms, new wake up in 9222549011955 ms
[2023-07-25T11:48:58.289-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeouts               ] [HttpClient@47198524-scheduler-1    ]: Timeouts check for org.eclipse.jetty.client.transport.HttpDestination$RequestTimeouts@200afe73

Wakeups in heap after the initial requests, before the 5 minute timeout:

Wakeups in heap after the initial 5 minute timeout:

sbordet · 2023-07-25T16:00:46Z

Thanks, I arrived at the same conclusion but was looking at Jetty 10, which has similar issues, but slightly different.

Thank you again for the detailed report.

Fixed handling of Expirable.getExpireNanoTime() in case it returns Long.MAX_VALUE. Also fixed implementations of Expirable that were not initializing their expireNanoTime field to Long.MAX_VALUE. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet · 2023-07-25T18:12:33Z

I confirm that in Jetty 10 we do not handle well Long.MAX_VALUE returned from Expirable.getExpireNanoTime(), but we do not leak.

However, in 12 we also do not handle this case well, and in addition we leak. The difference is how we initialize CyclicTimeouts.earliestTimeout in CyclicTimeouts.onTimeoutExpired().

PR is #10148, but it is for Jetty 10.
When merged to 12, you should see your problem fixed too.
If you want to give it a try in 12, just add the if block as done here: https://github.com/eclipse/jetty.project/pull/10148/files#diff-a6dbe0bdb3c9e9d7ef0a28b81637edac71f687db3ec283e1c5e363f0c944f310R98, and let us know if it works for you.

Thanks!

Fixed handling of Expirable.getExpireNanoTime() in case it returns Long.MAX_VALUE. Also fixed implementations of Expirable that were not initializing their expireNanoTime field to Long.MAX_VALUE. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet · 2023-07-26T08:49:22Z

@mpilone the fix has been merged into jetty-12.0.x HEAD, so you can now build Jetty from that branch to test the fix.
Thanks!

mpilone · 2023-07-26T14:55:09Z

Thanks, @sbordet . We applied the patch yesterday and aren't seeing the really long Wakeups being scheduled so that's promising. We'll update to the latest 12.0.x HEAD and let it run for a few days to see how it behaves.

sbordet · 2023-07-26T14:56:21Z

@mpilone thanks for the feedback. Let us know how's going in a few days.

mpilone · 2023-07-28T16:29:33Z

After 48 hours the number of Wakeups in the queue still looks good. Thanks for the fix.

jomapp · 2024-03-18T15:40:17Z

@mpilone looking into this. Definitely there is an issue handling MAX_VALUE, as you discovered. I'll keep this issue posted on my results.

Hi, while debugging a probably related issue, we noted that there is now a difference in code regarding the fix #10148.

Jetty 10 (10.0.X) includes all the changes from Fixes #10120 - OOME caused by CyclicTimeouts. #10148 and does not produce a rollover error when setting earliestNanoTime. https://github.com/jetty/jetty.project/blob/d39899fc2917c2899b81dbc795f6ca7014fad93c/jetty-io/src/main/java/org/eclipse/jetty/io/CyclicTimeouts.java#L85C1-L85C40
Jetty 12 (12.0.X) does however not include all changes from Fixes #10120 - OOME caused by CyclicTimeouts. #10148 and does indeed still produce a rollover error when setting earliestNanoTime. https://github.com/jetty/jetty.project/blob/2067ec81e8e947a7e65955e04cc2b8d637bfbbd9/jetty-core/jetty-io/src/main/java/org/eclipse/jetty/io/CyclicTimeouts.java#L88C1-L88C52

We are having issues with a heavily increasing heap size in Jetty 12.0.6, and are wondering if this could be still related to this issue. Thanks for looking into it again!

sbordet · 2024-03-18T17:12:57Z

@jomapp the rollover in Jetty 12 is fine, as we deal with nanoTimes.
I don't think your OOME is related to this issue.
If you have further details, please open a new issue and we'll look into that.
Thanks!

jomapp · 2024-03-18T18:26:11Z

@sbordet Thank you for your fast response!
But isn't earliestNanoTime.set(now + Long.MAX_VALUE); (line 88) still overflowing on assignment to an AtomicLong?
Never mind if I substantially miss context here, but we were just confused that changes to this line in Jetty 10 are different from Jetty 12.
Thanks!

sbordet · 2024-03-18T18:34:04Z

But isn't earliestNanoTime.set(now + Long.MAX_VALUE); (line 88) still overflowing on assignment to an AtomicLong?

Yes, but this is fine because it's a nanoTime.

In Jetty 12 the codebase is slightly different, so the difference is accommodated to fit the Jetty 12 codebase.

jsalts added the Bug For general bugs on Jetty side label Jul 18, 2023

sbordet added a commit that referenced this issue Jul 18, 2023

Fixes #10120 - OOME due to HttpChannelState._onIdleTimeout

8d16c19

When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet linked a pull request Jul 18, 2023 that will close this issue

Fixes #10120 - OOME due to HttpChannelState._onIdleTimeout #10126

Merged

sbordet added a commit that referenced this issue Jul 18, 2023

Fixes #10120 - OOME due to HttpChannelState._onIdleTimeout

abc01d5

When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet closed this as completed Jul 18, 2023

sbordet reopened this Jul 24, 2023

sbordet linked a pull request Jul 25, 2023 that will close this issue

Fixes #10120 - OOME caused by CyclicTimeouts. #10148

Merged

sbordet closed this as completed in #10148 Jul 26, 2023

joakime changed the title ~~OutOfMemoryError Upgrading from 9.4.49.v20220914 to 12.0.0.beta3~~ OOME caused by CyclicTimeouts Aug 25, 2023

joakime changed the title ~~OOME caused by CyclicTimeouts~~ OutOfMemoryError caused by CyclicTimeouts Aug 25, 2023

joakime mentioned this issue Sep 21, 2023

Memory leak via IdleTimeout and supressedExceptions in Jetty 12.0.1 #10564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutOfMemoryError caused by CyclicTimeouts #10120

OutOfMemoryError caused by CyclicTimeouts #10120

jsalts commented Jul 18, 2023

sbordet commented Jul 18, 2023

jsalts commented Jul 18, 2023 •

edited

jsalts commented Jul 18, 2023

mpilone commented Jul 18, 2023

sbordet commented Jul 18, 2023

jsalts commented Jul 18, 2023

sbordet commented Jul 18, 2023

jsalts commented Jul 24, 2023 •

edited

sbordet commented Jul 24, 2023

mpilone commented Jul 24, 2023

sbordet commented Jul 24, 2023

mpilone commented Jul 25, 2023

mpilone commented Jul 25, 2023

mpilone commented Jul 25, 2023 •

edited

sbordet commented Jul 25, 2023

mpilone commented Jul 25, 2023 •

edited

sbordet commented Jul 25, 2023

sbordet commented Jul 25, 2023

sbordet commented Jul 26, 2023

mpilone commented Jul 26, 2023

sbordet commented Jul 26, 2023

mpilone commented Jul 28, 2023

jomapp commented Mar 18, 2024

sbordet commented Mar 18, 2024

jomapp commented Mar 18, 2024

sbordet commented Mar 18, 2024

OutOfMemoryError caused by CyclicTimeouts #10120

OutOfMemoryError caused by CyclicTimeouts #10120

Comments

jsalts commented Jul 18, 2023

sbordet commented Jul 18, 2023

jsalts commented Jul 18, 2023 • edited

jsalts commented Jul 18, 2023

mpilone commented Jul 18, 2023

sbordet commented Jul 18, 2023

jsalts commented Jul 18, 2023

sbordet commented Jul 18, 2023

jsalts commented Jul 24, 2023 • edited

sbordet commented Jul 24, 2023

mpilone commented Jul 24, 2023

sbordet commented Jul 24, 2023

mpilone commented Jul 25, 2023

mpilone commented Jul 25, 2023

mpilone commented Jul 25, 2023 • edited

sbordet commented Jul 25, 2023

mpilone commented Jul 25, 2023 • edited

sbordet commented Jul 25, 2023

sbordet commented Jul 25, 2023

sbordet commented Jul 26, 2023

mpilone commented Jul 26, 2023

sbordet commented Jul 26, 2023

mpilone commented Jul 28, 2023

jomapp commented Mar 18, 2024

sbordet commented Mar 18, 2024

jomapp commented Mar 18, 2024

sbordet commented Mar 18, 2024

jsalts commented Jul 18, 2023 •

edited

jsalts commented Jul 24, 2023 •

edited

mpilone commented Jul 25, 2023 •

edited

mpilone commented Jul 25, 2023 •

edited