Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError caused by CyclicTimeouts #10120

Closed
jsalts opened this issue Jul 18, 2023 · 26 comments · Fixed by #10126 or #10148
Closed

OutOfMemoryError caused by CyclicTimeouts #10120

jsalts opened this issue Jul 18, 2023 · 26 comments · Fixed by #10126 or #10148
Labels
Bug For general bugs on Jetty side

Comments

@jsalts
Copy link

jsalts commented Jul 18, 2023

Jetty version(s)
12.0.0.beta3

Java version/vendor (use: java -version)
openjdk 17.0.3 2022-04-19 LTS

OS type/version
linux

Description

This is mostly a shot in the dark at this point. We're using ProxyServlet with a five minute timeout and wondering if it's related.

image
@jsalts jsalts added the Bug For general bugs on Jetty side label Jul 18, 2023
@sbordet
Copy link
Contributor

sbordet commented Jul 18, 2023

I do not understand how the profiler shows the client-side exchange having a reference to o.e.j.server.internal.HttpConnection, as there is no link between the two.

Can you please detail a little more what you are doing, how long the proxy runs, whether the 5 minutes timeout expires or not, etc?

@jsalts
Copy link
Author

jsalts commented Jul 18, 2023

We might have ruled out the proxy. We're able to reproduce it with this type of test case.

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;

class Scratch {
    // Uses the standard Java HttpClient
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {

        System.setProperty("jdk.httpclient.allowRestrictedHeaders", "connection");

        java.net.http.HttpClient client = java.net.http.HttpClient.newBuilder()
                .version(java.net.http.HttpClient.Version.HTTP_1_1)
                .build();

        for (int i = 0; i < 100_000; ++i) {
            var request = HttpRequest.newBuilder()
                    .header("Accept", "application/json")
                    .header("Content-Type", "application/json")
                    // Sending no Connection header, which is the same as keep-alive in Http 1.1,
                    // seems to result in OOM on the remote side.
//                    .header("Connection", "close")
                    .uri(new URI("http://localhost:8033/app/endpoint))
                    .POST(HttpRequest.BodyPublishers.ofString("{\"ids\":[6961806]}", StandardCharsets.UTF_8))
                    .build();

            HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

            int status = response.statusCode();
            if (status != 200) {
                System.out.println("ERROR: " + status);
            }

            if (i % 100 == 0) {
                System.out.println(i);
            }

            // 200 requests/second
            Thread.sleep(5);
        }
    }
}
image

@jsalts
Copy link
Author

jsalts commented Jul 18, 2023

It's possible this is caused by addIdleTimeoutListener chain referencing the previous idle timeout.

@mpilone
Copy link

mpilone commented Jul 18, 2023

I was working with @jsalts on this. Here's a GC root path I found in the heap dump when memory was near full (I trimmed a lot of the middle references out).

Screen Shot 2023-07-18 at 11 45 49 AM
Screen Shot 2023-07-18 at 11 46 12 AM

I think that reference list comes from org.eclipse.jetty.server.internal.HttpChannelState.ChannelRequest#addIdleTimeoutListener adding an idle listener on each request which generates a lamba referencing a previous idle listener. I’m assuming the same HttpChannelState is reused on multiple requests to a persistent connection so the lamba reference list just keeps growing.

My guess is that the server is keeping the connection open and reusing it and over time something keeps collecting in the open connection state (e.g HttpChannelState). If you stop the test client, which would close the connections, the memory is eventually cleaned up on the server.

If you run the same test client with a "Connection: close" header, the memory issue doesn't appear.

Screen Shot 2023-07-18 at 11 15 38 AM

I think we pointed to the ProxyServlet originally because we see the OOM more when requests go through the proxy because the default Jetty HttpClient doesn't appear to set a Connection header which defaults to keep-alive. Our other clients likely close the connection more often allowing the remote server to cleanup the state. But as @jsalts said, we can now reproduce it without using the ProxyServlet by doing a lot of requests with no Connection header (i.e. keep-alive).

Thanks,
-mike

sbordet added a commit that referenced this issue Jul 18, 2023
When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle().

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@sbordet sbordet linked a pull request Jul 18, 2023 that will close this issue
@sbordet
Copy link
Contributor

sbordet commented Jul 18, 2023

@jsalts @mpilone thanks for the detailed report.

Can you try #10126 and see if it solves this issue?

@jsalts
Copy link
Author

jsalts commented Jul 18, 2023

Looks better to me.

Before:
image

After:

image

sbordet added a commit that referenced this issue Jul 18, 2023
When `HttpChannelState._onIdleTimeout` was introduced, it was forgotten to null it out in recycle().

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@sbordet
Copy link
Contributor

sbordet commented Jul 18, 2023

Thanks for the feedback!

@sbordet sbordet closed this as completed Jul 18, 2023
@jsalts
Copy link
Author

jsalts commented Jul 24, 2023

@sbordet Following up on this, it seems like the patch helped but we're now OOMing over 1-2 days instead of 4 hours. At first glance the hprof shows a similar state as above. I've attached a memory after garbage collection graph and it looks like it leaks for about 2 days before flattening out at our -Xmx limit and throwing an OOM exception.

Will update if we get more information.

image

Retested the patch again and it definitely looks more stable than it was before. I'll leave the test running overnight and see what happens.

image

@sbordet
Copy link
Contributor

sbordet commented Jul 24, 2023

@jsalts thanks for this additional report.

Please let us know if you have a heap dump we can look at, or you analysis of this issue.

@sbordet sbordet reopened this Jul 24, 2023
@mpilone
Copy link

mpilone commented Jul 24, 2023

I just started to look into this and it looks like it might be an issue with the proxy side. Maybe this is part of the issue we originally saw when we blamed the proxy. Here are the initial screenshots from the heap dump. I'm going to try to recreate the issue locally today. From a first glance, it looks like maybe the org.eclipse.jetty.client.transport.HttpConnection#requestTimeouts are not cleared when a client connection is reused in the proxy?

Screen Shot 2023-07-24 at 10 17 36 AM

Screen Shot 2023-07-24 at 10 17 53 AM

@sbordet
Copy link
Contributor

sbordet commented Jul 24, 2023

I am not sure I understand the image above for the GC root -- is it one instance that may actually be a legit one?

How many connections do you have?

Is this heap dump taken after load, at rest, with no requests being processed?

@mpilone
Copy link

mpilone commented Jul 25, 2023

There are a lot of instances of that similar GC root in the dump. Looking for just the CyclicTimeout.Wakeup, I can see 4871 instance in the dump which seems odd. Our "timeout" value on the outbound proxy requests is 5 minutes and as far as we know all the requests were completing successfully. That is, none of them were actually lasting the full 5 minutes and timing out. If I read the heap dump above correctly, it looks like the scheduled executor queue had 4872 elements in it which aligns with the 4871 Wakeups in the dump.

The requests are pretty regular from another component in our system. I included a graph of the requests leading up to the OOM at 02:36. The requests should only be coming from a couple of external components so I would guess the actual number of active connections to be very low (e.g. < 5). The service that had the OOM is not exposed to the public and only services a handful of other internal components.

The heap dump was taken automatically by the JVM when the OOM occurred so it would be during the requests/load after the process was running for about 2 days (Friday through Sunday morning).

I've tried a bunch of different scenarios locally but I haven't been able to recreate the issue yet. I enabled debug logging on CyclicTimeouts and CyclicTimeout to see if that gives any clues in the environment where we saw the OOM. I'll update if I get anymore info from that.

It seems like it might be a CyclicTimeout Wakeup leak but looking at the code I'm not sure how that can happen yet because it looks like the Wakeups only schedule themselves in the executor if an earlier one arrives. So you'd think that if we only had a handful of connections there should only be a handful of Wakeups in the executor queue. It's possible this is an issue on our side with generating tons of independent connections but I haven't seen any evidence of that yet.

-mike
Wakeup Count
Connection Instance Count
Requests to Proxy

@mpilone
Copy link

mpilone commented Jul 25, 2023

Just after posting that last comment I went to check if we got any new info from the CyclicTimeout logging. I see a lot of Wakeups getting scheduled, but even more oddly, I see some really long Wakeups being scheduled. For example, 9208175844786ms. I don't know what could cause that yet but I could see that leading to a buildup of Wakeups in the executor queue over time.

I included two screenshots. The first shows all the Wakeups being scheduled over the last hour. A mix of our 5 minute timeouts and the really long timeouts. The second is filtered to show only the really long timeouts. There are a lot of them for a one hour period.

All Wakeups Really Long Wakeups

@mpilone
Copy link

mpilone commented Jul 25, 2023

I'm wondering if the issue is in org.eclipse.jetty.io.CyclicTimeouts#onTimeoutExpired. I see two potential problems (but this is my first time looking at that code). The first is the use of earliestTimeout as it is set to "now + Long.MAX_VALUE" but it is only an AtomicLong. Isn't that going to cause a rollover error?

The second is that the line "long expiresAt = expirable.getExpireNanoTime();" will ask for expiration time from the HttpChannel. The logic in HttpChannel,

return exchange != null ? exchange.getExpireNanoTime() : Long.MAX_VALUE;"

is to return MAX_VALUE if there's no exchange. Later in onTimeoutExpired, it calculates

earliest = Math.min(earliest, NanoTime.elapsed(now, expiresAt));

and then schedules a timeout if if (earliest != Long.MAX_VALUE). But earliest won't equal MAX_LONG if the code just calculated earliest as the elapsed time from now to expiresAt (which is MAX_LONG).

Again, I'm just reading through this code for the first time but I think it can lead to Wakeups being schedule for near MAX_LONG - now time if a channel doesn't have an exchange when the timeout expires.

Screen Shot 2023-07-25 at 9 46 11 AM

@sbordet
Copy link
Contributor

sbordet commented Jul 25, 2023

@mpilone looking into this.
Definitely there is an issue handling MAX_VALUE, as you discovered.
I'll keep this issue posted on my results.

@mpilone
Copy link

mpilone commented Jul 25, 2023

Now that I understand what's going on a little more, I think you can reproduce it with:

  1. Setup a proxy with a 5 minute "timeout" servlet config value
  2. Make a few requests through the proxy
  3. Wait 5 minutes for the timeout to expire
  4. Watch the log to see a new Wakeup get scheduled for 9222549011955ms in the future

Ironically I think I couldn't reproduce this before because I was putting a bunch of load on the proxy thinking it would accelerate the leak but it seems like it happens when the timeout expires and there's no active exchange (i.e. it is idle).

You can probably recreate it better with just a unit test directly on CyclicTimeout(s) but at least I can reproduce it end-to-end now.

[2023-07-25T11:48:58.288-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeouts               ] [HttpClient@47198524-scheduler-1    ]: Entity HttpChannelOverHTTP@74e2d56(exchange=null)[send=HttpSenderOverHTTP@1ae684bc(req=QUEUED,failure=null)[HttpGenerator@2d804ddc{s=START}],recv=HttpReceiverOverHTTP@7552f1d0(ex=null,rsp=IDLE,failure=null)[HttpParser{s=START,0 of -1}]] expires in 9222549011956 ms for org.eclipse.jetty.client.transport.HttpConnection$RequestTimeouts@64a6301
[2023-07-25T11:48:58.289-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeouts               ] [HttpClient@47198524-scheduler-1    ]: Scheduling timeout in 9222549011955 ms for org.eclipse.jetty.client.transport.HttpConnection$RequestTimeouts@64a6301
[2023-07-25T11:48:58.289-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeout                ] [HttpClient@47198524-scheduler-1    ]: Installed timeout in 9222549011955 ms, new wake up in 9222549011955 ms
[2023-07-25T11:48:58.289-0400] [DEBUG] [org.eclipse.jetty.io.CyclicTimeouts               ] [HttpClient@47198524-scheduler-1    ]: Timeouts check for org.eclipse.jetty.client.transport.HttpDestination$RequestTimeouts@200afe73

Wakeups in heap after the initial requests, before the 5 minute timeout:
Wakeups before initial timeout

Wakeups in heap after the initial 5 minute timeout:
Wakeups after initial timeout

@sbordet
Copy link
Contributor

sbordet commented Jul 25, 2023

Thanks, I arrived at the same conclusion but was looking at Jetty 10, which has similar issues, but slightly different.

Thank you again for the detailed report.

sbordet added a commit that referenced this issue Jul 25, 2023
Fixed handling of Expirable.getExpireNanoTime() in case it returns Long.MAX_VALUE.

Also fixed implementations of Expirable that were not initializing their expireNanoTime field to Long.MAX_VALUE.

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@sbordet sbordet linked a pull request Jul 25, 2023 that will close this issue
@sbordet
Copy link
Contributor

sbordet commented Jul 25, 2023

I confirm that in Jetty 10 we do not handle well Long.MAX_VALUE returned from Expirable.getExpireNanoTime(), but we do not leak.

However, in 12 we also do not handle this case well, and in addition we leak. The difference is how we initialize CyclicTimeouts.earliestTimeout in CyclicTimeouts.onTimeoutExpired().

PR is #10148, but it is for Jetty 10.
When merged to 12, you should see your problem fixed too.
If you want to give it a try in 12, just add the if block as done here: https://github.com/eclipse/jetty.project/pull/10148/files#diff-a6dbe0bdb3c9e9d7ef0a28b81637edac71f687db3ec283e1c5e363f0c944f310R98, and let us know if it works for you.

Thanks!

sbordet added a commit that referenced this issue Jul 26, 2023
Fixed handling of Expirable.getExpireNanoTime() in case it returns Long.MAX_VALUE.

Also fixed implementations of Expirable that were not initializing their expireNanoTime field to Long.MAX_VALUE.

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@sbordet
Copy link
Contributor

sbordet commented Jul 26, 2023

@mpilone the fix has been merged into jetty-12.0.x HEAD, so you can now build Jetty from that branch to test the fix.
Thanks!

@mpilone
Copy link

mpilone commented Jul 26, 2023

Thanks, @sbordet . We applied the patch yesterday and aren't seeing the really long Wakeups being scheduled so that's promising. We'll update to the latest 12.0.x HEAD and let it run for a few days to see how it behaves.

@sbordet
Copy link
Contributor

sbordet commented Jul 26, 2023

@mpilone thanks for the feedback. Let us know how's going in a few days.

@mpilone
Copy link

mpilone commented Jul 28, 2023

After 48 hours the number of Wakeups in the queue still looks good. Thanks for the fix.

@joakime joakime changed the title OutOfMemoryError Upgrading from 9.4.49.v20220914 to 12.0.0.beta3 OOME caused by CyclicTimeouts Aug 25, 2023
@joakime joakime changed the title OOME caused by CyclicTimeouts OutOfMemoryError caused by CyclicTimeouts Aug 25, 2023
@jomapp
Copy link

jomapp commented Mar 18, 2024

@mpilone looking into this. Definitely there is an issue handling MAX_VALUE, as you discovered. I'll keep this issue posted on my results.

Hi, while debugging a probably related issue, we noted that there is now a difference in code regarding the fix #10148.

We are having issues with a heavily increasing heap size in Jetty 12.0.6, and are wondering if this could be still related to this issue. Thanks for looking into it again!

@sbordet
Copy link
Contributor

sbordet commented Mar 18, 2024

@jomapp the rollover in Jetty 12 is fine, as we deal with nanoTimes.
I don't think your OOME is related to this issue.
If you have further details, please open a new issue and we'll look into that.
Thanks!

@jomapp
Copy link

jomapp commented Mar 18, 2024

@sbordet Thank you for your fast response!
But isn't earliestNanoTime.set(now + Long.MAX_VALUE); (line 88) still overflowing on assignment to an AtomicLong?
Never mind if I substantially miss context here, but we were just confused that changes to this line in Jetty 10 are different from Jetty 12.
Thanks!

@sbordet
Copy link
Contributor

sbordet commented Mar 18, 2024

But isn't earliestNanoTime.set(now + Long.MAX_VALUE); (line 88) still overflowing on assignment to an AtomicLong?

Yes, but this is fine because it's a nanoTime.

In Jetty 12 the codebase is slightly different, so the difference is accommodated to fit the Jetty 12 codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side
Projects
None yet
4 participants