Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIO ByteBuffer corruption in embedded Jetty server #4828

Closed
chietti opened this issue Apr 29, 2020 · 17 comments · Fixed by #4864
Closed

NIO ByteBuffer corruption in embedded Jetty server #4828

chietti opened this issue Apr 29, 2020 · 17 comments · Fixed by #4864
Assignees
Labels
Bug For general bugs on Jetty side High Priority

Comments

@chietti
Copy link

chietti commented Apr 29, 2020

Jetty 9.4.26.v20200117

OpenJDK 64-Bit Server VM, Version: 11.0.5

Linux, Version: 4.14.114-83.126.amzn1.x86_64

On one of our server cluster nodes we observed a sudden occurrence of these exceptions while writing http responses from our servlet on the embedded Jetty webserver:

2020-04-28 13:06:58 UTC SEVERE  [<server,0x1>] [CommunicationServletImpl] Failed to handle request 
java.lang.IllegalArgumentException: newPosition > limit: (65536 > 12240)
	at java.base/java.nio.Buffer.createPositionException(Buffer.java:318)
	at java.base/java.nio.Buffer.position(Buffer.java:293)
	at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1086)
	at java.base/java.nio.MappedByteBuffer.position(MappedByteBuffer.java:226)
	at java.base/java.nio.MappedByteBuffer.position(MappedByteBuffer.java:67)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:192)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130)
	at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:263)
	at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)
	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
	at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:809)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
	at org.eclipse.jetty.server.HttpConnection.send(HttpConnection.java:546)
	at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:824)
	at org.eclipse.jetty.server.HttpChannel.write(HttpChannel.java:901)
	at org.eclipse.jetty.server.HttpOutput.channelWrite(HttpOutput.java:283)
	at org.eclipse.jetty.server.HttpOutput.channelWrite(HttpOutput.java:267)
	at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:705)
	at org.eclipse.jetty.server.Response.flushBuffer(Response.java:1026)
	at com.compuware.apm.communication.servlet.CommunicationServletImpl.writeResponseContainer(CommunicationServletImpl.java:405)
	at com.compuware.apm.communication.servlet.CommunicationServletImpl.handleSyncRequest(CommunicationServletImpl.java:378)
	at com.compuware.apm.communication.servlet.CommunicationServletImpl.handleRequest(CommunicationServletImpl.java:219)
	at com.compuware.apm.communication.servlet.CommunicationServletImpl.doPost(CommunicationServletImpl.java:186)

At the same time the other servers of the cluster started to complain that they got messages which should never reach them. Looking at the exception it looks like multiple threads are using the same NIO ByteBuffer to write http responses to, that would explain that these responses did leak to targets that they are not intended to.

I did some research in the HttpConnection class source code and how it handles the pooled NIO ByteBuffers.
While the ByteBufferPool implementation we use (the default ArrayByteBufferPool) and the Bucket looks fine in terms of synchronization I am wondering if it is possible that a ByteBuffer is released twice into the pool.

HttpConnection.onCompleted() contains this piece of code with a comment that caught my attention:

// Not in a race here with onFillable, because it has given up control before calling handle.
// in a slight race with #completed, but not sure what to do with that anyway.
if (_chunk != null)
    _bufferPool.release(_chunk);
_chunk = null;

Is it possible that onCompleted() is called while HttpConnection$SendCallBack.process() is executed? E.g. I have found calls to it via HttpChannel.onBadMessage()

HttpConnection$SendCallBack.process() contains this piece of code:

if (_chunk != null)
   _bufferPool.release(_chunk);
chunk = _chunk = _bufferPool.acquire(_config.getResponseHeaderSize(), CHUNK_BUFFER_DIRECT);

Since there is no synchronization it is possible that either one is interrupted before setting the _chunk member to null or a different value. So the chunk buffer could be released twice and remain in the pool for multiple threads to be used at the same time until the server is restarted (we did that and the problem has not reoccurred since then).

We assume it must be a rare fringe scenario since we never have seen this kind of behavior before in years but I do not know enough about the internal dynamics of Jetty to be sure if a race condition between these two code pieces is technically possible (although the comment suggests it).

Since the effect is rather devastating (we send messages to the wrong targets, causing any kinds of weird behavior) we would appreciate your insight on this problem.

@gregw
Copy link
Contributor

gregw commented Apr 29, 2020

I'm moderately sure there is not a race between HttpConnection$SendCallBack.process() and onCompleted() because the later should not be called until the last send has completed it's processing.
That comment about "slight race with #completed" doesn't look good, but I think it is out of date as the whole completion conversation has been refactored since 2014.

However, I do agree that is sounds plausible that a buffer was incorrectly recycled. We do have a LeakTrackingByteBufferPool that could help diagnose, but I'm guessing this has just happened once and you are unable to reproduce?

So I think we are down to code inspection and thought experiments. Can you tell us a lot more about the load on your server. Is it all HTTP/1.? Do you have POSTs or GETs with content? If so is it chunked or length delimited? How big are your normal responses? are they normally chunked? I see a method called handleSyncRequest does that mean that you also handle async requests? Do you always explicitly call flushBuffer yourselves? why? do you ever call it from an async thread? Your stack trace includes a MappedByteBuffer! Are your writing static content or a file? are you working with Mapped buffers yourselves? Any idea how the mapped buffer could have been used? Any chance of seeing the code from CommunicationServletImpl ?

Lots of questions sorry and few answers at this stage.

@chietti
Copy link
Author

chietti commented May 4, 2020

Hi Greg,
thank you far the fast response. It is indeed not reproducable and has so far only happened twice in two different clusters (both with the same configuration).

To answer your questions:

  • The traffic is all HTTP/1.
  • POST requests only with a few HEAD requests "sprinkled" in for health checks
  • The requests are all handled synchronously (no AsyncContext used in this configuration)
  • flushBuffer is always called when we are done answering the request (I assume this is redundant by now, we have this call in there for a long time when we worked with older versions of Jetty)
  • I have no idea where the MappedByteBuffer comes from (we do not explicitly use it - all is set to default in Jetty regarding buffers)
  • We do not use any servlet specific code for file handling, we implement our own protobuf-based protocol using HTTP as transport

The LeakTrackingByteBufferPool looks promising, we'll check if we can integrate it.

@gregw
Copy link
Contributor

gregw commented May 4, 2020

Thanks for the extra info.

I would consider removing the flushBuffer. It should not be necessary and jetty can be a lot more efficient if you leave the flushing to it (eg can sometimes do content-length rather than chunk bounding, can do an async flush of last content). Note that this might avoid the problem rather than fix it, because it should also be perfectly fine to do the flush.

In your logs, can you tell if the first request that was not handled correctly was a HEAD request? Or was there a lost HEAD request in close association?

@sbordet I'm guessing that direct buffers are now implemented as mapped buffers?

@sbordet
Copy link
Contributor

sbordet commented May 4, 2020

@gregw no, direct buffers are direct buffers.

Jetty produces MappedByteBuffers only from AfterContextTransformer (used by AsyncMiddleManServlet) or by CachedContentFactory (where we do cache the mapped buffer, but we always return it via asReadOnlyBuffer()).

In both cases the MappedByteBuffer is created on the fly and not stored anywhere else, so I don't see a chance that something else has a reference to it and can mess with the position.

@chietti do you spawn threads in your application? Do you write to the response output stream from one of those spawned threads?

@chietti
Copy link
Author

chietti commented May 4, 2020

@sbordet Not in this context. The CommunicationServlet is fully synchronous.

We parse the protobuf content, handle the messages in there and write back the response all in the same Jetty thread context.

I have to admit though that I can only speak for the protobuf layer. The server also hosts a lot of Jersey-based REST APIs that I am not familiar with as well well as a lot of other servlets which I do not know.
I assume that the buffer pool is shared across the whole server and is not local to a servlet, right?
In that case it would be possible that anybody contributing to the server could cause this problem.

@sbordet
Copy link
Contributor

sbordet commented May 4, 2020

@chietti you are using HTTP/1.1, I don't think other requests on other sockets can mess with this one.

Are you able to figure out from where the MappedByteBuffer comes from?
Perhaps via remote debugging, logging, or modifying the jetty source to add a log line if the buffer is a MappedByteBuffer?

@djelinski
Copy link

FWIW, I get the same MappedByteBuffer lines in stacktrace when I run this on JDK 11: ByteBuffer.allocateDirect(1).position(2);

java.lang.IllegalArgumentException: newPosition > limit: (2 > 1)
	at java.base/java.nio.Buffer.createPositionException(Buffer.java:318)
	at java.base/java.nio.Buffer.position(Buffer.java:293)
	at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1086)
	at java.base/java.nio.MappedByteBuffer.position(MappedByteBuffer.java:226)
	at java.base/java.nio.MappedByteBuffer.position(MappedByteBuffer.java:67)

@chietti
Copy link
Author

chietti commented May 4, 2020

@sbordet you mean why there is a DirectByteBuffer and not a HeapByteBuffer in use?
DirectByteBuffer is a subclass from the abstract MappedByteBuffer

@chietti
Copy link
Author

chietti commented May 4, 2020

As far as I understand the source code it cannot be caused by the _header or _chunk members since both are HeapByterBuffers (according to the HEADER_BUFFER_DIRECT and CHUNK_BUFFER_DIRECT constants). That leaves us with the _content member which is the _aggregate member in HttpOutput and that is a DirectByteBuffer (see HttpOutput.acquireBuffer()). I also verified that locally in the Debugger.

@sbordet
Copy link
Contributor

sbordet commented May 4, 2020

DirectByteBuffer is a subclass from the abstract MappedByteBuffer

I always forget this, so yeah, it's not a MappedByteBuffer but likely just a DirectByteBuffer.

I also verified that locally in the Debugger.

Sorry what did you verify exactly? That the buffer that throws is the aggregate buffer?

Do you have a reproducible test case for us?

@chietti
Copy link
Author

chietti commented May 4, 2020

Sorry, that was a bit unclear: I verified that the _aggregate in HttpOutput is a DirectByteBuffer which makes it the only candidate for me since the other two involved buffers (_chunk and _header) are HeapByteBuffers for sure as far as I understand the code here: SendCallback.process(HttpConnection.java:809)

@chietti
Copy link
Author

chietti commented May 7, 2020

I had some time to look into the HttpOutput implementation and I am wondering about the call of releaseBuffer(). There are three places where it is called (completed(), onWriteComplete() and recycle()). In onWriteComplete() and recycle() the call is done in the synchronized block but in completed() it is done after the synchronized block. Is that intentional? If not, is it possible that these three methods might be called from different threads? If this is possible then there is a minimal window for releasing the buffer twice (if releaseBuffer is interrupted right after the release to the pool but before the member is reset to null on line 631)

@gregw gregw added Bug For general bugs on Jetty side High Priority labels May 11, 2020
@gregw gregw self-assigned this May 11, 2020
gregw added a commit that referenced this issue May 11, 2020
 + improve synchronization around releaseBuffer
 + improve synchronization around acquireBuffer
 + made acquireBuffer private.

Signed-off-by: Greg Wilkins <gregw@webtide.com>
@gregw gregw linked a pull request May 11, 2020 that will close this issue
@gregw
Copy link
Contributor

gregw commented May 11, 2020

I'm not entirely convinced that the synchronization is the problem, but then I can't convince myself that it is not.
So I have created #4864 to add some extra synchronization. @chietti any chance you could test a build of this over the next few days while we rebuild a release?

gregw added a commit that referenced this issue May 11, 2020
+ improve synchronization around releaseBuffer
 + improve synchronization around acquireBuffer
 + made acquireBuffer private.

Signed-off-by: Greg Wilkins <gregw@webtide.com>
@joakime
Copy link
Contributor

joakime commented May 11, 2020

PR #4864 merged.
We can close this once @chietti confirms with us.

@chietti
Copy link
Author

chietti commented May 12, 2020

Hi @gregw, since we have no reproducer we cannot confirm that this fixes the bug.
I talked to our load testing team and we can plan the internal rollout of version 9.4.29 in the next weeks once it was released by you. The only thing we can test is that in our scenarios the new version does not behave differently than the old one in terms of throughput, memory usage, CPU & GC overhead etc. I assume you have internal load tests and regression tests in place so this will likely be no new information for you then.

We will definitely keep an eye on this problem with our internal monitoring.

@chietti
Copy link
Author

chietti commented Dec 2, 2020

I think you can close this ticket. We switched to 9.4.30 a long time ago and never faced this problem anymore

@gregw
Copy link
Contributor

gregw commented Dec 2, 2020

thanks!

@gregw gregw closed this as completed Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side High Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants