-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856
Comments
Interesting.... will have a look |
I'm not entirely sure the stack trace and dump correlate but oddly enough, SslHandler's two flush() method flavors reference that one EMPTY_BUFFER that appears to be the |
@rkapsi - do you have any idea where the EmptyByteBuf's writes are originating from? Anything interesting with your use case that may generate these?
Can you clarify what the
I guess there are buffers other than |
@Scottmitch: Bad wording on my end. PendingWriteQueue's bytes is 0 and size is 21685346 which is approximately the number of https://github.com/rkapsi/sqsp-yk/blob/master/e.png So it appears, it's 21M element linked list of The affected connection is HTTP/2. We're experimenting with H2/1.1 translation using the |
Maybe one thing that is interesting... We have a generic ExceptionHandler at the end of our pipeline. It logs uncaught Exceptions and if the Channel reports it's I'm wondering... Could it be a catch exception, write, flush throws Exception, catch exception, write, flush... death spiral? There is a |
Are you able to reproduce this issue? If so can you add some code to detect when a zero sized object is added when
Have you tried removing this code? |
@Scottmitch I'll see if I can write a simple standalone version and hopefully repro it. It may take a few days. We haven't re-run the previous test and aren't planning to due to some other unrelated reasons. |
@rkapsi thanks! |
@Scottmitch @normanmaurer: I've pushed something to https://github.com/rkapsi/sqsp-yk I didn't manage to repro exactly this issue but found a different behavior between Netty 4.1.4 and 4.1.5+ while I was trying to compel Netty into this error state. To be clear, that ExceptionHandler class in the repro attempt is purposely buggy and should make the whole thing spin in a indefinite loop. The odd thing is that 4.1.4 doesn't. And while 4.1.5+ does spin there're some strange steps. Please take a close look at the four different scenarios in the ExceptionHandler class. Notice how this has no effect. ctx.write(ctx.channel().alloc().buffer().release()).addFutureListener(... recur on error ...); While this does. ctx.write(ctx.channel().alloc().buffer().release());
ctx.write(Unpooled.EMPTY).addFutureListener(...recur on error ...); |
@rkapsi will try to have a look the next days... Currently traveling so may need a few days. |
@rkapsi thanks again for the "reproducer". I had finally time to check it out. I think what you see with 4.1.5+ is the correct behaviour. So I think this is now handled correctly. That said this still not helps us with the "original" issue. Any more details there ? |
@normanmaurer cool. As of yesterday I've been testing h2 again. Our code has significantly changed and it's no longer using Netty's own I get a repro within seconds after deploying to production. With some new logging I'm able to correlate the refCnt exceptions with the YourKit snapshot. Here are some new screenshots + stack traces. The latter should be all from the same connection. https://github.com/rkapsi/sqsp-yk/tree/master/docs/2016-10-12 I can't make much sense of it yet but notice the possibly "strange" flush chain:
Now that I've an isolated repro in prod I'll be able to do some deeper profiling/snapshotting with YourKit. |
@rkapsi would it be be possible to see the content of |
@normanmaurer shared a private Github repository with you. |
@rkapsi hmm.. the handler looks good. I somehow thing there is a retain() missing somewhere. Not sure yet where tho |
I think I'll run YK with allocation recording turned on as next. Maybe tomorrow. I want to know where these empty ByteBuf are being produced. In that regard, have you thought about replacing all |
@normanmaurer - some great news! We're always running the latest SNAPSHOT of Netty and we have our own Nexus server proxying Sonatype. The last version we have in our cache is Anyways, you merged af8ef3e on Friday and released 4.1.6.Final shortly after. I don't think there was an another SNAPSHOT release or at least we didn't pick it up. I'm able to reproduce the OOME with our latest/last 4.1.6.Final-SNAPSHOT but not with the 4.1.6.Final release nor 4.1.7.Final-SNAPSHOT. I therefore have to assume af8ef3e fixes our problem as well. Reproduce means in this case to run our Netty code with H2 enabled in the production environment and watch it crap out in a matter of few minutes. The YK dumps with and without allocation information haven't been very useful to identify the underlying problem. Does this make sense? |
…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.
…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.
…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.
@normanmaurer @Scottmitch - I think we can close this issue. I attribute it to a combination of these bugs: |
@rkapsi - Thanks for following up. Lets close for now ... plz let us know if you start seeing similar strange behavior. |
…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to netty#5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.
…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to netty#5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.
We did some SSL/H2 testing yesterday and one of our servers crashed in a rather odd way. It seems one SslHandler instance's pendingUnencryptedWrites managed to create a linked-list of 21M PendingWrite instances.
The referenced
msg
object appears to be in all cases anEmptyByteBuf
and the PendingWriteQueue is itself also reporting a size of 0.At the same time there was a ton of IllegalReferenceCountExceptions coming from
I've uploaded some YK screenshots and 2x2 stack traces. On the same thread it's always a sequence of
release()
followed byensureAccessible()
.https://github.com/rkapsi/sqsp-yk
I'll not be able to share the dump as it contains some pki secrets but I can dig around if you have any ideas what to look for.
The text was updated successfully, but these errors were encountered: