~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

rkapsi · 2016-09-23T15:44:00Z

We did some SSL/H2 testing yesterday and one of our servers crashed in a rather odd way. It seems one SslHandler instance's pendingUnencryptedWrites managed to create a linked-list of 21M PendingWrite instances.

The referenced msg object appears to be in all cases an EmptyByteBuf and the PendingWriteQueue is itself also reporting a size of 0.

At the same time there was a ton of IllegalReferenceCountExceptions coming from

    at io.netty.channel.PendingWriteQueue.removeAndFailAll(PendingWriteQueue.java:171) [netty-all-4.1.6.Final-SNAPSHOT.jar:4.1.6.Final-SNAPSHOT]
    at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:493) [netty-all-4.1.6.Final-SNAPSHOT.jar:4.1.6.Final-SNAPSHOT]

I've uploaded some YK screenshots and 2x2 stack traces. On the same thread it's always a sequence of release() followed by ensureAccessible().

https://github.com/rkapsi/sqsp-yk

I'll not be able to share the dump as it contains some pki secrets but I can dig around if you have any ideas what to look for.

The text was updated successfully, but these errors were encountered:

normanmaurer · 2016-09-23T15:46:13Z

Interesting.... will have a look

rkapsi · 2016-09-23T16:05:50Z

I'm not entirely sure the stack trace and dump correlate but oddly enough, SslHandler's two flush() method flavors reference that one EMPTY_BUFFER that appears to be the msg in the dump.

Scottmitch · 2016-09-23T16:49:19Z

@rkapsi - do you have any idea where the EmptyByteBuf's writes are originating from? Anything interesting with your use case that may generate these?

PendingWriteQueue is itself also reporting a size of 0.

Can you clarify what thebytes() method returns, and what the size() method returns?

there was a ton of IllegalReferenceCountExceptions coming from

I guess there are buffers other than EmptyByteBuf involved? EmptyByteBuf cannot be released and always returns 1 for refCnt.

rkapsi · 2016-09-23T18:00:06Z

@Scottmitch: Bad wording on my end. PendingWriteQueue's bytes is 0 and size is 21685346 which is approximately the number of $PendingWrite instances. Each $PendingWrite is reporting a size of 0 and has EmptyByteBuf for its msg.

https://github.com/rkapsi/sqsp-yk/blob/master/e.png

So it appears, it's 21M element linked list of $PendingWrites with EmptyByteBufs.

The affected connection is HTTP/2. We're experimenting with H2/1.1 translation using the HttpToHttp2ConnectionHandler (currently in conjunction with the stock InboundHttp2ToHttpAdapter and maxConcurrency of 1). Other than that there is nothing special. Under normal circumstances there are other ByteBuf involved as well (as per stack trace). So in that regard I don't know how much the Exceptions and the dump correlate.

rkapsi · 2016-09-23T18:25:04Z

Maybe one thing that is interesting... We have a generic ExceptionHandler at the end of our pipeline. It logs uncaught Exceptions and if the Channel reports it's isActive()==true it is trying to write an error message.

I'm wondering... Could it be a catch exception, write, flush throws Exception, catch exception, write, flush... death spiral? There is a ctx#close() in that Exception handler but it isn't in a finally block.

Scottmitch · 2016-09-24T00:23:55Z

Are you able to reproduce this issue? If so can you add some code to detect when a zero sized object is added when size() != 0 && bytes() == 0 and report back?

... death spiral

Have you tried removing this code?

rkapsi · 2016-09-26T15:47:52Z

@Scottmitch I'll see if I can write a simple standalone version and hopefully repro it. It may take a few days.

We haven't re-run the previous test and aren't planning to due to some other unrelated reasons.

normanmaurer · 2016-09-28T08:38:31Z

@rkapsi thanks!

rkapsi · 2016-10-01T18:33:16Z

@Scottmitch @normanmaurer: I've pushed something to https://github.com/rkapsi/sqsp-yk

I didn't manage to repro exactly this issue but found a different behavior between Netty 4.1.4 and 4.1.5+ while I was trying to compel Netty into this error state.

To be clear, that ExceptionHandler class in the repro attempt is purposely buggy and should make the whole thing spin in a indefinite loop. The odd thing is that 4.1.4 doesn't. And while 4.1.5+ does spin there're some strange steps. Please take a close look at the four different scenarios in the ExceptionHandler class.

Notice how this has no effect.

ctx.write(ctx.channel().alloc().buffer().release()).addFutureListener(... recur on error ...);

While this does.

ctx.write(ctx.channel().alloc().buffer().release());
ctx.write(Unpooled.EMPTY).addFutureListener(...recur on error ...);

normanmaurer · 2016-10-03T17:43:18Z

@rkapsi will try to have a look the next days... Currently traveling so may need a few days.

normanmaurer · 2016-10-12T12:38:57Z

@rkapsi thanks again for the "reproducer". I had finally time to check it out. I think what you see with 4.1.5+ is the correct behaviour. So I think this is now handled correctly. That said this still not helps us with the "original" issue. Any more details there ?

rkapsi · 2016-10-12T19:50:00Z

@normanmaurer cool. As of yesterday I've been testing h2 again. Our code has significantly changed and it's no longer using Netty's own InboundHttp2ToHttpAdapter.

I get a repro within seconds after deploying to production. With some new logging I'm able to correlate the refCnt exceptions with the YourKit snapshot.

Here are some new screenshots + stack traces. The latter should be all from the same connection.

https://github.com/rkapsi/sqsp-yk/tree/master/docs/2016-10-12

I can't make much sense of it yet but notice the possibly "strange" flush chain:

IllegalReferenceCountException
  AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:110)
  SslHandler.flush(SslHandler.java:493)
  Http2ConnectionHandler.flush(Http2ConnectionHandler.java:163)
  Http2ConnectionHandler.channelWritabilityChanged(Http2ConnectionHandler.java:385)
  ChannelOutboundBuffer.setWritable(ChannelOutboundBuffer.java:555)
  NioSocketChannel.doWrite(NioSocketChannel.java:442)
  SslHandler.flush(SslHandler.java:499)
  Http2ConnectionHandler.flush(Http2ConnectionHandler.java:163)

Now that I've an isolated repro in prod I'll be able to do some deeper profiling/snapshotting with YourKit.

normanmaurer · 2016-10-13T11:50:16Z

@rkapsi would it be be possible to see the content of ProxyChannelHandler ? You could even send me an email if you not want to attach it in public.

rkapsi · 2016-10-13T17:38:20Z

@normanmaurer shared a private Github repository with you.

normanmaurer · 2016-10-13T18:09:38Z

@rkapsi hmm.. the handler looks good. I somehow thing there is a retain() missing somewhere. Not sure yet where tho

rkapsi · 2016-10-13T20:43:23Z

I think I'll run YK with allocation recording turned on as next. Maybe tomorrow. I want to know where these empty ByteBuf are being produced.

In that regard, have you thought about replacing all Unpooled#EMPTY and similar things such as LastHttpContent#EMPTY_LAST_CONTENT with T#empty() and a system property to toggle between these singletons (default) and new T()?

rkapsi · 2016-10-17T21:14:20Z

@normanmaurer - some great news!

We're always running the latest SNAPSHOT of Netty and we have our own Nexus server proxying Sonatype. The last version we have in our cache is netty-all-4.1.6.Final-20161013.121908-45 released on Thu Oct 13 12:19:08 GMT 2016. It's already purged from yours.

Anyways, you merged af8ef3e on Friday and released 4.1.6.Final shortly after. I don't think there was an another SNAPSHOT release or at least we didn't pick it up.

I'm able to reproduce the OOME with our latest/last 4.1.6.Final-SNAPSHOT but not with the 4.1.6.Final release nor 4.1.7.Final-SNAPSHOT. I therefore have to assume af8ef3e fixes our problem as well.

Reproduce means in this case to run our Netty code with H2 enabled in the production environment and watch it crap out in a matter of few minutes. The YK dumps with and without allocation information haven't been very useful to identify the underlying problem. Does this make sense?

…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.

rkapsi · 2016-11-17T17:48:35Z

@normanmaurer @Scottmitch - I think we can close this issue. I attribute it to a combination of these bugs:

Scottmitch · 2016-11-17T17:50:43Z

@rkapsi - Thanks for following up. Lets close for now ... plz let us know if you start seeing similar strange behavior.

…o account Motivation: To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to netty#5856. Modifications: When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation. Result: Better guard against OOME.

normanmaurer self-assigned this Oct 3, 2016

normanmaurer mentioned this issue Oct 31, 2016

Take memory overhead of ChannelOutboundBuffer / PendingWriteQueue int… #5961

Closed

rkapsi mentioned this issue Nov 9, 2016

IllegalReferenceCountException with DefaultHttp2ConnectionEncoder#writeData() and PooledSlicedByteBuf #5999

Closed

Scottmitch closed this as completed Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

rkapsi commented Sep 23, 2016 •

edited

Loading

normanmaurer commented Sep 23, 2016

rkapsi commented Sep 23, 2016

Scottmitch commented Sep 23, 2016 •

edited

Loading

rkapsi commented Sep 23, 2016

rkapsi commented Sep 23, 2016

Scottmitch commented Sep 24, 2016 •

edited

Loading

rkapsi commented Sep 26, 2016

normanmaurer commented Sep 28, 2016

rkapsi commented Oct 1, 2016

normanmaurer commented Oct 3, 2016

normanmaurer commented Oct 12, 2016

rkapsi commented Oct 12, 2016 •

edited

Loading

normanmaurer commented Oct 13, 2016

rkapsi commented Oct 13, 2016

normanmaurer commented Oct 13, 2016

rkapsi commented Oct 13, 2016

rkapsi commented Oct 17, 2016

rkapsi commented Nov 17, 2016

Scottmitch commented Nov 17, 2016

~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

Comments

rkapsi commented Sep 23, 2016 • edited Loading

normanmaurer commented Sep 23, 2016

rkapsi commented Sep 23, 2016

Scottmitch commented Sep 23, 2016 • edited Loading

rkapsi commented Sep 23, 2016

rkapsi commented Sep 23, 2016

Scottmitch commented Sep 24, 2016 • edited Loading

rkapsi commented Sep 26, 2016

normanmaurer commented Sep 28, 2016

rkapsi commented Oct 1, 2016

normanmaurer commented Oct 3, 2016

normanmaurer commented Oct 12, 2016

rkapsi commented Oct 12, 2016 • edited Loading

normanmaurer commented Oct 13, 2016

rkapsi commented Oct 13, 2016

normanmaurer commented Oct 13, 2016

rkapsi commented Oct 13, 2016

rkapsi commented Oct 17, 2016

rkapsi commented Nov 17, 2016

Scottmitch commented Nov 17, 2016

rkapsi commented Sep 23, 2016 •

edited

Loading

Scottmitch commented Sep 23, 2016 •

edited

Loading

Scottmitch commented Sep 24, 2016 •

edited

Loading

rkapsi commented Oct 12, 2016 •

edited

Loading