Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

Closed
rkapsi opened this issue Sep 23, 2016 · 19 comments
Closed

~2GB of objects tied up in SslHandler's pendingUnencryptedWrites #5856

rkapsi opened this issue Sep 23, 2016 · 19 comments
Assignees

Comments

@rkapsi
Copy link
Member

rkapsi commented Sep 23, 2016

We did some SSL/H2 testing yesterday and one of our servers crashed in a rather odd way. It seems one SslHandler instance's pendingUnencryptedWrites managed to create a linked-list of 21M PendingWrite instances.

The referenced msg object appears to be in all cases an EmptyByteBuf and the PendingWriteQueue is itself also reporting a size of 0.

At the same time there was a ton of IllegalReferenceCountExceptions coming from

    at io.netty.channel.PendingWriteQueue.removeAndFailAll(PendingWriteQueue.java:171) [netty-all-4.1.6.Final-SNAPSHOT.jar:4.1.6.Final-SNAPSHOT]
    at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:493) [netty-all-4.1.6.Final-SNAPSHOT.jar:4.1.6.Final-SNAPSHOT]

I've uploaded some YK screenshots and 2x2 stack traces. On the same thread it's always a sequence of release() followed by ensureAccessible().

https://github.com/rkapsi/sqsp-yk

I'll not be able to share the dump as it contains some pki secrets but I can dig around if you have any ideas what to look for.

@normanmaurer
Copy link
Member

Interesting.... will have a look

@rkapsi
Copy link
Member Author

rkapsi commented Sep 23, 2016

I'm not entirely sure the stack trace and dump correlate but oddly enough, SslHandler's two flush() method flavors reference that one EMPTY_BUFFER that appears to be the msg in the dump.

@Scottmitch
Copy link
Member

Scottmitch commented Sep 23, 2016

@rkapsi - do you have any idea where the EmptyByteBuf's writes are originating from? Anything interesting with your use case that may generate these?

PendingWriteQueue is itself also reporting a size of 0.

Can you clarify what thebytes() method returns, and what the size() method returns?

there was a ton of IllegalReferenceCountExceptions coming from

I guess there are buffers other than EmptyByteBuf involved? EmptyByteBuf cannot be released and always returns 1 for refCnt.

@rkapsi
Copy link
Member Author

rkapsi commented Sep 23, 2016

@Scottmitch: Bad wording on my end. PendingWriteQueue's bytes is 0 and size is 21685346 which is approximately the number of $PendingWrite instances. Each $PendingWrite is reporting a size of 0 and has EmptyByteBuf for its msg.

https://github.com/rkapsi/sqsp-yk/blob/master/e.png

So it appears, it's 21M element linked list of $PendingWrites with EmptyByteBufs.

The affected connection is HTTP/2. We're experimenting with H2/1.1 translation using the HttpToHttp2ConnectionHandler (currently in conjunction with the stock InboundHttp2ToHttpAdapter and maxConcurrency of 1). Other than that there is nothing special. Under normal circumstances there are other ByteBuf involved as well (as per stack trace). So in that regard I don't know how much the Exceptions and the dump correlate.

@rkapsi
Copy link
Member Author

rkapsi commented Sep 23, 2016

Maybe one thing that is interesting... We have a generic ExceptionHandler at the end of our pipeline. It logs uncaught Exceptions and if the Channel reports it's isActive()==true it is trying to write an error message.

I'm wondering... Could it be a catch exception, write, flush throws Exception, catch exception, write, flush... death spiral? There is a ctx#close() in that Exception handler but it isn't in a finally block.

@Scottmitch
Copy link
Member

Scottmitch commented Sep 24, 2016

Are you able to reproduce this issue? If so can you add some code to detect when a zero sized object is added when size() != 0 && bytes() == 0 and report back?

... death spiral

Have you tried removing this code?

@rkapsi
Copy link
Member Author

rkapsi commented Sep 26, 2016

@Scottmitch I'll see if I can write a simple standalone version and hopefully repro it. It may take a few days.

We haven't re-run the previous test and aren't planning to due to some other unrelated reasons.

@normanmaurer
Copy link
Member

@rkapsi thanks!

@rkapsi
Copy link
Member Author

rkapsi commented Oct 1, 2016

@Scottmitch @normanmaurer: I've pushed something to https://github.com/rkapsi/sqsp-yk

I didn't manage to repro exactly this issue but found a different behavior between Netty 4.1.4 and 4.1.5+ while I was trying to compel Netty into this error state.

To be clear, that ExceptionHandler class in the repro attempt is purposely buggy and should make the whole thing spin in a indefinite loop. The odd thing is that 4.1.4 doesn't. And while 4.1.5+ does spin there're some strange steps. Please take a close look at the four different scenarios in the ExceptionHandler class.

Notice how this has no effect.

ctx.write(ctx.channel().alloc().buffer().release()).addFutureListener(... recur on error ...);

While this does.

ctx.write(ctx.channel().alloc().buffer().release());
ctx.write(Unpooled.EMPTY).addFutureListener(...recur on error ...);

@normanmaurer
Copy link
Member

@rkapsi will try to have a look the next days... Currently traveling so may need a few days.

@normanmaurer normanmaurer self-assigned this Oct 3, 2016
@normanmaurer
Copy link
Member

@rkapsi thanks again for the "reproducer". I had finally time to check it out. I think what you see with 4.1.5+ is the correct behaviour. So I think this is now handled correctly. That said this still not helps us with the "original" issue. Any more details there ?

@rkapsi
Copy link
Member Author

rkapsi commented Oct 12, 2016

@normanmaurer cool. As of yesterday I've been testing h2 again. Our code has significantly changed and it's no longer using Netty's own InboundHttp2ToHttpAdapter.

I get a repro within seconds after deploying to production. With some new logging I'm able to correlate the refCnt exceptions with the YourKit snapshot.

Here are some new screenshots + stack traces. The latter should be all from the same connection.

https://github.com/rkapsi/sqsp-yk/tree/master/docs/2016-10-12

I can't make much sense of it yet but notice the possibly "strange" flush chain:

IllegalReferenceCountException
  AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:110)
  SslHandler.flush(SslHandler.java:493)
  Http2ConnectionHandler.flush(Http2ConnectionHandler.java:163)
  Http2ConnectionHandler.channelWritabilityChanged(Http2ConnectionHandler.java:385)
  ChannelOutboundBuffer.setWritable(ChannelOutboundBuffer.java:555)
  NioSocketChannel.doWrite(NioSocketChannel.java:442)
  SslHandler.flush(SslHandler.java:499)
  Http2ConnectionHandler.flush(Http2ConnectionHandler.java:163)

Now that I've an isolated repro in prod I'll be able to do some deeper profiling/snapshotting with YourKit.

@normanmaurer
Copy link
Member

@rkapsi would it be be possible to see the content of ProxyChannelHandler ? You could even send me an email if you not want to attach it in public.

@rkapsi
Copy link
Member Author

rkapsi commented Oct 13, 2016

@normanmaurer shared a private Github repository with you.

@normanmaurer
Copy link
Member

@rkapsi hmm.. the handler looks good. I somehow thing there is a retain() missing somewhere. Not sure yet where tho

@rkapsi
Copy link
Member Author

rkapsi commented Oct 13, 2016

I think I'll run YK with allocation recording turned on as next. Maybe tomorrow. I want to know where these empty ByteBuf are being produced.

In that regard, have you thought about replacing all Unpooled#EMPTY and similar things such as LastHttpContent#EMPTY_LAST_CONTENT with T#empty() and a system property to toggle between these singletons (default) and new T()?

@rkapsi
Copy link
Member Author

rkapsi commented Oct 17, 2016

@normanmaurer - some great news!

We're always running the latest SNAPSHOT of Netty and we have our own Nexus server proxying Sonatype. The last version we have in our cache is netty-all-4.1.6.Final-20161013.121908-45 released on Thu Oct 13 12:19:08 GMT 2016. It's already purged from yours.

Anyways, you merged af8ef3e on Friday and released 4.1.6.Final shortly after. I don't think there was an another SNAPSHOT release or at least we didn't pick it up.

I'm able to reproduce the OOME with our latest/last 4.1.6.Final-SNAPSHOT but not with the 4.1.6.Final release nor 4.1.7.Final-SNAPSHOT. I therefore have to assume af8ef3e fixes our problem as well.

Reproduce means in this case to run our Netty code with H2 enabled in the production environment and watch it crap out in a matter of few minutes. The YK dumps with and without allocation information haven't been very useful to identify the underlying problem. Does this make sense?

normanmaurer added a commit that referenced this issue Oct 31, 2016
…o account

Motivation:

To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856.

Modifications:

When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation.

Result:

Better guard against OOME.
normanmaurer added a commit that referenced this issue Nov 3, 2016
…o account

Motivation:

To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856.

Modifications:

When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation.

Result:

Better guard against OOME.
normanmaurer added a commit that referenced this issue Nov 3, 2016
…o account

Motivation:

To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to #5856.

Modifications:

When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation.

Result:

Better guard against OOME.
@rkapsi
Copy link
Member Author

rkapsi commented Nov 17, 2016

@normanmaurer @Scottmitch - I think we can close this issue. I attribute it to a combination of these bugs:

  1. af8ef3e
  2. HTTP/2 WeightedFairQueueByteDistributor Bug #6004
  3. retained[Slice|Duplicate] buffer reference count bug #6024

@Scottmitch
Copy link
Member

@rkapsi - Thanks for following up. Lets close for now ... plz let us know if you start seeing similar strange behavior.

liuzhengyang pushed a commit to liuzhengyang/netty that referenced this issue Sep 10, 2017
…o account

Motivation:

To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to netty#5856.

Modifications:

When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation.

Result:

Better guard against OOME.
pulllock pushed a commit to pulllock/netty that referenced this issue Oct 19, 2023
…o account

Motivation:

To guard against the case that a user will enqueue a lot of empty or small buffers and so raise an OOME we need to also take the overhead of the ChannelOutboundBuffer / PendingWriteQueue into account when detect if a Channel is writable or not. This is related to netty#5856.

Modifications:

When calculate the memory for an message that is enqueued also add some extra bytes depending on the implementation.

Result:

Better guard against OOME.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants