Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor SslHandler.setWrapDataSize greater than SSL packet length #13551

Merged
merged 3 commits into from Aug 18, 2023

Conversation

bbeaudreault
Copy link
Contributor

@bbeaudreault bbeaudreault commented Aug 16, 2023

NOTE: The idea is sound and has been tested in my environment, but the implementation varies from my original POC. I think this approach fits better into the ecosystem, but I have not fully run all the tests locally or validated this approach yet. I wanted to post this in case anyone happens to see my issue #13549 and wanted to see an example of my idea.

Motivation:

When processing large responses, chunking them into many 16kb chunks causes excessive load on PoolArena which slows down AbstractUnsafe.flush0.

Modifications:

Allocate an outNetBuf based on setWrapDataSize. If setWrapDataSize is larger than SSL packet length, iteratively slice the data buffer, encrypting chunks into the one larger output buffer.

Reduce unnecessary composition in AbstractCoalescingBufferQueue when a call to remove() only needs to process one buffer.

Result:

Eliminate contention in PoolArena relative to splitting large responses, thus increasing overall throughput and reducing overall memory pressure.

Relates to #13549

Motivation:

When processing large responses, chunking them into many 16kb chunks
causes excessive load on PoolArena which slows down AbstractUnsafe.flush0.

Modifications:

Allocate an outNetBuf based on setWrapDataSize. If setWrapDataSize is larger
than SSL packet length, iteratively slice the data buffer, encrypting chunks
into the one larger buffer.

Reduce unnecessary composition in AbstractCoalescingBufferQueue when a call
to remove() only needs to process one buffer.

Result:

Eliminate contention in PoolArena relative to splitting large responses,
thus increasing overall throughput and reducing overall memory pressure.

Relates to netty#13549
@bbeaudreault
Copy link
Contributor Author

Had time to do a quick test tonight. This is not working as well as my original POC, so I need to do a little more to finish it up while I do more testing tomorrow.

@Apache9
Copy link
Contributor

Apache9 commented Aug 16, 2023

+1, sounds like a good idea to smooth the allocation pressure, a 'packet' in the SSL protocol does not need to be exact a ByteBuf in netty, we can allocate a large buffer put a lot of 16KB packets in it and write all the packets at once.

Usually SslHandler will be the last outbound handler in the pipeline, and even it is not, if another outbound handler wants to parse the ssl packet, it should follow the typical MessageDecoder way in netty, instead of rely on SslHandler will issue a ByteBuf for a packet.

Thanks.

@normanmaurer
Copy link
Member

@bbeaudreault please let me know once this is ready for review

@normanmaurer
Copy link
Member

@bbeaudreault also added two comments to you hbase PR which I noticed while reviewing your netty related code there.

@bbeaudreault
Copy link
Contributor Author

Thank you for looking at that! I finished testing the code here (with some fixes) yesterday. Hoping to clean up and run the netty test suite today so I can submit an update here for you to review.

@bbeaudreault
Copy link
Contributor Author

@normanmaurer this should be ready for review. I ran this code in hbase in our environment with tcnative(boringssl) and jdk ssl, with and without calling setWrapDataSize. I'm using the netty inspection and style profiles. I also ran all of the io.netty.handler.ssl and CoalescingBufferQueue-related tests. I had trouble with my dev setup to run the full suite.

I'll keep an eye on any failures here from the automation.

I'm far from an expert on SSL or netty, this was my first big deep dive in the area. So please let me know if I missed anything. Thank you for your time.

@normanmaurer
Copy link
Member

@bbeaudreault thanks a lot... I will have a look. Can you also share some "before" / "after" numbers / flame graphs ?

@normanmaurer
Copy link
Member

@bbeaudreault also can you please sign our icla if you didn't do yet ? https://netty.io/s/icla

@bbeaudreault
Copy link
Contributor Author

I signed the ICLA. I have a flamegraph in the issue based on my original design which did not use setWrapDataSize. I'll work on taking new/clean ones today for before/after. Probably have them later in the day ET time.

@normanmaurer
Copy link
Member

@bbeaudreault btw one "downside" of this approach might be that we might allocate bigger buffers then the maximum pooled buffer size. That said I think this should be ok as it depends on the configured max wrap size. WDYT ?

@bbeaudreault
Copy link
Contributor Author

@normanmaurer Agreed! That's actually what informed my design here relative to my original POC. In my original POC I just always pulled full buffers from the queue. As I started implementing a cleaner solution I realized that this could be a big problem for usecases that might send even bigger responses. I thought about adding a new limit, but realized setWrapDataSize is perfect for this.

So I do think this is ok, as it's controlled by the user configuration. I do wonder if users will get the most mileage out of aligning their setWrapDataSize to their PoolArena page size.

@bbeaudreault
Copy link
Contributor Author

@normanmaurer there's some initial numbers and flamegraphs on this new hbase ticket https://issues.apache.org/jira/browse/HBASE-28029

@chrisvest
Copy link
Contributor

The difference on the flame graph looks pretty small. Do you have throughput numbers?

Without the change, I still don't think we should see any OOMs. Unless HBASE has no pack-pressure on writing?

@bbeaudreault
Copy link
Contributor Author

HBase previously had no backpressure on writing to the channel. We added it now. In fact, this fix does also would not avoid an OOM without backpressure in certain cases. So the backpressure is key, but in digging into that I noticed the contention on PoolArena.

The jira above includes a comment with some analysis, including throughput. Despite only about 3-4% combined change in the netty parts of the flamegraph, we see about 9% throughput increase on the one test case we've been running. I think that's worth it.

The other thing here is that the contention is in PoolArena, which is a shared resource across all channels. Allocating larger runs of memory to reduce contention from a core handler seems like a good thing that will potentially improve many use cases.

}

if (out == null) {
out = allocateOutNetBuf(ctx, readableBytes, buf.nioBufferCount() + numPackets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird: buf.nioBufferCount() + numPackets

Will we be doing a wrap per buffer component, plus a wrap per packets based on the data size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wrapMultiple, we will pass slices up to MAX_PLAINTEXT_LENGTH in length to engine.wrap(). So we will create at least numPackets encrypted packets. If that were it, we could just pass numPackets here, and readableBytes + numPackets * wrapOverhead would be an accurate measure of the total output buffer needed.

But also, engine.wrap() takes a ByteBuffer[] and wraps each element separately. Your input buffer's backing nioBuffers() may not align perfectly to MAX_PLAINTEXT_LENGTH. So there might be a case where we pass 16k to engine.wrap() but across 2+ buffers, thus 2+ packets for what we thought was one.

So this attempts to account for both of those constraints, albeit very imperfectly. I'm open to other options here, but I didn't try too hard because the estimate doesn't need to be perfect. If we underestimate, we'll just end up re-enqueing the remaining bytes and picking them up on the next outer while loop of wrap. If we overestimate, there might be some extra unused bytes at the end of our OutNetBuf which will be reclaimed once we finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, hence the comments about over-/under-estimating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a case being made for deciding to just use numPackets, which would likely often be an underestimate whenever input buffers have more than 1 nioBuffer. The downside of underestimating is we need to re-enqueue the remaining portion of the input buffer back into SslHandlerCoalescingBufferQueue and re-poll it, allocate a new smaller buffer, etc. If this happens once, it's not a big deal. If it happens many times, we start to regress back to the original design. The downside of over estimation is we waste a bit of memory temporarily until it can be released by the AbstractUnsafe.flush0. That would be bad if we could over estimate by a lot, but I don't think we will in most cases -- we might be off by a few packets * 100. In my tests so far I've noticed that if anything it usually underestimates a little bit, which is not ideal but 2 allocations is far better than 300 (as evidence by not really seeing PoolArena in the profile anymore) and I'm not sure I can get perfect.

@normanmaurer normanmaurer merged commit 5cb4974 into netty:4.1 Aug 18, 2023
14 checks passed
@normanmaurer
Copy link
Member

@bbeaudreault thanks a lot!

@normanmaurer normanmaurer added this to the 4.1.97.Final milestone Aug 18, 2023
normanmaurer added a commit that referenced this pull request Aug 18, 2023
Motivation:

When processing large responses, chunking them into many 16kb chunks
causes excessive load on PoolArena which slows down
AbstractUnsafe.flush0.

Modifications:

Allocate an outNetBuf based on setWrapDataSize. If setWrapDataSize is
larger than SSL packet length, iteratively slice the data buffer,
encrypting chunks into the one larger output buffer.

Reduce unnecessary composition in AbstractCoalescingBufferQueue when a
call to remove() only needs to process one buffer.

Result:

Eliminate contention in PoolArena relative to splitting large responses,
thus increasing overall throughput and reducing overall memory pressure.

Relates to #13549

---------

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
@bbeaudreault
Copy link
Contributor Author

Thank you both for the quick review here!

@bbeaudreault bbeaudreault deleted the wrap-size branch August 18, 2023 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants