Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpObjectEncoder scalability issue due to instanceof checks (Fixes #12708) #12709

Merged
merged 1 commit into from
Aug 25, 2022

Conversation

franz1981
Copy link
Contributor

@franz1981 franz1981 commented Aug 17, 2022

Motivation:

Current http encoder logic cause contention over the klass field used by the JIT to speed up instanceof checks vs interfaces,
preventing scaling with multi-core machines.
See https://bugs.openjdk.org/browse/JDK-8180450 for more info.

Modifications:

Code duplication to reduce the arity of the morphism on the instanceof checks on the http encoder.
Removed using encoder inherited methods to take control of the arity of the
call-site morphism while releasing ref counted http msg types.

Result:

Scalable HTTP encoding

@normanmaurer
Copy link
Member

@franz1981 can you share any numbers that "proof" the gain ?

@franz1981
Copy link
Contributor Author

franz1981 commented Aug 17, 2022

Yep @normanmaurer I'm going publish both an end 2 end benchmark result and the micro-bench one I've modified in Netty, but I'm trying a different fix too, that won't requires holding any Class field nor using specific concrete types too (but requires a bit more checks/duplication)

@franz1981
Copy link
Contributor Author

franz1981 commented Aug 17, 2022

@normanmaurer
this is techempower plaintext benchmark, while polluting the encode method by sending some chunked response before the test; blue is 4.1 while green is this pr, with both using full http responses on steady state (meaning that they measure the same things/same application code).

image

The machine is a 32 cores one, if we run the same tests with a single core, there's no slowdown (as expected, given that's a contention issue) - not really, because the "type pollution" still cause instanceof to happen for real, making it slower, but without scalability impacts.

@franz1981
Copy link
Contributor Author

I am trying hard to create a version of this fix that doesn't require any hint re the "full concrete http type" used, making it more transparent for existing users so, please, hold on :)

@franz1981
Copy link
Contributor Author

franz1981 commented Aug 18, 2022

This is the result of the microbenchmark while using type pollution
(note: although the original benchmark isn't using black holes I've already verified in the assembly that dead-code-elimination isn't happening):

This PR 1 thread:

Benchmark                                   (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt        Score        Error  Units
HttpRequestEncoderBenchmark.chunked                      true            false           true  thrpt   10  1036698.667 ±   8107.867  ops/s
HttpRequestEncoderBenchmark.chunked                      true             true           true  thrpt   10   905807.417 ±  13772.428  ops/s
HttpRequestEncoderBenchmark.contentLength                true            false           true  thrpt   10  1561759.238 ±  10901.522  ops/s
HttpRequestEncoderBenchmark.contentLength                true             true           true  thrpt   10  1299413.850 ±  26776.253  ops/s
HttpRequestEncoderBenchmark.differentTypes               true            false           true  thrpt   10   440180.570 ±  14614.238  ops/s
HttpRequestEncoderBenchmark.differentTypes               true             true           true  thrpt   10   443385.537 ±  20332.729  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true            false           true  thrpt   10  2642338.894 ± 110718.503  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true             true           true  thrpt   10  2324594.524 ±  22424.194  ops/s

This PR 6 threads:

Benchmark                                   (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt         Score        Error  Units
HttpRequestEncoderBenchmark.chunked                      true            false           true  thrpt   10   4769551.449 ± 142854.608  ops/s
HttpRequestEncoderBenchmark.chunked                      true             true           true  thrpt   10   4477651.779 ± 419351.979  ops/s
HttpRequestEncoderBenchmark.contentLength                true            false           true  thrpt   10   7776615.466 ± 225903.978  ops/s
HttpRequestEncoderBenchmark.contentLength                true             true           true  thrpt   10   6278694.334 ± 515945.343  ops/s
HttpRequestEncoderBenchmark.differentTypes               true            false           true  thrpt   10   1964309.534 ±  69229.190  ops/s
HttpRequestEncoderBenchmark.differentTypes               true             true           true  thrpt   10   2041805.064 ±  61923.201  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true            false           true  thrpt   10  10271948.355 ± 558919.801  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true             true           true  thrpt   10  10328800.426 ± 754618.592  ops/s

4.1 1 thread:

Benchmark                                   (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt        Score       Error  Units
HttpRequestEncoderBenchmark.chunked                      true            false           true  thrpt   10   968151.152 ± 52872.773  ops/s
HttpRequestEncoderBenchmark.chunked                      true             true           true  thrpt   10   887903.816 ± 17644.514  ops/s
HttpRequestEncoderBenchmark.contentLength                true            false           true  thrpt   10  1445556.948 ± 46657.318  ops/s
HttpRequestEncoderBenchmark.contentLength                true             true           true  thrpt   10  1174235.130 ± 39772.138  ops/s
HttpRequestEncoderBenchmark.differentTypes               true            false           true  thrpt   10   377020.952 ±  5622.663  ops/s
HttpRequestEncoderBenchmark.differentTypes               true             true           true  thrpt   10   367732.657 ±  2173.944  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true            false           true  thrpt   10  2467141.212 ± 99167.623  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true             true           true  thrpt   10  1500541.983 ± 18486.408  ops/s

4.1 6 threads

Benchmark                                   (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt         Score         Error  Units
HttpRequestEncoderBenchmark.chunked                      true            false           true  thrpt   10   4246375.439 ±  238607.376  ops/s
HttpRequestEncoderBenchmark.chunked                      true             true           true  thrpt   10   2948854.203 ±  123797.386  ops/s
HttpRequestEncoderBenchmark.contentLength                true            false           true  thrpt   10   5507903.362 ±  194094.645  ops/s
HttpRequestEncoderBenchmark.contentLength                true             true           true  thrpt   10   3505617.988 ±   77168.404  ops/s
HttpRequestEncoderBenchmark.differentTypes               true            false           true  thrpt   10   1344402.158 ±  113718.579  ops/s
HttpRequestEncoderBenchmark.differentTypes               true             true           true  thrpt   10   1376816.001 ±   49867.836  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true            false           true  thrpt   10  10582142.810 ± 1443464.906  ops/s
HttpRequestEncoderBenchmark.fullMessage                  true             true           true  thrpt   10   2765103.925 ±  148623.873  ops/s

Few notes:

  • the PR improve the single thread case too
  • scalability is greatly improved, especially with type pollution

There's still some work to do for HTTP but this one already improve things dramatically, see fullMessage with type pollution:

  1. on 4.1 type pollution makes barely achieve x1.8 the single threaded throughput using 6 cores
  2. on the new PR the single-threaded baseline is higher and using 6 cores achieves x4.5 the single threaded throughput

return ((FileRegion) msg).count();
}
throw new IllegalStateException("unexpected message type: " + StringUtil.simpleClassName(msg));
return msg instanceof FullHttpMessage ||
Copy link
Contributor Author

@franz1981 franz1981 Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanmaurer
We can add checks for existing known concrete types too (default, default full etc etc), but TBH users can do it too while extending the resp/req encoders based on how they use it ie if they have custom types or not.

note:
The sequence of checks here matches the same on encode in order to let concrete types to not get their Klass cache field invalidated (and refreshed) - this will likely cause a perf hit for the last ones checked in the chain (ByteBuf and FileRegion), because they will be evaluated after every previous checks always hitting the JIT O(n) slow path.

@franz1981
Copy link
Contributor Author

@chrisvest ready to go :)
Both end 2 end and microbenchs agree that the scalability issue is solved :)

@franz1981 franz1981 force-pushed the netty-4.1.74.Final branch 3 times, most recently from 8381793 to 88e2c30 Compare August 19, 2022 09:09
@franz1981
Copy link
Contributor Author

weird failure on

Build PR / windows-x86_64-java11-boringssl (pull_request)

@normanmaurer @chrisvest how to re-run the validations/just that validations?

Copy link
Contributor

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple of comments.

It's unfortunate we have to play tricks like this, as the control flow was already hard to follow, and now that part is even worse.

@franz1981
Copy link
Contributor Author

franz1981 commented Aug 19, 2022

It's unfortunate we have to play tricks like this, as the control flow was already hard to follow, and now that part is even worse.

Agree, and it's unfortunate that such JDK 'feature" is unlikely to be fixed but still - there's a lesson to learn here too ie using traits to guide a state machine isn't a good idea, because it relies on how good is the runtime to resolve it. With polymorphism is similar, although maybe a double dispatch/visitor pattern would save the klass caching mechanism to be triggered.

Probably a plain old int/byte field in a single common type acting as a bitset of known "traits" + some switches to choose what to do for each of the implemented traits, is more deterministic regardless what the JIT decide (meaning that its performance characteristics survive over time/JDK bugs).
Even if I've solved (alleviated, more on this next week) the scalability issue we still get a plain dead O(n) in case we check against a missing trait, that's not ideal.
I would love Netty 5 to not make uses of this mechanism and I can help (@vietj too I believe) to do things differently there

@franz1981 franz1981 force-pushed the netty-4.1.74.Final branch 2 times, most recently from 97e1c1e to 200cbd2 Compare August 24, 2022 12:30
@franz1981
Copy link
Contributor Author

@normanmaurer this is ready to go, unless there are other concerns - I can add more comments to help

Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add some link to the PR / openjdk issue somewhere so we not revert this by mistake at some point

@franz1981
Copy link
Contributor Author

Can you just add some link to the PR / openjdk issue somewhere so we not revert this by mistake at some point

Let me add some comments on the PR

…etty#12708)

Motivation:

Current http encoder logic cause contention over the klass field used by the JIT to speed up instanceof checks vs interfaces,
preventing scaling with multi-core machines.
See https://bugs.openjdk.org/browse/JDK-8180450 for more info.

Modifications:

Code duplication to reduce the arity of the morphism on the instanceof checks on the http encoder.
Removed using encoder inherited methods to take control of the arity of the
call-site  morphism while releasing ref counted http msg types.

Result:

Scalable HTTP encoding
@franz1981
Copy link
Contributor Author

@normanmaurer comment added and the issue too is more verbose (with an example that show what's going on)

@franz1981
Copy link
Contributor Author

The benchmark I've made is naive, but is good enough for our purpose I believe - type pollution made upfront just assume LOT of things re JIT and its configuration (and which optimizations should perform)

@normanmaurer normanmaurer merged commit 423a385 into netty:4.1 Aug 25, 2022
@normanmaurer
Copy link
Member

@franz1981 thanks a lot for all the hard work. I learned something new ❤️

@normanmaurer
Copy link
Member

@franz1981 can you also do a PR for main ?

@normanmaurer normanmaurer added this to the 4.1.80.Final milestone Aug 25, 2022
@franz1981
Copy link
Contributor Author

franz1981 commented Aug 25, 2022

@normanmaurer doing it for main too ;)
Although I would later change it there TBH

Let's chat on the pr

@normanmaurer
Copy link
Member

@normanmaurer doing it for main too ;)

Although I would later change it there TBH

Let's chat on the pr

Agree... but better to be consistent until we d have a better way in main

franz1981 added a commit to franz1981/netty that referenced this pull request Aug 31, 2022
…tly (Fixes netty#12750)

Motivation:

Changes due to netty#12709 have added a code path.

Modifications:

Restore the "just http message" case as separated by the http content ones

Result:

Same encoding as original code
yawkat added a commit to yawkat/netty that referenced this pull request Sep 1, 2022
Motivation:

netty#12709 changed HttpObjectEncoder to override the write method of MessageToMessageEncoder, with slightly changed semantics: The `msg` argument to `encode` is not released anymore. To accommodate this change, netty#12709 also update `HttpObjectEncoder.encode` to release the `msg`. However, `HttpClientCodec.Encoder` overrides `encode` and simply forwards the message if a HTTP upgrade has been completed. This code path was not updated to release the input message. This leads to a memory leak.

Modifications:

Changed the `encode` implementation to not retain the message that is forwarded. Added a test case to verify that the refCnt to the data passed through is unchanged.

Result:

The buffer retains its correct refCnt and will be released properly.
normanmaurer pushed a commit that referenced this pull request Sep 1, 2022
Motivation:

#12709 changed HttpObjectEncoder to override the write method of MessageToMessageEncoder, with slightly changed semantics: The `msg` argument to `encode` is not released anymore. To accommodate this change, #12709 also update `HttpObjectEncoder.encode` to release the `msg`. However, `HttpClientCodec.Encoder` overrides `encode` and simply forwards the message if a HTTP upgrade has been completed. This code path was not updated to release the input message. This leads to a memory leak.

Modifications:

Changed the `encode` implementation to not retain the message that is forwarded. Added a test case to verify that the refCnt to the data passed through is unchanged.

Result:

The buffer retains its correct refCnt and will be released properly.
normanmaurer pushed a commit that referenced this pull request Sep 1, 2022
…tly (Fixes #12750) (#12751)


Motivation:

Changes due to #12709 have added a code path.

Modifications:

Restore the "just http message" case as separated by the http content ones

Result:

Same encoding as original code


Co-authored-by: Chris Vest <christianvest_hansen@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants