Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize get byte multipart and fix buffer reuse #11001

Merged

Conversation

fredericBregier
Copy link
Member

@fredericBregier fredericBregier commented Feb 7, 2021

Fix HttpPostMultipartRequestDecoder buffer usages and allocations

Motivation:

Method getByte(position) is too often called within the current implementation
of the HttpPostMultipartRequestDecoder.
This implies too much activities which is visible when PARANOID mode is active.
This is also true in standard mode.

Apply the same fix on buffer from HttpPostMultipartRequestDecoder to HttpPostStandardRequestDecoder
made previously.

Finally in order to ensure we do not rewrite already decoded HttpData when decoding
next ones within multipart, we must ensure the buffers are copied and not a retained slice.

Modifications:

Use the bytesBefore(...) method instead of getByte(pos) in order to limit the external
access to the underlying buffer by retrieving iteratively the beginning of a correct start
position.
It is used to find both LF/CRLF and delimiter.
2 methods in HttpPostBodyUtil were created for that.

The undecodedChunk is copied when adding a chunk to a DataMultipart is loaded.
The same buffer is also rewritten in order to release the copied memory part.

Result:

Just for note, for both Memory or Disk or Mixed mode factories, the release has to be done as:

for (InterfaceHttpData httpData: decoder.getBodyHttpDatas()) {
    httpData.release();
    factory.removeHttpDataFromClean(request, httpData);
}
factory.cleanAllHttpData();
decoder.destroy();

The memory used is minimal in Disk or Mixed mode. In Memory mode, a big file is still
in memory but not more in the undecodedChunk but its own buffer (copied).

In terms of benchmarking, the results are:

Original code Benchmark                                                             Mode  Cnt  Score    Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  0,152 ±  0,100  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  0,543 ±  0,218  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  0,615 ±  0,070  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  0,114 ±  0,063  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  0,664 ±  0,034  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  0,620 ±  0,140  ops/ms

New code Benchmark                                                                  Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  4,253 ± 0,333  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  4,422 ± 0,250  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,877 ± 0,014  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  4,151 ± 0,481  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  2,167 ± 0,098  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  2,520 ± 0,043  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,177 ± 0,003  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  2,419 ± 0,061  ops/ms

In short, using big file transfers, this is about 7 times faster with new code, while
using high number of HttpData, this is about 4 times faster with new code when using Simple Level.
When using Paranoid Level, using big file transfers, this is about 800 times faster with new code, while
using high number of HttpData, this is about 170 times faster with new code.

Small improvements are also added, concerning buffer allocation and unsueful check within MixedAttribute and charset extended usage.

Add some tests to check consistency in Http Codec Multipart

Motivation:

Underlying buffer usages might be erroneous when releasing them internaly
in HttpPostMultipartRequestDecoder.

2 bugs occurs:

  1. Final File upload seems not to be of the right size
  2. Memory, even in Disk mode, is increasing continuously, while it shouldn't

Modification:

Add some tests to check consistency for HttpPostMultipartRequestDecoder.
Add a package protected method for testing purpose only.

Results:

Without fixes, those tests failed. With the fixes, they passed.

Motivation:
Underlying buffer usages might be erroneous when releasing them internaly
in HttpPostMultipartRequestDecoder.

2 bugs occurs:
1) Final File upload seems not to be of the right size
2) Memory, even in Disk mode, is increasing continuously, while it shouldn't

Modification:
Add some tests to check consistency for HttpPostMultipartRequestDecoder.
Add a package protected method for testing purpose only.
Motivation:

Method `getByte(position)` is too often called within the current implementation
of the HttpPostMultipartRequestDecoder.
This implies too much activities which is visible when PARANOID mode is active.
This is also true in standard mode.

Apply the same fix on buffer from HttpPostMultipartRequestDecoder to HttpPostStandardRequestDecoder
made previously.

Finally in order to ensure we do not rewrite already decoded HttpData when decoding
next ones within multipart, we must ensure the buffers are copied and not a retained slice.

Modifications:

Use the `bytesBefore(...)` method instead of `getByte(pos)` in order to limit the external
access to the underlying buffer by retrieving iteratively the beginning of a correct start
position.
It is used to find both LF/CRLF and delimiter.
2 methods in HttpPostBodyUtil were created for that.

The undecodedChunk is copied when adding a chunk to a DataMultipart is loaded.
The same buffer is also rewritten in order to release the copied memory part.

Result:

Just for note, for both Memory or Disk or Mixed mode factories, the release has to be done as:

      for (InterfaceHttpData httpData: decoder.getBodyHttpDatas()) {
          httpData.release();
          factory.removeHttpDataFromClean(request, httpData);
      }
      factory.cleanAllHttpData();
      decoder.destroy();

The memory used is minimal in Disk or Mixed mode. In Memory mode, a big file is still
in memory but not more in the undecodedChunk but its own buffer (copied).

In terms of benchmarking, the results are:

Original code Benchmark                                                             Mode  Cnt  Score    Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  0,152 ±  0,100  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  0,543 ±  0,218  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  0,615 ±  0,070  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  0,114 ±  0,063  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  0,664 ±  0,034  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  0,620 ±  0,140  ops/ms

New code Benchmark                                                                  Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  4,037 ± 0,358  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  4,226 ± 0,471  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,875 ± 0,029  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  4,346 ± 0,275  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  2,044 ± 0,020  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  2,278 ± 0,159  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,174 ± 0,004  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  2,370 ± 0,065  ops/ms

In short, using big file transfers, this is about 7 times faster with new code, while
using high number of HttpData, this is about 4 times faster with new code when using Simple Level.
When using Paranoid Level, using big file transfers, this is about 800 times faster with new code, while
using high number of HttpData, this is about 170 times faster with new code.
@fredericBregier
Copy link
Member Author

I try to rewrite from the beginning the previous proposal.
I believe it is clearer and also I found some bugs still there (decoding in error, memory allocation in error) in the current actual implementation.

@franz1981
Copy link
Contributor

I see that I should have used this PR benchmark to test the effectiveness of #10737 in a more realistic use case too :)

@fredericBregier
Copy link
Member Author

I see that I should have used this PR benchmark to test the effectiveness of #10737 in a more realistic use case too :)

@franz1981
I'm not sure to what point you think it is relevant but if it is using getByte when iterating on a position to find a value is quite consuming (mainly due to access being partially or totally logged within Netty to find bad releasing, but not only limited to that point): using bytesBefore seems more efficient.
If this is this point, yes, trying to not iterate on getByte is quite a good way (as it is in "almost" every language on file for instance to not read byte after byte).

In addition, we could implement a specific method within ByteBuf that allows to search for a "serie of bytes", but I think it might be too specific in my case so I decide to let this within HttpCodec as a static helper method.

@franz1981
Copy link
Contributor

I'm not sure to what point you think it is relevant

If you will use bytesBefore or indexOf you will likely get an additional speedup thanks to the PR I have linked ;)

@normanmaurer
Copy link
Member

@fredericBregier does this replace #10982 ?

@fredericBregier
Copy link
Member Author

@normanmaurer yes, I forgot to close the previous one.
It is a rewritten for both original #10623 and following ones.

@fredericBregier fredericBregier marked this pull request as ready for review February 9, 2021 08:44
Copy link
Contributor

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments.

@chrisvest
Copy link
Contributor

I forgot to mention: the benchmarks look great, and I like how many parts of the code became simpler. Nice work!

Copy link
Contributor

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another comment, but otherwise I think this looks good.

Fix typo and naming, Charset usage, while to if, improve comments, smaller tests, factorized them and split tests according to factory type, optimize buffer allocation when given buffer is empty, improve findDelimiter, remove unuseful check in MixedAttribute

Still pending AbstractSearchProcessorFactory usage to check
@fredericBregier fredericBregier force-pushed the MinimizeGetByteMultipartFixBufferReuse branch from b2757ae to 6daeb0c Compare February 12, 2021 10:50
@fredericBregier
Copy link
Member Author

@franz1981 Done, thank you !!

@chrisvest
Copy link
Contributor

@fredericBregier Thanks! (sorry about the delay)

chrisvest pushed a commit that referenced this pull request Feb 26, 2021
Motivation:
- Underlying buffer usages might be erroneous when releasing them internaly
in HttpPostMultipartRequestDecoder.

2 bugs occurs:
1) Final File upload seems not to be of the right size.
2) Memory, even in Disk mode, is increasing continuously, while it shouldn't.

- Method `getByte(position)` is too often called within the current implementation
of the HttpPostMultipartRequestDecoder.
This implies too much activities which is visible when PARANOID mode is active.
This is also true in standard mode.

Apply the same fix on buffer from HttpPostMultipartRequestDecoder to HttpPostStandardRequestDecoder
made previously.

Finally in order to ensure we do not rewrite already decoded HttpData when decoding
next ones within multipart, we must ensure the buffers are copied and not a retained slice.

Modification:
- Add some tests to check consistency for HttpPostMultipartRequestDecoder.
Add a package protected method for testing purpose only.

- Use the `bytesBefore(...)` method instead of `getByte(pos)` in order to limit the external
access to the underlying buffer by retrieving iteratively the beginning of a correct start
position.
It is used to find both LF/CRLF and delimiter.
2 methods in HttpPostBodyUtil were created for that.

The undecodedChunk is copied when adding a chunk to a DataMultipart is loaded.
The same buffer is also rewritten in order to release the copied memory part.

Result:

Just for note, for both Memory or Disk or Mixed mode factories, the release has to be done as:

      for (InterfaceHttpData httpData: decoder.getBodyHttpDatas()) {
          httpData.release();
          factory.removeHttpDataFromClean(request, httpData);
      }
      factory.cleanAllHttpData();
      decoder.destroy();

The memory used is minimal in Disk or Mixed mode. In Memory mode, a big file is still
in memory but not more in the undecodedChunk but its own buffer (copied).

In terms of benchmarking, the results are:

Original code Benchmark                                                             Mode  Cnt  Score    Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  0,152 ±  0,100  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  0,543 ±  0,218  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  0,615 ±  0,070  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  0,114 ±  0,063  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  0,664 ±  0,034  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  0,620 ±  0,140  ops/ms

New code Benchmark                                                                  Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  4,037 ± 0,358  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  4,226 ± 0,471  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,875 ± 0,029  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  4,346 ± 0,275  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  2,044 ± 0,020  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  2,278 ± 0,159  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,174 ± 0,004  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  2,370 ± 0,065  ops/ms

In short, using big file transfers, this is about 7 times faster with new code, while
using high number of HttpData, this is about 4 times faster with new code when using Simple Level.
When using Paranoid Level, using big file transfers, this is about 800 times faster with new code, while
using high number of HttpData, this is about 170 times faster with new code.
@jameskleeh
Copy link
Contributor

This change broke the semantics of the decoder API. Will file an issue

raidyue pushed a commit to raidyue/netty that referenced this pull request Jul 8, 2022
Motivation:
- Underlying buffer usages might be erroneous when releasing them internaly
in HttpPostMultipartRequestDecoder.

2 bugs occurs:
1) Final File upload seems not to be of the right size.
2) Memory, even in Disk mode, is increasing continuously, while it shouldn't.

- Method `getByte(position)` is too often called within the current implementation
of the HttpPostMultipartRequestDecoder.
This implies too much activities which is visible when PARANOID mode is active.
This is also true in standard mode.

Apply the same fix on buffer from HttpPostMultipartRequestDecoder to HttpPostStandardRequestDecoder
made previously.

Finally in order to ensure we do not rewrite already decoded HttpData when decoding
next ones within multipart, we must ensure the buffers are copied and not a retained slice.

Modification:
- Add some tests to check consistency for HttpPostMultipartRequestDecoder.
Add a package protected method for testing purpose only.

- Use the `bytesBefore(...)` method instead of `getByte(pos)` in order to limit the external
access to the underlying buffer by retrieving iteratively the beginning of a correct start
position.
It is used to find both LF/CRLF and delimiter.
2 methods in HttpPostBodyUtil were created for that.

The undecodedChunk is copied when adding a chunk to a DataMultipart is loaded.
The same buffer is also rewritten in order to release the copied memory part.

Result:

Just for note, for both Memory or Disk or Mixed mode factories, the release has to be done as:

      for (InterfaceHttpData httpData: decoder.getBodyHttpDatas()) {
          httpData.release();
          factory.removeHttpDataFromClean(request, httpData);
      }
      factory.cleanAllHttpData();
      decoder.destroy();

The memory used is minimal in Disk or Mixed mode. In Memory mode, a big file is still
in memory but not more in the undecodedChunk but its own buffer (copied).

In terms of benchmarking, the results are:

Original code Benchmark                                                             Mode  Cnt  Score    Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  0,152 ±  0,100  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  0,543 ±  0,218  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  0,615 ±  0,070  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  0,114 ±  0,063  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  0,664 ±  0,034  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  0,620 ±  0,140  ops/ms

New code Benchmark                                                                  Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  4,037 ± 0,358  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  4,226 ± 0,471  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,875 ± 0,029  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  4,346 ± 0,275  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  2,044 ± 0,020  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  2,278 ± 0,159  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,174 ± 0,004  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  2,370 ± 0,065  ops/ms

In short, using big file transfers, this is about 7 times faster with new code, while
using high number of HttpData, this is about 4 times faster with new code when using Simple Level.
When using Paranoid Level, using big file transfers, this is about 800 times faster with new code, while
using high number of HttpData, this is about 170 times faster with new code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants