Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: provide a mechanism for encoded message buffer recycling #6613

Closed
wants to merge 9 commits into from

Conversation

HippoBaro
Copy link

@HippoBaro HippoBaro commented Sep 10, 2023

RELEASE NOTES:

This PR adds a new public API to allow users to pass a buffer pool
(implementing the pre-existing grpc.SharedBufferPool). When used in
conjunction with a compatible encoding.Codec, the memory used to
marshal messages may be reused indefinitely, significantly reducing the
garbage collection overhead for clients and servers dealing with high
number of messages or high volume of data.

Motivation

We are currently working on a service that uses gRPC streaming to stream
(potentially) large files, in chunks, back to gRPC clients over the
network. We measured that the Go allocation volume per second is roughly
equal to the network throughput of the host. This creates GC cycles that
introduce latency spikes and prevent us from predictably saturating the
network at a reasonable CPU cost.

After investigation, we have isolated the source of most of these
allocations to protobuf slice creation during message serialization.

Results

Using this patch-set, CPU usage was reduced by 27% and by up to 64% in
one (quite synthetic) extreme case. Allocation volume per second dropped
almost 100%, from 805MiB/s to 5MiB/s.

This PR also commits some benchmarks to demonstrate the effect in an
ideal scenario:

benchmark                    old ns/op     new ns/op     delta
BenchmarkEncode1B-10         90.7          118           +30.43%
BenchmarkEncode1KiB-10       336           134           -60.05%
BenchmarkEncode8KiB-10       1806          260           -85.61%
BenchmarkEncode64KiB-10      11749         1235          -89.49%
BenchmarkEncode512KiB-10     86471         14682         -83.02%
BenchmarkEncode1MiB-10       111630        25925         -76.78%

benchmark                    old MB/s     new MB/s     speedup
BenchmarkEncode1B-10         33.08        25.37        0.77x
BenchmarkEncode1KiB-10       3057.12      7652.03      2.50x
BenchmarkEncode8KiB-10       4538.51      31539.55     6.95x
BenchmarkEncode64KiB-10      5578.28      53078.44     9.52x
BenchmarkEncode512KiB-10     6063.21      35709.85     5.89x
BenchmarkEncode1MiB-10       9393.39      40446.71     4.31x

benchmark                    old allocs     new allocs     delta
BenchmarkEncode1B-10         1              0              -100.00%
BenchmarkEncode1KiB-10       1              0              -100.00%
BenchmarkEncode8KiB-10       1              0              -100.00%
BenchmarkEncode64KiB-10      1              0              -100.00%
BenchmarkEncode512KiB-10     1              0              -100.00%
BenchmarkEncode1MiB-10       1              0              -100.00%

benchmark                    old bytes     new bytes     delta
BenchmarkEncode1B-10         3             0             -100.00%
BenchmarkEncode1KiB-10       1152          0             -100.00%
BenchmarkEncode8KiB-10       9472          0             -100.00%
BenchmarkEncode64KiB-10      73728         1             -100.00%
BenchmarkEncode512KiB-10     532482        12            -100.00%
BenchmarkEncode1MiB-10       1056772       22            -100.00%

Notes

I am unsure if this is a direction the project maintainers are
comfortable pursuing, and this is my first time contributing. Please let
me know if you'd rather I break up the PR into smaller ones (the commit
log is clean and should let us do that easily). Or, if there is a better way
to achieve this!

Additionally, there are a few things of note I considered and would
appreciate feedback on:

  • I was unsure whether we should reuse the existing buffer pools that
    can already be passed today, such as through the grpc.RecvBufferPool
    or [With]SharedWriteBuffer options;
  • Whether this mechanism should also apply to compressors, which operate
    much the same way as encoders do (some already store a buffer pool
    internally);
  • Tiny messages may suffer from this change, as it adds some extra
    logic that only pays off when the messages are larger than around 64
    bytes.
  • Whether this is even safe to do, given we pass encoded message buffers
    to handlers, which may keep references to them. This strikes me as a bad
    practice that would lead to bugs, but I am unable to find a written
    documentation about it.

Thank you!

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 10, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@HippoBaro HippoBaro force-pushed the encoder_buffer_pool branch 2 times, most recently from 2588937 to 6155d26 Compare September 10, 2023 20:33
When writing data on an HTTP writer, the call completes once the frame
is added to the control buffer queue. Therefore, there is no way to know
when the actual write is completed, as it occurs asynchronously
afterward.

`transport.Options` now allows passing a callback that's called exactly
once all the data has been committed to the underlying http2 layer.
Users may now provide a `SharedBufferPool` for the purpose of encoding
messages.
Users may now provide a `SharedBufferPool` for the purpose of encoding
messages.
Provide a callback to the transport layer that inserts the encoded
message buffer to the buffer pool once the message has been copied or
sent over the wire and no longer referenced anywhere.

Note that the encoded message buffer is also shared with user code
through handlers. Those should not keep references to these buffers
after they return.
Codecs may now implement an additional interface, `BufferedCodec`. It
gives codec writers an option to use pre-existing memory when
marshaling messages, if possible.

Implementing the interface is entirely optional, and no changes to
existing codecs are necessary.
When the user provides a buffer pool for message encoding and the
codec implements `codec.BufferedCodec`, `MarshalWithBuffer` is called
instead of `Marshal`.

We detect whether the codec supports memory reuse through conditional
casting, akin to the technique `io.Copy` uses when the provided
interfaces support either `io.RederFrom` or `io.WriterTo`.
The implementation uses the `proto.Buffer` API that allows passing a
pre-existing buffer for the library to marshal into.
Asserts that passing a buffer pool combined with a compatible codec
(protobuf in this case) results in buffer reuse.
@HippoBaro
Copy link
Author

HippoBaro commented Sep 11, 2023

I ran the default benchmarks on my branch as well. This compare without (before) providing a encoder buffer pool, and with one (after):

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps      7584502      6728760   -11.28%
             RecvOps      7009714      7089326     1.14%
            Bytes/op      1011.67      1102.50     8.99%
           Allocs/op        25.18        25.39     0.00%
             ReqT/op   6067601.60   5383008.00   -11.28%
            RespT/op   5607771.20   5671460.80     1.14%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps         9636        10968    13.82%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   4282773.51   3716343.87   -13.23%
           Allocs/op       443.54       435.61    -1.80%
             ReqT/op 8083262668.80 9200625254.40    13.82%
            RespT/op      7708.80      8774.40    13.83%
            50th-Lat   1.055083ms    893.417µs   -15.32%
            90th-Lat   1.264333ms   1.216042ms    -3.82%
            99th-Lat   1.476625ms   1.400875ms    -5.13%
             Avg-Lat   1.037379ms    911.368µs   -12.15%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1048576B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps         4399         5064    15.12%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   8742561.71   7814381.11   -10.62%
           Allocs/op       938.67       910.11    -2.98%
             ReqT/op 3690148659.20 4247991091.20    15.12%
            RespT/op 3690148659.20 4247991091.20    15.12%
            50th-Lat   2.213709ms   1.903834ms   -14.00%
            90th-Lat   2.484291ms   2.225166ms   -10.43%
            99th-Lat   6.142709ms      5.541ms    -9.80%
             Avg-Lat   2.272503ms   1.974372ms   -13.12%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1048576B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps         7000         6929    -1.01%
             RecvOps        11670        12830     9.94%
            Bytes/op   6439865.27   5000374.41   -22.35%
           Allocs/op       604.05       604.43     0.00%
             ReqT/op 5872025600.00 5812466483.20    -1.01%
            RespT/op 9789505536.00 10762584064.00     9.94%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1048576B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps      6864882      6600227    -3.86%
             RecvOps        11310        11936     5.53%
            Bytes/op     11559.76      9790.57   -15.30%
           Allocs/op        25.18        25.30     0.00%
             ReqT/op   5491905.60   5280181.60    -3.86%
            RespT/op 9487515648.00 10012642508.80     5.53%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps         8990        10444    16.17%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   4295138.78   3762462.71   -12.40%
           Allocs/op       566.20       591.24     4.42%
             ReqT/op 7541358592.00 8761062195.20    16.17%
            RespT/op      7192.00      8355.20    16.17%
            50th-Lat   1.098667ms    941.209µs   -14.33%
            90th-Lat   1.319542ms    1.24925ms    -5.33%
            99th-Lat      1.863ms   1.423583ms   -23.59%
             Avg-Lat   1.111958ms    957.167µs   -13.92%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps         2504         2527     0.92%
             RecvOps     11242519      9840033   -12.47%
            Bytes/op      2570.09      2569.98    -0.04%
           Allocs/op        26.33        26.36     0.00%
             ReqT/op 2100507443.20 2119801241.60     0.92%
            RespT/op   8994015.20   7872026.40   -12.47%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1048576B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps         5215         5182    -0.63%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   8593845.68   7695533.70   -10.45%
           Allocs/op       793.87       789.09    -0.50%
             ReqT/op 4374659072.00 4346976665.60    -0.63%
            RespT/op 4374659072.00 4346976665.60    -0.63%
            50th-Lat   1.836584ms    1.82125ms    -0.83%
            90th-Lat   2.175292ms   2.122375ms    -2.43%
            99th-Lat   6.023667ms   6.055333ms     0.53%
             Avg-Lat   1.917102ms    1.92922ms     0.63%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps       132084       130071    -1.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10102.26     10180.92     0.77%
           Allocs/op       214.42       214.98     0.00%
             ReqT/op    105667.20    104056.80    -1.52%
            RespT/op    105667.20    104056.80    -1.52%
            50th-Lat     70.375µs     72.042µs     2.37%
            90th-Lat     86.125µs     88.959µs     3.29%
            99th-Lat    179.958µs    176.625µs    -1.85%
             Avg-Lat     75.479µs      76.65µs     1.55%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps       201847       200906    -0.47%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      1717.47      1778.49     3.55%
           Allocs/op        65.79        66.44     1.52%
             ReqT/op    161477.60    160724.80    -0.47%
            RespT/op    161477.60    160724.80    -0.47%
            50th-Lat     46.625µs       47.5µs     1.88%
            90th-Lat     59.458µs     59.916µs     0.77%
            99th-Lat     95.292µs         81µs   -15.00%
             Avg-Lat     49.321µs      49.55µs     0.46%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1048576B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps         8845         9765    10.40%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   4296419.41   3782539.47   -11.96%
           Allocs/op       563.97       570.22     1.24%
             ReqT/op      7076.00      7812.00    10.40%
            RespT/op 7419723776.00 8191475712.00    10.40%
            50th-Lat   1.138625ms    989.416µs   -13.10%
            90th-Lat   1.343709ms   1.299916ms    -3.26%
            99th-Lat   1.578042ms   1.979958ms    25.47%
             Avg-Lat   1.130026ms   1.023689ms    -9.41%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1048576B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps         9796        11034    12.64%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   4283612.42   3712975.91   -13.32%
           Allocs/op       437.95       439.56     0.46%
             ReqT/op      7836.80      8827.20    12.65%
            RespT/op 8217480396.80 9255990067.20    12.64%
            50th-Lat    1.05675ms    873.959µs   -17.30%
            90th-Lat   1.263417ms   1.208958ms    -4.31%
            99th-Lat   1.454458ms   1.596417ms     9.76%
             Avg-Lat    1.02032ms    905.854µs   -11.22%
           GoVersion     go1.20.6     go1.20.6
         GrpcVersion   1.59.0-dev   1.59.0-dev

@PapaCharlie
Copy link

Hey @HippoBaro looks like we were working on the same thing (see #6608) though yours seems significantly further along. Either way I just want to see this feature checked in, regardless of how it's implemented so feel free to steal whatever code you might find useful (if any) from my PR.

Add an option to use a shared encoder buffer pool during benchmarks.
@HippoBaro
Copy link
Author

Hey @HippoBaro looks like we were working on the same thing (see #6608) though yours seems significantly further along. Either way I just want to see this feature checked in, regardless of how it's implemented so feel free to steal whatever code you might find useful (if any) from my PR.

Apologies! I didn't think of looking at closed PR; I took a quick look at your PR, and it goes one step further to enable memory reuse when compressing. Mine doesn't for the sake of keeping it simple, but this is also something I would like to get merged eventually.

@PapaCharlie
Copy link

No worries, better two brains on this than one!

@easwars
Copy link
Contributor

easwars commented Sep 11, 2023

@HippoBaro : This PR contains a bunch of changes. We appreciate the contribution. But it would be better if you can open an issue, explain the problem you are facing, and the proposed solution. That way we can discuss options for the solution before actually having to look at code. That way, we would also be able to break up the effort into smaller, more manageable PRs. Thanks.

@HippoBaro
Copy link
Author

HippoBaro commented Sep 11, 2023

@HippoBaro : This PR contains a bunch of changes. We appreciate the contribution. But it would be better if you can open an issue, explain the problem you are facing, and the proposed solution. That way we can discuss options for the solution before actually having to look at code. That way, we would also be able to break up the effort into smaller, more manageable PRs. Thanks.

Thank you @easwars! I added an issue to discuss the solution: #6619. I will leave this implementation PR open for reference and make it reflect any changes to the approach we discuss.

@easwars
Copy link
Contributor

easwars commented Sep 12, 2023

Marking this as blocked on: #6619

@ginayeh ginayeh assigned easwars and unassigned HippoBaro Oct 3, 2023
@easwars easwars removed their assignment Oct 13, 2023
@arvindbr8 arvindbr8 linked an issue Oct 18, 2023 that may be closed by this pull request
@dfawley
Copy link
Member

dfawley commented Oct 31, 2023

I hope you don't mind, but I'd like to close this until we have more time to spend on these types of issues, and that could be some time (until next quarter). We want to more holistically look at buffer sharing/re-use before continuing down this path, and don't have the cycles to do so for now. Let's discuss further in the issues, and please save your branch for now in case we do decide we should implement this, or want to reference it for discussions.

@dfawley dfawley closed this Oct 31, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a mechanism for encoded message buffer recycling
4 participants