Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream: allow codecs to reuse output buffers #2816

Closed
dfawley opened this issue May 15, 2019 · 3 comments · Fixed by #3167
Closed

stream: allow codecs to reuse output buffers #2816

dfawley opened this issue May 15, 2019 · 3 comments · Fixed by #3167
Labels
P3 Type: Performance Performance improvements (CPU, network, memory, etc)

Comments

@dfawley
Copy link
Member

dfawley commented May 15, 2019

This would require a codec API change/extension to recycle the memory once grpc is done writing it to the wire (or compressing it).

@dfawley dfawley added P2 Type: Performance Performance improvements (CPU, network, memory, etc) labels May 15, 2019
@stale stale bot added the stale label Sep 6, 2019
@dfawley dfawley removed the stale label Sep 6, 2019
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

    streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

	 Title       Before        After Percentage
      TotalOps        61821        65568     6.06%
       SendOps            0            0      NaN%
       RecvOps            0            0      NaN%
      Bytes/op    116033.83    105560.37    -9.03%
     Allocs/op       111.79       117.89     5.37%
       ReqT/op 506437632.00 537133056.00     6.06%
      RespT/op 506437632.00 537133056.00     6.06%
      50th-Lat    143.303µs    136.558µs    -4.71%
      90th-Lat    197.926µs    188.623µs    -4.70%
      99th-Lat    521.575µs    507.591µs    -2.68%
       Avg-Lat    161.294µs    152.038µs    -5.74%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

    streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

	 Title       Before        After Percentage
      TotalOps        61821        65568     6.06%
       SendOps            0            0      NaN%
       RecvOps            0            0      NaN%
      Bytes/op    116033.83    105560.37    -9.03%
     Allocs/op       111.79       117.89     5.37%
       ReqT/op 506437632.00 537133056.00     6.06%
      RespT/op 506437632.00 537133056.00     6.06%
      50th-Lat    143.303µs    136.558µs    -4.71%
      90th-Lat    197.926µs    188.623µs    -4.70%
      99th-Lat    521.575µs    507.591µs    -2.68%
       Avg-Lat    161.294µs    152.038µs    -5.74%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

    streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

	 Title       Before        After Percentage
      TotalOps        61821        65568     6.06%
       SendOps            0            0      NaN%
       RecvOps            0            0      NaN%
      Bytes/op    116033.83    105560.37    -9.03%
     Allocs/op       111.79       117.89     5.37%
       ReqT/op 506437632.00 537133056.00     6.06%
      RespT/op 506437632.00 537133056.00     6.06%
      50th-Lat    143.303µs    136.558µs    -4.71%
      90th-Lat    197.926µs    188.623µs    -4.70%
      99th-Lat    521.575µs    507.591µs    -2.68%
       Avg-Lat    161.294µs    152.038µs    -5.74%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 8, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
adtac pushed a commit to adtac/grpc-go that referenced this issue Nov 9, 2019
Performance benchmarks can be found below. Obviously, a 10KB request and
10KB response is tailored to showcase this improvement as this is where
codec buffer re-use shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.

To no one's surprise, the number of bytes allocated per operation goes
down by almost exactly 10 KB across 60k+ queries, which suggests
excellent buffer re-use. The number of allocations itself increases by
5-ish, but that's probably because of a few additional slice pointers
that we need to store; these are 8-byte allocations and should have
virtually no impact on the allocator and garbage collector.

We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_10240B-respSize_10240B-compressor_off-channelz_false-preloader_false

               Title       Before        After Percentage
            TotalOps       370480       372395     0.52%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    116049.91    105488.90    -9.10%
           Allocs/op       111.59       118.27     6.27%
             ReqT/op 505828693.33 508443306.67     0.52%
            RespT/op 505828693.33 508443306.67     0.52%
            50th-Lat    142.553µs    143.951µs     0.98%
            90th-Lat    193.714µs     192.51µs    -0.62%
            99th-Lat    549.345µs    545.059µs    -0.78%
             Avg-Lat    161.506µs    160.635µs    -0.54%

Closes grpc#2816
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
@dfawley dfawley reopened this Jan 26, 2021
@dfawley
Copy link
Member Author

dfawley commented Jan 26, 2021

The PR that implemented this ultimately needed to be rolled back (#3307); this should have been reopened at that time.

@grpc grpc deleted a comment from stale bot May 3, 2021
@menghanl
Copy link
Contributor

menghanl commented May 3, 2021

There's another attempt to reuse the buffer for reads, but the team didn't have time to review the PR (#3220 (comment)).
Future performance work may reopen it, or reuse some of the change.

@dfawley dfawley added P3 and removed P2 labels May 16, 2022
@ginayeh
Copy link
Contributor

ginayeh commented Sep 19, 2023

Bring tags over to #6619 and close this down.

@ginayeh ginayeh closed this as completed Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P3 Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants