New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the per-VU buffer pool #2879
Conversation
The |
Codecov Report
@@ Coverage Diff @@
## master #2879 +/- ##
==========================================
+ Coverage 76.84% 76.86% +0.02%
==========================================
Files 229 224 -5
Lines 16924 16913 -11
==========================================
- Hits 13005 13000 -5
+ Misses 3077 3075 -2
+ Partials 842 838 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
You are right, done! |
…t/k6 into feature/buffer-sync-pool
setupData []byte | ||
console *console | ||
setupData []byte | ||
BufferPool *lib.BufferPool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add a BufferPool
to the js.Runner
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to reuse same pool for all VUs. I suppose ´js.Runner´ instance is unique for a load test execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but is that actually better for performance than a separate BufferPool
per VU? In the original simplistic checks I did for #794, a global sync.Pool
actually performed a lot worse than a per-VU sync.Pool
.
Maybe that has changed in recent Go and k6 versions, but we'd need to see a new benchmark comparison between both to justify the decision to go with a global pool. And, as a small nitpick, even if the the global pool is performing better now, it should still probably be unexported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the test (BenchmarkReadResponseBody) with both configurations and to have a global pool improves memory and time execution. First execution (with global pool) allocate less memory and can iterate more times.
I think the result make sense because we reuse more the pool objects, so we don't have to allocate memory neither spend time creating new objects.
`
Global pool for all VUs
BenchmarkReadResponseBody-10 1 30455673750 ns/op 138908381552 B/op 69924811 allocs/op
checks.........................: 100.00% ✓ 222192 ✗ 0
data_received..................: 15 GB 507 MB/s
data_sent......................: 21 MB 689 kB/s
http_req_blocked...............: avg=16.18µs min=0s med=1µs max=75.67ms p(90)=2µs p(95)=3µs
http_req_connecting............: avg=5.29µs min=0s med=0s max=31.02ms p(90)=0s p(95)=0s
http_req_duration..............: avg=2.31ms min=74µs med=1.14ms max=93.67ms p(90)=5.53ms p(95)=8.32ms
{ expected_response:true }...: avg=2.31ms min=74µs med=1.14ms max=93.67ms p(90)=5.53ms p(95)=8.32ms
http_req_failed................: 0.00% ✓ 0 ✗ 222192
http_req_receiving.............: avg=420.75µs min=9µs med=34µs max=56.95ms p(90)=938µs p(95)=1.88ms
http_req_sending...............: avg=27.8µs min=1µs med=3µs max=69.87ms p(90)=15µs p(95)=43µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=1.86ms min=60µs med=860µs max=93.54ms p(90)=4.62ms p(95)=6.87ms
http_reqs......................: 222192 7378.159847/s
iteration_duration.............: avg=108.23ms min=101.33ms med=105.72ms max=204.63ms p(90)=117.13ms p(95)=121.4ms
iterations.....................: 27774 922.269981/s
vus............................: 100 min=100 max=100
vus_max........................: 100 min=100 max=100
Individual pool per VUs
BenchmarkReadResponseBody-10 1 30450333375 ns/op 174401310944 B/op 73585506 allocs/op
checks.........................: 100.00% ✓ 217008 ✗ 0
data_received..................: 15 GB 495 MB/s
data_sent......................: 20 MB 673 kB/s
http_req_blocked...............: avg=27µs min=0s med=1µs max=44.97ms p(90)=2µs p(95)=3µs
http_req_connecting............: avg=14.71µs min=0s med=0s max=42.62ms p(90)=0s p(95)=0s
http_req_duration..............: avg=2.99ms min=75µs med=1.36ms max=106.88ms p(90)=7.15ms p(95)=10.76ms
{ expected_response:true }...: avg=2.99ms min=75µs med=1.36ms max=106.88ms p(90)=7.15ms p(95)=10.76ms
http_req_failed................: 0.00% ✓ 0 ✗ 217008
http_req_receiving.............: avg=590.9µs min=10µs med=59µs max=71.81ms p(90)=1.14ms p(95)=2.56ms
http_req_sending...............: avg=34.37µs min=1µs med=3µs max=64.95ms p(90)=15µs p(95)=45µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=2.37ms min=61µs med=1.02ms max=92.94ms p(90)=5.8ms p(95)=8.92ms
http_reqs......................: 217008 7209.308715/s
iteration_duration.............: avg=110.64ms min=101.37ms med=107.07ms max=227.74ms p(90)=121.7ms p(95)=129.85ms
iterations.....................: 27126 901.163589/s
vus............................: 100 min=100 max=100
vus_max........................: 100 min=100 max=100
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, thanks 👍
I am postponing the final decision about where the buffer sync pool should be to Ned since I feel that he has a better context.
OK, so I finally had time to re-run my original benchmarks from #794 (comment), with these 4 configurations:
Some details about how I ran all of the tests:
As you can see, the test has a few significant differences with the one you added to the benchmarks. It is much, much larger in terms of VUs and iterations, but it also expresses the work in terms of test iterations, not in terms of test duration. So, assuming the server is roughly constant, a more performant k6 implementation will be done with the script sooner. And having the server outside of the measured k6 process helps us to (mostly) isolate the benchmark test to just the k6 performance. So, it seems like the local per-VU However, things are more complicated when we compare 1. and 2 😕 A global Personally, I would be fine with merging the PR as it is, since having ~2.5x memory reduction is probably worth the ~10% CPU increase/throughput decrease (depending on what we measure and compare). We can just leave #794 open and investigate some of the other potential improvements from this comment (or others) at some later point? Alternatively, we can try to see if some further simple improvement won't push things enough so there are no downsides to merging the PR? @grafana/k6-core, what do you think? @imiric, you recently did grafana/k6-docs#992 and we now have https://github.com/grafana/k6-benchmarks, maybe it's worth it to run the benchmarks with some of the same variants I tested here? FWIW, I probably won't be able to dedicate time again for a more thorough investigation here or more benchmarks for at least the next 2 weeks 😞 |
Yes, now GC has to run to remove those buffers so it consume some cpu when run, you can see it using pprof. In any case the memory optimization is bigger than the cpu consumption.
Sure, I committed the test with a small number of VUs and Iterations to avoid largest builds, but I ran it with a bigger number of course and the result was the same, memory optimization and the increase of GC related with this kind of pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @davidpst! 👏 I just have one minor comment.
I ran the script from k6-benchmarks on master
(367163b) and this branch (b4a0314) on my local machine, and noticed the same behavior you guys did. This branch has much lower memory usage, at the slight cost of higher CPU usage and lower RPS.
See the results here. The graphs are a bit wonky, since the RAM line is squashed due to the higher order of RPS, but the effect is still visible. Otherwise, GitHub renders CSVs nicely in tables. :)
Question: would it be a good idea to make the new BufferPool
the default implementation, but give the choice to the user to fallback to the previous behavior (via an env var), if they decide they'd rather not sacrifice CPU/RPS? I know we'd like to abandon the bpool
dependency, but this approach might be safer in some circumstances.
This still requires more thorough testing, like @na-- says, but I don't think these additional benchmarks should block this PR from merging. We've seen that the memory improvement is substantial, and we can continue testing this on master
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidpst After an internal discussion with the team, we decided it was best to leave this PR open for a while, so that we can run a few more tests with it.
Namely, we'd like to test the performance with https://github.com/valyala/bytebufferpool, and see if we can reduce the CPU/RPS impact.
This will happen sometime in the upcoming weeks, but I can't say exactly when. Sorry about the delay!
Here are the benchmark results when using https://github.com/valyala/bytebufferpool. It's a negligible improvement in memory usage, practically within the margin of error, but has an even bigger CPU and RPS penalty than plain I also tried https://github.com/libp2p/go-buffer-pool, which does something similar to what is mentioned in this comment. It manages a pool of It's slightly better than bytebufferpool and a single So I don't see any conclusive improvements with these minor changes. Maybe we'll need a custom buffer implementation, like @na-- suggested, but that will have to wait for another time, or someone else to pick it up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's then 🤷♂️ we can keep the issue open, for now, since we can probably claw back some of that CPU inefficiency with a bit of effort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to re-approve this.
Thanks for your work @davidpst! 🙇
We detected an incompatibility with k6-core: (https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2) - BPool becomes BufferPool in grafana/k6#2879. This PR fixes the issue. Co-authored-by: ka3de <daniel.jimenez@grafana.com>
We detected an incompatibility with k6-core: (https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2) - BPool becomes BufferPool in grafana/k6#2879. This PR fixes the issue. Co-authored-by: ka3de <daniel.jimenez@grafana.com>
We detected an incompatibility with k6-core: (https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2) - BPool becomes BufferPool in grafana/k6#2879. This PR fixes the issue. Co-authored-by: ka3de <daniel.jimenez@grafana.com>
We detected an incompatibility with k6-core: (https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2) - BPool becomes BufferPool in grafana/k6#2879. This PR fixes the issue. Co-authored-by: ka3de <daniel.jimenez@grafana.com>
This is an implementation of a shared buffer pool for all the VUs using
sync.Pool
. This implementation aims to improve memory consumption that has been commented here #794