Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the per-VU buffer pool #2879

Merged
merged 25 commits into from Mar 15, 2023
Merged

Conversation

davidpst
Copy link
Contributor

This is an implementation of a shared buffer pool for all the VUs using sync.Pool. This implementation aims to improve memory consumption that has been commented here #794

@CLAassistant
Copy link

CLAassistant commented Jan 27, 2023

CLA assistant check
All committers have signed the CLA.

@na--
Copy link
Member

na-- commented Jan 27, 2023

The deps CI check probably fails because we no longer use github.com/oxtoacart/bpool, so you need to run something like go mod tidy and go mod vendor and commit the changes

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2023

Codecov Report

Merging #2879 (1acc385) into master (367163b) will increase coverage by 0.02%.
The diff coverage is 88.00%.

❗ Current head 1acc385 differs from pull request most recent head ba321dc. Consider uploading reports for the commit ba321dc to get more accurate results

@@            Coverage Diff             @@
##           master    #2879      +/-   ##
==========================================
+ Coverage   76.84%   76.86%   +0.02%     
==========================================
  Files         229      224       -5     
  Lines       16924    16913      -11     
==========================================
- Hits        13005    13000       -5     
+ Misses       3077     3075       -2     
+ Partials      842      838       -4     
Flag Coverage Δ
ubuntu 76.86% <88.00%> (+0.10%) ⬆️
windows ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
lib/vu_state.go 84.61% <ø> (ø)
lib/buffer_pool.go 81.81% <81.81%> (ø)
cmd/tests/test_state.go 87.75% <88.88%> (-4.09%) ⬇️
js/runner.go 84.92% <100.00%> (+0.03%) ⬆️
lib/netext/httpext/compression.go 84.21% <100.00%> (-0.17%) ⬇️
loader/filesystems.go 60.00% <0.00%> (-40.00%) ⬇️
js/common/initenv.go 66.66% <0.00%> (-33.34%) ⬇️
lib/netext/httpext/error_codes.go 94.52% <0.00%> (-2.74%) ⬇️
loader/loader.go 82.87% <0.00%> (-0.69%) ⬇️
... and 12 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@davidpst
Copy link
Contributor Author

You are right, done!

@na-- na-- added this to the v0.44.0 milestone Jan 30, 2023
setupData []byte
console *console
setupData []byte
BufferPool *lib.BufferPool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add a BufferPool to the js.Runner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to reuse same pool for all VUs. I suppose ´js.Runner´ instance is unique for a load test execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but is that actually better for performance than a separate BufferPool per VU? In the original simplistic checks I did for #794, a global sync.Pool actually performed a lot worse than a per-VU sync.Pool.

Maybe that has changed in recent Go and k6 versions, but we'd need to see a new benchmark comparison between both to justify the decision to go with a global pool. And, as a small nitpick, even if the the global pool is performing better now, it should still probably be unexported.

Copy link
Contributor Author

@davidpst davidpst Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test (BenchmarkReadResponseBody) with both configurations and to have a global pool improves memory and time execution. First execution (with global pool) allocate less memory and can iterate more times.

I think the result make sense because we reuse more the pool objects, so we don't have to allocate memory neither spend time creating new objects.

`
Global pool for all VUs

BenchmarkReadResponseBody-10 1 30455673750 ns/op 138908381552 B/op 69924811 allocs/op

 checks.........................: 100.00% ✓ 222192      ✗ 0     
 data_received..................: 15 GB   507 MB/s
 data_sent......................: 21 MB   689 kB/s
 http_req_blocked...............: avg=16.18µs  min=0s       med=1µs      max=75.67ms  p(90)=2µs      p(95)=3µs    
 http_req_connecting............: avg=5.29µs   min=0s       med=0s       max=31.02ms  p(90)=0s       p(95)=0s     
 http_req_duration..............: avg=2.31ms   min=74µs     med=1.14ms   max=93.67ms  p(90)=5.53ms   p(95)=8.32ms 
   { expected_response:true }...: avg=2.31ms   min=74µs     med=1.14ms   max=93.67ms  p(90)=5.53ms   p(95)=8.32ms 
 http_req_failed................: 0.00%   ✓ 0           ✗ 222192
 http_req_receiving.............: avg=420.75µs min=9µs      med=34µs     max=56.95ms  p(90)=938µs    p(95)=1.88ms 
 http_req_sending...............: avg=27.8µs   min=1µs      med=3µs      max=69.87ms  p(90)=15µs     p(95)=43µs   
 http_req_tls_handshaking.......: avg=0s       min=0s       med=0s       max=0s       p(90)=0s       p(95)=0s     
 http_req_waiting...............: avg=1.86ms   min=60µs     med=860µs    max=93.54ms  p(90)=4.62ms   p(95)=6.87ms 
 http_reqs......................: 222192  7378.159847/s
 iteration_duration.............: avg=108.23ms min=101.33ms med=105.72ms max=204.63ms p(90)=117.13ms p(95)=121.4ms
 iterations.....................: 27774   922.269981/s
 vus............................: 100     min=100       max=100 
 vus_max........................: 100     min=100       max=100 

Individual pool per VUs

BenchmarkReadResponseBody-10 1 30450333375 ns/op 174401310944 B/op 73585506 allocs/op

 checks.........................: 100.00% ✓ 217008      ✗ 0     
 data_received..................: 15 GB   495 MB/s
 data_sent......................: 20 MB   673 kB/s
 http_req_blocked...............: avg=27µs     min=0s       med=1µs      max=44.97ms  p(90)=2µs     p(95)=3µs     
 http_req_connecting............: avg=14.71µs  min=0s       med=0s       max=42.62ms  p(90)=0s      p(95)=0s      
 http_req_duration..............: avg=2.99ms   min=75µs     med=1.36ms   max=106.88ms p(90)=7.15ms  p(95)=10.76ms 
   { expected_response:true }...: avg=2.99ms   min=75µs     med=1.36ms   max=106.88ms p(90)=7.15ms  p(95)=10.76ms 
 http_req_failed................: 0.00%   ✓ 0           ✗ 217008
 http_req_receiving.............: avg=590.9µs  min=10µs     med=59µs     max=71.81ms  p(90)=1.14ms  p(95)=2.56ms  
 http_req_sending...............: avg=34.37µs  min=1µs      med=3µs      max=64.95ms  p(90)=15µs    p(95)=45µs    
 http_req_tls_handshaking.......: avg=0s       min=0s       med=0s       max=0s       p(90)=0s      p(95)=0s      
 http_req_waiting...............: avg=2.37ms   min=61µs     med=1.02ms   max=92.94ms  p(90)=5.8ms   p(95)=8.92ms  
 http_reqs......................: 217008  7209.308715/s
 iteration_duration.............: avg=110.64ms min=101.37ms med=107.07ms max=227.74ms p(90)=121.7ms p(95)=129.85ms
 iterations.....................: 27126   901.163589/s
 vus............................: 100     min=100       max=100 
 vus_max........................: 100     min=100       max=100 

`

olegbespalov
olegbespalov previously approved these changes Mar 1, 2023
Copy link
Collaborator

@olegbespalov olegbespalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks 👍

I am postponing the final decision about where the buffer sync pool should be to Ned since I feel that he has a better context.

@na--
Copy link
Member

na-- commented Mar 6, 2023

OK, so I finally had time to re-run my original benchmarks from #794 (comment), with these 4 configurations:

  1. [results] the current k6 master, as of 0b1fe61
  2. [results] this PR with the same master branch from 1. merged into it
  3. [results] the code from 2. but with this patch to make the sync.Pool values per-VU
  4. [results] no buffer pools (i.e. just buf := &bytes.Buffer{} in readResponseBody() 😱 ), for a control

Some details about how I ran all of the tests:

As you can see, the test has a few significant differences with the one you added to the benchmarks. It is much, much larger in terms of VUs and iterations, but it also expresses the work in terms of test iterations, not in terms of test duration. So, assuming the server is roughly constant, a more performant k6 implementation will be done with the script sooner. And having the server outside of the measured k6 process helps us to (mostly) isolate the benchmark test to just the k6 performance.

So, it seems like the local per-VU sync.Pool variant (the 3rd test case) performs the worst in all categories except the "no buffer" control (4th test case) - most memory and CPU usage for the least performance. This is somewhat surprising to me, but it is consistent with both your benchmarks and my original ones from a few years ago that I put in the issue.

However, things are more complicated when we compare 1. and 2 😕 A global sync.Pool indeed has a significantly lower memory usage! 🎉 However, it also has a noticeably higher CPU usage and lower throughput/iteration rate 😞 To be fair, this benchmark is not a very realistic real-life k6 use case. It just shows that the changes in this PR are not only positive, they also have some performance costs... 😞

Personally, I would be fine with merging the PR as it is, since having ~2.5x memory reduction is probably worth the ~10% CPU increase/throughput decrease (depending on what we measure and compare). We can just leave #794 open and investigate some of the other potential improvements from this comment (or others) at some later point? Alternatively, we can try to see if some further simple improvement won't push things enough so there are no downsides to merging the PR?

@grafana/k6-core, what do you think? @imiric, you recently did grafana/k6-docs#992 and we now have https://github.com/grafana/k6-benchmarks, maybe it's worth it to run the benchmarks with some of the same variants I tested here? FWIW, I probably won't be able to dedicate time again for a more thorough investigation here or more benchmarks for at least the next 2 weeks 😞

@na-- na-- requested a review from imiric March 6, 2023 13:47
@davidpst
Copy link
Contributor Author

davidpst commented Mar 8, 2023

Personally, I would be fine with merging the PR as it is, since having ~2.5x memory reduction is probably worth the ~10% CPU increase/throughput decrease (depending on what we measure and compare). We can just leave #794 open and investigate some of the other potential improvements from #794 (comment) (or others) at some later point? Alternatively, we can try to see if some further simple improvement won't push things enough so there are no downsides to merging the PR?

Yes, now GC has to run to remove those buffers so it consume some cpu when run, you can see it using pprof. In any case the memory optimization is bigger than the cpu consumption.

As you can see, the test has a few significant differences with the one you added to the benchmarks. It is much, much larger in terms of VUs and iterations, but it also expresses the work in terms of test iterations, not in terms of test duration. So, assuming the server is roughly constant, a more performant k6 implementation will be done with the script sooner. And having the server outside of the measured k6 process helps us to (mostly) isolate the benchmark test to just the k6 performance.

Sure, I committed the test with a small number of VUs and Iterations to avoid largest builds, but I ran it with a bigger number of course and the result was the same, memory optimization and the increase of GC related with this kind of pool.

Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @davidpst! 👏 I just have one minor comment.

I ran the script from k6-benchmarks on master (367163b) and this branch (b4a0314) on my local machine, and noticed the same behavior you guys did. This branch has much lower memory usage, at the slight cost of higher CPU usage and lower RPS.

See the results here. The graphs are a bit wonky, since the RAM line is squashed due to the higher order of RPS, but the effect is still visible. Otherwise, GitHub renders CSVs nicely in tables. :)

Question: would it be a good idea to make the new BufferPool the default implementation, but give the choice to the user to fallback to the previous behavior (via an env var), if they decide they'd rather not sacrifice CPU/RPS? I know we'd like to abandon the bpool dependency, but this approach might be safer in some circumstances.

This still requires more thorough testing, like @na-- says, but I don't think these additional benchmarks should block this PR from merging. We've seen that the memory improvement is substantial, and we can continue testing this on master.

lib/buffer_pool.go Outdated Show resolved Hide resolved
Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidpst After an internal discussion with the team, we decided it was best to leave this PR open for a while, so that we can run a few more tests with it.

Namely, we'd like to test the performance with https://github.com/valyala/bytebufferpool, and see if we can reduce the CPU/RPS impact.

This will happen sometime in the upcoming weeks, but I can't say exactly when. Sorry about the delay!

@imiric
Copy link
Contributor

imiric commented Mar 9, 2023

Here are the benchmark results when using https://github.com/valyala/bytebufferpool.

It's a negligible improvement in memory usage, practically within the margin of error, but has an even bigger CPU and RPS penalty than plain sync.Pool.

I also tried https://github.com/libp2p/go-buffer-pool, which does something similar to what is mentioned in this comment. It manages a pool of sync.Pools, to separate buffers into buckets of similar size. The issue is that it requires knowing the buffer size upfront, which doesn't work reliably for us, since we have to depend on the Content-Length header, which may be wrong or missing. See the benchmark results.

It's slightly better than bytebufferpool and a single sync.Pool, but I would also chalk it up to being within the margin of error.

So I don't see any conclusive improvements with these minor changes. Maybe we'll need a custom buffer implementation, like @na-- suggested, but that will have to wait for another time, or someone else to pick it up.

Copy link
Member

@na-- na-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's :shipit: then 🤷‍♂️ we can keep the issue open, for now, since we can probably claw back some of that CPU inefficiency with a bit of effort

Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to re-approve this.

Thanks for your work @davidpst! 🙇

@imiric imiric merged commit 12f5dd8 into grafana:master Mar 15, 2023
inancgumus added a commit to grafana/xk6-browser that referenced this pull request Mar 16, 2023
We detected an incompatibility with k6-core:
(https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2)

- BPool becomes BufferPool in grafana/k6#2879.

This PR fixes the issue.

Co-authored-by: ka3de <daniel.jimenez@grafana.com>
inancgumus added a commit to grafana/xk6-browser that referenced this pull request Mar 16, 2023
We detected an incompatibility with k6-core:
(https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2)

- BPool becomes BufferPool in grafana/k6#2879.

This PR fixes the issue.

Co-authored-by: ka3de <daniel.jimenez@grafana.com>
inancgumus added a commit to grafana/xk6-browser that referenced this pull request Mar 16, 2023
We detected an incompatibility with k6-core:
(https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2)

- BPool becomes BufferPool in grafana/k6#2879.

This PR fixes the issue.

Co-authored-by: ka3de <daniel.jimenez@grafana.com>
inancgumus added a commit to grafana/xk6-browser that referenced this pull request Mar 16, 2023
We detected an incompatibility with k6-core:
(https://github.com/grafana/xk6-browser/actions/runs/4434049483/attempts/2)

- BPool becomes BufferPool in grafana/k6#2879.

This PR fixes the issue.

Co-authored-by: ka3de <daniel.jimenez@grafana.com>
@davidpst davidpst deleted the feature/buffer-sync-pool branch March 17, 2023 13:50
@imiric imiric mentioned this pull request Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants