Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739

Merged
merged 4 commits into from
Dec 16, 2022

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Dec 15, 2022

What this PR does

Looking at streaming store-gateway profiles, I can see about 10% of memory allocated by loadingSeriesChunksSetIterator.Next(). There are two main memory allocations there:

  • make([]seriesEntry, nextUnloaded.len())
  • make([]storepb.AggrChunk, len(s.chunks))

In this PR I propose to use a memory pool for both of them (separate pools). To increase the chances the slice capacity is large enough to be reused, I've adopted the following strategies:

  • []seriesEntry allocate it based on the batch size (all real batches will have that size except the last one)
  • []storepb.AggrChunk use a slab (here the number of chunks is not predictable, so we can't use a good default)

I've introduced pool.SlabPool which is like pool.BatchBytes but works with generics and can never return error (so it's easier to use).

If this PR is accepted and once we feel comfortable with pool.SlabPool, I will replace pool.BatchBytes with pool.SlabPool.

Benchmarks:

Notes:

  • LoadingSeriesChunksSetIterator is the micro benchmark
  • Bucket_Series is the high level benchmark (we see a -10% improvement in some tests allocating a lot of memory, a downgrade is some tests on low cardinality queries, due to the higher pre-allocation done for reusing purposes)
  • Bucket_Series_WithSkipChunks is not expected to be impacted, because this PR only affects chunks loading

I don't trust very much the worse performance reported by BenchmarkBucket_Series. Why? Because pooling is not effective if GC triggers continuously. The BenchmarkBucket_Series is even worse than the production scenario, because it runs both the store-gateway client and server, so a lot of memory is used to unmarshal the response and keep it in memory for assertions (which is not done by a real store-gateway server). For this reason, I will test it in a real Mimir cluster too.

name                                                                                                                 old time/op    new time/op    delta
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1of1000000-12                                              110ms ± 1%     102ms ± 4%   -7.17%  (p=0.042 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/10of1000000-12                                             109ms ± 1%     101ms ± 3%   -7.24%  (p=0.029 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                                        1.65s ± 3%     1.67s ± 5%     ~     (p=0.683 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1of1000000-12                             82.0ms ± 3%    78.6ms ± 3%     ~     (p=0.117 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/10of1000000-12                            81.8ms ± 1%    80.1ms ± 3%     ~     (p=0.367 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12                        1.41s ± 1%     1.44s ± 2%     ~     (p=0.243 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1of1000000-12                            68.6ms ± 3%    64.4ms ± 2%     ~     (p=0.051 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/10of1000000-12                           67.3ms ± 1%    63.7ms ± 2%   -5.34%  (p=0.024 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12                       1.31s ± 1%     1.29s ± 3%     ~     (p=0.539 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/1of10000000-12                                           6.75ms ± 4%    5.80ms ± 0%  -14.09%  (p=0.019 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/100of10000000-12                                         6.99ms ± 1%    5.90ms ± 1%  -15.52%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                                     181ms ± 2%     172ms ± 2%   -5.21%  (p=0.023 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                           8.45ms ± 1%    7.82ms ± 2%   -7.49%  (p=0.017 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                         8.49ms ± 2%    7.87ms ± 2%   -7.25%  (p=0.012 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     149ms ± 1%     145ms ± 1%     ~     (p=0.056 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                          7.09ms ± 2%    6.25ms ± 3%  -11.80%  (p=0.005 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                        6.96ms ± 2%    6.16ms ± 2%  -11.53%  (p=0.001 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    180ms ± 2%     161ms ± 3%  -10.60%  (p=0.006 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                            250µs ± 2%     243µs ± 1%     ~     (p=0.065 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                          254µs ± 0%     242µs ± 1%   -4.53%  (p=0.005 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                    86.2ms ± 1%    83.8ms ± 2%     ~     (p=0.089 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                           261µs ± 1%     246µs ± 1%   -5.91%  (p=0.002 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                         258µs ± 3%     245µs ± 1%     ~     (p=0.069 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                   85.9ms ± 1%    84.2ms ± 1%   -2.00%  (p=0.019 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/1of10000000-12                                            238µs ± 2%     225µs ± 1%   -5.70%  (p=0.014 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/100of10000000-12                                          247µs ± 6%     229µs ± 2%     ~     (p=0.134 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                                    79.8ms ± 1%    76.0ms ± 1%   -4.76%  (p=0.005 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                         1.30s ± 4%     1.21s ± 1%     ~     (p=0.073 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12         1.37s ± 2%     1.27s ± 2%   -7.70%  (p=0.005 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12        1.34s ± 1%     1.27s ± 2%   -5.48%  (p=0.024 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                      124ms ± 3%     119ms ± 3%     ~     (p=0.101 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      144ms ± 2%     131ms ± 1%   -9.16%  (p=0.010 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     160ms ± 2%     145ms ± 1%   -9.22%  (p=0.003 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                      425µs ± 0%     385µs ± 2%   -9.25%  (p=0.005 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      723µs ± 2%     685µs ± 5%     ~     (p=0.154 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     697µs ± 2%     666µs ± 4%     ~     (p=0.141 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10-12                                                                       1.03ms ± 2%    0.81ms ± 2%  -22.02%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_100-12                                                                      5.90ms ± 2%    4.05ms ± 4%  -31.40%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_1000-12                                                                     54.9ms ± 3%    36.0ms ± 2%  -34.33%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12                                                                     544ms ± 3%     362ms ± 5%  -33.33%  (p=0.000 n=3+3)

name                                                                                                                 old alloc/op   new alloc/op   delta
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1of1000000-12                                             65.9MB ± 0%    65.9MB ± 0%     ~     (p=0.917 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/10of1000000-12                                            65.9MB ± 0%    65.9MB ± 0%     ~     (p=0.412 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                                       1.79GB ± 0%    1.79GB ± 0%     ~     (p=0.227 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1of1000000-12                             84.1MB ± 0%    84.1MB ± 0%   +0.06%  (p=0.050 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/10of1000000-12                            84.1MB ± 0%    84.1MB ± 0%     ~     (p=0.227 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12                       1.37GB ± 0%    1.23GB ± 0%   -9.90%  (p=0.000 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1of1000000-12                            59.1MB ± 0%    59.3MB ± 0%     ~     (p=0.115 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/10of1000000-12                           59.1MB ± 0%    59.3MB ± 0%     ~     (p=0.269 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12                      1.24GB ± 0%    1.11GB ± 0%  -10.49%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/1of10000000-12                                           5.04MB ± 0%    5.04MB ± 0%     ~     (p=0.638 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/100of10000000-12                                         5.04MB ± 0%    5.05MB ± 0%     ~     (p=0.127 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                                     204MB ± 1%     209MB ± 3%     ~     (p=0.239 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                           8.22MB ± 0%    8.25MB ± 0%   +0.36%  (p=0.025 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                         8.22MB ± 0%    8.25MB ± 0%   +0.34%  (p=0.018 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     173MB ± 0%     161MB ± 0%   -7.44%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                          5.77MB ± 0%    5.85MB ± 0%   +1.26%  (p=0.004 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                        5.77MB ± 0%    5.88MB ± 1%   +1.90%  (p=0.009 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    168MB ± 0%     156MB ± 0%   -7.39%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                            218kB ± 0%     225kB ± 0%   +3.00%  (p=0.004 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                          219kB ± 0%     224kB ± 0%   +2.72%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     226MB ± 0%     226MB ± 0%   +0.04%  (p=0.029 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                           242kB ± 1%     267kB ± 0%  +10.35%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                         242kB ± 0%     269kB ± 2%  +11.03%  (p=0.007 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    227MB ± 0%     228MB ± 0%   +0.35%  (p=0.005 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/1of10000000-12                                            217kB ± 0%     217kB ± 0%     ~     (p=0.098 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/100of10000000-12                                          217kB ± 0%     217kB ± 0%   +0.38%  (p=0.004 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                                     223MB ± 0%     223MB ± 0%     ~     (p=0.730 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                        1.95GB ± 0%    1.95GB ± 0%     ~     (p=0.168 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12        1.54GB ± 0%    1.54GB ± 0%     ~     (p=0.465 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12       1.46GB ± 0%    1.46GB ± 0%     ~     (p=0.870 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                      179MB ± 0%     179MB ± 0%   +0.00%  (p=0.038 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      153MB ± 0%     153MB ± 0%     ~     (p=0.889 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     149MB ± 0%     148MB ± 0%     ~     (p=0.083 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                      709kB ± 0%     710kB ± 0%   +0.16%  (p=0.000 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      732kB ± 0%     732kB ± 0%     ~     (p=0.928 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     883kB ± 1%     885kB ± 1%     ~     (p=0.875 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10-12                                                                        869kB ± 0%     160kB ± 0%  -81.52%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_100-12                                                                      7.37MB ± 0%    0.16MB ± 0%  -97.83%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_1000-12                                                                     71.5MB ± 0%     0.2MB ± 1%  -99.71%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12                                                                     712MB ± 0%       3MB ± 0%  -99.61%  (p=0.000 n=3+3)

name                                                                                                                 old allocs/op  new allocs/op  delta
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1of1000000-12                                              9.73k ± 0%     9.73k ± 0%     ~     (p=0.948 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/10of1000000-12                                             9.88k ± 0%     9.90k ± 0%     ~     (p=0.141 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                                        18.0M ± 0%     18.0M ± 0%     ~     (p=0.226 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1of1000000-12                              14.4k ± 0%     14.4k ± 0%     ~     (p=0.683 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/10of1000000-12                             14.5k ± 0%     14.5k ± 0%     ~     (p=0.961 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12                        17.1M ± 0%     16.1M ± 0%   -5.83%  (p=0.000 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1of1000000-12                             6.21k ± 0%     6.21k ± 0%     ~     (p=0.966 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/10of1000000-12                            6.35k ± 0%     6.35k ± 0%     ~     (p=0.720 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12                       17.0M ± 0%     16.0M ± 0%   -5.87%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/1of10000000-12                                            1.14k ± 0%     1.14k ± 0%     ~     (p=0.621 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/100of10000000-12                                          1.20k ± 0%     1.20k ± 0%     ~     (p=0.138 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                                     1.80M ± 0%     1.81M ± 0%     ~     (p=0.236 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                            1.64k ± 0%     1.65k ± 0%   +0.20%  (p=0.013 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                          1.70k ± 0%     1.70k ± 0%     ~     (p=0.811 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     1.71M ± 0%     1.61M ± 0%   -5.82%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                             822 ± 0%       825 ± 0%     ~     (p=0.122 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                           877 ± 0%       874 ± 0%   -0.34%  (p=0.029 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    1.70M ± 0%     1.60M ± 0%   -5.86%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                              250 ± 0%       252 ± 0%   +0.67%  (p=0.038 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                            250 ± 0%       252 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                      334k ± 0%      334k ± 0%     ~     (p=0.433 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                             250 ± 0%       251 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                           250 ± 0%       251 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                     334k ± 0%      334k ± 0%   +0.00%  (p=0.008 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/1of10000000-12                                              244 ± 0%       245 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/100of10000000-12                                            244 ± 0%       245 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                                      334k ± 0%      334k ± 0%   +0.00%  (p=0.009 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                         11.0M ± 0%     11.0M ± 0%     ~     (p=0.441 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12         10.1M ± 0%     10.1M ± 0%     ~     (p=0.892 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12        10.0M ± 0%     10.0M ± 0%     ~     (p=0.528 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                      1.10M ± 0%     1.10M ± 0%   +0.00%  (p=0.039 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      1.01M ± 0%     1.01M ± 0%     ~     (p=0.619 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     1.00M ± 0%     1.00M ± 0%     ~     (p=0.234 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                        798 ± 0%       802 ± 0%     ~     (zero variance)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12        816 ± 0%       816 ± 0%     ~     (zero variance)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12       815 ± 0%       815 ± 0%     ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_10-12                                                                        3.21k ± 0%     2.40k ± 0%     ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_100-12                                                                       12.2k ± 0%      2.4k ± 0%     ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_1000-12                                                                       102k ± 0%        3k ± 0%  -97.26%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12                                                                     1.00M ± 0%     0.00M ± 0%  -99.68%  (p=0.000 n=3+3)

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci
Copy link
Collaborator Author

pracucci commented Dec 15, 2022

I've tested this PR in dev cluster (mimir-dev-09), loading testing the store-gateway with a tool we're using to benchmark the streaming store-gateway (code), and comparing the results running 2 load test scenarios.

How to read the results:

  • zone-a: store-gateway deployed on top of main, with streaming store-gateway, without mmap
  • zone-b: store-gateway deployed on this PR, with streaming store-gateway, without mmap

Scenario 1: few large requests (about 50-100k series per request)

Screenshot 2022-12-15 at 18 47 16

Screenshot 2022-12-15 at 18 47 22

Screenshot 2022-12-15 at 18 47 26

Screenshot 2022-12-15 at 18 47 30

Screenshot 2022-12-15 at 18 47 38

Screenshot 2022-12-15 at 18 47 42

As a counter proof, here is a profiling comparison for the whole test range. Look at the total memory allocated by loadingSeriesChunksSetIterator.Next():

Screenshot 2022-12-15 at 18 49 18

Scenario 2: many very small requests (about 10 series per request)

Screenshot 2022-12-15 at 18 51 34

Screenshot 2022-12-15 at 18 51 38

Screenshot 2022-12-15 at 18 51 43

Screenshot 2022-12-15 at 18 51 50

Screenshot 2022-12-15 at 18 52 01

Screenshot 2022-12-15 at 18 52 07

Screenshot 2022-12-15 at 18 53 22

Conclusion

Not every metric has the same weight. The test cluster is deployed in read-write mode, so the backend deployment runs other components too (be aware compactor was not running during the test). I think the most reliable metrics here are allocations/sec and the profile data (even if profile is sampled, so not a 100% representation of the reality).

My conclusion is that, as expected, this optimization can reduce about 10% of memory allocations (bytes) when the store-gateway is under heavy load due to few but heavy requests, while it's not significative for the case of small requests.

@pracucci pracucci marked this pull request as ready for review December 16, 2022 07:30
@pracucci pracucci requested a review from a team as a code owner December 16, 2022 07:30
// If it's releasable then we try to reuse a slice from the pool.
if seriesReleasable {
if reused := seriesEntrySlicePool.Get(); reused != nil {
prealloc = *(reused.(*[]seriesEntry))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microoptimization, feel free to ignore, but I think it's worth mentioning: we're optimizing allocations here but we're throwing away a pointer here, and then the reused slice escapes again:

		reuse := b.series[:0]
		seriesEntrySlicePool.Put(&reuse)

What if keep the reused pointer in seriesChunkSet and put it back to the pool later after overwriting the value, like:

			// here:
			b.releasableSeries = reused.(*[]seriesEntry)
			prealloc = *(b.releasableSeries)


		// ... later in release():
		*b.releasableSeries := b.series[:0] // TODO: Unless pointer is nil
		seriesEntrySlicePool.Put(b.releasableSeries)

for c := range b.series[i].chks {
b.series[i].chks[c].MinTime = 0
b.series[i].chks[c].MaxTime = 0
b.series[i].chks[c].Raw = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we reusing .chks? Because I can see them being set to nil below, and if so, why bother zeroing them here? Is that something that chunkReleaser does? In that case it would worth a comment because this code looks useless to me from this point of view.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that we zero each chunk so that the slice we put back into seriesChunksSlicePool can be used later without problem.

If that's correct, how about we change pool.SlabPool to

type Resetable interface {
  Reset()
}

type SlabPool[T Resetable] struct {
	delegate Interface
	slabSize int
	slabs    []*[]T
}

and also change SlabPool.Release to

func (b *SlabPool[T]) Release() {
	for _, slab := range b.slabs {
	    for _, item := range slab {
	      item.Reset()
	    }
    	// The slab length will be reset to 0 in the Get().
		b.delegate.Put(slab)
	}

	b.slabs = b.slabs[:0]
}

storepb.AggrChunk already has Reset() implemented by protogen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although if we go with my suggestion, the using SlabPool instead of BatchBytes will be more difficult because we don't want to call .Reset() on each byte individually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can see how it's reused in b.seriesChunksPool.Release(), my fault. I would bring this zeroing code closer to the place where it's reused. I like your suggestion too, but I'm not sure that we want to force the items being resettable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a big preference. I simplified the reset code in 59937e9 and I think that's just fine. I would add further indirection.

@@ -176,3 +176,71 @@ func (b *BatchBytes) Get(sz int) ([]byte, error) {
*slab = (*slab)[:len(*slab)+sz]
return (*slab)[len(*slab)-sz : len(*slab) : len(*slab)], nil
}

// SlabPool is NOT concurrency safe.
type SlabPool[T any] struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a comment explaining what SlablPool does? You can ask ChatGPT if you want :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a very difficult requested. I attempted it in a114772.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty cool :) I really like SlabPool. I left some suggestions and questions, but overall LGTM

pkg/util/pool/pool.go Show resolved Hide resolved
// The capacity MUST be guaranteed. If it's smaller, then we forget it and will be
// reallocated.
if cap(prealloc) < seriesCapacity {
prealloc = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return it to the pool instead of throwing it away?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it and I have mixed feelings. Putting back to the pool means the next time we'll get it again from the pool it may still be smaller than seriesCapacity and so we're loosing an opportunity to reuse another slice which could be big enough.

For the specific use case of this PR, this should never happen because our batch size is currently fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch size for seriesChunksSet may change when we do memory quotas. hope we don't forget about this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope we don't forget about this

Agree. When we'll add quotas, we'll have to revisit all pooling. My suggestion is to do bucketed batch sizes (e.g. batch sizes rounded to a fixed and low number of buckets, and then having buckets pool) but we can discuss it once we'll be there.

for c := range b.series[i].chks {
b.series[i].chks[c].MinTime = 0
b.series[i].chks[c].MaxTime = 0
b.series[i].chks[c].Raw = nil
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that we zero each chunk so that the slice we put back into seriesChunksSlicePool can be used later without problem.

If that's correct, how about we change pool.SlabPool to

type Resetable interface {
  Reset()
}

type SlabPool[T Resetable] struct {
	delegate Interface
	slabSize int
	slabs    []*[]T
}

and also change SlabPool.Release to

func (b *SlabPool[T]) Release() {
	for _, slab := range b.slabs {
	    for _, item := range slab {
	      item.Reset()
	    }
    	// The slab length will be reset to 0 in the Get().
		b.delegate.Put(slab)
	}

	b.slabs = b.slabs[:0]
}

storepb.AggrChunk already has Reset() implemented by protogen

pkg/storegateway/series_chunks.go Show resolved Hide resolved

// If it's releasable then we try to reuse a slice from the pool.
if seriesReleasable {
if reused := seriesEntrySlicePool.Get(); reused != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thought more than a suggestion: I liked how in other places we inject the pool. Like the chunk bytes pool for the chunks Raw data. It will make tests less dependant on each other (if you write one bad test you can fail a lot of them), slightly more consistent, and a bit more clear that we're managing our memory.

I don't have experience with memory management, so if it's generally better to make this pooling as invisible as possible, then your approach is better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seriesChunksSet uses two pools and passing them around is a bit annoying. I would do it only if there's a real justification. Currently I've been able to test pooling in a clean way anyway.

pkg/storegateway/series_chunks_test.go Outdated Show resolved Hide resolved
…lices

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the optimize-seriesChunksSet-series-memory-allocation branch from 44500fc to 59d459e Compare December 16, 2022 11:09
No performance difference:

name                                                 old time/op    new time/op    delta
LoadingSeriesChunksSetIterator/batch_size:_10-12        819µs ± 2%     802µs ± 1%   ~     (p=0.221 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_100-12      4.16ms ± 1%    4.10ms ± 3%   ~     (p=0.392 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_1000-12     37.6ms ± 2%    37.3ms ± 3%   ~     (p=0.769 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12     372ms ± 2%     371ms ± 5%   ~     (p=0.880 n=3+3)

name                                                 old alloc/op   new alloc/op   delta
LoadingSeriesChunksSetIterator/batch_size:_10-12        160kB ± 0%     160kB ± 0%   ~     (p=0.693 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_100-12       160kB ± 0%     160kB ± 0%   ~     (p=0.699 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_1000-12      208kB ± 2%     206kB ± 0%   ~     (p=0.362 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12    2.79MB ± 0%    2.79MB ± 0%   ~     (p=0.473 n=3+3)

name                                                 old allocs/op  new allocs/op  delta
LoadingSeriesChunksSetIterator/batch_size:_10-12        2.40k ± 0%     2.40k ± 0%   ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_100-12       2.40k ± 0%     2.40k ± 0%   ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_1000-12      2.81k ± 0%     2.81k ± 0%   ~     (p=0.423 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12     3.18k ± 0%     3.17k ± 0%   ~     (p=0.426 n=3+3)

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…ator

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@pracucci pracucci merged commit db6e5cc into main Dec 16, 2022
@pracucci pracucci deleted the optimize-seriesChunksSet-series-memory-allocation branch December 16, 2022 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants