Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739

pracucci · 2022-12-15T16:32:09Z

What this PR does

Looking at streaming store-gateway profiles, I can see about 10% of memory allocated by loadingSeriesChunksSetIterator.Next(). There are two main memory allocations there:

make([]seriesEntry, nextUnloaded.len())
make([]storepb.AggrChunk, len(s.chunks))

In this PR I propose to use a memory pool for both of them (separate pools). To increase the chances the slice capacity is large enough to be reused, I've adopted the following strategies:

[]seriesEntry allocate it based on the batch size (all real batches will have that size except the last one)
[]storepb.AggrChunk use a slab (here the number of chunks is not predictable, so we can't use a good default)

I've introduced pool.SlabPool which is like pool.BatchBytes but works with generics and can never return error (so it's easier to use).

If this PR is accepted and once we feel comfortable with pool.SlabPool, I will replace pool.BatchBytes with pool.SlabPool.

Benchmarks:

Notes:

LoadingSeriesChunksSetIterator is the micro benchmark
Bucket_Series is the high level benchmark (we see a -10% improvement in some tests allocating a lot of memory, a downgrade is some tests on low cardinality queries, due to the higher pre-allocation done for reusing purposes)
Bucket_Series_WithSkipChunks is not expected to be impacted, because this PR only affects chunks loading

I don't trust very much the worse performance reported by BenchmarkBucket_Series. Why? Because pooling is not effective if GC triggers continuously. The BenchmarkBucket_Series is even worse than the production scenario, because it runs both the store-gateway client and server, so a lot of memory is used to unmarshal the response and keep it in memory for assertions (which is not done by a real store-gateway server). For this reason, I will test it in a real Mimir cluster too.

name                                                                                                                 old time/op    new time/op    delta
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1of1000000-12                                              110ms ± 1%     102ms ± 4%   -7.17%  (p=0.042 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/10of1000000-12                                             109ms ± 1%     101ms ± 3%   -7.24%  (p=0.029 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                                        1.65s ± 3%     1.67s ± 5%     ~     (p=0.683 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1of1000000-12                             82.0ms ± 3%    78.6ms ± 3%     ~     (p=0.117 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/10of1000000-12                            81.8ms ± 1%    80.1ms ± 3%     ~     (p=0.367 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12                        1.41s ± 1%     1.44s ± 2%     ~     (p=0.243 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1of1000000-12                            68.6ms ± 3%    64.4ms ± 2%     ~     (p=0.051 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/10of1000000-12                           67.3ms ± 1%    63.7ms ± 2%   -5.34%  (p=0.024 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12                       1.31s ± 1%     1.29s ± 3%     ~     (p=0.539 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/1of10000000-12                                           6.75ms ± 4%    5.80ms ± 0%  -14.09%  (p=0.019 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/100of10000000-12                                         6.99ms ± 1%    5.90ms ± 1%  -15.52%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                                     181ms ± 2%     172ms ± 2%   -5.21%  (p=0.023 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                           8.45ms ± 1%    7.82ms ± 2%   -7.49%  (p=0.017 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                         8.49ms ± 2%    7.87ms ± 2%   -7.25%  (p=0.012 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     149ms ± 1%     145ms ± 1%     ~     (p=0.056 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                          7.09ms ± 2%    6.25ms ± 3%  -11.80%  (p=0.005 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                        6.96ms ± 2%    6.16ms ± 2%  -11.53%  (p=0.001 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    180ms ± 2%     161ms ± 3%  -10.60%  (p=0.006 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                            250µs ± 2%     243µs ± 1%     ~     (p=0.065 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                          254µs ± 0%     242µs ± 1%   -4.53%  (p=0.005 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                    86.2ms ± 1%    83.8ms ± 2%     ~     (p=0.089 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                           261µs ± 1%     246µs ± 1%   -5.91%  (p=0.002 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                         258µs ± 3%     245µs ± 1%     ~     (p=0.069 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                   85.9ms ± 1%    84.2ms ± 1%   -2.00%  (p=0.019 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/1of10000000-12                                            238µs ± 2%     225µs ± 1%   -5.70%  (p=0.014 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/100of10000000-12                                          247µs ± 6%     229µs ± 2%     ~     (p=0.134 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                                    79.8ms ± 1%    76.0ms ± 1%   -4.76%  (p=0.005 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                         1.30s ± 4%     1.21s ± 1%     ~     (p=0.073 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12         1.37s ± 2%     1.27s ± 2%   -7.70%  (p=0.005 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12        1.34s ± 1%     1.27s ± 2%   -5.48%  (p=0.024 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                      124ms ± 3%     119ms ± 3%     ~     (p=0.101 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      144ms ± 2%     131ms ± 1%   -9.16%  (p=0.010 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     160ms ± 2%     145ms ± 1%   -9.22%  (p=0.003 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                      425µs ± 0%     385µs ± 2%   -9.25%  (p=0.005 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      723µs ± 2%     685µs ± 5%     ~     (p=0.154 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     697µs ± 2%     666µs ± 4%     ~     (p=0.141 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10-12                                                                       1.03ms ± 2%    0.81ms ± 2%  -22.02%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_100-12                                                                      5.90ms ± 2%    4.05ms ± 4%  -31.40%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_1000-12                                                                     54.9ms ± 3%    36.0ms ± 2%  -34.33%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12                                                                     544ms ± 3%     362ms ± 5%  -33.33%  (p=0.000 n=3+3)

name                                                                                                                 old alloc/op   new alloc/op   delta
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1of1000000-12                                             65.9MB ± 0%    65.9MB ± 0%     ~     (p=0.917 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/10of1000000-12                                            65.9MB ± 0%    65.9MB ± 0%     ~     (p=0.412 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                                       1.79GB ± 0%    1.79GB ± 0%     ~     (p=0.227 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1of1000000-12                             84.1MB ± 0%    84.1MB ± 0%   +0.06%  (p=0.050 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/10of1000000-12                            84.1MB ± 0%    84.1MB ± 0%     ~     (p=0.227 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12                       1.37GB ± 0%    1.23GB ± 0%   -9.90%  (p=0.000 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1of1000000-12                            59.1MB ± 0%    59.3MB ± 0%     ~     (p=0.115 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/10of1000000-12                           59.1MB ± 0%    59.3MB ± 0%     ~     (p=0.269 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12                      1.24GB ± 0%    1.11GB ± 0%  -10.49%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/1of10000000-12                                           5.04MB ± 0%    5.04MB ± 0%     ~     (p=0.638 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/100of10000000-12                                         5.04MB ± 0%    5.05MB ± 0%     ~     (p=0.127 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                                     204MB ± 1%     209MB ± 3%     ~     (p=0.239 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                           8.22MB ± 0%    8.25MB ± 0%   +0.36%  (p=0.025 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                         8.22MB ± 0%    8.25MB ± 0%   +0.34%  (p=0.018 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     173MB ± 0%     161MB ± 0%   -7.44%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                          5.77MB ± 0%    5.85MB ± 0%   +1.26%  (p=0.004 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                        5.77MB ± 0%    5.88MB ± 1%   +1.90%  (p=0.009 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    168MB ± 0%     156MB ± 0%   -7.39%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                            218kB ± 0%     225kB ± 0%   +3.00%  (p=0.004 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                          219kB ± 0%     224kB ± 0%   +2.72%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     226MB ± 0%     226MB ± 0%   +0.04%  (p=0.029 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                           242kB ± 1%     267kB ± 0%  +10.35%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                         242kB ± 0%     269kB ± 2%  +11.03%  (p=0.007 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    227MB ± 0%     228MB ± 0%   +0.35%  (p=0.005 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/1of10000000-12                                            217kB ± 0%     217kB ± 0%     ~     (p=0.098 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/100of10000000-12                                          217kB ± 0%     217kB ± 0%   +0.38%  (p=0.004 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                                     223MB ± 0%     223MB ± 0%     ~     (p=0.730 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                        1.95GB ± 0%    1.95GB ± 0%     ~     (p=0.168 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12        1.54GB ± 0%    1.54GB ± 0%     ~     (p=0.465 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12       1.46GB ± 0%    1.46GB ± 0%     ~     (p=0.870 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                      179MB ± 0%     179MB ± 0%   +0.00%  (p=0.038 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      153MB ± 0%     153MB ± 0%     ~     (p=0.889 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     149MB ± 0%     148MB ± 0%     ~     (p=0.083 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                      709kB ± 0%     710kB ± 0%   +0.16%  (p=0.000 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      732kB ± 0%     732kB ± 0%     ~     (p=0.928 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     883kB ± 1%     885kB ± 1%     ~     (p=0.875 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10-12                                                                        869kB ± 0%     160kB ± 0%  -81.52%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_100-12                                                                      7.37MB ± 0%    0.16MB ± 0%  -97.83%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_1000-12                                                                     71.5MB ± 0%     0.2MB ± 1%  -99.71%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12                                                                     712MB ± 0%       3MB ± 0%  -99.61%  (p=0.000 n=3+3)

name                                                                                                                 old allocs/op  new allocs/op  delta
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1of1000000-12                                              9.73k ± 0%     9.73k ± 0%     ~     (p=0.948 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/10of1000000-12                                             9.88k ± 0%     9.90k ± 0%     ~     (p=0.141 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                                        18.0M ± 0%     18.0M ± 0%     ~     (p=0.226 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1of1000000-12                              14.4k ± 0%     14.4k ± 0%     ~     (p=0.683 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/10of1000000-12                             14.5k ± 0%     14.5k ± 0%     ~     (p=0.961 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12                        17.1M ± 0%     16.1M ± 0%   -5.83%  (p=0.000 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1of1000000-12                             6.21k ± 0%     6.21k ± 0%     ~     (p=0.966 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/10of1000000-12                            6.35k ± 0%     6.35k ± 0%     ~     (p=0.720 n=3+3)
Bucket_Series/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12                       17.0M ± 0%     16.0M ± 0%   -5.87%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/1of10000000-12                                            1.14k ± 0%     1.14k ± 0%     ~     (p=0.621 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/100of10000000-12                                          1.20k ± 0%     1.20k ± 0%     ~     (p=0.138 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                                     1.80M ± 0%     1.81M ± 0%     ~     (p=0.236 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                            1.64k ± 0%     1.65k ± 0%   +0.20%  (p=0.013 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                          1.70k ± 0%     1.70k ± 0%     ~     (p=0.811 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                     1.71M ± 0%     1.61M ± 0%   -5.82%  (p=0.000 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                             822 ± 0%       825 ± 0%     ~     (p=0.122 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                           877 ± 0%       874 ± 0%   -0.34%  (p=0.029 n=3+3)
Bucket_Series/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                    1.70M ± 0%     1.60M ± 0%   -5.86%  (p=0.000 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/1of10000000-12                              250 ± 0%       252 ± 0%   +0.67%  (p=0.038 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/100of10000000-12                            250 ± 0%       252 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12                      334k ± 0%      334k ± 0%     ~     (p=0.433 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/1of10000000-12                             250 ± 0%       251 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/100of10000000-12                           250 ± 0%       251 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12                     334k ± 0%      334k ± 0%   +0.00%  (p=0.008 n=3+3)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/1of10000000-12                                              244 ± 0%       245 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/100of10000000-12                                            244 ± 0%       245 ± 0%     ~     (zero variance)
Bucket_Series/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                                      334k ± 0%      334k ± 0%   +0.00%  (p=0.009 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_default_options/1000000of1000000-12                         11.0M ± 0%     11.0M ± 0%     ~     (p=0.441 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(1K_per_batch)/1000000of1000000-12         10.1M ± 0%     10.1M ± 0%     ~     (p=0.892 n=3+3)
Bucket_Series_WithSkipChunks/1000000SeriesWith1Samples/with_series_streaming_(10K_per_batch)/1000000of1000000-12        10.0M ± 0%     10.0M ± 0%     ~     (p=0.528 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_default_options/10000000of10000000-12                      1.10M ± 0%     1.10M ± 0%   +0.00%  (p=0.039 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12      1.01M ± 0%     1.01M ± 0%     ~     (p=0.619 n=3+3)
Bucket_Series_WithSkipChunks/100000SeriesWith100Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12     1.00M ± 0%     1.00M ± 0%     ~     (p=0.234 n=3+3)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_default_options/10000000of10000000-12                        798 ± 0%       802 ± 0%     ~     (zero variance)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(1K_per_batch)/10000000of10000000-12        816 ± 0%       816 ± 0%     ~     (zero variance)
Bucket_Series_WithSkipChunks/1SeriesWith10000000Samples/with_series_streaming_(10K_per_batch)/10000000of10000000-12       815 ± 0%       815 ± 0%     ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_10-12                                                                        3.21k ± 0%     2.40k ± 0%     ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_100-12                                                                       12.2k ± 0%      2.4k ± 0%     ~     (zero variance)
LoadingSeriesChunksSetIterator/batch_size:_1000-12                                                                       102k ± 0%        3k ± 0%  -97.26%  (p=0.000 n=3+3)
LoadingSeriesChunksSetIterator/batch_size:_10000-12                                                                     1.00M ± 0%     0.00M ± 0%  -99.68%  (p=0.000 n=3+3)

Which issue(s) this PR fixes or relates to

N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci · 2022-12-15T17:51:08Z

I've tested this PR in dev cluster (mimir-dev-09), loading testing the store-gateway with a tool we're using to benchmark the streaming store-gateway (code), and comparing the results running 2 load test scenarios.

How to read the results:

zone-a: store-gateway deployed on top of main, with streaming store-gateway, without mmap
zone-b: store-gateway deployed on this PR, with streaming store-gateway, without mmap

Scenario 1: few large requests (about 50-100k series per request)

As a counter proof, here is a profiling comparison for the whole test range. Look at the total memory allocated by loadingSeriesChunksSetIterator.Next():

Scenario 2: many very small requests (about 10 series per request)

Conclusion

Not every metric has the same weight. The test cluster is deployed in read-write mode, so the backend deployment runs other components too (be aware compactor was not running during the test). I think the most reliable metrics here are allocations/sec and the profile data (even if profile is sampled, so not a 100% representation of the reality).

My conclusion is that, as expected, this optimization can reduce about 10% of memory allocations (bytes) when the store-gateway is under heavy load due to few but heavy requests, while it's not significative for the case of small requests.

pkg/storegateway/series_chunks.go

colega · 2022-12-16T08:31:17Z

pkg/storegateway/series_chunks.go

+	// If it's releasable then we try to reuse a slice from the pool.
+	if seriesReleasable {
+		if reused := seriesEntrySlicePool.Get(); reused != nil {
+			prealloc = *(reused.(*[]seriesEntry))


Microoptimization, feel free to ignore, but I think it's worth mentioning: we're optimizing allocations here but we're throwing away a pointer here, and then the reused slice escapes again:

reuse := b.series[:0] seriesEntrySlicePool.Put(&reuse)

What if keep the reused pointer in seriesChunkSet and put it back to the pool later after overwriting the value, like:

// here: b.releasableSeries = reused.(*[]seriesEntry) prealloc = *(b.releasableSeries) // ... later in release(): *b.releasableSeries := b.series[:0] // TODO: Unless pointer is nil seriesEntrySlicePool.Put(b.releasableSeries)

colega · 2022-12-16T08:32:57Z

pkg/storegateway/series_chunks.go

+			for c := range b.series[i].chks {
+				b.series[i].chks[c].MinTime = 0
+				b.series[i].chks[c].MaxTime = 0
+				b.series[i].chks[c].Raw = nil


Are we reusing .chks? Because I can see them being set to nil below, and if so, why bother zeroing them here? Is that something that chunkReleaser does? In that case it would worth a comment because this code looks useless to me from this point of view.

my understanding is that we zero each chunk so that the slice we put back into seriesChunksSlicePool can be used later without problem.

If that's correct, how about we change pool.SlabPool to

type Resetable interface { Reset() } type SlabPool[T Resetable] struct { delegate Interface slabSize int slabs []*[]T }

and also change SlabPool.Release to

func (b *SlabPool[T]) Release() { for _, slab := range b.slabs { for _, item := range slab { item.Reset() } // The slab length will be reset to 0 in the Get(). b.delegate.Put(slab) } b.slabs = b.slabs[:0] }

storepb.AggrChunk already has Reset() implemented by protogen

although if we go with my suggestion, the using SlabPool instead of BatchBytes will be more difficult because we don't want to call .Reset() on each byte individually

Okay, I can see how it's reused in b.seriesChunksPool.Release(), my fault. I would bring this zeroing code closer to the place where it's reused. I like your suggestion too, but I'm not sure that we want to force the items being resettable.

I don't have a big preference. I simplified the reset code in 59937e9 and I think that's just fine. I would add further indirection.

pkg/util/pool/pool.go

colega · 2022-12-16T08:45:19Z

pkg/util/pool/pool.go

@@ -176,3 +176,71 @@ func (b *BatchBytes) Get(sz int) ([]byte, error) {
 	*slab = (*slab)[:len(*slab)+sz]
 	return (*slab)[len(*slab)-sz : len(*slab) : len(*slab)], nil
 }
+
+// SlabPool is NOT concurrency safe.
+type SlabPool[T any] struct {


Can we write a comment explaining what SlablPool does? You can ask ChatGPT if you want :D

This was a very difficult requested. I attempted it in a114772.

dimitarvdimitrov

That's pretty cool :) I really like SlabPool. I left some suggestions and questions, but overall LGTM

pkg/util/pool/pool.go

dimitarvdimitrov · 2022-12-16T09:16:41Z

pkg/storegateway/series_chunks.go

+			// The capacity MUST be guaranteed. If it's smaller, then we forget it and will be
+			// reallocated.
+			if cap(prealloc) < seriesCapacity {
+				prealloc = nil


Should we return it to the pool instead of throwing it away?

I thought about it and I have mixed feelings. Putting back to the pool means the next time we'll get it again from the pool it may still be smaller than seriesCapacity and so we're loosing an opportunity to reuse another slice which could be big enough.

For the specific use case of this PR, this should never happen because our batch size is currently fixed.

batch size for seriesChunksSet may change when we do memory quotas. hope we don't forget about this

hope we don't forget about this

Agree. When we'll add quotas, we'll have to revisit all pooling. My suggestion is to do bucketed batch sizes (e.g. batch sizes rounded to a fixed and low number of buckets, and then having buckets pool) but we can discuss it once we'll be there.

dimitarvdimitrov · 2022-12-16T09:59:30Z

pkg/storegateway/series_chunks.go

+			for c := range b.series[i].chks {
+				b.series[i].chks[c].MinTime = 0
+				b.series[i].chks[c].MaxTime = 0
+				b.series[i].chks[c].Raw = nil


my understanding is that we zero each chunk so that the slice we put back into seriesChunksSlicePool can be used later without problem.

If that's correct, how about we change pool.SlabPool to

type Resetable interface { Reset() } type SlabPool[T Resetable] struct { delegate Interface slabSize int slabs []*[]T }

and also change SlabPool.Release to

func (b *SlabPool[T]) Release() { for _, slab := range b.slabs { for _, item := range slab { item.Reset() } // The slab length will be reset to 0 in the Get(). b.delegate.Put(slab) } b.slabs = b.slabs[:0] }

storepb.AggrChunk already has Reset() implemented by protogen

pkg/storegateway/series_chunks.go

dimitarvdimitrov · 2022-12-16T10:17:36Z

pkg/storegateway/series_chunks.go

+
+	// If it's releasable then we try to reuse a slice from the pool.
+	if seriesReleasable {
+		if reused := seriesEntrySlicePool.Get(); reused != nil {


a thought more than a suggestion: I liked how in other places we inject the pool. Like the chunk bytes pool for the chunks Raw data. It will make tests less dependant on each other (if you write one bad test you can fail a lot of them), slightly more consistent, and a bit more clear that we're managing our memory.

I don't have experience with memory management, so if it's generally better to make this pooling as invisible as possible, then your approach is better.

seriesChunksSet uses two pools and passing them around is a bit annoying. I would do it only if there's a real justification. Currently I've been able to test pooling in a clean way anyway.

pkg/storegateway/series_chunks_test.go

…lices Signed-off-by: Marco Pracucci <marco@pracucci.com>

No performance difference: name old time/op new time/op delta LoadingSeriesChunksSetIterator/batch_size:_10-12 819µs ± 2% 802µs ± 1% ~ (p=0.221 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_100-12 4.16ms ± 1% 4.10ms ± 3% ~ (p=0.392 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_1000-12 37.6ms ± 2% 37.3ms ± 3% ~ (p=0.769 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_10000-12 372ms ± 2% 371ms ± 5% ~ (p=0.880 n=3+3) name old alloc/op new alloc/op delta LoadingSeriesChunksSetIterator/batch_size:_10-12 160kB ± 0% 160kB ± 0% ~ (p=0.693 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_100-12 160kB ± 0% 160kB ± 0% ~ (p=0.699 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_1000-12 208kB ± 2% 206kB ± 0% ~ (p=0.362 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_10000-12 2.79MB ± 0% 2.79MB ± 0% ~ (p=0.473 n=3+3) name old allocs/op new allocs/op delta LoadingSeriesChunksSetIterator/batch_size:_10-12 2.40k ± 0% 2.40k ± 0% ~ (zero variance) LoadingSeriesChunksSetIterator/batch_size:_100-12 2.40k ± 0% 2.40k ± 0% ~ (zero variance) LoadingSeriesChunksSetIterator/batch_size:_1000-12 2.81k ± 0% 2.81k ± 0% ~ (p=0.423 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_10000-12 3.18k ± 0% 3.17k ± 0% ~ (p=0.426 n=3+3) Signed-off-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>

…ator Signed-off-by: Marco Pracucci <marco@pracucci.com>

dimitarvdimitrov

LGTM, thank you!

pracucci marked this pull request as ready for review December 16, 2022 07:30

pracucci requested a review from a team as a code owner December 16, 2022 07:30

colega reviewed Dec 16, 2022

View reviewed changes

pkg/storegateway/series_chunks.go Outdated Show resolved Hide resolved

colega reviewed Dec 16, 2022

View reviewed changes

pkg/storegateway/series_chunks.go Outdated Show resolved Hide resolved

colega reviewed Dec 16, 2022

View reviewed changes

pkg/util/pool/pool.go Show resolved Hide resolved

colega reviewed Dec 16, 2022

View reviewed changes

dimitarvdimitrov approved these changes Dec 16, 2022

View reviewed changes

Reduce memory allocations for seriesChunksSet.series and its chunks s…

59d459e

…lices Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci force-pushed the optimize-seriesChunksSet-series-memory-allocation branch from 44500fc to 59d459e Compare December 16, 2022 11:09

colega approved these changes Dec 16, 2022

View reviewed changes

pracucci added 2 commits December 16, 2022 12:33

Added a comment to explain what SlabPool does

a114772

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Reuse generateSeriesChunkRef() in BenchmarkLoadingSeriesChunksSetIter…

9fa044c

…ator Signed-off-by: Marco Pracucci <marco@pracucci.com>

dimitarvdimitrov approved these changes Dec 16, 2022

View reviewed changes

pracucci merged commit db6e5cc into main Dec 16, 2022

pracucci deleted the optimize-seriesChunksSet-series-memory-allocation branch December 16, 2022 12:59

dimitarvdimitrov mentioned this pull request Dec 16, 2022

Add missing PRs to changelog for streaming store-gateway #3753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739

Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739

pracucci commented Dec 15, 2022 •

edited

Loading

pracucci commented Dec 15, 2022 •

edited

Loading

colega Dec 16, 2022

colega Dec 16, 2022

dimitarvdimitrov Dec 16, 2022 •

edited

Loading

dimitarvdimitrov Dec 16, 2022

colega Dec 16, 2022

pracucci Dec 16, 2022

colega Dec 16, 2022

pracucci Dec 16, 2022

dimitarvdimitrov left a comment

dimitarvdimitrov Dec 16, 2022

pracucci Dec 16, 2022

dimitarvdimitrov Dec 16, 2022

pracucci Dec 16, 2022

dimitarvdimitrov Dec 16, 2022 •

edited

Loading

dimitarvdimitrov Dec 16, 2022

pracucci Dec 16, 2022

dimitarvdimitrov left a comment

Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739

Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739

Conversation

pracucci commented Dec 15, 2022 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

pracucci commented Dec 15, 2022 • edited Loading

Scenario 1: few large requests (about 50-100k series per request)

Scenario 2: many very small requests (about 10 series per request)

Conclusion

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

pracucci commented Dec 15, 2022 •

edited

Loading

pracucci commented Dec 15, 2022 •

edited

Loading

dimitarvdimitrov Dec 16, 2022 •

edited

Loading

dimitarvdimitrov Dec 16, 2022 •

edited

Loading