Store-gateway: series streaming #3348

dimitarvdimitrov · 2022-11-01T10:03:46Z

Background

The current implementation of the store-gateway's Series() gRPC method fetches all data for each block concurrently and holds them in memory until responding to the method. This leads to increased memory usage. We also cannot control the amount of memory the store-gateway uses, so with many heavy requests simultaneously a store-gateway can go out of memory.

Proposal

The proposal is to fetch the index header (symbols, chunk refs) and chunks contents in batches. This way we can control the amount of memory we use at a time. By reusing the memory for each batch we also ensure a constant memory usage for each request.

It comes with the tradeoff of having to do multiple requests to the cache and to the object store. We can deal with this by having a relatively large size of each batch. This way only big Series() calls will require multiple requests.

Further improvements

Provide constant memory usage for all inflight requests instead of for each request. This will secure the store-gateway from going out of memory.
Add strategies for dealing with incoming requests when at capacity
- Swap out new batches to disk and serve the series from there
- Add priority to incoming requests - prefer smaller requests to larger ones (we know the number of blocks and number of series for each block that the request will read); prefer new requests to old ones

TODO

Week of:

Nov 21

POC of batching incoming requests (was part of Streaming store-gateway Series() #3355)
settle on a design
- without additional tests - i.e. reusing the existing store-gateway unit tests
- without memory limits

Nov 28

add specific tests for streaming-based implementation
- this will be mostly running the existing test suite for both streaming and non-streaming implementations; some specific unit tests for the streaming BatchSets and SeriesSets implementations
- partially done; still tests missing for blockSeriesChunkRefSetIterator
start merging design into main
- store-gateway series: break up fetching of series into functions #3540
- store-gateways: extract read locking during Series #3546
- Streaming store-gateway: add structs and interfaces for sets of series #3587
- store-gateway: add iterators for seriesChunks and seriesChunkRefs #3620
- these were the core data structures. There is still some left on the working branch (2 iterators + glue code in Series())
deploy streaming store-gateway alongside old code in one cluster

Dec 5

complete tests and design refactoring
upstream the base implementation with completed tests (no memory management, no caching) Streaming store-gateway Series() #3355
start on fixing series caching logic
start testing in dev - verify correctness and stress test
- verify chunks content as well

Dec 12

restore series caching streaming store-gateway: Add caching of series when skipChunks #3687
optimize memory allocations for batches Reduce newSeriesChunkRefsSet() introducing a memory pool #3666
roll out to one dev cell (without memory quotas)

Dec 19

Rollout in to all dev and ops cells (zone-a)

Jan 2

Rollout to production cells (zone-a)

Jan 16

Rollout to all dev and ops zones

Jan 23

Rollout to all prod zones

Unprioritized TODOs

Features:

implement memory quotas with per-tenant fairness
- the store-gateway will have a fixed memory quota that it can use for Series() calls
- when quota is exceeded calls will block until some memory quota frees up; this blocking needs to be fair among tenants (?)
- Consider to compute the per-block batch size based on the number of blocks to query

Optimizations:

~~Fetch postings asynchronously in openBlockSeriesChunkRefsSetsIterator; otherwise we wait on fetching postings from the slowest block before starting to load series, merge them and load chunks~~ (concurrency will likely increase memory usage which is against the goal of this project
metasToChunks(): during load testing with "few large requests", memory allocated by metasToChunks() accounted for about 7%. Can we optimise it?
bucketChunkReader.addLoad(): during load testing with "few large requests", memory allocated by bucketChunkReader.addLoad() accounted for about 4%. I suspect there's a lot of reslicing, because chunks are added 1-by-1 to the reader. (low impact, this hasn't shown up in profiles since)
~~add pooling for snappy decoding buffers - see this comment store-gateway: more efficient series caching #3751 (comment)~~ (this turned out to be low impact)
streaming store-gateway: series and chunks limits are enforced later #3878

Tests:

Make tests in gateway_test.go run with streaming and without streaming implementation Enable streaming store-gateway by default #4330

Code design:

naming: use a less repetitive naming scheme (the string seriesChunk is repeated 702 times in /pkg/storegateway)
- maybe replace seriesChunkRefs with index (indexSet, indexSetIterator) and seriesChunks with completeSeries (completeSeriesSet, completeSeriesIterator)
- maybe drop these from the names of iterators (deduplicatingSeriesChunkRefsSetIterator -> deduplicatingIterator)

Configuration:

remove -blocks-storage.bucket-store.batch-series-size and set it to a sensible value (we're running with 5000 so far) store-gateway: experiment with different batch sizes for series streaming #4984
remove -blocks-storage.bucket-store.max-concurrent after quota manager is merged because concurrency will be determined by the amount of free resources

Low priority (will be trivial once we stop supporting non-streaming version)

change chunkReader(s) to not work with seriesEntry, but their own type - this is for naming consistency mostly store-gateway: rename seriesEntry to seriesChunks #4747
applying series limit after merging series instead of before (see comment) Store-gateway: apply limits on merged series #4037

The text was updated successfully, but these errors were encountered:

dimitarvdimitrov · 2023-05-11T15:04:47Z

I think we can close this issue and move remaining work to other issues

the remaining work for this is

implementing memory quotas
remove the max-concurrent limit for requests
enforcing chunks and series limits earlier streaming store-gateway: series and chunks limits are enforced later #3878
using a less repetitive naming scheme
experimenting with different values for batch size and remove the config parameter

implementing memory quotas

We have discussed internally that this is tricky and will probably involve more work. I'm also not sure if we want to do this given the improvements in memory profiles. I'm tempted to not create an issue because it's not addressing any serious problem in the store-gateway at the moment. @pracucci do you agree?

remove the max-concurrent limit for requests

We may do this as part of implementing memory quotas. If I create an issue for that, I will include this there.

enforcing chunks and series limits earlier

this has an issue and I will try to work on it when I have bandwidth. But I also think it has enough details for someone else to work on it.

using a less repetitive naming scheme

I think it's possible to introduce sub packages: series and chunks with main structs series.Series and chunks.Ref and chunks.Chunk. We can also move iterators and readers accordingly and remove chunkRefs and series from their names.

I think doing this with the help of an IDE may not be terribly difficult, but it will touch the majority of lines in the pkg/storegateway. Do you have an opinion on this Marco?

experimenting with different values for batch size and remove the config parameter

Along with #3939 I will try to try this out too. I've opened an issue #4984, and maybe someone else can pitch in with experience.

dimitarvdimitrov · 2023-09-14T16:08:41Z

closing as all follow-up items are already in issues

dimitarvdimitrov added the component/store-gateway label Nov 1, 2022

dimitarvdimitrov self-assigned this Nov 1, 2022

pracucci added the ease-of-use label Nov 2, 2022

dimitarvdimitrov changed the title ~~Store-gateway: fetch series in batches~~ Store-gateway: series streaming Nov 23, 2022

dimitarvdimitrov added the squad/customer label Nov 28, 2022

This was referenced Dec 16, 2022

Store-gateway: add streaming support to LabelNames() API #3749

Open

Store-gateway: add streaming support to LabelValues() API #3750

Open

56quarters mentioned this issue Dec 20, 2022

enforce max_fetched_series_per_query in store-gateway #3382

Closed

dimitarvdimitrov mentioned this issue Mar 10, 2023

Store-gateway: use streaming for LabelNames RPC #4464

Merged

3 tasks

dimitarvdimitrov mentioned this issue Apr 17, 2023

store-gateway: rename seriesEntry to seriesChunks #4747

Merged

1 task

dimitarvdimitrov mentioned this issue May 11, 2023

store-gateway: experiment with different batch sizes for series streaming #4984

Open

dimitarvdimitrov closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store-gateway: series streaming #3348

Store-gateway: series streaming #3348

dimitarvdimitrov commented Nov 1, 2022 •

edited

Loading

dimitarvdimitrov commented May 11, 2023

dimitarvdimitrov commented Sep 14, 2023

Store-gateway: series streaming #3348

Store-gateway: series streaming #3348

Comments

dimitarvdimitrov commented Nov 1, 2022 • edited Loading

Background

Proposal

Further improvements

TODO

Nov 21

Nov 28

Dec 5

Dec 12

Dec 19

Jan 2

Jan 16

Jan 23

Unprioritized TODOs

Low priority (will be trivial once we stop supporting non-streaming version)

dimitarvdimitrov commented May 11, 2023

dimitarvdimitrov commented Sep 14, 2023

dimitarvdimitrov commented Nov 1, 2022 •

edited

Loading