[RFC] Search performance on warm index #13806

sohami · 2024-05-24T00:34:14Z

Is your feature request related to a problem? Please describe

The issues #6528 and #12809 touch upon building the support for warm index on top of remote store enabled indices. The main idea is to serve read and write by not keeping all the data locally available and fetching blocks of data on demand from the remote storage. This RFC captures ideas which will be useful to improve on the search performance with warm index. Some of these are touched upon in above issues but is worth to extract in a separate RFC focussed on search performance of warm indices.

Describe the solution you'd like

With remote store based warm index, search performance is dependent on if the data needed for serving the request is available locally or not. In absence of data present locally, a search query will incur the data download latency. If the data required to serve a search request is locally available, then the performance will be similar to a hot index. To achieve low latency on search requests for warm indices, it is crucial to reduce the cache miss of the blocks or reduce the wait time in downloading the required blocks. One of the main challenge is lot of the optimizations could be workload dependent so having different configurable options to benefit different use cases could be a good starting point. This RFC is capturing few items here but please feel free to add or suggest other ideas. These performance improvements can be categorized into 3 broad level areas:

Downloading the data faster
Avoiding cold cache
Improving cache management

Downloading Data Faster:

Concurrent Segment Search (#6798):

This is already GA in OpenSource 2.12 and will be useful to speed up the searches on warm indices. When all the data is cached locally then concurrent search will help to improve on latency similar to a hot index. For cases when data blocks is not cached locally, with sequential search it will download the data blocks per segment sequentially. With concurrent segment search, there will be multiple slice threads each working on subset of segments. Each of these slice threads will perform the parallel download of the data blocks for the segments they are working on. Due to this, the required blocks of data across segments will be downloaded in parallel on the node and will help to improve the end to end search latency. This will work when there are multiple segments per shard (which is generally the case) and cluster is sized well to enable concurrent segment search. Currently concurrent segment search is an opt-in feature, to enable it by default we will need: a) Dynamic capability at node level to make the decisions of using concurrent vs sequential flow for the request based on node level resource availability b) Based on query type chose between concurrent vs sequential flow.

Avoiding cold cache

Query acceleration by prefetching data from remote:

In a typical log analytics use case, we will have a bunch of filters or bool query along with some aggregations. Time spent in these requests are mostly dominated by the aggregation operation. During the search execution on each shard, first the list of matching documents is found and then aggregation is performed on all the matched documents using the field values from the doc values format file. Depending on the size parameter the query phase is followed by the fetch phase to get the documents associated with the top matched doc ids using the stored fields format file. For both aggregation in query phase and fetching source in fetch phase, for each of the matched document ids, the field value will be retrieved by downloading the block of the data containing it. This means that it will wait for download to complete before moving to the next document which could be time consuming depending upon how many distinct blocks download will happen. For example: Lets say a request has 5 matched documents and aggregation field values for each of the doc is stored in distinct blocks 1,3,5,7,9. During aggregation collect time for each matched documents it will download block 1, collect the value and then download block 3 and so on so forth. If each block download time is say td and value processing is tp then for 5 documents total approximate time will be: tt = 5 x (td + tp)

To accelerate this, during query time if we can figure out the blocks needed from doc value format file for each of the matched document ids, then we can trigger the async download of all those data blocks in parallel. This will remove the wait time incurred for each of the sequential block downloads such that when actual collect happens on the aggregation collector, it will have the data block locally available. This way for the above example the approximate total time will be tt = tb + 5tp + td (assuming all 5 blocks are downloaded in parallel), where tb is the time to determine different blocks and trigger the async downloads for those which will be very fast. Lucene community seems to be working on something similar apache/lucene#13179 (thanks @msfroh to share the issue) to prefetch the blocks in page cache to speed up the next seek. This will help the remote store use cases as well in terms of prefetching those blocks upfront from remote to local before access. Will be interesting to have discussion around doc values for aggregation related use cases too. Same mechanism can be extended to other query types and index files where matched documents are known and processing needs to happen on those set of documents using a different index file, such as in case of sorts.

File Cache warmup using user hints or query visibility data:

We have mechanism today in OpenSearch to perform warm-up on the shards. Currently it is used for warming up eager global ordinals in IndicesFieldDataCache during refresh time. On similar lines, we can use some mechanism to enable the warming up of local file cache for a shard using hints from the users. This will be useful if users have too many fields but only perform searches on specific set of fields. Using the hints we can potentially download all the data blocks across different index files or even subset of index files for those fields. For example: download doc values for fields used in aggregation and posting lists for fields used in term queries. This will help to avoid the cold start latency as the File Cache will be warmed up with the required data blocks which user can control thus trading cost vs performance. It will also be useful for evolving workloads where pre-warm can be done ahead of time before running the actual queries. Potentially we can improve this by collecting the information at an index and field level such as which all fields are used in what context (i.e. filter/aggs, etc) and download those field specific index file data proactively. This could be something similar to what is tracked in Query Categorization (#11365) or a different version of it.

Block level access frequency for FileCache warmup:

For warm index, the FileCache will store only the data blocks which are fetched on-demand for read/write requests. These blocks will be cached locally for future references by any other requests and improve on the latency. For various reasons such as shards rebalancing or node restarts the shards will be moved around different nodes over the lifetime of the cluster. This will mean that all the pre-downloaded local blocks of data will need to be purged and newly initialized shard will start with a clean cache thus incurring the cold start latency for any request landing on the new shard. To prevent from cold start latency, we can keep track of different block access frequency for each shards segment level. We can use this block access frequency information to pre-warm the new shards by downloading these blocks on the new nodes. This will be most useful for cases where warm indices are not mutating a lot otherwise the block frequency information will become stale very soon.

Improving Cache Management:

Tiered Caching (#9001):

This is work in progress and will be useful for indices which are not mutating very frequently (usually warm indices will fall under this category). With disk based cache we can essentially increase the request cache size to accommodate more entries which will result in increased cache hits for such indices. If the workload include repeat queries usually as seen for dashboard like use cases, then this will help to reduce query execution and fetch results directly from cache instead.

Different cache replacement strategy for FileCache:

Depending on changing workload or large scans of data, LRU cache replacement strategy for File Cache may not be an optimal choice. There are other sophisticated cache replacement strategy like ARC, LIRS or LRFU which can be well suited for different use case. We can probably have multiple of these configurable options for FileCache which can be used to cater to different workloads. To deal with cache pollution because of background merges we can avoid using File Cache to keep track of files/blocks of background activities such as merges and use pre-configured space for such operations instead. Other option is to use the background task fleet to perform such merges on different nodes as suggested in this RFC (#12361). All these different options will be suitable for different use cases and will help users to make the cost vs performance trade offs to choose between these.

Related component

Search:Remote Search

Describe alternatives you've considered

No response

Additional context

Next Steps:

Gather and incorporate community feedback and follow-up with POC/Design proposals.

The text was updated successfully, but these errors were encountered:

sohami · 2024-05-24T02:48:33Z

Tagging @msfroh @andrross @ankitkala @mch2 for feedback

Bukhtawar · 2024-05-28T18:19:15Z

Thanks @sohami
The Lucene work around IO concurrency looks promising enough, thanks for highlighting the change, some of the primitives for warm index would definitely be able to leverage the primitive to optimise query latency. Tagging @abiesps as well to ensure we are able to plan for this change.
I am assuming this is a logical extension to basic primitives from #8891 and similar issues like #9987 where some of the improvements including improving caching management is captured. Thanks for bringing this together.

For the historical query access pattern, we can consider maintaining a state of what blocks have been known to be hot, by taking point in time snapshots of the block metadata to better analyse patterns over time.

reta · 2024-05-28T19:11:57Z

Thanks @sohami

With concurrent segment search, there will be multiple slice threads each working on subset of segments

I think in case of the remote (warm index), this is a blessing and the curse. If the cluster has mix of hot (local) / warm (remote) indices, the single query could keep the index searcher pool busy downloading blocks and not searching, this could impact the queries to hot (regular) indices, we may need to think about separating these thread pools.

Using the hints we can potentially download all the data blocks across different index files or even subset of index files for those fields.

Hints sound like a good idea.

sohami · 2024-05-29T19:08:25Z

I am assuming this is a logical extension to basic primitives from #8891 and similar issues like #9987 where some of the improvements including improving caching management is captured. Thanks for bringing this together.

Thanks for sharing these other issues. Yes some of the things captured in these touch upon the File cache management such as creating tiers within FileCache which operates in SegmentLRU fashion.

For the historical query access pattern, we can consider maintaining a state of what blocks have been known to be hot, by taking point in time snapshots of the block metadata to better analyse patterns over time.

Right which is the block level access frequency mechanism. Or using the hints/query visibility mechanism to understand at field level which index will be used for different queries and downloading those upfront. I think both will be useful for different use cases and providing an option to configure these based on use cases will be helpful.

sohami · 2024-05-29T19:10:06Z

Thanks @sohami

With concurrent segment search, there will be multiple slice threads each working on subset of segments

I think in case of the remote (warm index), this is a blessing and the curse. If the cluster has mix of hot (local) / warm (remote) indices, the single query could keep the index searcher pool busy downloading blocks and not searching, this could impact the queries to hot (regular) indices, we may need to think about separating these thread pools.

True, other way to avoid such interference is using the dedicated warm nodes for hosting the warm indices. In that case, the threadpools will be separate for hot/warm indices. But agreed for shared setup we will need to look into the isolation between hot/warm from all the different aspects.

sohami added enhancement Enhancement or improvement to existing feature or request Search:Remote Search Roadmap:CoPS (Cost, Performance, Scale) Project-wide roadmap label labels May 24, 2024

github-actions bot added the untriaged label May 24, 2024

sohami removed the untriaged label May 24, 2024

jed326 mentioned this issue Jun 4, 2024

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Search performance on warm index #13806

[RFC] Search performance on warm index #13806

sohami commented May 24, 2024

sohami commented May 24, 2024

Bukhtawar commented May 28, 2024

reta commented May 28, 2024

sohami commented May 29, 2024

sohami commented May 29, 2024

[RFC] Search performance on warm index #13806

[RFC] Search performance on warm index #13806

Comments

sohami commented May 24, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Downloading Data Faster:

Concurrent Segment Search (#6798):

Avoiding cold cache

Query acceleration by prefetching data from remote:

File Cache warmup using user hints or query visibility data:

Block level access frequency for FileCache warmup:

Improving Cache Management:

Tiered Caching (#9001):

Different cache replacement strategy for FileCache:

Related component

Describe alternatives you've considered

Additional context

Next Steps:

sohami commented May 24, 2024

Bukhtawar commented May 28, 2024

reta commented May 28, 2024

sohami commented May 29, 2024

sohami commented May 29, 2024