Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using Bloom filter for id field index to optimize read performance #4489

Closed
itiyama opened this issue Sep 12, 2022 · 19 comments · Fixed by #11027
Closed

Explore using Bloom filter for id field index to optimize read performance #4489

itiyama opened this issue Sep 12, 2022 · 19 comments · Fixed by #11027
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Indexing & Search v2.12.0 Issues and PRs related to version 2.12.0

Comments

@itiyama
Copy link

itiyama commented Sep 12, 2022

Opensearch uses FST data structure to store document ids and then searches for it as a term query. Document id queries are generally "does_exist" for most use-cases. Bloom filter is known to perform better for such queries. We will explore this data structure to store document ids and then measure read performance for various usecases:

  1. Mostly appends
  2. Upserts, but no updates. Equal number of appends and upserts
  3. Both updates and appends

This is a spin off from a discussion captured here

@itiyama itiyama added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 12, 2022
@dreamer-89 dreamer-89 added Indexing & Search discuss Issues intended to help drive brainstorming and decision making distributed framework and removed untriaged labels Sep 12, 2022
@mgodwan
Copy link
Member

mgodwan commented Aug 28, 2023

While analysing updates, I found FST lookups to be taking time as well, and started exploring bloom filter based postings for id field, and worked on a POC for the same: https://github.com/mgodwan/OpenSearch/compare/d76614cb4f071989da04d0976c1227b8943e5b61..02a344004413d39b08d882cc2b04321808682909

Upon running the same for NYC Taxis dataset with 25% conflict probability, we see gains in indexing throughput of the order of ~30%

Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM
Client: c4.4xlarge, 16 client instances
opensearch-benchmark execute-test --workload=nyc_taxis --target-hosts "<<HOST>>" --results-format csv --pipeline benchmark-only --test-procedure update --workload-params '{"index_settings": {"number_of_shards": 1, "number_of_replicas": 0, "codec": "best_compression", "refresh_interval": "60s", "translog.flush_threshold_size": "4g"}}'

Following are the results of the benchmarks for the same (7 runs for each code):

Screenshot 2023-08-28 at 12 33 57 PM

Following is a differential flame graph for the CPU usage analysis which clearly shows the advantage of the short circuiting bloom filter introduces with the short circuiting:

Screenshot 2023-08-28 at 12 35 30 PM

Memory Used Percentage (Left Baseline, Right candidate)
Screenshot 2023-08-28 at 1 06 22 PM

JVM Used Percentage (Left Baseline, Right candidate)
Screenshot 2023-08-28 at 1 06 54 PM

Total Young GC time (Left Baseline, Right candidate)
Screenshot 2023-08-28 at 1 07 32 PM

CPU Used Percentage (Left Baseline, Right candidate)
Screenshot 2023-08-28 at 1 05 32 PM

@mgodwan
Copy link
Member

mgodwan commented Aug 28, 2023

Based on the above results, I think we should look into exposing this as an opt-in feature to start with while working to optimize the usage of bloom filters internally, and potentially exposing this as a data structure for other fields.

In parallel, we can continue to investigate on the optimizations on the impact this introduces on JVM, Storage, and get/search paths(if any) to promote this towards a default feature for indices.

@shwetathareja @backslasht @dblock @itiyama
Looking forward to your thoughts on this.

@backslasht
Copy link
Contributor

@mgodwan - Thanks for the experiments. With approx. 2% increase in disk size, 20%+ decrease in latency is great.

We should definitely consider to bring it as opt-in feature. Are you thinking of exposing conflict probability as an option as well?

@itiyama
Copy link
Author

itiyama commented Aug 29, 2023

@mgodwan This are great results. Any reason for choosing this as an opt-in instead of default? I am aware that some users piggyback on term features on the id field, but I am thinking that going forward we should disable term features and just use id as a mechanism for does_exist or get kind of queries so that the optimizations can be more tailored towards these type of queries.

@shwetathareja
Copy link
Member

shwetathareja commented Aug 29, 2023

The nos. look great. +1 @mgodwan to offer this as opt in/ experimental features with setting to configure the false positive %. @itiyama , it is pending full scale testing to evaluate the impact impact on JVM/ Storage with different workloads. And, also how it impacts merging before marking it production ready.

@mgodwan, Lets try to make it to 2.10 release as opt-in feature.

@mgodwan
Copy link
Member

mgodwan commented Aug 30, 2023

Thanks folks for the comments.

We should definitely consider to bring it as opt-in feature. Are you thinking of exposing conflict probability as an option as well?

@backslasht Yes, we can expose the false positive probability as an argument as it provides trade-offs between jvm/storage usage and the throughput improvement we observe for update use cases.

Any reason for choosing this as an opt-in instead of default?

@itiyama For append-only use cases with auto-generated ids, this may not help and add additional compute/memory/storage overhead. Also, we are still profiling the overall resource utilization the new data structure results in to take more informed decision.

I am thinking that going forward we should disable term features and just use id as a mechanism for does_exist

@itiyama I agree that it will help optimize on the majority of use cases where FST may be an overhead. I think we can track a separate issue for evaluating the underlying deterministic structure.

@mgodwan
Copy link
Member

mgodwan commented Aug 30, 2023

@shwetathareja @backslasht

The Bloom filter implementation in Lucene is part of the Lucene's codecs extensions, outside of Lucene core.

Digging further on the same, while exploring the existing implementation in Lucene and seeing the past changes (example), it seems like the implementation has been changed in the past, as it is marked experimental, and does not provide direct support for backward compatibility among Lucene versions.

This may be concerning for us to support version upgrades, and may require us to use an implementation which can ensure backwards compatibility.

Hence, I wanted to discuss:

  1. Is it alright to provide this as an experimental feature flag gated setting in 2.10 given these constraints as it provides opportunity to gather customer feedback on the feature? This will restrict the upgrades for indices using the bloom filter postings when Lucene implementations change, unless customers disable bloom filters through re-index/force merge old segments into new.

  2. We may need explore if it may be beneficial to use the code from Lucene's implementation and ensure backwards compatibility on it. This adds extra maintenance overhead for us but also provides more control, and opportunities to add support for more forms of fuzzy sets (e.g. https://arxiv.org/pdf/1912.08258.pdf) to optimize on.

@shwetathareja
Copy link
Member

shwetathareja commented Aug 30, 2023

Is it alright to provide this as an experimental feature flag gated setting in 2.10 given these constraints as it provides opportunity to gather customer feedback on the feature? This will restrict the upgrades for indices using the bloom filter postings when Lucene implementations change, unless customers disable bloom filters through re-index/force merge old segments into new.

Generally, we don't provide backward compatibility guarantees for experimental features. We would need to call it out explicitly as well. Do you know the reason why it is not part of Lucene core codecs? Also, considering our approach may change drastically as we look into providing backward compatibility, then we shouldn't rush for 2.10 release.

@backslasht
Copy link
Contributor

Backward compatibility is must.

I agree with @shwetathareja, if backward compatibility is not supported, lets review our approach.

@mgodwan
Copy link
Member

mgodwan commented Aug 30, 2023

Do you know the reason why it is not part of Lucene core codecs?

Looks like only codecs in core are supported for backward compatibility. This is marked experimental along with other extension codecs (although it has been present for a long time in the Lucene repository) due to which the guarantees for backward compatibility may not apply. I can open an issue in the Lucene repo to discuss this with the community.

If backward compatibility is not supported, lets review our approach.

I agree.
This can be achieved through an extension of Lucene's PostingsFormat which can implement a postings format similar to Lucene's while adding needed semantics for ensuring backward compatibility. Are we open to introducing custom codecs like this in OpenSearch core?

@backslasht
Copy link
Contributor

This can be achieved through an extension of Lucene's PostingsFormat which can implement a postings format similar to Lucene's while adding needed semantics for ensuring backward compatibility. Are we open to introducing custom codecs like this in OpenSearch core?

We recently introduced ZStd as a custom codec (#9658) and there are more to come (Snappy). If via custom codec we can support backward compatibility, worth giving it a try.

@mgodwan
Copy link
Member

mgodwan commented Sep 19, 2023

For the bloom filter implementation, following is the throughput observed with varying false positive probabilities:

newplot(5)


The storage/memory variation for different false positive probability on nyc_taxis dataset is as follows:

newplot(4)


@mgodwan
Copy link
Member

mgodwan commented Sep 19, 2023

I'm working on collating the results for an XOR Filter implementation as well with same parameters.
Will update the issue with the same.

@mgodwan
Copy link
Member

mgodwan commented Oct 25, 2023

I worked on comparing different fuzzy set implementations over the past few days. Following is a comparison of the same with similar size overhead:

Screenshot 2023-10-25 at 2 52 16 PM

We observe XOR Filter to be providing the highest reported indexing throughput among the implementations. Although, this comes at the cost of higher construction time and JVM overhead during construction (XOR Filter implementation POC is based off of https://github.com/FastFilter/fastfilter_java)


Following is a result of micro-benchmarks performed using JMH to profile filter construction times:

Filter Construction

Benchmark                                (numIds)  (setTypeAlias)  Mode  Cnt      Score       Error  Units
FilterConstructionBenchmark.buildFilter   1000000    bloom_filter  avgt    3     26.229 ±     9.165  ms/op
FilterConstructionBenchmark.buildFilter   1000000      xor_filter  avgt    3     78.331 ±    27.188  ms/op
FilterConstructionBenchmark.buildFilter   1000000   cuckoo_filter  avgt    3     52.045 ±     9.371  ms/op
FilterConstructionBenchmark.buildFilter  10000000    bloom_filter  avgt    3    422.899 ±   578.249  ms/op
FilterConstructionBenchmark.buildFilter  10000000      xor_filter  avgt    3   2363.247 ±   781.973  ms/op
FilterConstructionBenchmark.buildFilter  10000000   cuckoo_filter  avgt    3    943.940 ±  1622.361  ms/op
FilterConstructionBenchmark.buildFilter  50000000    bloom_filter  avgt    3   8545.999 ±  1791.090  ms/op
FilterConstructionBenchmark.buildFilter  50000000      xor_filter  avgt    3  20415.212 ± 26552.341  ms/op
FilterConstructionBenchmark.buildFilter  50000000   cuckoo_filter  avgt    3  10126.537 ±   887.455  ms/op

@mgodwan
Copy link
Member

mgodwan commented Oct 25, 2023

Side note: While trying to use existing cuckoo filter implementation in OpenSearch, I encountered an issue: #10582

Will fix it separately.

@shwetathareja
Copy link
Member

@mgodwan : do you also see scope of improvement in XOR filter construction considering it is providing better throughput and lower latencies as compared to bloom and cuckoo filter?

@mgodwan
Copy link
Member

mgodwan commented Nov 6, 2023

@shwetathareja The implementation used for xor filter was based on https://arxiv.org/pdf/1912.08258.pdf, where the mapping step requires few auxiliary data structures which consume the high memory during construction. There was no evident way to get rid of the same during the experiments.

@kiranprakash154
Copy link
Contributor

Hi, are we on track for this to be released in 2.12 ?

@shwetathareja
Copy link
Member

@kiranprakash154 yes we are targeting 2.12 for the release of bloom filter support for DocumentId in OpenSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Indexing & Search v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
7 participants