Allow fine control of Elasticsearch search type parameter #1335

orangejulius · 2019-07-15T18:02:16Z

Background

Elasticsearch has a search type parameter that can be sent with every query to control the method by which Elasticsearch scores results across multiple independent shards.

These days, there are really only two options, query_then_fetch, the default, and dfs_query_then_fetch.

dfs_query_then_fetch only differs from the default in one way: it adds an additional step to calculate term frequencies for scoring across the entire index, rather than allowing each shard to do it independently.

This generally introduces a small but noticeable latency and performance hit, but prevents issues where scores are skewed by the essentially random distribution of documents across multiple shards. The scoring differences are most notable when there are only a few documents in the index, and are usually irrelevant with a few million documents or more.

History in Pelias

Pelias has used dfs_query_then_fetch essentially since the start of the project, and it's very important that Pelias generates correct scores with multiple shards.

Occasionally, we have seen queries that return potentially incorrect results without dfs_query_then_fetch, even on full planet builds with hundreds of millions of documents. However, usually in these cases the "right" result is still shown and is not clearly more correct than the first result.

Scenarios for consideration

Since this parameter is closely tied to the number of shards, it's important to consider different scenarios for shard count and performance needs when deciding how to use this parameter

Development or very small builds (<2M documents)

Developers actively working on Pelias code generally have up to a few million documents in their index for testing, are not particularly sensitive to query performance, but definitely want to avoid any scoring errors due to small document counts.

Therefore it makes sense that development users either want to use a single shard, in which case dfs_query_then_fetch has no impact, or several shards with dfs_query_then_fetch.

In the case where a small build is in production, document counts under 2 million generally mean performance will be so fast that even a single shard is fine. In that case, again the search type is irrelevant.

Medium size production workloads (<50M documents)

Running Pelias in production with a moderate size index means that performance of queries starts to become an important consideration.

While anything under around 50 million documents probably doesn't require multiple shards, it might make sense from a latency perspective: Elasticsearch executes each shard's query in a separate thread, so an index with multiple shards can take advantage of parallelization to run the query across several cores. This doesn't change total overall cluster throughput in terms of requests per second, but can massively impact latency numbers for individual requests.

In the distant past, we saw users first moving to production get bitten by the fact that a single shard index will have this high query latency, leading us to setting the default to 5 shards (pelias/config#60).

Taking all that into account, medium size production users either want a single shard, or multiple shards, probably with dfs_query_then_fetch used.

Full planet production workloads (600M+ documents)

Full planet or other very large builds will always have multiple shards. Elasticsearch recommends that each shard is under 50GB, and usually performance considerations will make using multiple shards sensible long before then.

However, full planet builds are also the most sensitive to performance, and therefore avoiding use of dfs_query_then_fetch is worth investigating.

Determining good project default values

Putting all this together, we want to determine what good default values for the Pelias project would be. Some considerations are:

ensure development environments work seamlessly, with no confusing results due to distribution of documents across shards
where possible, allow for users to "grow" from small development to moderate sized builds without having to re-index to alleviate performance issues
ensure configuration allows for maximal performance of full planet builds

With that in mind, it looks like we do indeed want to default to multiple shards. Elasticsearch is changing the default shard count from 5 to 1 in the future (TODO: find the reference for this).

Therefore, we also do need to continue to default to using dfs_query_then_fetch, but allow disabling it by configuration for large builds.

Benchmarking

In order to see what sort of impact dfs_query_then_fetch has, I ran a subset of geocode.earth queries with the default search type instead of dfs_query_then_fetch. The results are really interesting and unexpected:

All query types showed a small but notable latency improvement, except for autocomplete, which got a little slower.

This suggests that we actually want to allow controlling the Elasticsearch search type for each different type of Pelias query.

The text was updated successfully, but these errors were encountered:

orangejulius mentioned this issue Jul 15, 2019

[DO NOT MERGE] Disable dfs_query_then_fetch #1128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow fine control of Elasticsearch search type parameter #1335

Allow fine control of Elasticsearch search type parameter #1335

orangejulius commented Jul 15, 2019 •

edited

Loading

Allow fine control of Elasticsearch search type parameter #1335

Allow fine control of Elasticsearch search type parameter #1335

Comments

orangejulius commented Jul 15, 2019 • edited Loading

Background

History in Pelias

Scenarios for consideration

Development or very small builds (<2M documents)

Medium size production workloads (<50M documents)

Full planet production workloads (600M+ documents)

Determining good project default values

Benchmarking

orangejulius commented Jul 15, 2019 •

edited

Loading