Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fine control of Elasticsearch search type parameter #1335

Open
orangejulius opened this issue Jul 15, 2019 · 0 comments
Open

Allow fine control of Elasticsearch search type parameter #1335

orangejulius opened this issue Jul 15, 2019 · 0 comments

Comments

@orangejulius
Copy link
Member

orangejulius commented Jul 15, 2019

Background

Elasticsearch has a search type parameter that can be sent with every query to control the method by which Elasticsearch scores results across multiple independent shards.

These days, there are really only two options, query_then_fetch, the default, and dfs_query_then_fetch.

dfs_query_then_fetch only differs from the default in one way: it adds an additional step to calculate term frequencies for scoring across the entire index, rather than allowing each shard to do it independently.

This generally introduces a small but noticeable latency and performance hit, but prevents issues where scores are skewed by the essentially random distribution of documents across multiple shards. The scoring differences are most notable when there are only a few documents in the index, and are usually irrelevant with a few million documents or more.

History in Pelias

Pelias has used dfs_query_then_fetch essentially since the start of the project, and it's very important that Pelias generates correct scores with multiple shards.

Occasionally, we have seen queries that return potentially incorrect results without dfs_query_then_fetch, even on full planet builds with hundreds of millions of documents. However, usually in these cases the "right" result is still shown and is not clearly more correct than the first result.

Scenarios for consideration

Since this parameter is closely tied to the number of shards, it's important to consider different scenarios for shard count and performance needs when deciding how to use this parameter

Development or very small builds (<2M documents)

Developers actively working on Pelias code generally have up to a few million documents in their index for testing, are not particularly sensitive to query performance, but definitely want to avoid any scoring errors due to small document counts.

Therefore it makes sense that development users either want to use a single shard, in which case dfs_query_then_fetch has no impact, or several shards with dfs_query_then_fetch.

In the case where a small build is in production, document counts under 2 million generally mean performance will be so fast that even a single shard is fine. In that case, again the search type is irrelevant.

Medium size production workloads (<50M documents)

Running Pelias in production with a moderate size index means that performance of queries starts to become an important consideration.

While anything under around 50 million documents probably doesn't require multiple shards, it might make sense from a latency perspective: Elasticsearch executes each shard's query in a separate thread, so an index with multiple shards can take advantage of parallelization to run the query across several cores. This doesn't change total overall cluster throughput in terms of requests per second, but can massively impact latency numbers for individual requests.

In the distant past, we saw users first moving to production get bitten by the fact that a single shard index will have this high query latency, leading us to setting the default to 5 shards (pelias/config#60).

Taking all that into account, medium size production users either want a single shard, or multiple shards, probably with dfs_query_then_fetch used.

Full planet production workloads (600M+ documents)

Full planet or other very large builds will always have multiple shards. Elasticsearch recommends that each shard is under 50GB, and usually performance considerations will make using multiple shards sensible long before then.

However, full planet builds are also the most sensitive to performance, and therefore avoiding use of dfs_query_then_fetch is worth investigating.

Determining good project default values

Putting all this together, we want to determine what good default values for the Pelias project would be. Some considerations are:

  • ensure development environments work seamlessly, with no confusing results due to distribution of documents across shards
  • where possible, allow for users to "grow" from small development to moderate sized builds without having to re-index to alleviate performance issues
  • ensure configuration allows for maximal performance of full planet builds

With that in mind, it looks like we do indeed want to default to multiple shards. Elasticsearch is changing the default shard count from 5 to 1 in the future (TODO: find the reference for this).

Therefore, we also do need to continue to default to using dfs_query_then_fetch, but allow disabling it by configuration for large builds.

Benchmarking

In order to see what sort of impact dfs_query_then_fetch has, I ran a subset of geocode.earth queries with the default search type instead of dfs_query_then_fetch. The results are really interesting and unexpected:
dfs_cropped

All query types showed a small but notable latency improvement, except for autocomplete, which got a little slower.

This suggests that we actually want to allow controlling the Elasticsearch search type for each different type of Pelias query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant