You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Elasticsearch has a search type parameter that can be sent with every query to control the method by which Elasticsearch scores results across multiple independent shards.
These days, there are really only two options, query_then_fetch, the default, and dfs_query_then_fetch.
dfs_query_then_fetch only differs from the default in one way: it adds an additional step to calculate term frequencies for scoring across the entire index, rather than allowing each shard to do it independently.
This generally introduces a small but noticeable latency and performance hit, but prevents issues where scores are skewed by the essentially random distribution of documents across multiple shards. The scoring differences are most notable when there are only a few documents in the index, and are usually irrelevant with a few million documents or more.
History in Pelias
Pelias has used dfs_query_then_fetch essentially since the start of the project, and it's very important that Pelias generates correct scores with multiple shards.
Occasionally, we have seen queries that return potentially incorrect results without dfs_query_then_fetch, even on full planet builds with hundreds of millions of documents. However, usually in these cases the "right" result is still shown and is not clearly more correct than the first result.
Scenarios for consideration
Since this parameter is closely tied to the number of shards, it's important to consider different scenarios for shard count and performance needs when deciding how to use this parameter
Development or very small builds (<2M documents)
Developers actively working on Pelias code generally have up to a few million documents in their index for testing, are not particularly sensitive to query performance, but definitely want to avoid any scoring errors due to small document counts.
Therefore it makes sense that development users either want to use a single shard, in which case dfs_query_then_fetch has no impact, or several shards with dfs_query_then_fetch.
In the case where a small build is in production, document counts under 2 million generally mean performance will be so fast that even a single shard is fine. In that case, again the search type is irrelevant.
Medium size production workloads (<50M documents)
Running Pelias in production with a moderate size index means that performance of queries starts to become an important consideration.
While anything under around 50 million documents probably doesn't require multiple shards, it might make sense from a latency perspective: Elasticsearch executes each shard's query in a separate thread, so an index with multiple shards can take advantage of parallelization to run the query across several cores. This doesn't change total overall cluster throughput in terms of requests per second, but can massively impact latency numbers for individual requests.
In the distant past, we saw users first moving to production get bitten by the fact that a single shard index will have this high query latency, leading us to setting the default to 5 shards (pelias/config#60).
Taking all that into account, medium size production users either want a single shard, or multiple shards, probably with dfs_query_then_fetch used.
Full planet production workloads (600M+ documents)
Full planet or other very large builds will always have multiple shards. Elasticsearch recommends that each shard is under 50GB, and usually performance considerations will make using multiple shards sensible long before then.
However, full planet builds are also the most sensitive to performance, and therefore avoiding use of dfs_query_then_fetch is worth investigating.
Determining good project default values
Putting all this together, we want to determine what good default values for the Pelias project would be. Some considerations are:
ensure development environments work seamlessly, with no confusing results due to distribution of documents across shards
where possible, allow for users to "grow" from small development to moderate sized builds without having to re-index to alleviate performance issues
ensure configuration allows for maximal performance of full planet builds
With that in mind, it looks like we do indeed want to default to multiple shards. Elasticsearch is changing the default shard count from 5 to 1 in the future (TODO: find the reference for this).
Therefore, we also do need to continue to default to using dfs_query_then_fetch, but allow disabling it by configuration for large builds.
Benchmarking
In order to see what sort of impact dfs_query_then_fetch has, I ran a subset of geocode.earth queries with the default search type instead of dfs_query_then_fetch. The results are really interesting and unexpected:
All query types showed a small but notable latency improvement, except for autocomplete, which got a little slower.
This suggests that we actually want to allow controlling the Elasticsearch search type for each different type of Pelias query.
The text was updated successfully, but these errors were encountered:
Background
Elasticsearch has a search type parameter that can be sent with every query to control the method by which Elasticsearch scores results across multiple independent shards.
These days, there are really only two options,
query_then_fetch
, the default, anddfs_query_then_fetch
.dfs_query_then_fetch
only differs from the default in one way: it adds an additional step to calculate term frequencies for scoring across the entire index, rather than allowing each shard to do it independently.This generally introduces a small but noticeable latency and performance hit, but prevents issues where scores are skewed by the essentially random distribution of documents across multiple shards. The scoring differences are most notable when there are only a few documents in the index, and are usually irrelevant with a few million documents or more.
History in Pelias
Pelias has used
dfs_query_then_fetch
essentially since the start of the project, and it's very important that Pelias generates correct scores with multiple shards.Occasionally, we have seen queries that return potentially incorrect results without
dfs_query_then_fetch
, even on full planet builds with hundreds of millions of documents. However, usually in these cases the "right" result is still shown and is not clearly more correct than the first result.Scenarios for consideration
Since this parameter is closely tied to the number of shards, it's important to consider different scenarios for shard count and performance needs when deciding how to use this parameter
Development or very small builds (<2M documents)
Developers actively working on Pelias code generally have up to a few million documents in their index for testing, are not particularly sensitive to query performance, but definitely want to avoid any scoring errors due to small document counts.
Therefore it makes sense that development users either want to use a single shard, in which case
dfs_query_then_fetch
has no impact, or several shards withdfs_query_then_fetch
.In the case where a small build is in production, document counts under 2 million generally mean performance will be so fast that even a single shard is fine. In that case, again the search type is irrelevant.
Medium size production workloads (<50M documents)
Running Pelias in production with a moderate size index means that performance of queries starts to become an important consideration.
While anything under around 50 million documents probably doesn't require multiple shards, it might make sense from a latency perspective: Elasticsearch executes each shard's query in a separate thread, so an index with multiple shards can take advantage of parallelization to run the query across several cores. This doesn't change total overall cluster throughput in terms of requests per second, but can massively impact latency numbers for individual requests.
In the distant past, we saw users first moving to production get bitten by the fact that a single shard index will have this high query latency, leading us to setting the default to 5 shards (pelias/config#60).
Taking all that into account, medium size production users either want a single shard, or multiple shards, probably with
dfs_query_then_fetch
used.Full planet production workloads (600M+ documents)
Full planet or other very large builds will always have multiple shards. Elasticsearch recommends that each shard is under 50GB, and usually performance considerations will make using multiple shards sensible long before then.
However, full planet builds are also the most sensitive to performance, and therefore avoiding use of
dfs_query_then_fetch
is worth investigating.Determining good project default values
Putting all this together, we want to determine what good default values for the Pelias project would be. Some considerations are:
With that in mind, it looks like we do indeed want to default to multiple shards. Elasticsearch is changing the default shard count from 5 to 1 in the future (TODO: find the reference for this).
Therefore, we also do need to continue to default to using
dfs_query_then_fetch
, but allow disabling it by configuration for large builds.Benchmarking
In order to see what sort of impact
dfs_query_then_fetch
has, I ran a subset of geocode.earth queries with the default search type instead ofdfs_query_then_fetch
. The results are really interesting and unexpected:All query types showed a small but notable latency improvement, except for autocomplete, which got a little slower.
This suggests that we actually want to allow controlling the Elasticsearch search type for each different type of Pelias query.
The text was updated successfully, but these errors were encountered: