Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: routing by _id #34

Open
bnewbold opened this issue Aug 13, 2020 · 1 comment
Open

feature request: routing by _id #34

bnewbold opened this issue Aug 13, 2020 · 1 comment

Comments

@bnewbold
Copy link

This elasticsearch blog post implies that doing batch indexing of documents all going to the same shard at a time improves performance: https://www.elastic.co/blog/how-kenna-security-speeds-up-elasticsearch-indexing-at-scale-part-1

The feature request for esbulk would be to somehow automate this speed-up, without users needing to re-sort or partition documents themselves. Some unstructured thoughts about this:

  • probably control by a CLI flag. could esbulk fetch mapping info from the cluster, eg number of shareds?
  • could require _routing field in documents, or fall back to _id or a key field if set
  • esbulk could partition documents to the existing worker threads. I think this might "just work" even if the number of worker threads is not equal to the number of index shards, but it would probably work better if batches were all a single shard at a time
  • or, esbulk could store per-shard caches internally, then when any individual shard cache reaches the bulk document size, send that batch to a worker thread. this would increase memory consumption, particularly with large documents and large number of shards, but that might be fine
@miku
Copy link
Owner

miku commented Mar 25, 2021

Great point.

While reading https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html#_making_a_routing_value_required I think a custom routing value would not simplify things - they should be used at index and query time, etc.

The way I see how this could be done, would be a per-shard cache (option 4), in memory (or even temp files, if there are many shards).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants