Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hourly Elasticsearch indexing #2369

Open
libeilin opened this issue Jan 29, 2019 · 23 comments
Open

Support hourly Elasticsearch indexing #2369

libeilin opened this issue Jan 29, 2019 · 23 comments

Comments

@libeilin
Copy link

@libeilin libeilin commented Jan 29, 2019

elasticSearch

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Jan 29, 2019

This will not work out of the box as some other logic would need to change. we can leave this to see if it is popular or not

@libeilin

This comment has been minimized.

Copy link
Author

@libeilin libeilin commented Jan 29, 2019

OK, thanks for your reply, because we have encountered some problems when compiling by ourselves. Therefore, we are looking for your help here.

If this requirement is made, I hope you can release it as soon as possible. At present, the data volume of one day is too large, and the ES query speed cannot keep up with it.

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Jan 29, 2019

@openzipkin/elasticsearch any interest on this?

@xeraa

This comment has been minimized.

Copy link
Contributor

@xeraa xeraa commented Jan 29, 2019

  1. I assume this can't be easily fixed with alias trickery, right? 2019-01-01 pointing to 2019-01-01-00 and you switch that every hour. As long as the alias is pointing to a single index you can write to it. Pointing to multiple indices makes it read-only.
  2. Probably the better approach long-term is a rollover index where you can specify a certain age or number of docs or size. I'd generally go for size so you have a very even distribution of data per shard (otherwise weekends might be oversharded and a peak during the week undersharded). Also note that we will very soon have Index Lifecycle Management (ILM) built into Elasticsearch and Kibana, which will make the management of rollover indices and deleting old data much simpler. Though it's under the (free) Basic license and not Apache2 — not sure if that is acceptable to be used in Zipkin then.
@shakuzen shakuzen changed the title ElasticSearch can only support day level index, can it support hour level index ? Support Elasticsearch indexing other than daily Jan 29, 2019
@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Feb 18, 2019

@xeraa do you know which version rollover index was added? I agree the core issue here is size.

@xeraa

This comment has been minimized.

Copy link
Contributor

@xeraa xeraa commented Feb 18, 2019

@adriancole 6.6 (the current version): https://www.elastic.co/guide/en/elasticsearch/reference/6.6/index-lifecycle-management.html

You can fully managed it through the Elasticsearch API, but Kibana also provides a UI for it. And as I said: Not open source but free to use (Basic license).

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Feb 27, 2019

@libeilin before we experiment with a non-OSS feature, can you comment if rollover indexing is desirable? maintaining features has a cost, especially so with non OSS distributions (as it affects how we do testing) so we want to make sure there is user buy-in.

It is also possible for us to explore hourly indexes regardless

@adriancole

This comment has been minimized.

@untergeek

This comment has been minimized.

Copy link

@untergeek untergeek commented Apr 22, 2019

There's always Elastic Curator if you want to use Rollover, but are using OSS Elasticsearch (no Basic license). It's OSS, and requires no license.

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Apr 22, 2019

@untergeek thanks for the pointer. I think you are pointing to this specifically right? https://www.elastic.co/guide/en/elasticsearch/client/curator/5.6/ex_rollover.html

To elaborate this approach, we'd need some more details about what this will take in practice in terms of curator config vs index template config, any extra processes curator needs to run, what if anything the aliasing implies when we do reads or writes. I wonder if someone has this setup with a zipkin site already (or anything that uses daily indexes and rollover with no client call changes needed)

@singhabhinav03

This comment has been minimized.

Copy link

@singhabhinav03 singhabhinav03 commented May 2, 2019

We recently started using zipkin for opentracing. In our company also requirement is for monthly or weekly zipkin index. It would be great if you add this support.

@xeraa

This comment has been minimized.

Copy link
Contributor

@xeraa xeraa commented May 2, 2019

Just as an idea: Maybe this is going a bit too deep down the rabbit hole for one datastore and it would make more sense to leave that part to Curator or ILM (by documenting the right configurations to be used)? There are various use cases about time based index patterns, rollover, deletion of data,... that are kind of solved externally already.

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented May 3, 2019

@shakuzen

This comment has been minimized.

Copy link
Member

@shakuzen shakuzen commented May 8, 2019

In our company also requirement is for monthly or weekly zipkin index. It would be great if you add this support.

@singhabhinav03 could you elaborate on what you're trying to achieve that you cannot currently? The original request is to be able to have finer-grain indexes than daily because the data volume in one day is too large. Weekly or monthly indexes are only likely usable with relatively small amounts of tracing data.

@adriancole adriancole changed the title Support Elasticsearch indexing other than daily Support hourly Elasticsearch indexing Aug 21, 2019
@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Aug 21, 2019

I think this issue got stuck as we were worried about how to address varied granularity. @narayaruna opened #2767 which doesn't imply varied granularity.

If we limit this to hourly indexes, still anyone can use curator or similar to rescale these to daily, weekly monthly.. correct? cc @openzipkin/elasticsearch

@xeraa

This comment has been minimized.

Copy link
Contributor

@xeraa xeraa commented Aug 21, 2019

If we limit this to hourly indexes, still anyone can use curator or similar to rescale these to daily, weekly monthly.. correct?

Not sure I'm reading this correctly, but combining hourly indices into a daily one (merging 24 indices) isn't easily possible — that would require a reindex (where you use a script to change the _index field).

My concern with hourly indices is that this will be a lot of shards. Just using 1 primary and 1 replica you'll end up with 48 shards for a single day. Our recommendation is to have less than 20 shards per GB of heap and each shard should be around 10 to 50GB in size. I can see how this works out for some heavy users, but it will be a bad choice for many others.

IMO a combination of rollover and write index alias would be the more generic solution that gives users fewer chances for bad configurations.

Do you have like a sample app where I could add the right config to show how this works? Might be easier than discussing it.

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Aug 21, 2019

@xeraa so I think the concern from @narayaruna is that with TB scale indexes, search, even with our cherry-picked indexing, require bumping read timeouts to 60s.. so more about query side than write side iiuc.

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Aug 21, 2019

so the thinking is.. I wonder.. if for data sets that naturally fit the heap-per-shard guidance at hourly or less, then putting that data in hourly should make more sense than daily. Query side could be better optimized with this as instead of requesting a day index for a search, it could an hourly, without any special features...

am I missing something? (ps thanks for mentioning where hourly does not make sense! possibly we can do a discover check to warn if config doesn't make sense)

@xeraa

This comment has been minimized.

Copy link
Contributor

@xeraa xeraa commented Aug 21, 2019

Yes, if you are looking at a short timeframe (like 1h). I'm not sure what the common access pattern is to be honest.

On the other hand if you have a filter on the timeframe and access it frequently enough then that will be cached and should also be pretty fast as well. I couldn't say how much win to expect (depends on so many factors including the access pattern — timeframe and frequency).

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Aug 22, 2019

Literally, the default lookback is 1 hour, and currently, it will grab a day or possibly 2 if just past midnight, to form a query with. This is probably why Nara mentions this, as it lowers the blast of default to max 2 hours if just past the hour.

Screenshot 2019-08-22 at 8 24 28 AM

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Aug 22, 2019

at any rate we could put a branch up and see how it goes. If isn't helpful we wouldn't do it, but for some sites this could be an easy to reason with, low-tech option to speed up some things.

Ack on the reindexing thing if someone needs to re-scale data. We can put more notes in the readme with knowledge gained here regardless of if the change is implemented.

@adriancole

This comment has been minimized.

Copy link
Contributor

@adriancole adriancole commented Aug 22, 2019

PS I opened this because I think I was the one who came up with the hour search default :) #2772

@xeraa

This comment has been minimized.

Copy link
Contributor

@xeraa xeraa commented Aug 22, 2019

Sounds good on trying it out on a branch.

On the re-scaling: Rather than reindexing indices together, you could have an index template with 3 primary shards (just as an example for spreading the ingestion over 3 nodes), but once the index is readonly you could shrink it down to a single primary shard. That should be the better pattern for more parallelization at first and then reducing the number of shards later on. And this is just a question of index template and then Elastic Curator / ILM / ... — would probably just need a little documentation on the Zipkin side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.