Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about less accurate but more efficient percentile calculation? #513

Closed
tisba opened this issue May 7, 2014 · 6 comments
Closed

What about less accurate but more efficient percentile calculation? #513

tisba opened this issue May 7, 2014 · 6 comments

Comments

@tisba
Copy link

tisba commented May 7, 2014

I was wondering, what your opinion is about offering a less accurate, but way more efficient method for percentile calculation.

AFAIK influx currently calculates percentiles in a naive way. There are more efficient (and way faster approaches) to do this (t-DIGESTS used in Elasticsearch e.g.). If you have many events, distributed over your cluster, then it might be nice to offer the user a way to trace accuracy for speed.

I have to deal with millions of events that I'd like to run percentile (actually I calculate entire percentile ranges) calculations over. The performance is okay-isch right now, but not very good for an interactive use case.

@jvshahid
Copy link
Contributor

jvshahid commented May 7, 2014

Sounds interesting. We've talking about this for a while. I'll read the paper and try to get something out there soon for you to try.

@jvshahid jvshahid added this to the Next release milestone May 12, 2014
@tisba
Copy link
Author

tisba commented Oct 8, 2014

Did you had any time to take a look at this, @jvshahid? Performance of percentile calculation is starting to give me a little headache.

@jvshahid jvshahid modified the milestones: 0.8.4, Next release Oct 9, 2014
@jvshahid jvshahid modified the milestone: 0.8.4 Oct 21, 2014
@tisba
Copy link
Author

tisba commented Oct 27, 2014

Just FYI

I've read the t-DIGESTS paper during my vacation multiple times and I think it's a great fit for InfluxDB. I will have to take a closer look at this algorithm for a project (most probably in Erlang), but I'm happy to share my remarks regarding a possible implementation in Go.

Why I think t-DIGESTS is a good fit:

  • it's very easy (and efficiently) parallelizable
  • precision (error) can be easily configured and there are very few parameters to configure tradeoffs (sample amount, compression ratio and size for digest data structures)

@jvshahid
Copy link
Contributor

@tisba that would be great

@toddboom toddboom added the idea label Nov 25, 2014
@beckettsean beckettsean added this to the Next Point Release milestone May 15, 2015
@beckettsean beckettsean modified the milestones: Next Point Release, Longer term Aug 6, 2015
@seiflotfy
Copy link
Contributor

I am looking through the right place in the code to implement it.
Also I am considering adding some new SQL statement like in Redshift: SELECT APPROXIMATE DISTINCT(COUNT()) ... Which implements a hyperloglog++. Should be faster

@beckettsean
Copy link
Contributor

As mentioned in my post to the mailing list we are experimenting with simplifying our open GitHub Issues. This feature request has been rolled into an aggregate issue for all function requests, so that we can close this issue until we are ready to work on it.

You may continue to make comments here. Closing the issue does not mean we are rejecting this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants