Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approximate histograms #58

Closed
belm0 opened this issue Apr 17, 2019 · 3 comments
Closed

approximate histograms #58

belm0 opened this issue Apr 17, 2019 · 3 comments

Comments

@belm0
Copy link

belm0 commented Apr 17, 2019

I'm following the paper (http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) implemented by https://github.com/carsonfarmer/streamhist, and the notion of approximate histograms seems elegant and efficient.

After seeing the internals of streamhist (trying to fix bugs) and reading the paper, I can imagine ways to make a better implementation: e.g. much more efficient discovery of bins to be joined, and avoiding temporary lists when possible. Also the code seems overly complex, partially due to features like "bin freezing" which try to workaround poor bin joining performance.

Anyway since streamhist is defunct, I'm thinking about trying an implementation. I wonder if this kind of histogram would fit into physt (and if sortedcollections would be reasonable as a dependency).

@belm0
Copy link
Author

belm0 commented Apr 25, 2019

I've prototyped a new implementation of the Ben-Haim streaming histogram in pure Python, no package dependencies. Given a max bin count of 64 (streamhist default), adding a value takes 12 usec vs. 36 usec for streamhist (on 5 year old MacBook Air). That's a compelling speed given that the algorithm dynamically adjusts bins on every insert at steady state. It's possible to maintain that insert speed at much larger max bin counts (say 1,000), but the implementation gets significantly more complex (maintain a sorted container of bin distances).

More interesting is how flexible the algorithm is regarding bin merging strategies. The strategy in the Ben-Haim paper merges nearest bins, which I suppose preserves the most information about a distribution. However with a simple tweak it can instead maintain equal bin sizes-- ideal for arbitrary percentile queries. Then with some cheating I've made a "best effort" equal bin size strategy that is amortized O(1) regardless of max bin count, with fill running at 3 usec. Other merge strategies are possible, such as providing specific percentiles using a minim number of bins (e.g. accurate 50% and 90% using only 5 bins).

@belm0
Copy link
Author

belm0 commented May 19, 2019

implementation update

I improved the fill performance of the simple Python implementation and added a numba implementation:

  • 64 bins: streamhist 30 µs, new 6.5 µs, new_numba 2.3 µs
  • 1024 bins: streamhist 192 µs, new 31 µs, new_numba 4.7 µs

(I wonder if physt would take on numba as extra_requires.)

I've implemented equivalent of numpy.quantile(). It matches quantile() output exactly until max_bins is reached, then transitions gracefully to approximate quantiles.

@janpipek
Copy link
Owner

Sorry, this will probably never be implemented :-/

@janpipek janpipek closed this as not planned Won't fix, can't repro, duplicate, stale Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants