approximate histograms #58

belm0 · 2019-04-17T14:43:24Z

I'm following the paper (http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) implemented by https://github.com/carsonfarmer/streamhist, and the notion of approximate histograms seems elegant and efficient.

After seeing the internals of streamhist (trying to fix bugs) and reading the paper, I can imagine ways to make a better implementation: e.g. much more efficient discovery of bins to be joined, and avoiding temporary lists when possible. Also the code seems overly complex, partially due to features like "bin freezing" which try to workaround poor bin joining performance.

Anyway since streamhist is defunct, I'm thinking about trying an implementation. I wonder if this kind of histogram would fit into physt (and if sortedcollections would be reasonable as a dependency).

belm0 · 2019-04-25T13:11:32Z

I've prototyped a new implementation of the Ben-Haim streaming histogram in pure Python, no package dependencies. Given a max bin count of 64 (streamhist default), adding a value takes 12 usec vs. 36 usec for streamhist (on 5 year old MacBook Air). That's a compelling speed given that the algorithm dynamically adjusts bins on every insert at steady state. It's possible to maintain that insert speed at much larger max bin counts (say 1,000), but the implementation gets significantly more complex (maintain a sorted container of bin distances).

More interesting is how flexible the algorithm is regarding bin merging strategies. The strategy in the Ben-Haim paper merges nearest bins, which I suppose preserves the most information about a distribution. However with a simple tweak it can instead maintain equal bin sizes-- ideal for arbitrary percentile queries. Then with some cheating I've made a "best effort" equal bin size strategy that is amortized O(1) regardless of max bin count, with fill running at 3 usec. Other merge strategies are possible, such as providing specific percentiles using a minim number of bins (e.g. accurate 50% and 90% using only 5 bins).

belm0 · 2019-05-19T11:08:50Z

implementation update

I improved the fill performance of the simple Python implementation and added a numba implementation:

64 bins: streamhist 30 µs, new 6.5 µs, new_numba 2.3 µs
1024 bins: streamhist 192 µs, new 31 µs, new_numba 4.7 µs

(I wonder if physt would take on numba as extra_requires.)

I've implemented equivalent of numpy.quantile(). It matches quantile() output exactly until max_bins is reached, then transitions gracefully to approximate quantiles.

janpipek · 2022-11-28T13:20:02Z

Sorry, this will probably never be implemented :-/

belm0 mentioned this issue Apr 25, 2019

fill() performance #57

Open

janpipek closed this as not planned Won't fix, can't repro, duplicate, stale Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

approximate histograms #58

approximate histograms #58

belm0 commented Apr 17, 2019

belm0 commented Apr 25, 2019 •

edited

belm0 commented May 19, 2019 •

edited

janpipek commented Nov 28, 2022

approximate histograms #58

approximate histograms #58

Comments

belm0 commented Apr 17, 2019

belm0 commented Apr 25, 2019 • edited

belm0 commented May 19, 2019 • edited

janpipek commented Nov 28, 2022

belm0 commented Apr 25, 2019 •

edited

belm0 commented May 19, 2019 •

edited