Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestions #1

Open
belm0 opened this issue Jun 20, 2020 · 3 comments
Open

suggestions #1

belm0 opened this issue Jun 20, 2020 · 3 comments

Comments

@belm0
Copy link

belm0 commented Jun 20, 2020

Hello-- noticed the link to this project from carsonfarmer/streamhist. I've been tinkering with optimization and correctness of this algorithm for about a year, and contributed some correctness changes to streamhist. Most of the performance work I did hasn't been published yet.

Suggestions after browsing this project:

consider a few small changes to the algorithm to make the implementation "exact" when the histogram is below capacity - ideally it should match the output of a well-tested library like numpy in this mode. I contributed such support to streamhist.

carsonfarmer/streamhist#11

support reporting of multiple quantile points efficiently - this is a common use, numpy and streamhist API's support it, and streamhist is an example of efficiently computing the sums one time in advance and using it for each quantile

Beyond that, one of the things I've developed which is missing here is a fast implementation based on numba + numpy, for projects which cannot use pypy. The implementation is about 10-15x faster than streamhist last time I measured. The dependencies could be optional. Would an enhancement like that fit into distogram?

@MainRo
Copy link
Member

MainRo commented Jun 20, 2020

consider a few small changes to the algorithm to make the implementation "exact" when the histogram is below capacity - ideally it should match the output of a well-tested library like numpy in this mode. I contributed such support to streamhist.

I wonder whether this is really necessary: Using this library does not make sense with such small datasets. See next point for more explanations

support reporting of multiple quantile points efficiently - this is a common use, numpy and streamhist API's support it, and streamhist is an example of efficiently computing the sums one time in advance and using it for each quantile

The main idea of distogram was to keep its implementation very simple, and eventually use it as foundation for other higher level libraries. For example, the equivalent of the pandas describe function can be implemented on top of the current functions. There is an example here:
https://github.com/maki-nage/rxsci/blob/master/rxsci/math/dist/__init__.py#L60
I see that your implementation of multi-quantlie computation contains optimizations. But is this necessary ? I mean, unless quantiles are computed very often, this is not a performance bottleneck.
The previous point on exact computation on small datasets is similar: This could be done on higher level libraries (btw this is a good idea that I will certainly implement on rxsci).
On the other side, your changes for such support are rather small. So if some people are willing to use distogram directly, we could add these features.

Beyond that, one of the things I've developed which is missing here is a fast implementation based on numba + numpy, for projects which cannot use pypy. The implementation is about 10-15x faster than streamhist last time I measured. The dependencies could be optional. Would an enhancement like that fit into distogram?

Performance wise, I consider pypy as the only option today on a real application. I adapted the distogram bench to streamhist to compare the update function. On CPython distogram is 25% faster, and on pypy distogram is 13 times faster.
I do not want to use numpy because on a streaming application it will never allow CPython to close the gap with pypy, an numpy just breaks pypy's jit. However I am interested in numba support, especially because it should be compatible with pypy at some point (cf numba/numba#5766).

@belm0
Copy link
Author

belm0 commented Jun 21, 2020

Thank you for responding.

I wonder whether this is really necessary: Using this library does not make sense with such small datasets.

Some applications are general, and the dataset size is not known. For example, I using streaming histogram in performance timers. As a general utility, it is not known whether the code under test will be invoked 10 times or 1 million times during the session. It's an advantage to use the available bin capacity and report exact quantiles when possible, yet have the implementation transition gracefully to approximate histogram as the data points grow. It only requires minor tweaks to the Ben-haim algorithms.

Performance wise, I consider pypy as the only option today on a real application.

I'm not sure what this means. Pypy has limitations (it is not a 1:1 replacement for CPython), and there are legacy applications which cannot transition to pypy easily, or which are heavily dependent on numpy. For such applications, a numpy + numba implementation is useful. Having such an implementation does not preclude having a pure Python implementation which supports Pypy. They can exist along side each other.

I do not want to use numpy because on a streaming application it will never allow CPython to close the gap with pypy, an numpy just breaks pypy's jit. However I am interested in numba support,

A numba-only implementation will not perform, because numba does not support fast mode with Python arrays. The only way to get performance on this algorithm with numba is via numpy arrays.

I adapted the distogram bench to streamhist to compare the update function. On CPython distogram is 25% faster, and on pypy distogram is 13 times faster.

I measured my pure Python implementation (no numa or numpy) vs. distogram. It is 20% faster (and less code, but I didn't compare closely). The implementation uses a "maintain cost function array" approach just as distogram does. So distogram appears to have some room for improvement.

My numba+numpy implementation is 20x faster than streamhist (with 64 bins).

@MainRo
Copy link
Member

MainRo commented Jun 21, 2020

Some applications are general, and the dataset size is not known. For example, I using streaming histogram in performance timers. As a general utility, it is not known whether the code under test will be invoked 10 times or 1 million times during the session. It's an advantage to use the available bin capacity and report exact quantiles when possible, yet have the implementation transition gracefully to approximate histogram as the data points grow. It only requires minor tweaks to the Ben-haim algorithms.

ok, this makes sense. Since this is a minor evolution we can add it.

I created issue #2 for a dedicated numba/numpy discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants