-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: push histogram calculations to compiled_base #9910
Conversation
np.histogramdd(self.d, (200, 200), ((50, 51), (50, 51))) | ||
|
||
def time_fine_binning(self): | ||
np.histogramdd(self.d, (10000, 10000), ((0, 100), (0, 100))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of seeing a comparison at https://pv.github.io/numpy-bench/, can you make a separate PR with just the benchmarks, that we can merge a day earlier?
dres[i].imag = (dy[j+1].imag - dy[j].imag)*(x_val - dx[j])* | ||
inv_dx + dy[j].imag; | ||
inv_dx + dy[j].imag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation looks worse to me than before
const npy_cdouble *dy; | ||
npy_cdouble lval, rval; | ||
const npy_cdouble *dy; | ||
npy_cdouble lval, rval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to touch this unrelated whitespace, can you do it in a separate commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem. should it be in a separate PR as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two lines were missed. all the other whitespace changes are on a separate commit now as part of this PR.
numpy/lib/function_base.py
Outdated
n += _histogram_uniform(tmp_a, bin_edges, tmp_w).astype(ntype) | ||
|
||
# Rename the bin edges for return. | ||
bins = bin_edges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed after a change I made recently - we now return bin_edges
anyway
* Get the number of bins for each array since | ||
* the edges are passed in as a flat array | ||
*/ | ||
arr_bins = (PyArrayObject *)PyArray_ContiguousFromAny(obj_bins, NPY_INTP, 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this accept and round floating point values, when it ought to be erroring?
goto fail; | ||
} | ||
{ | ||
npy_intp nedges_total = 0, i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be clearer as two declarations.
max[d] = e[bins[d]]; | ||
|
||
if (bins[d] <= 0) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brace style doesn't match elsewhere here
} | ||
if ((max - min) <= 0) | ||
{ | ||
PyErr_SetString(PyExc_ValueError, "Bin edges must be increasing"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a regression - equal bins edges are allowed in other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you mean. Just overzealous error checking? Should I remove this?
} | ||
} | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they're awkward to pull into functions, it'd be good if each of these blocks could at least get a comment summarizing what it does.
min = e[0]; | ||
max = e[bins]; | ||
|
||
if (bins <= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check needs to happen before e[bins]
to avoid UB / segfaults
if (arr_edges == NULL) { | ||
goto fail; | ||
} | ||
bins = (npy_intp)(PyArray_DIM(arr_edges, 0) - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast necessary?
The fast-histogram python package demonstrated that histogramming real data onto a continuous range with evenly-spaced bins can be made significantly faster. Two methods were added to multiarray/compiled_base.c: _arr_histogram_uniform(x, edges, weights) and _arr_histogramdd_uniform(x, edges, weights). These methods make the assumption that the edges are uniform-linear, i.e. the result of np.linspace() or similar. The ensurance that this is case is handled on the python side in np.histogram() and np.histogramdd(). Benchmarks were added for 1D and 2D histogramming for when the histogram spans the range of the sample and when it spans only 1% of the range of the sample (per dimension).
3ea37c4
to
dc21a2d
Compare
I'm a bit bothered by this being in Any idea on why the windows tests fail? |
* Get the number of bins for each array since | ||
* the edges are passed in as a flat array | ||
*/ | ||
arr_bins = (PyArrayObject *)PyArray_FROM_O(obj_bins); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is suspect regarding test failures on windows. The idea was that obj_bins should be an integer array containing the number of bins along each dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for being in multiarray, the thought was that histogram is closely related to bincount but I'm open to doing a refactor/move if wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unaware of a better way to get a PyObject that must consist of integers (errors instead of rounds off when floats are used)
There were two small changes that very likely lead to the breakage on windows. Is it OK to use appveyor to test reverting these changes one by one (I don't have a windows machine)? |
Sure, go for it. |
The previous failures on windows was due to my use of As it stands, this array is created in |
I'm afraid this needs a rebase / merge, as histograms have moved to |
I think I'm done with the rewrite of histogram stuff, so it should be safe to rebase this. I am worried that this replaces type-generic code with float-only code - perhaps some templated loops would be useful here. |
@theodoregoetz @eric-wieser Do you still wish to pursue implementing this feature? |
We discussed this in the meeting a bit. It seems like a good idea in general, but considering the age of the PR and that there is still seems quite bit to do to finish it off, we decided to close the PR. Please do not hesitate to open a new PR (or reopen) if this work is continued. We should maybe discuss the approach a bit before spending too much time on it. |
This is a resubmittal of PR #9627 (my sincere apologies for the gitmess)
Faster numpy.histogram() and histogramdd()
The fast-histogram python package demonstrates that histogramming real data onto a continuous range with evenly-spaced bins, can be made significantly faster. I took this idea of pushing the bin calculation and the filling of the histogram array into a C function and implemented it in NumPy.
For details, please my gist concerning this patch:
https://gist.github.com/theodoregoetz/10d2351421689bf2660b4f2fca350e6e