Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: push histogram calculations to compiled_base #9910

Closed
wants to merge 3 commits into from

Conversation

theodoregoetz
Copy link
Contributor

This is a resubmittal of PR #9627 (my sincere apologies for the gitmess)

Faster numpy.histogram() and histogramdd()

The fast-histogram python package demonstrates that histogramming real data onto a continuous range with evenly-spaced bins, can be made significantly faster. I took this idea of pushing the bin calculation and the filling of the histogram array into a C function and implemented it in NumPy.

For details, please my gist concerning this patch:
https://gist.github.com/theodoregoetz/10d2351421689bf2660b4f2fca350e6e

np.histogramdd(self.d, (200, 200), ((50, 51), (50, 51)))

def time_fine_binning(self):
np.histogramdd(self.d, (10000, 10000), ((0, 100), (0, 100)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of seeing a comparison at https://pv.github.io/numpy-bench/, can you make a separate PR with just the benchmarks, that we can merge a day earlier?

dres[i].imag = (dy[j+1].imag - dy[j].imag)*(x_val - dx[j])*
inv_dx + dy[j].imag;
inv_dx + dy[j].imag;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation looks worse to me than before

const npy_cdouble *dy;
npy_cdouble lval, rval;
const npy_cdouble *dy;
npy_cdouble lval, rval;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to touch this unrelated whitespace, can you do it in a separate commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. should it be in a separate PR as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two lines were missed. all the other whitespace changes are on a separate commit now as part of this PR.

n += _histogram_uniform(tmp_a, bin_edges, tmp_w).astype(ntype)

# Rename the bin edges for return.
bins = bin_edges
Copy link
Member

@eric-wieser eric-wieser Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed after a change I made recently - we now return bin_edges anyway

* Get the number of bins for each array since
* the edges are passed in as a flat array
*/
arr_bins = (PyArrayObject *)PyArray_ContiguousFromAny(obj_bins, NPY_INTP, 1, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this accept and round floating point values, when it ought to be erroring?

goto fail;
}
{
npy_intp nedges_total = 0, i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be clearer as two declarations.

max[d] = e[bins[d]];

if (bins[d] <= 0)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brace style doesn't match elsewhere here

}
if ((max - min) <= 0)
{
PyErr_SetString(PyExc_ValueError, "Bin edges must be increasing");
Copy link
Member

@eric-wieser eric-wieser Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a regression - equal bins edges are allowed in other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you mean. Just overzealous error checking? Should I remove this?

}
}

{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're awkward to pull into functions, it'd be good if each of these blocks could at least get a comment summarizing what it does.

min = e[0];
max = e[bins];

if (bins <= 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check needs to happen before e[bins] to avoid UB / segfaults

if (arr_edges == NULL) {
goto fail;
}
bins = (npy_intp)(PyArray_DIM(arr_edges, 0) - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cast necessary?

The fast-histogram python package demonstrated that
histogramming real data onto a continuous range with
evenly-spaced bins can be made significantly faster.
Two methods were added to multiarray/compiled_base.c:
_arr_histogram_uniform(x, edges, weights) and
_arr_histogramdd_uniform(x, edges, weights). These
methods make the assumption that the edges are
uniform-linear, i.e. the result of np.linspace() or
similar. The ensurance that this is case is handled on
the python side in np.histogram() and np.histogramdd().
Benchmarks were added for 1D and 2D histogramming for
when the histogram spans the range of the sample and
when it spans only 1% of the range of the sample (per
dimension).
@charris
Copy link
Member

charris commented Oct 24, 2017

I'm a bit bothered by this being in multiarray, but that is probably a bigger issue of code organization.

Any idea on why the windows tests fail?

* Get the number of bins for each array since
* the edges are passed in as a flat array
*/
arr_bins = (PyArrayObject *)PyArray_FROM_O(obj_bins);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suspect regarding test failures on windows. The idea was that obj_bins should be an integer array containing the number of bins along each dimension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for being in multiarray, the thought was that histogram is closely related to bincount but I'm open to doing a refactor/move if wanted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unaware of a better way to get a PyObject that must consist of integers (errors instead of rounds off when floats are used)

@theodoregoetz
Copy link
Contributor Author

There were two small changes that very likely lead to the breakage on windows. Is it OK to use appveyor to test reverting these changes one by one (I don't have a windows machine)?

@charris
Copy link
Member

charris commented Oct 24, 2017

Sure, go for it.

@theodoregoetz
Copy link
Contributor Author

The previous failures on windows was due to my use of PyArray_FROM_O so my first attempt to add type checking on obj_bins failed (see commit e6ce638) and I haven't yet figured out another way to do so. Any help would be much appreciated.

As it stands, this array is created in numpy/lib/function_base.py and so will always be integers. If it does get floats and rounds down it will be caught in the check of the total number bins later in the function.

@eric-wieser
Copy link
Member

I'm afraid this needs a rebase / merge, as histograms have moved to np.lib.histograms. I can maybe try this myself at some point

@eric-wieser
Copy link
Member

I think I'm done with the rewrite of histogram stuff, so it should be safe to rebase this.

I am worried that this replaces type-generic code with float-only code - perhaps some templated loops would be useful here.

Base automatically changed from master to main March 4, 2021 02:04
@InessaPawson InessaPawson added the 52 - Inactive Pending author response label Jun 8, 2022
@InessaPawson
Copy link
Member

@theodoregoetz @eric-wieser Do you still wish to pursue implementing this feature?

@seberg
Copy link
Member

seberg commented Jun 29, 2022

We discussed this in the meeting a bit. It seems like a good idea in general, but considering the age of the PR and that there is still seems quite bit to do to finish it off, we decided to close the PR.

Please do not hesitate to open a new PR (or reopen) if this work is continued. We should maybe discuss the approach a bit before spending too much time on it.

@seberg seberg closed this Jun 29, 2022
@seberg seberg added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Jun 29, 2022
@InessaPawson InessaPawson added the triaged Issue/PR that was discussed in a triage meeting label Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 52 - Inactive Pending author response 55 - Needs work 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. component: numpy._core triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants