ENH: push histogram calculations to compiled_base #9910

theodoregoetz · 2017-10-23T22:01:52Z

This is a resubmittal of PR #9627 (my sincere apologies for the gitmess)

Faster numpy.histogram() and histogramdd()

The fast-histogram python package demonstrates that histogramming real data onto a continuous range with evenly-spaced bins, can be made significantly faster. I took this idea of pushing the bin calculation and the filling of the histogram array into a C function and implemented it in NumPy.

For details, please my gist concerning this patch:
https://gist.github.com/theodoregoetz/10d2351421689bf2660b4f2fca350e6e

eric-wieser · 2017-10-23T22:15:44Z

benchmarks/benchmarks/bench_function_base.py

+        np.histogramdd(self.d, (200, 200), ((50, 51), (50, 51)))
+
+    def time_fine_binning(self):
+        np.histogramdd(self.d, (10000, 10000), ((0, 100), (0, 100)))


In the interest of seeing a comparison at https://pv.github.io/numpy-bench/, can you make a separate PR with just the benchmarks, that we can merge a day earlier?

eric-wieser · 2017-10-24T04:38:12Z

numpy/core/src/multiarray/compiled_base.c

                    dres[i].imag = (dy[j+1].imag - dy[j].imag)*(x_val - dx[j])*
-			inv_dx + dy[j].imag;
+                    inv_dx + dy[j].imag;


This indentation looks worse to me than before

eric-wieser · 2017-10-24T04:38:47Z

numpy/core/src/multiarray/compiled_base.c

-    const npy_cdouble *dy; 
-    npy_cdouble lval, rval; 
+    const npy_cdouble *dy;
+    npy_cdouble lval, rval;


If you're going to touch this unrelated whitespace, can you do it in a separate commit?

no problem. should it be in a separate PR as well?

these two lines were missed. all the other whitespace changes are on a separate commit now as part of this PR.

eric-wieser · 2017-10-24T04:40:06Z

numpy/lib/function_base.py

+                n += _histogram_uniform(tmp_a, bin_edges, tmp_w).astype(ntype)
+
+        # Rename the bin edges for return.
+        bins = bin_edges


This isn't needed after a change I made recently - we now return bin_edges anyway

eric-wieser · 2017-10-24T04:43:35Z

numpy/core/src/multiarray/compiled_base.c

+     * Get the number of bins for each array since
+     * the edges are passed in as a flat array
+     */
+    arr_bins = (PyArrayObject *)PyArray_ContiguousFromAny(obj_bins, NPY_INTP, 1, 1);


Does this accept and round floating point values, when it ought to be erroring?

eric-wieser · 2017-10-24T04:44:25Z

numpy/core/src/multiarray/compiled_base.c

+        goto fail;
+    }
+    {
+        npy_intp nedges_total = 0, i;


Would be clearer as two declarations.

eric-wieser · 2017-10-24T04:44:46Z

numpy/core/src/multiarray/compiled_base.c

+            max[d] = e[bins[d]];
+
+            if (bins[d] <= 0)
+            {


Brace style doesn't match elsewhere here

eric-wieser · 2017-10-24T04:52:03Z

numpy/core/src/multiarray/compiled_base.c

+    }
+    if ((max - min) <= 0)
+    {
+        PyErr_SetString(PyExc_ValueError, "Bin edges must be increasing");


I think this might be a regression - equal bins edges are allowed in other places

how do you mean. Just overzealous error checking? Should I remove this?

eric-wieser · 2017-10-24T04:54:31Z

numpy/core/src/multiarray/compiled_base.c

+        }
+    }
+
+    {


If they're awkward to pull into functions, it'd be good if each of these blocks could at least get a comment summarizing what it does.

eric-wieser · 2017-10-24T04:55:48Z

numpy/core/src/multiarray/compiled_base.c

+    min = e[0];
+    max = e[bins];
+
+    if (bins <= 0)


This check needs to happen before e[bins] to avoid UB / segfaults

eric-wieser · 2017-10-24T04:56:25Z

numpy/core/src/multiarray/compiled_base.c

+    if (arr_edges == NULL) {
+        goto fail;
+    }
+    bins = (npy_intp)(PyArray_DIM(arr_edges, 0) - 1);


Is this cast necessary?

The fast-histogram python package demonstrated that histogramming real data onto a continuous range with evenly-spaced bins can be made significantly faster. Two methods were added to multiarray/compiled_base.c: _arr_histogram_uniform(x, edges, weights) and _arr_histogramdd_uniform(x, edges, weights). These methods make the assumption that the edges are uniform-linear, i.e. the result of np.linspace() or similar. The ensurance that this is case is handled on the python side in np.histogram() and np.histogramdd(). Benchmarks were added for 1D and 2D histogramming for when the histogram spans the range of the sample and when it spans only 1% of the range of the sample (per dimension).

charris · 2017-10-24T17:47:51Z

I'm a bit bothered by this being in multiarray, but that is probably a bigger issue of code organization.

Any idea on why the windows tests fail?

theodoregoetz · 2017-10-24T18:01:28Z

numpy/core/src/multiarray/compiled_base.c

+     * Get the number of bins for each array since
+     * the edges are passed in as a flat array
+     */
+    arr_bins = (PyArrayObject *)PyArray_FROM_O(obj_bins);


This is suspect regarding test failures on windows. The idea was that obj_bins should be an integer array containing the number of bins along each dimension.

as for being in multiarray, the thought was that histogram is closely related to bincount but I'm open to doing a refactor/move if wanted.

I am unaware of a better way to get a PyObject that must consist of integers (errors instead of rounds off when floats are used)

theodoregoetz · 2017-10-24T18:34:24Z

There were two small changes that very likely lead to the breakage on windows. Is it OK to use appveyor to test reverting these changes one by one (I don't have a windows machine)?

charris · 2017-10-24T19:20:11Z

Sure, go for it.

theodoregoetz · 2017-10-25T15:39:48Z

The previous failures on windows was due to my use of PyArray_FROM_O so my first attempt to add type checking on obj_bins failed (see commit e6ce638) and I haven't yet figured out another way to do so. Any help would be much appreciated.

As it stands, this array is created in numpy/lib/function_base.py and so will always be integers. If it does get floats and rounds down it will be caught in the check of the total number bins later in the function.

eric-wieser · 2017-12-27T08:06:03Z

I'm afraid this needs a rebase / merge, as histograms have moved to np.lib.histograms. I can maybe try this myself at some point

eric-wieser · 2018-04-10T18:00:12Z

I think I'm done with the rewrite of histogram stuff, so it should be safe to rebase this.

I am worried that this replaces type-generic code with float-only code - perhaps some templated loops would be useful here.

InessaPawson · 2022-06-24T20:03:20Z

@theodoregoetz @eric-wieser Do you still wish to pursue implementing this feature?

seberg · 2022-06-29T17:01:48Z

We discussed this in the meeting a bit. It seems like a good idea in general, but considering the age of the PR and that there is still seems quite bit to do to finish it off, we decided to close the PR.

Please do not hesitate to open a new PR (or reopen) if this work is continued. We should maybe discuss the approach a bit before spending too much time on it.

eric-wieser reviewed Oct 23, 2017

View reviewed changes

theodoregoetz mentioned this pull request Oct 23, 2017

BENCH: histogramming benchmarks #9912

Merged

eric-wieser reviewed Oct 24, 2017

View reviewed changes

theodoregoetz added 2 commits October 24, 2017 10:00

STY: whitespace cleanup

c59b9e7

theodoregoetz force-pushed the faster-histogram branch from 3ea37c4 to dc21a2d Compare October 24, 2017 17:02

charris added 01 - Enhancement component: numpy._core labels Oct 24, 2017

theodoregoetz commented Oct 24, 2017

View reviewed changes

BUG: remove testing integer type of obj_bins

e6ce638

eric-wieser added the 55 - Needs work label Jul 30, 2018

Base automatically changed from master to main March 4, 2021 02:04

InessaPawson added the 52 - Inactive Pending author response label Jun 8, 2022

seberg closed this Jun 29, 2022

seberg added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Jun 29, 2022

InessaPawson added the triaged Issue/PR that was discussed in a triage meeting label Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: push histogram calculations to compiled_base #9910

ENH: push histogram calculations to compiled_base #9910

theodoregoetz commented Oct 23, 2017

eric-wieser Oct 23, 2017

eric-wieser Oct 24, 2017

eric-wieser Oct 24, 2017

theodoregoetz Oct 24, 2017

theodoregoetz Oct 24, 2017

eric-wieser Oct 24, 2017 •

edited

Loading

eric-wieser Oct 24, 2017

eric-wieser Oct 24, 2017

eric-wieser Oct 24, 2017

eric-wieser Oct 24, 2017 •

edited

Loading

theodoregoetz Oct 24, 2017

eric-wieser Oct 24, 2017

eric-wieser Oct 24, 2017

eric-wieser Oct 24, 2017

charris commented Oct 24, 2017

theodoregoetz Oct 24, 2017

theodoregoetz Oct 24, 2017

theodoregoetz Oct 24, 2017

theodoregoetz commented Oct 24, 2017

charris commented Oct 24, 2017

theodoregoetz commented Oct 25, 2017

eric-wieser commented Dec 27, 2017

eric-wieser commented Apr 10, 2018

InessaPawson commented Jun 24, 2022

seberg commented Jun 29, 2022

+                      }
+                  }
+                  {

ENH: push histogram calculations to compiled_base #9910

ENH: push histogram calculations to compiled_base #9910

Conversation

theodoregoetz commented Oct 23, 2017

Faster numpy.histogram() and histogramdd()

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris commented Oct 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theodoregoetz commented Oct 24, 2017

charris commented Oct 24, 2017

theodoregoetz commented Oct 25, 2017

eric-wieser commented Dec 27, 2017

eric-wieser commented Apr 10, 2018

InessaPawson commented Jun 24, 2022

seberg commented Jun 29, 2022

eric-wieser Oct 24, 2017 •

edited

Loading

eric-wieser Oct 24, 2017 •

edited

Loading