specify multiple aggregation functions #3

d1manson · 2015-06-04T11:30:44Z

Discussed initially in this other issue.

Design of syntax

Suggested syntax for input:

# using tuple...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=('mean', 'std', 'min', 'max'))
# using list...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=['mean', 'std', 'min', 'max'])
# using comma/semicolon and/or white-space-delimited string
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')

For output, you could argue that a dict, class or namedtuple would be a safer solution as the user is less likely to mix up the order. Actually, it seems that namedtuple is probably a pretty good solution because it will naturally unpack if the user wants it to, otherwise the user can treat it like a class/dict. Incidentally, the field_names arg for namedtuple supports basically the same set of input syntaxes described above for aggregate. I guess you would need to dynamically generate the namedtuple classes based on the requested funcs list, and then store the class definitions in a little cache - but that's easy enough to do.

If output is named tuple the following is possible:

# direct unpacking...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
# using as object...
a_agg = aggregate(group_idx, a, funcs='mean std min max')
plt.errorbar(a_agg.mean, yerr=a_gg.std) # or whatever

As previously discussed, the advantage of accepting multiple functions is that the aggregation machinery will then have scope for a variety of optimizations (though it could do the simplest thing and loop over each function in turn, or simply throw NotImplemented). Any alternative syntax (e.g. wrapping the aggregation into an object and then providing methods on that object in the way that pandas does) is likely to require some degree of explicit caching and a bunch of complex logic to deal with it, whereas this syntax should keep any logic reasonably straightforward and permit optimal use of memory.

Some suggestions on possible optimizations

In JIT-C implementations there is the option of squashing everything into a single pass over the array, which hopefully would offer very nice speed ups. To get the most out of the cache it would probably make sense to arrange each of the multiple outputs for a given group contiguously in memory, e.g. for min max, the output would be [min_0 max_0 min_1 max_1 .. min_n max_n]. Whether or not the output is produced in this manner will actually not be visible to the user as they will only get views onto it, which can be provided with basic numpy indexing return custom_tuple(min=combined[:,0], max=combined[:,1]). I think that will be safe, because there is no overlapping of the views into the combined array, so the user could never accidentally overwrite one variable and unexpectedly effect another. One thing to note though is that different outputs may need different dtypes - it is still possible to achieve this with numpy ndarrays, but it's an extra implementation detail to get right.

In the numpy implementation there are a few functions which go nicely together: sum mean var std all start with sum, so that can be reused etc. min max can be run using argsort(a) combined with the first/last trick. any all both need to work on a.astype(bool), and allnan anynan both need to work on isnan(a).

Of course, if nan- versions are used, then that overhead only needs to appear once for multiple functions. The same is true of multi_ravel_index and potentially all bounds-checking only needs to be done once.

edit: and if custom functions are involved, then it's obviously easier to loop over each of them for a given array rather than doing all the sorting etc for each function...although admittedly that has got to be a pretty rare usage case, and the custom function could have dealt with it internally anyway...though I guess this is relevant to the pure-python implementation as everything is done in this manner.

The text was updated successfully, but these errors were encountered:

ml31415 · 2018-04-23T15:41:03Z

I guess we should close this. Reasons:

In terms of syntax, we're not benefiting a lot. It would always be possible to also write:

a_mean, a_std, a_min, a_max = (aggregate(group_idx, a,f) for f in ('mean', 'std', 'min', 'max')))

Our call syntax for aggregate already has enough keyword options, and func can be a function, or a string and so on, making the call structure complicated enough to just be on the edge of being intuitive, which I'd like to preserve.

In terms of speed, it only makes sense to talk about the fast numba (and weave) options. If you're concerned about speed, you won't take the slower versions into consideration anyways. Let's also leave weave aside, as I'd consider it kind of deprecated anyways. So, talking about the numba implementation. What would have to be changed in order to actually to that:

Compile a new function for every functions/datatype combination you throw in
For every single entry in group_idx / a, iterate over all given functions

Both things come with major speed concerns. While a plain sum, plain mean etc will already be compiled when run, a sum-mean combo or even more complex usually won't. Having done some benchmarking lately, the compile time is magnitudes higher, than running the function itself on a midsized array like in our benchmarks. Also, the loop over the given functions is a big speed hindrance, as every loop comes with it's own overhead. Inlining might be possible, not sure to which extend it might happen automatically, but in worst case, you'll have to create separate functions for 2, 3, or whatever amount of given functions. And this would make the uglification of the already not that simple code in the numba implementation even uglier.

So, concluding, I guess the negatives outweight. I'd also like to close any questions concerning API, aiming for a 1.0 version. Let's get this off the table!

bscully27 · 2020-04-06T01:40:59Z

Do any of the aggregates allow multiple funcs to be passed in a single call?
Numpy, numpy_ufuncs and Numba didn't work for me.

Also even if multiple reduction funcs were passed, would it even be faster than looping?
In other words would

aggregate(... funcs=[A, B]  )

be any faster than 

aggregate(... funcs=A )
aggregate(... funcs=B )

Thanks.

ml31415 · 2020-04-06T07:14:56Z

No, we had thought about implementing that, but in the end we decided to keep it as is, as for the majority of functions there would be no speed gain from this batching.

d1manson mentioned this issue Jun 4, 2015

Create a single API/test/benchmarking but with multiple independent implementations #2

Closed

ml31415 added enhancement question labels Jun 9, 2015

d1manson removed the question label Jul 6, 2015

ml31415 closed this as completed Apr 23, 2018

ml31415 mentioned this issue May 17, 2018

Collaboration EelcoHoogendoorn/Numpy_arraysetops_EP#7

Open

dcherian mentioned this issue Feb 21, 2023

Improve performance with numpy_groupies xarray-contrib/flox#222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specify multiple aggregation functions #3

specify multiple aggregation functions #3

d1manson commented Jun 4, 2015

ml31415 commented Apr 23, 2018

bscully27 commented Apr 6, 2020

ml31415 commented Apr 6, 2020 •

edited

Loading

specify multiple aggregation functions #3

specify multiple aggregation functions #3

Comments

d1manson commented Jun 4, 2015

ml31415 commented Apr 23, 2018

bscully27 commented Apr 6, 2020

ml31415 commented Apr 6, 2020 • edited Loading

ml31415 commented Apr 6, 2020 •

edited

Loading