Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specify multiple aggregation functions #3

Closed
d1manson opened this issue Jun 4, 2015 · 3 comments
Closed

specify multiple aggregation functions #3

d1manson opened this issue Jun 4, 2015 · 3 comments

Comments

@d1manson
Copy link
Collaborator

d1manson commented Jun 4, 2015

Discussed initially in this other issue.

Design of syntax

Suggested syntax for input:

# using tuple...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=('mean', 'std', 'min', 'max'))
# using list...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=['mean', 'std', 'min', 'max'])
# using comma/semicolon and/or white-space-delimited string
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')

For output, you could argue that a dict, class or namedtuple would be a safer solution as the user is less likely to mix up the order. Actually, it seems that namedtuple is probably a pretty good solution because it will naturally unpack if the user wants it to, otherwise the user can treat it like a class/dict. Incidentally, the field_names arg for namedtuple supports basically the same set of input syntaxes described above for aggregate. I guess you would need to dynamically generate the namedtuple classes based on the requested funcs list, and then store the class definitions in a little cache - but that's easy enough to do.

If output is named tuple the following is possible:

# direct unpacking...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
# using as object...
a_agg = aggregate(group_idx, a, funcs='mean std min max')
plt.errorbar(a_agg.mean, yerr=a_gg.std) # or whatever

As previously discussed, the advantage of accepting multiple functions is that the aggregation machinery will then have scope for a variety of optimizations (though it could do the simplest thing and loop over each function in turn, or simply throw NotImplemented). Any alternative syntax (e.g. wrapping the aggregation into an object and then providing methods on that object in the way that pandas does) is likely to require some degree of explicit caching and a bunch of complex logic to deal with it, whereas this syntax should keep any logic reasonably straightforward and permit optimal use of memory.

Some suggestions on possible optimizations

In JIT-C implementations there is the option of squashing everything into a single pass over the array, which hopefully would offer very nice speed ups. To get the most out of the cache it would probably make sense to arrange each of the multiple outputs for a given group contiguously in memory, e.g. for min max, the output would be [min_0 max_0 min_1 max_1 .. min_n max_n]. Whether or not the output is produced in this manner will actually not be visible to the user as they will only get views onto it, which can be provided with basic numpy indexing return custom_tuple(min=combined[:,0], max=combined[:,1]). I think that will be safe, because there is no overlapping of the views into the combined array, so the user could never accidentally overwrite one variable and unexpectedly effect another. One thing to note though is that different outputs may need different dtypes - it is still possible to achieve this with numpy ndarrays, but it's an extra implementation detail to get right.

In the numpy implementation there are a few functions which go nicely together: sum mean var std all start with sum, so that can be reused etc. min max can be run using argsort(a) combined with the first/last trick. any all both need to work on a.astype(bool), and allnan anynan both need to work on isnan(a).

Of course, if nan- versions are used, then that overhead only needs to appear once for multiple functions. The same is true of multi_ravel_index and potentially all bounds-checking only needs to be done once.

edit: and if custom functions are involved, then it's obviously easier to loop over each of them for a given array rather than doing all the sorting etc for each function...although admittedly that has got to be a pretty rare usage case, and the custom function could have dealt with it internally anyway...though I guess this is relevant to the pure-python implementation as everything is done in this manner.

@ml31415
Copy link
Owner

ml31415 commented Apr 23, 2018

I guess we should close this. Reasons:

In terms of syntax, we're not benefiting a lot. It would always be possible to also write:

a_mean, a_std, a_min, a_max = (aggregate(group_idx, a,f) for f in ('mean', 'std', 'min', 'max')))

Our call syntax for aggregate already has enough keyword options, and func can be a function, or a string and so on, making the call structure complicated enough to just be on the edge of being intuitive, which I'd like to preserve.

In terms of speed, it only makes sense to talk about the fast numba (and weave) options. If you're concerned about speed, you won't take the slower versions into consideration anyways. Let's also leave weave aside, as I'd consider it kind of deprecated anyways. So, talking about the numba implementation. What would have to be changed in order to actually to that:

  • Compile a new function for every functions/datatype combination you throw in
  • For every single entry in group_idx / a, iterate over all given functions

Both things come with major speed concerns. While a plain sum, plain mean etc will already be compiled when run, a sum-mean combo or even more complex usually won't. Having done some benchmarking lately, the compile time is magnitudes higher, than running the function itself on a midsized array like in our benchmarks. Also, the loop over the given functions is a big speed hindrance, as every loop comes with it's own overhead. Inlining might be possible, not sure to which extend it might happen automatically, but in worst case, you'll have to create separate functions for 2, 3, or whatever amount of given functions. And this would make the uglification of the already not that simple code in the numba implementation even uglier.

So, concluding, I guess the negatives outweight. I'd also like to close any questions concerning API, aiming for a 1.0 version. Let's get this off the table!

@bscully27
Copy link

Do any of the aggregates allow multiple funcs to be passed in a single call?
Numpy, numpy_ufuncs and Numba didn't work for me.

Also even if multiple reduction funcs were passed, would it even be faster than looping?
In other words would

aggregate(... funcs=[A, B]  )

be any faster than 

aggregate(... funcs=A )
aggregate(... funcs=B )

Thanks.

@ml31415
Copy link
Owner

ml31415 commented Apr 6, 2020

No, we had thought about implementing that, but in the end we decided to keep it as is, as for the majority of functions there would be no speed gain from this batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants