ufunc.at perfomance >10x too slow

I have created a Matlab-like [accumarray](http://uk.mathworks.com/help/matlab/ref/accumarray.html?refresh=true) function which tries to squeeze as much performance form numpy as possible for a specific list of functions: `sum any all max min mean...`etc.

The functions is available as a gist [here](https://gist.github.com/d1manson/5f78561c0f52d3073fe8).

There is another accumarray implementation available on github [here](https://github.com/ml31415/accumarray), which was originally created by @ml31415, but over the last couple of days has had many of my suggestions incorporated (I have also updated my own gist in line with my recent suggestions).

The original purpose of this other implementation was to use `scipy.weave` to get massively improved performance over what was assumed to be the best-possible raw-numpy version.  However it appears that most of the functions can be fairly heavily optimized without `scipy.weave`.  [It's not really important here, but for reference, the remaining functions which are difficult-to-optimize are `min max` and `prod`.]  

**The main point**, however, is that it should be simple to optimize by just using `ufunc.at` for the relevant `ufunc` (as this is exactly what `ufunc.at` is intended for), yet `ufunc.at` gets miserable performance....about 15x slower than `scipy.weave` and 10-25x slower than a bunch of carefully written alternative numpy algorithms (where such optimized algorithms exist).    Surely `ufunc.at` could be improved?  

Also, and on a separate note, would there be any interest in including accumarray itself in numpy?

---

Here are some benchmarking stats produced by my function (testing and benchmarking code is inclued at the bottom of the [gist](https://gist.github.com/d1manson/5f78561c0f52d3073fe8)). Note that the `baseline` times are obtained by sorting-spliting-and-looping, using the named numpy function for each group; whereas the optimised functions do some kind of handcrafted vectorised operation in most cases, except `max min` and `prod` which use `ufunc.at`.  Note also that the actual observed speedup depends on a variety of properties of the input.  Here we are using `100 000` indices uniformly picked from the interval `[0, 1000)`. Specifically, about 25% of the values are `0` (for use with the bool tests), the remainder are uniformly distributed on `[-50,25)`.

``` text
std baseline 190.4 ms optimised 7.3 ms ... 26.2 x faster
all baseline 77.2 ms optimised 8.6 ms ... 9.0 x faster
min baseline 65.3 ms optimised 50.0 ms ... 1.3 x faster
max baseline 64.4 ms optimised 45.8 ms ... 1.4 x faster
sum baseline 64.6 ms optimised 2.4 ms ... 27.3 x faster
var baseline 173.4 ms optimised 7.7 ms ... 22.4 x faster
prod baseline 63.5 ms optimised 50.2 ms ... 1.3 x faster
any baseline 75.5 ms optimised 7.0 ms ... 10.8 x faster
mean baseline 100.2 ms optimised 3.7 ms ... 26.9 x faster
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ufunc.at perfomance >10x too slow #5922

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ufunc.at perfomance >10x too slow #5922

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions