Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow numpy.apply_along_axis() to work with ndarray subclasses #7918

Merged
merged 5 commits into from
Oct 11, 2016

Conversation

bennyrowland
Copy link
Contributor

This commit modifies the numpy.apply_along_axis() function so that if
it is called with an ndarray subclass, the internal func1d calls
receive subclass instances and the overall function returns an instance
of the subclass. There are two new tests for these two behaviours.

This commit modifies the numpy.apply_along_axis() function so that if
it is called with an ndarray subclass, the internal func1d calls
receive subclass instances and the overall function returns an instance
of the subclass. There are two new tests for these two behaviours.
@mhvk
Copy link
Contributor

mhvk commented Aug 8, 2016

I like anything that makes numpy more subclass aware, but worry this will not quite work at least for the ndarray subclass I am responsible for, astropy's Quantity. Specifically, you assume here that the output array should be similar to the input array, but this may not necessarily be true. E.g., for a quantity with units of m, my func1d could be calculating a variance, which would result in an output quantity with units of m^2. Or func1d might be a counting function, in which case it would return not a quantity but an integer; or a checking function that returns a bool array, etc.

I think one could circumvent these problems by not running arr.__array_wrap__ at the very end, but rather run res.__array_wrap__ just after creating the output array (see in-line comment). To me, it seems this would make sense for the non-scalar case only (i.e., where func1d returns an ndarray subclass); for the scalar case, I think the output should be a plain ndarray, i.e., no wrapping should be done.

@@ -107,7 +107,7 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs):
res = func1d(arr[tuple(i.tolist())], *args, **kwargs)
outarr[tuple(ind)] = res
k += 1
return outarr
return arr.__array_wrap__(outarr)
else:
Ntot = product(outshape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for this branch, I would suggest to run

res.__array_wrap__(outarr)

just after creating outarr.

@mhvk
Copy link
Contributor

mhvk commented Aug 8, 2016

After looking at the test cases: what is your own usage where you would want the subclass to be preserved?

Sticking with numpy's own ndarray subclasses, I think a case with a MaskedArray might be more logical, where, e.g., one does want to keep track of a result that was masked.

@bennyrowland
Copy link
Contributor Author

My use case is this: I have an ndarray subclass storing sets of temporal MR data which has additional metadata properties such as the time between data points, spectrometer frequency etc. For example I might have a 3D array which contains temporal data on an x-y grid.
First use case: I have a function which calculates the frequency of an oscillation in the data in Hz, which requires the sampling frequency stored in the subclass. By using asanyarray() on line 77 I ensure that func1d gets passed a single temporal dataset with its metadata attached. In this case it returns a scalar and I think you are absolutely correct that it would be more appropriate for the 2D grid to come in a bare ndarray.
Second use case: I want to apply some processing to each temporal dataset, e.g. a windowing function. In this case the function takes in the subclass slice and returns an identically sized new instance of the same subclass, containing the same metadata. Obviously I want the overall 3D return grid to also have the same values included in it. After reading your comments above I completely agree that it makes more sense to use the return array from the func1d call to wrap the output, rather than the input array, and I will make this change to the code.

In light of your suggestions I agree that the matrix test case is not necessarily an ideal choice, it was motivated by a desire to use existing functions rather than add my own additional ones, but that caused it to become rather far from my original use case. I will change this to something closer to what I wanted.
In the second test there is no check that the return value preserves the subclass, the check is that the func1d calls get the subclass to operate on.

Modified pull request so that result of func1d call gets to call
__array_wrap__ instead of the input array. Modified tests to work
with this
@bennyrowland
Copy link
Contributor Author

I have also run into the same issue with array scalars coming from np.mean etc. I guess the correct solution in this case is to set the shape of the outarr to be 1 in that dimension if the shape of res == (). This can then be short-circuited if desired at the subclass level in array_wrap by checking if the array is in fact an array scalar and returning the scalar instead, which allows functions like np.mean to be used to get the scalars if desired.

Modified pull request to handle case where calling func1d generates
scalar arrays
@mhvk
Copy link
Contributor

mhvk commented Aug 8, 2016

@bennyrowland - OK, sounds like we agree on what it should do! I also agree that setting the dimension to 1 for the purposes of filling the array makes sense, but I'm not sure about keeping that dimension. Better would be to have a keepdims keyword which decides whether or not that dimension gets kept. Since this presumably is not something func1d would ever need, it may be possible to add that -- you'd just get it out in the code using keepdims = kwargs.pop('keepdims', False).

But maybe best to let the real numpy developers judge that (I'm guessing this may become sufficiently "new feature" that an e-mail to the mailing list would be in order; @ahaldane? @charris?)

I did manage to come up with what I think is a decent test case using masked arrays (sadly, simple things like np.sum are simply going to fail because of the way those return either numpy scalars or the masked singleton; no fault of the code here, though!):

def my_func1d(a):
    return a.dot(a)

m = np.ma.MaskedArray([[1, 2], [3, 4]], mask=[[True, False], [True, True]])
np.apply_along_axis(my_func1d, 0, m)
# Should give [4, --]
# Note that m[0].dot(m[0]) yields a masked array scalar unlike things like m[0].sum()

@bennyrowland
Copy link
Contributor Author

@mhvk Actually in my current version I decided to squeeze that dimension if a scalar array is returned. Certainly could add a keepdims keyword argument though. If we are going to start adding arguments to apply_along_axis, one thing I thought might be interesting was optionally passing the index of the current slice to func1d, so that each call knows where it is in the larger grid. This could be used to index into other arguments passed through from the apply_along_axis call, for example to give each axis custom data. An example use case for this might be: if I have a 2D array containing a set of temporal data and I want to filter a different frequency from each dataset.

def filter_dataset(data, frequency_list, index):
    return data.filter(frequency_list[index])
result = np.apply_along_axis(filter_dataset, 1, input, pass_index=True)

I did send a notification email to numpy-discussion to invite comment on this PR, was that not the right mailing list?

@@ -112,8 +112,9 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs):
Ntot = product(outshape)
holdshape = outshape
outshape = list(arr.shape)
outshape[axis] = len(res)
outshape[axis] = len(res) if res.shape != () else 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could just be outshape[axis] = res.size.

@mhvk
Copy link
Contributor

mhvk commented Aug 8, 2016

@bennyrowland - I had two very small additional comments, where the most important might be whether or not to assume that func1d returns something array-like. It seems some of the code does not assume so, and the easiest may simply be to force it using res = np.asanyarray(res).
I certainly think it would be OK that way myself. But it might be good to have someone who has looked at this code before have a look, so cc @charris, @rkern (git blame suggests it goes back to Travis...).

p.s. Of course, with this PR you do beg the question of why not also do apply_over_axes -- I think there all you need is a change from asarray to asanyarray 😈

Modified pull request to cast result of func1d into ndarray (or subclass)
and ensure that it has the necessary size, dtype and __array_wrap properties
@bennyrowland
Copy link
Contributor Author

@mhvk - some further excellent suggestions I have acted on, as you say we will have to wait for an opinion from @charris or @rkern. I did take a look at apply_over_axes but I don't really get the use case so I left it alone. If this PR gets approved then I think the next step is a new function apply_along_axes which behaves like apply_along_axis but can extract multi-dimensional slices from the parent ndarray.

@bennyrowland
Copy link
Contributor Author

@charris or @rkern, would still like to get this merged. Please let me know if there is anything else I need to do first.

@shoyer
Copy link
Member

shoyer commented Oct 7, 2016

This definitely needs tests before it's ready to merge.

@bennyrowland
Copy link
Contributor Author

@shoyer, thanks for commenting. There are already three new tests in this PR, what further tests would you like to see included?

def test_preserve_subclass(self):
def double(row):
return row * 2
m = np.matrix(np.arange(4).reshape((2, 2)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor style thing -- I think this test would be clearer as just np.matrix([[0, 1], [2, 3]]), without the reshape.

result = apply_along_axis(double, 0, m)
assert isinstance(result, np.matrix)
assert_array_equal(
result, np.matrix([0, 2, 4, 6]).reshape((2, 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. I find reshape on np.matrix especially non-intuitive (and I could only guess at whether passing 1D input to np.matrix does).

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple (very) minor suggestions for the tests, but otherwise looks good to me.

@shoyer
Copy link
Member

shoyer commented Oct 11, 2016

@bennyrowland oops, somehow I missed those :)

Modified pull request to make stylistic changes to tests suggested by
@shoyer
@shoyer shoyer merged commit 84b11f5 into numpy:master Oct 11, 2016
@shoyer
Copy link
Member

shoyer commented Oct 11, 2016

OK, in it goes. Thanks!

@mortonjt mortonjt mentioned this pull request Dec 22, 2016
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants