Deprecate PyArrayObject* direct field access #116

Merged
merged 25 commits into from Jul 26, 2011

Conversation

Projects
None yet
3 participants
@mwiebe
Member

mwiebe commented Jul 19, 2011

This pull request deprecates direct access to PyArrayObject fields. This direct access has been discouraged for a while through comments in the header file and documentation, but up till now, there was no way to disable it. I've created such a mechanism, and C extensions can test that they don't use deprecated C API by #defining NPY_NO_DEPRECATED_API at the top of the C file.

I've confirmed that scipy master builds against this branch, and its test failures look unrelated to these changes (iterative methods failures). Additional testing of different versions and platforms would be appreciated!

This also includes a few other miscellaneous changes:

mwiebe and others added some commits Jul 11, 2011

Mark Wiebe
BUG: core: PyArray_GetArrayParamsFromObject was treating __array_inte…
…rface__ incorrectly

I'm not sure why this came up after the arrayobject field access
changes, since it looks like the bug was in there before already.
Mark Wiebe
BUG: mmap: Make the memmap subclass rely on the Python mmap's destruc…
…tor being correct

There was some fishy code flushing and closing the _mmap property,
which looked as if it was paranoid in not trusting mmap.mmap to
behave properly. This was done in a way which assumed the .base
attribute isn't collapsed, something which I've changed in the refactoring.

The easiest way to fix this is to trust mmap.mmap - if this is incorrect,
the specific platform on which this fails should have been already
documented in the comments!
Mark Wiebe
ENH: core: Add access macros back in conditionally, for backwards com…
…patibilities

Sticking with inline functions when NPY_NO_DEPRECATED_API is defined.
may be set to True to add a mask to an array which does not have one.
-Second is 'arr.flags.ownnamask', which is True if the array owns the
+Second is 'arr.flags.ownmaskna', which is True if the array owns the
memory to the mask, and False if the array has no mask, or has a view
into the mask of another array. If this is set to False in a masked

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

It might be simpler if copying the mask could be specified when making the view, something like a 'copymask' keyword.

@charris

charris Jul 20, 2011

Member

It might be simpler if copying the mask could be specified when making the view, something like a 'copymask' keyword.

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

Nevermind, I see this is done below.

@charris

charris Jul 20, 2011

Member

Nevermind, I see this is done below.

numpy/core/_internal.py
- return newlist
-
-format_re = re.compile(asbytes(r'(?P<order1>[<>|=]?)(?P<repeats> *[(]?[ ,0-9]*[)]? *)(?P<order2>[<>|=]?)(?P<dtype>[A-Za-z0-9.]*)'))
+format_re = re.compile(asbytes(r'(?P<order1>[<>|=]?)(?P<repeats> *[(]?[ ,0-9]*[)]? *)(?P<order2>[<>|=]?)(?P<dtype>[A-Za-z0-9.]*(?:\[[a-zA-Z0-9,.]+\])?)'))

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

Might be nice to split up the string, 154 char is a bit long.

@charris

charris Jul 20, 2011

Member

Might be nice to split up the string, 154 char is a bit long.

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

Sure

numpy/core/code_generators/numpy_api.py
@@ -318,6 +318,7 @@ multiarray_funcs_api = {
# End 1.6 API
'PyArray_MaskedCopyInto': 281,
'PyArray_MaskedMoveInto': 282,
+ 'PyArray_SetBase': 283,

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

Does this need to be in the API?

@charris

charris Jul 20, 2011

Member

Does this need to be in the API?

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

To support full separation of the ABI from the memory layout of PyArrayObject *, all of its accessors will have to move into the API. This one seemed like a no-brainer to start, as it's doing more than just setting the field.

@mwiebe

mwiebe Jul 20, 2011

Member

To support full separation of the ABI from the memory layout of PyArrayObject *, all of its accessors will have to move into the API. This one seemed like a no-brainer to start, as it's doing more than just setting the field.

- PyArray_FLAGS(obj) &= ~NPY_ARRAY_UPDATEIFCOPY; \
- } \
- Py_XDECREF(obj)
+static NPY_INLINE void

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

The use of NPY_INLINE in a public include probably means that extensions need to use the same compiler that numpy was compiled with. Hmm, wonder if that affects cross compiling?

@charris

charris Jul 20, 2011

Member

The use of NPY_INLINE in a public include probably means that extensions need to use the same compiler that numpy was compiled with. Hmm, wonder if that affects cross compiling?

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

The neighborhood iterator was already doing this, so there is at least some precedence.

@mwiebe

mwiebe Jul 20, 2011

Member

The neighborhood iterator was already doing this, so there is at least some precedence.

+ * the Python struct HEAD.
+ */
+#ifdef NPY_NO_DEPRECATED_API
+typedef struct tagPyArrayObject {

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

I believe tags are in their own namespace, so the tag prefix isn't really needed. Adding '_t' to the typedef is more commonly done.

@charris

charris Jul 20, 2011

Member

I believe tags are in their own namespace, so the tag prefix isn't really needed. Adding '_t' to the typedef is more commonly done.

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

I haven't noticed any consistent convention in NumPy with regards to this. I agree that typedefs like npy_int32 should be npy_int32_t as you suggest, but there are lots of things to change to make it consistent. This is the kind of thing that would go in a NumPy coding standards document somewhere...

@mwiebe

mwiebe Jul 20, 2011

Member

I haven't noticed any consistent convention in NumPy with regards to this. I agree that typedefs like npy_int32 should be npy_int32_t as you suggest, but there are lots of things to change to make it consistent. This is the kind of thing that would go in a NumPy coding standards document somewhere...

- (it)->dataptr += (it)->strides[0] - \
- (it)->backstrides[1]; \
- } \
+#define _PyArray_ITER_NEXT2(it) { \

This comment has been minimized.

@charris

charris Jul 20, 2011

Member

Now I contradict myself and wonder if these should be inline functions. Having a typecheck on the 'it' variable could be helpful.

@charris

charris Jul 20, 2011

Member

Now I contradict myself and wonder if these should be inline functions. Having a typecheck on the 'it' variable could be helpful.

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

It's unfortunate that all the NumPy array creation functions return PyObject* instead of PyArrayObject*.

@mwiebe

mwiebe Jul 20, 2011

Member

It's unfortunate that all the NumPy array creation functions return PyObject* instead of PyArrayObject*.

doc/neps/missing-data.rst
produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] /
-mask [Unmasked, Unmasked, Masked, Unmasked], and
+mask [Exposed, Exposed, Exposed, Hidden], and

This comment has been minimized.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

I think you mean "[Exposed, Exposed, Hidden, Exposed]"

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

I think you mean "[Exposed, Exposed, Hidden, Exposed]"

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

Indeed

@mwiebe

mwiebe Jul 20, 2011

Member

Indeed

doc/neps/missing-data.rst
produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] /
-mask [Unmasked, Unmasked, Masked, Unmasked], and
+mask [Exposed, Exposed, Exposed, Hidden], and
values [1.0, 2.0, <NA bitpattern>, 7.0] respectively.

This comment has been minimized.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

When you say "respectively", I would expect the same number of outputs as there are examples, but you only have two here. Confusion could occur if someone thought that the mask line was a separate output example, somehow.

Plus, I am a little concerned about having different print outputs. It might be ok, but I am not sure it is the best idea. On the other hand, it does allow for future mixing of NA-bitpatterns and masks...

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

When you say "respectively", I would expect the same number of outputs as there are examples, but you only have two here. Confusion could occur if someone thought that the mask line was a separate output example, somehow.

Plus, I am a little concerned about having different print outputs. It might be ok, but I am not sure it is the best idea. On the other hand, it does allow for future mixing of NA-bitpatterns and masks...

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

Yes, the respectively got out of kilter with multiple edits. The explanation here was to show what's under the hood. I think the reprs should be different, otherwise you wouldn't be able to tell the difference, and eval-ing the repr would produce a different object.

@mwiebe

mwiebe Jul 20, 2011

Member

Yes, the respectively got out of kilter with multiple edits. The explanation here was to show what's under the hood. I think the reprs should be different, otherwise you wouldn't be able to tell the difference, and eval-ing the repr would produce a different object.

doc/neps/missing-data.rst
@@ -461,7 +471,7 @@ will also use the unmasked value counts for their calculations if
Some examples::
- >>> a = np.array([1., 3., np.NA, 7.], namasked=True)
+ >>> a = np.array([1., 3., np.NA, 7.], maskna=True)
>>> np.sum(a)
array(NA, dtype='<f8', masked=True)

This comment has been minimized.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

"masked=True" -> "maskna=True"

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

"masked=True" -> "maskna=True"

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

changed

@mwiebe

mwiebe Jul 20, 2011

Member

changed

doc/neps/missing-data.rst
False
- >>> np.any(np.array([False, NA, False], namasked=True))
+ >>> np.any(np.array([False, NA, False], maskna=True))

This comment has been minimized.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Nit-picking here, but shouldn't it be "np.NA", not "NA"?

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Nit-picking here, but shouldn't it be "np.NA", not "NA"?

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

changed

@mwiebe

mwiebe Jul 20, 2011

Member

changed

@WeatherGod

This comment has been minimized.

Show comment
Hide comment
@WeatherGod

WeatherGod Jul 20, 2011

Contributor

The NEP might benefit from some consideration of ufuncs applied to 2-D arrays along an axis. I think masked arrays right now have some behaviors in those situations that might differ from what I might logically extrapolate from the 1-D examples.

Contributor

WeatherGod commented Jul 20, 2011

The NEP might benefit from some consideration of ufuncs applied to 2-D arrays along an axis. I think masked arrays right now have some behaviors in those situations that might differ from what I might logically extrapolate from the 1-D examples.

doc/neps/missing-data.rst
True
- >>> np.all(np.array([True, NA, True], namasked=True))
+ >>> np.all(np.array([True, NA, True], maskna=True))

This comment has been minimized.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Just thinking out loud here... would it make sense to have skipna=True available for np.all?

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Just thinking out loud here... would it make sense to have skipna=True available for np.all?

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

Makes sense

@mwiebe

mwiebe Jul 20, 2011

Member

Makes sense

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 20, 2011

Member

Do you have an example case you're thinking for a ufunc along an axis of a multidimensional array which would be surprising? I guess I thought this part was pretty obvious, but I can make some examples.

Member

mwiebe commented Jul 20, 2011

Do you have an example case you're thinking for a ufunc along an axis of a multidimensional array which would be surprising? I guess I thought this part was pretty obvious, but I can make some examples.

doc/neps/missing-data.rst
>>> a + b
- array([NA(0), 5, NA(2)], namasked='multi')
+ array([NA(0), 5, NA(2)], maskna='multi')

This comment has been minimized.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Just to clarify, the implication here is not to mix masked and bit-patterns? Consider the following:

...

a = np.array([np.NA, 2, 5], maskna=True)
b = np.array([1, np.NA, 7], dtype='NA')
a + b
...
What should be the result? Now, let's say that -- somehow -- resulting array retains the distinction between the masked-NA and the bit-pattern NA. And the user wants to skip masked-NA values, but not skip bitpattern-NA values. I don't see how that is going to be possible in the current framework. Note, I am not saying that it should be possible, I am just raising a possible future issue with maintaining the distinction between bitpattern-NA and masked NA.

@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Just to clarify, the implication here is not to mix masked and bit-patterns? Consider the following:

...

a = np.array([np.NA, 2, 5], maskna=True)
b = np.array([1, np.NA, 7], dtype='NA')
a + b
...
What should be the result? Now, let's say that -- somehow -- resulting array retains the distinction between the masked-NA and the bit-pattern NA. And the user wants to skip masked-NA values, but not skip bitpattern-NA values. I don't see how that is going to be possible in the current framework. Note, I am not saying that it should be possible, I am just raising a possible future issue with maintaining the distinction between bitpattern-NA and masked NA.

This comment has been minimized.

@mwiebe

mwiebe Jul 20, 2011

Member

That's correct, it isn't about mixing masks and bitpatterns, its about having multiple distinct NAs in both masks and bitpatterns. The NEP makes no distinction between bitpattern-NA and masked NA with regard to computations, however the multi-NA idea would allow a user to distinguish between them by using different multi-NA payloads.

@mwiebe

mwiebe Jul 20, 2011

Member

That's correct, it isn't about mixing masks and bitpatterns, its about having multiple distinct NAs in both masks and bitpatterns. The NEP makes no distinction between bitpattern-NA and masked NA with regard to computations, however the multi-NA idea would allow a user to distinguish between them by using different multi-NA payloads.

@WeatherGod

This comment has been minimized.

Show comment
Hide comment
@WeatherGod

WeatherGod Jul 20, 2011

Contributor

Consider the following:

>>> import numpy as np
>>> a = np.random.random((3, 2))
>>> b = np.ma.masked_array(a, mask=[[False, True], [True, True], [False, False]])
>>> b
masked_array(data =
 [[0.110804969841 --]
 [-- --]
 [0.955128477746 0.440430735546]],
             mask =
 [[False  True]
 [ True  True]
 [False False]],
       fill_value = 1e+20)
>>> b.mean(axis=0)
masked_array(data = [0.532966723794 0.440430735546],
             mask = [False False],
       fill_value = 1e+20)

>>> b.mean(axis=1)
masked_array(data = [0.110804969841 -- 0.697779606646],
             mask = [False  True False],
       fill_value = 1e+20)

Note that for masked arrays, they also do not return 0.0 for completely masked rows:

>>> b[1].sum()
masked

I am not saying that this is the way it should be, but if there is going to be a difference, then this should be made very clear for former np.ma users.

Contributor

WeatherGod commented Jul 20, 2011

Consider the following:

>>> import numpy as np
>>> a = np.random.random((3, 2))
>>> b = np.ma.masked_array(a, mask=[[False, True], [True, True], [False, False]])
>>> b
masked_array(data =
 [[0.110804969841 --]
 [-- --]
 [0.955128477746 0.440430735546]],
             mask =
 [[False  True]
 [ True  True]
 [False False]],
       fill_value = 1e+20)
>>> b.mean(axis=0)
masked_array(data = [0.532966723794 0.440430735546],
             mask = [False False],
       fill_value = 1e+20)

>>> b.mean(axis=1)
masked_array(data = [0.110804969841 -- 0.697779606646],
             mask = [False  True False],
       fill_value = 1e+20)

Note that for masked arrays, they also do not return 0.0 for completely masked rows:

>>> b[1].sum()
masked

I am not saying that this is the way it should be, but if there is going to be a difference, then this should be made very clear for former np.ma users.

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 20, 2011

Member

I've added a section 'Differences with numpy.ma' which uses your example to show how things behave differently. Thanks!

Member

mwiebe commented Jul 20, 2011

I've added a section 'Differences with numpy.ma' which uses your example to show how things behave differently. Thanks!

Mark Wiebe
ENH: dtype: Make handling of struct dtype align= flag more rigorous (…
…also fixes ticket #1912)

This adds an 'aligned'= property to the format dict, which gets put
in the str() representation when necessary to preserve it.
@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 22, 2011

Member

I've added another commit which makes the 'align=' struct dtype flag stickiness better, and fixes ticket #1912. Are there any more changes you think are necessary before this gets merged?

Member

mwiebe commented Jul 22, 2011

I've added another commit which makes the 'align=' struct dtype flag stickiness better, and fixes ticket #1912. Are there any more changes you think are necessary before this gets merged?

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Jul 22, 2011

Member

I haven't had time to finish looking through it, but I suspect I won't have anything substantive to add. However, I've been thinking the name PyArray_SetBase is too generic. What is the Base and what does setting it do? So maybe another name and some explanation of the intended use of the function would help.

Member

charris commented Jul 22, 2011

I haven't had time to finish looking through it, but I suspect I won't have anything substantive to add. However, I've been thinking the name PyArray_SetBase is too generic. What is the Base and what does setting it do? So maybe another name and some explanation of the intended use of the function would help.

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 22, 2011

Member

Sure, the name PyArray_SetBase is what it is because it matches with PyArray_BASE.

I think the possible confusion is due to poor naming conventions in the NumPy API. First, Py* should be reserved for CPython, not NumPy. I think a transition to Npy* should be done when the API has matured some more. Second, PyArray_* has all the PyArrayObject-specific functions and random other functions mixed in together.

The overloading of the BASE property together with the UPDATEIFCOPY flag also seems unfortunate, as it significantly changes the meaning of that property depending on the flag.

Member

mwiebe commented Jul 22, 2011

Sure, the name PyArray_SetBase is what it is because it matches with PyArray_BASE.

I think the possible confusion is due to poor naming conventions in the NumPy API. First, Py* should be reserved for CPython, not NumPy. I think a transition to Npy* should be done when the API has matured some more. Second, PyArray_* has all the PyArrayObject-specific functions and random other functions mixed in together.

The overloading of the BASE property together with the UPDATEIFCOPY flag also seems unfortunate, as it significantly changes the meaning of that property depending on the flag.

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Jul 22, 2011

Member

I'm guessing that PyArray_SetBase sets a pointer to the array from which a view is taken, so maybe something like PyArray_SetBaseArray or PyArray_SetRootArray, or some such. I really think all arrays should be views of some separate memory/buffer object, but that is another issue.

Can you expand on the interplay between the BASE and UPDATECOPY flags? I don't think it matters for this commit, but I would like to learn something ;)

Member

charris commented Jul 22, 2011

I'm guessing that PyArray_SetBase sets a pointer to the array from which a view is taken, so maybe something like PyArray_SetBaseArray or PyArray_SetRootArray, or some such. I really think all arrays should be views of some separate memory/buffer object, but that is another issue.

Can you expand on the interplay between the BASE and UPDATECOPY flags? I don't think it matters for this commit, but I would like to learn something ;)

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 22, 2011

Member

Yes, except the base can be any object that owns the memory, not just an array. PyArray_SetBaseObject? I agree with you about the separate memory/buffer object.

When the UPDATEIFCOPY flag of arr is set, arr's base is then the original array from which a copy is made. Arr's base is set readonly as well. When arr is freed, it copies its data into its base and sets arr to writeable again. Quite different from arr's base owning the memory...

Member

mwiebe commented Jul 22, 2011

Yes, except the base can be any object that owns the memory, not just an array. PyArray_SetBaseObject? I agree with you about the separate memory/buffer object.

When the UPDATEIFCOPY flag of arr is set, arr's base is then the original array from which a copy is made. Arr's base is set readonly as well. When arr is freed, it copies its data into its base and sets arr to writeable again. Quite different from arr's base owning the memory...

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Jul 22, 2011

Member

PyArray_SetBaseObject sounds good, or maybe PyArray_SetBufferObject or PyArray_SetMemoryObject.

The UPDATEIFCOPY sounds like two paradigms getting mixed together.

Member

charris commented Jul 22, 2011

PyArray_SetBaseObject sounds good, or maybe PyArray_SetBufferObject or PyArray_SetMemoryObject.

The UPDATEIFCOPY sounds like two paradigms getting mixed together.

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 22, 2011

Member

I'll switch it to PyArray_SetBaseObject. Changing it to a name other than Base would need a bigger cascading set of changes.

Member

mwiebe commented Jul 22, 2011

I'll switch it to PyArray_SetBaseObject. Changing it to a name other than Base would need a bigger cascading set of changes.

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 22, 2011

Member

I've changed the name to PyArray_SetBaseObject, and added some documentation about it.

Member

mwiebe commented Jul 22, 2011

I've changed the name to PyArray_SetBaseObject, and added some documentation about it.

@mwiebe

This comment has been minimized.

Show comment
Hide comment
@mwiebe

mwiebe Jul 26, 2011

Member

Chuck, did you have anything else you'd like changed before merging this?

Member

mwiebe commented Jul 26, 2011

Chuck, did you have anything else you'd like changed before merging this?

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Jul 26, 2011

Member

No, go for it. Might clean up the trailing whitespace though ;)

Member

charris commented Jul 26, 2011

No, go for it. Might clean up the trailing whitespace though ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment