Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with zero-width string fields #6430

Merged
merged 5 commits into from
Jun 2, 2016
Merged

Conversation

embray
Copy link
Contributor

@embray embray commented Oct 8, 2015

There is a behavior that probably dates far back, that Numpy tries really hard not to let you make arrays with zero-width strings as the dtype. It normally will promote them to a width of at least one (or more, depending on the context). That can be a little surprising, but we probably don't want to change it.

However, (perhaps as an oversight) it is possible to make a structured array that has zero-width strings for one of its fields. But when viewing such a field the dtype of the view is converted to 'S1' (or 'U1' for unicode), leading to data corruption, overflows, etc. So this PR ensures that arrays of zero-width strings can be created at least in the special case of viewing a field of a structured array.

This also lifts the restrictions in the dtype constructor that prevented creation of structured dtypes with zero-width fields.

Fixes #2196, #473, #4955, #2585, and #1901 .

(I'll add some regression tests in a bit)

@njsmith
Copy link
Member

njsmith commented Oct 8, 2015

Do you have a sense of how difficult it would be to simply enable
zero-width strings in general? This approach seems like a reasonable
strategy (haven't looked at the patch yet), but it would be nice to have a
fair comparison to the alternative, since the alternative has fewer special
cases and, who knows, maybe it's easier than we think.
On Oct 8, 2015 3:06 PM, "Erik Bray" notifications@github.com wrote:

There is a behavior that probably dates far back, that Numpy tries really
hard not to let you make arrays with zero-width strings as the dtype. It
normally will promote them to a width of at least one (or more, depending
on the context). That can be a little surprising, but we probably don't
want to change it.

However, (perhaps as an oversight) it is possible to make a structured
array that has zero-width strings for one of its fields. But when viewing
such a field the dtype of the view is converted to 'S1' (or 'U1' for
unicode), leading to data corruption, overflows, etc. So this PR ensures
that arrays of zero-width strings can be created at least in the
special case of viewing a field of a structured array.

This also lifts the restrictions in the dtype constructor that prevented
creation of structured dtypes with zero-width fields.

Fixes #2196 #2196, #473
#473, #4955
#4955, #2585
#2585, and #1901
#1901 .

(I'll add some regression tests in a bit)

You can view, comment on, or merge this pull request online at:

#6430
Commit Summary

  • PyArray_NewFromDescr normally does not allow zero-width string
    dtypes (or rather, it automatically converts them to 1-width strings). This
    affects any code that uses PyArray_NewFromDescr, which is a lot. So we
    extend PyArray_NewFromDescr_int to allow disabling this functionality in a
    couple narrow cases where it's appropriate--one is when extracting a field
    from a structured array that has a zero-width string dtype (which,
    intentionally or not, has been allowed). The other, which is related, is
    returning a view of an array that has a zero-width string dtype. This
    shouldn't otherwise break or change any existing behavior.
  • Remove roadblocks to creating structured dtypes with zero-width
    fields using dict-based constructors (this was possible by other means such
    as the list-based constructor--with the previous fix in particular it
    should not be necessary to block this anymore).

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#6430.

@embray
Copy link
Contributor Author

embray commented Oct 8, 2015

It wouldn't be difficult technically, but it would be difficult for preserving backwards compat.

For example currently if one does:

>>> np.array(['abc', 'def'], dtype=str)

this is equivalent (on Python 3) to

>>> np.array(['abc', 'def'], dtype='U0')

but this will automatically set the dtype to fit the length of the largest string in the list. There's no obvious way to distinguish this "auto-string" feature from explicitly requesting a dtype of zero width. Perhaps one exception would be if all the strings in the input array are zero width (currently this results in 'U1' where one would expect 'U0').

@embray
Copy link
Contributor Author

embray commented Oct 8, 2015

In either case, I think the behavior should be at least documented somewhere, if it isn't already.

@njsmith
Copy link
Member

njsmith commented Oct 8, 2015

OK, so right now "U0" means "whatever width fits the data, or else width 1".
.
I see how it would be difficult to make "U0" mean "always width 0". Would
it be difficult to make it mean "whatever width fits the data, or else
width 0"? And would that be simpler all around, e.g. would it let us avoid
special casing struct dtypes?
On Oct 8, 2015 3:18 PM, "Erik Bray" notifications@github.com wrote:

In either case, I think the behavior should be at least documented
somewhere, if it isn't already.


Reply to this email directly or view it on GitHub
#6430 (comment).

@embray
Copy link
Contributor Author

embray commented Oct 8, 2015

No, it's like, a 1 line change or so.

However, it would impact anything that uses PyArray_NewFromDescr. I'll try making the change really quick and just see what breaks, if anything, in the test suite.

@njsmith
Copy link
Member

njsmith commented Oct 8, 2015

Thanks!
On Oct 8, 2015 3:31 PM, "Erik Bray" notifications@github.com wrote:

No, it's like, a 1 line change or so.

However, it would impact anything that uses PyArray_NewFromDescr. I'll
try making the change really quick and just see what breaks, if anything,
in the test suite.


Reply to this email directly or view it on GitHub
#6430 (comment).

@njsmith
Copy link
Member

njsmith commented Oct 8, 2015

(In the long run I think the ideal would be something like: dtype=str and
dtype="U0" are different, with the former autodetecting and erroring out if
used in a context that doesn't do autodetection, and the latter not doing
any autodetection at all. But that's a bigger piece of surgery.)
On Oct 8, 2015 3:35 PM, "Nathaniel Smith" njs@pobox.com wrote:

Thanks!
On Oct 8, 2015 3:31 PM, "Erik Bray" notifications@github.com wrote:

No, it's like, a 1 line change or so.

However, it would impact anything that uses PyArray_NewFromDescr. I'll
try making the change really quick and just see what breaks, if anything,
in the test suite.


Reply to this email directly or view it on GitHub
#6430 (comment).

@embray
Copy link
Contributor Author

embray commented Oct 8, 2015

So basically allowing zero-width strings in general is just a matter of deleting these lines:

PyArray_DESCR_REPLACE(descr);
if (descr == NULL) {
return NULL;
}
if (descr->type_num == NPY_STRING) {
sd = descr->elsize = 1;
}
else {
sd = descr->elsize = sizeof(npy_ucs4);
}

it doesn't break any of the existing tests that I can tell. It would change the existing behavior of several functions--but it's behavior that it would seem wrong for anyone to be relying on.

In principle I think we could even delete the entire outer if statement that belongs to, and not prevent empty void types either. It doesn't seem to be a problem.

@embray
Copy link
Contributor Author

embray commented Oct 8, 2015

I agree that might be ideal. But currently it would be too big a change for almost no benefit. I don't think anyone really cares about making arrays of zero-width strings in general. If they really wanted to, with the change I suggested above they can with np.zeros(N, dtype=str).

@njsmith
Copy link
Member

njsmith commented Oct 8, 2015

Can you send a note to the list describing the issue and giving an example
of some of the functions that would change behavior? We definitely have to
be careful with changing things like this, but let's find out if anyone
knows any reason to object :-)
On Oct 8, 2015 3:41 PM, "Erik Bray" notifications@github.com wrote:

So basically allowing zero-width strings in general is just a matter of
deleting these lines:

PyArray_DESCR_REPLACE(descr);
if (descr == NULL) {
return NULL;
}
if (descr->type_num == NPY_STRING) {
sd = descr->elsize = 1;
}
else {
sd = descr->elsize = sizeof(npy_ucs4);
}

it doesn't break any of the existing tests that I can tell. It would
change the existing behavior of several functions--but it's behavior that
it would seem wrong for anyone to be relying on.

In principle I think we could even delete the entire outer if statement
that belongs to, and not prevent empty void types either. It doesn't seem
to be a problem.


Reply to this email directly or view it on GitHub
#6430 (comment).

@embray
Copy link
Contributor Author

embray commented Oct 9, 2015

Sure.

@ahaldane
Copy link
Member

I ran py2 unit tests for scipy and pandas after deleting the lines you pointed to. FWIW I don't see any relevant test failures, though probably we are more worried about random user scripts.

Also, it looks like numpy.io might not be set up to handle 0-sized arrays:

>>> np.save('tmp', np.zeros(3, dtype='S'))
ZeroDivisionError: integer division or modulo by zero

though I haven't investigated that much. (Also, in any case probably older numpy will not be able to load zero-sized arrays saved in newer numpy with these changes).

I like the idea of always allowing zero-width strings. I haven't thought of any clear cases we are breaking someone's code, but maybe that's just for lack of imagination!

@embray
Copy link
Contributor Author

embray commented Oct 12, 2015

Interesting--thanks for pointing out the above case. Should probably be fixed. I can add the fix for that to this PR.

I agree I have a hard time thinking of anything it would break. Who knows...

@ahaldane
Copy link
Member

Another one to think about (I'm not sure it's a problem):

>>> np.ones(3, dtype='S')

Old result:

array(['1', '1', '1'], 
  dtype='|S1')

New result:

array(['', '', ''], 
  dtype='|S0')

Note that ones(3, dtype='S1') gives the old behavior.

@embray
Copy link
Contributor Author

embray commented Oct 12, 2015

Oof, that's terrible :) Especially considering

>>> np.zeros(3, dtype='S')
array(['', '', ''], dtype='|S1')

and not

array(['0', '0', '0'], dtype='|S1)

@embray
Copy link
Contributor Author

embray commented Oct 13, 2015

Well, didn't seem to be a lot of strong feelings about this on the mailing list, so I'll give it a few more days.

In the meantime, how about this: For the sake of getting this in soon as a bug fix maybe merge this PR as is (once I add tests). I'll then open a separate issue for making the change to support zero-width strings in general. (Yes, that would mean going from a more complicated change to a simpler change, but it would still probably have the least impact overall, and only changes internal functions).

@homu
Copy link
Contributor

homu commented Oct 18, 2015

☔ The latest upstream changes (presumably #6208) made this pull request unmergeable. Please resolve the merge conflicts.

@embray
Copy link
Contributor Author

embray commented Oct 19, 2015

Will rebase this shortly and add the requisite tests.

It turns out this affects me more than I previously thought: I'm now finding myself in a position where I need to be able to create dtypes with arbitrary offsets for the fields, and supporting zero-width fields. As far as I've been able to find this is currently impossible but I may be missing some subtle workaround...?

If I can't find a workaround I'm going to have to write my own custom dtype constructor in C to use in my code, for now :(

@ahaldane
Copy link
Member

I would guess the problem is this line. If you fix the dtype conversion behavior in this PR, I am pretty sure that line can be removed, presumably solving your problem.

@embray
Copy link
Contributor Author

embray commented Oct 19, 2015

This PR does fix that line. Doesn't help me for older Numpy versions though.

@embray
Copy link
Contributor Author

embray commented Oct 21, 2015

This is strange--since I rebased on master my fix has completely stopped working.

Nevermind, it was just due to #6208 after all. At first it wasn't obvious to me what changed there that would impact this but now it's clear.

@embray
Copy link
Contributor Author

embray commented Oct 21, 2015

Okay, I was able to mostly fix this against #6208, but the test case from #1901 is still failing. This is because, for some reason, assigning to a zero-width string array from a non-zero string array is still resulting in data being copied into the destination array. I'm going to double check, but I'm pretty sure that this did not happen before. Not clear why that would have changed though. Any ideas, @ahaldane ?

Nevermind--I double checked and that case was not fixed previously.

@embray
Copy link
Contributor Author

embray commented Oct 21, 2015

This is probably about ready to go, I think.

It would be nice to have this as a bug fix, so my suggestion would be to use this version for now, but maybe open an issue to later outright remove the behavior of converting empty-string dtypes to 1 character string dtypes for a later version.

Let me know if there's anything I should do in terms of rebasing / squashing commits.

@ahaldane
Copy link
Member

The code looks good to me. I would just like to try to think briefly of any special cases we might have missed.

For instance, saving the zero-sized arrays with np.save doesn't work with the error from above. I'm ok leaving that broken for now though (who would want to do it?).

Here are a few more cases I thought about:

>>> a = np.zeros(3, dtype='S0,u1')['f0']  # easiest way to get a 0-sized array?
>>> a.view('S0')
TypeError: data-type must not be 0-sized

That seems fine to leave alone.

>>> np.empty(3, dtype='S0,S0')
TypeError: Empty data-type

same.

>>> a = np.zeros(4, dtype=[('a', 'S0,S0'), ('b', 'u1')])
>>> a['a']
TypeError: Empty data-type

That's more dubious.

>>> a = np.zeros(4, dtype='S0,u1')['f0']
>>> a.reshape((2,2))
array([['', ''],
       ['', '']], dtype='|S1')

This also seems a bit dubious.

Any thoughts on those?

Edit: The last one definitely needs to be fixed (segfault risk). I think you might fix the middle two as well, as they just need a change in logic in the region of PyArray_NewFromDescr_int you've already modified.

@embray
Copy link
Contributor Author

embray commented Oct 22, 2015

I'm ok leaving that broken for now though (who would want to do it?)

Probably nobody would want to do it explicitly, though it could crash code that isn't expecting it. I'll see if I can fix that one.

The view and empty cases I deliberately left alone as being the current behavior. Your third example is a bit strange, but makes sense if you consider that a dtype of "S0,S0" is packed into a "|V0". So it takes a different code path (and one that I'm leaving alone for now in this PR). But I agree that it's dubious and could be changed.

The reshape example is interesting. That should be fixed too.

@charris charris added this to the 1.10.2 release milestone Oct 29, 2015
@ahaldane
Copy link
Member

ahaldane commented Jun 2, 2016

Oh right, I forgot about that.

This PR ended up being a bug-fix/enhancement rather than changed behavior, but you're right it should be noted in case of problems. I think in summary we allowed 'S0' types in np.ndarray and no longer (incorrectly) promote string lengths in structured array fields.

Ping @embray for first dibs on making the PR for that. The note goes in doc/release/1.12.0-notes.rst

@embray
Copy link
Contributor Author

embray commented Jun 3, 2016

Sorry; didn't know what Numpy's procedure is for adding to release notes. I can do so--as @ahaldane wrote the most visible change is in ndarray.__new__. Most users wouldn't notice, and it's doubtful that any users depended on the HIGHLY CONFUSING AND WRONG (sorry for the all-caps) previous behavior. But definitely worth noting.

@embray embray deleted the issue-473 branch June 7, 2016 06:11
@embray
Copy link
Contributor Author

embray commented Jun 7, 2016

So what sections should I put thins in the release notes? The ndarray.__new__ change should probably go in "Compatibility notes" I don't really think it's much of an enhancement but it is worth noting. The rest are just bug fixes--I don't know if you list those exhaustively in your release notes. From looking at past ones it looks like maybe not (I typically do for most of my projects which is why I'm not sure--fine either way though).

@ahaldane
Copy link
Member

ahaldane commented Jun 7, 2016

I don't think we're too picky about exactly which section it goes. Since these are all related changes I might to keep them together in a single note though.

I kind of like putting it in "Improvements", since I don't think we're breaking any code that previously worked (so it's not a compatibility issue), but we are fixing a bunch of cases.

embray added a commit to embray/numpy that referenced this pull request Jun 13, 2016
charris added a commit that referenced this pull request Jun 13, 2016
DOC: Mention the changes of #6430 in the release notes.
@juliantaylor
Copy link
Contributor

what is the use case of an 'S0' array?
right now we just end up dumping uninitialized memory when such an array is used depending on whether the code uses the constructor that allows empty strings or not, which is more or less random (e.g. subscript and flatten/ravel do not allow it)

@eric-wieser
Copy link
Member

Can you give a code example that dumps uninitialized memory?

@juliantaylor
Copy link
Contributor

juliantaylor commented Apr 20, 2017

import numpy as np
dt = np.dtype([('I', int), ('S', 'S0')])
x = np.zeros(4, dtype=dt)
xx = x['S'].reshape((2, 2))
print(repr(xx))

here it happens due to the ravel in repr, but tostring() as used in the testsuite also triggers it

@eric-wieser
Copy link
Member

eric-wieser commented Apr 20, 2017

depending on whether the code uses the constructor that allows empty strings or not,

It seems to me that we should always use that constructor, and that the one that does not should not even exist

@juliantaylor
Copy link
Contributor

I am wondering why we allow it at all, I cannot think of a practical use case for this, it is an array of zeros that cannot be modified.
If it was an oversight to allow it in structured types I'd rather fix that instead of trying to audit every single code path that can create arrays.

@MSeifert04
Copy link
Contributor

I wonder if this is because of astropys fits module. However that's just an educated guess.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 20, 2017

it is an array of zeros that cannot be modified.

That doesn't sound right - it's an empty array with no contents that occupies 0 bytes (all of which are modifiable, but that's kinda meaningless).

I think there's a consistency argument that says we should allow it, but I do struggle to construct an actual use case. Perhaps so that a function can use np.dtype([('val', int), ('comment', string, n)]), and there's no memory overhead when n == 0 (rather than special casing and constructing a dtype without that field when n == 0)

@embray
Copy link
Contributor Author

embray commented Apr 21, 2017

I am wondering why we allow it at all, I cannot think of a practical use case for this, it is an array of zeros that cannot be modified.

For one there is a consistency argument to be made, but irrespective of any use case known to you, you could just point out bugs--perhaps in a separate issue where it would be relevant--without superfluous challenges to the legitimacy of some use case.

If it was an oversight to allow it in structured types

In fact this is specifically the use case where for me it was most important. As @MSeifert04 suggests, the FITS format allows columns in binary tables that zero width. Why? I'm not sure. I've explored some possible reasons for this in the past but I don't remember. Nevertheless they have and do occur in the wild. It's very difficult to represent such a table as a structured dtype without supporting zero-width fields in the structure because it means lots of extra book-keeping to map fields in a structured array with logical fields in the FITS format. It's much easier and more consistent to allow fields that are just empty arrays.

In fact, as pointed out, this was already supported in the form of |V0, but the treatment of S0 was not consistent with this, and in fact buggy and inconsistent in its own right (some code paths explicitly disallowed it, others allowed it but silently converted S0 to S1 resulting in other weird errors).

I'd rather fix that instead of trying to audit every single code path that can create arrays.

In fact I (tried) to do exactly that. And it's not exactly a difficult case to support--in fact in a way it takes less effort and doing so in fact makes for more correct code in any many cases, rather than brushing it under the rug (as the conversion from S0 to S1 in some cases was doing). The only problem is that historically more code in Numpy has taken the brushing under the rug approach, so it's still not hard to miss a few cases here and there.

@embray
Copy link
Contributor Author

embray commented Apr 21, 2017

Perhaps so that a function can use np.dtype([('val', int), ('comment', string, n)]), and there's no memory overhead when n == 0 (rather than special casing and constructing a dtype without that field when n == 0)

I've had cases like this as well come up too. You may want to have the same data structure throughout even if n = 0 without gobs of special-casing.

@juliantaylor
Copy link
Contributor

hm it should be fixable by not attempting to change the string description when creating a view.

@eric-wieser
Copy link
Member

@njsmith:

(In the long run I think the ideal would be something like: dtype=str and
dtype="U0" are different, with the former autodetecting and erroring out if
used in a context that doesn't do autodetection, and the latter not doing
any autodetection at all. But that's a bigger piece of surgery.

Surgery complete at #8970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants