Fix issues with zero-width string fields #6430

embray · 2015-10-08T22:05:59Z

There is a behavior that probably dates far back, that Numpy tries really hard not to let you make arrays with zero-width strings as the dtype. It normally will promote them to a width of at least one (or more, depending on the context). That can be a little surprising, but we probably don't want to change it.

However, (perhaps as an oversight) it is possible to make a structured array that has zero-width strings for one of its fields. But when viewing such a field the dtype of the view is converted to 'S1' (or 'U1' for unicode), leading to data corruption, overflows, etc. So this PR ensures that arrays of zero-width strings can be created at least in the special case of viewing a field of a structured array.

This also lifts the restrictions in the dtype constructor that prevented creation of structured dtypes with zero-width fields.

Fixes #2196, #473, #4955, #2585, and #1901 .

(I'll add some regression tests in a bit)

njsmith · 2015-10-08T22:09:44Z

Do you have a sense of how difficult it would be to simply enable
zero-width strings in general? This approach seems like a reasonable
strategy (haven't looked at the patch yet), but it would be nice to have a
fair comparison to the alternative, since the alternative has fewer special
cases and, who knows, maybe it's easier than we think.
On Oct 8, 2015 3:06 PM, "Erik Bray" notifications@github.com wrote:

There is a behavior that probably dates far back, that Numpy tries really
hard not to let you make arrays with zero-width strings as the dtype. It
normally will promote them to a width of at least one (or more, depending
on the context). That can be a little surprising, but we probably don't
want to change it.

However, (perhaps as an oversight) it is possible to make a structured
array that has zero-width strings for one of its fields. But when viewing
such a field the dtype of the view is converted to 'S1' (or 'U1' for
unicode), leading to data corruption, overflows, etc. So this PR ensures
that arrays of zero-width strings can be created at least in the
special case of viewing a field of a structured array.

This also lifts the restrictions in the dtype constructor that prevented
creation of structured dtypes with zero-width fields.

Fixes #2196 #2196, #473
#473, #4955
#4955, #2585
#2585, and #1901
#1901 .

(I'll add some regression tests in a bit)

You can view, comment on, or merge this pull request online at:

#6430
Commit Summary

PyArray_NewFromDescr normally does not allow zero-width string
dtypes (or rather, it automatically converts them to 1-width strings). This
affects any code that uses PyArray_NewFromDescr, which is a lot. So we
extend PyArray_NewFromDescr_int to allow disabling this functionality in a
couple narrow cases where it's appropriate--one is when extracting a field
from a structured array that has a zero-width string dtype (which,
intentionally or not, has been allowed). The other, which is related, is
returning a view of an array that has a zero-width string dtype. This
shouldn't otherwise break or change any existing behavior.

Remove roadblocks to creating structured dtypes with zero-width
fields using dict-based constructors (this was possible by other means such
as the list-based constructor--with the previous fix in particular it
should not be necessary to block this anymore).

File Changes

M numpy/core/_internal.py
https://github.com/numpy/numpy/pull/6430/files#diff-0 (4)

M numpy/core/records.py
https://github.com/numpy/numpy/pull/6430/files#diff-1 (4)

M numpy/core/src/multiarray/convert.c
https://github.com/numpy/numpy/pull/6430/files#diff-2 (5)

M numpy/core/src/multiarray/ctors.c
https://github.com/numpy/numpy/pull/6430/files#diff-3 (33)

M numpy/core/src/multiarray/ctors.h
https://github.com/numpy/numpy/pull/6430/files#diff-4 (6)

M numpy/core/src/multiarray/descriptor.c
https://github.com/numpy/numpy/pull/6430/files#diff-5 (2)

M numpy/core/src/multiarray/methods.c
https://github.com/numpy/numpy/pull/6430/files#diff-6 (14)

Patch Links:

https://github.com/numpy/numpy/pull/6430.patch

https://github.com/numpy/numpy/pull/6430.diff

—
Reply to this email directly or view it on GitHub
#6430.

embray · 2015-10-08T22:17:49Z

It wouldn't be difficult technically, but it would be difficult for preserving backwards compat.

For example currently if one does:

>>> np.array(['abc', 'def'], dtype=str)

this is equivalent (on Python 3) to

>>> np.array(['abc', 'def'], dtype='U0')

but this will automatically set the dtype to fit the length of the largest string in the list. There's no obvious way to distinguish this "auto-string" feature from explicitly requesting a dtype of zero width. Perhaps one exception would be if all the strings in the input array are zero width (currently this results in 'U1' where one would expect 'U0').

embray · 2015-10-08T22:18:09Z

In either case, I think the behavior should be at least documented somewhere, if it isn't already.

njsmith · 2015-10-08T22:25:50Z

OK, so right now "U0" means "whatever width fits the data, or else width 1".
.
I see how it would be difficult to make "U0" mean "always width 0". Would
it be difficult to make it mean "whatever width fits the data, or else
width 0"? And would that be simpler all around, e.g. would it let us avoid
special casing struct dtypes?
On Oct 8, 2015 3:18 PM, "Erik Bray" notifications@github.com wrote:

In either case, I think the behavior should be at least documented
somewhere, if it isn't already.

—
Reply to this email directly or view it on GitHub
#6430 (comment).

embray · 2015-10-08T22:31:05Z

No, it's like, a 1 line change or so.

However, it would impact anything that uses PyArray_NewFromDescr. I'll try making the change really quick and just see what breaks, if anything, in the test suite.

njsmith · 2015-10-08T22:35:35Z

Thanks!
On Oct 8, 2015 3:31 PM, "Erik Bray" notifications@github.com wrote:

No, it's like, a 1 line change or so.

However, it would impact anything that uses PyArray_NewFromDescr. I'll
try making the change really quick and just see what breaks, if anything,
in the test suite.

—
Reply to this email directly or view it on GitHub
#6430 (comment).

njsmith · 2015-10-08T22:38:35Z

(In the long run I think the ideal would be something like: dtype=str and
dtype="U0" are different, with the former autodetecting and erroring out if
used in a context that doesn't do autodetection, and the latter not doing
any autodetection at all. But that's a bigger piece of surgery.)
On Oct 8, 2015 3:35 PM, "Nathaniel Smith" njs@pobox.com wrote:

Thanks!
On Oct 8, 2015 3:31 PM, "Erik Bray" notifications@github.com wrote:

No, it's like, a 1 line change or so.

However, it would impact anything that uses PyArray_NewFromDescr. I'll
try making the change really quick and just see what breaks, if anything,
in the test suite.

—
Reply to this email directly or view it on GitHub
#6430 (comment).

embray · 2015-10-08T22:41:18Z

So basically allowing zero-width strings in general is just a matter of deleting these lines:

numpy/numpy/core/src/multiarray/ctors.c

Lines 946 to 955 in 6cbd724

    
           PyArray_DESCR_REPLACE(descr); 
        
           if (descr == NULL) { 
        
               return NULL; 
        
           } 
        
           if (descr->type_num == NPY_STRING) { 
        
               sd = descr->elsize = 1; 
        
           } 
        
           else { 
        
               sd = descr->elsize = sizeof(npy_ucs4); 
        
           }

it doesn't break any of the existing tests that I can tell. It would change the existing behavior of several functions--but it's behavior that it would seem wrong for anyone to be relying on.

In principle I think we could even delete the entire outer if statement that belongs to, and not prevent empty void types either. It doesn't seem to be a problem.

embray · 2015-10-08T22:43:48Z

I agree that might be ideal. But currently it would be too big a change for almost no benefit. I don't think anyone really cares about making arrays of zero-width strings in general. If they really wanted to, with the change I suggested above they can with np.zeros(N, dtype=str).

njsmith · 2015-10-08T22:54:56Z

Can you send a note to the list describing the issue and giving an example
of some of the functions that would change behavior? We definitely have to
be careful with changing things like this, but let's find out if anyone
knows any reason to object :-)
On Oct 8, 2015 3:41 PM, "Erik Bray" notifications@github.com wrote:

So basically allowing zero-width strings in general is just a matter of
deleting these lines:

numpy/numpy/core/src/multiarray/ctors.c

Lines 946 to 955 in 6cbd724

PyArray_DESCR_REPLACE(descr);

if (descr == NULL) {

return NULL;

}

if (descr->type_num == NPY_STRING) {

sd = descr->elsize = 1;

}

else {

sd = descr->elsize = sizeof(npy_ucs4);

}

it doesn't break any of the existing tests that I can tell. It would
change the existing behavior of several functions--but it's behavior that
it would seem wrong for anyone to be relying on.

In principle I think we could even delete the entire outer if statement
that belongs to, and not prevent empty void types either. It doesn't seem
to be a problem.

—
Reply to this email directly or view it on GitHub
#6430 (comment).

embray · 2015-10-09T13:58:15Z

Sure.

ahaldane · 2015-10-12T18:51:20Z

I ran py2 unit tests for scipy and pandas after deleting the lines you pointed to. FWIW I don't see any relevant test failures, though probably we are more worried about random user scripts.

Also, it looks like numpy.io might not be set up to handle 0-sized arrays:

>>> np.save('tmp', np.zeros(3, dtype='S'))
ZeroDivisionError: integer division or modulo by zero

though I haven't investigated that much. (Also, in any case probably older numpy will not be able to load zero-sized arrays saved in newer numpy with these changes).

I like the idea of always allowing zero-width strings. I haven't thought of any clear cases we are breaking someone's code, but maybe that's just for lack of imagination!

embray · 2015-10-12T19:16:56Z

Interesting--thanks for pointing out the above case. Should probably be fixed. I can add the fix for that to this PR.

I agree I have a hard time thinking of anything it would break. Who knows...

ahaldane · 2015-10-12T19:53:14Z

Another one to think about (I'm not sure it's a problem):

>>> np.ones(3, dtype='S')

Old result:

array(['1', '1', '1'], 
  dtype='|S1')

New result:

array(['', '', ''], 
  dtype='|S0')

Note that ones(3, dtype='S1') gives the old behavior.

embray · 2015-10-12T20:04:39Z

Oof, that's terrible :) Especially considering

>>> np.zeros(3, dtype='S')
array(['', '', ''], dtype='|S1')

and not

array(['0', '0', '0'], dtype='|S1)

embray · 2015-10-13T14:51:01Z

Well, didn't seem to be a lot of strong feelings about this on the mailing list, so I'll give it a few more days.

In the meantime, how about this: For the sake of getting this in soon as a bug fix maybe merge this PR as is (once I add tests). I'll then open a separate issue for making the change to support zero-width strings in general. (Yes, that would mean going from a more complicated change to a simpler change, but it would still probably have the least impact overall, and only changes internal functions).

homu · 2015-10-18T20:15:41Z

☔ The latest upstream changes (presumably #6208) made this pull request unmergeable. Please resolve the merge conflicts.

embray · 2015-10-19T16:50:00Z

Will rebase this shortly and add the requisite tests.

It turns out this affects me more than I previously thought: I'm now finding myself in a position where I need to be able to create dtypes with arbitrary offsets for the fields, and supporting zero-width fields. As far as I've been able to find this is currently impossible but I may be missing some subtle workaround...?

If I can't find a workaround I'm going to have to write my own custom dtype constructor in C to use in my code, for now :(

ahaldane · 2015-10-19T17:08:52Z

I would guess the problem is this line. If you fix the dtype conversion behavior in this PR, I am pretty sure that line can be removed, presumably solving your problem.

embray · 2015-10-19T17:17:57Z

This PR does fix that line. Doesn't help me for older Numpy versions though.

embray · 2015-10-21T18:16:56Z

~~This is strange--since I rebased on master my fix has completely stopped working.~~

Nevermind, it was just due to #6208 after all. At first it wasn't obvious to me what changed there that would impact this but now it's clear.

embray · 2015-10-21T19:28:18Z

Okay, I was able to mostly fix this against #6208, but the test case from #1901 is still failing. This is because, for some reason, assigning to a zero-width string array from a non-zero string array is still resulting in data being copied into the destination array. I'm going to double check, but I'm pretty sure that this did not happen before. Not clear why that would have changed though. Any ideas, @ahaldane ?

Nevermind--I double checked and that case was not fixed previously.

embray · 2015-10-21T22:04:36Z

This is probably about ready to go, I think.

It would be nice to have this as a bug fix, so my suggestion would be to use this version for now, but maybe open an issue to later outright remove the behavior of converting empty-string dtypes to 1 character string dtypes for a later version.

Let me know if there's anything I should do in terms of rebasing / squashing commits.

ahaldane · 2015-10-22T03:17:09Z

The code looks good to me. I would just like to try to think briefly of any special cases we might have missed.

For instance, saving the zero-sized arrays with np.save doesn't work with the error from above. I'm ok leaving that broken for now though (who would want to do it?).

Here are a few more cases I thought about:

>>> a = np.zeros(3, dtype='S0,u1')['f0']  # easiest way to get a 0-sized array?
>>> a.view('S0')
TypeError: data-type must not be 0-sized

That seems fine to leave alone.

>>> np.empty(3, dtype='S0,S0')
TypeError: Empty data-type

same.

>>> a = np.zeros(4, dtype=[('a', 'S0,S0'), ('b', 'u1')])
>>> a['a']
TypeError: Empty data-type

That's more dubious.

>>> a = np.zeros(4, dtype='S0,u1')['f0']
>>> a.reshape((2,2))
array([['', ''],
       ['', '']], dtype='|S1')

This also seems a bit dubious.

Any thoughts on those?

Edit: The last one definitely needs to be fixed (segfault risk). I think you might fix the middle two as well, as they just need a change in logic in the region of PyArray_NewFromDescr_int you've already modified.

embray · 2015-10-22T13:45:02Z

I'm ok leaving that broken for now though (who would want to do it?)

Probably nobody would want to do it explicitly, though it could crash code that isn't expecting it. I'll see if I can fix that one.

The view and empty cases I deliberately left alone as being the current behavior. Your third example is a bit strange, but makes sense if you consider that a dtype of "S0,S0" is packed into a "|V0". So it takes a different code path (and one that I'm leaving alone for now in this PR). But I agree that it's dubious and could be changed.

The reshape example is interesting. That should be fixed too.

ahaldane · 2016-06-02T16:59:54Z

Oh right, I forgot about that.

This PR ended up being a bug-fix/enhancement rather than changed behavior, but you're right it should be noted in case of problems. I think in summary we allowed 'S0' types in np.ndarray and no longer (incorrectly) promote string lengths in structured array fields.

Ping @embray for first dibs on making the PR for that. The note goes in doc/release/1.12.0-notes.rst

embray · 2016-06-03T17:21:07Z

Sorry; didn't know what Numpy's procedure is for adding to release notes. I can do so--as @ahaldane wrote the most visible change is in ndarray.__new__. Most users wouldn't notice, and it's doubtful that any users depended on the HIGHLY CONFUSING AND WRONG (sorry for the all-caps) previous behavior. But definitely worth noting.

embray · 2016-06-07T06:26:39Z

So what sections should I put thins in the release notes? The ndarray.__new__ change should probably go in "Compatibility notes" I don't really think it's much of an enhancement but it is worth noting. The rest are just bug fixes--I don't know if you list those exhaustively in your release notes. From looking at past ones it looks like maybe not (I typically do for most of my projects which is why I'm not sure--fine either way though).

ahaldane · 2016-06-07T19:11:38Z

I don't think we're too picky about exactly which section it goes. Since these are all related changes I might to keep them together in a single note though.

I kind of like putting it in "Improvements", since I don't think we're breaking any code that previously worked (so it's not a compatibility issue), but we are fixing a bunch of cases.

DOC: Mention the changes of #6430 in the release notes.

juliantaylor · 2017-04-20T18:19:21Z

what is the use case of an 'S0' array?
right now we just end up dumping uninitialized memory when such an array is used depending on whether the code uses the constructor that allows empty strings or not, which is more or less random (e.g. subscript and flatten/ravel do not allow it)

eric-wieser · 2017-04-20T18:27:15Z

Can you give a code example that dumps uninitialized memory?

juliantaylor · 2017-04-20T18:31:51Z

import numpy as np
dt = np.dtype([('I', int), ('S', 'S0')])
x = np.zeros(4, dtype=dt)
xx = x['S'].reshape((2, 2))
print(repr(xx))

here it happens due to the ravel in repr, but tostring() as used in the testsuite also triggers it

eric-wieser · 2017-04-20T18:37:03Z

depending on whether the code uses the constructor that allows empty strings or not,

It seems to me that we should always use that constructor, and that the one that does not should not even exist

juliantaylor · 2017-04-20T18:47:31Z

I am wondering why we allow it at all, I cannot think of a practical use case for this, it is an array of zeros that cannot be modified.
If it was an oversight to allow it in structured types I'd rather fix that instead of trying to audit every single code path that can create arrays.

MSeifert04 · 2017-04-20T19:02:24Z

I wonder if this is because of astropys fits module. However that's just an educated guess.

eric-wieser · 2017-04-20T19:09:07Z

it is an array of zeros that cannot be modified.

That doesn't sound right - it's an empty array with no contents that occupies 0 bytes (all of which are modifiable, but that's kinda meaningless).

I think there's a consistency argument that says we should allow it, but I do struggle to construct an actual use case. Perhaps so that a function can use np.dtype([('val', int), ('comment', string, n)]), and there's no memory overhead when n == 0 (rather than special casing and constructing a dtype without that field when n == 0)

embray · 2017-04-21T09:09:19Z

I am wondering why we allow it at all, I cannot think of a practical use case for this, it is an array of zeros that cannot be modified.

For one there is a consistency argument to be made, but irrespective of any use case known to you, you could just point out bugs--perhaps in a separate issue where it would be relevant--without superfluous challenges to the legitimacy of some use case.

If it was an oversight to allow it in structured types

In fact this is specifically the use case where for me it was most important. As @MSeifert04 suggests, the FITS format allows columns in binary tables that zero width. Why? I'm not sure. I've explored some possible reasons for this in the past but I don't remember. Nevertheless they have and do occur in the wild. It's very difficult to represent such a table as a structured dtype without supporting zero-width fields in the structure because it means lots of extra book-keeping to map fields in a structured array with logical fields in the FITS format. It's much easier and more consistent to allow fields that are just empty arrays.

In fact, as pointed out, this was already supported in the form of |V0, but the treatment of S0 was not consistent with this, and in fact buggy and inconsistent in its own right (some code paths explicitly disallowed it, others allowed it but silently converted S0 to S1 resulting in other weird errors).

I'd rather fix that instead of trying to audit every single code path that can create arrays.

In fact I (tried) to do exactly that. And it's not exactly a difficult case to support--in fact in a way it takes less effort and doing so in fact makes for more correct code in any many cases, rather than brushing it under the rug (as the conversion from S0 to S1 in some cases was doing). The only problem is that historically more code in Numpy has taken the brushing under the rug approach, so it's still not hard to miss a few cases here and there.

embray · 2017-04-21T09:12:02Z

Perhaps so that a function can use np.dtype([('val', int), ('comment', string, n)]), and there's no memory overhead when n == 0 (rather than special casing and constructing a dtype without that field when n == 0)

I've had cases like this as well come up too. You may want to have the same data structure throughout even if n = 0 without gobs of special-casing.

juliantaylor · 2017-04-21T11:44:54Z

hm it should be fixable by not attempting to change the string description when creating a view.

eric-wieser · 2017-04-21T11:46:46Z

@njsmith:

(In the long run I think the ideal would be something like: dtype=str and
dtype="U0" are different, with the former autodetecting and erroring out if
used in a context that doesn't do autodetection, and the latter not doing
any autodetection at all. But that's a bigger piece of surgery.

Surgery complete at #8970

rgommers added component: numpy._core 03 - Maintenance labels Oct 10, 2015

embray force-pushed the issue-473 branch from 1b4c30c to cd142cd Compare October 21, 2015 19:50

charris mentioned this pull request Oct 29, 2015

dtype(('S0', ())) raise unexpected exception #6471

Open

charris added this to the 1.10.2 release milestone Oct 29, 2015

This was referenced Jun 2, 2016

data leakage with S0 in record dtype (Trac #1992) #2585

Closed

Bug in rec.fromarrays in handling length-0 strings (Trac #1303), now with segfault. #1901

Closed

embray deleted the issue-473 branch June 7, 2016 06:11

embray added a commit to embray/numpy that referenced this pull request Jun 13, 2016

DOC: Mention the changes of numpy#6430 in the release notes.

abbf794

embray mentioned this pull request Jun 13, 2016

DOC: Mention the changes of #6430 in the release notes. #7739

Merged

charris added a commit that referenced this pull request Jun 13, 2016

Merge pull request #7739 from embray/doc/6430

fd946f0

DOC: Mention the changes of #6430 in the release notes.

embray mentioned this pull request Aug 29, 2016

BUG: Let linspace accept input that has an array_interface. #6659

Merged

embray mentioned this pull request Sep 6, 2016

Few optimizations for FITS tables astropy/astropy#4985

Merged

eric-wieser mentioned this pull request Apr 21, 2017

ENH/API: Change flexible types to indicate resizability with elsize == -1 #8970

Closed

juliantaylor mentioned this pull request Apr 22, 2017

BUG: do not change size 0 description when viewing data #8971

Merged

pllim mentioned this pull request Jun 6, 2018

Remove numpy hack in io.fits when Numpy minversion is 1.12 astropy/astropy#7545

Closed

Fix issues with zero-width string fields #6430

Fix issues with zero-width string fields #6430

Conversation

embray commented Oct 8, 2015

njsmith commented Oct 8, 2015

(I'll add some regression tests in a bit)

embray commented Oct 8, 2015

embray commented Oct 8, 2015

njsmith commented Oct 8, 2015

embray commented Oct 8, 2015

njsmith commented Oct 8, 2015

njsmith commented Oct 8, 2015

embray commented Oct 8, 2015

embray commented Oct 8, 2015

njsmith commented Oct 8, 2015

embray commented Oct 9, 2015

ahaldane commented Oct 12, 2015

embray commented Oct 12, 2015

ahaldane commented Oct 12, 2015

embray commented Oct 12, 2015

embray commented Oct 13, 2015

homu commented Oct 18, 2015

embray commented Oct 19, 2015

ahaldane commented Oct 19, 2015

embray commented Oct 19, 2015

embray commented Oct 21, 2015

embray commented Oct 21, 2015

embray commented Oct 21, 2015

ahaldane commented Oct 22, 2015

embray commented Oct 22, 2015

ahaldane commented Jun 2, 2016

embray commented Jun 3, 2016

embray commented Jun 7, 2016

ahaldane commented Jun 7, 2016

juliantaylor commented Apr 20, 2017

eric-wieser commented Apr 20, 2017

juliantaylor commented Apr 20, 2017 • edited Loading

eric-wieser commented Apr 20, 2017 • edited Loading

juliantaylor commented Apr 20, 2017

MSeifert04 commented Apr 20, 2017

eric-wieser commented Apr 20, 2017 • edited Loading

embray commented Apr 21, 2017

embray commented Apr 21, 2017

juliantaylor commented Apr 21, 2017

eric-wieser commented Apr 21, 2017

juliantaylor commented Apr 20, 2017 •

edited

Loading

eric-wieser commented Apr 20, 2017 •

edited

Loading

eric-wieser commented Apr 20, 2017 •

edited

Loading