dtype field names cannot be unicode (Trac #1814) #2407

Open
numpy-gitbot opened this Issue Oct 19, 2012 · 22 comments

Comments

Projects
None yet
@numpy-gitbot

Original ticket http://projects.scipy.org/numpy/ticket/1814 on 2011-04-29 by @jonovik, assigned to unknown.

Is there a reason why Unicode strings are not accepted as field names for record arrays?

>>> np.dtype([("a", int)])
dtype([('a', '<i4')])
>>> np.dtype([(u"a", int)])
TypeError: data type not understood

A workaround is .encode("ascii").

>>> np.dtype([(u"a".encode("ascii"), int)])
dtype([('a', '<i4')])

It is okay for the type specification to be Unicode.

>>> np.dtype([("a", u"S2")])
dtype([('a', '|S2')])}}}

I came across this while building a record array from data in a Microsoft Excel spreadsheet using pythoncom and win32com. Converting to ascii isn't too much of a hassle, but maybe it wouldn't be difficult to allow Unicode strings as field names?

@numpy-gitbot

This comment has been minimized.

Show comment
Hide comment
@numpy-gitbot

numpy-gitbot Oct 19, 2012

trac user jkmacc wrote on 2011-11-01

Bump?

trac user jkmacc wrote on 2011-11-01

Bump?

@burnpanck

This comment has been minimized.

Show comment
Hide comment
@burnpanck

burnpanck Feb 22, 2013

I run into this issue all the time when I try to be python3 forward compatible and do

from __future__ import unicode_literals

which means that all strings are unicode. I wonder how this is handled in true python3, since numpy now supports it?

I run into this issue all the time when I try to be python3 forward compatible and do

from __future__ import unicode_literals

which means that all strings are unicode. I wonder how this is handled in true python3, since numpy now supports it?

@evertrol

This comment has been minimized.

Show comment
Hide comment
@evertrol

evertrol May 7, 2013

No idea if this is being addressed, but I'm still running into it (python 2.7 + numpy 1.7.0 and python 3.3 + numpy 1.8 dev).
One horrendous workaround I've found to have the code compatible with both python 2 and 3, and have from __future__ import unicode_literals, is using a ternary expression:

np.dtype([("a" if isinstance("a", str) else b"a", int)])

evertrol commented May 7, 2013

No idea if this is being addressed, but I'm still running into it (python 2.7 + numpy 1.7.0 and python 3.3 + numpy 1.8 dev).
One horrendous workaround I've found to have the code compatible with both python 2 and 3, and have from __future__ import unicode_literals, is using a ternary expression:

np.dtype([("a" if isinstance("a", str) else b"a", int)])
@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith May 7, 2013

Member

I think you can just use b"foo" and skip the ternary ugliness, on both py2
and py3.
On 7 May 2013 08:02, "evertrol" notifications@github.com wrote:

No idea if this is being addressed, but I'm still running into it (python
2.7 + numpy 1.7.0 and python 3.3 + numpy 1.8 dev).
One horrendous workaround I've found to have the code compatible with both
python 2 and 3, and have from future import unicode_literals, is
using a ternary expression:

np.dtype([("a" if isinstance("a", str) else b"a", int)])


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2407#issuecomment-17537900
.

Member

njsmith commented May 7, 2013

I think you can just use b"foo" and skip the ternary ugliness, on both py2
and py3.
On 7 May 2013 08:02, "evertrol" notifications@github.com wrote:

No idea if this is being addressed, but I'm still running into it (python
2.7 + numpy 1.7.0 and python 3.3 + numpy 1.8 dev).
One horrendous workaround I've found to have the code compatible with both
python 2 and 3, and have from future import unicode_literals, is
using a ternary expression:

np.dtype([("a" if isinstance("a", str) else b"a", int)])


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2407#issuecomment-17537900
.

@evertrol

This comment has been minimized.

Show comment
Hide comment
@evertrol

evertrol May 7, 2013

No, that's where things go wrong in Python 3:

 >>> np.dtype([(b'a', int)])
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 TypeError: data type not understood

numpy doesn't like bytes either for the field name. So the more correct title of this issue would be "dtype field names cannot be unicode or bytes".
Identical to above remarks, bytes can be used for the field type specification.

Hence I think this is something that needs to be addressed on the numpy side, so that either unicode or bytes are allowed as the field name (next to str).

evertrol commented May 7, 2013

No, that's where things go wrong in Python 3:

 >>> np.dtype([(b'a', int)])
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 TypeError: data type not understood

numpy doesn't like bytes either for the field name. So the more correct title of this issue would be "dtype field names cannot be unicode or bytes".
Identical to above remarks, bytes can be used for the field type specification.

Hence I think this is something that needs to be addressed on the numpy side, so that either unicode or bytes are allowed as the field name (next to str).

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv May 7, 2013

Member

On Python 2 and Python 3, the field names must be of type str, so b"foo" won't do. In itself, this is perfectly consistent behavior.

What can be done is allowing ascii encoding on entry. Allowing both str and unicode internally probably leads to a mess.

Member

pv commented May 7, 2013

On Python 2 and Python 3, the field names must be of type str, so b"foo" won't do. In itself, this is perfectly consistent behavior.

What can be done is allowing ascii encoding on entry. Allowing both str and unicode internally probably leads to a mess.

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith May 7, 2013

Member

Oh, I see, sorry. Field names have to be bytes on py2 and unicode on py3,
so normally you just write "foo" and it works on both, but unicode_literals
on py2 breaks this.

In that case wouldn't a workaround be to write str("foo")?

Aside from workarounds, what kind of change would you propose? I don't
think we're likely to start storing Unicode field names on py2; the
compatibility issues would be a mess. Should py2 automatically convert
incoming Unicode strings to bytes, using the default encoding? That's also
a mess, but at least it's an instance of a mess that py3 solves, so
eventually it will go away when the py3 migration is complete... Or if the
str() hack works maybe that's good enough?
On 7 May 2013 08:15, "Pauli Virtanen" notifications@github.com wrote:

On Python 2 and Python 3, the field names must be of type str so b"foo"won't do.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2407#issuecomment-17538453
.

Member

njsmith commented May 7, 2013

Oh, I see, sorry. Field names have to be bytes on py2 and unicode on py3,
so normally you just write "foo" and it works on both, but unicode_literals
on py2 breaks this.

In that case wouldn't a workaround be to write str("foo")?

Aside from workarounds, what kind of change would you propose? I don't
think we're likely to start storing Unicode field names on py2; the
compatibility issues would be a mess. Should py2 automatically convert
incoming Unicode strings to bytes, using the default encoding? That's also
a mess, but at least it's an instance of a mess that py3 solves, so
eventually it will go away when the py3 migration is complete... Or if the
str() hack works maybe that's good enough?
On 7 May 2013 08:15, "Pauli Virtanen" notifications@github.com wrote:

On Python 2 and Python 3, the field names must be of type str so b"foo"won't do.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2407#issuecomment-17538453
.

@evertrol

This comment has been minimized.

Show comment
Hide comment
@evertrol

evertrol May 7, 2013

Doh: str("foo") works. For some reason, I probably figured it couldn't be that easy and tried the hard way. Perhaps I thought that str wouldn't be able to convert the Python 2 unicode to string, but for basic ASCII strings, that's of course not a problem. For me, the str() work-around is good enough; hopefully for others as well. Thanks for that.

evertrol commented May 7, 2013

Doh: str("foo") works. For some reason, I probably figured it couldn't be that easy and tried the hard way. Perhaps I thought that str wouldn't be able to convert the Python 2 unicode to string, but for basic ASCII strings, that's of course not a problem. For me, the str() work-around is good enough; hopefully for others as well. Thanks for that.

@Nodd

This comment has been minimized.

Show comment
Hide comment
@Nodd

Nodd Oct 9, 2013

Contributor

Maybe the str() workaround could be included somehow in numpy ? Or is it considered a corner case ? from __future__ import unicode_literals is quite common with the transition between python 2 and 3.

Contributor

Nodd commented Oct 9, 2013

Maybe the str() workaround could be included somehow in numpy ? Or is it considered a corner case ? from __future__ import unicode_literals is quite common with the transition between python 2 and 3.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Oct 9, 2013

Member

Yes, adding something like

#if defined(NPY_PY3)
tmp = PyUnicode_FromEncodedObject(obj, "ascii", "strict");
if (tmp == NULL) { goto fail; }
#else
tmp = PyUnicode_AsASCIIString(obj);
if (tmp == NULL) { goto fail; }
#endif

probably would be OK.

Member

pv commented Oct 9, 2013

Yes, adding something like

#if defined(NPY_PY3)
tmp = PyUnicode_FromEncodedObject(obj, "ascii", "strict");
if (tmp == NULL) { goto fail; }
#else
tmp = PyUnicode_AsASCIIString(obj);
if (tmp == NULL) { goto fail; }
#endif

probably would be OK.

@burnpanck

This comment has been minimized.

Show comment
Hide comment
@burnpanck

burnpanck Oct 10, 2013

I strongly agree with @pv! There is one more corner case, where this solution would be helpful: When unpickling data in python 3 that was written in python 2, one might end up having byte strings in the field names. Currently, unpickle defaults to unpickle byte strings as unicode objects, but this breaks all objects that pickle true binary data as byte strings. In this case, one tweaks the unpickler to leave byte strings as is. See also http://bugs.python.org/issue6784 .

I strongly agree with @pv! There is one more corner case, where this solution would be helpful: When unpickling data in python 3 that was written in python 2, one might end up having byte strings in the field names. Currently, unpickle defaults to unpickle byte strings as unicode objects, but this breaks all objects that pickle true binary data as byte strings. In this case, one tweaks the unpickler to leave byte strings as is. See also http://bugs.python.org/issue6784 .

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Feb 20, 2014

Member

Anyone want to step up to this?

Member

charris commented Feb 20, 2014

Anyone want to step up to this?

@acjackson

This comment has been minimized.

Show comment
Hide comment
@acjackson

acjackson Nov 27, 2014

bump, anything new on this issue?

bump, anything new on this issue?

@hpaulj

This comment has been minimized.

Show comment
Hide comment
@hpaulj

hpaulj Feb 20, 2015

Contributor

In numpy/core/src/multiarray/descriptor.c field names are tested in two ways.

If the dtype specification is of the list of tuples type, it checks whether each name is a string (as defined by py2 or 3)

PyUString_Check(name)

But if the dtype specification is a dictionary {'names':[ alist], 'formats':[alist]...}, the py2 case also allows unicode names:

#if defined(NPY_PY3K)
        if (!PyUString_Check(name)) {
#else
        if (!(PyUString_Check(name) || PyUnicode_Check(name))) {
#endif

Byte strings are disallowed in PY3 in both cases. titles allow strings and unicode.

http://stackoverflow.com/questions/28586238/does-numpy-recfunctions-append-fields-fail-when-when-array-names-are-unicode/

In this Stackoverflow question, the user created an array with

np.core.records.fromarrays([x1,x2,x3],names=[u'a',u'b',u'c'])

but ran into problems when trying to extend it with

recfunctions.append_fields(FileData,'DateTime', data=DT) 

fromarrays uses format_parser, which in turn creates the dtype with the dictionary format. append_fields creates an empty array with a list of tuples dtype: base.dtype.descr + data.dtype.descr. The first allows unicode names (in py2), but the second only strings.

A possible fix is to allow unicode names in the list of tuples case as well. That would be in the _convert_from_array_descr() function.

I haven't looked yet, at whether the unittesting addresses this issue or not. I think the pickling code uses unicode names for maximum compatibility between py2 and py3.

Contributor

hpaulj commented Feb 20, 2015

In numpy/core/src/multiarray/descriptor.c field names are tested in two ways.

If the dtype specification is of the list of tuples type, it checks whether each name is a string (as defined by py2 or 3)

PyUString_Check(name)

But if the dtype specification is a dictionary {'names':[ alist], 'formats':[alist]...}, the py2 case also allows unicode names:

#if defined(NPY_PY3K)
        if (!PyUString_Check(name)) {
#else
        if (!(PyUString_Check(name) || PyUnicode_Check(name))) {
#endif

Byte strings are disallowed in PY3 in both cases. titles allow strings and unicode.

http://stackoverflow.com/questions/28586238/does-numpy-recfunctions-append-fields-fail-when-when-array-names-are-unicode/

In this Stackoverflow question, the user created an array with

np.core.records.fromarrays([x1,x2,x3],names=[u'a',u'b',u'c'])

but ran into problems when trying to extend it with

recfunctions.append_fields(FileData,'DateTime', data=DT) 

fromarrays uses format_parser, which in turn creates the dtype with the dictionary format. append_fields creates an empty array with a list of tuples dtype: base.dtype.descr + data.dtype.descr. The first allows unicode names (in py2), but the second only strings.

A possible fix is to allow unicode names in the list of tuples case as well. That would be in the _convert_from_array_descr() function.

I haven't looked yet, at whether the unittesting addresses this issue or not. I think the pickling code uses unicode names for maximum compatibility between py2 and py3.

@tacaswell tacaswell referenced this issue in matplotlib/matplotlib Mar 21, 2015

Closed

dtype problems with record arrays #4253

@jmlarson1

This comment has been minimized.

Show comment
Hide comment
@jmlarson1

jmlarson1 Apr 17, 2015

Since this has not been addressed, it is not possible to open numpy structured arrays saved in python3 in python2.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)     
>>> a = np.zeros(4, dtype=[('x',int)])
>>> np.save('a.npy', {'a': a})
>>> 
$ python2
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
>>> np.load('a.npy')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 393, in load
    return format.read_array(fid)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/format.py", line 602, in read_array
    array = pickle.load(fp)
ValueError: non-string names in Numpy dtype unpickling

Is such functionality not going to be supported by numpy?

Since this has not been addressed, it is not possible to open numpy structured arrays saved in python3 in python2.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)     
>>> a = np.zeros(4, dtype=[('x',int)])
>>> np.save('a.npy', {'a': a})
>>> 
$ python2
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
>>> np.load('a.npy')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 393, in load
    return format.read_array(fid)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/format.py", line 602, in read_array
    array = pickle.load(fp)
ValueError: non-string names in Numpy dtype unpickling

Is such functionality not going to be supported by numpy?

@jakirkham

This comment has been minimized.

Show comment
Hide comment
@jakirkham

jakirkham May 6, 2015

Contributor

Also, seeing this problem.

Contributor

jakirkham commented May 6, 2015

Also, seeing this problem.

@rainwoodman

This comment has been minimized.

Show comment
Hide comment
@rainwoodman

rainwoodman Mar 11, 2016

Contributor

Just got bitten by this 10 mins ago. Any news? It's been a year since last update.

Is there a decision on what encoding the internal (C-API) of numpy will use for column names?

Once that is settled, every input column name (from python 2&3) can be easily converted to that internal encoding upon arrival. be it from pickle or python source code.

Contributor

rainwoodman commented Mar 11, 2016

Just got bitten by this 10 mins ago. Any news? It's been a year since last update.

Is there a decision on what encoding the internal (C-API) of numpy will use for column names?

Once that is settled, every input column name (from python 2&3) can be easily converted to that internal encoding upon arrival. be it from pickle or python source code.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Mar 11, 2016

Member

As discussed above, field names have the Python string type (bytes on py2, unicode on py3). There's no internal encoding in either case.

Regarding the pickling failures above --- py2 and py3 pickles are in general incompatible, and there's not that much that can be done here without resulting either to mojibake or data corruption. .npy files containing pickles are not portable in py2 vs. py3.

In practice, automatic ASCII conversions could perhaps be safe to add (to dtype field name assign etc), as that's the typical lowest common denominator. However, if done on py3, this is reintroducing the bytes/strings confusion of py2.

Member

pv commented Mar 11, 2016

As discussed above, field names have the Python string type (bytes on py2, unicode on py3). There's no internal encoding in either case.

Regarding the pickling failures above --- py2 and py3 pickles are in general incompatible, and there's not that much that can be done here without resulting either to mojibake or data corruption. .npy files containing pickles are not portable in py2 vs. py3.

In practice, automatic ASCII conversions could perhaps be safe to add (to dtype field name assign etc), as that's the typical lowest common denominator. However, if done on py3, this is reintroducing the bytes/strings confusion of py2.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Mar 11, 2016

Member

I think a PR adding automatic conversion from unicode to bytes for dtype field names on Python 2 only using e.g. ASCII (or your favourite alternative encoding eg utf8) has good chances of being merged.

Member

pv commented Mar 11, 2016

I think a PR adding automatic conversion from unicode to bytes for dtype field names on Python 2 only using e.g. ASCII (or your favourite alternative encoding eg utf8) has good chances of being merged.

@marcocamma

This comment has been minimized.

Show comment
Hide comment

any news ?

@J-Sand

This comment has been minimized.

Show comment
Hide comment
@J-Sand

J-Sand Dec 5, 2016

Contributor

I'm having a look at this - before I go any further could I check what the appropriate fix is? There seems to be agreement that in py2, unicode field names should be encoded into strs somehow (UTF8? unicode-escape?), but at present it's already possible to have unicode field names by specifying the fields with a dict instead of a list:

>>> np.dtype({'names': [u'\u0394'], 'formats': ['i']})
dtype([(u'\u0394', '<i4')])

The name doesn't get encoded here, and it generally seems to work OK (the only exception I've found is that pickle chokes on it). So would it be OK to just keep the field name as a unicode object in py2, or should we encode it everywhere?

Actually there are some other weird inconsistencies between the dict and list forms. With the dict form you can have the empty string as a field name:

>>> np.dtype({'names': [''], 'formats': ['i']})
dtype([('', '<i4')])

With the list form, it attempts to use the title instead, if there is one, but then immediately complains that the name and title are the same:

>>> np.dtype([(('title', ''), 'i')])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: title already used as a name or title.

If there is no title, then it falls back to 'f0', 'f1', etc.

Contributor

J-Sand commented Dec 5, 2016

I'm having a look at this - before I go any further could I check what the appropriate fix is? There seems to be agreement that in py2, unicode field names should be encoded into strs somehow (UTF8? unicode-escape?), but at present it's already possible to have unicode field names by specifying the fields with a dict instead of a list:

>>> np.dtype({'names': [u'\u0394'], 'formats': ['i']})
dtype([(u'\u0394', '<i4')])

The name doesn't get encoded here, and it generally seems to work OK (the only exception I've found is that pickle chokes on it). So would it be OK to just keep the field name as a unicode object in py2, or should we encode it everywhere?

Actually there are some other weird inconsistencies between the dict and list forms. With the dict form you can have the empty string as a field name:

>>> np.dtype({'names': [''], 'formats': ['i']})
dtype([('', '<i4')])

With the list form, it attempts to use the title instead, if there is one, but then immediately complains that the name and title are the same:

>>> np.dtype([(('title', ''), 'i')])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: title already used as a name or title.

If there is no title, then it falls back to 'f0', 'f1', etc.

@AlexisMignon AlexisMignon referenced this issue in pandas-dev/pandas Jan 3, 2017

Closed

BUG: Fix a bug when using DataFrame.to_records with unicode column names #13462

4 of 4 tasks complete

@TomAugspurger TomAugspurger referenced this issue in statsmodels/statsmodels May 15, 2017

Closed

Maint/TST: test failure pandas, numpy compat #3658

@chrisjbillington

This comment has been minimized.

Show comment
Hide comment
@chrisjbillington

chrisjbillington Feb 27, 2018

Contributor

In Python 2 it should encode ascii and just raise the error if it can't be encoded. Then you can pass in unicode string literals in Python 2 with unicode_literals imported from __future__

The only use case here is code that is supposed to work on both Python 2 and Python 3. wrapping str() around them all might work, but it's ugly and doesn't work well with the fact that people's code often defines str = unicode in Python 2 in order to be referring to the same datatypes on both versions.

Contributor

chrisjbillington commented Feb 27, 2018

In Python 2 it should encode ascii and just raise the error if it can't be encoded. Then you can pass in unicode string literals in Python 2 with unicode_literals imported from __future__

The only use case here is code that is supposed to work on both Python 2 and Python 3. wrapping str() around them all might work, but it's ugly and doesn't work well with the fact that people's code often defines str = unicode in Python 2 in order to be referring to the same datatypes on both versions.

chrisjbillington added a commit to chrisjbillington/numpy that referenced this issue Feb 27, 2018

Resolve issue #2407
Allow unicode names in record arrays when datatype specified as tuples.
Name is encoded as ascii if possible, raising an exception if not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment