Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow dtype field names to be ascii encoded unicode in Python2 #10672

Merged
merged 10 commits into from Mar 8, 2018
Merged

ENH: Allow dtype field names to be ascii encoded unicode in Python2 #10672

merged 10 commits into from Mar 8, 2018

Conversation

chrisjbillington
Copy link
Contributor

@chrisjbillington chrisjbillington commented Feb 27, 2018

(#2407)

Allow unicode names in record arrays when datatype specified as tuples in Python 2.
Name is encoded as ascii if possible, raising an exception if not.

The result passes round-tripping with pickle

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals, print_function
import numpy as np
import cPickle as pickle

x = np.zeros(1, dtype=[('test', int)])
print('created array:', repr(x))

y = pickle.loads(pickle.dumps(x))
print('round-tripped pickle:', repr(y))
assert x == y

print('should raise an error now...')
x = np.zeros(1, dtype=[('tëßt', int)])

Result:

created array: array([(0,)], dtype=[('test', '<i8')])
round-tripped pickle: array([(0,)], dtype=[('test', '<i8')])
should raise an error now...
Traceback (most recent call last):
  File "12.py", line 18, in <module>
    x = np.zeros(1, dtype=[('tëßt', int)])
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)

Allow unicode names in record arrays when datatype specified as tuples.
Name is encoded as ascii if possible, raising an exception if not.
if (PyUString_Check(name)) {
#else
if ((PyUString_Check(name) || PyUnicode_Check(name))) {
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is PyBaseString_Check(name) (from npy_3kcompat.h)


#if !defined(NPY_PY3)
/* convert unicode name to ascii on Python 2 if possible */
if PyUnicode_Check(name){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntax error

/* convert unicode name to ascii on Python 2 if possible */
if PyUnicode_Check(name){
Py_DECREF(name);
name = PyUnicode_AsASCIIString(name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not safe - you need to convert before you decref

@eric-wieser
Copy link
Member

The approach looks good, just some implementation details to fix.

@chrisjbillington
Copy link
Contributor Author

Thanks for the comments! I've made the changes.

@mhvk mhvk changed the title Resolve issue #2407 Allow dtype field name definitions to be unicode in Python2 Feb 27, 2018
@mhvk
Copy link
Contributor

mhvk commented Feb 27, 2018

(No review, just changed the title to be meaningful.)

/* convert unicode name to ascii on Python 2 if possible */
if (PyUnicode_Check(name)) {
PyObject *tmp;
tmp = PyUnicode_AsASCIIString(name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could combine these lines

# But raises UnicodeEncodeError if it can't be encoded:
nonencodable_title = u'\uc3bc'
assert_raises(UnicodeEncodeError, np.dtype, [(nonencodable_title, int)])
assert_raises(UnicodeEncodeError, np.dtype, [(('a', nonencodable_title), int)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this fail on py3 too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure what you're asking, but this part of test_multiarray.py is within an if statement such that it only is defined on Python 2, if that answers your question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I would have expected a @skipIf decorator for that, but that's cleanup for another time.

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation and test looks correct to me.

I've leave it to someone else to decide whether we want to fix #2407 at all - I think I'm +0.75 on it.

This maybe needs a release note, but I'd be ok without one. Again, will let someone else decide.

@eric-wieser eric-wieser added this to the 1.15.0 release milestone Feb 28, 2018
@mhvk
Copy link
Contributor

mhvk commented Feb 28, 2018

I am in favour of the fix: this particular bug meant that people where discouraged from using unicode_literals in their code, and thus made making code for both python2 and 3 just that little bit more annoying. It is good to remove that annoyance even if it is nearly not needed any more.

Given the above, I think a release note is a good idea.

@charris charris changed the title Allow dtype field name definitions to be unicode in Python2 ENH: Allow dtype field names to be unicode in Python2 Feb 28, 2018
@chrisjbillington
Copy link
Contributor Author

Thanks very much! I look forward to removing my workarounds for this once 1.15 is out.

@pv
Copy link
Member

pv commented Mar 1, 2018 via email

@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 4, 2018
@charris
Copy link
Member

charris commented Mar 4, 2018

@chrisjbillington Needs a release note in `doc/release/1.15.0-notes.rst' under Enhancements.

@chrisjbillington
Copy link
Contributor Author

Release note added!

@charris charris removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 8, 2018
@charris charris changed the title ENH: Allow dtype field names to be unicode in Python2 ENH: Allow dtype field names to be ascii encoded unicode in Python2 Mar 8, 2018
@charris charris merged commit 6fb8622 into numpy:master Mar 8, 2018
@charris
Copy link
Member

charris commented Mar 8, 2018

Thanks @chrisjbillington .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants