A fix for the €˜PyUnicodeObject Python 3.3 #372

Merged
merged 4 commits into from Aug 4, 2012

4 participants

@certik

This is the polished fix from #366. The same approach is used, except that no hacks are necessary, we simply swap the buffer if needed, then call PyUnicode_FromKindAndData() and that's it.

There is a bug in tests, that sometime they produce invalid unicode. This is also fixed. See the description of the individual commits for more details.

certik added some commits Jul 27, 2012
@certik certik FIX: Fixes the PyUnicodeObject problem in py-3.3
Previously NumPy did not compile in Python 3.3 due to the changes in PEP 393.
This patch simply calls PyUnicode_FromKindAndData() from the new Python 3.3 API
and possibly swaps the data before calling it if needed. The data in NumPy is always UCS4 and the PyUnicode_FromKindAndData() internally converts it to UCS1, UCS2 or UCS4 depending on the maximum code point.

The following tests now fail, because they produce invalid unicode in the
process (will be fixed in the next patch):

======================================================================
ERROR: Check byteorder of 0-dimensional objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py", line 277, in test_values0D
    self.assertTrue(ua[()] != ua2[()])
SystemError: invalid maximum character passed to PyUnicode_New

======================================================================
ERROR: Check byteorder of multi-dimensional objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py", line 297, in test_valuesMD
    self.assertTrue(ua[0,0,0] != ua2[0,0,0])
SystemError: invalid maximum character passed to PyUnicode_New

======================================================================
ERROR: Check byteorder of single-dimensional objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py", line 286, in test_valuesSD
    self.assertTrue(ua[0] != ua2[0])
SystemError: invalid maximum character passed to PyUnicode_New

======================================================================
ERROR: Check byteorder of 0-dimensional objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py", line 277, in test_values0D
    self.assertTrue(ua[()] != ua2[()])
SystemError: invalid maximum character passed to PyUnicode_New

======================================================================
ERROR: Check byteorder of multi-dimensional objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py", line 297, in test_valuesMD
    self.assertTrue(ua[0,0,0] != ua2[0,0,0])
SystemError: invalid maximum character passed to PyUnicode_New

======================================================================
ERROR: Check byteorder of single-dimensional objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py", line 286, in test_valuesSD
    self.assertTrue(ua[0] != ua2[0])
SystemError: invalid maximum character passed to PyUnicode_New
a9d58ab
@certik certik FIX: Make sure the tests produce valid unicode
The tests are testing byte order for unicode, so we can only use such unicode
data, so that both versions (swapped and unswapped) are valid unicode.
4234b6b
@travisbot

This pull request passes (merged 4234b6b into fd15162).

@charris charris commented on an outdated diff Aug 3, 2012
numpy/core/src/multiarray/scalarapi.c
@@ -743,6 +778,7 @@
#endif
return obj;
}
+#endif // PY_VERSION_HEX < 0x03030000
@charris
NumPy member
charris added a note Aug 3, 2012

C++ comments aren't portable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris charris commented on an outdated diff Aug 3, 2012
numpy/core/src/multiarray/scalarapi.c
@@ -641,6 +641,40 @@
itemsize = (((itemsize - 1) >> 2) + 1) << 2;
}
}
+#if PY_VERSION_HEX >= 0x03030000
+ if (type_num == NPY_UNICODE) {
+ PyObject *u, *args;
+ char *buffer;
+ if (swap) {
+ buffer = malloc(itemsize);
+ if (buffer == NULL) {
+ PyErr_NoMemory();
+ return NULL;
+ }
+ memcpy(buffer, data, itemsize);
+ byte_swap_vector(buffer, itemsize >> 2, 4);
+ } else {
@charris
NumPy member
charris added a note Aug 3, 2012

Style (see doc/C_STYLE_GUIDE.rst.txt).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris charris and 1 other commented on an outdated diff Aug 3, 2012
numpy/core/src/multiarray/scalarapi.c
@@ -641,6 +641,40 @@
itemsize = (((itemsize - 1) >> 2) + 1) << 2;
}
}
+#if PY_VERSION_HEX >= 0x03030000
+ if (type_num == NPY_UNICODE) {
+ PyObject *u, *args;
+ char *buffer;
@charris
NumPy member
charris added a note Aug 3, 2012

Blank line after declarations.

@certik
certik added a note Aug 3, 2012

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris charris and 1 other commented on an outdated diff Aug 3, 2012
numpy/core/src/multiarray/scalarapi.c
@@ -641,6 +641,40 @@
itemsize = (((itemsize - 1) >> 2) + 1) << 2;
}
}
+#if PY_VERSION_HEX >= 0x03030000
+ if (type_num == NPY_UNICODE) {
+ PyObject *u, *args;
+ char *buffer;
+ if (swap) {
+ buffer = malloc(itemsize);
+ if (buffer == NULL) {
+ PyErr_NoMemory();
+ return NULL;
+ }
+ memcpy(buffer, data, itemsize);
+ byte_swap_vector(buffer, itemsize >> 2, 4);
@charris
NumPy member
charris added a note Aug 3, 2012

Might want to do itemsize/sizeof(NPY_UNICODE). The compiler will probably implement it as a shift.

@certik
certik added a note Aug 3, 2012

Well -- then one should do this change everywhere (right now this line is consistent to other usage of byte_swap_vector). Also the "4" should be changed to sizeof(NPY_UNICODE). So I think it's better to either keep it as it is, or create a patch that does this change everywhere in numpy. Well I don't know, either is fine with me.

@charris
NumPy member
charris added a note Aug 3, 2012

True, there are lots of places with hard coded values of this and that in Numpy. A NPY_UNICODE_SIZE macro might be useful, I didn't see one with a quick grep. However, I don't think current usage is a good argument for not making changes, you can look at the code in, say, the 1.1 release to see how much has changed in that regard. Just thought it might be useful to start doing things differently going forward if there is agreement on usage. I think the explicit sizeof is a bit more self documenting.

@certik
certik added a note Aug 3, 2012

That's a good argument. However, I just noticed, that NPY_UNICODE is just an "enum", see ndarraytypes.h:71. So I don't think that's a good idea to do sizeof(), because that will always return 4, no matter what the actual size is. Is there some other solution?

@charris
NumPy member
charris added a note Aug 3, 2012

Yeah, my mistake. The type is npy_ucs4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris charris commented on the diff Aug 3, 2012
numpy/core/src/multiarray/scalartypes.c.src
@@ -2592,7 +2592,11 @@ finish:
*((npy_@name@ *)dest) = *((npy_@name@ *)src);
#elif @default@ == 1 /* unicode and strings */
if (itemsize == 0) { /* unicode */
+#if PY_VERSION_HEX >= 0x03030000
+ itemsize = PyUnicode_GetLength(robj) * PyUnicode_KIND(robj);
@charris
NumPy member
charris added a note Aug 3, 2012

Can't say I'm happy with PyUnicode_Kind being changed to be a size, but that is a Python issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris
NumPy member

Looks good except for some minor style issues.

@certik

Thanks @charris for the review. I've fixed the C style problems.

Let me know about the byte_swap_vector(), either solution is fine with me.

@teoliphant
NumPy member

I think we should look into two things:

1) Using PyUnicode_DecodeUTF32 to create the unicode object --- with an argument to byteorder that depends on swap.
2) Look into the type constructor (after the Unicode object is created) to look at how the new Unicode Array Scalar is being created. It would be best if in-essence the type-object was simply changed to point to a the Unicode Array Scalar object. Right now, I'm not clear on what happens --- is "yet another copy happening"?

@certik

I'll investigate 1) and report back.

As to 2), that also occurred to me --- I already spent some time to find the constructor but so far I failed. Does anybody know where the numpy's str_ constructor is implemented? I'll try to use gdb to nail it.

@travisbot

This pull request passes (merged 09d2c51 into fd15162).

@teoliphant
NumPy member
@certik certik Use PyUnicode_DecodeUTF32()
This function handles the swapping automatically and it returns a unicode
object in one of: UCS1, UCS2 or UCS4 internal Python format.
f2ac38f
@travisbot

This pull request passes (merged f2ac38f into fd15162).

@certik

@teoliphant, indeed, the PyUnicode_DecodeUTF32() is a much better option! See my patch above (f2ac38f). Everything works. It handles the byte order itself and then at the end it converts it (internally) into UCS1, UCS2 or UCS4 Python formats. See Python sources of the PyUnicode_DecodeUTF32(), PyUnicode_DecodeUTF32Stateful() and unicode_resize() functions in Objects/unicodeobject.c (if anybody is reading the sources, the way it works is that first UCS1 is created, filled in using unicode_putchar() and then the unicode_resize() function converts it to the proper internal Python format --- however this is just implementation detail, important is that we use public API) .
So this answers 1).

@teoliphant
NumPy member

This looks good to me for the time being --- quite an improvement in-fact. I looked into the unicode constructor and it looks like it does indeed create yet a new object for a sub-type and copies the data over again. It would be much better if PyUnicode_DecodeUTF32 were able to return a sub-type.

What I am looking for is the ability to view the unicode object as a numpy.unicode_ object --- with no data copy.

@teoliphant teoliphant merged commit 4676f33 into numpy:master Aug 4, 2012

1 check passed

Details default The Travis build passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment