-
-
Notifications
You must be signed in to change notification settings - Fork 14
FEAT: Adding array casting support to fixed length strings #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Slightly off topic - what's the current timeline for the next release? |
Got a bit busy with new stuff and job, NumPy 2.4 is about to be out so probably at same time. |
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments. I didn't carefully review the reference counting.
| PyErr_Format(PyExc_ValueError, | ||
| "Cannot cast non-ASCII character '%c' to QuadPrecision", c); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't the string-conversion functions called by cstring_to_quad catch this?
By the way, you might take inspiration from the string_to_long_double function in numpy:
You may also want to simply copy/paste and vendor the code in NumPyOS_ascii_strtold into this project to get a parser that act exactly like numpy's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm reading the code, there is not much difference, we directly working with UCS but NumPy converts it first to UTF-8 ( we can do this) and special case handling, white spaces, etc are already being handled by the Sleef_strotoq and strtold except the handling of locale which we might not be able to do as
- SLEEF's conversion function does no support that
strtold_l(used byNumPyOS_ascii_strtold) is not C standard (a posix extension)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just realised NumPy's longdouble and float64 string to dtype conversion's are inconsistent
a = np.array(['1.0000000000000002', " 1 "], dtype=np.str_)
print(repr(a))
print(a.astype(QuadPrecDType()))
print(a.astype(np.float64))
print(a.astype(np.longdouble))
# output
array(['1.0000000000000002', ' 1 '], dtype='<U18')
[1. 1.]
[1. 1.]
Traceback (most recent call last):
File "/home/OSS/numpy-user-dtypes/main.py", line 64, in <module>
print(a.astype(np.longdouble))
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for long double: 1 longdouble does not handle the end trailing whitespaces (In quaddtype I am handling it, by setting require_full_parse to false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix it, loadtxt strips whitespace anyway (unless you quote), np.fromtxt probably chokes on it, although I wouldn't recommend it over loadtxt anyway.
So yeah, probably reasonable to fix, but also seems very low priority to me.
|
The overall diff became quite big but most of the code is inspired from NumPy with slight change logic + Additional tests. So probably Maybe put the casting helper functions into utility file as well (to cut the diff) but as of now any other files does not seem to include them |
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed this - unfortunately I don't have time to give this a low-level code review.
Overall, this looks awesome. I think clearly indicating when stuff is copied from NumPy is fine.
Maybe @seberg wants to look this over too?
|
Cool also independent of this work, @ngoldbaum @seberg should we also patch the longdouble inconsistency? shown #225 (comment) within this thread |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have a very close look, but overall this looks reasonable to me. There is a small issue with byte-swapped unicode string inputs, but that is rather niche (and will fail obvious anyway).
| npy_intp *view_offset) | ||
| { | ||
| Py_INCREF(given_descrs[0]); | ||
| loop_descrs[0] = given_descrs[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really needs a canonical/ISNBO call, since you presumably do not support byte-swapping.
But, I have to ask: I guess NumPy forces you to implement this? Maybe we should change this in some form, even passing the slot as NULL explicitly to mean: Just use the default already.
Because this smells like the default would work fine so that this is all just a bunch of boilerplate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the slot NULL, gives segfault at runtime, also in future (post version-1 release) I was thinking to support the byteswapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you can't set it explicitly to NULL right now. But omitting it doesn't work, I guess? I.e. NumPy rejects it because your dtype is parametric or so?!
If you want to support NBO, then yes, the default won't work for you anyway. I was thinking of filling in the default if the slot is NULL, but not erroring in case omitting it is rejected by NumPy. (I.e. explicitly tell NumPy: I know the default promoter is fine, even if you think it may not be.)
I don't think byte-swapping support is really worthwhile to bother here, but sure, if you want to add it you'll need it anyway, I guess.
| } | ||
|
|
||
| static int | ||
| unicode_to_quad_strided_loop_unaligned(PyArrayMethod_Context *context, char *const data[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw you don't have to implement an unaligned version if you don't want to. NumPy will just make another cast/copy for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh maybe that's why in my earlier experiments I was almost never able to hit the unaligned loop execution. I thought maybe my system is sophisticated and they might be required for older systems.
Also @seberg is this true always, I mean in that case I can remove all the unaligned loops added but still keep the flag of NPY_METH_SUPPORTS_UNALIGNED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must support unaligned for the quaddtype <-> quaddtype conversions, but not for any others (because for any other, NumPy can chain them to ensure alignment).
For most systems these days unaligned access just works (although it may be a bit slower, IIRC). But, UBSAN or so likes to shout at you anyway and also it is not true for all systems.
If you know that it is irrelevant for a system, you could probably only write the unaligned version and the compiler will be smart enough to optimize the extra memcpy's away (the other way works, but UBSAN...).
Either way, there is no big need to set NPY_METH_SUPPORTS_UNALIGNED, only for the "copy" one I enforced it in NumPy to ensure that unaligned arrays are guaranteed to work.
|
Cool so I think this PR is fine as some pointed out work can be useful in future if we support the byte-swapping (othewise we can refactor it there itself) Proceeding to merge it in! |
Not quite. As is, the tests should fail if you pass in a byteswapped unicode string (legacy dtype)? The resolve_dtypes function needs to ensure a native byte order for unicode. |
|
Oh sorry I mistakenly pushed that commit into a different branch (for byte casting support) Keeping the unaligned as you mentioned UBSAN complains about it so |
| npy_intp *view_offset) | ||
| { | ||
| if (!PyArray_ISNBO(given_descrs[0]->byteorder)) { | ||
| loop_descrs[0] = PyArray_DescrNewByteorder(given_descrs[0], NPY_NATIVE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating new one might be better than just throwing incompatibility error
| else { | ||
| Py_INCREF(given_descrs[0]); | ||
| loop_descrs[0] = given_descrs[0]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm yeah this works, so fine to move on.
I am a bit surprised that (but not much) that I never made something like NPY_DT_CALL_ensure_canonical public? That seems like a pretty big oversight as it is a rather common need here.
(For simple builtins, canonical is the same as ensuring native byteorder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is there a case that PyArray_GetDefaultDescr gives the descriptor with native byte order?
I use this for float->quad casting and with '>f4' this seems to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That resolve_descriptor is basically a general descriptor for every numeric->quad cast,
IDK why I did that (1 year ago) sound illogical now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArray_GetDefaultDescr will always be native and is defined for these non-parametric types.
So I think that would also be fine to use for sure. (There is one silly difference that doesn't even matter here, this path preserves metadata I suspect. But while metadata seems kinda useful these days, I don't think propagating it is useful.)
|
okay merging this in now! |
This PR introduces the support for
np.str_andstr(will work on bytes after this gets merged)