New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String to int conversion #3937
String to int conversion #3937
Conversation
@Alexander-Makaryev thanks for this PR, is it ready for review? If not, please just comment on here when you'd like it reviewed and the core devs will schedule it. Thanks! |
@stuartarchibald yes, it is ready for review. |
@Alexander-Makaryev great, thanks for confirming, I'll schedule them for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have only reviewed the organization of the code and not much of the C code. )
Instead of creating a ext_string_conversion
, I'd suggest adding the feature into _helperlib
so that the new C functions are exposed via the c_helpers
. For reference, you can look at how numba_dict_new_minsize
is implemented in _dictobject.c
and exposed in _helpermod.c
(with declmethod(dict_new_minsize);
). Lastly, dictobject.py
creates an intrinsic to reference it via a .get_or_insert_function(fnty, name='numba_dict_new_minsize')
setup.py
Outdated
@@ -276,9 +276,13 @@ def check_file_at_path(path2file): | |||
depends=['numba/_pymodule.h'], | |||
include_dirs=["numba"]) | |||
|
|||
ext_string_conversion = Extension(name='string_conversion_ext', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making a top-level module. It should be nested inside numba. i.e. name='numba._string_conversion_ext'
. But better yet is to build numba/_string_conversion.c
into numba._helperlib
.
numba/unicode.py
Outdated
@@ -31,6 +31,9 @@ | |||
from numba.targets.hashing import _Py_hash_t | |||
from numba.unsafe.bytes import memcpy_region | |||
|
|||
import llvmlite.binding as ll | |||
import string_conversion_ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string_conversion_ext
should not be a top-level module. See comment in setup.py
numba/unicode.py
Outdated
@@ -810,3 +813,13 @@ def iternext_unicode(context, builder, sig, args, result): | |||
# bump index for next cycle | |||
nindex = cgutils.increment_index(builder, index) | |||
builder.store(nindex, iterobj.index) | |||
|
|||
|
|||
ll.add_symbol('str2int', string_conversion_ext.str2int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reuse the machinery with _helperlib
to expose C functions. The manual registration of str2int
here will not work with ahead-of-time compiled code.
Thanks for the PR. Further to @sklam's comments, having taken a quick look at this, I'm of the view that testing that is considerably more extensive needs writing ahead of re-review. Thanks. |
@sklam is this approach did you mean? |
@shssf Yes, the approach presented is far closer to that needed for use in Numba, thanks. Following on from gitter chat, I've added this branch https://github.com/stuartarchibald/numba/tree/pr_3937 which has stuartarchibald@a9bf601 as a patch to fix and demo the code working:
there are still a large number of things that are working essentially by accident, but I figure it's probably easier to fix something that half works with a bunch of problems than something that doesn't. I also added a load of comments to the code to try and explain literally what is going on, this however doesn't cover commenting on the issues in the current code. Main issues at present are 1) assumption that strings are equivalent to Hope this helps, we can definitely discuss more. |
@stuartarchibald Thank you very much! |
@shssf No problem. In terms of kind-2 and kind-4 strings, I think that yes this needs to work, consider the following: from numba import njit
@njit
def foo():
string = "🐍⚡123"
part = string[2:]
return int(part)
print(foo.py_func())
print(foo()) the above is something that could reasonably occur in practice but will be incorrectly handled by the current code. Further, I'm not convinced that the |
@stuartarchibald In this case |
@shssf The purpose of the demonstration was to highlight that wider kind unicode representations of numbers that could be encoded as ASCII are possible and trivial to achieve. @njit
def foo():
string = "🐍⚡123" # Unicode kind-2
part = string[2:] # This slice is also unicode kind-2
return int(part) # the `int` call now has to handle a unicode kind-2 repr of "123"
Correct, Numba has to reproduce exactly what CPython does, if CPython doesn't support it then Numba doesn't have to either. |
Just checking in on the state of this PR. @shssf @Alexander-Makaryev was wondering if you have time to address the changes above, if not, don't worry, one of the core developers can take over to do the fixes. Thanks. |
@stuartarchibald Sorry we ran out of time to continue this. |
@shssf thanks, am going to close this for now, there's merge conflicts and additional effort required to fully support unicode strings. Please feel free to reopen if updates are made to address these concerns. Thanks again. |
This is the port of native implementation from cpython.