Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String to int conversion #3937

Closed
wants to merge 17 commits into from
Closed

String to int conversion #3937

wants to merge 17 commits into from

Conversation

Alexander-Makaryev
Copy link
Contributor

This is the port of native implementation from cpython.

@Alexander-Makaryev Alexander-Makaryev changed the title [WIP] String to int conversion String to int conversion Apr 17, 2019
@stuartarchibald
Copy link
Contributor

@Alexander-Makaryev thanks for this PR, is it ready for review? If not, please just comment on here when you'd like it reviewed and the core devs will schedule it. Thanks!

@Alexander-Makaryev
Copy link
Contributor Author

@stuartarchibald yes, it is ready for review.
Another one (#3985) is ready too. It is done on the top of this branch and implementation is very close to this PR, so I think that it is a good idea to review both PRs by one developer at the same time.

@stuartarchibald
Copy link
Contributor

@Alexander-Makaryev great, thanks for confirming, I'll schedule them for review.

Copy link
Member

@sklam sklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have only reviewed the organization of the code and not much of the C code. )
Instead of creating a ext_string_conversion, I'd suggest adding the feature into _helperlib so that the new C functions are exposed via the c_helpers. For reference, you can look at how numba_dict_new_minsize is implemented in _dictobject.c and exposed in _helpermod.c (with declmethod(dict_new_minsize);). Lastly, dictobject.py creates an intrinsic to reference it via a .get_or_insert_function(fnty, name='numba_dict_new_minsize')

setup.py Outdated
@@ -276,9 +276,13 @@ def check_file_at_path(path2file):
depends=['numba/_pymodule.h'],
include_dirs=["numba"])

ext_string_conversion = Extension(name='string_conversion_ext',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making a top-level module. It should be nested inside numba. i.e. name='numba._string_conversion_ext'. But better yet is to build numba/_string_conversion.c into numba._helperlib.

numba/unicode.py Outdated
@@ -31,6 +31,9 @@
from numba.targets.hashing import _Py_hash_t
from numba.unsafe.bytes import memcpy_region

import llvmlite.binding as ll
import string_conversion_ext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_conversion_ext should not be a top-level module. See comment in setup.py

numba/unicode.py Outdated
@@ -810,3 +813,13 @@ def iternext_unicode(context, builder, sig, args, result):
# bump index for next cycle
nindex = cgutils.increment_index(builder, index)
builder.store(nindex, iterobj.index)


ll.add_symbol('str2int', string_conversion_ext.str2int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reuse the machinery with _helperlib to expose C functions. The manual registration of str2int here will not work with ahead-of-time compiled code.

@stuartarchibald
Copy link
Contributor

Thanks for the PR. Further to @sklam's comments, having taken a quick look at this, I'm of the view that testing that is considerably more extensive needs writing ahead of re-review. Thanks.

@shssf
Copy link
Contributor

shssf commented May 24, 2019

@sklam is this approach did you mean?
@stuartarchibald I did changes I mentioned. The code is not working. Please help.

@stuartarchibald
Copy link
Contributor

@shssf Yes, the approach presented is far closer to that needed for use in Numba, thanks.

Following on from gitter chat, I've added this branch https://github.com/stuartarchibald/numba/tree/pr_3937 which has stuartarchibald@a9bf601 as a patch to fix and demo the code working:

./runtests.py numba.tests.test_unicode.TestUnicodeStr2Int.test_str2int_demo

there are still a large number of things that are working essentially by accident, but I figure it's probably easier to fix something that half works with a bunch of problems than something that doesn't. I also added a load of comments to the code to try and explain literally what is going on, this however doesn't cover commenting on the issues in the current code. Main issues at present are 1) assumption that strings are equivalent to char *, which they are not, Numba's internal representation handles unicode! :) 2) Most of the types need more thought to make it work everywhere. 3) Overloading int will need more thought in general.

Hope this helps, we can definitely discuss more.

@shssf
Copy link
Contributor

shssf commented May 25, 2019

@stuartarchibald Thank you very much!
Are we going to support kind-2 and kind-4 in string conversion procedure? If so, could you recommend where I can find good examples of digits represented in unicode kind-4 for tests?

@stuartarchibald
Copy link
Contributor

@shssf No problem. In terms of kind-2 and kind-4 strings, I think that yes this needs to work, consider the following:

from numba import njit

@njit
def foo():
    string = "🐍⚡123"
    part = string[2:]
    return int(part)

print(foo.py_func())
print(foo())

the above is something that could reasonably occur in practice but will be incorrectly handled by the current code. Further, I'm not convinced that the strto*(3) family produces sufficiently equivalent behaviour to that of the CPython implementation.

@shssf
Copy link
Contributor

shssf commented May 28, 2019

@stuartarchibald In this case part = string[2:] Numba will not support other digits like this https://ltl-taiwan.com/chinese-numbers/#chapter-1

@stuartarchibald
Copy link
Contributor

@shssf The purpose of the demonstration was to highlight that wider kind unicode representations of numbers that could be encoded as ASCII are possible and trivial to achieve.

@njit
def foo():
    string = "🐍⚡123" # Unicode kind-2
    part = string[2:] # This slice is also unicode kind-2
    return int(part) # the `int` call now has to handle a unicode kind-2 repr of "123"

Numba will not support other digits like this https://ltl-taiwan.com/chinese-numbers/#chapter-1

Correct, Numba has to reproduce exactly what CPython does, if CPython doesn't support it then Numba doesn't have to either.

@stuartarchibald
Copy link
Contributor

Just checking in on the state of this PR. @shssf @Alexander-Makaryev was wondering if you have time to address the changes above, if not, don't worry, one of the core developers can take over to do the fixes. Thanks.

@shssf
Copy link
Contributor

shssf commented Aug 12, 2019

@stuartarchibald Sorry we ran out of time to continue this.

@stuartarchibald
Copy link
Contributor

@shssf thanks, am going to close this for now, there's merge conflicts and additional effort required to fully support unicode strings. Please feel free to reopen if updates are made to address these concerns. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on author Waiting for author to respond to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants