New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for str.split and str.join #3678
Conversation
OK, so currently |
Basically, I'm trying to understand if the statement Update: The speed issue seems to be mostly due to the performance of constructing a string slice. |
After some additional experimentation and differential benchmarking, I'm now convinced that the speed difference is specifically the time required to allocate an empty string. The time required to fill the empty string with data seems to be trivial. It isn't surprising to learn that the pymalloc allocator is much better for small allocations than the system allocator. Not sure what we can do about this, aside from experiment with enabling other memory allocators in Numba (which goes beyond the scope of this PR). I'll add a note to the string docs to warn people about the performance issues. |
Perf difference is still not entirely understood, but I think now goes beyond the scope of this PR. Now just holding for default argument support in |
Some links below about string immutability issues we discussed. Looks like the aspects we discussed are the main considerations. https://stackoverflow.com/questions/18042042/pythons-immutable-strings-and-their-slices |
With #3704 merged now, |
OK, I think I've addressed all of Stuart's comments, unless we want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fixes, there's just a couple more things to sort out.
|
||
# Handle empty separator exception | ||
with self.assertRaises(TypingError) as raises: | ||
cfunc('', [1,2,3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did flake8
not complain about this line ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I enabled flake8, I copied over Dask's error suppressions (as that seemed a good style to copy):
ignore =
E20, # Extra space in brackets
E231,E241, # Multiple spaces around ","
E26, # Comments
E731, # Assigning lambda expression
E741, # Ambiguous variable names
W503, # line break before binary operator
W504, # line break after binary operator
max-line-length = 120
numba/unicode.py
Outdated
|
||
return parts | ||
return split_impl | ||
elif sep is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is fine as a literal, but as an argument types.nonetype
, this would fail. e.g.:
from numba import njit
def foo(x, y):
return x.split(sep=y)
args= ('abacadae', None)
print('"%s"' % njit(foo)(*args))
print(foo(*args))
as a consequence of adding literals, both literal and as-arg typing is needed. This sort of pattern seems to be working:
sep is None or isinstance(sep, types.NoneType) or getattr(sep, 'value', False) is None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
numba/unicode.py
Outdated
return parts | ||
return split_impl | ||
elif sep is None: | ||
def split_whitespace_impl(a): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @overload
safety net needs extending to @overload_method
. This function declaration has a signature that does not match the typing signature. maxsplit
is not implemented and sep
is missing as a kwarg
, this breaks as a result:
from numba import njit
def foo(x):
return x.split(maxsplit=3)
args= ('\taa a a aa a aa a',)
print('"%s"' % njit(foo)(*args))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
|
||
class TestBytesIntrinsic(TestCase): | ||
"""Tests for numba.unsafe.tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps numba.unsafe.bytes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. typo should be fixed now
numba/unicode.py
Outdated
@njit | ||
def _is_whitespace(code_point): | ||
# unrolling this for speed | ||
return code_point == _WHITESPACE_SPACE or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These probably need including: https://github.com/python/cpython/blob/5105483acb3aca318304bed056dcfd7e188fe4b5/Objects/unicodetype_db.h#L5996-L6031
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, should be fixed
Thanks for your persistence in fixing these last few issues. I think conditional on CI pass this can be merged. |
CI is passing (except the known Windows Py27 issue). |
As noted in #3674, support for
str.split
andstr.join
is pretty straightforward.Remaining todos:
maxsplit
argument ofstr.split
with default value of-1
(not currently possible with@overload_method
, so we should fix that first)str.join
just for lists (rather than generic iterables) that makes two passes over the list to preallocate the full size of the resulting string