Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for str.split and str.join #3678

Merged
merged 30 commits into from Feb 19, 2019
Merged

Conversation

seibert
Copy link
Contributor

@seibert seibert commented Jan 14, 2019

As noted in #3674, support for str.split and str.join is pretty straightforward.

Remaining todos:

  • enable support for maxsplit argument of str.split with default value of -1 (not currently possible with @overload_method, so we should fix that first)
  • improve performance of split
  • fast implementation of str.join just for lists (rather than generic iterables) that makes two passes over the list to preallocate the full size of the resulting string
  • fast path string copy to use memcpy when character widths match
  • figure out remaining performance spread between Numba and CPython
  • update docs

@sklam sklam added this to In Progress in Active Jan 15, 2019
@stuartarchibald stuartarchibald added this to the Numba 0.43 RC milestone Jan 16, 2019
@seibert
Copy link
Contributor Author

seibert commented Jan 21, 2019

OK, so currently join seems to be 40% slower than CPython, which is probably within the accuracy of my ability to benchmark things. split is still 6x slower, which is a very large improvement from where we started, but still concerning. The split loop does need to construct new strings and append to lists, which are places where Numba's slower atomic reference counting can hurt performance, but it isn't clear that explains all of the difference.

@seibert
Copy link
Contributor Author

seibert commented Jan 21, 2019

Basically, I'm trying to understand if the statement parts.append(a[last:idx]) is resulting in code that is very suboptimal.

Update: The speed issue seems to be mostly due to the performance of constructing a string slice.

@seibert
Copy link
Contributor Author

seibert commented Jan 22, 2019

After some additional experimentation and differential benchmarking, I'm now convinced that the speed difference is specifically the time required to allocate an empty string. The time required to fill the empty string with data seems to be trivial.

It isn't surprising to learn that the pymalloc allocator is much better for small allocations than the system allocator. Not sure what we can do about this, aside from experiment with enabling other memory allocators in Numba (which goes beyond the scope of this PR). I'll add a note to the string docs to warn people about the performance issues.

@seibert
Copy link
Contributor Author

seibert commented Jan 22, 2019

Perf difference is still not entirely understood, but I think now goes beyond the scope of this PR. Now just holding for default argument support in overload_method().

@ehsantn
Copy link
Collaborator

ehsantn commented Jan 25, 2019

@seibert
Copy link
Contributor Author

seibert commented Feb 4, 2019

With #3704 merged now, maxsplit support has been enabled. This PR is ready for review now.

@seibert seibert moved this from In Progress to Need Review in Active Feb 4, 2019
@seibert seibert changed the title [WIP] Support for str.split and str.join Support for str.split and str.join Feb 4, 2019
@seibert seibert mentioned this pull request Feb 4, 2019
@sklam sklam moved this from Need Review to Reviewed... discussion/fixes taking place in Active Feb 12, 2019
@seibert
Copy link
Contributor Author

seibert commented Feb 18, 2019

OK, I think I've addressed all of Stuart's comments, unless we want str.join(str) to have a faster implementation.

@seibert seibert added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Feb 18, 2019
Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fixes, there's just a couple more things to sort out.


# Handle empty separator exception
with self.assertRaises(TypingError) as raises:
cfunc('', [1,2,3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did flake8 not complain about this line ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I enabled flake8, I copied over Dask's error suppressions (as that seemed a good style to copy):

ignore =
    E20,   # Extra space in brackets
    E231,E241,  # Multiple spaces around ","
    E26,   # Comments
    E731,  # Assigning lambda expression
    E741,  # Ambiguous variable names
    W503,  # line break before binary operator
    W504,  # line break after binary operator
max-line-length = 120

numba/unicode.py Outdated

return parts
return split_impl
elif sep is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is fine as a literal, but as an argument types.nonetype, this would fail. e.g.:

from numba import njit

def foo(x, y):
    return x.split(sep=y)

args= ('abacadae', None)
print('"%s"' % njit(foo)(*args))
print(foo(*args))

as a consequence of adding literals, both literal and as-arg typing is needed. This sort of pattern seems to be working:

sep is None or isinstance(sep, types.NoneType) or getattr(sep, 'value', False) is None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

numba/unicode.py Outdated
return parts
return split_impl
elif sep is None:
def split_whitespace_impl(a):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @overload safety net needs extending to @overload_method. This function declaration has a signature that does not match the typing signature. maxsplit is not implemented and sep is missing as a kwarg, this breaks as a result:

from numba import njit

def foo(x):
    return x.split(maxsplit=3)

args= ('\taa a a aa a aa a',)
print('"%s"' % njit(foo)(*args))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed



class TestBytesIntrinsic(TestCase):
"""Tests for numba.unsafe.tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps numba.unsafe.bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. typo should be fixed now

numba/unicode.py Outdated
@njit
def _is_whitespace(code_point):
# unrolling this for speed
return code_point == _WHITESPACE_SPACE or \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, should be fixed

@stuartarchibald
Copy link
Contributor

Thanks for your persistence in fixing these last few issues. I think conditional on CI pass this can be merged.

@seibert
Copy link
Contributor Author

seibert commented Feb 19, 2019

CI is passing (except the known Windows Py27 issue).

@seibert seibert added 5 - Ready to merge Review and testing done, is ready to merge and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Feb 19, 2019
@seibert seibert merged commit f5b2867 into numba:master Feb 19, 2019
Active automation moved this from Reviewed... discussion/fixes taking place to Done Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to merge Review and testing done, is ready to merge
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants