Support for str.split and str.join #3678

seibert · 2019-01-14T23:08:18Z

As noted in #3674, support for str.split and str.join is pretty straightforward.

Remaining todos:

enable support for maxsplit argument of str.split with default value of -1 (not currently possible with @overload_method, so we should fix that first)
improve performance of split
fast implementation of str.join just for lists (rather than generic iterables) that makes two passes over the list to preallocate the full size of the resulting string
fast path string copy to use memcpy when character widths match
figure out remaining performance spread between Numba and CPython
update docs

…iles

seibert · 2019-01-21T21:53:34Z

OK, so currently join seems to be 40% slower than CPython, which is probably within the accuracy of my ability to benchmark things. split is still 6x slower, which is a very large improvement from where we started, but still concerning. The split loop does need to construct new strings and append to lists, which are places where Numba's slower atomic reference counting can hurt performance, but it isn't clear that explains all of the difference.

seibert · 2019-01-21T22:08:23Z

Basically, I'm trying to understand if the statement parts.append(a[last:idx]) is resulting in code that is very suboptimal.

Update: The speed issue seems to be mostly due to the performance of constructing a string slice.

seibert · 2019-01-22T13:55:53Z

After some additional experimentation and differential benchmarking, I'm now convinced that the speed difference is specifically the time required to allocate an empty string. The time required to fill the empty string with data seems to be trivial.

It isn't surprising to learn that the pymalloc allocator is much better for small allocations than the system allocator. Not sure what we can do about this, aside from experiment with enabling other memory allocators in Numba (which goes beyond the scope of this PR). I'll add a note to the string docs to warn people about the performance issues.

seibert · 2019-01-22T18:35:34Z

Perf difference is still not entirely understood, but I think now goes beyond the scope of this PR. Now just holding for default argument support in overload_method().

ehsantn · 2019-01-25T19:35:33Z

Some links below about string immutability issues we discussed. Looks like the aspects we discussed are the main considerations.

https://stackoverflow.com/questions/18042042/pythons-immutable-strings-and-their-slices
https://stackoverflow.com/questions/6742923/if-strings-are-immutable-in-net-then-why-does-substring-take-on-time
https://blogs.msdn.microsoft.com/ericlippert/2011/07/19/strings-immutability-and-persistence/

seibert · 2019-02-04T16:11:28Z

With #3704 merged now, maxsplit support has been enabled. This PR is ready for review now.

seibert · 2019-02-18T20:02:36Z

OK, I think I've addressed all of Stuart's comments, unless we want str.join(str) to have a faster implementation.

stuartarchibald

thanks for the fixes, there's just a couple more things to sort out.

stuartarchibald · 2019-02-19T12:13:17Z

numba/tests/test_unicode.py

+
+        # Handle empty separator exception
+        with self.assertRaises(TypingError) as raises:
+            cfunc('', [1,2,3])


Did flake8 not complain about this line ?!

When I enabled flake8, I copied over Dask's error suppressions (as that seemed a good style to copy):

ignore = E20, # Extra space in brackets E231,E241, # Multiple spaces around "," E26, # Comments E731, # Assigning lambda expression E741, # Ambiguous variable names W503, # line break before binary operator W504, # line break after binary operator max-line-length = 120

stuartarchibald · 2019-02-19T13:59:21Z

numba/unicode.py

+
+            return parts
+        return split_impl
+    elif sep is None:


None is fine as a literal, but as an argument types.nonetype, this would fail. e.g.:

from numba import njit def foo(x, y): return x.split(sep=y) args= ('abacadae', None) print('"%s"' % njit(foo)(*args)) print(foo(*args))

as a consequence of adding literals, both literal and as-arg typing is needed. This sort of pattern seems to be working:

sep is None or isinstance(sep, types.NoneType) or getattr(sep, 'value', False) is None:

stuartarchibald · 2019-02-19T14:15:45Z

numba/unicode.py

+            return parts
+        return split_impl
+    elif sep is None:
+        def split_whitespace_impl(a):


The @overload safety net needs extending to @overload_method. This function declaration has a signature that does not match the typing signature. maxsplit is not implemented and sep is missing as a kwarg, this breaks as a result:

from numba import njit def foo(x): return x.split(maxsplit=3) args= ('\taa a a aa a aa a',) print('"%s"' % njit(foo)(*args))

stuartarchibald · 2019-02-19T14:30:39Z

numba/tests/test_unsafe_intrinsics.py

+
+
+class TestBytesIntrinsic(TestCase):
+    """Tests for numba.unsafe.tuple


Perhaps numba.unsafe.bytes?

good catch. typo should be fixed now

stuartarchibald · 2019-02-19T14:43:25Z

numba/unicode.py

+@njit
+def _is_whitespace(code_point):
+    # unrolling this for speed
+    return code_point == _WHITESPACE_SPACE or \


These probably need including: https://github.com/python/cpython/blob/5105483acb3aca318304bed056dcfd7e188fe4b5/Objects/unicodetype_db.h#L5996-L6031

ok, should be fixed

stuartarchibald · 2019-02-19T16:24:51Z

Thanks for your persistence in fixing these last few issues. I think conditional on CI pass this can be merged.

seibert · 2019-02-19T19:23:30Z

CI is passing (except the known Windows Py27 issue).

seibert added 3 commits January 14, 2019 12:48

fix existing flake8 failures and remove flake8 exclusions for these f…

31adc51

…iles

support for str.split

81f2020

str.join support

e1f4313

seibert added the 2 - In Progress label Jan 15, 2019

sklam added this to In Progress in Active Jan 15, 2019

Special cast join for lists of strings for massive speedup.

dfc55a6

stuartarchibald added this to the Numba 0.43 RC milestone Jan 16, 2019

seibert added 4 commits January 18, 2019 10:09

Add fast path memcpy when string character widths match

5ee14c9

Speed up split by 20x

5e7dbf9

add split, join to docs

f8b5973

Disable NRT on functions that should not change refcount

d753704

Fast path for slicing with stride 1

c927c97

Clarify string performance caveats

b266931

seibert added 2 commits January 23, 2019 08:33

Merge branch 'master' into str_split_join

8dda70c

Fix flake8 fail caused by merging with master

fd1bbac

seibert added 2 commits February 4, 2019 09:45

Merge master

96f2219

Add maxsplit support

70abb58

seibert added 3 - Ready for Review and removed 2 - In Progress labels Feb 4, 2019

seibert moved this from In Progress to Need Review in Active Feb 4, 2019

seibert changed the title ~~[WIP] Support for str.split and str.join~~ Support for str.split and str.join Feb 4, 2019

seibert mentioned this pull request Feb 4, 2019

UnicodeView type #3736

Closed

Fix flake8

28cedc4

stuartarchibald self-requested a review February 7, 2019 20:07

stuartarchibald removed the 3 - Ready for Review label Feb 8, 2019

sklam moved this from Need Review to Reviewed... discussion/fixes taking place in Active Feb 12, 2019

seibert added 8 commits February 18, 2019 11:24

Fix up some docs

c0d31d3

Add split on whitespace

ff7e835

merge master

93e047f

Only accept str.join on list<str>

bc2b9ad

Support join on standalone string

0d8d96e

check maxsplit as kwarg

802d9ed

Create memcpy_region intrinsic in numba.unsafe.bytes

9f99521

Fix flake8

1f0b6b6

seibert added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Feb 18, 2019

Make exception message test more general to cover all platforms

78c489f

stuartarchibald reviewed Feb 19, 2019

View reviewed changes

seibert added 5 commits February 19, 2019 08:58

respond to review comments

1a14c1d

Fix error message test

592be35

raise typing error on non-integer maxsplit

feec704

missed testing non-integer maxsplit when sep is None

07f4691

flake8

b23b152

stuartarchibald approved these changes Feb 19, 2019

View reviewed changes

seibert added 5 - Ready to merge Review and testing done, is ready to merge and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Feb 19, 2019

fix merge conflict with master

ccb00de

seibert merged commit f5b2867 into numba:master Feb 19, 2019

Active automation moved this from Reviewed... discussion/fixes taking place to Done Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for str.split and str.join #3678

Support for str.split and str.join #3678

seibert commented Jan 14, 2019 •

edited

seibert commented Jan 21, 2019

seibert commented Jan 21, 2019 •

edited

seibert commented Jan 22, 2019

seibert commented Jan 22, 2019

ehsantn commented Jan 25, 2019

seibert commented Feb 4, 2019

seibert commented Feb 18, 2019

stuartarchibald left a comment

stuartarchibald Feb 19, 2019

seibert Feb 19, 2019

stuartarchibald Feb 19, 2019

seibert Feb 19, 2019

stuartarchibald Feb 19, 2019

seibert Feb 19, 2019

stuartarchibald Feb 19, 2019

seibert Feb 19, 2019

stuartarchibald Feb 19, 2019

seibert Feb 19, 2019

stuartarchibald commented Feb 19, 2019

seibert commented Feb 19, 2019



		class TestBytesIntrinsic(TestCase):
		"""Tests for numba.unsafe.tuple

Support for str.split and str.join #3678

Support for str.split and str.join #3678

Conversation

seibert commented Jan 14, 2019 • edited

seibert commented Jan 21, 2019

seibert commented Jan 21, 2019 • edited

seibert commented Jan 22, 2019

seibert commented Jan 22, 2019

ehsantn commented Jan 25, 2019

seibert commented Feb 4, 2019

seibert commented Feb 18, 2019

stuartarchibald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartarchibald commented Feb 19, 2019

seibert commented Feb 19, 2019

seibert commented Jan 14, 2019 •

edited

seibert commented Jan 21, 2019 •

edited