Slower string manipulation performance than CPython #7535

dlee992 · 2021-11-04T08:29:09Z

Feature request

After investigating Numba usage and internals, I want to popularize the usage of Numba in my work group.
However, our existing Python bussiness code is often related with user-defined-functions using Pandas DataFrame and string manipulation.
I know Numba is excellent at numeric manipulation, not good at string stuff.
I look into Numba internals, it already has Python unicode type support, but with much slower speed than CPython.

I wonder:

whether Numba will enhance string performance in recent future?
if I dive into modifying numba source code for string speed on the current basis of Numba unicode support by myself, is there a theoretical upper limit for performance? Specifically, I think if I decide to do it by myself, I have to optimize each string operation (e.g., CPython string find() implementation contains many "short cut", bloom, KMP algorithm and so on; numba just implements a brute-force _finder() to overload string find()), and what's more? Adding specific compilation optimization pass for string manipulation?
or for much better performance, such as ~10x speed up than CPython, I have to rewrite all unicode code already in Numba?
or recommand some existing numba extensions/tools with better string support?
Besides, I know Intel SDC project is developing, are there other numba extensions with Pandas DataFrame support?

For example, I tested a small snippet of code:

sklam · 2021-11-04T21:30:33Z

Thank you for your interest in Numba. Currently, the unicode support in Numba is primitive and it's not performing well for application relying on string manipulation. IIRC, it is particularly bad on code that requires processing on individual characters such as the text_distance() code. Unfortunately, we don't have any immediate plans to optimize the string support.

As for the performance limitation, i think the reference counting operations are preventing optimizations. In CPython, many of the implementation can access the underlying buffer directly and bypass any reference counted operations. Numba needs a similar direct access to the chars.

Lastly, beside Intel SDC, I only know of bodo for pandas support. (ping @ehsantn)

dlee992 · 2021-11-18T10:17:46Z

@sklam , @ehsantn , thanks.

Recently, I saw SDC gives a str_ext implementation, the core idea is like below:

@overload_method(types.UnicodeType, 'replace')
def str_replace_overload(in_str, old, new, count=-1):

    def _str_replace_impl(in_str, old, new, count=-1):
        with numba.objmode(out='unicode_type'):
            out = in_str.replace(old, new, count)
        return out

    return _str_replace_impl

It seems sdc uses with objmode to callback to python interpreter code.

But I don't know whether this kind of operations will unbox python unicode type to numba UnicodeType, and box back when call replace or other str operations. Can this boost string performance, e.g., achieve similar performance like CPython? It really confuses me, haha.

I also find some c.pyapi usage in numba source code, does this call into ctypes.pythonapi or sth like this? If totally calling back to Python C API, can we boost Numba String Performance?

Any comments are welcome, or any numba discussion group can let me in, haha. thanks.

github-actions · 2021-12-19T01:52:07Z

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

dlee992 · 2023-08-15T14:11:30Z

close this. I think no many users will care for string operation performance when using numba, we more focus on numeric computing. Feel free to reopen it.

sklam added needtriage discussion An issue requiring discussion performance performance related issue performance - run time Performance issue occurring at run time. labels Nov 4, 2021

github-actions bot added the stale Marker label for stale issues. label Dec 19, 2021

gmarkall removed needtriage stale Marker label for stale issues. labels Dec 20, 2021

dlee992 closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower string manipulation performance than CPython #7535

Slower string manipulation performance than CPython #7535

dlee992 commented Nov 4, 2021

sklam commented Nov 4, 2021

dlee992 commented Nov 18, 2021

github-actions bot commented Dec 19, 2021

dlee992 commented Aug 15, 2023

Slower string manipulation performance than CPython #7535

Slower string manipulation performance than CPython #7535

Comments

dlee992 commented Nov 4, 2021

Feature request

sklam commented Nov 4, 2021

dlee992 commented Nov 18, 2021

github-actions bot commented Dec 19, 2021

dlee992 commented Aug 15, 2023