Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower string manipulation performance than CPython #7535

Closed
dlee992 opened this issue Nov 4, 2021 · 4 comments
Closed

Slower string manipulation performance than CPython #7535

dlee992 opened this issue Nov 4, 2021 · 4 comments
Labels
discussion An issue requiring discussion performance - run time Performance issue occurring at run time. performance performance related issue

Comments

@dlee992
Copy link
Contributor

dlee992 commented Nov 4, 2021

Feature request

After investigating Numba usage and internals, I want to popularize the usage of Numba in my work group.
However, our existing Python bussiness code is often related with user-defined-functions using Pandas DataFrame and string manipulation.
I know Numba is excellent at numeric manipulation, not good at string stuff.
I look into Numba internals, it already has Python unicode type support, but with much slower speed than CPython.

I wonder:

  1. whether Numba will enhance string performance in recent future?
  2. if I dive into modifying numba source code for string speed on the current basis of Numba unicode support by myself, is there a theoretical upper limit for performance? Specifically, I think if I decide to do it by myself, I have to optimize each string operation (e.g., CPython string find() implementation contains many "short cut", bloom, KMP algorithm and so on; numba just implements a brute-force _finder() to overload string find()), and what's more? Adding specific compilation optimization pass for string manipulation?
  3. or for much better performance, such as ~10x speed up than CPython, I have to rewrite all unicode code already in Numba?
  4. or recommand some existing numba extensions/tools with better string support?
  5. Besides, I know Intel SDC project is developing, are there other numba extensions with Pandas DataFrame support?

For example, I tested a small snippet of code:

image

@sklam sklam added needtriage discussion An issue requiring discussion performance performance related issue performance - run time Performance issue occurring at run time. labels Nov 4, 2021
@sklam
Copy link
Member

sklam commented Nov 4, 2021

Thank you for your interest in Numba. Currently, the unicode support in Numba is primitive and it's not performing well for application relying on string manipulation. IIRC, it is particularly bad on code that requires processing on individual characters such as the text_distance() code. Unfortunately, we don't have any immediate plans to optimize the string support.

As for the performance limitation, i think the reference counting operations are preventing optimizations. In CPython, many of the implementation can access the underlying buffer directly and bypass any reference counted operations. Numba needs a similar direct access to the chars.

Lastly, beside Intel SDC, I only know of bodo for pandas support. (ping @ehsantn)

@dlee992
Copy link
Contributor Author

dlee992 commented Nov 18, 2021

@sklam , @ehsantn , thanks.

Recently, I saw SDC gives a str_ext implementation, the core idea is like below:

@overload_method(types.UnicodeType, 'replace')
def str_replace_overload(in_str, old, new, count=-1):

    def _str_replace_impl(in_str, old, new, count=-1):
        with numba.objmode(out='unicode_type'):
            out = in_str.replace(old, new, count)
        return out

    return _str_replace_impl

It seems sdc uses with objmode to callback to python interpreter code.

But I don't know whether this kind of operations will unbox python unicode type to numba UnicodeType, and box back when call replace or other str operations. Can this boost string performance, e.g., achieve similar performance like CPython? It really confuses me, haha.

I also find some c.pyapi usage in numba source code, does this call into ctypes.pythonapi or sth like this? If totally calling back to Python C API, can we boost Numba String Performance?

Any comments are welcome, or any numba discussion group can let me in, haha. thanks.

@github-actions
Copy link

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

@github-actions github-actions bot added the stale Marker label for stale issues. label Dec 19, 2021
@gmarkall gmarkall removed needtriage stale Marker label for stale issues. labels Dec 20, 2021
@dlee992
Copy link
Contributor Author

dlee992 commented Aug 15, 2023

close this. I think no many users will care for string operation performance when using numba, we more focus on numeric computing. Feel free to reopen it.

@dlee992 dlee992 closed this as completed Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion An issue requiring discussion performance - run time Performance issue occurring at run time. performance performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants