Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue on basic string operations #742

Open
github-gauthier-perrod opened this issue Jun 7, 2024 · 3 comments
Open

Performance issue on basic string operations #742

github-gauthier-perrod opened this issue Jun 7, 2024 · 3 comments

Comments

@github-gauthier-perrod
Copy link

github-gauthier-perrod commented Jun 7, 2024

Hello,
First of all thank you for this amazing project.
We have a few questions regarding string manipulation in ONNX runtime extensions. Specifically, we are trying to incorporate simple string manipulations directly in our ONNX models, such as "string_upper" and "string_join".

However, we have observed a significant impact on performance. These operations appear to be unexpectedly expensive as it seems more expensive than matrice multiplication (our string are always rather short, less than 30-40 characters). For instance, adding a "string_upper" operation on five features increases the inference time by a factor of 3 for a batch size of 1, and doubles the inference time for a batch size of 10 on our benchmarks.

Even worse, but expected, when not fully vectorised (for example we want to string upper some features and lower some others) those times adds up practically linearly (processing 1 vector with 10 features is almost 10 times faster than splitting this vector and applying it separately on those 10 features). It can be a real performance issue if we want to apply different simple transforms based on the feature.

We generated a benchmark to show you this you can regenerate using the linked python notebook.
Benchmark_example.ipynb.zip

Here is the profiing with split/unsplit string upper operation. We can see that a simple operation like toUpper takes by itself time than the whole MLP.:

Screenshot 2024-06-07 at 11 24 06 Screenshot 2024-06-07 at 11 26 37

We suspect that the high cost may be due to the overhead of copying. Is there a reason we do all those copies ? Why are we obligated to do so ?

PS: We have verified that we are correctly using en_US.utf8.

Have you encountered this issue before? Is this performance impact expected? Could you provide any insights or recommendations on how to optimize these operations?
Thanks a lot

@wenbingl
Copy link
Member

wenbingl commented Jun 7, 2024

I think the most of time was spent on Python->C++->Python conversion and new/delete objects in C++.
The copying you mentioned is to create an output object actually which will be re-used later in the loop.
The excessive use of string manipulation has not been thoroughly considered on development, and implementation are based on C++ in a simple way. It might require a fine granularity on CPU profiling to see how we can improve the efficiency which may lead to a complicated C++ implementation.

Is there any real case that a model need more string operations than MatMul and other operations, or is it just a test?

FYI, the input string should UTF-8 encoded.

@github-louis-fruleux
Copy link

Hello @wenbingl sorry for the delay in the answer,

Actually, this is something we are experiencing in production, the preprocessing of String takes 70% of the CPU of the total. (with unstack, upper, and stack methods)
What bothers us is, that string_upper's onnx operator seems slower than Python's equivalent string.upper().

Our issue is in Scala but seems reproducible in Python, so I will just give you a minimal example in Python to reproduce.

Using two benchmarks, one doing the string.upper() in Python, the other one letting string_upper onnx operator doing it. And we've got a consequent difference in terms of performance.

Execution time: 8.37116679904284 +/- 1.2217764614101005 us
Execution time: 35.20608749677194  +/- 1.7959354508589989 us

(here is the benchmark reproducer)

Regarding the performances, I could give it a look using some profiling on the C++ code if you think this could be a valuable contribution!
Do you have in mind someone having similar issues? Are we missing something obvious? (such as String encoding or something?)

Thanks for your time and help

@wenbingl
Copy link
Member

Hello @wenbingl sorry for the delay in the answer,

Actually, this is something we are experiencing in production, the preprocessing of String takes 70% of the CPU of the total. (with unstack, upper, and stack methods) What bothers us is, that string_upper's onnx operator seems slower than Python's equivalent string.upper().

Our issue is in Scala but seems reproducible in Python, so I will just give you a minimal example in Python to reproduce.

Using two benchmarks, one doing the string.upper() in Python, the other one letting string_upper onnx operator doing it. And we've got a consequent difference in terms of performance.

Execution time: 8.37116679904284 +/- 1.2217764614101005 us
Execution time: 35.20608749677194  +/- 1.7959354508589989 us

(here is the benchmark reproducer)

Regarding the performances, I could give it a look using some profiling on the C++ code if you think this could be a valuable contribution! Do you have in mind someone having similar issues? Are we missing something obvious? (such as String encoding or something?)

Thanks for your time and help

Yes, the C++ profiling will be very helpful to see how much time was spent on this upper function, and ORT session. Then we can decide on the next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants