-
-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a string replace method to handle multiple search keys efficiently. #92433
base: master
Are you sure you want to change the base?
Conversation
There are multiple places where we want to replace multiple characters or strings with one character (e.g. cleaning up paths), and calling replace for every source string is wasteful, we can do it all in one go.
3809ffc
to
5235392
Compare
Would be nice if this was exposed to the script API. |
Ok, I fixed the off-by one error [insert meme here], moved things to static storage, and exposed the replace overload as |
@@ -744,6 +744,14 @@ | |||
Replaces all occurrences of [param what] inside the string with the given [param forwhat]. | |||
</description> | |||
</method> | |||
<method name="replace_all" qualifiers="const"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the name replace_multiple
would be better, since otherwise the user might think that replace_all()
is a version of replace()
. But replace()
already replaces all occurrences, not just the first one (like in some languages/libraries).
We should also think about naming if we decide to add a pair replacement method in the future (replace_*(["a", "b"], ["x", "y"])
or replace_*({"a": "x", "b": "y"})
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be good to clarify that it does not replace recursively, i.e. "banana".replace_all(["b","na"],"x")
gives "xaxx"
not "xxx"
Co-authored-by: Aaron Pagano <136198872+aaronp64@users.noreply.github.com>
The efficiency looks obvious, but it would still be nice to do some benchmarks, if you don't mind. Also, in my opinion, a separate method would be better (than an overload), since it reduces the differences between the core and the scripting API. It's easier for new contributors and search-and-replace changes. |
I agree on benchmarks primarily to aid in deciding where to use these methods, as the convenience of this is great, but important to not carelessly use it if it is slower (especially on input where only a few keys are present, or the keys are long, etc.) |
name = name.replace("'", ""); | ||
name = name.replace(".", ""); | ||
static Vector<String> to_replace({ " ", "-", "_", "'", "." }); | ||
name = name.replace(to_replace, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since several of these are plain erasures of individual characters I wonder if we should add an erase method as well, or even instead, that takes a vector of char32_t
instead, it would be faster and cleaner I'd say
Out of 10 cases 5 replace a single character with an empty string
As a general feature this is a great idea though as well
Having a dedicated char->char
replacement would probably be very helpful for performance as it would avoid allocating new strings etc., it could also replace in-place if the output is restricted to not be \0
. So we could add a erase_char(s)
and replace_char(s)
to do the brunt of that work, like the many replace("\\", "/")
for paths, quite a few replace("\n", " ")
, replace("-", "_")
, replace("_", "-")
. As well as many replace("\n", "")
, replace(" ", "")
, etc. with erase_char(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see about writing a PR for these two options later
Co-authored-by: A Thousand Ships <96648715+AThousandShips@users.noreply.github.com>
This feels a bit backwards tbh. I only added the scripting API because someone requested it, not because it's needed for this. What this is attempting to solve is it being very inefficient to call replace n times in C++, essentially going over an entire string (which could be 10 or 10k characters) n times over. C++ can handle overloads just fine, and this replace pattern is used a lot everywhere. If I remove the scripting registration, then the overload is fine and perfectly adequate, so renaming this to an unwieldy long name because the scripting layer can't handle overloads feels a tad odd to me.
It's not just speed, it's memory allocation - every time replace is called, the source string is duplicated for the return value. I'm not setup to do benchmarks in C++, but I ran some quickly over at quick-bench and this is the result: |
Literally it can never be slower, basically only equal - one key is the equivalent of calling the single replace once, but two keys is faster and with less memory overhead than calling replace twice, by virtue of cutting out one of the loops (the input string only being processed once). |
I agree that this is highly debatable, but I have given practical reasons above. Many of us often use plain text and regular expressions to search in the codebase. And often the core API intended to be exposed for scripting is designed taking into account the limitations of the latter. Lines 313 to 314 in dbc6f2a
These overloads are very similar to each other, they do the same thing: replace a substring with a substring (let's consider a character a substring too). It's just that the second one is more efficient and is not intended for scripting. While the new method is somewhat different: it replaces a set of substrings with a substring. Here's an example where we use a separate method instead of an overload, even though Lines 288 to 290 in dbc6f2a
I'm not sure that the benchmark is as applicable to Godot, since it does not use STL and the results may differ for the better or for the worse. I meant comparison for different source string lengths, number and length of keys, etc. But I don't insist on benchmarks if it's tedious for you. |
This benchmark is using the two versions of replace and calling them. I copy pasted the code directly, and the only difference is that the String class is the std implementation and not ours, but none of that matters for the algorithm performance. The core bulk of the logic is our code in both cases, only the What you see in that benchmark is the result of a |
What if we defer the decision to expose the method to scripting (and the name bikeshedding), and keep this as an internal core performance optimization? GDScript users already can use RexEx.sub if they need multiple things replaced in one pass. |
Yes please. Let's not block performance optimizations so that we can bikeshed scripting API naming conventions. As for the internal naming. I can see there being a slight benefit to having a unique name per overload. But this function already has overloads, so I don't think it is the right time to be establishing a new naming convention |
If the main goal of this change is performance optimization, I think it makes sense to benchmark using Godot's Even with the |
For benchmarking I'd say it's important to compare replacing strings with almost all the sequences searched for not being found, compared to repeated replacements, there might be slowdowns because of the way it works there though I haven't tested, but it's a possible edge case where performance might be hurting. |
It seems like the big difference with big datasets is that in the new approach handles not finding any matches worse. So a source string like:
Is much faster with the new method than with the old. But inserting lorem ipsum in the middle makes it slower than the old method
Aaaaaand replacing String's find method with Godot's find method makes the new method take the lead again: https://quick-bench.com/q/uWnljUL9PkwmapILxlTTgJMGshM It seems that Godot's find method is waaaaay slower than std's. Original Benchmark with the lorem ipsum text from above: Same as above, but using Godot's find method: Tested with the "hello hello hello hello hello hello hello hello hello hello" string and the new method is also faster when compared against Godot's find instead of the one from the STL. Here is another benchmark with an optimized version of find that makes the old replace almost as fast as this new replace https://quick-bench.com/q/Nlq8b3E0_7E_55p8Woyl6-s4w0w Using the Lorem ipsum string from above "hello hello hello hello hello hello hello hello hello hello" Using the first string above (with a lot of matches) Last one, Using the Lorem Ipsum version above, but using really short keys:
|
This is what I suspected, I think optimization for searches would be best served by adding better underlying search mechanisms, and in some critical places using RegEx when possible etc. Using a proper searcher, like a Knuth–Morris–Pratt or other pre-processed methods, when applicable would probably do the most for improvement, some reading here to start The problem is that the approaches are pretty brute force, and struggles for partial matches etc. This feature is great for usability, but to solve performance the underlying mechanisms need improving IMO The above data shows this most critically: Fast in the right conditions, slow under the wrong, is worse than average in both |
@clayjohn Super interesting finds, thanks for digging into that! I see other PRs happening around this, so feel free to close this one if there's going to be an equivalent but different improvement incoming from other PRs.
Not necessarily completely unknown; for example, when sanitizing filepaths, we know the input data is small, so the impact of no matches when doing a replace all is negligible, but the memory impact is equal or better, since we're doing a lot of keys and therefore have more likelihood of multiple matches. |
There are multiple places where we want to replace multiple characters or strings with one character (e.g. cleaning up paths), and calling replace for every source string is wasteful, we can do it all in one go.
Contributed with ❤️ by W4Games