Add a string replace method to handle multiple search keys efficiently. #92433

shana · 2024-05-27T14:35:40Z

There are multiple places where we want to replace multiple characters or strings with one character (e.g. cleaning up paths), and calling replace for every source string is wasteful, we can do it all in one go.

Contributed with ❤️ by W4Games

There are multiple places where we want to replace multiple characters or strings with one character (e.g. cleaning up paths), and calling replace for every source string is wasteful, we can do it all in one go.

core/math/color.cpp

YeldhamDev · 2024-05-27T15:57:59Z

Would be nice if this was exposed to the script API.

shana · 2024-05-27T17:39:28Z

Ok, I fixed the off-by one error [insert meme here], moved things to static storage, and exposed the replace overload as replace_all to script (which feels logical to me, open to suggestions on the naming)

core/string/ustring.cpp

dalexeev · 2024-05-28T09:24:57Z

doc/classes/String.xml

@@ -744,6 +744,14 @@
 				Replaces all occurrences of [param what] inside the string with the given [param forwhat].
 			</description>
 		</method>
+		<method name="replace_all" qualifiers="const">


In my opinion, the name replace_multiple would be better, since otherwise the user might think that replace_all() is a version of replace(). But replace() already replaces all occurrences, not just the first one (like in some languages/libraries).

We should also think about naming if we decide to add a pair replacement method in the future (replace_*(["a", "b"], ["x", "y"]) or replace_*({"a": "x", "b": "y"})).

Might also be good to clarify that it does not replace recursively, i.e. "banana".replace_all(["b","na"],"x") gives "xaxx" not "xxx"

Co-authored-by: Aaron Pagano <136198872+aaronp64@users.noreply.github.com>

dalexeev · 2024-05-28T09:41:53Z

The efficiency looks obvious, but it would still be nice to do some benchmarks, if you don't mind.

Also, in my opinion, a separate method would be better (than an overload), since it reduces the differences between the core and the scripting API. It's easier for new contributors and search-and-replace changes.

core/string/ustring.cpp

AThousandShips · 2024-05-28T09:44:01Z

I agree on benchmarks primarily to aid in deciding where to use these methods, as the convenience of this is great, but important to not carelessly use it if it is slower (especially on input where only a few keys are present, or the keys are long, etc.)

AThousandShips · 2024-05-28T09:48:53Z

core/math/color.cpp

-	name = name.replace("'", "");
-	name = name.replace(".", "");
+	static Vector<String> to_replace({ " ", "-", "_", "'", "." });
+	name = name.replace(to_replace, "");


Since several of these are plain erasures of individual characters I wonder if we should add an erase method as well, or even instead, that takes a vector of char32_t instead, it would be faster and cleaner I'd say

Out of 10 cases 5 replace a single character with an empty string

As a general feature this is a great idea though as well

Having a dedicated char->char replacement would probably be very helpful for performance as it would avoid allocating new strings etc., it could also replace in-place if the output is restricted to not be \0. So we could add a erase_char(s) and replace_char(s) to do the brunt of that work, like the many replace("\\", "/") for paths, quite a few replace("\n", " "), replace("-", "_"), replace("_", "-"). As well as many replace("\n", ""), replace(" ", ""), etc. with erase_char(s)

I'll see about writing a PR for these two options later

Co-authored-by: A Thousand Ships <96648715+AThousandShips@users.noreply.github.com>

shana · 2024-05-28T14:43:14Z

Also, in my opinion, a separate method would be better (than an overload), since it reduces the differences between the core and the scripting API. It's easier for new contributors and search-and-replace changes.

This feels a bit backwards tbh. I only added the scripting API because someone requested it, not because it's needed for this. What this is attempting to solve is it being very inefficient to call replace n times in C++, essentially going over an entire string (which could be 10 or 10k characters) n times over. C++ can handle overloads just fine, and this replace pattern is used a lot everywhere. If I remove the scripting registration, then the overload is fine and perfectly adequate, so renaming this to an unwieldy long name because the scripting layer can't handle overloads feels a tad odd to me.

The efficiency looks obvious, but it would still be nice to do some benchmarks, if you don't mind.

It's not just speed, it's memory allocation - every time replace is called, the source string is duplicated for the return value. I'm not setup to do benchmarks in C++, but I ran some quickly over at quick-bench and this is the result:

shana · 2024-05-28T14:55:48Z

I agree on benchmarks primarily to aid in deciding where to use these methods, as the convenience of this is great, but important to not carelessly use it if it is slower (especially on input where only a few keys are present, or the keys are long, etc.)

Literally it can never be slower, basically only equal - one key is the equivalent of calling the single replace once, but two keys is faster and with less memory overhead than calling replace twice, by virtue of cutting out one of the loops (the input string only being processed once).

dalexeev · 2024-05-28T15:36:09Z

C++ can handle overloads just fine, and this replace pattern is used a lot everywhere. If I remove the scripting registration, then the overload is fine and perfectly, so renaming this to an unwieldy long name because the scripting layer can't handle overloads feels a tad odd to me.

I agree that this is highly debatable, but I have given practical reasons above. Many of us often use plain text and regular expressions to search in the codebase. And often the core API intended to be exposed for scripting is designed taking into account the limitations of the latter.

godot/core/string/ustring.h

Lines 313 to 314 in dbc6f2a

    
           String replace(const String &p_key, const String &p_with) const; 
        
           String replace(const char *p_key, const char *p_with) const;

These overloads are very similar to each other, they do the same thing: replace a substring with a substring (let's consider a character a substring too). It's just that the second one is more efficient and is not intended for scripting. While the new method is somewhat different: it replaces a set of substrings with a substring.

Here's an example where we use a separate method instead of an overload, even though find_char() isn't exposed:

godot/core/string/ustring.h

Lines 288 to 290 in dbc6f2a

    
           int find(const String &p_str, int p_from = 0) const; ///< return <0 if failed 
        
           int find(const char *p_str, int p_from = 0) const; ///< return <0 if failed 
        
           int find_char(const char32_t &p_char, int p_from = 0) const; ///< return <0 if failed

It's not just speed, it's memory allocation - every time replace is called, the source string is duplicated for the return value. I'm not setup to do benchmarks in C++, but I ran some quickly over at quick-bench and this is the result

I'm not sure that the benchmark is as applicable to Godot, since it does not use STL and the results may differ for the better or for the worse. I meant comparison for different source string lengths, number and length of keys, etc. But I don't insist on benchmarks if it's tedious for you.

shana · 2024-05-28T17:26:05Z

I'm not sure that the benchmark is as applicable to Godot, since it does not use STL and the results may differ for the better or for the worse. I meant comparison for different source string lengths, number and length of keys, etc. But I don't insist on benchmarks if it's tedious for you.

This benchmark is using the two versions of replace and calling them. I copy pasted the code directly, and the only difference is that the String class is the std implementation and not ours, but none of that matters for the algorithm performance. The core bulk of the logic is our code in both cases, only the find and substr implementations are the ones from std instead of our versions - and I'm pretty sure ours are not a magnitude faster than the ones in std, so the difference really doesn't matter.

What you see in that benchmark is the result of a replace of a bunch of keys on a very short string - you can run the benchmark yourself and put in an actual long string, and you'll see the massive difference the size of the input string makes to the overall performance of calling replace a bunch of times. Unfortunately, I don't have time or the set up required to do a thorough and detailed performance benchmark review of replace vs replace all. If someone wants to do that, that would be great, although I don't think the results are going to be different. I don't know if the test suite supports benchmarking? That would be nice to have builtin.

lyuma · 2024-05-28T17:39:17Z

What if we defer the decision to expose the method to scripting (and the name bikeshedding), and keep this as an internal core performance optimization?

GDScript users already can use RexEx.sub if they need multiple things replaced in one pass.

clayjohn · 2024-05-28T18:56:03Z

What if we defer the decision to expose the method to scripting (and the name bikeshedding), and keep this as an internal core performance optimization?

GDScript users already can use RexEx.sub if they need multiple things replaced in one pass.

Yes please. Let's not block performance optimizations so that we can bikeshed scripting API naming conventions.

As for the internal naming. I can see there being a slight benefit to having a unique name per overload. But this function already has overloads, so I don't think it is the right time to be establishing a new naming convention

aaronp64 · 2024-05-28T20:27:53Z

If the main goal of this change is performance optimization, I think it makes sense to benchmark using Godot's String class, ideally with different combinations of long/short keys, long/short original String, no/many replacements, etc. I think the use of CowData in Godot's String, and many std::string implementations using small string optimization, makes them different enough that they could have performance differences here (might not matter, but would be nice to know for sure).

Even with the std::string benchmark at https://quick-bench.com/q/oUdTdeQHp1PN1kGWefV6tXw0eJk, ReplaceAll seems to be slower with larger from strings compared to ReplaceOneAtATime. Changing "hello" to "hello hello hello hello hello hello hello hello hello hello" in both methods is enough to make ReplaceOneAtATime faster.

AThousandShips · 2024-05-28T21:21:27Z

For benchmarking I'd say it's important to compare replacing strings with almost all the sequences searched for not being found, compared to repeated replacements, there might be slowdowns because of the way it works there though I haven't tested, but it's a possible edge case where performance might be hurting.

clayjohn · 2024-05-29T01:34:12Z

It seems like the big difference with big datasets is that in the new approach handles not finding any matches worse. So a source string like:

": Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type"

Is much faster with the new method than with the old.

But inserting lorem ipsum in the middle makes it slower than the old method

: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.ModedsdsLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type

Aaaaaand replacing String's find method with Godot's find method makes the new method take the lead again: https://quick-bench.com/q/uWnljUL9PkwmapILxlTTgJMGshM

It seems that Godot's find method is waaaaay slower than std's.

Original Benchmark with the lorem ipsum text from above:

Same as above, but using Godot's find method:

Tested with the "hello hello hello hello hello hello hello hello hello hello" string and the new method is also faster when compared against Godot's find instead of the one from the STL.

Here is another benchmark with an optimized version of find that makes the old replace almost as fast as this new replace https://quick-bench.com/q/Nlq8b3E0_7E_55p8Woyl6-s4w0w

Using the Lorem ipsum string from above
Blue: This PR
Pink: Godot's find
Green: STL's find
Purple: My optimized find

"Hello"

"hello hello hello hello hello hello hello hello hello hello"

Using the first string above (with a lot of matches)

Last one, Using the Lorem Ipsum version above, but using really short keys:

auto keys = std::vector<std::string>({ ": int",
		":",
		"-",
		"+",
		"*",
		".",
		"_",
		"|",
		":=",
		" -> void",
		" -> bool",
		" -> int",
		" -> PortType",
		" -> String",
		" -> Object" });

AThousandShips · 2024-05-29T08:03:21Z

It seems like the big difference with big datasets is that in the new approach handles not finding any matches worse.

This is what I suspected, I think optimization for searches would be best served by adding better underlying search mechanisms, and in some critical places using RegEx when possible etc.

Using a proper searcher, like a Knuth–Morris–Pratt or other pre-processed methods, when applicable would probably do the most for improvement, some reading here to start

The problem is that the approaches are pretty brute force, and struggles for partial matches etc.

This feature is great for usability, but to solve performance the underlying mechanisms need improving IMO

The above data shows this most critically:
Any method needs to be balanced and have both good best and worse case performance, as the input data is entirely unknown, so balance between many and few hits needs to be good

Fast in the right conditions, slow under the wrong, is worse than average in both

shana · 2024-05-30T17:52:09Z

@clayjohn Super interesting finds, thanks for digging into that!

I see other PRs happening around this, so feel free to close this one if there's going to be an equivalent but different improvement incoming from other PRs.

as the input data is entirely unknown

Not necessarily completely unknown; for example, when sanitizing filepaths, we know the input data is small, so the impact of no matches when doing a replace all is negligible, but the memory impact is equal or better, since we're doing a lot of keys and therefore have more likelihood of multiple matches.

clayjohn · 2024-05-30T18:07:25Z

Two more quick benchmarks

using the lorem ipsum (long string, few matches) with only single character keys

Using a shorter string with some matches "shader_name=clay+some-other*awesome--combo" with only single character keys

I think we can extract some general principles from the benchmarking above:

This PR tends to be better than vanilla replace when the source string is short
This PR is much better than vanilla replace when there are lots of matches
This PR scales worse than vanilla replace when the string is long and there are few matches

So I think we could safely use it all all cases where we have a sense that the source string is small (filepaths, variable names, etc.). But we should avoid it for long strings (shader source, script source, etc.)

shana requested review from a team as code owners May 27, 2024 14:35

Add a string replace method to handle multiple search keys efficiently.

5235392

There are multiple places where we want to replace multiple characters or strings with one character (e.g. cleaning up paths), and calling replace for every source string is wasteful, we can do it all in one go.

shana force-pushed the multi-string-replace branch from 3809ffc to 5235392 Compare May 27, 2024 14:41

shana requested a review from a team as a code owner May 27, 2024 14:41

AThousandShips added enhancement topic:core labels May 27, 2024

AThousandShips added this to the 4.x milestone May 27, 2024

lyuma reviewed May 27, 2024

View reviewed changes

core/math/color.cpp Outdated Show resolved Hide resolved

shana added 2 commits May 27, 2024 17:42

Fix off-by-one error and add a string replace test

76b6e87

Make string vectors to replace static so we're not allocating every time

60c35c4

shana requested a review from a team as a code owner May 27, 2024 16:48

Expose replace(Vector<String>, String) to script as replace_all

44235e5

shana requested a review from a team as a code owner May 27, 2024 17:38

Missed a spot

3afcedf

aaronp64 reviewed May 27, 2024

View reviewed changes

core/string/ustring.cpp Outdated Show resolved Hide resolved

dalexeev reviewed May 28, 2024

View reviewed changes

Make sure empty key strings are handled correctly.

ccb9bd4

Co-authored-by: Aaron Pagano <136198872+aaronp64@users.noreply.github.com>

AThousandShips reviewed May 28, 2024

View reviewed changes

core/string/ustring.cpp Outdated Show resolved Hide resolved

AThousandShips reviewed May 28, 2024

View reviewed changes

Update ustring.cpp

95d3161

Co-authored-by: A Thousand Ships <96648715+AThousandShips@users.noreply.github.com>

AThousandShips mentioned this pull request May 28, 2024

Add String::replace_char(s) methods for performance and convenience #92475

Open

AThousandShips mentioned this pull request May 30, 2024

[Core] Optimize String::replace methods #92546

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a string replace method to handle multiple search keys efficiently. #92433

Add a string replace method to handle multiple search keys efficiently. #92433

shana commented May 27, 2024

YeldhamDev commented May 27, 2024

shana commented May 27, 2024

dalexeev May 28, 2024

AThousandShips May 28, 2024

dalexeev commented May 28, 2024 •

edited

Loading

AThousandShips commented May 28, 2024

AThousandShips May 28, 2024 •

edited

Loading

AThousandShips May 28, 2024

shana commented May 28, 2024 •

edited

Loading

shana commented May 28, 2024

dalexeev commented May 28, 2024

shana commented May 28, 2024 •

edited

Loading

lyuma commented May 28, 2024

clayjohn commented May 28, 2024

aaronp64 commented May 28, 2024

AThousandShips commented May 28, 2024 •

edited

Loading

clayjohn commented May 29, 2024 •

edited

Loading

AThousandShips commented May 29, 2024 •

edited

Loading

shana commented May 30, 2024

clayjohn commented May 30, 2024

Add a string replace method to handle multiple search keys efficiently. #92433

Are you sure you want to change the base?

Add a string replace method to handle multiple search keys efficiently. #92433

Conversation

shana commented May 27, 2024

YeldhamDev commented May 27, 2024

shana commented May 27, 2024

dalexeev May 28, 2024

Choose a reason for hiding this comment

AThousandShips May 28, 2024

Choose a reason for hiding this comment

dalexeev commented May 28, 2024 • edited Loading

AThousandShips commented May 28, 2024

AThousandShips May 28, 2024 • edited Loading

Choose a reason for hiding this comment

AThousandShips May 28, 2024

Choose a reason for hiding this comment

shana commented May 28, 2024 • edited Loading

shana commented May 28, 2024

dalexeev commented May 28, 2024

shana commented May 28, 2024 • edited Loading

lyuma commented May 28, 2024

clayjohn commented May 28, 2024

aaronp64 commented May 28, 2024

AThousandShips commented May 28, 2024 • edited Loading

clayjohn commented May 29, 2024 • edited Loading

AThousandShips commented May 29, 2024 • edited Loading

shana commented May 30, 2024

clayjohn commented May 30, 2024

dalexeev commented May 28, 2024 •

edited

Loading

AThousandShips May 28, 2024 •

edited

Loading

shana commented May 28, 2024 •

edited

Loading

shana commented May 28, 2024 •

edited

Loading

AThousandShips commented May 28, 2024 •

edited

Loading

clayjohn commented May 29, 2024 •

edited

Loading

AThousandShips commented May 29, 2024 •

edited

Loading