Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a string replace method to handle multiple search keys efficiently. #92433

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

shana
Copy link
Contributor

@shana shana commented May 27, 2024

There are multiple places where we want to replace multiple characters or strings with one character (e.g. cleaning up paths), and calling replace for every source string is wasteful, we can do it all in one go.

Contributed with ❤️ by W4Games

@shana shana requested review from a team as code owners May 27, 2024 14:35
There are multiple places where we want to replace multiple characters or strings
with one character (e.g. cleaning up paths), and calling replace for every source
string is wasteful, we can do it all in one go.
@shana shana requested a review from a team as a code owner May 27, 2024 14:41
@AThousandShips AThousandShips added this to the 4.x milestone May 27, 2024
core/math/color.cpp Outdated Show resolved Hide resolved
@YeldhamDev
Copy link
Member

Would be nice if this was exposed to the script API.

@shana shana requested a review from a team as a code owner May 27, 2024 16:48
@shana shana requested a review from a team as a code owner May 27, 2024 17:38
@shana
Copy link
Contributor Author

shana commented May 27, 2024

Ok, I fixed the off-by one error [insert meme here], moved things to static storage, and exposed the replace overload as replace_all to script (which feels logical to me, open to suggestions on the naming)

core/string/ustring.cpp Outdated Show resolved Hide resolved
@@ -744,6 +744,14 @@
Replaces all occurrences of [param what] inside the string with the given [param forwhat].
</description>
</method>
<method name="replace_all" qualifiers="const">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the name replace_multiple would be better, since otherwise the user might think that replace_all() is a version of replace(). But replace() already replaces all occurrences, not just the first one (like in some languages/libraries).

We should also think about naming if we decide to add a pair replacement method in the future (replace_*(["a", "b"], ["x", "y"]) or replace_*({"a": "x", "b": "y"})).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be good to clarify that it does not replace recursively, i.e. "banana".replace_all(["b","na"],"x") gives "xaxx" not "xxx"

Co-authored-by: Aaron Pagano <136198872+aaronp64@users.noreply.github.com>
@dalexeev
Copy link
Member

dalexeev commented May 28, 2024

The efficiency looks obvious, but it would still be nice to do some benchmarks, if you don't mind.

Also, in my opinion, a separate method would be better (than an overload), since it reduces the differences between the core and the scripting API. It's easier for new contributors and search-and-replace changes.

@AThousandShips
Copy link
Member

I agree on benchmarks primarily to aid in deciding where to use these methods, as the convenience of this is great, but important to not carelessly use it if it is slower (especially on input where only a few keys are present, or the keys are long, etc.)

name = name.replace("'", "");
name = name.replace(".", "");
static Vector<String> to_replace({ " ", "-", "_", "'", "." });
name = name.replace(to_replace, "");
Copy link
Member

@AThousandShips AThousandShips May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since several of these are plain erasures of individual characters I wonder if we should add an erase method as well, or even instead, that takes a vector of char32_t instead, it would be faster and cleaner I'd say

Out of 10 cases 5 replace a single character with an empty string

As a general feature this is a great idea though as well

Having a dedicated char->char replacement would probably be very helpful for performance as it would avoid allocating new strings etc., it could also replace in-place if the output is restricted to not be \0. So we could add a erase_char(s) and replace_char(s) to do the brunt of that work, like the many replace("\\", "/") for paths, quite a few replace("\n", " "), replace("-", "_"), replace("_", "-"). As well as many replace("\n", ""), replace(" ", ""), etc. with erase_char(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see about writing a PR for these two options later

Co-authored-by: A Thousand Ships <96648715+AThousandShips@users.noreply.github.com>
@shana
Copy link
Contributor Author

shana commented May 28, 2024

Also, in my opinion, a separate method would be better (than an overload), since it reduces the differences between the core and the scripting API. It's easier for new contributors and search-and-replace changes.

This feels a bit backwards tbh. I only added the scripting API because someone requested it, not because it's needed for this. What this is attempting to solve is it being very inefficient to call replace n times in C++, essentially going over an entire string (which could be 10 or 10k characters) n times over. C++ can handle overloads just fine, and this replace pattern is used a lot everywhere. If I remove the scripting registration, then the overload is fine and perfectly adequate, so renaming this to an unwieldy long name because the scripting layer can't handle overloads feels a tad odd to me.

The efficiency looks obvious, but it would still be nice to do some benchmarks, if you don't mind.

It's not just speed, it's memory allocation - every time replace is called, the source string is duplicated for the return value. I'm not setup to do benchmarks in C++, but I ran some quickly over at quick-bench and this is the result:

oUdTdeQHp1PN1kGWefV6tXw0eJk

@shana
Copy link
Contributor Author

shana commented May 28, 2024

I agree on benchmarks primarily to aid in deciding where to use these methods, as the convenience of this is great, but important to not carelessly use it if it is slower (especially on input where only a few keys are present, or the keys are long, etc.)

Literally it can never be slower, basically only equal - one key is the equivalent of calling the single replace once, but two keys is faster and with less memory overhead than calling replace twice, by virtue of cutting out one of the loops (the input string only being processed once).

@dalexeev
Copy link
Member

C++ can handle overloads just fine, and this replace pattern is used a lot everywhere. If I remove the scripting registration, then the overload is fine and perfectly, so renaming this to an unwieldy long name because the scripting layer can't handle overloads feels a tad odd to me.

I agree that this is highly debatable, but I have given practical reasons above. Many of us often use plain text and regular expressions to search in the codebase. And often the core API intended to be exposed for scripting is designed taking into account the limitations of the latter.

godot/core/string/ustring.h

Lines 313 to 314 in dbc6f2a

String replace(const String &p_key, const String &p_with) const;
String replace(const char *p_key, const char *p_with) const;

These overloads are very similar to each other, they do the same thing: replace a substring with a substring (let's consider a character a substring too). It's just that the second one is more efficient and is not intended for scripting. While the new method is somewhat different: it replaces a set of substrings with a substring.

Here's an example where we use a separate method instead of an overload, even though find_char() isn't exposed:

godot/core/string/ustring.h

Lines 288 to 290 in dbc6f2a

int find(const String &p_str, int p_from = 0) const; ///< return <0 if failed
int find(const char *p_str, int p_from = 0) const; ///< return <0 if failed
int find_char(const char32_t &p_char, int p_from = 0) const; ///< return <0 if failed

It's not just speed, it's memory allocation - every time replace is called, the source string is duplicated for the return value. I'm not setup to do benchmarks in C++, but I ran some quickly over at quick-bench and this is the result

I'm not sure that the benchmark is as applicable to Godot, since it does not use STL and the results may differ for the better or for the worse. I meant comparison for different source string lengths, number and length of keys, etc. But I don't insist on benchmarks if it's tedious for you.

@shana
Copy link
Contributor Author

shana commented May 28, 2024

I'm not sure that the benchmark is as applicable to Godot, since it does not use STL and the results may differ for the better or for the worse. I meant comparison for different source string lengths, number and length of keys, etc. But I don't insist on benchmarks if it's tedious for you.

This benchmark is using the two versions of replace and calling them. I copy pasted the code directly, and the only difference is that the String class is the std implementation and not ours, but none of that matters for the algorithm performance. The core bulk of the logic is our code in both cases, only the find and substr implementations are the ones from std instead of our versions - and I'm pretty sure ours are not a magnitude faster than the ones in std, so the difference really doesn't matter.

What you see in that benchmark is the result of a replace of a bunch of keys on a very short string - you can run the benchmark yourself and put in an actual long string, and you'll see the massive difference the size of the input string makes to the overall performance of calling replace a bunch of times. Unfortunately, I don't have time or the set up required to do a thorough and detailed performance benchmark review of replace vs replace all. If someone wants to do that, that would be great, although I don't think the results are going to be different. I don't know if the test suite supports benchmarking? That would be nice to have builtin.

@lyuma
Copy link
Contributor

lyuma commented May 28, 2024

What if we defer the decision to expose the method to scripting (and the name bikeshedding), and keep this as an internal core performance optimization?

GDScript users already can use RexEx.sub if they need multiple things replaced in one pass.

@clayjohn
Copy link
Member

What if we defer the decision to expose the method to scripting (and the name bikeshedding), and keep this as an internal core performance optimization?

GDScript users already can use RexEx.sub if they need multiple things replaced in one pass.

Yes please. Let's not block performance optimizations so that we can bikeshed scripting API naming conventions.

As for the internal naming. I can see there being a slight benefit to having a unique name per overload. But this function already has overloads, so I don't think it is the right time to be establishing a new naming convention

@aaronp64
Copy link
Contributor

If the main goal of this change is performance optimization, I think it makes sense to benchmark using Godot's String class, ideally with different combinations of long/short keys, long/short original String, no/many replacements, etc. I think the use of CowData in Godot's String, and many std::string implementations using small string optimization, makes them different enough that they could have performance differences here (might not matter, but would be nice to know for sure).

Even with the std::string benchmark at https://quick-bench.com/q/oUdTdeQHp1PN1kGWefV6tXw0eJk, ReplaceAll seems to be slower with larger from strings compared to ReplaceOneAtATime. Changing "hello" to "hello hello hello hello hello hello hello hello hello hello" in both methods is enough to make ReplaceOneAtATime faster.

@AThousandShips
Copy link
Member

AThousandShips commented May 28, 2024

For benchmarking I'd say it's important to compare replacing strings with almost all the sequences searched for not being found, compared to repeated replacements, there might be slowdowns because of the way it works there though I haven't tested, but it's a possible edge case where performance might be hurting.

@clayjohn
Copy link
Member

clayjohn commented May 29, 2024

It seems like the big difference with big datasets is that in the new approach handles not finding any matches worse. So a source string like:

": Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type"

Is much faster with the new method than with the old.

But inserting lorem ipsum in the middle makes it slower than the old method

: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.ModedsdsLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type: Shader.Modedsds : VisualShader.Type

Aaaaaand replacing String's find method with Godot's find method makes the new method take the lead again: https://quick-bench.com/q/uWnljUL9PkwmapILxlTTgJMGshM

It seems that Godot's find method is waaaaay slower than std's.

Original Benchmark with the lorem ipsum text from above:
Screenshot from 2024-05-28 18-30-49

Same as above, but using Godot's find method:
Screenshot from 2024-05-28 18-31-05

Tested with the "hello hello hello hello hello hello hello hello hello hello" string and the new method is also faster when compared against Godot's find instead of the one from the STL.

Here is another benchmark with an optimized version of find that makes the old replace almost as fast as this new replace https://quick-bench.com/q/Nlq8b3E0_7E_55p8Woyl6-s4w0w

Using the Lorem ipsum string from above
Blue: This PR
Pink: Godot's find
Green: STL's find
Purple: My optimized find
image

"Hello"
image

"hello hello hello hello hello hello hello hello hello hello"
Screenshot from 2024-05-28 20-07-47

Using the first string above (with a lot of matches)
image

Last one, Using the Lorem Ipsum version above, but using really short keys:

auto keys = std::vector<std::string>({ ": int",
		":",
		"-",
		"+",
		"*",
		".",
		"_",
		"|",
		":=",
		" -> void",
		" -> bool",
		" -> int",
		" -> PortType",
		" -> String",
		" -> Object" });

image

@AThousandShips
Copy link
Member

AThousandShips commented May 29, 2024

It seems like the big difference with big datasets is that in the new approach handles not finding any matches worse.

This is what I suspected, I think optimization for searches would be best served by adding better underlying search mechanisms, and in some critical places using RegEx when possible etc.

Using a proper searcher, like a Knuth–Morris–Pratt or other pre-processed methods, when applicable would probably do the most for improvement, some reading here to start

The problem is that the approaches are pretty brute force, and struggles for partial matches etc.

This feature is great for usability, but to solve performance the underlying mechanisms need improving IMO

The above data shows this most critically:
Any method needs to be balanced and have both good best and worse case performance, as the input data is entirely unknown, so balance between many and few hits needs to be good

Fast in the right conditions, slow under the wrong, is worse than average in both

@shana
Copy link
Contributor Author

shana commented May 30, 2024

@clayjohn Super interesting finds, thanks for digging into that!

I see other PRs happening around this, so feel free to close this one if there's going to be an equivalent but different improvement incoming from other PRs.

as the input data is entirely unknown

Not necessarily completely unknown; for example, when sanitizing filepaths, we know the input data is small, so the impact of no matches when doing a replace all is negligible, but the memory impact is equal or better, since we're doing a lot of keys and therefore have more likelihood of multiple matches.

@clayjohn
Copy link
Member

Two more quick benchmarks

  1. using the lorem ipsum (long string, few matches) with only single character keys
Screenshot 2024-05-30 at 10 58 03 AM
  1. Using a shorter string with some matches "shader_name=clay+some-other*awesome--combo" with only single character keys
Screenshot 2024-05-30 at 11 01 47 AM

I think we can extract some general principles from the benchmarking above:

  1. This PR tends to be better than vanilla replace when the source string is short
  2. This PR is much better than vanilla replace when there are lots of matches
  3. This PR scales worse than vanilla replace when the string is long and there are few matches

So I think we could safely use it all all cases where we have a sense that the source string is small (filepaths, variable names, etc.). But we should avoid it for long strings (shader source, script source, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants