Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: consider to add move-like optimization for some string<->[]byte operations #50846

Open
quasilyte opened this issue Jan 27, 2022 · 9 comments
Labels
FeatureRequest NeedsInvestigation
Milestone

Comments

@quasilyte
Copy link
Contributor

@quasilyte quasilyte commented Jan 27, 2022

The problem

1. Building the string

In Go, it's hard to efficiently build a string. Contrary to what strings.Builder documentation says, it's not always more efficient than a direct []byte usage with extra allocation and copying when doing string([]byte) conversion in the end. We can't blindly replace all places with strings.Builder due to its overhead for some use cases (need to write many small chunks while building the string).

A quick glance tells that we can't really make strings.Builder as fast as we might want it to be: even if it doesn't do a copycheck and we call append directly, some algorithms use copy + write by index to be really fast. For example, some code in strings.Replacer specialization uses this instead of strings.Builder. It's 1 extra allocation, but CPU-wise it's ~20% faster on average.

2. Reusing []byte algorithms

When you write a code that returns []byte, you may want to provide a second function that returns a string. For most cases, you'll probably start with something like this:

func EncodeString(v value) string {
  return string(Encode(v))
}

This will add extra data copying + allocation. Not to mention that if Encode returns a fresh allocation (and it only "escapes" in the return), there is no need to copy it. We can "move" it (transfer the ownership) to the return value. In other words, we could use slicebytetostringtmp instead of slicebytetostring.

We can do this since Encode() return value doesn't have any other references to it, it's a temporary that we can move easily.

This will make the code above very efficient. Not only it would avoid the extra allocation, it'll also benefit from the fact that Encode() is implemented over []byte instead of some cross-type manner that will make the code 20-30% slower for both cases.

This cross-type manner usually involves passing both []byte to append (for []byte case) and a bool flag to return string instead. Another approach is passing both strings.Builder and bytes.Buffer, so the function writes the data to the appropriate storage provided by the caller. As noted above, it's not as efficient as we could be with another approach, given that compiler would add some optimizations for that case.

3. Append-like API are bad for strings

Right now one of the best tricks to avoid redundant allocations is to use append-like APIs, where the caller passes a slice that can be reused later. This works well for bytes, but not so well for strings.

With the optimization requested in this issue, we could do this:

buf := make([]byte, 0, size)
buf = appendSomething(buf, data)
// If appendSomething makes buf escape only to its return value,
// then we can try to prove that buf is locally created slice that can be
// converted to a string without copying.
return string(buf)

The optimization

When converting a []byte to a string, if it's proven that after this conversion no one case reference the original slice, initialize the string using the []byte memory with slicebytetostringtmp-like operation (we could add a wrapper called slicebytetostringmove to make it's easier to separate the semantics).

Some examples of when this could be done:

  1. A function creates a local allocation of the slice b. That slice is then returned as string(b).
  2. A temporary value returned by a function passed to a string conversion, like string(f()). We need to know that f() returns a newly allocated memory though, because we shouldn't move reused or global memory. This would require an extra func bit.

(1) Helps in the first problem. We'll be able to build a string inside []byte and return it as a string without overhead.
(2) Helps in the second problem. We'll be able to do string(f()) to reuse the []byte-oriented routines.

These 2 cases solve most issues outlined above, but it may also require some extra care to distinguish how function param "leaks": leaking to a return value is OK for append API, but leaking to global memory is not. For simplicity, I would avoid handling the (3) problem, but maybe someone has a good idea of how to handle it without extra too much effort.

I believe that the opposite is not as useful, but it could be possible to remove some string->[]byte copying with this idea as well. Although I would ask for some use cases of when it could be useful.

Since it doesn't look like a trivial change, it seems fair to discuss it before going beyond the minimal prototype.

Some background

While it might look like a minor issue, it helps to fix several places in the stdlib (make them either ~15% faster or involve 1 less data copying; sometimes it just makes the code more readable because you don't need to pass 2 destinations into a common function).

It also should help to fix some issues in the application I'm trying to optimize. []byte->string conversion is quite high in the profile there, but strings.Builder or other approaches will not solve the issue completely -- it can't compete with a direct []byte manipulations.

An alternative solutions

Faster (or new) strings.Builder

If there is a way to make strings.Builder faster or add a new type that can operate like a slice of bytes with direct access (not sure it's possible to make it "safe"), this compiler optimization will not be needed as much.

Explicit moves

I don't like this approach, but figured I could leave it here as a thought.

If we had an intrinsic function like unsafe.BytesToString(b) (don't mind the naming) that compiler would check (whether it's valid to do this move). If it can prove it, it inserts an allocation-free conversion. Otherwise, it gives a compilation error.

The only benefit is probably the fact that it could be easier for the compiler to do this analysis only for the places it's asked to do it.

Manual optimizations

While it's possible to do something like this:

return unsafeutil.BytesAsString(b)

It makes the code quite fragile and it demands the unsafe usage. It can also break if b starts to escape in some unpredictable way or we start using a sync.Pool for bytes, so b could be shared.

Extra: strings.Replacer rewrite with strings.Builder

@@ -524,18 +524,21 @@ func (r *byteStringReplacer) Replace(s string) string {
        if !anyChanges {
                return s
        }
-       buf := make([]byte, newSize)
-       j := 0
+       var buf Builder
+       buf.Grow(newSize)
+       from := 0
        for i := 0; i < len(s); i++ {
-               b := s[i]
-               if r.replacements[b] != nil {
-                       j += copy(buf[j:], r.replacements[b])
-               } else {
-                       buf[j] = b
-                       j++
+               rep := r.replacements[s[i]]
+               if rep != nil {
+                       buf.WriteString(s[from:i])
+                       buf.Write(rep)
+                       from = i + 1
                }
        }
-       return string(buf)
+       if from != len(s) {
+               buf.WriteString(s[from:])
+       }
+       return buf.String()
 }

Benchmarks:

func BenchmarkReplacer(b *testing.B) {
	// smallString is exactly 32 bytes long.
	const smallString = "Hello  world  strings Rep lacer "

	type testCase struct {
		input string
		from []string
		to []string
	}

	tests := []testCase{
		{
			input: smallString,
			from: []string{"$"},
			to: []string{""},
		},
		{
			input: smallString,
			from: []string{"R"},
			to: []string{""},
		},
		{
			input: smallString,
			from: []string{" "},
			to: []string{""},
		},
		{
			input: smallString,
			from: []string{" ", "l", "s"},
			to: []string{"", "", ""},
		},
		{
			input: smallString,
			from: []string{" ", "l", "s", "e", "o"},
			to: []string{"", "", "", "", ""},
		},
	}

	var allTests []testCase
	for i := range tests {
		testX4 := tests[i]
		testX4.input = strings.Repeat(testX4.input, 4)
		testX16 := tests[i]
		testX16.input = strings.Repeat(testX16.input, 16)
		testX96 := tests[i]
		testX96.input = strings.Repeat(testX96.input, 96)
		testX512 := tests[i]
		testX512.input = strings.Repeat(testX512.input, 512)
		allTests = append(allTests, tests[i], testX4, testX16, testX96, testX512)
	}

	for i := range allTests {
		test := allTests[i]
		var pairs []string
		numMatches := 0
		for j := range test.from {
			pairs = append(pairs, test.from[j], test.to[j])
			if matches := strings.Count(test.input, string(test.from[j])); matches != -1 {
				numMatches += matches
			}
		}
		replacer := strings.NewReplacer(pairs...)
		matchesSuffix := "Matched0%"
		if numMatches != 0 {
			percentage := int((float64(numMatches) / float64(len(test.input))) * 100)
			matchesSuffix = fmt.Sprintf("Matched%d%%", percentage)
		}
		key := fmt.Sprintf("Len%d%s", len(test.input), matchesSuffix)
		b.Run(key, func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				replacer.Replace(test.input)
			}
		})
	}
}

Results (tl;dr - less allocations, slower execution time):

name                           old time/op    new time/op    delta
Replacer/Len32Matched0%-8        20.5ns ± 1%    20.4ns ± 0%     ~     (p=0.183 n=5+5)
Replacer/Len128Matched0%-8       27.0ns ± 0%    27.5ns ± 0%   +1.88%  (p=0.008 n=5+5)
Replacer/Len512Matched0%-8       43.3ns ± 1%    43.3ns ± 1%     ~     (p=1.000 n=5+5)
Replacer/Len3072Matched0%-8       152ns ± 1%     156ns ± 0%   +2.93%  (p=0.008 n=5+5)
Replacer/Len16384Matched0%-8      650ns ± 0%     705ns ± 0%   +8.51%  (p=0.008 n=5+5)
Replacer/Len32Matched3%-8         188ns ± 0%     143ns ± 1%  -24.16%  (p=0.008 n=5+5)
Replacer/Len128Matched3%-8        499ns ± 0%     360ns ± 2%  -27.88%  (p=0.008 n=5+5)
Replacer/Len512Matched3%-8       1.47µs ± 0%    1.40µs ± 0%   -4.52%  (p=0.008 n=5+5)
Replacer/Len3072Matched3%-8      9.33µs ± 0%    7.08µs ± 1%  -24.10%  (p=0.008 n=5+5)
Replacer/Len16384Matched3%-8     50.3µs ± 0%    40.8µs ± 1%  -19.06%  (p=0.008 n=5+5)
Replacer/Len32Matched21%-8        191ns ± 1%     208ns ± 1%   +8.86%  (p=0.008 n=5+5)
Replacer/Len128Matched21%-8       531ns ± 0%     694ns ± 1%  +30.72%  (p=0.008 n=5+5)
Replacer/Len512Matched21%-8      1.76µs ± 1%    2.55µs ± 2%  +44.46%  (p=0.008 n=5+5)
Replacer/Len3072Matched21%-8     10.1µs ± 0%    14.8µs ± 1%  +46.53%  (p=0.008 n=5+5)
Replacer/Len16384Matched21%-8    51.1µs ± 1%    77.5µs ± 1%  +51.88%  (p=0.008 n=5+5)
Replacer/Len32Matched40%-8        248ns ± 1%     305ns ± 1%  +23.17%  (p=0.008 n=5+5)
Replacer/Len128Matched40%-8       661ns ± 0%     981ns ± 1%  +48.40%  (p=0.008 n=5+5)
Replacer/Len512Matched40%-8      2.22µs ± 1%    3.59µs ± 1%  +61.64%  (p=0.008 n=5+5)
Replacer/Len3072Matched40%-8     12.7µs ± 0%    21.2µs ± 1%  +66.85%  (p=0.008 n=5+5)
Replacer/Len16384Matched40%-8    64.8µs ± 0%   112.8µs ± 1%  +74.08%  (p=0.008 n=5+5)
Replacer/Len32Matched56%-8        287ns ± 1%     351ns ± 1%  +22.20%  (p=0.008 n=5+5)
Replacer/Len128Matched56%-8       758ns ± 0%    1208ns ± 1%  +59.49%  (p=0.008 n=5+5)
Replacer/Len512Matched56%-8      2.49µs ± 0%    4.39µs ± 1%  +76.53%  (p=0.008 n=5+5)
Replacer/Len3072Matched56%-8     14.3µs ± 1%    26.1µs ± 1%  +82.71%  (p=0.008 n=5+5)
Replacer/Len16384Matched56%-8    75.2µs ± 0%   138.4µs ± 1%  +84.06%  (p=0.008 n=5+5)
[Geo mean]                       1.37µs         1.68µs       +22.67%

name                           old alloc/op   new alloc/op   delta
Replacer/Len32Matched0%-8         0.00B          0.00B          ~     (all equal)
Replacer/Len128Matched0%-8        0.00B          0.00B          ~     (all equal)
Replacer/Len512Matched0%-8        0.00B          0.00B          ~     (all equal)
Replacer/Len3072Matched0%-8       0.00B          0.00B          ~     (all equal)
Replacer/Len16384Matched0%-8      0.00B          0.00B          ~     (all equal)
Replacer/Len32Matched3%-8         64.0B ± 0%     32.0B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched3%-8         256B ± 0%      128B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched3%-8       1.02kB ± 0%    0.51kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched3%-8      6.14kB ± 0%    3.07kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched3%-8     32.8kB ± 0%    16.4kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len32Matched21%-8        64.0B ± 0%     32.0B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched21%-8        224B ± 0%      112B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched21%-8        832B ± 0%      416B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched21%-8     5.38kB ± 0%    2.69kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched21%-8    27.1kB ± 0%    13.6kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len32Matched40%-8        48.0B ± 0%     24.0B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched40%-8        160B ± 0%       80B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched40%-8        640B ± 0%      320B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched40%-8     4.10kB ± 0%    2.05kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched40%-8    19.5kB ± 0%     9.7kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len32Matched56%-8        32.0B ± 0%     16.0B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched56%-8        128B ± 0%       64B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched56%-8        448B ± 0%      224B ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched56%-8     2.82kB ± 0%    1.41kB ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched56%-8    16.4kB ± 0%     8.2kB ± 0%  -50.00%  (p=0.008 n=5+5)
[Geo mean]                         921B           461B       -50.00%

name                           old allocs/op  new allocs/op  delta
Replacer/Len32Matched0%-8          0.00           0.00          ~     (all equal)
Replacer/Len128Matched0%-8         0.00           0.00          ~     (all equal)
Replacer/Len512Matched0%-8         0.00           0.00          ~     (all equal)
Replacer/Len3072Matched0%-8        0.00           0.00          ~     (all equal)
Replacer/Len16384Matched0%-8       0.00           0.00          ~     (all equal)
Replacer/Len32Matched3%-8          2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched3%-8         2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched3%-8         2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched3%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched3%-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len32Matched21%-8         2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched21%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched21%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched21%-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched21%-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len32Matched40%-8         2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched40%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched40%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched40%-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched40%-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len32Matched56%-8         2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len128Matched56%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len512Matched56%-8        2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len3072Matched56%-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
Replacer/Len16384Matched56%-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
[Geo mean]                         2.00           1.00       -50.00%

The string -> []byte cases

I don't have enough info and stats for this case, but I can provide a few ones nonetheless.

First, we would need a stringtoslicebytetmp function to make it possible.

Some obvious cases where it can be applied (this list is incomplete):

  • []byte(x + y)
  • append([]byte(s), data...)

If we use this knowledge about the freshly allocated data, some expressions like this can avoid excessive string->bytes allocation:

// If we know that f() always returns a fresh string (for example, it uses `string(b)` for its result)
// Also let's suppose that w is io.Writer that doesn't have WriteString method.
w.Write([]byte(f()))
@martisch
Copy link
Contributor

@martisch martisch commented Jan 27, 2022

I think if we can create a general pass that can answer is a reference to this piece of memory (byte slice backing array) leaked or is this the only reference live at this point then we can add a general transformation that uses converts slicebytetostring to slicebytetostringtmp. This can then be used for optimizations of string to []byte too.

Another way to use that pass then is to allow a call to f(string(b)) if b is the only reference locally at the time as then nothing else could change it while f is using it. This also requires a pass that can prove that the string doesnt leak.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 27, 2022

Thanks for the very detailed issue. I'm going to mark this as a feature request.

@mknyszek mknyszek added FeatureRequest NeedsInvestigation labels Jan 27, 2022
@mknyszek mknyszek added this to the Backlog milestone Jan 27, 2022
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 27, 2022

@aclements
Copy link
Member

@aclements aclements commented Feb 1, 2022

Yes, thanks for the very detailed write-up.

This seems very closely related to, if not the same as #2205. @quasilyte , can you take a look at that issue and see if you think these should be merged?

@quasilyte
Copy link
Contributor Author

@quasilyte quasilyte commented Feb 1, 2022

It does seem related in the sense that both of these issues want to avoid excessive allocations.
I saw dozens of somewhat similar issues that describe various cases of the conversions on one side or another.
Maybe it's possible to define some universal rules that can be applied or maybe it's still better to treat those as some ad-hoc optimizations that heavily rely on the context.

I think we solve the passing conversion at least partially when the buffer is small enough (it's allocated on the stack I believe).
Returning the converted value is a case that can't be optimized right now.

I would be glad if it would be possible to use some manual optimization of this code and just avoid this explicit return string(bytes), but our current mechanisms are not always a win (it can actually introduce a performance regression, as described in strings.Replacer case).

If there is some pattern that the compiler can recognize, it would be possible to adjust the code so it triggers that optimization. Right now I'm not aware of such a way (I would be glad to learn about one if it exists, apart from the unsafe []byte -> string conversion).

@jfcg
Copy link

@jfcg jfcg commented Mar 25, 2022

Simpler example (tested with Go 1.17.8 / 1.18):
pkg.go

package pkg

// ToLower converts ascii string s to lower-case
func ToLower(s string) string {
	if s == "" {
		return ""
	}
	buf := make([]byte, len(s))

	for i := 0; i < len(s); i++ {
		c := s[i]
		if 'A' <= c && c <= 'Z' {
			c |= 32
		}
		buf[i] = c
	}
	return string(buf)
}

pkg_test.go

package pkg

import "testing"

func BenchmarkToLower(b *testing.B) {
	str := "SomE StrInG"
	want := "some string"
	var res string

	for i := 0; i < b.N; i++ {
		res = ToLower(str)
	}
	if res != want {
		b.Fatal("ToLower error")
	}
}

go test -v -bench ToLower -benchmem -count=3 prints

BenchmarkToLower-4      18710726                66.86 ns/op           32 B/op          2 allocs/op
BenchmarkToLower-4      17420910                65.57 ns/op           32 B/op          2 allocs/op
BenchmarkToLower-4      17085405                63.16 ns/op           32 B/op          2 allocs/op

There is a redundant allocation & copy in return string(buf). It looks too trivial to me that I am surprised Go compiler cannot handle this.

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 25, 2022

@jfcg this is the use case for strings.Builder:

func ToLower(s string) string {
	if s == "" {
		return ""
	}
	var b strings.Builder
	b.Grow(len(s))

	for i := 0; i < len(s); i++ {
		c := s[i]
		if 'A' <= c && c <= 'Z' {
			c |= 32
		}
		b.WriteByte(c)
	}
	return b.String()
}

@jfcg
Copy link

@jfcg jfcg commented Mar 25, 2022

builder.go is 126 lines, one struct type, 12 functions, and its String() method is *(*string)(unsafe.Pointer(&b.buf))

We can have the exact same effect with just *(*string)(unsafe.Pointer(&buf)) in the original ToLower().

But the pont is why can the compiler not detect such a simple case?

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 25, 2022

You're right, the compiler probably could detect this case.
buf does not escape, and return string(buf) ensures that no one is writing to buf after the string conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

6 participants