Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: provide std lib pkg Clone function to copy a string efficiently without memory sharing #40200

Open
martisch opened this issue Jul 14, 2020 · 20 comments
Labels
Projects
Milestone

Comments

@martisch
Copy link
Contributor

@martisch martisch commented Jul 14, 2020

Updated 21th April 2021 to use Clone instead of Copy as name: see #45038 (comment)

Goal

Provide a fast, allocation efficient and guaranteed way to copy a string without the new string sharing memory with the old string, that does not rely on code assuming string([]byte(s)) works for this.

Proposal

Add a Clone(s string) string function (not a builtin) to a std lib package that is guaranteed to return a string with the same content as the input string, does not share memory (for the backing) array with the input string and minimises memory allocated to do so.

Note that it this function does not need to be in the strings package but could be located in a package more fitting for its special use character.

Use cases

  • creating a new string for e.g. use as a map key or substring of a string that should be guaranteed not pin more memory in the backing array then needed (which currently cant be freed by the garbage collector)
  • creating a new string from a string for which the backing array is e.g. an mmap region that should be unmapped

Existing alternatives:

  • string([]byte(s)) is currently not guaranteed by the language standard and thereby all conforming Go compilers to not just return the same backing array. While the current Go Gc compiler does not elide the double copy a future Go compiler could optimize this to a noop. If the compiler becomes smart enough to recognise the safe versions of string to byte to string conversions (#6714) it would be fragile to try to forbid some of them again. Being explicit about the intent of the copy semantics seems cleaner and a better intent communication.

  • (s + " ")[:len(s)] is as far as I understand not guaranteed by the language standard and thereby all conforming Go compilers to not just return the same backing array.

  • strings.Builder can be used to only cause one allocation currently for the copy but has no guarantees to not be worse in allocations or even avoid copying if only one string is written and retrieved again.

Implementation is simple:

func Clone(s string) string {
	b := make([]byte, len(s))
	copy(b, s)
	return *(*string)(unsafe.Pointer(&b))
}

https://go-review.googlesource.com/c/go/+/242259/1/src/strings/copy.go

The use of unsafe headers in the strings library is already present in the strings.Builder code.

Downsides

  • The garbage collector might be getting smarter in the future making the pinning of large memory for small substrings a non issue.
  • The compiler/runtime might choose to automatically dedupe strings needing to make strings.Clone special to keep semantics intact.
  • Having the Clone function could lead to confusion with (s1 = s2) and application in contexts where it is not needed.
  • Inner details as does not reuse memory can be to low level and considered to much of an implementation detail to expose.
  • It only provides a tool for some very special use cases.
  • The function can just be implemented without guarantees of compatibility by users needed such a function.

For those reasons and others I expect this proposal will be easily rejected or voted down. That is fine as I think it is still useful to have some documented discussion about agreement whether such functionality or implementation details about strings are not ok to be exposed.

@gopherbot gopherbot added this to the Proposal milestone Jul 14, 2020
@robpike
Copy link
Contributor

@robpike robpike commented Jul 14, 2020

Why do you want this? Are you worried about extracting a substring so the main string can be collected? Because if that's the use case, I don't think confusing the simple existing model (s1 = s2) is justified.

@martisch
Copy link
Contributor Author

@martisch martisch commented Jul 14, 2020

Yes that is one use case as written in the proposal. Another application I have seen is to make a string that is safe to use going forward since the orginal string was created through an unsafe process of using a memory region e.g. mmaped that should be released again.

I fully understand if the proposal is getting rejected on grounds that this is something that the garbage collector can solve in the future without any new API or that users can implement as a helper themselfs. I have seen the issue come up before but never a proposal. Even if rejected I think it will serve as a good reference point cite that this wont be implemtented/support or that these details are to low level to base an API on.

It is also in context of me experimenting with the compiler to optimze some string->byte->string... conversion chains which could unless explicitly prevented make the 'strings([]byte(s))' version a noop that some Go programs may rely on to actually make a copy due to either memory concerns or having used unsafe to create the string in the first place. I understand that issue is not yet reality and can be defered until that is the case to offer an easy escape hatch to refactor then "broken" code.

@martisch martisch changed the title proposal: strings.Copy to copy a string efficiently without memory sharing proposal: provide std lib pkg Copy function to copy a string efficiently without memory sharing Jul 14, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Jul 14, 2020
@valyala
Copy link
Contributor

@valyala valyala commented Jul 16, 2020

What about using the following code?

func CopyString(s string) string {
    return string(append([]byte(nil), s...))
}

The compiler may optimize it to a single string copy and a single memory allocation.

@martisch
Copy link
Contributor Author

@martisch martisch commented Jul 16, 2020

If the outcome is to change the compiler to recognise a pattern for low overhead copy I do not see the advantage of string(append([]byte(nil), s...)) over the compiler recognising the simpler string([]byte(s)) and using that for single memory allocation copy.

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 16, 2020

Since concatenation creates a new string, the simplest copy-this construct would be s + "" or "" + s.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 16, 2020

@networkimprov Right now the compiler can just turn s + "" into simply s. So to make that work as a string copy we would have to define it in the language. And it would be odd to define a specific meaning for s + "", and it would be odd to define string concatenation as always making a copy in general.

@martisch described two use cases. The first is to avoid pinning a large string. As @robpike suggests, that is something we can look into implementing in the garbage collector. I don't know how hard that would be--probably pretty hard--but it does seem like that should be the first step.

@martisch also suggests the case of a string defined in mmap'ed memory that needs to move to the regular heap. I'm probably missing something but I think getting a string in mmap'ed memory must involve some sort of unsafe operation, so I'm not sure how interesting this case is. I expect that we can move a string into the heap using an unsafe operation.

So maybe we need to decide whether the garbage collector option is possible for unpinning larage strings, and maybe we need to decide whether there is an unsafe operation that will do the copy operation. E.g., see the discussion in #19367.

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 16, 2020

Why not let us explicitly copy a string, by whatever API/syntax?

Why complicate the runtime to fix a single use case of a general purpose stdlib or language feature?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 16, 2020

Avoiding the unnecessary pinning of large strings is a general problem that would be nice to solve for the general case. It would mean that some programs would get better memory usage without requiring people to understand this detailed issue and requiring them to know about a relatively obscure library function.

And if we are able to fix that general problem, then we don't need the library function.

This is just my opinion, of course, others may feel differently.

@robpike
Copy link
Contributor

@robpike robpike commented Jul 17, 2020

It's easy to do yourself, although of course there is no guarantee the compiler won't optimize this away. It doesn't today.

https://play.golang.org/p/gLXMWXD1SNL

package main

import (
	"fmt"
	"reflect"
	"unsafe"
)

func main() {
	x := "qwertyuiopasdfghjklzxcvbnm"
	y := x[3:5]
	fmt.Println(x, (*reflect.StringHeader)(unsafe.Pointer(&x)))
	fmt.Println(y, (*reflect.StringHeader)(unsafe.Pointer(&y)))
	y = mycopy(y)
	fmt.Println(y, (*reflect.StringHeader)(unsafe.Pointer(&y)))
}

func mycopy(x string) string {
	return ("x" + x)[1:]
}
@martisch
Copy link
Contributor Author

@martisch martisch commented Jul 17, 2020

Note that mycopy using concatenation my allocate using the next bigger sizeclass currently which is not needed.

I agree that deciding to have the garbage collector not pin extra memory together with an unsafe operation to copy memory will make the string copy function uneccessary and are better more general solution if they are going to be choosen.

@rsc
Copy link
Contributor

@rsc rsc commented Jul 22, 2020

Every GC'ed language has this problem. I don't think that we are going to solve it in the GC when none of the rest have.

When Java was faced with this problem they made the (in my opinion incredibly unwise) decision to change the substring operation to start making copies. That is, Java x.substring(i, j) used to be like Go's x[i:j], sharing memory with the parent, but starting in Java 1.7 it makes a copy. This kind of change makes formerly linear algorithms go quadratic, so I'm stunned they thought this was a good idea. But if nothing else it shows how much people needed a solution to the unwanted sharing pinning data.

Given the likely absence of a magic bullet for the foreseeable future, it does seem reasonable to have something like strings.Copy(string) string, with appropriate documentation.

In the past I've written something that does s[:1] + s[1:] for non-empty strings and returns "" for empty strings. Lots of other ways to do it too (like Rob's), but I think it is telling that we all have our versions of this.

@martisch
Copy link
Contributor Author

@martisch martisch commented Nov 17, 2020

To avoid some confusion of what Copy does maybe a longer name like DeepCopy might be used if the function is in the common strings package. Otherwise unsafe.StringCopy might better convey the special use case of this function even if its not an unsafe operation itself but might need to be used with other unsafe operations to have created a string that needs a copy of the backing array.

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 17, 2020

@martisch, I don't think unsafe.StringCopy is a good name for this function, because there is nothing unsafe about it.

(strings.Copy seems like a clear enough name to me.)

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 17, 2020

@martisch

Another application I have seen is to make a string that is safe to use going forward since the orginal string was created through an unsafe process of using a memory region e.g. mmaped that should be released again.

I think the problem there is using the string type in the first place. Go programs can (and often do) assume that variables of type string are immutable and have unbounded, garbage-collected lifetimes. The appropriate type for “a sequence of bytes that exists for a while but may not be safe to write or to access in the future” is []byte, not string.

[]byte already has this meaning in general due to concurrency. Because other goroutines may read the same data concurrently, a function that receives an arbitrary []byte cannot assume that it is safe to mutate. Because other goroutines may write to the data after the call has returned, a function that receives an arbitrary []byte cannot assume that it is safe to retain for future reading.

So I don't think “copying an external string into the Go heap” is an appropriate use-case for this function, because I don't agree that external strings should be represented as type string in the first place.

@martisch
Copy link
Contributor Author

@martisch martisch commented Mar 30, 2021

@bcmills while I agree on the high level I would like to note that on an existing ecosystem level libraries that do e.g. proto handling that may want to use arenas for decoding dont use new types but use type string and they will be hard to change.

@martisch
Copy link
Contributor Author

@martisch martisch commented Mar 30, 2021

Another use case for strings.Clone I came across is to cleanup a map from holding on to more memory than needed for string keys. Keys might have been sliced from a much larger string. Remember that the Go language spec doesnt guarantee string([]byte(s)) is not a noop and a much more optimizing compiler might just make it return the original s.

for key, value := range m {
   m[strings.Clone(key)] = value
}
@rsc
Copy link
Contributor

@rsc rsc commented Apr 21, 2021

#45038 includes a strings.Clone that should address this.

@martisch martisch changed the title proposal: provide std lib pkg Copy function to copy a string efficiently without memory sharing proposal: provide std lib pkg Clone function to copy a string efficiently without memory sharing Apr 21, 2021
@go101
Copy link

@go101 go101 commented Apr 22, 2021

Will the supposed implementation return the original s if string s occupies its whole hosting memory block?

@randall77
Copy link
Contributor

@randall77 randall77 commented Apr 22, 2021

@go101 It might.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 22, 2021

If we are talking about the proposed strings.Clone function, then I don't think it should. I think that strings.Clone should always return a new copy of the string in newly allocated memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Proposals
Incoming
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants