Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: provide std lib pkg Copy function to copy a string efficiently without memory sharing #40200

Open
martisch opened this issue Jul 14, 2020 · 11 comments
Labels
Projects
Milestone

Comments

@martisch
Copy link
Contributor

@martisch martisch commented Jul 14, 2020

Goal

Provide a fast, allocation efficient and guaranteed way to copy a string without the new string sharing memory with the old string, that does not rely on code assuming string([]byte(s)) works for this.

Proposal

Add a Copy(s string) string function (not a builtin) to a std lib package that is guaranteed to return a string with the same content as the input string, does not share memory (for the backing) array with the input string and minimises memory allocated to do so.

Note that it this function does not need to be in the strings package but could be located in a package more fitting for its special use character.

Use cases

  • creating a new string for e.g. use as a map key or substring of a string that should be guaranteed not pin more memory in the backing array then needed (which currently cant be freed by the garbage collector)
  • creating a new string from a string for which the backing array is e.g. an mmap region that should be unmapped

Existing alternatives:

  • string([]byte(s)) is currently not guaranteed by the language standard and thereby all conforming Go compilers to not just return the same backing array. While the current Go Gc compiler does not elide the double copy a future Go compiler could optimize this to a noop. If the compiler becomes smart enough to recognise the safe versions of string to byte to string conversions (#6714) it would be fragile to try to forbid some of them again. Being explicit about the intent of the copy semantics seems cleaner and a better intent communication.

  • (s + " ")[:len(s)] is as far as I understand not guaranteed by the language standard and thereby all conforming Go compilers to not just return the same backing array.

  • strings.Builder can be used to only cause one allocation currently for the copy but has no guarantees to not be worse in allocations or even avoid copying if only one string is written and retrieved again.

Implementation is simple:

func Copy(s string) string {
	b := make([]byte, len(s))
	copy(b, s)
	return *(*string)(unsafe.Pointer(&b))
}

https://go-review.googlesource.com/c/go/+/242259/1/src/strings/copy.go

The use of unsafe headers in the strings library is already present in the strings.Builder code.

Downsides

  • The garbage collector might be getting smarter in the future making the pinning of large memory for small substrings a non issue.
  • The compiler/runtime might choose to automatically dedupe strings needing to make strings.Copy special to keep semantics intact.
  • Having the Copy function could lead to confusion with (s1 = s2) and application in contexts where it is not needed.
  • Inner details as does not reuse memory can be to low level and considered to much of an implementation detail to expose.
  • It only provides a tool for some very special use cases.
  • The function can just be implemented without guarantees of compatibility by users needed such a function.

For those reasons and others I expect this proposal will be easily rejected or voted down. That is fine as I think its still useful to have some documented discussion about agreement whether such functionality or implementation details about strings are not ok to be exposed.

@gopherbot gopherbot added this to the Proposal milestone Jul 14, 2020
@gopherbot gopherbot added the Proposal label Jul 14, 2020
@robpike
Copy link
Contributor

@robpike robpike commented Jul 14, 2020

Why do you want this? Are you worried about extracting a substring so the main string can be collected? Because if that's the use case, I don't think confusing the simple existing model (s1 = s2) is justified.

@martisch
Copy link
Contributor Author

@martisch martisch commented Jul 14, 2020

Yes that is one use case as written in the proposal. Another application I have seen is to make a string that is safe to use going forward since the orginal string was created through an unsafe process of using a memory region e.g. mmaped that should be released again.

I fully understand if the proposal is getting rejected on grounds that this is something that the garbage collector can solve in the future without any new API or that users can implement as a helper themselfs. I have seen the issue come up before but never a proposal. Even if rejected I think it will serve as a good reference point cite that this wont be implemtented/support or that these details are to low level to base an API on.

It is also in context of me experimenting with the compiler to optimze some string->byte->string... conversion chains which could unless explicitly prevented make the 'strings([]byte(s))' version a noop that some Go programs may rely on to actually make a copy due to either memory concerns or having used unsafe to create the string in the first place. I understand that issue is not yet reality and can be defered until that is the case to offer an easy escape hatch to refactor then "broken" code.

@martisch martisch changed the title proposal: strings.Copy to copy a string efficiently without memory sharing proposal: provide std lib pkg Copy function to copy a string efficiently without memory sharing Jul 14, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Jul 14, 2020
@valyala
Copy link
Contributor

@valyala valyala commented Jul 16, 2020

What about using the following code?

func CopyString(s string) string {
    return string(append([]byte(nil), s...))
}

The compiler may optimize it to a single string copy and a single memory allocation.

@martisch
Copy link
Contributor Author

@martisch martisch commented Jul 16, 2020

If the outcome is to change the compiler to recognise a pattern for low overhead copy I do not see the advantage of string(append([]byte(nil), s...)) over the compiler recognising the simpler string([]byte(s)) and using that for single memory allocation copy.

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 16, 2020

Since concatenation creates a new string, the simplest copy-this construct would be s + "" or "" + s.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 16, 2020

@networkimprov Right now the compiler can just turn s + "" into simply s. So to make that work as a string copy we would have to define it in the language. And it would be odd to define a specific meaning for s + "", and it would be odd to define string concatenation as always making a copy in general.

@martisch described two use cases. The first is to avoid pinning a large string. As @robpike suggests, that is something we can look into implementing in the garbage collector. I don't know how hard that would be--probably pretty hard--but it does seem like that should be the first step.

@martisch also suggests the case of a string defined in mmap'ed memory that needs to move to the regular heap. I'm probably missing something but I think getting a string in mmap'ed memory must involve some sort of unsafe operation, so I'm not sure how interesting this case is. I expect that we can move a string into the heap using an unsafe operation.

So maybe we need to decide whether the garbage collector option is possible for unpinning larage strings, and maybe we need to decide whether there is an unsafe operation that will do the copy operation. E.g., see the discussion in #19367.

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 16, 2020

Why not let us explicitly copy a string, by whatever API/syntax?

Why complicate the runtime to fix a single use case of a general purpose stdlib or language feature?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 16, 2020

Avoiding the unnecessary pinning of large strings is a general problem that would be nice to solve for the general case. It would mean that some programs would get better memory usage without requiring people to understand this detailed issue and requiring them to know about a relatively obscure library function.

And if we are able to fix that general problem, then we don't need the library function.

This is just my opinion, of course, others may feel differently.

@robpike
Copy link
Contributor

@robpike robpike commented Jul 17, 2020

It's easy to do yourself, although of course there is no guarantee the compiler won't optimize this away. It doesn't today.

https://play.golang.org/p/gLXMWXD1SNL

package main

import (
	"fmt"
	"reflect"
	"unsafe"
)

func main() {
	x := "qwertyuiopasdfghjklzxcvbnm"
	y := x[3:5]
	fmt.Println(x, (*reflect.StringHeader)(unsafe.Pointer(&x)))
	fmt.Println(y, (*reflect.StringHeader)(unsafe.Pointer(&y)))
	y = mycopy(y)
	fmt.Println(y, (*reflect.StringHeader)(unsafe.Pointer(&y)))
}

func mycopy(x string) string {
	return ("x" + x)[1:]
}
@martisch
Copy link
Contributor Author

@martisch martisch commented Jul 17, 2020

Note that mycopy using concatenation my allocate using the next bigger sizeclass currently which is not needed.

I agree that deciding to have the garbage collector not pin extra memory together with an unsafe operation to copy memory will make the string copy function uneccessary and are better more general solution if they are going to be choosen.

@rsc
Copy link
Contributor

@rsc rsc commented Jul 22, 2020

Every GC'ed language has this problem. I don't think that we are going to solve it in the GC when none of the rest have.

When Java was faced with this problem they made the (in my opinion incredibly unwise) decision to change the substring operation to start making copies. That is, Java x.substring(i, j) used to be like Go's x[i:j], sharing memory with the parent, but starting in Java 1.7 it makes a copy. This kind of change makes formerly linear algorithms go quadratic, so I'm stunned they thought this was a good idea. But if nothing else it shows how much people needed a solution to the unwanted sharing pinning data.

Given the likely absence of a magic bullet for the foreseeable future, it does seem reasonable to have something like strings.Copy(string) string, with appropriate documentation.

In the past I've written something that does s[:1] + s[1:] for non-empty strings and returns "" for empty strings. Lots of other ways to do it too (like Rob's), but I think it is telling that we all have our versions of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Incoming
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.