New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes, strings: add ToValidUTF8 #25805

Open
martisch opened this Issue Jun 9, 2018 · 11 comments

Comments

Projects
None yet
7 participants
@martisch
Member

martisch commented Jun 9, 2018

I have seen multiple places were string([]rune(x)) (where x has underlying type string) or a handcrafted function was used to replace invalid utf8 sequences in strings by utf8.RuneError.
In many places only very few of the processed strings actually have invalid utf8 sequences in them.

The pattern string([]rune(x)) is short and does not require a package import but is inefficient in that it involves at least two runtime function calls (that iterate over the runes multiple times) and possibly allocates twice (once for the rune array and once for the string).

I propose (no language change needed) to detect and optimize this pattern and use a special optimized runtime function to sanitize utf8 sequences the way string([]rune(x)) does. This function should use a stack local tmp buf if the resulting string is small and does not escape (like other runtime string functions do) and should allocate only once for the result string and only when there are invalid utf8 sequences in the string.

As an alternative (or addition to the proposal) a new function could be added to the utf8 package that does the semantic equivalent of string([]rune(x)) but in a more efficient manner. However this function would not be able to make use of all the optimizations the runtime with compiler support can and would make the new function only available from a specific std lib version onwards. This would also make new programs not compile with older std libs and existing uses of string([]rune(x)) not optimized automatically. There would also be likely performance regressions for small non escaping strings where the existing string([]rune(x)) was able to stack allocate the return string but the new utf8 function could not when returning a modified string.

If this sounds something we can pursue i would like to implement an optimization of string([]rune(x)) (and/or utf8 package function) for go1.12.

/cc @josharian @randall77

@martisch martisch added this to the Go1.12 milestone Jun 9, 2018

@robpike

This comment has been minimized.

Contributor

robpike commented Jun 10, 2018

I don't understand the motivation. Is this an important problem that requires special compiler support? You haven't justified it here. You say you've seen the pattern, but it's a somewhat odd thing to do and I would expect it to be rare in well-written code anyway.

@martisch

This comment has been minimized.

Member

martisch commented Jun 10, 2018

My motivation is to remove the inefficiency of string([]rune(x)) by either providing an optimized runtime function for it (needs compiler support) or adding a optimized (e.g. utf8 package) function to replace existing uses of string([]rune)).

Most uses by computation time i have seen in a large profiled codebase for []rune(x) are due to the string([]rune(x)) pattern.

Compiler support could be advantageous but is not needed if a function in the e.g. utf8 package that does the same is added. Absent that it seems often string([]rune(x)) is used as that is what the runtime provides already instead of writing a function that does the same explicitly. Having a utf8 package function that works like string([]rune(x)) but more efficient could also avoid unnecessary allocations by providing an optimized std lib function thats easily available (like utf8.ValidString is for checking invalid sequences) and is preferred over string([]rune(x)).

Alternatively what would we expect the replacement of string([]rune(x)) to be in well written go code (e.g. self made implementation or existing function)? Replacement of invalid sequences even if ignored within go could still be needed for storage or communication with other languages and or protocols between services.

I understand that existing uses might not warrant adding an additional function to utf8 or the runtime but to me it seems this would be a good candidate to complement the existing optimized utf8.ValidString function that is simpler but likely also used more.

@martisch martisch changed the title from cmd/compile: detect and optimize string([]rune(x)) to cmd/compile, utf8: optimize string([]rune(x)) or add specialized utf8 function Jun 10, 2018

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 11, 2018

Given that almost no one will know what string([]rune(x)) does beyond being a no-op at first glance, it seems like a properly-named function would be a lot clearer than a magic compiler optimization. This issue should almost certainly be proposing a new function in unicode/utf8. The question about how often this comes up remains.

@rsc rsc self-assigned this Jun 11, 2018

@rsc rsc changed the title from cmd/compile, utf8: optimize string([]rune(x)) or add specialized utf8 function to proposal: utf8: add function to coerce string to valid UTF-8 Jun 11, 2018

@rsc rsc modified the milestones: Go1.12, Proposal Jun 11, 2018

@gopherbot gopherbot added the Proposal label Jun 11, 2018

@martisch

This comment has been minimized.

Member

martisch commented Jun 11, 2018

Not all use cases are coded as string([]rune(x)). Some functions named like sanitize(UTF8), clean(UTF8), coerceToUTF8 seem to carry out the same operation. I think the use case comes up half as often in code as utf8.ValidString if google mono repo is indicative of general use.

@rsc

This comment has been minimized.

Contributor

rsc commented Jul 10, 2018

@robpike and I talked about this. This functionality fits better into strings/bytes than to unicode/utf8. Nothing in utf8 returns a string or []byte yet, while this conversion is very much like strings.ToUpper, etc. We think this should be strings.ToValidUTF8 and bytes.ToValidUTF8.

After fixing #26304 and #26305, the implementation can be:

package strings

func ToValidUTF8(s string) string {
    return strings.Map(func(r rune) rune { return r }, s)
}

@rsc rsc added NeedsFix and removed NeedsDecision labels Jul 10, 2018

@rsc rsc modified the milestones: Proposal, Go1.12 Jul 10, 2018

@rsc rsc changed the title from proposal: utf8: add function to coerce string to valid UTF-8 to bytes, strings: add ToValidUTF8 Jul 10, 2018

@gopherbot

This comment has been minimized.

gopherbot commented Oct 13, 2018

Change https://golang.org/cl/142003 mentions this issue: strings, bytes: Add ToValidUTF8

@robpike

This comment has been minimized.

Contributor

robpike commented Oct 13, 2018

Comment made in review of the CL:

This might be a better design:

// ToValidUTF8 treats s as UTF-8-encoded bytes and returns a copy with run of bytes representing invalid UTF-8 replaced with repl, which may be empty.
func ToValidUTF8(s, repl []byte) []byte {
  ...
}

Rather than blow up the bytes, replace the bad encoding. And allow the user to specify a possibly empty replacement.

ToValidUTF8(s, nil)  clears them out.
ToValidUTF8(s, []byte{'?'}) makes it readable.

The name might be wrong for this design, but it is a superior function. Also the existing decoding mechanisms don't do this directly, but replace each byte, which to me argues this is better in part because it adds functionality.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Nov 28, 2018

Any thoughts on @robpike 's suggested API? Perhaps a better name would be ReplaceInvalidUTF8?

@martisch

This comment has been minimized.

Member

martisch commented Nov 28, 2018

SGTM will send a new CL.

I was thinking about adding another parameter to control if each run or each byte of an invalid sequence is replaced. The later might not be that much more useful apart from generating a string of same length and is easily implemented using the existing strings.Map function.

@bradfitz bradfitz modified the milestones: Go1.12, Go1.13 Nov 29, 2018

@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 29, 2018

Design didn't get hashed out for Go 1.12, so bumping to Go 1.13.

@RalphCorderoy

This comment has been minimized.

RalphCorderoy commented Dec 8, 2018

Hi @martisch, I agree using Map to create a string of the same length is trivial, but that suggests it could just as easily be done by the caller instead. ToValidUTF8 above provides new functionality by replacing runs of invalidity by a single replacement and I think that's what it should concentrate on, without cluttering its parameters with an additional flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment