Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings: Split is inconsistent with bytes.Split #53511

Closed
dsnet opened this issue Jun 23, 2022 · 4 comments
Closed

strings: Split is inconsistent with bytes.Split #53511

dsnet opened this issue Jun 23, 2022 · 4 comments
Labels
NeedsFix
Milestone

Comments

@dsnet
Copy link
Member

@dsnet dsnet commented Jun 23, 2022

Consider the following:

in := "\xff-\xff"
sep := ""
fmt.Printf("%q\n", bytes.Split([]byte(in), []byte(sep)))   // ["\xff" "-" "\xff"]
fmt.Printf("%q\n", strings.Split(string(in), string(sep))) // ["�" "-" "\xff"]

The results of these two are inconsistent where the strings implementation replaced
\xff with utf8.RuneError, even though the documentation of strings.Split
mentions no such behavior. It only says that it slices up the input.
Furthermore, it is odd, that it only mangles the first element of the result, but not the last.

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 23, 2022

Change https://go.dev/cl/413715 mentions this issue: strings: avoid utf.RuneError mangling in Split

@cagedmantis cagedmantis added the NeedsInvestigation label Jun 24, 2022
@cagedmantis cagedmantis added this to the Backlog milestone Jun 24, 2022
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Jun 24, 2022

@griesemer

@dmitshur dmitshur added NeedsFix and removed NeedsInvestigation labels Jun 25, 2022
@dmitshur dmitshur modified the milestones: Backlog, Go1.19 Jun 25, 2022
@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Jul 13, 2022

https://go.dev/play/p/7PPxD1AyxTq shows it together with 2 other situations where utf8 rune splitting is being done that are handled differently again.

To me it looks like the strings and bytes package are both being inconsistent to each of the other documented rune splitting methods and probably both need to be adjusted.

Or maybe I am missing something here and those functions are supposed to support rune splitting in a different way in order to allow for edge cases I am not aware of.

@dsnet
Copy link
Member Author

@dsnet dsnet commented Jul 14, 2022

@nightlyone. My CL makes it such that your example prints:

["\xff" "-" "\xff"]
["\xff" "-" "\xff"]
['�' '-' '�']
'�''-''�'

This brings the bytes and strings package in conformance to each other.

Regarding the last two lines in your example, there is no documented guarantee that strings.Split(s, "") needs to be identical to []rune(s). In fact, strings.Split is technically bound by the documentation to always produce a list of subslices. Converting 0xff to the rune error violates that. On the other hand, []rune(s) is documented by the language specification that it does do the rune error mangling.

@gopherbot gopherbot modified the milestones: Go1.19, Go1.20 Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix
Projects
None yet
Development

No branches or pull requests

5 participants