Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: for index, rune, runelen = range for strings #28599

Closed
martisch opened this issue Nov 5, 2018 · 12 comments

Comments

Projects
None yet
7 participants
@martisch
Copy link
Member

commented Nov 5, 2018

UPDATED proposal to only apply to strings.

The proposal is to allow a new 3 value range for strings:

for index, rune, runelen = range string {
   ...
}

where index and rune are the same as in for index, rune = range string and runelen is the number of bytes that was needed to encode the rune in UTF8. In case a rune could not be decoded from the string due to an invalid UTF8 byte sequence: runelen is set to 1 and rune to utf8.RuneError.

  • This is backwards compatible with existing range syntax and Go1 since 3 value ranges do not exist yet.
  • gives the rune width of the decoded rune which is useful for e.g. strings.Map and other functions where the rune width is used to determine if a real RuneError rune was decoded or how much space is needed to write the rune to a byte slice.
  • there is no performance overhead vs existing for i, r = range string loop for the 3 value loop since internally the loop already computes the utf8 byte sequence width of the rune to correctly advance the index on the next iteration. This will however allow user supplied code in the loop to use this hidden variable.

For example current strings.Map first range loop:

for i, c := range s {
	r := mapping(c)
	if r == c && c != utf8.RuneError {
	   continue
	}

	var width int
	if c == utf8.RuneError {
		c, width = utf8.DecodeRuneInString(s[i:])
		if width != 1 && r == c {
			continue
		}
	} else {
		width = utf8.RuneLen(c)
	}
        ...

could become:

for i, c, width := range s {
   r := mapping(c)
   if r == c && !(c == utf8.RuneError && width == 1) {
       continue
   }
   ...

@gopherbot gopherbot added this to the Proposal milestone Nov 5, 2018

@gopherbot gopherbot added the Proposal label Nov 5, 2018

@martisch martisch changed the title proposal: for index, size, rune = range for strings and bytes proposal: for index, size, rune = range for byte slices and strings Nov 5, 2018

@martisch martisch changed the title proposal: for index, size, rune = range for byte slices and strings proposal: for index, rune, size = range for byte slices and strings Nov 5, 2018

@theodesp

This comment has been minimized.

Copy link

commented Nov 5, 2018

Wouldn't be the better if the size is computed explicitly as its fixed and does not change? Also, I don't think range is the ideal keyword to perform this extension as if you think about it a range should iterate over a starting - ending point only and having a size would be extraneous.

@martisch

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

size here is the size of the rune that was decoded, its not fixed and changes from rune to rune.
Its the step between two indexes.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

I'd rather fix this by doing the optimization in the compiler.
Changing the language to support this optimization seems overkill.

@martisch

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

Agreed, however unless the functions are very simple the compiler cant apply this optimization as any write to a byteslice or byte through a pointer needs to assume the byteslice iterated over was changed unless the compiler can prove they cant overlap. As far as i understand the compiler for example cant optimize the bytes.Map example as the mapping function could change the bytes of the byteslice being iterated over during the range loop.

The other simplification this proposal provides is allowing to have the already computed rune size (step to the next index) in the loop be usable by code in the loop without calling utf8.DecodeRune(InString).

Updated the proposal to reflect those points better.

@martisch martisch changed the title proposal: for index, rune, size = range for byte slices and strings proposal: for index, rune, runelen = range for byte slices and strings Nov 5, 2018

@deanveloper

This comment has been minimized.

Copy link

commented Nov 6, 2018

This should definitely not be done for []byte. Bytes do not have any concept of a "rune" in any way, and shouldn't.

edit - To add on to this, this would mean that doing i, e, _ := range byteslice would be entirely different from i, e := range byteslice

@martisch

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2018

This should definitely not be done for []byte. Bytes do not have any concept of a "rune" in any way, and shouldn't.

I would not think this is that of an absolute counter point given the existence of utf8.DecodeRune and other utf8 function operating on byte slices and the general symmetry of byteslices and string in various packages like utf8 showing byte slices frequently being interpreted as utf8 encoded runes.

edit - To add on to this, this would mean that doing i, e, _ := range byteslice would be entirely different from i, e := range byteslice

fair point that will indeed be confusing.

Maybe something like i, r = range(rune) byteslice would better signify that range interprets the byteslice as runes and byteslice does not have an inherent rune concept.

@deanveloper

This comment has been minimized.

Copy link

commented Nov 6, 2018

I would not think this is that of an absolute counter point given the existence of utf8.DecodeRune and other utf8 function operating on byte slices and the general symmetry of byteslices and string in various packages like utf8 showing byte slices frequently being interpreted as utf8 encoded runes.

That's because utf8 is an encoding. When one does utf8.DecodeRune(byteslice), they are saying "take this byte slice and look at it like it's encoded in utf8. Whats the first rune?". The byteslice has no knowledge of its encoding, and it rightfully shouldn't. It's just a bunch of numbers.

And of course the utf8 package is going to assume utf8 encoded bytes. I could say the same about the utf16 package but say that we should be using utf16 instead.

The point I'm trying to make is, bytes aren't special. They are just a uint8, and should be treated as such. If we want to introduce a concept of runes for your byte slice, either use a package that specifies how to convert from bytes to runes (like utf8), or convert it to a string.

Maybe something like i, r = range(rune) byteslice would better signify that range interprets the byteslice as runes and byteslice does not have an inherent rune concept.

Or i, r = range string(byteslice) 😉

@martisch

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2018

That's because utf8 is an encoding. When one does utf8.DecodeRune(byteslice), they are saying "take this byte slice and look at it like it's encoded in utf8. Whats the first rune?". The byteslice has no knowledge of its encoding, and it rightfully shouldn't. It's just a bunch of numbers.

The same is true for strings. https://golang.org/ref/spec#String_types "A string type represents the set of string values. A string value is a (possibly empty) sequence of bytes." but currently range UTF8 decodes runes from the bytes of a string. The range keyword introduces the way the string or byteslice is decoded even if both have no knowledge of this on their own.

Or i, r = range string(byteslice) 😉

Which is not the same semantic since changes to the byteslice are not taken into account during iteration and therefore often a copy is needed.

@deanveloper

This comment has been minimized.

Copy link

commented Nov 6, 2018

The same is true for strings. https://golang.org/ref/spec#String_types "A string type represents the set of string values. A string value is a (possibly empty) sequence of bytes."

I thought this would be brought up. I wasn't going to do it, but I was thinking that if you did I would mention that both ranging over a string and converting it to []rune assume utf8 encoding, as you have mentioned already.

The range keyword introduces the way the string or byteslice is decoded even if both have no knowledge of this on their own.

There's not a single thing built-in to the language that assumes that a []byte is utf8 encoded IIRC.

Which is not the same semantic since changes to the byteslice are not taken into account during iteration and therefore often a copy is needed.

This is a fair assessment, although I feel like in the case where you want it to be available to be changed, the utf8 package (as verbose as it would be) should be used.

import "utf8"

func main() {
	byteslice := []byte("안녕!")

	for i := 0; i < len(byteslice); {
		r, rlen := utf8.DecodeRune(byteslice[i:])
		
		// code here
		
		i += rlen
	}
}

Anyway, the point I'm trying to make is that there's not a single feature in the language that assumes that a []byte is utf8 encoded, and I don't think we should add one. I'm alright with this proposal for string, but I don't agree that we should add it to []byte as well.

@martisch martisch changed the title proposal: for index, rune, runelen = range for byte slices and strings proposal: for index, rune, runelen = range for strings Nov 6, 2018

@martisch

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2018

updated the proposal to the strings version only.

@ianlancetaylor ianlancetaylor changed the title proposal: for index, rune, runelen = range for strings proposal: Go 2: for index, rune, runelen = range for strings Nov 7, 2018

@bradfitz

This comment has been minimized.

Copy link
Member

commented Dec 11, 2018

Or if we do the bug where rune32 and int32 are separate types (#29012), then we could define len(a rune32) to be the UTF-8 length of the rune.

But in general we're moving away from hard-coding UTF-8 assumptions from the language.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Dec 11, 2018

If performance of this kind of operation is essential, we should teach the compiler to treat the relevant functions from the utf8 package as intrinsics. Ranging over a string is already somewhat non-orthogonal. Let's not make it even more different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.