Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: make rune be a new type of concrete type int32, not an alias #29012

Open
Zemnmez opened this issue Nov 29, 2018 · 12 comments

Comments

Projects
None yet
9 participants
@Zemnmez
Copy link

commented Nov 29, 2018

Issue

Currently, runes are aliased to int32s. This has existed before type aliases existed and it works like this:

s := "hello world!"
var i int32 = 1
	
for _, rune := range s {
	fmt.Println(rune + i)
}
/* Output: 
105
102
...
*/

This also means, rather more subtly that functions that handle runes or int32s implicitly support inputs of each other without any kind of warning or error, like this:

var identity = func(r rune) rune { return r }
var i int32 = 10
fmt.Println(identity(i))
// Output: 10

Though it makes sense knowing of the little-known aliasing feature, this gets especially confusing in cases where runes are rejected and it's not mentioned that they're an alias:

./test.go:11: invalid operation: a + c (mismatched types rune and int32)

Because rune and int32 are aliased to each other, it's not possible to put both in a type switch, returning a mysterious and confusing error like this:

var a int32
var b rune
var which = func(i interface{}) string {
	switch i.(type){
	case int32: return "int32"
	case rune: return "rune"
	}
	panic("unknown!")
}
	
fmt.Println(which(a))
fmt.Println(which(b))
/* error:
prog.go:16:3: duplicate case rune in type switch
	previous case at prog.go:15:3
*/

Of course, if you reflect a rune, it thinks it's an int32:

var a int32
var b rune
var which = func(i interface{}) string {
	return reflect.TypeOf(i).String()
}
	
fmt.Println(which(a))
fmt.Println(which(b))
// Output:
// int32
// int32

It is additionally odd that rune constants are untyped numeric constants not actually of type rune and will do things ordinary runes can't if you try to perform an operation on them for example:

var i int64 = 10
fmt.Println(i + 'a')
// Output: 107

Proposal

It's my opinion that it can be surprising and inconvenient to those new to the language that runes are invisibly int32s. It just makes sense that operations on runes should require a conversion, as such:

s := "hello world!"
var i int32 = 1
	
for _, rune := range s {
	fmt.Println(rune + rune(i)) // or: int32(rune) + i
}

Additionally, rune constants could be of type untyped rune, the same way string constants are typed untyped string:

var i int32 = 3
fmt.Println(int('a') + i)

In a modern language that standardises on runes as UTF-32 encoded characters, I think it only makes sense for runes to be only explicitly convertible to types that osentiably represent numeric scalars.

It's my understanding that, since a this juncture it's impossible to know if a value is a rune or int32,fmt and similar tools can only print the numeric value of the character, rather than the code point it represents.

@gopherbot gopherbot added this to the Proposal milestone Nov 29, 2018

@gopherbot gopherbot added the Proposal label Nov 29, 2018

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

The recent blog entry https://blog.golang.org/go2-here-we-come suggests that each language change should address an important issue for many people with minimal impact on everybody else. To me this issue, while real, doesn't seem all that important, and making this change would break an unknown number of currently working and correct Go programs. At first glance I'm not sure this passes the bar.

@alanfo

This comment has been minimized.

Copy link

commented Nov 30, 2018

Exactly the same arguments could be applied to byte which is an alias of uint8.

Personally, I've never found these aliases confusing. If I'm using characters (or whatever one should call them in Go) I use the aliases, otherwise I use the ordinary numeric types.

Also there's no inheritance in Go and so rune couldn't be a subtype of int32. It would just have to be a type which, although represented internally as an int32, would be distinct from it and therefore require an explicit conversion when converting between the two.

Sorry, but I'm not a fan of the idea.

@Zemnmez

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

@ianlancetaylor

The recent blog entry https://blog.golang.org/go2-here-we-come suggests that each language change should address an important issue for many people with minimal impact on everybody else. To me this issue, while real, doesn't seem all that important, and making this change would break an unknown number of currently working and correct Go programs. At first glance I'm not sure this passes the bar.

breaking changes from the rune constant change are rare, and can be addressed

As it stands, my proposed changes to rune constants would break parsers that operate on unicode codepoints as numeric values where they are compared with non-rune types. The most common use of these constants is, to my knowledge to do c-style (?) codepoint parsing, like this:

func allLatinAlpha(s string) bool {
   for _, r := range s {
      if !((r <= 'a' && r >= 'z') || (r <= 'a' && r >= 'z')) { return false }
  }
}

This would not be broken by my constant proposal (since all types are rune here). An uncommon usage that I can imagine would be broken would be bitwise operations on runes where the shift value is a numeric constant, ex: 'a' << 1, which could potentially be used by character encoding parsers.

I think there are ways to potentially address this use case by making the semantic upgrade of untyped integer implicitly convertible to untyped rune, but the semantic downgrade untyped rune to untyped integer only explicitly convertible. In code:

// this is allowed because the untyped integer `3` can be converted to rune implicitly
'a' + 3

// this is not allowed because the  untyped rune 'a' cannot be converted to int32 implicitly
'a' + int32(3) 

Though this is something the type system already uses with 'ideal' untyped types being convertible to corresponding real types but not in reverse this is obviously a little complex, and I'm not terribly tied to the idea of reworking rune constants in this way. I think it makes a lot of sense, and especially helps new Go programmers but I'd be happy enough just to see int32 unaliased to rune.

using the rune and int32 types like this is rare

It's very rare to have int32 and rune types interacting in this way -- since, to use this 'feature', they'd in essence have to already know the types were identical in a way that wasn't possible to achieve with the Go type system until Go 1.9, and choose to use this to save them a type conversion. int32 is not a terribly common type (usually just int is used), so I find it hard to imagine this could happen by coincidence.

I'd love to have better data on this, if I could use, for example godoc.org as a dataset.

@alanfo

Exactly the same arguments could be applied to byte which is an alias of uint8.

rune is defined to be a UTF-32 representation of a codepoint, int32 is not.

I'd argue that byte has identical semantic meaning to 'an unsigned integer of eight bits'. They're both semantically arbitrary, unknown numeric scalar data.

A rune however is specifically stipulated by spec to be a UTF-32 encoded Unicode codepoint, which in my opinion carries significantly different meaning.

Personally, I've never found these aliases confusing. If I'm using characters (or whatever one should call them in Go) I use the aliases, otherwise I use the ordinary numeric types.
You're using the types to denote their use, as I am pushing for. However, because these types are aliased, it's not possible to determine whether the type is int8 or rune at runtime, which makes the difference essentially void for conditional execution based on type.

rune and uint8 should be used to denote 'characters' and 'numbers', and they are impossible to differentiate at present.

As a concrete example, I've personally been bitten writing a simple interpreter that uses Go types. It's not possible, for example to store reflect.TypeOf(new(rune)) and reflect.TypeOf(new(int32)) as distinct types in memory (because they are the same to the type system).

It's also not possible at this juncture to use fmt to immediately show a rune as a string, because it's not possible to determine whether the value is a numeric int32 or a rune value.

Also there's no inheritance in Go and so rune couldn't be a subtype of int32. It would just have to be a type which, although represented internally as an int32, would be distinct from it and therefore require an implicit conversion when converting between the two.
I'm aware that go does not have a traditional inheritance model. It's my understanding that this construction type rune uint8 rather than type rune = uint8 would be a subtype, rather than an alias.

@deanveloper

This comment has been minimized.

Copy link

commented Dec 3, 2018

Exactly the same arguments could be applied to byte which is an alias of uint8.

Except that byte and uint8 represent the same thing, numbers. rune and int32 are separate concepts, where runes represent Unicode codepoints and int32s represent numbers. When type aliases were introduced, it was mentioned that they should be used very conservatively and that they were a "necessary evil", yet we use one to alias rune and int32, which are two different concepts. It makes much more intuitive sense for them to be separate types.

Personally, I've never found these aliases confusing. If I'm using characters (or whatever one should call them in Go) I use the aliases, otherwise I use the ordinary numeric types.

All this proposal does is force you to use separate rune/int32 types. You can still easily convert between the two if you need to.

Also there's no inheritance in Go and so rune couldn't be a subtype of int32. It would just have to be a type which, although represented internally as an int32, would be distinct from it and therefore require an explicit conversion when converting between the two.

Yes, that's what this proposal is trying to do. Represent them as separate types, because that's how they should be treated. Separate concepts should have separate types. If I make a UUID type, it should be a separate type from [2]uint64, even if that's what I use to represent it internally, and if I want to convert between the two, I should have to do it explicitly. [2]uint64 and UUID are two different concepts, so they shouldn't be aliased.

@deanveloper

This comment has been minimized.

Copy link

commented Dec 3, 2018

While I overall support this proposal, I don't like the idea of untyped rune constants being separate from untyped integer constants, a rune constant should still be an untyped integer constant, and all untyped integer constants should be assignable to runes. This remains consistent with how type definitions work in current Go.

@alanfo

This comment has been minimized.

Copy link

commented Dec 6, 2018

Well, remembering that strings are a sequence of bytes and that you can (and frequently do) assign rune literals in the ASCII range to byte variables, I don't agree that people think of bytes as being just numbers. If they did, then why bother having the byte alias in the first place?

If we could go back in time to Go 1.0, then I would be much more sympathetic to this proposal as I remember being surprised when I first studied Go that it didn't have a separate 'char' type as many other languages do.

However, making this change in Go 2 is going to break a fair bit of code and, for me at least, it certainly doesn't satisfy the recently introduced '10 x gain/pain' criterion for filtering out potentially implementable proposals.

@deanveloper

This comment has been minimized.

Copy link

commented Dec 6, 2018

If we could go back in time to Go 1.0, then I would be much more sympathetic to this proposal as I remember being surprised when I first studied Go that it didn't have a separate 'char' type as many other languages do.

Several other languages are the exact same - C family has char (which is used as a number in a lot of cases), Java has char which is an alias of short, etc. It's definitely not an uncommon practice to make a language's character type an alias of a number type rather than a separate type.

That being said, I personally don't think it should be a good practice since in my head I see them as different types.

@Zemnmez

This comment has been minimized.

Copy link
Author

commented Dec 6, 2018

Well, remembering that strings are a sequence of bytes and that you can (and frequently do) assign rune literals in the ASCII range to byte variables,

This ... is odd, and I don't understand why you would go out of your way to do this. None of the functions in bytes or unicode consume a single byte, they all consume rune or []byte. If you want a []byte from a []rune or rune you already have to perform an explicit conversion that would not be altered by my proposed change.

I don't agree that people think of bytes as being just numbers. If they did, then why bother having the byte alias in the first place?

We're getting into the semantic weeds of why types exist here. int, in C and the rest are, in essence just 'storage units' large enough to hold integers. In this sense, you're correct. byte is fundamentally just an arbitrary storage unit that can represent anything.

char in C, for example simply exists to define a value of size that is arbitrary, but defined to be able to contain the system's 'native character set'. The types represent much lower-level concepts than modern languages like Python or Javascript, many of whose values represent significant abstractions on the underlying memory. In both of these languages, 'letters' are single-character strings, and getting them as numeric values of lesser abstraction requires explicit conversion, i.e "x".charCodeAt(0) or "c".ord(0).

Why do I mention this? In more dynamic languages that Go draws inspiration from, types don't just confer size, they also confer meaning.

Why do we use error and not always, perhaps interface{ Error() string}? Why do we have separate string and []byte types that require conversion? Why do we have rune and int32 as aliases, or byte and uint8 as aliases if they're defined to be the same size?

It's not just for ease of typing -- these types all confer contextual meaning. If I see a func([]byte) rune I know it consumes a series of bytes and returns a UTF-32 character. To use rune, or even []uint8 in any other way would be bizarre.

If we could go back in time to Go 1.0, then I would be much more sympathetic to this proposal as I remember being surprised when I first studied Go that it didn't have a separate 'char' type as many other languages do.

rune is simply a UTF-32 version of the ASCII char type. char is defined to hold a single-byte character. rune is defined to hold a 32-bit character.

@Zemnmez

This comment has been minimized.

Copy link
Author

commented Dec 6, 2018

I wrote a lot of words on my last comment about type systems, but I think a more succinct way to explain what I mean is that in a typical program you might define type UserID int. This type is, without any methods essentially almost exactly an int. But it carries extra meaning along with it: that the value being described is an identifier for a user, and the type checker ensures that I don't accidentally cross the streams by using a type ArticleID int instead. rune is for all concrete intents and purposes an int32. However, when rune is used instead of int32, it's signifying that what we're handling is a UTF-32 encoded character, as defined in the spec. These use cases are the same.

@bradfitz bradfitz changed the title proposal: Go 2: make `rune` a subtype of `int32`, rather than an alias proposal: Go 2: make rune be a new type of concrete type int32, not an alias Dec 11, 2018

@bradfitz

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

We want to collect data on how much code this would break in the wild. @griesemer was going to work on that.

@hewenyang

This comment has been minimized.

Copy link

commented Feb 9, 2019

I agree that rune and int32 should be two different concepts and must be converted explicitly.
When you think about a rune you often don't care how it is encoded. Say if only 2^21 different characters existed on the planet, rune could have been a [3]byte; or it may also have been [4]byte instead of int32 because UTF-32 is really a stream of bytes rather than integers and it prevents confusion over endianness. Only when a parser needs that data is the underlying type important.

This is also useful for marshaling/unmarshaling. Assume the following struct:

struct Message {
  Val1 rune
  Val2 []rune
}

Then the encoder may decide to encode that Val1 as either an integer or a single UTF-32 character string, depending on the options passed to the encoder. If rune were an alias for int32, the encoder would not be able to decide that, and worse yet it would have to encode Val2 as an array of integers when it is apparent that Val2 really means a Unicode string.

@vvakame

This comment has been minimized.

Copy link

commented Mar 6, 2019

+1 for this topic.
I want to see the letter in debugger.

rune_in_debugger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.