Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
2 contributors

Users who have contributed to this file

@rsc @griesemer
663 lines (567 sloc) 30.4 KB

Proposal: Go 2 Number Literal Changes

Russ Cox
Robert Griesemer

Last updated: March 6, 2019

golang.org/design/19308-number-literals

Discussion at:

Abstract

We propose four related changes to number literals in Go:

  1. Add binary integer literals, as in 0b101.
  2. Add alternate octal integer literals, as in 0o377.
  3. Add hexadecimal floating-point literals, as in 0x1p-1021.
  4. Allow _ as a digit separator in number literals.

Background

Go adopted C’s number literal syntax and in so doing joined a large group of widely-used languages that all broadly agree about how numbers are written. The group of such “C-numbered languages” includes at least C, C++, C#, Java, JavaScript, Perl, PHP, Python, Ruby, Rust, and Swift.

In the decade since Go’s initial design, nearly all the C-numbered languages have extended their number literals to add one or more of the four changes in this proposal. Extending Go in the same way makes it easier for developers to move between these languages, eliminating an unnecessary rough edge without adding significant complexity to the language.

Binary Integer Literals

The idea of writing a program’s integer literals in binary is quite old, dating back at least to PL/I (1964), which used '01111000'B.

In C’s lineage, CPL (1966) supported decimal, binary, and octal integers. Binary and octal were introduced by an underlined 2 or 8 prefix. BCPL (1967) removed binary but retained octal, still introduced by an 8 (it’s unclear whether the 8 was underlined or followed by a space). B (1972) introduced the leading zero syntax for octal, as in 0377. C as of 1974 had only decimal and octal. Hexadecimal 0x12ab had been added by the time K&R (1978) was published. Possibly the earliest use of the exact 0b01111000 syntax was in Caml Light 0.5 (1992), which was written in C and borrowed 0x12ab for hexadecimal.

Binary integer literals using the 0b01111000 syntax were added in C++14 (2014), C# 7.0 (2017), Java 7 (2011), JavaScript ES6 (2015), Perl 5.005_55 (1998), PHP 5.4.0 (2012), Python 2.6 (2008), Ruby 1.4.0 (1999), Rust 0.1 or earlier (2012), and Swift 1.0 or earlier (2014).

The syntax is a leading 0b prefix followed by some number of 0s and 1s. There is no corresponding character escape sequence (that is, no '\b01111000' for 'x', since '\b' is already used for backspace, U+0008). Most languages also updated their integer parsing and formatting routines to support binary forms as well.

Although C++14 added binary integer literals, C itself has not, as of C18.

Octal Integer Literals

As noted earlier, octal was the most widely-used form for writing bit patterns in the early days of computing (after binary itself). Even though octal today is far less common, B’s introduction of 0377 as syntax for octal carried forward into C, C++, Go, Java, JavaScript, Python, Perl, PHP, and Ruby. But because programmers don't see octal much, it sometimes comes as a surprise that 01234 is not 1234 decimal or that 08 is a syntax error.

Caml Light 0.5 (1992), mentioned above as possibly the earliest language with 0b01111000 for binary, may also have been the first to use the analogous notation 0o377 for octal.

JavaScript ES3 (1999) technically removed support for 0377 as octal, but of course allowed implementations to continue recognizing them. ES5 (2009) added “strict mode,” in which, among other restrictions, octal literals are disallowed entirely (0377 is an error, not decimal). ES6 (2015) introduced the 0o377 syntax, allowed even in strict mode.

Python’s initial release (1991) used 0377 syntax for octal. Python 3 (2008) changed the syntax to 0o377, removing the 0377 syntax (0377 is an error, not decimal). Python 2.7 (2010) backported 0o377 as an alternate octal syntax (0377 is still supported).

Rust (2012) initially had no octal syntax but added 0o377 in Rust 0.9 (2014). Swift’s initial release (2014) used 0o377 for octal. Both Rust and Swift allow decimals to have leading zeros (0377 is decimal 377), creating a potential point of confusion for programmers coming from other C-numbered languages.

Hexadecimal Floating-Point

The exact decimal floating-point literal syntax of C and its successors (1.23e4) appears to have originated at IBM in Fortran (1956), some time after the 1954 draft. The syntax was not used in Algol 60 (1960) but was adopted by PL/I (1964) and Algol 68 (1968), and it spread from those into many other languages.

Hexadecimal floating-point literals appear to have originated in C99 (1999), spreading to C++17 (2017), Java 5 (2004) Perl 5.22 (2015), and Swift's initial release (2014). IEEE 754-2008 also added hexadecimal floating-point literals, citing C99.

All these languages use the syntax 0x123.fffp5, where the “pN” specifies a decimal number interpreted as a power of two: 0x123.fffp5 is (0x123 + 0xfff/0x1000) x 2^5. In all languages, the exponent is required: 0x123.fff is not a valid hexadecimal floating-point literal.

The fraction may be omitted, as in 0x1p-1000. C, C++, Java, Perl, and the IEEE 754-2008 standard allow omitting the digits before or after the hexadecimal point: 0x1.p0 and 0x.fp0 are valid hexadecimal floating-point literals just as 1. and .9 are valid decimal literals. Swift requires digits on both sides of a decimal or hexadecimal point; that is, in Swift, 0x1.p0, 0x.fp0, 1., and .9 are all invalid.

Adding hexadecimal floating-point literals also requires adding library support. C99 added the %a and %A printf formats for formatting and %a for scanning. It also redefined strtod to accept hexadecimal floating-point values. The other languages made similar changes.

C# (as of C# 7.3, which has no published language specification), JavaScript (as of ES8), PHP (as of PHP 7.3.0), Python (as of Python 3.7.2), Ruby (as of Ruby 2.6.0), and Rust (as of Rust 1.31.1) do not support hexadecimal floating-point literals.

Digit Separators

Allowing the use of an underscore to separate digits in a number literal into groups dates back at least to Ada 83, possibly earlier.

A digit-separating underscore was added to C# 7.0 (2017), Java 7 (2011), Perl 2.0 (1988), Python 3.6 (2016), Ruby 1.0 or earlier (1998), Rust 0.1 or earlier (2012), and Swift 1.0 or earlier (2014).

C has not yet added digit separators as of C18. C++14 uses single-quote as a digit separator to avoid an ambiguity with C++11 user-defined integer suffixes that might begin with underscore. JavaScript is considering adding underscore as a digit separator but ran into a similar problem with user-defined suffixes. PHP considered but decided against adding digit separators.

The design space for a digit separator feature reduces to four questions: (1) whether to accept a separator immediately after the single-digit octal 0 base prefix, as in 0_1; (2) whether to accept a separator immediately after non-digit base prefixes like 0b, 0o, and 0x, as in 0x_1; (3) whether to accept multiple separators in a row, as in 1__2; and (4) whether to accept trailing separators, as in 1_. (Note that a “leading separator” would create a variable name, as in _1.) These four questions produce sixteen possible approaches.

Case 0b0001: If the name “digit separator” is understood literally, so that each underscore must separate (appear between) digits, then the answers should be that 0_1 is allowed but 0x_1, 1__2, and 1_ are all disallowed. This is the approach taken by Ada 83 (using 8#123# for octal and so avoiding question 1), C++14, Java 7, and Swift (using only 0o for octal and thereby also avoiding question 1).

Case 0b0011: If we harmonize the treatment of the 0 octal base prefix with the 0b, 0o, and 0x base prefixes by allowing a digit separator between a base prefix and leading digit, then the answers are that 0_1 and 0x_1 are allowed but 1__2 and 1_ are disallowed. This is the approach taken in Python 3.6 and Ruby 1.8.0.

Case 0b0111: If we allow runs of multiple separators as well, that allows 0_1, 0x_1, and 1__2, but not 1_. This is the approach taken in C# 7.2 and Ruby 1.6.2.

Case 0b1111: If we then also accept trailing digit separators, the implementation becomes trivial: ignore digit separators wherever they appear. Perl takes this approach, as does Rust.

Other combinations have been tried: C# 7.0 used 0b0101 (0x_1 and 1_ disallowed) before moving to case 0b1110 in C# 7.2. Ruby 1.0 used 0b1110 (only 0_1 disallowed) and Ruby 1.3.1 used 0b1101 (only 0x_1 disallowed), before Ruby 1.6.2 tried 0b0111 and Ruby 1.8.0 settled on 0b0011.

A similar question arises for whether to allow underscore between a decimal point and a decimal digit in a floating-point number, or between the literal e and the exponent. We won’t enumerate the cases here, but again languages make surprising choices. For example, in Rust, 1_.2 is valid but 1._2 is not.

Proposal

We propose to add binary integer literals, to add octal 0o377 as an alternate octal literal syntax, to add hexadecimal floating-point literals, and to add underscore as a base-prefix-or-digit separator (case 0b0011 above; see rationale below), along with appropriate library support. Finally, to fit the existing imaginary literals seemlessly into the new number literals, we propose that the imaginary suffix i may be used on any (non-imaginary) number literal.

Language Changes

The definitions in https://golang.org/ref/spec#Letters_and_digits add:

binary_digit = "0" | "1" .

The https://golang.org/ref/spec#Integer_literals section would be amended to read:

An integer literal is a sequence of digits representing an integer constant. An optional prefix sets a non-decimal base: 0, 0o, or 0O for octal, 0b or 0B for binary, 0x or 0X for hexadecimal. A single 0 is considered a decimal zero. In hexadecimal literals, letters a-f and A-F represent values 10 through 15. For readability, an underscore may appear after a base prefix or between successive digits; such underscores do not change the literal value.

int_lit        = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit    = "0" | ( "1" … "9" ) [ [ "_" ] decimal_digits ] .
binary_lit     = "0" ( "b" | "B" ) [ "_" ] binary_digits .
octal_lit      = "0" [ "o" | "O" ] [ "_" ] octal_digits .
hex_lit        = "0" ( "x" | "X" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits  = binary_digit { [ "_" ] binary_digit } .
octal_digits   = octal_digit { [ "_" ] octal_digit } .
hex_digits     = hex_digit { [ "_" ] hex_digit } .

42
4_2
0600
0_600
0o600
0O600       // second character is capital letter 'O'
0xBadFace
0xBad_Face
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727

_42         // an identifier, not an integer literal
42_         // invalid: _ must separate successive digits
4__2        // invalid: only one _ at a time
0_xBadFace  // invalid: _ must separate successive digits

The https://golang.org/ref/spec#Floating-point_literals section would be amended to read:

A floating-point literal is a decimal or hexadecimal representation of a floating-point constant. A decimal floating-point literal consists of an integer part (decimal digits), a decimal point, a fractional part (decimal digits) and an exponent part (e or E followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; one of the decimal point or the exponent part may be elided. A hexadecimal floating-point literal consists of a 0x or 0X prefix, an integer part (hexadecimal digits), a decimal point, a fractional part (hexadecimal digits), and an exponent part (p or P followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; the decimal point may be elided as well, but the exponent part is required. (This syntax matches the one given in IEEE 754-2008 §5.12.3.) For readability, an underscore may appear after a base prefix or between successive digits; such underscores do not change the literal value.

float_lit         = decimal_float_lit | hex_float_lit .

decimal_float_lit = decimal_digits "." [ decimal_digits ] [ decimal_exponent ] |
                    decimal_digits decimal_exponent |
                    "." decimal_digits [ decimal_exponent ] .
decimal_exponent  = ( "e" | "E" ) [ "+" | "-" ] decimal_digits .

hex_float_lit     = "0" ( "x" | "X" ) hex_mantissa hex_exponent .
hex_mantissa      = [ "_" ] hex_digits "." [ hex_digits ] |
                    [ "_" ] hex_digits |
                    "." hex_digits .
hex_exponent      = ( "p" | "P" ) [ "+" | "-" ] decimal_digits .


0.
72.40
072.40       // == 72.40
2.71828
1.e+0
6.67428e-11
1E6
.25
.12345E+5
1_5.         // == 15.0
0.15e+0_2    // == 15.0

0x1p-2       // == 0.25
0x2.p10      // == 2048.0
0x1.Fp+0     // == 1.9375
0X.8p-0      // == 0.5
0X_1FFFP-16  // == 0.1249847412109375
0x15e-2      // == 0x15e - 2 (integer subtraction)

0x.p1        // invalid: mantissa has no digits
1p-2         // invalid: p exponent requires hexadecimal mantissa
0x1.5e-2     // invalid: hexadecimal mantissa requires p exponent
1_.5         // invalid: _ must separate successive digits
1._5         // invalid: _ must separate successive digits
1.5_e1       // invalid: _ must separate successive digits
1.5e_1       // invalid: _ must separate successive digits
1.5e1_       // invalid: _ must separate successive digits

The syntax in https://golang.org/ref/spec#Imaginary_literals section would be amended to read:

An imaginary literal represents the imaginary part of a complex constant. It consists of an integer or floating-point literal followed by the lower-case letter i. The value of an imaginary literal is the value of the respective integer or floating-point literal multiplied by the imaginary unit i.

imaginary_lit = (decimal_digits | int_lit | float_lit) "i" .

For backward-compatibility, an imaginary literal's integer part consisting entirely of decimal digits (and possibly underscores) is considered a decimal integer, not octal, even if it starts with a leading 0.

0i
0123i         // == 123i for backward-compatibility
0o123i        // == 0o123 * 1i == 83i
0xabci        // == 0xabc * 1i == 2748i
0.i
2.71828i
1.e+0i
6.67428e-11i
1E6i
.25i
.12345E+5i
0x1p-2i       // == 0x1p-2 * 1i == 0.25i

Library Changes

In fmt, Printf with %#b will format an integer argument in binary with a leading 0b prefix. Today, %b already formats an integer in binary with no prefix; %#b does the same but is rejected by go vet, including during go test, so redefining %#b will not break vetted, tested programs.

Printf with %#o is already defined to format an integer argument in octal with a leading 0 (not 0o) prefix, and all the other available format flags have defined effects too. It appears no change is possible here. Clients can use 0o%o, at least for non-negative arguments.

Printf with %x will format a floating-point argument in hexadecimal floating-point syntax. (Today, %x on a floating-point argument formats as a %!x error and also provokes a vet error.) Scanf will accept both decimal and hexadecimal floating-point forms where it currently accepts decimal.

In go/scanner, the implementation must change to understand the new syntax, but the public API needs no changes. Because text/scanner recognizes Go’s number syntax as well, it will be updated to add the new numbers too.

In math/big, Int.SetString with base set to zero accepts binary integer literals already; it will change to recognize the new octal prefix and the underscore digit separator. ParseFloat and Float.Parse with base set to zero, Float.SetString, and Rat.SetString each accept binary integer literals and hexadecimal floating-point literals already; they will change to recognize the new octal prefix and the underscore digit separator. Calls using non-zero bases will continue to reject inputs with underscores.

In strconv, ParseInt and ParseUint will change behavior. When the base argument is zero, they will recognize binary literals like 0b0111 and also allow underscore as a digit separator. Calls using non-zero bases will continue to reject inputs with underscores. ParseFloat will change to accept hexadecimal floating-point literals and the underscore digit separator. FormatFloat will add a new format x to generate hexadecimal floating-point.

In text/template/parse, (*lex).scanNumber will need to recognize the three new syntaxes. This will provide the new literals to both html/template and text/template.

Tool Changes

Gofmt will understand the new syntax once go/scanner is updated. For legibility, gofmt will also rewrite capitalized base prefixes 0B, 0O, and 0X and exponent prefixes E and P to their lowercase equivalents 0b, 0o, 0x, e, and p. This is especially important for 0O377 vs 0o377.

To avoid introducing incompatibilities into otherwise backward-compatible code, gofmt will not rewrite 0377 to 0o377. (Perhaps in a few years we will be able to consider doing that.)

Rationale

As discussed in the background section, the choices being made in this proposal match those already made in Go's broader language family. Making these same changes to Go is useful on its own and avoids unnecessary lexical differences with the other languages. This is the primary rationale for all four changes.

Octal Literals

We considered using 0o377 in the initial design of Go, but we decided that even if Go used 0o377 for octal, it would have to reject 0377 as invalid syntax (that is, Go could not accept 0377 as decimal 377), to avoid an unpleasant surprise for programmers coming from C, C++, Java, Python 2, Perl, PHP, Ruby, and so on. Given that 0377 cannot be decimal, it seemed at the time unnecessary and gratuitously different to avoid it for octal. It still seemed that way in 2015, when the issue was raised as golang.org/issue/12711.

Today, however, it seems clear that there is agreement among at least the newer C-numbered languages for 0o377 as octal (either alone or in addition to 0377). Harmonizing Go’s octal integer syntax with these languages makes sense for the same reasons as harmonizing the binary integer and hexadecimal floating-point syntax.

For backwards compatibility, we must keep the existing 0377 syntax in Go 1, so Go will have two octal integer syntaxes, like Python 2.7 and non-strict JavaScript. As noted earlier, after a few years, once there are no supported Go releases missing the 0o377 syntax, we could consider changing gofmt to at least reformat 0377 to 0o377 for clarity.

Arbitrary Bases

Another obvious change is to consider arbitrary-radix numbers, like Algol 68’s 2r101. Perhaps the form most in keeping with Go’s history would be to allow BxDIGITS where B is the base, as in 2x0101, 8x377, and 16x12ab, where 0x becomes an alias for 16x. We considered this in the initial design of Go, but it seemed gratuitously different from the common C-numbered languages, and it would still not let us interpret 0377 as decimal. It also seemed that very few programs would be aided by being able to write numbers in, say, base 3 or base 36. That logic still holds today, reinforced by the weight of existing Go usage. Better to add only the syntaxes that other languages use. For discussion, see golang.org/issue/28256.

Library Changes

In the library changes, the various number parsers are changed to accept underscores only in the base-detecting case. For example:

strconv.ParseInt("12_34",   0, 0)   // decimal with underscores
strconv.ParseInt("0b11_00", 0, 0)   // binary with underscores
strconv.ParseInt("012_34",  0, 0)   // 01234 (octal)
strconv.ParseInt("0o12_34", 0, 0)   // 0o1234 (octal)
strconv.ParseInt("0x12_34", 0, 0)   // 0x1234 (hexadecimal)

strconv.ParseInt("12_34",  10, 0)   // error: fixed base cannot use underscores
strconv.ParseInt("11_00",   2, 0)   // error: fixed base cannot use underscores
strconv.ParseInt("12_34",   8, 0)   // error: fixed base cannot use underscores
strconv.ParseInt("12_34",  16, 0)   // error: fixed base cannot use underscores

Note that the fixed-base case also rejects base prefixes (and always has):

strconv.ParseInt("0b1100",  2, 0)   // error: fixed base cannot use base prefix
strconv.ParseInt("0o1100",  8, 0)   // error: fixed base cannot use base prefix
strconv.ParseInt("0x1234", 16, 0)   // error: fixed base cannot use base prefix

The rationale for rejecting underscores when the base is known is the same as the rationale for rejecting base prefixes: the caller is likely to be parsing a substring of a larger input and would not appreciate the “flexibility.” For example, parsing hex bytes two digits at a time might use strconv.ParseInt(input[i:i+2], 16, 8), and parsers for various text formats use strconv.ParseInt(field, 10, 64) to parse a plain decimal number. These use cases should not be required to guard against underscores in the inputs themselves.

On the other hand, uses of strconv.ParseInt and strconv.ParseUint with base argument zero already accept decimal, octal 0377, and hexadecimal literals, so they will start accepting the new binary and octal literals and digit-separating underscores. For example, command line flags defined with flag.Int will start accepting these inputs. Similarly, uses of strconv.ParseFloat, like flag.Float64 or the conversion of string-typed database entries to float64 in database/sql, will start accepting hexadecimal floating-point literals and digit-separating underscores.

Digit Separators

The main bike shed to paint is the detail about where exactly digit separators are allowed. Following discussion on golang.org/issue/19308, and matching the latest versions of Python and Ruby, this proposal adopts the rule that each digit separator must separate a digit from the base prefix or another digit: 0_1, 0x_1, and 1_2 are all allowed, while 1__2 and 1_ are not.

Compatibility

The syntaxes being introduced here were all previously invalid, either syntactically or semantically. For an example of the latter, 0x1.fffp-2 parses in current versions of Go as the value 0x1’s fffp field minus two. Of course, integers have no fields, so while this program is syntactically valid, it is still semantically invalid.

The changes to numeric parsing functions like strconv.ParseInt and strconv.ParseFloat mean that programs that might have failed before on inputs like 0x1.fffp-2 or 1_2_3 will now succeed. Some users may be surprised. Part of the rationale with limiting the changes to calls using base zero is to limit the potential surprise to those cases that already accepted multiple syntaxes.

Implementation

The implementation requires:

  • Language specification changes, detailed above.
  • Library changes, detailed above.
  • Compiler changes, in gofrontend and cmd/compile/internal/syntax.
  • Testing of compiler changes, library changes, and gofmt.

Robert Griesemer and Russ Cox plan to split the work and aim to have all the changes ready at the start of the Go 1.13 cycle, around February 1.

As noted in our blog post “Go 2, here we come!”, the development cycle will serve as a way to collect experience about these new features and feedback from (very) early adopters.

At the release freeze, May 1, we will revisit the proposed features and decide whether to include them in Go 1.13.

You can’t perform that action at this time.