New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: spec: allow combining characters in identifiers #20706

Open
rsc opened this Issue Jun 16, 2017 · 13 comments

Comments

Projects
None yet
10 participants
@rsc
Contributor

rsc commented Jun 16, 2017

Forking from #16033, which had two related but different proposals in it. The proposal for this issue, by @robpike:

On a related note, some writing systems - Devanagari is one (see #5167) require combining characters. The current identifier rules forbid combining characters; perhaps that should be relaxed, although that will require a canonicalization rule for combining characters. Unicode does have a definition for identifiers (http://unicode.org/reports/tr31/); perhaps Go should use it. Note that the addition of combining characters, allied with the export proposal above, would make it possible to export Devanagari identifiers.

@gopherbot gopherbot added this to the Proposal milestone Jun 16, 2017

@gopherbot gopherbot added the Proposal label Jun 16, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 16, 2017

Re canonicalization, one possibility Rob and I discussed at one point was to require in the spec that implementations canonicalize during comparisons to establish whether two identifiers are the same but also to have gofmt canonicalize to generate its output (the former is required for the latter to be semantically safe). Then source code is consistent but the compilers will deal if not.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Jun 16, 2017

@rsc Is this Go 2 or would you consider this for Go 1?

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 17, 2017

For Go 2.

@rsc rsc added the Go2 label Jun 17, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 17, 2017

Merging #5167 in here. From suraj@barkale.com in 2013:

My suggestion is to amend Go specification by allowing combining-mark & non-spacing-mark characters in identifiers.

This will be similar to Java identifier rules given at http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierPart(char).
A character may be part of a Java identifier if any of the following are true:

  • it is a letter
  • it is a currency symbol (such as '$')
  • it is a connecting punctuation character (such as '_')
  • it is a digit
  • it is a numeric letter (such as a Roman numeral character)
  • it is a combining mark
  • it is a non-spacing mark
  • isIdentifierIgnorable returns true for the character
@bakul

This comment has been minimized.

bakul commented Jul 19, 2017

Note that the Java id rule link is invalid now.

The first char of an identifier is typically more constrained (e.g. no digit).

I propose that as far as Indic languages are concerned, an identifier may not start with one of various sign chars, digits, dependent vowels or "virama". An identifier may start with a currency sign or an "om" sign is allowed or an independent vowel (vowel letters) or a consonant.

@robpike

This comment has been minimized.

Contributor

robpike commented Jul 19, 2017

@bakul It's precisely that kind of minute precision we'd like to avoid. The rule must be very simple to express, as it is now. We are seeking a new but also simple-to-express rule that admits more classes of identifier without requiring natural-language-specific detail.

@bakul

This comment has been minimized.

bakul commented Jul 19, 2017

Would "१२३" (123 in devanagari) be considered a number or an identifier? If the latter, that would be strange!

If you don't want nitpicky rules, a simpler rule may be that the first char satisfies unicode.IsLetter() and the succeeding chars satisfy unicode.IsLetter() or IsDigit() or IsMark(). Plus whatever is needed for CJK.

@bakul

This comment has been minimized.

bakul commented Jul 19, 2017

I should add: along with identifiers, numbers should also be expressible in other languages (but I expect that'll be even more unpopular!)

@bakul

This comment has been minimized.

bakul commented Sep 19, 2018

In case it makes a difference, the Swift programming language allows legitimate Indic words as identifiers. From Swift Lexical Structure:

Identifiers begin with an uppercase or lowercase letter A through Z, an underscore (_), a noncombining alphanumeric Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plane that isn’t in a Private Use Area. After the first character, digits and combining Unicode characters are also allowed.

Most indian language words use combining chars so it would be good to fix this.

If these identifiers are not exportable, that is fine. Prefixing with an uppercase letter from some other script is ugly but that can't be helped! I am tempted to suggest using the section symbol (§) or some such symbol as an additional exportable start char of an identifier.

@bcmills

This comment has been minimized.

Member

bcmills commented Nov 29, 2018

that will require a canonicalization rule for combining characters.

See #27896 for a proposal specifically about canonicalization.

That potentially affects existing programs, since (for example) μ and µ are both already allowed (and treated as distinct identifiers) in Go source code.

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented Nov 29, 2018

Re canonicalization, one possibility Rob and I discussed at one point was to require in the spec that implementations canonicalize during comparisons to establish whether two identifiers are the same but also to have gofmt canonicalize to generate its output (the former is required for the latter to be semantically safe). Then source code is consistent but the compilers will deal if not.

I would prefer that non-normalized identifier were rejected so I wouldn't have to worry that my grep/editor/browser uses the same normalization strategy as gofmt/compiler when looking for identifiers even for sources that weren't visited by gofmt.

@mpvl

This comment has been minimized.

Member

mpvl commented Dec 7, 2018

@bcmills: forcing NFKC is not backwards compatible. If we break backwards compatibility, I would prefer to simply disallow any character with a decomposition type "font" and permit all others as is.

@bcmills

This comment has been minimized.

Member

bcmills commented Dec 7, 2018

If we break backwards compatibility, I would prefer to simply disallow any character with a decomposition type "font" and permit all others as is.

I think that would lead to a pretty unfortunate user experience. Consider the snippet:

	var jalapeño = "🌶️"
	fmt.Println(jalapeño)

Without any sort of normalization, the otherwise-equivalent identifiers jalapeño and jalapeño refer to two completely different variables, and as far as I am aware none of the characters involved have decomposition type font.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment