Support 'raw identifier' syntax #11252

stedolan · 2022-05-10T14:31:27Z

This PR implements part of RFC 27 by adding support for 'raw identifiers' \#foo. The syntax \#foo means the same thing as foo, but is always an identifier even when foo is a keyword. This allows future versions of the language to introduce new keywords in a backwards-compatible way (see the RFC for full details).

The important part of this patch is the small patch to lexer.mll which adds the new concrete syntax. The entire rest of the patch is changes to various pretty-printers, to ensure that keywords are printed in appropriately escaped form should they ever occur as variable names, in types, in error messages, and the like.

tools/.depend

stedolan · 2023-01-12T14:14:40Z

@gasche You expressed interest in this patch at some point, would you be willing / have time to review?

gasche · 2023-01-12T14:32:56Z

Yes, during ICFP last September I realized that I had forgotten about the PR (I liked the RFC and ended up in the awkward position of encouraging you to send a PR about it), and then I quickly forgot about it again.

You expressed interest in this patch at some point

Yes!

would [...] have time to review?

No!

But tomorrow we have a meeting about encouraging people to do right by reviewing more stuff, let us try to designate some volunteers there. So many people eager to help, it would be a shame if I stole this opportunity from them.

dra27 · 2023-01-12T15:58:38Z

I regret that this was allowed to slip off the radar for 5.0 - it should certainly be in 5.1 and if you're happy to rebase I'm happy to either review it or ensure it gets reviewed in time!

stedolan · 2023-01-26T16:58:57Z

Now rebased!

OlivierNicole

I believe the code changes of this PR are sound.

My only interrogation is at the level of the specification. As of the current state of this PR:

Module names and value constructors can never be raw identifiers (unlike what the original RFC suggests), because raw identifiers cannot be capitalized (e.g. \#Foo).
However, polymorphic variant constructors can be raw identifiers (e.g. `\#true).

~~(I thought polymorphic variant constructors were necessarily capitalized, but I checked and apparently they don’t need to be, i.e. `x is valid.)~~ This is discouraged, see my message below.

As a consequence, the simple rule “raw identifiers can be used whenever any non-capitalized identifier can be used” applies. I think this simple rule should be stated somewhere (probably in the manual) so that it is clear how to use raw identifiers.

OlivierNicole · 2023-02-20T11:50:47Z

parsing/lexer.mll

+  | raw_ident_escape (lowercase identchar * as name)
+      { LIDENT name }


I was expecting to see the equivalent of this for uppercase and UIDENT. Without it, raw identifiers can only start with a lowercase character, and are a bit less general than what the RFC mentions:

This syntax can be used anywhere that foo can be used: ~\#foo is a labelled argument, `\#Foo is a polymorphic variant, '\#foo is a type variable, \#Foo is a module or constructor name, etc.

Arguably this is sufficient as OCaml has no capitalized keywords. I just want to make sure that’s what we want.

OlivierNicole · 2023-02-27T10:45:43Z

parsing/pprintast.ml

+  | `btrue
+  | `bfalse ]


Why is it necessary to add cases for true and false in this patch?

OK, I think I understand. true and false are the only keywords that can be used as constructors, so we must be wary of not printing them as escaped (since, on the other hand, raw identifiers cannot be used as constructor names).

OlivierNicole · 2023-02-27T18:36:57Z

Using non-capitalized tag names is in fact not considered standard and may be explicitly disallowed in the future (see #12047). So maybe we don’t want to encourage the use of raw identifiers as tag names.

lpw25 · 2023-02-27T19:45:50Z

I think that raw identifiers as tag names should have the same status as lowercase tag names i.e. undocumented but supported. Then if we ever remove lowercase tag names or make them official we can do the same with raw identifier tag names. For the same reasons that no one has managed to remove lowercase tag names -- they are actually widely used -- it's important to be able to use raw identifiers for them too. Otherwise when adding a keyword you risk not being able to use a library that uses that keyword as a tag name.

OlivierNicole · 2023-02-28T12:54:48Z

I see, that makes sense.

gasche

I looked at this PR.

I find myself confused by some of the naming choices in the code, and I think that this is because there is a lack of clarity in the terminology. When we write the OCaml expression \#foo, we may want to talk about \#foo (the identifier as it appears in source code) and foo (the internal payload / identifying information of the identifier). These two different things should have different names. Currently we both call them "identifiers", and then we have to remember to "protect" an identifier before printing it. This is the same distinction between (+) and +, (let!) and let!.

Then there is a similar question at the level of type variables, between 'a and a. (Is it exactly the same distinction, or a distinct distinction?)

I would propose to call \#foo and (+) identifiers, and foo and + the name of those identifiers. This feels natural to me, and it is consistent with the Rust documentation of raw identifiers (emphasis mine):

This is particularly useful when Rust introduces new keywords, and a library using an older edition of Rust has a variable or function with the same name as a keyword introduced in a newer edition.

If we like this terminology (other proposals are welcome), we should use it in the code, in particular protect_ident could be ident_of_name or just ident (and be properly documented) and tyvar_name could be tyvar_of_name or just tyvar.

gasche · 2023-06-14T13:46:26Z

Changes

@@ -93,7 +96,7 @@ Working version
  (Stephen Dolan, review by Xavier Leroy)

 - #11418, #11708: RISC-V multicore support.
-  (Nicolás Ojeda Bär, review by KC Sivaramakrishnan)
+  (Nicolás Ojeda Bär, review by KuC Sivaramakrishnan)


This is an unrelated typo.

gasche · 2023-06-14T13:51:24Z

parsing/pprintast.ml

@@ -95,16 +95,17 @@ let needs_parens txt =
 let needs_spaces txt =
  first_is '*' txt || last_is '*' txt

-let string_loc ppf x = fprintf ppf "%s" x.txt
-
 (* add parentheses to binders when they are in fact infix or prefix operators *)


This docstring should be updated, given that the feature scope was extended to also "protect" identifiers that are also keywords. Suggestion:

(* Turn an arbitrary identifier name into a lexically valid identifier. *)

I realize as I write this that I don't know what is the right terminology for these distinct objects (the "string payload of an identifier", which I call "identifier name" here, and the "identifier as it appears in the source file", which I call a "lexically valid identifier'.)

gasche · 2023-06-14T13:51:38Z

parsing/pprintast.ml

@@ -268,16 +273,21 @@ let iter_loc f ctxt {txt; loc = _} = f ctxt txt

 let constant_string f s = pp f "%S" s

-let tyvar ppf s =
+let tyvar_name s =


I find the new function name tyvar_name confusing. If I was asked to guess what is the "name" of a tyvar, I would say that foo is the name of the tyvar 'foo. This function is in fact doing the opposite thing.

gasche · 2023-06-14T13:52:44Z

typing/typetexp.ml

@@ -684,7 +686,8 @@ open Printtyp

 let report_error env ppf = function
  | Unbound_type_variable name ->
-      let add_name name _ l = if name = "_" then l else ("'" ^ name) :: l in
+      let add_name name _ l =
+        if name = "_" then l else Pprintast.tyvar_name name :: l in


I wonder if this logic (being a no-op on "_") should be added to tyvar_name.

gasche · 2023-06-14T14:01:17Z

@OlivierNicole I remember that we discussed this PR and formed a cunning plan, but I don't remember what the plan was. You were supposed to resubmit a rebased/updated version of this PR, and I would review it; or maybe the other way around? Do you remember, which one do you prefer now?

(If you resubmit and I review, we don't need to find yet another maintainer for approval before merging. But then if you prefer to keep reviewing that is also fine by me.)

OlivierNicole · 2023-06-20T09:28:32Z

@OlivierNicole I remember that we discussed this PR and formed a cunning plan, but I don't remember what the plan was. You were supposed to resubmit a rebased/updated version of this PR, and I would review it; or maybe the other way around? Do you remember, which one do you prefer now?

I can rebase and update it within a few days.

gasche · 2023-06-24T12:49:16Z

This has been rebased as an independent PR #12323, which is now merged. Closing. Thanks @stedolan for the nice RFC and implementation, and @OlivierNicole for going the last mile.

stedolan force-pushed the raw-idents branch from bd11249 to c863650 Compare May 10, 2022 14:37

dra27 reviewed May 10, 2022

View reviewed changes

tools/.depend Outdated Show resolved Hide resolved

stedolan force-pushed the raw-idents branch 2 times, most recently from 0b680a0 to 463589f Compare May 16, 2022 09:48

dra27 self-assigned this Jan 12, 2023

stedolan added 2 commits January 26, 2023 16:58

Support 'raw identifier' syntax \#foo

29535c8

Changes

6cd72fc

stedolan force-pushed the raw-idents branch from 463589f to 6cd72fc Compare January 26, 2023 16:58

OlivierNicole reviewed Feb 27, 2023

View reviewed changes

Octachron assigned gasche and unassigned dra27 Mar 14, 2023

gasche reviewed Jun 14, 2023

View reviewed changes

dra27 mentioned this pull request Jun 19, 2023

Add effect syntax #12309

Merged

9 tasks

OlivierNicole mentioned this pull request Jun 23, 2023

Support 'raw identifier' syntax (new version) #12323

Merged

gasche closed this Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 'raw identifier' syntax #11252

Support 'raw identifier' syntax #11252

stedolan commented May 10, 2022

stedolan commented Jan 12, 2023

gasche commented Jan 12, 2023

dra27 commented Jan 12, 2023

stedolan commented Jan 26, 2023

OlivierNicole left a comment •

edited

Loading

OlivierNicole Feb 20, 2023

OlivierNicole Feb 27, 2023

OlivierNicole Feb 27, 2023 •

edited

Loading

OlivierNicole commented Feb 27, 2023

lpw25 commented Feb 27, 2023

OlivierNicole commented Feb 28, 2023

gasche left a comment

gasche Jun 14, 2023

gasche Jun 14, 2023

gasche Jun 14, 2023

gasche Jun 14, 2023

gasche commented Jun 14, 2023

OlivierNicole commented Jun 20, 2023

gasche commented Jun 24, 2023

		\| raw_ident_escape (lowercase identchar * as name)
		{ LIDENT name }

Support 'raw identifier' syntax #11252

Support 'raw identifier' syntax #11252

Conversation

stedolan commented May 10, 2022

stedolan commented Jan 12, 2023

gasche commented Jan 12, 2023

dra27 commented Jan 12, 2023

stedolan commented Jan 26, 2023

OlivierNicole left a comment • edited Loading

Choose a reason for hiding this comment

OlivierNicole Feb 20, 2023

Choose a reason for hiding this comment

OlivierNicole Feb 27, 2023

Choose a reason for hiding this comment

OlivierNicole Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

OlivierNicole commented Feb 27, 2023

lpw25 commented Feb 27, 2023

OlivierNicole commented Feb 28, 2023

gasche left a comment

Choose a reason for hiding this comment

gasche Jun 14, 2023

Choose a reason for hiding this comment

gasche Jun 14, 2023

Choose a reason for hiding this comment

gasche Jun 14, 2023

Choose a reason for hiding this comment

gasche Jun 14, 2023

Choose a reason for hiding this comment

gasche commented Jun 14, 2023

OlivierNicole commented Jun 20, 2023

gasche commented Jun 24, 2023

OlivierNicole left a comment •

edited

Loading

OlivierNicole Feb 27, 2023 •

edited

Loading