Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 'raw identifier' syntax #11252

Closed
wants to merge 2 commits into from
Closed

Conversation

stedolan
Copy link
Contributor

This PR implements part of RFC 27 by adding support for 'raw identifiers' \#foo. The syntax \#foo means the same thing as foo, but is always an identifier even when foo is a keyword. This allows future versions of the language to introduce new keywords in a backwards-compatible way (see the RFC for full details).

The important part of this patch is the small patch to lexer.mll which adds the new concrete syntax. The entire rest of the patch is changes to various pretty-printers, to ensure that keywords are printed in appropriately escaped form should they ever occur as variable names, in types, in error messages, and the like.

tools/.depend Outdated Show resolved Hide resolved
@stedolan stedolan force-pushed the raw-idents branch 2 times, most recently from 0b680a0 to 463589f Compare May 16, 2022 09:48
@stedolan
Copy link
Contributor Author

@gasche You expressed interest in this patch at some point, would you be willing / have time to review?

@gasche
Copy link
Member

gasche commented Jan 12, 2023

Yes, during ICFP last September I realized that I had forgotten about the PR (I liked the RFC and ended up in the awkward position of encouraging you to send a PR about it), and then I quickly forgot about it again.

You expressed interest in this patch at some point

Yes!

would [...] have time to review?

No!

But tomorrow we have a meeting about encouraging people to do right by reviewing more stuff, let us try to designate some volunteers there. So many people eager to help, it would be a shame if I stole this opportunity from them.

@dra27 dra27 self-assigned this Jan 12, 2023
@dra27
Copy link
Member

dra27 commented Jan 12, 2023

I regret that this was allowed to slip off the radar for 5.0 - it should certainly be in 5.1 and if you're happy to rebase I'm happy to either review it or ensure it gets reviewed in time!

@stedolan
Copy link
Contributor Author

Now rebased!

Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the code changes of this PR are sound.

My only interrogation is at the level of the specification. As of the current state of this PR:

  • Module names and value constructors can never be raw identifiers (unlike what the original RFC suggests), because raw identifiers cannot be capitalized (e.g. \#Foo).
  • However, polymorphic variant constructors can be raw identifiers (e.g. `\#true).

(I thought polymorphic variant constructors were necessarily capitalized, but I checked and apparently they don’t need to be, i.e. `x is valid.) This is discouraged, see my message below.

As a consequence, the simple rule “raw identifiers can be used whenever any non-capitalized identifier can be used” applies. I think this simple rule should be stated somewhere (probably in the manual) so that it is clear how to use raw identifiers.

Comment on lines +412 to +413
| raw_ident_escape (lowercase identchar * as name)
{ LIDENT name }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to see the equivalent of this for uppercase and UIDENT. Without it, raw identifiers can only start with a lowercase character, and are a bit less general than what the RFC mentions:

This syntax can be used anywhere that foo can be used: ~\#foo is a labelled argument, `\#Foo is a polymorphic variant, '\#foo is a type variable, \#Foo is a module or constructor name, etc.

Arguably this is sufficient as OCaml has no capitalized keywords. I just want to make sure that’s what we want.

Comment on lines +139 to +140
| `btrue
| `bfalse ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to add cases for true and false in this patch?

Copy link
Contributor

@OlivierNicole OlivierNicole Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand. true and false are the only keywords that can be used as constructors, so we must be wary of not printing them as escaped (since, on the other hand, raw identifiers cannot be used as constructor names).

@OlivierNicole
Copy link
Contributor

Using non-capitalized tag names is in fact not considered standard and may be explicitly disallowed in the future (see #12047). So maybe we don’t want to encourage the use of raw identifiers as tag names.

@lpw25
Copy link
Contributor

lpw25 commented Feb 27, 2023

I think that raw identifiers as tag names should have the same status as lowercase tag names i.e. undocumented but supported. Then if we ever remove lowercase tag names or make them official we can do the same with raw identifier tag names. For the same reasons that no one has managed to remove lowercase tag names -- they are actually widely used -- it's important to be able to use raw identifiers for them too. Otherwise when adding a keyword you risk not being able to use a library that uses that keyword as a tag name.

@OlivierNicole
Copy link
Contributor

I see, that makes sense.

@Octachron Octachron assigned gasche and unassigned dra27 Mar 14, 2023
Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this PR.

I find myself confused by some of the naming choices in the code, and I think that this is because there is a lack of clarity in the terminology. When we write the OCaml expression \#foo, we may want to talk about \#foo (the identifier as it appears in source code) and foo (the internal payload / identifying information of the identifier). These two different things should have different names. Currently we both call them "identifiers", and then we have to remember to "protect" an identifier before printing it. This is the same distinction between (+) and +, (let!) and let!.

Then there is a similar question at the level of type variables, between 'a and a. (Is it exactly the same distinction, or a distinct distinction?)

I would propose to call \#foo and (+) identifiers, and foo and + the name of those identifiers. This feels natural to me, and it is consistent with the Rust documentation of raw identifiers (emphasis mine):

This is particularly useful when Rust introduces new keywords, and a library using an older edition of Rust has a variable or function with the same name as a keyword introduced in a newer edition.

If we like this terminology (other proposals are welcome), we should use it in the code, in particular protect_ident could be ident_of_name or just ident (and be properly documented) and tyvar_name could be tyvar_of_name or just tyvar.

@@ -93,7 +96,7 @@ Working version
(Stephen Dolan, review by Xavier Leroy)

- #11418, #11708: RISC-V multicore support.
(Nicolás Ojeda Bär, review by KC Sivaramakrishnan)
(Nicolás Ojeda Bär, review by KuC Sivaramakrishnan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated typo.

@@ -95,16 +95,17 @@ let needs_parens txt =
let needs_spaces txt =
first_is '*' txt || last_is '*' txt

let string_loc ppf x = fprintf ppf "%s" x.txt

(* add parentheses to binders when they are in fact infix or prefix operators *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring should be updated, given that the feature scope was extended to also "protect" identifiers that are also keywords. Suggestion:

(* Turn an arbitrary identifier name into a lexically valid identifier. *)

I realize as I write this that I don't know what is the right terminology for these distinct objects (the "string payload of an identifier", which I call "identifier name" here, and the "identifier as it appears in the source file", which I call a "lexically valid identifier'.)

@@ -268,16 +273,21 @@ let iter_loc f ctxt {txt; loc = _} = f ctxt txt

let constant_string f s = pp f "%S" s

let tyvar ppf s =
let tyvar_name s =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the new function name tyvar_name confusing. If I was asked to guess what is the "name" of a tyvar, I would say that foo is the name of the tyvar 'foo. This function is in fact doing the opposite thing.

@@ -684,7 +686,8 @@ open Printtyp

let report_error env ppf = function
| Unbound_type_variable name ->
let add_name name _ l = if name = "_" then l else ("'" ^ name) :: l in
let add_name name _ l =
if name = "_" then l else Pprintast.tyvar_name name :: l in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this logic (being a no-op on "_") should be added to tyvar_name.

@gasche
Copy link
Member

gasche commented Jun 14, 2023

@OlivierNicole I remember that we discussed this PR and formed a cunning plan, but I don't remember what the plan was. You were supposed to resubmit a rebased/updated version of this PR, and I would review it; or maybe the other way around? Do you remember, which one do you prefer now?

(If you resubmit and I review, we don't need to find yet another maintainer for approval before merging. But then if you prefer to keep reviewing that is also fine by me.)

@dra27 dra27 mentioned this pull request Jun 19, 2023
9 tasks
@OlivierNicole
Copy link
Contributor

@OlivierNicole I remember that we discussed this PR and formed a cunning plan, but I don't remember what the plan was. You were supposed to resubmit a rebased/updated version of this PR, and I would review it; or maybe the other way around? Do you remember, which one do you prefer now?

I can rebase and update it within a few days.

@gasche
Copy link
Member

gasche commented Jun 24, 2023

This has been rebased as an independent PR #12323, which is now merged. Closing. Thanks @stedolan for the nice RFC and implementation, and @OlivierNicole for going the last mile.

@gasche gasche closed this Jun 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants