Modest support for Unicode letters in identifiers, take 2#12664
Conversation
|
@shindere I am out of ideas for who would want to review this PR, and I would rather have someone else than myself do it. Would you by chance be able to ask someone at Tarides to do a review on their upstream maintenance time? |
|
Gabriel Scherer (2024/01/17 10:10 -0800):
@shindere I am out of ideas for who would want to review this PR, and
I would rather have someone else than myself do it. Would you by
chance be able to ask someone at Tarides to do a review on their
upstream maintenance time?
Investigating. Will report back here as soon as done.
|
|
@shindere : ping. |
|
Gabriel Scherer (2024/01/31 23:37 -0800):
@shindere : ping.
Awaiting for a reposnse internally myself.
Will report back here anyway.
|
|
@Julow I wonder if you would be interested in helping review this PR. It looks like I will do it myself otherwise, but maybe we could split some of the work. |
Julow
left a comment
There was a problem hiding this comment.
Nice step in the direction of full Unicode support :)
| - digits 0-9, underscore, single quote | ||
| *) | ||
|
|
||
| module UString : sig |
There was a problem hiding this comment.
This module name is a bit too general. What do you think of UIdent ?
normalize, capitalize and uncapitalize are not as general as they seem as they work only on "identifier characters". The other functions end in _identifier.
There was a problem hiding this comment.
UIdent does sound clearer to me, thanks for the suggestion!
There was a problem hiding this comment.
This module name is now ambiguous with the UIDENT token, where the "U" means "uppercase" and not "unicode". I don't have a better suggestion though.
There was a problem hiding this comment.
Yep, UIdent is a problem. Let's think of a better name.
There was a problem hiding this comment.
Proposals:
Ident_nameIdent_textUtf_ident
While looking at the mli, I also wondered if we should rename is_valid_identifier into is_valid, and validate_identifier into validate.
There was a problem hiding this comment.
Ident is already taken by typecore for something similar but slightly different: unique names.
Perhaps Name is the natural name for this concept ?
There was a problem hiding this comment.
I went with the very explicit Utf8_lexeme which conveys information on both the encoding and the content.
| { OPTLABEL name } | ||
| | "?" raw_ident_escape (ident_ext as raw_name) ':' | ||
| { let name = ident_for_extended lexbuf raw_name in | ||
| if UString.is_capitalized name |
There was a problem hiding this comment.
The previous rule matches names that always pass this check. Why duplicate the rule ?
Going further, perhaps we could avoid using identstart and identchar entirely, ensuring that we support "extended" identifiers everywhere.
There was a problem hiding this comment.
This is a performance vs maintainability tradeoff. With the fast ascii path separate from the extended unicode path, we keep the same performance as the current lexer on micro-benchmarks. Merging the two paths in one results into a 2x slowdown on the same benchmark (see https://github.com/Octachron/ocaml-lexer-microbenchmarks).
There was a problem hiding this comment.
I see that the difference goes down to 1.30x with an extremely large and more representative example (all .ml files of this repo concatenated). Is lexing time that important ?
There was a problem hiding this comment.
Thinking about it maybe a better compromise would be to keep the fast ascii rules for identifiers and labels, but merge the raw_identifier case with the unicode one? This reduces the number of rules to:
- 3 for identifiers: lowercase_ascii, uppercase_ascii, complex
- 2 for labels and optional labels: lowercase_ascii, complex
with (most probably) no bearing on the performances.
There was a problem hiding this comment.
I think reducing the number of rules can only be good for maintenance. Do uppercase ASCII identifiers need a fast path too ?
There was a problem hiding this comment.
My opinion is that the ascii rule are also quite simpler to read:
| lowercase identchar * as name
{ try Hashtbl.find keyword_table name
with Not_found -> LIDENT name }
| uppercase identchar * as name
{ UIDENT name } (* No capitalized keywords *)compared to
| (raw_ident_escape? as escape) (ident_ext as raw_name)
{ let name = ident_for_extended lexbuf raw_name in
if UIdent.is_capitalized name then begin
if escape="" then UIDENT name
else error lexbuf (Capitalized_raw_identifier name)
end else
LIDENT name
} (* No non-ascii keywords *)Consequently, I belive that keeping both rules for the ascii case is fine (or possibly very slightly better?) in term of maintenance.
There was a problem hiding this comment.
I think in this example the rule is made complicated by the Capitalized_raw_identifier check. It could be defined elsewhere, like it's done for other checks.
The is_capitalized if seems as readable as two separate rules, though it's nice that the keyword lookup is done in a separate rule. Also, the more complex rule is not made even more complex by dropping the simpler rule.
| LABEL name } | ||
| | "~" (lowercase_latin1 identchar_latin1 * as name) ':' | ||
| { warn_latin1 lexbuf; | ||
| | "~" (ident_ext as raw_name) ':' |
There was a problem hiding this comment.
There's now 4 very similar rules for labels ! Any way to collapse them to 1 or 2 ?
12ef987 to
40908f9
Compare
|
@Octachron does this need a fresh review? |
|
@tmcgilchrist , @Julow has started reviewing this PRs but more eyes are always welcome. |
Julow
left a comment
There was a problem hiding this comment.
I've done a full review and all my comments are nitpicks. I think this is ready to be merged!
This should be documented in the manual, though that can be done later.
| LABEL name } | ||
| | "~" (lowercase_latin1 identchar_latin1 * as name) ':' | ||
| { warn_latin1 lexbuf; | ||
| | "~" (raw_ident_escape? as escape) (ident_ext as raw_name) ':' |
There was a problem hiding this comment.
Nice simplification ! Though, does the lexing performances of labels matter ? I'd expect it to matter less than LIDENT and we could get rid of more rules.
There was a problem hiding this comment.
It matters less, but I wouldn't be surprised if some coding style ends up with a significant minority of label identifiers.
| { warn_latin1 lexbuf; | ||
| | "~" (raw_ident_escape? as escape) (ident_ext as raw_name) ':' | ||
| { let name = ident_for_extended lexbuf raw_name in | ||
| if escape="" then check_label_name lexbuf name; |
There was a problem hiding this comment.
It's not immediately obvious how and why the check is different when escape is present. I suggest passing whether an escape is present to the check_label_name function. For example, check_label_name ~has_escape:(escape<>"").
This way, the logic could be deduplicated from the OPTLABEL rule and documented in the check function.
There was a problem hiding this comment.
I took an alternative path, and added an check_unicode_label_name to factorize the two tests.
There was a problem hiding this comment.
check_label_name and check_unicode_label_name are practically identical. Perhaps they could be merged or shared more.
There was a problem hiding this comment.
Indeed, they are now merged together (with an optional argument).
| { let name = ident_for_extended lexbuf raw_name in | ||
| if UIdent.is_capitalized name then begin | ||
| if escape="" then UIDENT name | ||
| else error lexbuf (Capitalized_raw_identifier name) |
There was a problem hiding this comment.
Here would be a nice place to document why capitalized raw ident escapes are not allowed. I guess this is because it would allow capitalized LIDENT, which would be annoying during typing and pprinting ?
There was a problem hiding this comment.
It is more because raw identifiers were introduced to allow adding new keywords without breaking backward compatibility; and we don't have capitalized keywords.
| STRING (s, loc, Some delim) } | ||
| | "{%" (extattrident as id) "|" | ||
| | "{" (ident_ext as raw_name) '|' | ||
| { let delim = validate_delim lexbuf raw_name in |
There was a problem hiding this comment.
This rule could also be squashed with ASCII one ?
There was a problem hiding this comment.
I agree that performance don't matter that much here.
| | "{" ('%' '%'? extattrident blank*)? (lowercase* as delim) "|" | ||
| { | ||
| | "{" ('%' '%'? extattrident blank*)? (ident_ext? as raw_delim) "|" | ||
| { match lax_delim raw_delim with |
There was a problem hiding this comment.
It seems weird that delimited strings are not recognized in comments because of an encoding error in an arbitrary string. What would happen if invalid ident chars were allowed here as well as uppercases ?
There was a problem hiding this comment.
Since delimited strings are recognized in comments with the aim make it straightforward to comment code like
let a = {x| *) |x}I find it is coherent to only recognize delimited strings that are lexically valid in comments.
| if delim = edelim then lexbuf.lex_start_p | ||
| else (store_lexeme lexbuf; quoted_string delim lexbuf) | ||
| } | ||
| | "|" (ident_ext as raw_edelim) "}" |
There was a problem hiding this comment.
Is this optimisation necessary ?
Also, as delim has been validated before entering this state, the None case is not necessary as the Error value returned by UIdent.normalize cannot be equal delim. This could removes duplicated code.
There was a problem hiding this comment.
Indeed, the optimisation doesn't seem necessary in this case.
| | 'A'..'Z' | 'a'..'z' | '_' | '\192'..'\214' | '\216'..'\246' | ||
| | '\248'..'\255' | '\'' | '0'..'9' -> true | ||
| | _ -> false | ||
| let modname_from_source source_file = |
There was a problem hiding this comment.
What do you think of deprecating modname_from_source in favor of a new unsafe_modname_from_source ? It's currently used to construct idents in tools/ and in driver/ where strictness is not needed but the more verbose name couldn't hurt and might avoid bad uses outside of the compiler.
There was a problem hiding this comment.
I find unsafe a bit too much since the function is not unsafe by itself. I have settled for lax_modname_from_source to make the strict and lax variant more symmetric.
Facilities are provided for NFC normalization, capitalization / uncapitalization, and checking identifier validity. Support is currently limited to Latin-9 letters.
We use the new Misc.UString module to check that the letters are allowed and to normalize their Unicode representation. The lexing of labels was changed so that labels that start with an uppercase are accepted by the lexer regexps, then rejected by the action. This works better with non-ASCII letters and produces a nicer error message.
…names
Instead of `String.{un,}capitalize_ascii` like before.
This should support compilation units whose module names start with
or contain non-ASCII letters.
Both NFC and NFD representations are used in this file (hoping Git does not change this).
We reject invalid utf8 encoding and strings containing the replacement character U+FFFD in identifiers and output targets. Similarly, we ignore existing artifacts with encoding errors whenever we are doing an artifact look-up in load paths.
The filename ça.ml is NFD normalized. The test source now the two normalizations for each unit identifiers.
Julow
left a comment
There was a problem hiding this comment.
Sorry for the late reply. I approve :)
Modest support for Unicode letters in identifiers, take 2 (cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
(cherry picked from commit 389ae57)
|
@Octachron, has this been documented in the manual ? |
This pull request is essentially a version of #11736 rebased on top of #12389 with a handful of new commits for the sake of regularity:
\#étoileand quoted strings{%éclair été|x|été}.Inspired by a remark of @dbuenzli in #11736, 9137252 tightens our policy for badly encoded filenames: we emit now an error before producing such a filename, and ignore such files in load paths.
The main motivation behind this change is to steer away from such pathological corner cases as much as possible for input under the control of users.
Similarly, concerning the previous discussion about filename collisions (up-to-normalisation) in load paths, I propose in #12550 to simply refuse to work in such confused environment .
There are also a handful of supplementary tests in 76766c8, and ff3d473, and 22da51a.
This 22da51a commit also contains a small fix for the toplevel value pretty-printer.