Skip to content

Modest support for Unicode letters in identifiers, take 2#12664

Merged
Octachron merged 28 commits into
ocaml:trunkfrom
Octachron:at-least-latin-9-in-utf8
Jul 18, 2024
Merged

Modest support for Unicode letters in identifiers, take 2#12664
Octachron merged 28 commits into
ocaml:trunkfrom
Octachron:at-least-latin-9-in-utf8

Conversation

@Octachron
Copy link
Copy Markdown
Member

This pull request is essentially a version of #11736 rebased on top of #12389 with a handful of new commits for the sake of regularity:

  • 654c014, d03f600 adds support for latin-9 identifiers in raw literal \#étoile and quoted strings {%éclair été|x|été}.
  • bd74f3b, ad5e6d3 updates the lexers of ocamldoc and the debugger to follow the new lexical convention (note that this PR contains ocamldoc: centralize lexical convention before utf-8 #12662)
  • 6314270 ensures that ocamllex is compatible with actions and comments containing latin-9 letters. However, this PR does not extend ocamllex to support latin-9 identifier outside of actions to avoid introducing a compiler-libs library dependency on ocamllex.

Inspired by a remark of @dbuenzli in #11736, 9137252 tightens our policy for badly encoded filenames: we emit now an error before producing such a filename, and ignore such files in load paths.
The main motivation behind this change is to steer away from such pathological corner cases as much as possible for input under the control of users.
Similarly, concerning the previous discussion about filename collisions (up-to-normalisation) in load paths, I propose in #12550 to simply refuse to work in such confused environment .

There are also a handful of supplementary tests in 76766c8, and ff3d473, and 22da51a.
This 22da51a commit also contains a small fix for the toplevel value pretty-printer.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 17, 2024

@shindere I am out of ideas for who would want to review this PR, and I would rather have someone else than myself do it. Would you by chance be able to ask someone at Tarides to do a review on their upstream maintenance time?

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jan 18, 2024 via email

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 1, 2024

@shindere : ping.

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Feb 1, 2024 via email

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 30, 2024

@Julow I wonder if you would be interested in helping review this PR. It looks like I will do it myself otherwise, but maybe we could split some of the work.

Copy link
Copy Markdown
Contributor

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice step in the direction of full Unicode support :)

Comment thread utils/misc.mli Outdated
- digits 0-9, underscore, single quote
*)

module UString : sig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module name is a bit too general. What do you think of UIdent ?
normalize, capitalize and uncapitalize are not as general as they seem as they work only on "identifier characters". The other functions end in _identifier.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UIdent does sound clearer to me, thanks for the suggestion!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module name is now ambiguous with the UIDENT token, where the "U" means "uppercase" and not "unicode". I don't have a better suggestion though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, UIdent is a problem. Let's think of a better name.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposals:

  • Ident_name
  • Ident_text
  • Utf_ident

While looking at the mli, I also wondered if we should rename is_valid_identifier into is_valid, and validate_identifier into validate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ident is already taken by typecore for something similar but slightly different: unique names.
Perhaps Name is the natural name for this concept ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the very explicit Utf8_lexeme which conveys information on both the encoding and the content.

Comment thread utils/misc.ml
Comment thread utils/misc.mli
Comment thread parsing/unit_info.ml Outdated
Comment thread parsing/lexer.mll Outdated
{ OPTLABEL name }
| "?" raw_ident_escape (ident_ext as raw_name) ':'
{ let name = ident_for_extended lexbuf raw_name in
if UString.is_capitalized name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous rule matches names that always pass this check. Why duplicate the rule ?

Going further, perhaps we could avoid using identstart and identchar entirely, ensuring that we support "extended" identifiers everywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a performance vs maintainability tradeoff. With the fast ascii path separate from the extended unicode path, we keep the same performance as the current lexer on micro-benchmarks. Merging the two paths in one results into a 2x slowdown on the same benchmark (see https://github.com/Octachron/ocaml-lexer-microbenchmarks).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the difference goes down to 1.30x with an extremely large and more representative example (all .ml files of this repo concatenated). Is lexing time that important ?

Copy link
Copy Markdown
Member Author

@Octachron Octachron Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it maybe a better compromise would be to keep the fast ascii rules for identifiers and labels, but merge the raw_identifier case with the unicode one? This reduces the number of rules to:

  • 3 for identifiers: lowercase_ascii, uppercase_ascii, complex
  • 2 for labels and optional labels: lowercase_ascii, complex

with (most probably) no bearing on the performances.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reducing the number of rules can only be good for maintenance. Do uppercase ASCII identifiers need a fast path too ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that the ascii rule are also quite simpler to read:

  | lowercase identchar * as name
      { try Hashtbl.find keyword_table name
        with Not_found -> LIDENT name }
  | uppercase identchar * as name
      { UIDENT name } (* No capitalized keywords *)

compared to

  | (raw_ident_escape? as escape) (ident_ext as raw_name)
      { let name = ident_for_extended lexbuf raw_name in
        if UIdent.is_capitalized name then begin
            if escape="" then UIDENT name
            else error lexbuf (Capitalized_raw_identifier name)
        end else
          LIDENT name
      } (* No non-ascii keywords *)

Consequently, I belive that keeping both rules for the ascii case is fine (or possibly very slightly better?) in term of maintenance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this example the rule is made complicated by the Capitalized_raw_identifier check. It could be defined elsewhere, like it's done for other checks.
The is_capitalized if seems as readable as two separate rules, though it's nice that the keyword lookup is done in a separate rule. Also, the more complex rule is not made even more complex by dropping the simpler rule.

Comment thread parsing/lexer.mll Outdated
LABEL name }
| "~" (lowercase_latin1 identchar_latin1 * as name) ':'
{ warn_latin1 lexbuf;
| "~" (ident_ext as raw_name) ':'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now 4 very similar rules for labels ! Any way to collapse them to 1 or 2 ?

Comment thread testsuite/tests/typing-unicode/genfiles.ml Outdated
@tmcgilchrist
Copy link
Copy Markdown
Contributor

@Octachron does this need a fresh review?

@Octachron
Copy link
Copy Markdown
Member Author

@tmcgilchrist , @Julow has started reviewing this PRs but more eyes are always welcome.

Copy link
Copy Markdown
Contributor

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a full review and all my comments are nitpicks. I think this is ready to be merged!

This should be documented in the manual, though that can be done later.

Comment thread utils/misc.ml Outdated
Comment thread utils/misc.ml
Comment thread parsing/lexer.mll
LABEL name }
| "~" (lowercase_latin1 identchar_latin1 * as name) ':'
{ warn_latin1 lexbuf;
| "~" (raw_ident_escape? as escape) (ident_ext as raw_name) ':'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification ! Though, does the lexing performances of labels matter ? I'd expect it to matter less than LIDENT and we could get rid of more rules.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters less, but I wouldn't be surprised if some coding style ends up with a significant minority of label identifiers.

Comment thread parsing/lexer.mll Outdated
{ warn_latin1 lexbuf;
| "~" (raw_ident_escape? as escape) (ident_ext as raw_name) ':'
{ let name = ident_for_extended lexbuf raw_name in
if escape="" then check_label_name lexbuf name;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious how and why the check is different when escape is present. I suggest passing whether an escape is present to the check_label_name function. For example, check_label_name ~has_escape:(escape<>"").
This way, the logic could be deduplicated from the OPTLABEL rule and documented in the check function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took an alternative path, and added an check_unicode_label_name to factorize the two tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_label_name and check_unicode_label_name are practically identical. Perhaps they could be merged or shared more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, they are now merged together (with an optional argument).

Comment thread parsing/lexer.mll Outdated
{ let name = ident_for_extended lexbuf raw_name in
if UIdent.is_capitalized name then begin
if escape="" then UIDENT name
else error lexbuf (Capitalized_raw_identifier name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here would be a nice place to document why capitalized raw ident escapes are not allowed. I guess this is because it would allow capitalized LIDENT, which would be annoying during typing and pprinting ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more because raw identifiers were introduced to allow adding new keywords without breaking backward compatibility; and we don't have capitalized keywords.

Comment thread parsing/lexer.mll
STRING (s, loc, Some delim) }
| "{%" (extattrident as id) "|"
| "{" (ident_ext as raw_name) '|'
{ let delim = validate_delim lexbuf raw_name in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule could also be squashed with ASCII one ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that performance don't matter that much here.

Comment thread parsing/lexer.mll
| "{" ('%' '%'? extattrident blank*)? (lowercase* as delim) "|"
{
| "{" ('%' '%'? extattrident blank*)? (ident_ext? as raw_delim) "|"
{ match lax_delim raw_delim with
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird that delimited strings are not recognized in comments because of an encoding error in an arbitrary string. What would happen if invalid ident chars were allowed here as well as uppercases ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since delimited strings are recognized in comments with the aim make it straightforward to comment code like

let a = {x| *) |x}

I find it is coherent to only recognize delimited strings that are lexically valid in comments.

Comment thread parsing/lexer.mll Outdated
if delim = edelim then lexbuf.lex_start_p
else (store_lexeme lexbuf; quoted_string delim lexbuf)
}
| "|" (ident_ext as raw_edelim) "}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this optimisation necessary ?
Also, as delim has been validated before entering this state, the None case is not necessary as the Error value returned by UIdent.normalize cannot be equal delim. This could removes duplicated code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the optimisation doesn't seem necessary in this case.

Comment thread parsing/unit_info.ml Outdated
| 'A'..'Z' | 'a'..'z' | '_' | '\192'..'\214' | '\216'..'\246'
| '\248'..'\255' | '\'' | '0'..'9' -> true
| _ -> false
let modname_from_source source_file =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of deprecating modname_from_source in favor of a new unsafe_modname_from_source ? It's currently used to construct idents in tools/ and in driver/ where strictness is not needed but the more verbose name couldn't hurt and might avoid bad uses outside of the compiler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find unsafe a bit too much since the function is not unsafe by itself. I have settled for lax_modname_from_source to make the strict and lax variant more symmetric.

xavierleroy and others added 18 commits April 30, 2024 15:14
Facilities are provided for NFC normalization, capitalization /
uncapitalization, and checking identifier validity.

Support is currently limited to Latin-9 letters.
We use the new Misc.UString module to check that the letters are allowed
and to normalize their Unicode representation.

The lexing of labels was changed so that labels that start with an
uppercase are accepted by the lexer regexps, then rejected by the
action.  This works better with non-ASCII letters and produces a nicer
error message.
…names

Instead of `String.{un,}capitalize_ascii` like before.

This should support compilation units whose module names start with
or contain non-ASCII letters.
Both NFC and NFD representations are used in this file
(hoping Git does not change this).
We reject invalid utf8 encoding and strings containing the replacement
character U+FFFD in identifiers and output targets.

Similarly, we ignore existing artifacts with encoding errors whenever we are
doing an artifact look-up in load paths.
The filename ça.ml is NFD normalized. The test source now the two
normalizations for each unit identifiers.
Copy link
Copy Markdown
Contributor

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. I approve :)

@Octachron Octachron merged commit 389ae57 into ocaml:trunk Jul 18, 2024
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
Modest support for Unicode letters in identifiers, take 2

(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 29, 2024
(cherry picked from commit 389ae57)
@hhugo
Copy link
Copy Markdown
Contributor

hhugo commented Nov 2, 2024

@Octachron, has this been documented in the manual ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants