New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modest support for Unicode letters in identifiers #11736
Conversation
Facilities are provided for NFC normalization, capitalization / uncapitalization, and checking identifier validity. Support is currently limited to Latin-9 letters.
We use the new Misc.UString module to check that the letters are allowed and to normalize their Unicode representation. The lexing of labels was changed so that labels that start with an uppercase are accepted by the lexer regexps, then rejected by the action. This works better with non-ASCII letters and produces a nicer error message.
…names Instead of `String.{un,}capitalize_ascii` like before. This should support compilation units whose module names start with or contain non-ASCII letters.
Both NFC and NFD representations are used in this file (hoping Git does not change this).
In passing, I made a small and I hope uncontroversial change to the lexical analysis of labels Currently, the requirement that a label starts with a lowercase letter is hard-wired in the lexer regexps. Consequently, a capitalized label produces a non-informative error:
For this PR, it's simpler to have regexps that recognize labels with arbitrary case, then check capitalization in the semantic action. This even produces a nicer error message:
|
To facilitate reviewing, here are some Q&A : Q: Why did you put your Q: Why Latin-9 and not OCaml's original Latin-1? Q: How much work to get full support for Unicode identifiers a la UAX 31? |
I did the following experiment in the toplevel under both Windows and Linux (WSL):
which matches what I read somewhere that these operating systems store filenames as opaque sequences of code units, without normalization. I think this means than trying to map identifiers to filenames is never going to work 100% of the time. One could do a |
@nojb pardon my ignorance, but some filesystems already have issues with capitalization, don't they? For example on macOS. So you might have |
I am not sure I follow: I thought (but I don't have much experience with macOS) that the filesystem there was case-insensitive which means that |
That's my understanding too.
Actually we already do this (readdir + normalization) in some cases: and the corresponding code for But you're right that My hope for Linux and Windows is that text editors and other applications use NFC by default and create files with NFC-normalized names. I'm pretty sure this is the case for Linux. I haven't checked what Windows editors typically do. |
I didn't look at the PR yet (will try to have a look on sunday evening) but the discussion is getting a bit confusing. Just to make it clear: you have to normalise both filenames and sources so that everything is in the same form inside the OCaml program when you do your comparisons (also beware that in general normalisation is not closed under concatenation). You just can't trust foreign input to be in a given form, neither pure text files nor filenames returned by system calls give you any guarantee. |
I don't think you have a choice but do this. Since we are not dealing with the full range of Unicode I don't think this is really expensive, as @xavierleroy said the load path already There's one question left though which is what do you do when you have two filenames that normalize to the same value (like in your example):
Option 2. looks better to me. |
Do ocamllex, Menhir, and ocamlyacc also need to be updated? |
Today's XKCD is very relevant to this discussion: https://xkcd.com/2700 |
You don't need to replace anything, the hover tip reflects well that people still don't get Unicode's encoding space :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did a first pass on this, I mainly reviewed the Unicode bits, I didn't have the time to look more precisely the context in the load path to see if it does what we want (i.e. no normal form assumptions, there's a test though!) and what happens with what @nojb mentioned.
Regarding the general correctness of the transformation on the compiler it's a bit difficult to assert completeness through what the PR shows, but I guess appopriate git grep
s were performed.
There are still a few things that I would like to see:
-
As it stands the lexer is slightly tweaked to accept more bytes for identifiers. After that best effort decodes are performed with U+FFFD replacements during normalisation. This means that compilation succeeds on malformed UTF-8 character streams. I don't think that's a very good idea. I would really like the OCaml compiler to assert the character stream as being UTF-8 encoded and bail out with a character stream error if it's not. This will also ensure that my (byte-escape free) string literals are valid UTF-8 which is good to have for the language users.
-
Update to the manual of course!
But other than that I really like what I'm seeing here, especially proposing latin-9 characters is a nice touch which brings in a few more million people for this first step.
P.S. For other people reviewing this latin-9 to unicode mapping can be a useful reference.
(* NFD representation *) | ||
|
||
let f = function | ||
| Æsop -> 1 | Âcre -> 2 | Ça -> 3 | Élégant -> 4 | Öst -> 5 | Œuvre -> 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few tests that could be added:
let () = assert (f Élégant (* NFC encoded *) = 4)
let () =
let called = ref false in
let élégant (* NFC encoded *) () = called := true in
élégant (* NFD encoded *) (); assert (!called)
(* The following two defs should error with 'Multiple definition…' *)
module Élégant (* NFC encoded *) = struct end
module Élégant (* NFD encoded *) = struct end
@@ -232,6 +232,206 @@ module Stdlib = struct | |||
external compare : 'a -> 'a -> int = "%compare" | |||
end | |||
|
|||
(** {1 Minimal support for Unicode characters in identifiers} *) | |||
|
|||
module UString = struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't fight since this is internal to your project but personally, given Uchar
, I would call that Ustring
.
(0xdd, 0xfd); (* Ý, ý *) (0xde, 0xfe); (* Þ, þ *) | ||
(0x160, 0x161); (* Š, š *) (0x17d, 0x17e); (* Ž, ž *) | ||
(0x152, 0x153); (* Œ, œ *) (0x178, 0xff); (* Ÿ, ÿ *) | ||
(0x1e9e, 0xdf); (* ẞ, ß *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I crossed checked this data with Unicode's default case mapping via:
let check_case (upper, lower) =
try
let upper = Uchar.of_int upper and lower = Uchar.of_int lower in
assert (Uucp.Case.Map.to_lower upper = `Uchars [lower]);
assert (Uucp.Case.Map.to_upper lower = `Uchars [upper])
with
| Assert_failure _ ->
Printf.printf "Invalid pair U+%04X, U+%04X\n%!" upper lower
This just fails on the (0x1e9e, 0xdf); (* ẞ, ß *)
case. ẞ
is mapped to ß
but ß
is mapped to SS
:
# Uucp.Case.Map.to_lower (Uchar.of_int 0x1E9E);;
- : [ `Self | `Uchars of Uchar.t list ] = `Uchars [U+00DF]
# Uucp.Case.Map.to_upper (Uchar.of_int 0x00DF);;
- : [ `Self | `Uchars of Uchar.t list ] = `Uchars [U+0053; U+0053]
The reason why this is the case (yes) is detailed in this Unicode case mapping FAQ entry. I'll let you choose or ask a german friend whether it's a good idea to support this mapping.
('u', 0x302, 0xfb); (* û *) ('u', 0x308, 0xfc); (* ü *) | ||
('y', 0x301, 0xfd); (* ý *) ('y', 0x308, 0xff); (* ÿ *) | ||
('s', 0x30c, 0x161); (* š *) ('z', 0x30c, 0x17e); (* ž *) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this data for correctness with:
let check_comp (base, comb, comp) =
try
let comp = Uchar.of_int comp and base = Char.code base in
assert (Uunf.decomp comp = [| base; comb |]);
with
| Assert_failure _ ->
Printf.printf "Invalid composition: U+04%X U+04%X <> U+%04X"
(Char.code base) comb comp
|
||
let uchar_is_uppercase u = | ||
let c = Uchar.to_int u in | ||
if c < 0x80 then c >= 65 && c <= 90 else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When dealing with characters as ints in general I find it easier if:
- Everything is hex (because it's easier to match with
U+XXXX
specs) - When the chars are printable have them as comment following the number (like you did in
uchar_valid_in_identifier
)
| Some u' -> | ||
norm u' i' | ||
| None -> | ||
Buffer.add_utf_8_uchar buf (transform prev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in general character case mappings do not preserve normalization forms.
Here's a simple example: the sequence ß + ◌̌ (<U+00DF,U+030C>
) is NFC. But if you apply Unicode's default upper case mapping you get (as we have seen above) S + S + ◌̌ (<U+0053,U+0053,U+030C>
) which is NFD, NFC would be S + Š (<U+0053,U+0160>
).
I convinced myself that on the restricted subset we are dealing with and how the function is used below the results are correct. But the function gets more general and correctness is easier to assert if you apply transform
first and do the composition afterwards. Don't make me think !
@@ -742,6 +742,58 @@ module Magic_number : sig | |||
val all_kinds : kind list | |||
end | |||
|
|||
(** {1 Minimal support for Unicode characters in identifiers} *) | |||
|
|||
(** Characters allowed in identifiers are, currently: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Characters allowed in {{!Ustring.normalize}normalized} identifiers.
|
||
(** Characters allowed in identifiers are, currently: | ||
- ASCII letters A-Z a-z | ||
- Latin-1 letters (U+00C0 - U+00FF except U+00D7 and U+00F7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latin-9
|
||
(* Characters allowed in identifiers. Currently: | ||
- ASCII letters, underscore | ||
- Latin-1 letters, represented in NFC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latin-9
let identchar_latin1 = | ||
['A'-'Z' 'a'-'z' '_' '\192'-'\214' '\216'-'\246' '\248'-'\255' '\'' '0'-'9'] | ||
(* This should be kept in sync with the [is_identchar] function in [env.ml] *) | ||
let utf8 = ['\192'-'\255'] ['\128'-'\191']* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very rough notion of UTF-8 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also lex/lexer.mll
and yacc/reader.c
need updates.
We've had excellent feedback from @dbuenzli, but that's it. @xavierleroy has not addressed the review comments yet; he may be planning to wait for more fedback to pour more work into the PR. I would like to see this question making progress, and overall I like what I read from Xavier and Daniel. I am approving the principle of this change. But I don't have time to spend on it myself. So here is a proposal: @dbuenzli, if you are willing to complete the review process (if/when @xavierleroy is available to iterate based on your feedback), I will be happy to green-approve on the behalf of your gray-approval, if any. (We will probably need a broader expression of support for this non-small change to the lexical conventions of the language; I can take care of drumming up support.) |
I'm pretty happy with the handling of Unicode identifiers proposed in this PR, but the mapping between module identifiers and file names gives me pause. There is just too many places (in the current code base) where we call As long as we don't understand what's going on already with ASCII capitalization, there's no hope I can address @dbuenzli 's questions ("what do you do when you have two filenames that normalize to the same value ?") or admonitions ("you just can't trust foreign input to be in a given form, neither pure text files nor filenames returned by system calls give you any guarantee"). So, I'm afraid I'll have to close this PR, as I don't have the time and energy required to first clean up what the current code base is doing w.r.t. capitalization of file names. |
Maybe we could:
I haven't done (1), but here is a strayman for (2):
|
I'm not sure this will help. Empirically, projects that use lowercase file names only seem to work well, and projects that use properly capitalized file names seem to work too, but are less common, so perhaps they don't exercise all the cases. In other words, I would expect projects using capitalized filenames to be as problematic or even more problematic than those using lowercase filenames.
That would be a step forward, and something we could piggy-back on to also fail when equivalent NFC and NFD encodings exist. But I think it will need some refactoring in the too-many uses of Also, it can have a significant performance impact: checking whether a file exists with |
Maybe a preliminary PR should start by removing all of these and replacing them with That could help clarifying the places that are going to lookup the file system and how it's going to do it. More precisely the question is: how often is this not done via the I don't think the normalized matching is going to be a performance hit with the |
My plan is indeed to first factorize the module/filename pairing across the compiler in a separate PR to decouple the issue from the question of supporting Unicode. |
I was wondering whether is could be simpler not to normalize identifiers. JavaScript identifiers are not normalized, and this seems to work well in practice.
But we have an issue on Mac OS, where checking whether a given module exists using |
Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step. |
It feels a bit weird to make a difference between compilation units and other modules (as far as I know, currently there is no way to differentiatiate compilation units from other modules at the language level). |
May I say that this provides a wonderful solution to the problem of
dependency resolution in presence of `open`: always use at least one
non-ascii char in names of submodules. This way if you see unicode it
can't be a toplevel file name! ;-)
|
That's an interesting suggestion, thanks for the JavaScript reference. I see one potential problem with macOS, but it's not the one you mentioned. Say I have a source file Interestingly, |
It's a bit of a cop-out, but if nothing else works... I don't know which is worse, this suggestion or the JavaScript approach :-) |
Note that there are other language like XML that throw normalization out of the window for identifier matching (open/close tag), but these things happen in the same file. Regarding JavaScript people likely don't run into problems because most text is input in NFC, is saved as such in files and identifiers are never reified from the file system. When you start doing the latter like OCaml does you can't ignore the issue. |
Anyone who is interested in more general support for Unicode identifiers than what is provided here could take a look at my ocaml-m17n package. |
Right. I was aware of -- and quite impressed by -- this experiment of yours. This "modest support" PR was intended as a first incremental step in this direction, playing on the intuition that the ISO-Latin subset of Unicode should be trivial to support. However, this PR is hitting the "filename normalization" problem hard, and it's no simpler in the ISO-Latin special case than in the general case, so I'm not pushing this PR any longer. Can you describe at a high level how you went about the filename normalization problem? Given the great many occurrences of |
Thank you! 😊
This makes sense. For context, I made the ocaml-m17n project in response to a rather sharp and angry statement of a linguist I followed on Twitter at the time, who mentioned that she has a very low opinion of OCaml (I'll choose to not quote the inflammatory things she actually said) because it was so Latin-focused, and in particular required a language with an uppercase/lowercase distinction. I think it was a good use of my time since it seems to have been appreciated both by the OCaml community and in the Unicode committee (I believe Robin Leroy submitted a proposal inspired by my work) but I don't think it made her happy in the end...
I wrote a section in the README about exactly this; I'll condense it to a one paragraph description. ocaml-m17n works with NFC internally. It expects to be able to pass paths in the NFC form to the filesystem. Since this does not always work, it lists the contents of every directory searched, and issues a diagnostic for every path that looks similar but not equal to the NFC paths it is looking for. Because of the algorithm toNFKC_Casefold used in the "looks similar" comparison, that catches instances of mis-normalization (e.g. someone saving a module with a name in NFD on Linux), of the programmer supplying names with incorrect letter case after the initial letter, and also some obscure lookalikes that are probably not relevant. |
For reference, PEP 672, the "Normalizing identifiers" section contains how Python handles Unicode identifiers including module names. |
I had a look at version of this PR rebased on top of #12389 at https://github.com/octachron/ocaml/tree/unified_file_info+pr11736. Pleasingly, it reduces the fa5514c commit to 90c8948 . There are still some decision to make for file collisions within the same library (whenever a library end up with both été.cmi, été.cmi, Été.cmi, ... ) but I think it is better to discuss that point independently of this PR. |
I'm in awe of the work by @Octachron on refactoring module name <-> file name conversions and on adapting this PR accordingly. As discussed with him, I'm closing this PR so that he can open a new one based on https://github.com/octachron/ocaml/tree/unified_file_info+pr11736 . Thanks ! |
This is a continuation of #11717. The context is the same:
As a leftover from the 1990's, OCaml currently recognizes accented letters in identifiers provided they are encoded with the now-defunct Latin-1 character set (ISO 8859-1). There's considerable pressure to get rid of this special case and accept ASCII identifiers only + arbitrary UTF-8 in strings and comments, see #1802 for instance. However, I still like my accented letters in identifiers, because they work beautifully for textbooks written in Western languages other than English.
The present PR is my second and more general attempt at supporting UTF-8 encoded accented letters (and possibly more generally letters beyond ASCII letters) in identifiers.
A new module
UString
is added toMisc
(sigh...) to provide the basic operations over these identifiers:Foo
and file namesfoo.cmi
,Foo.ml
, etc.The lexer catches the UTF-8 form of these identifiers with a rather broad regexp, then uses the facilities from
UString
to check that they are supported and classify them into "capitalized" or "not capitalized".Everywhere
String.capitalize_ascii
andString.uncapitalize_ascii
were used to map between module names and file names, I usedMisc.Ustring.capitalize
andMisc.Ustring.uncapitalize
instead. These functions perform NFC normalization of Latin-9 letters (other Unicode characters are left as is) and case change of the first character if it is an ASCII or Latin-9 letter. I hope this is enough to correctly map between accented module names likeÉléphant
and file names likeéléphant.cmi
, both on macOS (which stores file names in NFD but matches them up to NFC/NFD conversion) and on Linux and Windows (which seem to use mostly or even exclusively NFC for file names). This is being lightly tested in a new test added to the test suite.Constructive criticism welcome!