Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modest support for Unicode letters in identifiers #11736

Closed
wants to merge 7 commits into from

Conversation

xavierleroy
Copy link
Contributor

This is a continuation of #11717. The context is the same:

As a leftover from the 1990's, OCaml currently recognizes accented letters in identifiers provided they are encoded with the now-defunct Latin-1 character set (ISO 8859-1). There's considerable pressure to get rid of this special case and accept ASCII identifiers only + arbitrary UTF-8 in strings and comments, see #1802 for instance. However, I still like my accented letters in identifiers, because they work beautifully for textbooks written in Western languages other than English.

The present PR is my second and more general attempt at supporting UTF-8 encoded accented letters (and possibly more generally letters beyond ASCII letters) in identifiers.

A new module UString is added to Misc (sigh...) to provide the basic operations over these identifiers:

  • Recognizing valid identifiers. Currently, we only accept Latin-9 (as in iso-8859-15) letters in addition to ASCII letters, digits, underscore and single quote, but this could be extended later.
  • Normalization (to NFC), so that identifiers can be compared with normal string equality.
  • Capitalization (change of case for the first character), so as to convert between module names Foo and file names foo.cmi, Foo.ml, etc.

The lexer catches the UTF-8 form of these identifiers with a rather broad regexp, then uses the facilities from UString to check that they are supported and classify them into "capitalized" or "not capitalized".

Everywhere String.capitalize_ascii and String.uncapitalize_ascii were used to map between module names and file names, I used Misc.Ustring.capitalize and Misc.Ustring.uncapitalize instead. These functions perform NFC normalization of Latin-9 letters (other Unicode characters are left as is) and case change of the first character if it is an ASCII or Latin-9 letter. I hope this is enough to correctly map between accented module names like Éléphant and file names like éléphant.cmi, both on macOS (which stores file names in NFD but matches them up to NFC/NFD conversion) and on Linux and Windows (which seem to use mostly or even exclusively NFC for file names). This is being lightly tested in a new test added to the test suite.

Constructive criticism welcome!

Facilities are provided for NFC normalization, capitalization /
uncapitalization, and checking identifier validity.

Support is currently limited to Latin-9 letters.
We use the new Misc.UString module to check that the letters are allowed
and to normalize their Unicode representation.

The lexing of labels was changed so that labels that start with an
uppercase are accepted by the lexer regexps, then rejected by the
action.  This works better with non-ASCII letters and produces a nicer
error message.
…names

Instead of `String.{un,}capitalize_ascii` like before.

This should support compilation units whose module names start with
or contain non-ASCII letters.
Both NFC and NFD representations are used in this file
(hoping Git does not change this).
@xavierleroy
Copy link
Contributor Author

In passing, I made a small and I hope uncontroversial change to the lexical analysis of labels ~foo: and ?foo:.

Currently, the requirement that a label starts with a lowercase letter is hard-wired in the lexer regexps. Consequently, a capitalized label produces a non-informative error:

# let f ~LBL:x = x + 1;;
         ^^^
Error: Syntax error

For this PR, it's simpler to have regexps that recognize labels with arbitrary case, then check capitalization in the semantic action. This even produces a nicer error message:

# let f ~LBL:x = x + 1;;
         ^^^
Error: `LBL' cannot be used as label name, it must start with a lowercase letter

@xavierleroy
Copy link
Contributor Author

To facilitate reviewing, here are some Q&A :

Q: Why did you put your UString module in Misc and not in a separate compilation unit, to reduce clutter?
A: I started with a separate compilation unit, then realized that Misc.find_in_path_uncap needs the uncapitalize function from UString, so UString has to be linked whenever Misc is used, i.e. almost all the time.

Q: Why Latin-9 and not OCaml's original Latin-1?
A: Latin-9 fixes some deficiencies of Latin-1, esp. the letters Œ and œ that are dear to French purists. It still lacks a capital form of ß, but I took the one from Unicode 2008, just for fun.

Q: How much work to get full support for Unicode identifiers a la UAX 31?
A: Technically, we'd need to reimplement the Misc.UString operations on top of a serious Unicode library, to get proper character classification, NFKC normalization, etc.
Conceptually, OCaml gives semantic meaning to the uppercase / lowercase distinction that is missing from many scripts. So, we might not accept that a constructor or module name starts with a Chinese ideogram, for example; and we may need an additional way to indicate "this is a constructor or module name".
Finally, visual confusion between identifiers is a problem and would need to be addressed by new warnings.

@nojb
Copy link
Contributor

nojb commented Nov 18, 2022

Linux and Windows (which seem to use mostly or even exclusively NFC for file names).

I did the following experiment in the toplevel under both Windows and Linux (WSL):

# close_out (open_out "\u{e9}"); close_out (open_out "\u{65}\u{301}");;
- : unit = ()
# Sys.readdir ".";;
- : string array = [|"é"; "é"|]

which matches what I read somewhere that these operating systems store filenames as opaque sequences of code units, without normalization. I think this means than trying to map identifiers to filenames is never going to work 100% of the time. One could do a readdir and normalize the results before the looking up the corresponding filename, but this is more expensive...

@c-cube
Copy link
Contributor

c-cube commented Nov 18, 2022

@nojb pardon my ignorance, but some filesystems already have issues with capitalization, don't they? For example on macOS. So you might have A.ml and a.ml and existing normalization (String.capitalize_ascii) will already have this issue that two files map to the same module name.

@nojb
Copy link
Contributor

nojb commented Nov 18, 2022

some filesystems already have issues with capitalization, don't they? For example on macOS. So you might have A.ml and a.ml and existing normalization (String.capitalize_ascii) will already have this issue that two files map to the same module name.

I am not sure I follow: I thought (but I don't have much experience with macOS) that the filesystem there was case-insensitive which means that A.ml and a.ml would be the same file, right? What am I missing?

@xavierleroy
Copy link
Contributor Author

xavierleroy commented Nov 18, 2022

these operating systems store filenames as opaque sequences of code units, without normalization.

That's my understanding too.

One could do a readdir and normalize the results

Actually we already do this (readdir + normalization) in some cases:
https://github.com/xavierleroy/ocaml/blob/ce5097327f8f1d1a06ab1740a45456f764c18733/utils/load_path.ml#L40-L48

and the corresponding code for find could easily use normalization too. (I'll put this on my todo list.)

But you're right that Misc.find_in_path and related functions would need to be rewritten based on readdir to handle normalization differences in Linux and Windows. (macOS should be fine, as the file system seems to handle these differences by itself.)

My hope for Linux and Windows is that text editors and other applications use NFC by default and create files with NFC-normalized names. I'm pretty sure this is the case for Linux. I haven't checked what Windows editors typically do.

@c-cube
Copy link
Contributor

c-cube commented Nov 18, 2022

@nojb right, that's the other way around, my bad (see e.g. this).

@dbuenzli
Copy link
Contributor

I didn't look at the PR yet (will try to have a look on sunday evening) but the discussion is getting a bit confusing.

Just to make it clear: you have to normalise both filenames and sources so that everything is in the same form inside the OCaml program when you do your comparisons (also beware that in general normalisation is not closed under concatenation).

You just can't trust foreign input to be in a given form, neither pure text files nor filenames returned by system calls give you any guarantee.

@dbuenzli
Copy link
Contributor

One could do a readdir and normalize the results before the looking up the corresponding filename, but this is more expensive...

I don't think you have a choice but do this. Since we are not dealing with the full range of Unicode I don't think this is really expensive, as @xavierleroy said the load path already readdirs, and the US-ASCII only path can be made fast, it can be just a scan and return the same string.

There's one question left though which is what do you do when you have two filenames that normalize to the same value (like in your example):

  1. You favour one, e.g. the "most NFC" by choosing the shortest one in bytes.
  2. You refuse to work in these conditions and ask users to be sane about their directories.

Option 2. looks better to me.

@DemiMarie
Copy link
Contributor

Do ocamllex, Menhir, and ocamlyacc also need to be updated?

@xavierleroy
Copy link
Contributor Author

Today's XKCD is very relevant to this discussion: https://xkcd.com/2700
(Replace "null string terminator" by "decomposed accented letter"...)

@dbuenzli
Copy link
Contributor

Today's XKCD is very relevant to this discussion: https://xkcd.com/2700
(Replace "null string terminator" by "decomposed accented letter"...)

You don't need to replace anything, the hover tip reflects well that people still don't get Unicode's encoding space :-)

Copy link
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did a first pass on this, I mainly reviewed the Unicode bits, I didn't have the time to look more precisely the context in the load path to see if it does what we want (i.e. no normal form assumptions, there's a test though!) and what happens with what @nojb mentioned.

Regarding the general correctness of the transformation on the compiler it's a bit difficult to assert completeness through what the PR shows, but I guess appopriate git greps were performed.

There are still a few things that I would like to see:

  1. As it stands the lexer is slightly tweaked to accept more bytes for identifiers. After that best effort decodes are performed with U+FFFD replacements during normalisation. This means that compilation succeeds on malformed UTF-8 character streams. I don't think that's a very good idea. I would really like the OCaml compiler to assert the character stream as being UTF-8 encoded and bail out with a character stream error if it's not. This will also ensure that my (byte-escape free) string literals are valid UTF-8 which is good to have for the language users.

  2. Update to the manual of course!

But other than that I really like what I'm seeing here, especially proposing latin-9 characters is a nice touch which brings in a few more million people for this first step.


P.S. For other people reviewing this latin-9 to unicode mapping can be a useful reference.

(* NFD representation *)

let f = function
| Æsop -> 1 | Âcre -> 2 | Ça -> 3 | Élégant -> 4 | Öst -> 5 | Œuvre -> 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few tests that could be added:

let () = assert (f Élégant (* NFC encoded *) = 4)

let () = 
  let called = ref false in 
  let élégant (* NFC encoded *) () = called := true in 
  élégant (* NFD encoded *) (); assert (!called) 

(* The following two defs should error with 'Multiple definition…' *)
module Élégant (* NFC encoded *) = struct end
module Élégant (* NFD encoded *) = struct end

@@ -232,6 +232,206 @@ module Stdlib = struct
external compare : 'a -> 'a -> int = "%compare"
end

(** {1 Minimal support for Unicode characters in identifiers} *)

module UString = struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't fight since this is internal to your project but personally, given Uchar, I would call that Ustring.

(0xdd, 0xfd); (* Ý, ý *) (0xde, 0xfe); (* Þ, þ *)
(0x160, 0x161); (* Š, š *) (0x17d, 0x17e); (* Ž, ž *)
(0x152, 0x153); (* Œ, œ *) (0x178, 0xff); (* Ÿ, ÿ *)
(0x1e9e, 0xdf); (* ẞ, ß *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I crossed checked this data with Unicode's default case mapping via:

let check_case (upper, lower) =
  try
    let upper = Uchar.of_int upper and lower = Uchar.of_int lower in
    assert (Uucp.Case.Map.to_lower upper = `Uchars [lower]);
    assert (Uucp.Case.Map.to_upper lower = `Uchars [upper])
  with
  | Assert_failure _ ->
      Printf.printf "Invalid pair U+%04X, U+%04X\n%!" upper lower

This just fails on the (0x1e9e, 0xdf); (* ẞ, ß *) case. is mapped to ß but ß is mapped to SS:

# Uucp.Case.Map.to_lower (Uchar.of_int 0x1E9E);;
- : [ `Self | `Uchars of Uchar.t list ] = `Uchars [U+00DF]
# Uucp.Case.Map.to_upper (Uchar.of_int 0x00DF);;
- : [ `Self | `Uchars of Uchar.t list ] = `Uchars [U+0053; U+0053]

The reason why this is the case (yes) is detailed in this Unicode case mapping FAQ entry. I'll let you choose or ask a german friend whether it's a good idea to support this mapping.

('u', 0x302, 0xfb); (* û *) ('u', 0x308, 0xfc); (* ü *)
('y', 0x301, 0xfd); (* ý *) ('y', 0x308, 0xff); (* ÿ *)
('s', 0x30c, 0x161); (* š *) ('z', 0x30c, 0x17e); (* ž *)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this data for correctness with:

let check_comp (base, comb, comp) =
  try
    let comp = Uchar.of_int comp and base = Char.code base in
    assert (Uunf.decomp comp = [| base; comb |]);
  with
  | Assert_failure _ ->
      Printf.printf "Invalid composition: U+04%X U+04%X <> U+%04X"
        (Char.code base) comb comp


let uchar_is_uppercase u =
let c = Uchar.to_int u in
if c < 0x80 then c >= 65 && c <= 90 else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dealing with characters as ints in general I find it easier if:

  1. Everything is hex (because it's easier to match with U+XXXX specs)
  2. When the chars are printable have them as comment following the number (like you did in uchar_valid_in_identifier)

| Some u' ->
norm u' i'
| None ->
Buffer.add_utf_8_uchar buf (transform prev);
Copy link
Contributor

@dbuenzli dbuenzli Nov 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in general character case mappings do not preserve normalization forms.

Here's a simple example: the sequence ß + ◌̌ (<U+00DF,U+030C>) is NFC. But if you apply Unicode's default upper case mapping you get (as we have seen above) S + S + ◌̌ (<U+0053,U+0053,U+030C>) which is NFD, NFC would be S + Š (<U+0053,U+0160>).

I convinced myself that on the restricted subset we are dealing with and how the function is used below the results are correct. But the function gets more general and correctness is easier to assert if you apply transform first and do the composition afterwards. Don't make me think !

@@ -742,6 +742,58 @@ module Magic_number : sig
val all_kinds : kind list
end

(** {1 Minimal support for Unicode characters in identifiers} *)

(** Characters allowed in identifiers are, currently:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Characters allowed in {{!Ustring.normalize}normalized} identifiers.


(** Characters allowed in identifiers are, currently:
- ASCII letters A-Z a-z
- Latin-1 letters (U+00C0 - U+00FF except U+00D7 and U+00F7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latin-9


(* Characters allowed in identifiers. Currently:
- ASCII letters, underscore
- Latin-1 letters, represented in NFC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latin-9

let identchar_latin1 =
['A'-'Z' 'a'-'z' '_' '\192'-'\214' '\216'-'\246' '\248'-'\255' '\'' '0'-'9']
(* This should be kept in sync with the [is_identchar] function in [env.ml] *)
let utf8 = ['\192'-'\255'] ['\128'-'\191']*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very rough notion of UTF-8 :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also lex/lexer.mll and yacc/reader.c need updates.

@gasche
Copy link
Member

gasche commented Jan 13, 2023

We've had excellent feedback from @dbuenzli, but that's it. @xavierleroy has not addressed the review comments yet; he may be planning to wait for more fedback to pour more work into the PR.

I would like to see this question making progress, and overall I like what I read from Xavier and Daniel. I am approving the principle of this change. But I don't have time to spend on it myself. So here is a proposal: @dbuenzli, if you are willing to complete the review process (if/when @xavierleroy is available to iterate based on your feedback), I will be happy to green-approve on the behalf of your gray-approval, if any. (We will probably need a broader expression of support for this non-small change to the lexical conventions of the language; I can take care of drumming up support.)

@xavierleroy
Copy link
Contributor Author

I'm pretty happy with the handling of Unicode identifiers proposed in this PR, but the mapping between module identifiers and file names gives me pause. There is just too many places (in the current code base) where we call String.capitalize_ascii and String.uncapitalize_ascii in an apparent attempt to convert between the two. I have the feeling this cannot be right, e.g. if we have foo.cmi, Foo.cmi, foo.cmo and Foo.cmo in the same directory, which ones does ocamlc pick ? and does it pick consistently ?

As long as we don't understand what's going on already with ASCII capitalization, there's no hope I can address @dbuenzli 's questions ("what do you do when you have two filenames that normalize to the same value ?") or admonitions ("you just can't trust foreign input to be in a given form, neither pure text files nor filenames returned by system calls give you any guarantee").

So, I'm afraid I'll have to close this PR, as I don't have the time and energy required to first clean up what the current code base is doing w.r.t. capitalization of file names.

@gasche
Copy link
Member

gasche commented Jan 13, 2023

I don't have the time and energy required to first clean up what the current code base is doing w.r.t. capitalization of file names.

Maybe we could:

  1. Articulate precisely the issue.
  2. Agree on what should be done.
  3. Ask for a volunteer to do it.

I haven't done (1), but here is a strayman for (2):

  • encourage people to use capitalized filenames from the start (maybe: add a warning when we need capitalization, disabled by default)
  • fail when both capitalizations exist (this is broken on case-insensitive filesystems)

@xavierleroy
Copy link
Contributor Author

xavierleroy commented Jan 13, 2023

encourage people to use capitalized filenames from the start (maybe: add a warning when we need capitalization, disabled by default)

I'm not sure this will help. Empirically, projects that use lowercase file names only seem to work well, and projects that use properly capitalized file names seem to work too, but are less common, so perhaps they don't exercise all the cases.

In other words, I would expect projects using capitalized filenames to be as problematic or even more problematic than those using lowercase filenames.

fail when both capitalizations exist (this is broken on case-insensitive filesystems)

That would be a step forward, and something we could piggy-back on to also fail when equivalent NFC and NFD encodings exist. But I think it will need some refactoring in the too-many uses of String.{un,}capitalize_ascii that we have today scattered all over the code base.

Also, it can have a significant performance impact: checking whether a file exists with stat() takes much less time than reading the enclosing directory to check that one capitalization/encoding of the file exists and no other capitalization/encoding. We have some mechanisms in place to cache directories, but they are not always used, e.g. for interactive use...

@dbuenzli
Copy link
Contributor

dbuenzli commented Jan 13, 2023

But I think it will need some refactoring in the too-many uses of String.{un,}capitalize_ascii that we have today scattered all over the code base.

Maybe a preliminary PR should start by removing all of these and replacing them with module_of_filename (already exists in Compenv) and a new function module_to_filename.

That could help clarifying the places that are going to lookup the file system and how it's going to do it. More precisely the question is: how often is this not done via the Load_path ?

I don't think the normalized matching is going to be a performance hit with the Load_path (currently, cap/uncap lookups are not even cached, see find_uncap). The files field here should be changed to store both filenames and their normalization in a convenient datastructure for matching when the directory is loaded. Depending on which data structure we use e.g. a map we can easily detect and warn (or abort) if a directory has multiple filenames have the same normalization (represent the same module name) at that point.

@Octachron
Copy link
Member

My plan is indeed to first factorize the module/filename pairing across the compiler in a separate PR to decouple the issue from the question of supporting Unicode.

@vouillon
Copy link
Member

I was wondering whether is could be simpler not to normalize identifiers. JavaScript identifiers are not normalized, and this seems to work well in practice.

Two IdentifierNames that are canonically equivalent according to the Unicode Standard are not equal unless, after replacement of each UnicodeEscapeSequence, they are represented by the exact same sequence of code points.

But we have an issue on Mac OS, where checking whether a given module exists using stat would not give the same result than creating modules corresponding to the files in a directory using readdir, since the system does a case insensitive match in the first case but we have to choose a normal form (presumably NFC) in the second case.

@gasche
Copy link
Member

gasche commented Jan 31, 2023

Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step.

@nojb
Copy link
Contributor

nojb commented Jan 31, 2023

Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step.

It feels a bit weird to make a difference between compilation units and other modules (as far as I know, currently there is no way to differentiatiate compilation units from other modules at the language level).

@c-cube
Copy link
Contributor

c-cube commented Jan 31, 2023 via email

@xavierleroy
Copy link
Contributor Author

I was wondering whether is could be simpler not to normalize identifiers. JavaScript identifiers are not normalized, and this seems to work well in practice.

That's an interesting suggestion, thanks for the JavaScript reference. I see one potential problem with macOS, but it's not the one you mentioned.

Say I have a source file été.ml and another, hiver.ml that does let x = Été.f 10. Text editors seem to use NFC, while the file name été.ml is in NFD. That's OK as long as we stay on a Mac, since the file system ignores NFC/NFD differences in file names. However, if I make a tarball with these sources, and try to use it under Linux, the file name été.ml is still in NFD, while the reference Été.f is in NFC, resulting in an `Unbound module Été" error. Same thing with Zip archives.

Interestingly, git commit on the Mac normalizes the file name été.ml back to NFC. So, if we use Git to share files and produce tarballs, we avoid this issue.

@xavierleroy
Copy link
Contributor Author

Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step.

It's a bit of a cop-out, but if nothing else works... I don't know which is worse, this suggestion or the JavaScript approach :-)

@dbuenzli
Copy link
Contributor

dbuenzli commented Feb 5, 2023

Note that there are other language like XML that throw normalization out of the window for identifier matching (open/close tag), but these things happen in the same file.

Regarding JavaScript people likely don't run into problems because most text is input in NFC, is saved as such in files and identifiers are never reified from the file system. When you start doing the latter like OCaml does you can't ignore the issue.

@whitequark
Copy link
Member

Anyone who is interested in more general support for Unicode identifiers than what is provided here could take a look at my ocaml-m17n package.

@xavierleroy
Copy link
Contributor Author

Anyone who is interested in more general support for Unicode identifiers than what is provided here could take a look at my ocaml-m17n package.

Right. I was aware of -- and quite impressed by -- this experiment of yours. This "modest support" PR was intended as a first incremental step in this direction, playing on the intuition that the ISO-Latin subset of Unicode should be trivial to support.

However, this PR is hitting the "filename normalization" problem hard, and it's no simpler in the ISO-Latin special case than in the general case, so I'm not pushing this PR any longer.

Can you describe at a high level how you went about the filename normalization problem? Given the great many occurrences of String.capitalize and String.uncapitalize in the OCaml code base, how did you find those that correspond to matching file names with compilation unit names, and how did you handle those?

@whitequark
Copy link
Member

whitequark commented Mar 12, 2023

Right. I was aware of -- and quite impressed by -- this experiment of yours.

Thank you! 😊

This "modest support" PR was intended as a first incremental step in this direction, playing on the intuition that the ISO-Latin subset of Unicode should be trivial to support.

This makes sense.

For context, I made the ocaml-m17n project in response to a rather sharp and angry statement of a linguist I followed on Twitter at the time, who mentioned that she has a very low opinion of OCaml (I'll choose to not quote the inflammatory things she actually said) because it was so Latin-focused, and in particular required a language with an uppercase/lowercase distinction.

I think it was a good use of my time since it seems to have been appreciated both by the OCaml community and in the Unicode committee (I believe Robin Leroy submitted a proposal inspired by my work) but I don't think it made her happy in the end...

Can you describe at a high level how you went about the filename normalization problem? Given the great many occurrences of String.capitalize and String.uncapitalize in the OCaml code base, how did you find those that correspond to matching file names with compilation unit names, and how did you handle those?

I wrote a section in the README about exactly this; I'll condense it to a one paragraph description.

ocaml-m17n works with NFC internally. It expects to be able to pass paths in the NFC form to the filesystem. Since this does not always work, it lists the contents of every directory searched, and issues a diagnostic for every path that looks similar but not equal to the NFC paths it is looking for. Because of the algorithm toNFKC_Casefold used in the "looks similar" comparison, that catches instances of mis-normalization (e.g. someone saving a module with a name in NFD on Linux), of the programmer supplying names with incorrect letter case after the initial letter, and also some obscure lookalikes that are probably not relevant.

@omasanori
Copy link
Contributor

For reference, PEP 672, the "Normalizing identifiers" section contains how Python handles Unicode identifiers including module names.

@Octachron
Copy link
Member

I had a look at version of this PR rebased on top of #12389 at https://github.com/octachron/ocaml/tree/unified_file_info+pr11736. Pleasingly, it reduces the fa5514c commit to 90c8948 .

There are still some decision to make for file collisions within the same library (whenever a library end up with both été.cmi, été.cmi, Été.cmi, ... ) but I think it is better to discuss that point independently of this PR.

@xavierleroy
Copy link
Contributor Author

I'm in awe of the work by @Octachron on refactoring module name <-> file name conversions and on adapting this PR accordingly. As discussed with him, I'm closing this PR so that he can open a new one based on https://github.com/octachron/ocaml/tree/unified_file_info+pr11736 . Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants