Modest support for Unicode letters in identifiers #11736

xavierleroy · 2022-11-18T15:57:05Z

This is a continuation of #11717. The context is the same:

As a leftover from the 1990's, OCaml currently recognizes accented letters in identifiers provided they are encoded with the now-defunct Latin-1 character set (ISO 8859-1). There's considerable pressure to get rid of this special case and accept ASCII identifiers only + arbitrary UTF-8 in strings and comments, see #1802 for instance. However, I still like my accented letters in identifiers, because they work beautifully for textbooks written in Western languages other than English.

The present PR is my second and more general attempt at supporting UTF-8 encoded accented letters (and possibly more generally letters beyond ASCII letters) in identifiers.

A new module UString is added to Misc (sigh...) to provide the basic operations over these identifiers:

Recognizing valid identifiers. Currently, we only accept Latin-9 (as in iso-8859-15) letters in addition to ASCII letters, digits, underscore and single quote, but this could be extended later.
Normalization (to NFC), so that identifiers can be compared with normal string equality.
Capitalization (change of case for the first character), so as to convert between module names Foo and file names foo.cmi, Foo.ml, etc.

The lexer catches the UTF-8 form of these identifiers with a rather broad regexp, then uses the facilities from UString to check that they are supported and classify them into "capitalized" or "not capitalized".

Everywhere String.capitalize_ascii and String.uncapitalize_ascii were used to map between module names and file names, I used Misc.Ustring.capitalize and Misc.Ustring.uncapitalize instead. These functions perform NFC normalization of Latin-9 letters (other Unicode characters are left as is) and case change of the first character if it is an ASCII or Latin-9 letter. I hope this is enough to correctly map between accented module names like Éléphant and file names like éléphant.cmi, both on macOS (which stores file names in NFD but matches them up to NFC/NFD conversion) and on Linux and Windows (which seem to use mostly or even exclusively NFC for file names). This is being lightly tested in a new test added to the test suite.

Constructive criticism welcome!

Facilities are provided for NFC normalization, capitalization / uncapitalization, and checking identifier validity. Support is currently limited to Latin-9 letters.

We use the new Misc.UString module to check that the letters are allowed and to normalize their Unicode representation. The lexing of labels was changed so that labels that start with an uppercase are accepted by the lexer regexps, then rejected by the action. This works better with non-ASCII letters and produces a nicer error message.

…names Instead of `String.{un,}capitalize_ascii` like before. This should support compilation units whose module names start with or contain non-ASCII letters.

Both NFC and NFD representations are used in this file (hoping Git does not change this).

xavierleroy · 2022-11-18T16:04:38Z

In passing, I made a small and I hope uncontroversial change to the lexical analysis of labels ~foo: and ?foo:.

Currently, the requirement that a label starts with a lowercase letter is hard-wired in the lexer regexps. Consequently, a capitalized label produces a non-informative error:

# let f ~LBL:x = x + 1;;
         ^^^
Error: Syntax error

For this PR, it's simpler to have regexps that recognize labels with arbitrary case, then check capitalization in the semantic action. This even produces a nicer error message:

# let f ~LBL:x = x + 1;;
         ^^^
Error: `LBL' cannot be used as label name, it must start with a lowercase letter

xavierleroy · 2022-11-18T16:33:44Z

To facilitate reviewing, here are some Q&A :

Q: Why did you put your UString module in Misc and not in a separate compilation unit, to reduce clutter?
A: I started with a separate compilation unit, then realized that Misc.find_in_path_uncap needs the uncapitalize function from UString, so UString has to be linked whenever Misc is used, i.e. almost all the time.

Q: Why Latin-9 and not OCaml's original Latin-1?
A: Latin-9 fixes some deficiencies of Latin-1, esp. the letters Œ and œ that are dear to French purists. It still lacks a capital form of ß, but I took the one from Unicode 2008, just for fun.

Q: How much work to get full support for Unicode identifiers a la UAX 31?
A: Technically, we'd need to reimplement the Misc.UString operations on top of a serious Unicode library, to get proper character classification, NFKC normalization, etc.
Conceptually, OCaml gives semantic meaning to the uppercase / lowercase distinction that is missing from many scripts. So, we might not accept that a constructor or module name starts with a Chinese ideogram, for example; and we may need an additional way to indicate "this is a constructor or module name".
Finally, visual confusion between identifiers is a problem and would need to be addressed by new warnings.

nojb · 2022-11-18T16:46:33Z

Linux and Windows (which seem to use mostly or even exclusively NFC for file names).

I did the following experiment in the toplevel under both Windows and Linux (WSL):

# close_out (open_out "\u{e9}"); close_out (open_out "\u{65}\u{301}");;
- : unit = ()
# Sys.readdir ".";;
- : string array = [|"é"; "é"|]

which matches what I read somewhere that these operating systems store filenames as opaque sequences of code units, without normalization. I think this means than trying to map identifiers to filenames is never going to work 100% of the time. One could do a readdir and normalize the results before the looking up the corresponding filename, but this is more expensive...

c-cube · 2022-11-18T16:51:43Z

@nojb pardon my ignorance, but some filesystems already have issues with capitalization, don't they? For example on macOS. So you might have A.ml and a.ml and existing normalization (String.capitalize_ascii) will already have this issue that two files map to the same module name.

nojb · 2022-11-18T16:57:47Z

some filesystems already have issues with capitalization, don't they? For example on macOS. So you might have A.ml and a.ml and existing normalization (String.capitalize_ascii) will already have this issue that two files map to the same module name.

I am not sure I follow: I thought (but I don't have much experience with macOS) that the filesystem there was case-insensitive which means that A.ml and a.ml would be the same file, right? What am I missing?

xavierleroy · 2022-11-18T17:00:16Z

these operating systems store filenames as opaque sequences of code units, without normalization.

That's my understanding too.

One could do a readdir and normalize the results

Actually we already do this (readdir + normalization) in some cases:
https://github.com/xavierleroy/ocaml/blob/ce5097327f8f1d1a06ab1740a45456f764c18733/utils/load_path.ml#L40-L48

and the corresponding code for find could easily use normalization too. (I'll put this on my todo list.)

But you're right that Misc.find_in_path and related functions would need to be rewritten based on readdir to handle normalization differences in Linux and Windows. (macOS should be fine, as the file system seems to handle these differences by itself.)

My hope for Linux and Windows is that text editors and other applications use NFC by default and create files with NFC-normalized names. I'm pretty sure this is the case for Linux. I haven't checked what Windows editors typically do.

c-cube · 2022-11-18T17:09:46Z

@nojb right, that's the other way around, my bad (see e.g. this).

dbuenzli · 2022-11-18T17:15:49Z

I didn't look at the PR yet (will try to have a look on sunday evening) but the discussion is getting a bit confusing.

Just to make it clear: you have to normalise both filenames and sources so that everything is in the same form inside the OCaml program when you do your comparisons (also beware that in general normalisation is not closed under concatenation).

You just can't trust foreign input to be in a given form, neither pure text files nor filenames returned by system calls give you any guarantee.

dbuenzli · 2022-11-18T17:51:40Z

One could do a readdir and normalize the results before the looking up the corresponding filename, but this is more expensive...

I don't think you have a choice but do this. Since we are not dealing with the full range of Unicode I don't think this is really expensive, as @xavierleroy said the load path already readdirs, and the US-ASCII only path can be made fast, it can be just a scan and return the same string.

There's one question left though which is what do you do when you have two filenames that normalize to the same value (like in your example):

You favour one, e.g. the "most NFC" by choosing the shortest one in bytes.
You refuse to work in these conditions and ask users to be sane about their directories.

Option 2. looks better to me.

DemiMarie · 2022-11-18T21:02:33Z

Do ocamllex, Menhir, and ocamlyacc also need to be updated?

xavierleroy · 2022-11-19T09:14:37Z

Today's XKCD is very relevant to this discussion: https://xkcd.com/2700
(Replace "null string terminator" by "decomposed accented letter"...)

dbuenzli · 2022-11-19T09:34:46Z

Today's XKCD is very relevant to this discussion: https://xkcd.com/2700
(Replace "null string terminator" by "decomposed accented letter"...)

You don't need to replace anything, the hover tip reflects well that people still don't get Unicode's encoding space :-)

dbuenzli

So I did a first pass on this, I mainly reviewed the Unicode bits, I didn't have the time to look more precisely the context in the load path to see if it does what we want (i.e. no normal form assumptions, there's a test though!) and what happens with what @nojb mentioned.

Regarding the general correctness of the transformation on the compiler it's a bit difficult to assert completeness through what the PR shows, but I guess appopriate git greps were performed.

There are still a few things that I would like to see:

As it stands the lexer is slightly tweaked to accept more bytes for identifiers. After that best effort decodes are performed with U+FFFD replacements during normalisation. This means that compilation succeeds on malformed UTF-8 character streams. I don't think that's a very good idea. I would really like the OCaml compiler to assert the character stream as being UTF-8 encoded and bail out with a character stream error if it's not. This will also ensure that my (byte-escape free) string literals are valid UTF-8 which is good to have for the language users.
Update to the manual of course!

But other than that I really like what I'm seeing here, especially proposing latin-9 characters is a nice touch which brings in a few more million people for this first step.

P.S. For other people reviewing this latin-9 to unicode mapping can be a useful reference.

dbuenzli · 2022-11-20T20:07:27Z

testsuite/tests/parsing/latin9.ml

+(* NFD representation *)
+
+let f = function
+  | Æsop -> 1 | Âcre -> 2 | Ça -> 3 | Élégant -> 4 | Öst -> 5 | Œuvre -> 6


Here are a few tests that could be added:

let () = assert (f Élégant (* NFC encoded *) = 4) let () = let called = ref false in let élégant (* NFC encoded *) () = called := true in élégant (* NFD encoded *) (); assert (!called) (* The following two defs should error with 'Multiple definition…' *) module Élégant (* NFC encoded *) = struct end module Élégant (* NFD encoded *) = struct end

dbuenzli · 2022-11-20T20:11:04Z

utils/misc.ml

@@ -232,6 +232,206 @@ module Stdlib = struct
  external compare : 'a -> 'a -> int = "%compare"
 end

+(** {1 Minimal support for Unicode characters in identifiers} *)
+
+module UString = struct


I won't fight since this is internal to your project but personally, given Uchar, I would call that Ustring.

dbuenzli · 2022-11-20T20:59:40Z

utils/misc.ml

+    (0xdd, 0xfd); (* Ý, ý *)    (0xde, 0xfe); (* Þ, þ *)
+    (0x160, 0x161); (* Š, š *)  (0x17d, 0x17e); (* Ž, ž *)
+    (0x152, 0x153); (* Œ, œ *)  (0x178, 0xff); (* Ÿ, ÿ *)
+    (0x1e9e, 0xdf); (* ẞ, ß *)


I crossed checked this data with Unicode's default case mapping via:

let check_case (upper, lower) = try let upper = Uchar.of_int upper and lower = Uchar.of_int lower in assert (Uucp.Case.Map.to_lower upper = `Uchars [lower]); assert (Uucp.Case.Map.to_upper lower = `Uchars [upper]) with | Assert_failure _ -> Printf.printf "Invalid pair U+%04X, U+%04X\n%!" upper lower

This just fails on the (0x1e9e, 0xdf); (* ẞ, ß *) case. ẞ is mapped to ß but ß is mapped to SS:

# Uucp.Case.Map.to_lower (Uchar.of_int 0x1E9E);; - : [ `Self | `Uchars of Uchar.t list ] = `Uchars [U+00DF] # Uucp.Case.Map.to_upper (Uchar.of_int 0x00DF);; - : [ `Self | `Uchars of Uchar.t list ] = `Uchars [U+0053; U+0053]

The reason why this is the case (yes) is detailed in this Unicode case mapping FAQ entry. I'll let you choose or ask a german friend whether it's a good idea to support this mapping.

dbuenzli · 2022-11-20T21:09:44Z

utils/misc.ml

+    ('u', 0x302, 0xfb); (* û *)    ('u', 0x308, 0xfc); (* ü *)
+    ('y', 0x301, 0xfd); (* ý *)    ('y', 0x308, 0xff); (* ÿ *)
+    ('s', 0x30c, 0x161); (* š *)   ('z', 0x30c, 0x17e); (* ž *)
+  ]


I checked this data for correctness with:

let check_comp (base, comb, comp) = try let comp = Uchar.of_int comp and base = Char.code base in assert (Uunf.decomp comp = [| base; comb |]); with | Assert_failure _ -> Printf.printf "Invalid composition: U+04%X U+04%X <> U+%04X" (Char.code base) comb comp

dbuenzli · 2022-11-20T21:27:08Z

utils/misc.ml

+
+  let uchar_is_uppercase u =
+    let c = Uchar.to_int u in
+    if c < 0x80 then c >= 65 && c <= 90 else


When dealing with characters as ints in general I find it easier if:

Everything is hex (because it's easier to match with U+XXXX specs)

When the chars are printable have them as comment following the number (like you did in uchar_valid_in_identifier)

dbuenzli · 2022-11-20T22:26:29Z

utils/misc.ml

+        | Some u' ->
+            norm u' i'
+        | None ->
+            Buffer.add_utf_8_uchar buf (transform prev);


Note that in general character case mappings do not preserve normalization forms.

Here's a simple example: the sequence ß + ◌̌ (<U+00DF,U+030C>) is NFC. But if you apply Unicode's default upper case mapping you get (as we have seen above) S + S + ◌̌ (<U+0053,U+0053,U+030C>) which is NFD, NFC would be S + Š (<U+0053,U+0160>).

I convinced myself that on the restricted subset we are dealing with and how the function is used below the results are correct. But the function gets more general and correctness is easier to assert if you apply transform first and do the composition afterwards. Don't make me think !

dbuenzli · 2022-11-20T22:39:25Z

utils/misc.mli

@@ -742,6 +742,58 @@ module Magic_number : sig
  val all_kinds : kind list
 end

+(** {1 Minimal support for Unicode characters in identifiers} *)
+
+(** Characters allowed in identifiers are, currently:


Characters allowed in {{!Ustring.normalize}normalized} identifiers.

dbuenzli · 2022-11-20T22:45:13Z

utils/misc.mli

+
+(** Characters allowed in identifiers are, currently:
+      - ASCII letters A-Z a-z
+      - Latin-1 letters (U+00C0 - U+00FF except U+00D7 and U+00F7)


dbuenzli · 2022-11-20T22:46:09Z

utils/misc.ml

+
+  (* Characters allowed in identifiers.  Currently:
+       - ASCII letters, underscore
+       - Latin-1 letters, represented in NFC


dbuenzli · 2022-11-20T22:58:56Z

parsing/lexer.mll

-let identchar_latin1 =
-  ['A'-'Z' 'a'-'z' '_' '\192'-'\214' '\216'-'\246' '\248'-'\255' '\'' '0'-'9']
-(* This should be kept in sync with the [is_identchar] function in [env.ml] *)
+let utf8 = ['\192'-'\255'] ['\128'-'\191']*


That's a very rough notion of UTF-8 :-)

Also lex/lexer.mll and yacc/reader.c need updates.

gasche · 2023-01-13T09:13:48Z

We've had excellent feedback from @dbuenzli, but that's it. @xavierleroy has not addressed the review comments yet; he may be planning to wait for more fedback to pour more work into the PR.

I would like to see this question making progress, and overall I like what I read from Xavier and Daniel. I am approving the principle of this change. But I don't have time to spend on it myself. So here is a proposal: @dbuenzli, if you are willing to complete the review process (if/when @xavierleroy is available to iterate based on your feedback), I will be happy to green-approve on the behalf of your gray-approval, if any. (We will probably need a broader expression of support for this non-small change to the lexical conventions of the language; I can take care of drumming up support.)

xavierleroy · 2023-01-13T14:19:06Z

I'm pretty happy with the handling of Unicode identifiers proposed in this PR, but the mapping between module identifiers and file names gives me pause. There is just too many places (in the current code base) where we call String.capitalize_ascii and String.uncapitalize_ascii in an apparent attempt to convert between the two. I have the feeling this cannot be right, e.g. if we have foo.cmi, Foo.cmi, foo.cmo and Foo.cmo in the same directory, which ones does ocamlc pick ? and does it pick consistently ?

As long as we don't understand what's going on already with ASCII capitalization, there's no hope I can address @dbuenzli 's questions ("what do you do when you have two filenames that normalize to the same value ?") or admonitions ("you just can't trust foreign input to be in a given form, neither pure text files nor filenames returned by system calls give you any guarantee").

So, I'm afraid I'll have to close this PR, as I don't have the time and energy required to first clean up what the current code base is doing w.r.t. capitalization of file names.

gasche · 2023-01-13T14:24:21Z

I don't have the time and energy required to first clean up what the current code base is doing w.r.t. capitalization of file names.

Maybe we could:

Articulate precisely the issue.
Agree on what should be done.
Ask for a volunteer to do it.

I haven't done (1), but here is a strayman for (2):

encourage people to use capitalized filenames from the start (maybe: add a warning when we need capitalization, disabled by default)
fail when both capitalizations exist (this is broken on case-insensitive filesystems)

xavierleroy · 2023-01-13T17:48:13Z

encourage people to use capitalized filenames from the start (maybe: add a warning when we need capitalization, disabled by default)

I'm not sure this will help. Empirically, projects that use lowercase file names only seem to work well, and projects that use properly capitalized file names seem to work too, but are less common, so perhaps they don't exercise all the cases.

In other words, I would expect projects using capitalized filenames to be as problematic or even more problematic than those using lowercase filenames.

fail when both capitalizations exist (this is broken on case-insensitive filesystems)

That would be a step forward, and something we could piggy-back on to also fail when equivalent NFC and NFD encodings exist. But I think it will need some refactoring in the too-many uses of String.{un,}capitalize_ascii that we have today scattered all over the code base.

Also, it can have a significant performance impact: checking whether a file exists with stat() takes much less time than reading the enclosing directory to check that one capitalization/encoding of the file exists and no other capitalization/encoding. We have some mechanisms in place to cache directories, but they are not always used, e.g. for interactive use...

dbuenzli · 2023-01-13T18:56:29Z

But I think it will need some refactoring in the too-many uses of String.{un,}capitalize_ascii that we have today scattered all over the code base.

Maybe a preliminary PR should start by removing all of these and replacing them with module_of_filename (already exists in Compenv) and a new function module_to_filename.

That could help clarifying the places that are going to lookup the file system and how it's going to do it. More precisely the question is: how often is this not done via the Load_path ?

I don't think the normalized matching is going to be a performance hit with the Load_path (currently, cap/uncap lookups are not even cached, see find_uncap). The files field here should be changed to store both filenames and their normalization in a convenient datastructure for matching when the directory is loaded. Depending on which data structure we use e.g. a map we can easily detect and warn (or abort) if a directory has multiple filenames have the same normalization (represent the same module name) at that point.

Octachron · 2023-01-13T18:59:11Z

My plan is indeed to first factorize the module/filename pairing across the compiler in a separate PR to decouple the issue from the question of supporting Unicode.

vouillon · 2023-01-24T23:14:03Z

I was wondering whether is could be simpler not to normalize identifiers. JavaScript identifiers are not normalized, and this seems to work well in practice.

Two IdentifierNames that are canonically equivalent according to the Unicode Standard are not equal unless, after replacement of each UnicodeEscapeSequence, they are represented by the exact same sequence of code points.

But we have an issue on Mac OS, where checking whether a given module exists using stat would not give the same result than creating modules corresponding to the files in a directory using readdir, since the system does a case insensitive match in the first case but we have to choose a normal form (presumably NFC) in the second case.

gasche · 2023-01-31T17:08:45Z

Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step.

nojb · 2023-01-31T19:08:46Z

Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step.

It feels a bit weird to make a difference between compilation units and other modules (as far as I know, currently there is no way to differentiatiate compilation units from other modules at the language level).

c-cube · 2023-01-31T19:15:16Z

May I say that this provides a wonderful solution to the problem of dependency resolution in presence of `open`: always use at least one non-ascii char in names of submodules. This way if you see unicode it can't be a toplevel file name! ;-)

xavierleroy · 2023-02-05T10:04:49Z

I was wondering whether is could be simpler not to normalize identifiers. JavaScript identifiers are not normalized, and this seems to work well in practice.

That's an interesting suggestion, thanks for the JavaScript reference. I see one potential problem with macOS, but it's not the one you mentioned.

Say I have a source file été.ml and another, hiver.ml that does let x = Été.f 10. Text editors seem to use NFC, while the file name été.ml is in NFD. That's OK as long as we stay on a Mac, since the file system ignores NFC/NFD differences in file names. However, if I make a tarball with these sources, and try to use it under Linux, the file name été.ml is still in NFD, while the reference Été.f is in NFC, resulting in an `Unbound module Été" error. Same thing with Zip archives.

Interestingly, git commit on the Mac normalizes the file name été.ml back to NFC. So, if we use Git to share files and produce tarballs, we avoid this issue.

xavierleroy · 2023-02-05T10:06:53Z

Could we help with the filename issue by saying that we require compilation unit names to be ascii-only? We would still need a bit of work to actually check and forbid other characters, but this could be an easier first step.

It's a bit of a cop-out, but if nothing else works... I don't know which is worse, this suggestion or the JavaScript approach :-)

dbuenzli · 2023-02-05T10:27:39Z

Note that there are other language like XML that throw normalization out of the window for identifier matching (open/close tag), but these things happen in the same file.

Regarding JavaScript people likely don't run into problems because most text is input in NFC, is saved as such in files and identifiers are never reified from the file system. When you start doing the latter like OCaml does you can't ignore the issue.

whitequark · 2023-03-10T13:27:34Z

Anyone who is interested in more general support for Unicode identifiers than what is provided here could take a look at my ocaml-m17n package.

xavierleroy · 2023-03-12T18:21:25Z

Anyone who is interested in more general support for Unicode identifiers than what is provided here could take a look at my ocaml-m17n package.

Right. I was aware of -- and quite impressed by -- this experiment of yours. This "modest support" PR was intended as a first incremental step in this direction, playing on the intuition that the ISO-Latin subset of Unicode should be trivial to support.

However, this PR is hitting the "filename normalization" problem hard, and it's no simpler in the ISO-Latin special case than in the general case, so I'm not pushing this PR any longer.

Can you describe at a high level how you went about the filename normalization problem? Given the great many occurrences of String.capitalize and String.uncapitalize in the OCaml code base, how did you find those that correspond to matching file names with compilation unit names, and how did you handle those?

whitequark · 2023-03-12T18:30:13Z

Right. I was aware of -- and quite impressed by -- this experiment of yours.

Thank you! 😊

This "modest support" PR was intended as a first incremental step in this direction, playing on the intuition that the ISO-Latin subset of Unicode should be trivial to support.

This makes sense.

For context, I made the ocaml-m17n project in response to a rather sharp and angry statement of a linguist I followed on Twitter at the time, who mentioned that she has a very low opinion of OCaml (I'll choose to not quote the inflammatory things she actually said) because it was so Latin-focused, and in particular required a language with an uppercase/lowercase distinction.

I think it was a good use of my time since it seems to have been appreciated both by the OCaml community and in the Unicode committee (I believe Robin Leroy submitted a proposal inspired by my work) but I don't think it made her happy in the end...

Can you describe at a high level how you went about the filename normalization problem? Given the great many occurrences of String.capitalize and String.uncapitalize in the OCaml code base, how did you find those that correspond to matching file names with compilation unit names, and how did you handle those?

I wrote a section in the README about exactly this; I'll condense it to a one paragraph description.

ocaml-m17n works with NFC internally. It expects to be able to pass paths in the NFC form to the filesystem. Since this does not always work, it lists the contents of every directory searched, and issues a diagnostic for every path that looks similar but not equal to the NFC paths it is looking for. Because of the algorithm toNFKC_Casefold used in the "looks similar" comparison, that catches instances of mis-normalization (e.g. someone saving a module with a name in NFD on Linux), of the programmer supplying names with incorrect letter case after the initial letter, and also some obscure lookalikes that are probably not relevant.

omasanori · 2023-07-20T02:51:38Z

For reference, PEP 672, the "Normalizing identifiers" section contains how Python handles Unicode identifiers including module names.

Octachron · 2023-09-08T13:11:31Z

I had a look at version of this PR rebased on top of #12389 at https://github.com/octachron/ocaml/tree/unified_file_info+pr11736. Pleasingly, it reduces the fa5514c commit to 90c8948 .

There are still some decision to make for file collisions within the same library (whenever a library end up with both été.cmi, été.cmi, Été.cmi, ... ) but I think it is better to discuss that point independently of this PR.

xavierleroy · 2023-09-20T16:53:21Z

I'm in awe of the work by @Octachron on refactoring module name <-> file name conversions and on adapting this PR accordingly. As discussed with him, I'm closing this PR so that he can open a new one based on https://github.com/octachron/ocaml/tree/unified_file_info+pr11736 . Thanks !

xavierleroy added 6 commits November 17, 2022 14:28

New module Misc.UString to work with UTF-8-encoded Unicode identifiers

1735b0e

Facilities are provided for NFC normalization, capitalization / uncapitalization, and checking identifier validity. Support is currently limited to Latin-9 letters.

Use Misc.UString.{un,}capitalize to connect file names with module …

fa5514c

…names Instead of `String.{un,}capitalize_ascii` like before. This should support compilation units whose module names start with or contain non-ASCII letters.

Test Latin-9 letters in identifiers

6e7db4e

Both NFC and NFD representations are used in this file (hoping Git does not change this).

Test Latin-9 letters in file names and compilation unit names

0645fac

eszett has an uppercase form

ba9214e

xavierleroy mentioned this pull request Nov 18, 2022

Recognize UTF8-encoded Latin-1 letters #11717

Closed

Update dependencies and bow to check-typo

ce50973

dbuenzli reviewed Nov 20, 2022

View reviewed changes

gasche assigned Octachron Jan 13, 2023

hhugo mentioned this pull request Jan 16, 2023

Invalid location with utf-8 encoded files #11899

Closed

gasche mentioned this pull request Mar 10, 2023

Make the character set for OCaml source code officially UTF-8. #1802

Open

gasche mentioned this pull request Jul 3, 2023

asmcomp "compile-time constants" do not work on cross-compilers #7250

Closed

Octachron mentioned this pull request Jul 18, 2023

Unified metadata for compilation files (or no more capitalize_ascii) #12389

Merged

gasche mentioned this pull request Aug 29, 2023

Newlines in quoted string literals mishandled on Windows/Cygwin #12502

Closed

xavierleroy closed this Sep 20, 2023

This was referenced Oct 13, 2023

ocamldoc: centralize lexical convention before utf-8 #12662

Merged

Modest support for Unicode letters in identifiers, take 2 #12664

Open

Modest support for Unicode letters in identifiers #11736

Modest support for Unicode letters in identifiers #11736

Conversation

xavierleroy commented Nov 18, 2022

xavierleroy commented Nov 18, 2022

xavierleroy commented Nov 18, 2022

nojb commented Nov 18, 2022

c-cube commented Nov 18, 2022

nojb commented Nov 18, 2022

xavierleroy commented Nov 18, 2022 • edited

c-cube commented Nov 18, 2022

dbuenzli commented Nov 18, 2022

dbuenzli commented Nov 18, 2022

DemiMarie commented Nov 18, 2022

xavierleroy commented Nov 19, 2022

dbuenzli commented Nov 19, 2022

dbuenzli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzli Nov 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche commented Jan 13, 2023 • edited

xavierleroy commented Jan 13, 2023

gasche commented Jan 13, 2023

xavierleroy commented Jan 13, 2023 • edited

dbuenzli commented Jan 13, 2023 • edited

Octachron commented Jan 13, 2023

vouillon commented Jan 24, 2023

gasche commented Jan 31, 2023

nojb commented Jan 31, 2023

c-cube commented Jan 31, 2023 via email

xavierleroy commented Feb 5, 2023

xavierleroy commented Feb 5, 2023

dbuenzli commented Feb 5, 2023

whitequark commented Mar 10, 2023

xavierleroy commented Mar 12, 2023

whitequark commented Mar 12, 2023 • edited

omasanori commented Jul 20, 2023

Octachron commented Sep 8, 2023

xavierleroy commented Sep 20, 2023

xavierleroy commented Nov 18, 2022 •

edited

dbuenzli Nov 20, 2022 •

edited

gasche commented Jan 13, 2023 •

edited

xavierleroy commented Jan 13, 2023 •

edited

dbuenzli commented Jan 13, 2023 •

edited

whitequark commented Mar 12, 2023 •

edited