As I already made clear in previous discussions on the caml-list,
I find that OCaml's current support for Unicode is outstanding
(au propre comme au figuré).
I don't think introducing a Unicode string data structure and
a corresponding syntax for literals would be a good thing do
to. Since, if one wanted to that in a correct and useful way, it
would entail importing a good deal of the Unicode processing machinery
(e.g. normalization) in the compiler and I really think it's better to
leave that outside the compiler. Unicode processing can perfectly be
left to a set of modularized, external libraries. I also think it's
actually a good idea to proceed that way as libraries are in a better
position to evolve with the standard (e.g. newly encoded characters on
Unicode standard updates may imply changes to normalisation results
and would entail updates to the compiler).
There is however one thing that I really find missing to get utterly
excellent Unicode support in OCaml: an abstract datatype, in the
standard library, to represent an Unicode scalar value (by abusing
terminology: an Unicode character). An Unicode scalar
simply an integer in the ranges 0x0000…0xD7FF or 0xE000…0x10FFFF.
Such a data type would allow independent libraries dealing with
unicode characters (e.g. ulex, camomile, uutf, uunf, uucp, uucd) to interchange data
without relying on ints and as such strengthen the abstractions and
guarantees a bit; avoid documentation warnings blabla that the given ints need to be in the above range, avoid needless (re)checks if
data flows among modules, well you get the idea, the basic advantages
of data abstraction...
This proposal simply adds such a minimal data type along with a few
functions which by themselves don't do much except integrating with
the standard library; doing real Unicode processing is left to
external libraries, as it should be.
One question is whether a Pervasives.uchar type equal to Uchar.t
should be introduced (not part of this proposal). I don't think it's
essential, it could be a nice touch though.
A question for Daniel: would you mind having to spell your name in pure ASCII? As part of the (slow) transition away from Latin-1, I'm trying to get all the source code of the system in pure ASCII, even in the comments.