Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a UChar module to the standard library #6525

vicuna opened this issue Aug 28, 2014 · 3 comments

add a UChar module to the standard library #6525

vicuna opened this issue Aug 28, 2014 · 3 comments


Copy link

@vicuna vicuna commented Aug 28, 2014

Original bug ID: 6525
Reporter: @gasche
Status: closed (set by @alainfrisch on 2016-01-27T08:22:38Z)
Resolution: fixed
Priority: normal
Severity: feature
Target version: 4.03.0+dev / +beta1
Category: standard library
Tags: github, patch
Monitored by: @dbuenzli

Bug description

Github Pull Request by Daniel Bünzli:

As I already made clear in previous discussions on the caml-list,
I find that OCaml's current support for Unicode is outstanding
(au propre comme au figuré).

I don't think introducing a Unicode string data structure and
a corresponding syntax for literals would be a good thing do
to. Since, if one wanted to that in a correct and useful way, it
would entail importing a good deal of the Unicode processing machinery
(e.g. normalization) in the compiler and I really think it's better to
leave that outside the compiler. Unicode processing can perfectly be
left to a set of modularized, external libraries. I also think it's
actually a good idea to proceed that way as libraries are in a better
position to evolve with the standard (e.g. newly encoded characters on
Unicode standard updates may imply changes to normalisation results
and would entail updates to the compiler).

There is however one thing that I really find missing to get utterly
Unicode support in OCaml: an abstract datatype, in the
standard library
, to represent an Unicode scalar value (by abusing
terminology: an Unicode character). An Unicode scalar
simply an integer in the ranges 0x0000…0xD7FF or 0xE000…0x10FFFF.

Such a data type would allow independent libraries dealing with
unicode characters (e.g. ulex,
uucd) to interchange data
without relying on ints and as such strengthen the abstractions and
guarantees a bit; avoid documentation warnings blabla that the given
ints need to be in the above range, avoid needless (re)checks if
data flows among modules, well you get the idea, the basic advantages
of data abstraction...

This proposal simply adds such a minimal data type along with a few
functions which by themselves don't do much except integrating with
the standard library; doing real Unicode processing is left to
external libraries, as it should be.

One question is whether a Pervasives.uchar type equal to Uchar.t
should be introduced (not part of this proposal). I don't think it's
essential, it could be a nice touch though.

File attachments

Copy link

@vicuna vicuna commented Nov 24, 2014

Comment author: @damiendoligez

A question for Daniel: would you mind having to spell your name in pure ASCII? As part of the (slow) transition away from Latin-1, I'm trying to get all the source code of the system in pure ASCII, even in the comments.

Copy link

@vicuna vicuna commented Dec 6, 2014

Comment author: @dbuenzli

Sorry forgot to monitor and missed your request. Done in the PR.

Copy link

@vicuna vicuna commented Jan 27, 2016

Comment author: @alainfrisch

The Github PR has been merged.

@vicuna vicuna closed this as completed Jan 27, 2016
@vicuna vicuna added the stdlib label Mar 14, 2019
@vicuna vicuna added this to the 4.03.0 milestone Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

1 participant