Add Uchar module to the standard library. #80

Merged
merged 1 commit into from Jan 6, 2016

Conversation

Projects
None yet
@dbuenzli
Contributor

dbuenzli commented Jul 9, 2014

As I already made clear in previous discussions on the caml-list, I find that OCaml's current support for Unicode is outstanding (au propre comme au figuré).

I don't think introducing a Unicode string data structure and a corresponding syntax for literals would be a good thing do to. Since, if one wanted to that in a correct and useful way, it would entail importing a good deal of the Unicode processing machinery (e.g. normalization) in the compiler and I really think it's better to leave that outside the compiler. Unicode processing can perfectly be left to a set of modularized, external libraries. I also think it's actually a good idea to proceed that way as libraries are in a better position to evolve with the standard (e.g. newly encoded characters on Unicode standard updates may imply changes to normalisation results and would entail updates to the compiler).

There is however one thing that I really find missing to get utterly excellent Unicode support in OCaml: an abstract datatype, in the standard library, to represent an Unicode scalar value (by abusing terminology: an Unicode character). An Unicode scalar value is simply an integer in the ranges 0x0000…0xD7FF or 0xE000…0x10FFFF.

Such a data type would allow independent libraries dealing with unicode characters (e.g. ulex, camomile, uutf, uunf, uucp, uucd) to interchange data without relying on ints and as such strengthen the abstractions and guarantees a bit; avoid documentation warnings blabla that the given ints need to be in the above range, avoid needless (re)checks if data flows among modules, well you get the idea, the basic advantages of data abstraction...

This proposal simply adds such a minimal data type along with a few functions which by themselves don't do much except integrating with the standard library; doing real Unicode processing is left to external libraries, as it should be.

One question is whether a Pervasives.uchar type equal to Uchar.t should be introduced (not part of this proposal). I don't think it's essential, it could be a nice touch though.

stdlib/uchar.mli
+
+ @since 4.03 *)
+
+type t

This comment has been minimized.

@Chris00

Chris00 Jul 10, 2014

Member

Should it be abstract or a private int?

@Chris00

Chris00 Jul 10, 2014

Member

Should it be abstract or a private int?

This comment has been minimized.

@dbuenzli

dbuenzli Jul 10, 2014

Contributor

Good question. I personally have no problem in doing

match Uchar.to_int u with 
| 0x000A -> ...
... 

I don't know what's the stance of the dev team about using so called "Language extensions" in the stdlib.

@dbuenzli

dbuenzli Jul 10, 2014

Contributor

Good question. I personally have no problem in doing

match Uchar.to_int u with 
| 0x000A -> ...
... 

I don't know what's the stance of the dev team about using so called "Language extensions" in the stdlib.

This comment has been minimized.

@diml

diml Jul 10, 2014

Member

You would still need to write Uchar.to_int u or write a coercion if t was defined as private int. Having t = private int is more an optimization: if the compiler knows that Uchar.t is always represented by an immediate value, the code generator can skip calls to caml_modify and/or float array checks.

@diml

diml Jul 10, 2014

Member

You would still need to write Uchar.to_int u or write a coercion if t was defined as private int. Having t = private int is more an optimization: if the compiler knows that Uchar.t is always represented by an immediate value, the code generator can skip calls to caml_modify and/or float array checks.

stdlib/uchar.mli
+ val unsafe_of_int : int -> t
+(**/**)
+
+val to_int : t -> int

This comment has been minimized.

@Chris00

Chris00 Jul 10, 2014

Member

Shouldn't this be external for efficiency (if private int is not used).

@Chris00

Chris00 Jul 10, 2014

Member

Shouldn't this be external for efficiency (if private int is not used).

This comment has been minimized.

@dbuenzli

dbuenzli Jul 10, 2014

Contributor

Isn't it sufficent to have it as external in the .ml file ? The .cmx files should take care of the inlining business. No ?

@dbuenzli

dbuenzli Jul 10, 2014

Contributor

Isn't it sufficent to have it as external in the .ml file ? The .cmx files should take care of the inlining business. No ?

This comment has been minimized.

@Chris00

Chris00 Jul 10, 2014

Member

On Thu, 10 Jul 2014 02:47:35 -0700, Daniel Bünzli wrote:

+val to_int : t -> int

Isn't it sufficent to have it as external in the .ml file ? The .cmx files
should take care of the inlining business. No ?

I do not think it is sufficient.

@Chris00

Chris00 Jul 10, 2014

Member

On Thu, 10 Jul 2014 02:47:35 -0700, Daniel Bünzli wrote:

+val to_int : t -> int

Isn't it sufficent to have it as external in the .ml file ? The .cmx files
should take care of the inlining business. No ?

I do not think it is sufficient.

This comment has been minimized.

@yallop

yallop Jul 10, 2014

Member

It seems that inlining works fine with external in the .ml file and val in the .mli file. Here's a quick example. Note the absence of calls to to_int in the generated code.

$ opam switch install 4.03.0+pr80
[...]
$ eval `opam config env`
$ cat test.ml
let f x y = Uchar.(to_int x + to_int y)
$ ocamlopt -dclambda test.ml
(seq
  (let
    (f/1008
       (closure  (fun camlTest__f_1008 2  x/1009 y/1010 (+ x/1009 y/1010)) ))
    (setfield_imm 0 (global camlTest!) f/1008))
  0a)
@yallop

yallop Jul 10, 2014

Member

It seems that inlining works fine with external in the .ml file and val in the .mli file. Here's a quick example. Note the absence of calls to to_int in the generated code.

$ opam switch install 4.03.0+pr80
[...]
$ eval `opam config env`
$ cat test.ml
let f x y = Uchar.(to_int x + to_int y)
$ ocamlopt -dclambda test.ml
(seq
  (let
    (f/1008
       (closure  (fun camlTest__f_1008 2  x/1009 y/1010 (+ x/1009 y/1010)) ))
    (setfield_imm 0 (global camlTest!) f/1008))
  0a)

This comment has been minimized.

@Chris00

Chris00 Jul 11, 2014

Member

On Thu, 10 Jul 2014 03:24:22 -0700, yallop wrote:

It seems that inlining works fine with external in the .ml file and val in the
.mli file. [...]

Indeed. Can it be confirmed it works in all cases? Is it a recent improvement? In addition to me remembering otherwise (which does not have much value), the interfaces of stdlib use "external" in the .mli — why if it not needed?

@Chris00

Chris00 Jul 11, 2014

Member

On Thu, 10 Jul 2014 03:24:22 -0700, yallop wrote:

It seems that inlining works fine with external in the .ml file and val in the
.mli file. [...]

Indeed. Can it be confirmed it works in all cases? Is it a recent improvement? In addition to me remembering otherwise (which does not have much value), the interfaces of stdlib use "external" in the .mli — why if it not needed?

stdlib/uchar.mli
+val compare : t -> t -> int
+(** [compare u u'] is [Pervasives.compare u u']. *)
+
+val pp : Format.formatter -> t -> unit

This comment has been minimized.

@Chris00

Chris00 Jul 10, 2014

Member

Isn't a way to print scalar values (as the character they represent) missing?

@Chris00

Chris00 Jul 10, 2014

Member

Isn't a way to print scalar values (as the character they represent) missing?

This comment has been minimized.

@dbuenzli

dbuenzli Jul 10, 2014

Contributor

That would entail getting into the encoding business, which I specifically want to keep out of the system.

@dbuenzli

dbuenzli Jul 10, 2014

Contributor

That would entail getting into the encoding business, which I specifically want to keep out of the system.

@c-cube

This comment has been minimized.

Show comment
Hide comment
@c-cube

c-cube Jul 10, 2014

Contributor

I like this idea of only adding standard types in the compiler library. It makes interoperability much easier and still doesn't require Inria people to support and maintain such complicated things as comprehensive unicode support... I don't see any drawback to this PR.

Contributor

c-cube commented Jul 10, 2014

I like this idea of only adding standard types in the compiler library. It makes interoperability much easier and still doesn't require Inria people to support and maintain such complicated things as comprehensive unicode support... I don't see any drawback to this PR.

@whitequark

This comment has been minimized.

Show comment
Hide comment
@whitequark

whitequark Jul 10, 2014

Contributor

I think this is an excellent idea!

Contributor

whitequark commented Jul 10, 2014

I think this is an excellent idea!

stdlib/uchar.mli
+(** [equal u u'] is [u = u']. *)
+
+val compare : t -> t -> int
+(** [compare u u'] is [Pervasives.compare u u']. *)

This comment has been minimized.

@bobot

bobot Jul 13, 2014

Contributor

Could you add a hash function? Just an alias for to_int, but it is useful for application with Hashtbl.Make.

@bobot

bobot Jul 13, 2014

Contributor

Could you add a hash function? Just an alias for to_int, but it is useful for application with Hashtbl.Make.

This comment has been minimized.

@dbuenzli

dbuenzli Jul 13, 2014

Contributor

Right, added a hash function.

@dbuenzli

dbuenzli Jul 13, 2014

Contributor

Right, added a hash function.

stdlib/uchar.ml
+let compare : int -> int -> int = Pervasives.compare
+let hash = to_int
+
+let pp ppf u = Format.fprintf ppf "U+%04X" u

This comment has been minimized.

@bobot

bobot Jul 14, 2014

Contributor

Do you think it can be useful to add a pp_to_string defined by Printf.sprintf "U+%04X" u just for the few places where one don't use a formatter, or use another formatting library?

@bobot

bobot Jul 14, 2014

Contributor

Do you think it can be useful to add a pp_to_string defined by Printf.sprintf "U+%04X" u just for the few places where one don't use a formatter, or use another formatting library?

This comment has been minimized.

@dbuenzli

dbuenzli Jul 14, 2014

Contributor

Nowadays we have Format.asprintf so that's just a Format.asprintf "%a" Uchar.pp away.

@dbuenzli

dbuenzli Jul 14, 2014

Contributor

Nowadays we have Format.asprintf so that's just a Format.asprintf "%a" Uchar.pp away.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Jul 14, 2014

Contributor

I agree it is a nice idea to add the abstract datatype in the standard library, and only that. What is the opinion of other unicode ocaml library makers? @yoriyuki @alainfrisch

Contributor

bobot commented Jul 14, 2014

I agree it is a nice idea to add the abstract datatype in the standard library, and only that. What is the opinion of other unicode ocaml library makers? @yoriyuki @alainfrisch

@yoriyuki

This comment has been minimized.

Show comment
Hide comment
@yoriyuki

yoriyuki Jul 14, 2014

I do not see the point to add Uchar module without standard Unicode string data type and literals. They are needed for the precisely same reason to Uchar, interoperability between Unicode processing libraries. We do not need normalization etc. inside the stdlib.

To that said, adding Uchar is a good step toward more satisfactory Unicode support in OCaml. I have only minor comments.

  • Code points like 0xFFFF are also non-character. Should we raise the error or not?
  • Should we mark the function which raises the exception by, say _exn? (I know it is a controversial point)

I do not see the point to add Uchar module without standard Unicode string data type and literals. They are needed for the precisely same reason to Uchar, interoperability between Unicode processing libraries. We do not need normalization etc. inside the stdlib.

To that said, adding Uchar is a good step toward more satisfactory Unicode support in OCaml. I have only minor comments.

  • Code points like 0xFFFF are also non-character. Should we raise the error or not?
  • Should we mark the function which raises the exception by, say _exn? (I know it is a controversial point)
@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jul 14, 2014

Contributor

Le lundi, 14 juillet 2014 à 12:57, Yoriyuki Yamagata a écrit :

I do not see the point to add Uchar module without standard Unicode string data type and literals. They are needed for the precisely same reason to Uchar, interoperability between Unicode processing libraries. We do not need normalization etc. inside the stdlib.

I disagree with that, if you introduce an Unicode string data type and literals, then you most likely also want pattern matching on them. And if you want pattern matching on them you need to take normalization into account, in particular you want to be able to specify in which normalisation form your literal is supposed to be, otherwise it is useless, deceiving and could even be the source of a new class of potential security bugs. Formal unicode string literals without normalisation would be irresponsible IMHO.

It is currently perfectly possible to write unnormalized UTF-8 literals in OCaml which is entirely sufficient for many programs out there and a function away to translate into the representation of your particular library at the cost of a negligible initial runtime cost. Introducing the Uchar module greatly enhance the possibility of modular implementations of Unicode and allow for exemple ulex to talk to uunf with strong invariants guaranteed by the abstraction.

Code points like 0xFFFF are also non-character. Should we raise the error or not?
I would say no. For the following reasons (reference are to the pdf of Unicode 6.2):

  1. Applications are allowed to use non-characters internally (D12 p. 68 Coded character sequence, bullet 2+3). Also on page 24. we have:

"Noncharacter code points are reserved for internal use, such as for sentinel values. They should never be interchanged. They do, however, have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange."

  1. Applications should not interchange (serialize to UTF-X) non-characters (D14 p. 68) but stricto sensu these code points may happen in interchange as they do not produce invalid sequences of bytes: UTF-X are explicitely defined as a map from scalar values to byte code units (see 3.9 p. 89., D79 p. 90), non-characters are part of scalar values. More specifically on p. 560 we have:

"Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters."

As such it's better if we have a way to represent these characters since UTF-X decoders can then pass them to the application which is then free to take the appropriate context dependent action.

Should we mark the function which raises the exception by, say _exn? (I know it is a controversial point)
I would say no. a) that's for people who like Hungarian notation b) Remember that Invalid_argument means programming error, you are not supposed to behave in a way that raises this exception and should not try to catch it, except at the toplevel as part of a general recovering procedure, see [1].

Best,

Daniel

[1] https://sympa.inria.fr/sympa/arc/caml-list/2007-10/msg00475.html

Contributor

dbuenzli commented Jul 14, 2014

Le lundi, 14 juillet 2014 à 12:57, Yoriyuki Yamagata a écrit :

I do not see the point to add Uchar module without standard Unicode string data type and literals. They are needed for the precisely same reason to Uchar, interoperability between Unicode processing libraries. We do not need normalization etc. inside the stdlib.

I disagree with that, if you introduce an Unicode string data type and literals, then you most likely also want pattern matching on them. And if you want pattern matching on them you need to take normalization into account, in particular you want to be able to specify in which normalisation form your literal is supposed to be, otherwise it is useless, deceiving and could even be the source of a new class of potential security bugs. Formal unicode string literals without normalisation would be irresponsible IMHO.

It is currently perfectly possible to write unnormalized UTF-8 literals in OCaml which is entirely sufficient for many programs out there and a function away to translate into the representation of your particular library at the cost of a negligible initial runtime cost. Introducing the Uchar module greatly enhance the possibility of modular implementations of Unicode and allow for exemple ulex to talk to uunf with strong invariants guaranteed by the abstraction.

Code points like 0xFFFF are also non-character. Should we raise the error or not?
I would say no. For the following reasons (reference are to the pdf of Unicode 6.2):

  1. Applications are allowed to use non-characters internally (D12 p. 68 Coded character sequence, bullet 2+3). Also on page 24. we have:

"Noncharacter code points are reserved for internal use, such as for sentinel values. They should never be interchanged. They do, however, have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange."

  1. Applications should not interchange (serialize to UTF-X) non-characters (D14 p. 68) but stricto sensu these code points may happen in interchange as they do not produce invalid sequences of bytes: UTF-X are explicitely defined as a map from scalar values to byte code units (see 3.9 p. 89., D79 p. 90), non-characters are part of scalar values. More specifically on p. 560 we have:

"Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters."

As such it's better if we have a way to represent these characters since UTF-X decoders can then pass them to the application which is then free to take the appropriate context dependent action.

Should we mark the function which raises the exception by, say _exn? (I know it is a controversial point)
I would say no. a) that's for people who like Hungarian notation b) Remember that Invalid_argument means programming error, you are not supposed to behave in a way that raises this exception and should not try to catch it, except at the toplevel as part of a general recovering procedure, see [1].

Best,

Daniel

[1] https://sympa.inria.fr/sympa/arc/caml-list/2007-10/msg00475.html

@yoriyuki

This comment has been minimized.

Show comment
Hide comment
@yoriyuki

yoriyuki Jul 14, 2014

For the latter two points, I now concur. I am not against to merging your patch.

For the first point,

I disagree with that, if you introduce an Unicode string data type and literals, then you most likely also want
pattern matching on them. And if you want pattern matching on them you need to take normalization into
account, in particular you want to be able to specify in which normalisation form your literal is supposed to
be, otherwise it is useless, deceiving and could even be the source of a new class of potential security bugs.
Formal unicode string literals without normalisation would be irresponsible IMHO.

If you mean that comparison and pattern matching should be always respect to canonical equivalence, and all string literals should be in normal forms, then I disagree with you. Code-point comparison has a place, like comparison which is used in binary trees, say, OCaml's Set. String literals in non-normalized form have also in place, for example, passing strings to legacy encodings. Unicode security is complex issue. Leave it to the programmer and we should satisfy that the necessary tools are provided by the compiler and libraries.

It is currently perfectly possible to write unnormalized UTF-8 literals in OCaml which is entirely sufficient for
many programs out there and a function away to translate into the representation of your particular library
at the cost of a negligible initial runtime cost.

Using the raw byte string which is encoded by UTF-8, as an alternative to proper Unicode string, is a troubling tendency. UTF-8 encoding can be broken, and creates serious security issues. It is much worse than your normalization apocalypse.

But, this topic (whether we need a standard Unicode string or not) is not related to your patch. If you want to continue the discussion, let us move to caml-list,

For the latter two points, I now concur. I am not against to merging your patch.

For the first point,

I disagree with that, if you introduce an Unicode string data type and literals, then you most likely also want
pattern matching on them. And if you want pattern matching on them you need to take normalization into
account, in particular you want to be able to specify in which normalisation form your literal is supposed to
be, otherwise it is useless, deceiving and could even be the source of a new class of potential security bugs.
Formal unicode string literals without normalisation would be irresponsible IMHO.

If you mean that comparison and pattern matching should be always respect to canonical equivalence, and all string literals should be in normal forms, then I disagree with you. Code-point comparison has a place, like comparison which is used in binary trees, say, OCaml's Set. String literals in non-normalized form have also in place, for example, passing strings to legacy encodings. Unicode security is complex issue. Leave it to the programmer and we should satisfy that the necessary tools are provided by the compiler and libraries.

It is currently perfectly possible to write unnormalized UTF-8 literals in OCaml which is entirely sufficient for
many programs out there and a function away to translate into the representation of your particular library
at the cost of a negligible initial runtime cost.

Using the raw byte string which is encoded by UTF-8, as an alternative to proper Unicode string, is a troubling tendency. UTF-8 encoding can be broken, and creates serious security issues. It is much worse than your normalization apocalypse.

But, this topic (whether we need a standard Unicode string or not) is not related to your patch. If you want to continue the discussion, let us move to caml-list,

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jul 14, 2014

Contributor

Le lundi, 14 juillet 2014 à 16:08, Yoriyuki Yamagata a écrit :

If you mean that comparison and pattern matching should be always respect to canonical equivalence,

That's exactly not what I said. First I never talked about comparison at all, pattern matching is about equality and what I was precisely suggesting is that the equality you'd like (i.e. the underlying unicode equivalence) depends on context, which is why literals should be able to indicate the normal form you want them to be in, in order to be useful in pattern matching. You could say we want the literal notation without the pattern matching but that would feel odd as this would mismatch all other literal notations we have in the language.

Code-point comparison has a place, like comparison which is used in binary trees, say, OCaml's Set.
Again, never talked about comparison here. Pay attention to the words I use.

Unicode security is complex issue. Leave it to the programmer and we should satisfy that the necessary tools are provided by the compiler and libraries.

That's precisely the aim of this proposal.

Using the raw byte string which is encoded by UTF-8, as an alternative to proper Unicode string, is a troubling tendency. UTF-8 encoding can be broken, and creates serious security issues.

I don't think so, you are not supposed and can't use them blindly: if you do any processing with them you must have them go through some validating function (which will detect malformed sequences) if only to be able to normalize them so that you can match them against normalized user provided input.

Best,

Daniel

Contributor

dbuenzli commented Jul 14, 2014

Le lundi, 14 juillet 2014 à 16:08, Yoriyuki Yamagata a écrit :

If you mean that comparison and pattern matching should be always respect to canonical equivalence,

That's exactly not what I said. First I never talked about comparison at all, pattern matching is about equality and what I was precisely suggesting is that the equality you'd like (i.e. the underlying unicode equivalence) depends on context, which is why literals should be able to indicate the normal form you want them to be in, in order to be useful in pattern matching. You could say we want the literal notation without the pattern matching but that would feel odd as this would mismatch all other literal notations we have in the language.

Code-point comparison has a place, like comparison which is used in binary trees, say, OCaml's Set.
Again, never talked about comparison here. Pay attention to the words I use.

Unicode security is complex issue. Leave it to the programmer and we should satisfy that the necessary tools are provided by the compiler and libraries.

That's precisely the aim of this proposal.

Using the raw byte string which is encoded by UTF-8, as an alternative to proper Unicode string, is a troubling tendency. UTF-8 encoding can be broken, and creates serious security issues.

I don't think so, you are not supposed and can't use them blindly: if you do any processing with them you must have them go through some validating function (which will detect malformed sequences) if only to be able to normalize them so that you can match them against normalized user provided input.

Best,

Daniel

@yoriyuki

This comment has been minimized.

Show comment
Hide comment
@yoriyuki

yoriyuki Jul 14, 2014

That's exactly not what I said. First I never talked about comparison at all, pattern matching is about
equality and what I was precisely suggesting is that the equality you'd like (i.e. the underlying unicode
equivalence) depends on context, which is why literals should be able to indicate the normal form you want
them to be in, in order to be useful in pattern matching. You could say we want the literal notation without the
pattern matching but that would feel odd as this would mismatch all other literal notations we have in the
language.

Comparison has a broader meaning, which includes equality test, I think. Although my example of Set is using comparison in narrow sense, there is a plenty of the case which code-point equality test are used. (say, hash table)

As for pattern matching, code-point comparison is enough. If you need canonical equivalence or others, you can preprocess the input and making a normal form for literals by hand or use when clauses.

I don't think so, you are not supposed and can't use them blindly: if you do any processing with them you must have them go through some validating function (which will detect malformed sequences) if only to be able to normalize them so that you can match them against normalized user provided input.

Of course we must have them validated, but there is no guarantee whether such validation is performed from the type system. Having abstract Unicode string enforces validation, and increases safety.

That's exactly not what I said. First I never talked about comparison at all, pattern matching is about
equality and what I was precisely suggesting is that the equality you'd like (i.e. the underlying unicode
equivalence) depends on context, which is why literals should be able to indicate the normal form you want
them to be in, in order to be useful in pattern matching. You could say we want the literal notation without the
pattern matching but that would feel odd as this would mismatch all other literal notations we have in the
language.

Comparison has a broader meaning, which includes equality test, I think. Although my example of Set is using comparison in narrow sense, there is a plenty of the case which code-point equality test are used. (say, hash table)

As for pattern matching, code-point comparison is enough. If you need canonical equivalence or others, you can preprocess the input and making a normal form for literals by hand or use when clauses.

I don't think so, you are not supposed and can't use them blindly: if you do any processing with them you must have them go through some validating function (which will detect malformed sequences) if only to be able to normalize them so that you can match them against normalized user provided input.

Of course we must have them validated, but there is no guarantee whether such validation is performed from the type system. Having abstract Unicode string enforces validation, and increases safety.

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jul 14, 2014

Contributor

Le lundi, 14 juillet 2014 à 17:48, Yoriyuki Yamagata a écrit :

Comparison has a broader meaning, which includes equality test, I think. Although my example of Set is using comparison in narrow sense, there is a plenty of the case which code-point equality test are used. (say, hash table)

I think you are making this discussion more confusing than it should be. Binary comparison which includes binary equality has its uses, especially when you have normalized your inputs including your string literals and you actually know in which normal form they are.

As for pattern matching, code-point comparison is enough. If you need canonical equivalence or others, you can preprocess the input and making a normal form for literals by hand or use when clauses.

Well it's enough if you want people to write broken Unicode programs. Making a normal form by hand is certainly painful and when clauses are impossible: you need to normalize the literal constant of the pattern, otherwise you are just acting on variables which you can already perfectly do right now:

let ustr nf s = (* function that validates the UTF-8 encoded s and normalizes to nf *)
let cst = ustr `NFD "Éole"

match ustr `NFD x with
| x when x = cst -> ...

Overall I think that unicode string literals without pattern matching and normalization is just a waste of time for everybody.

Daniel

Contributor

dbuenzli commented Jul 14, 2014

Le lundi, 14 juillet 2014 à 17:48, Yoriyuki Yamagata a écrit :

Comparison has a broader meaning, which includes equality test, I think. Although my example of Set is using comparison in narrow sense, there is a plenty of the case which code-point equality test are used. (say, hash table)

I think you are making this discussion more confusing than it should be. Binary comparison which includes binary equality has its uses, especially when you have normalized your inputs including your string literals and you actually know in which normal form they are.

As for pattern matching, code-point comparison is enough. If you need canonical equivalence or others, you can preprocess the input and making a normal form for literals by hand or use when clauses.

Well it's enough if you want people to write broken Unicode programs. Making a normal form by hand is certainly painful and when clauses are impossible: you need to normalize the literal constant of the pattern, otherwise you are just acting on variables which you can already perfectly do right now:

let ustr nf s = (* function that validates the UTF-8 encoded s and normalizes to nf *)
let cst = ustr `NFD "Éole"

match ustr `NFD x with
| x when x = cst -> ...

Overall I think that unicode string literals without pattern matching and normalization is just a waste of time for everybody.

Daniel

@yoriyuki

This comment has been minimized.

Show comment
Hide comment
@yoriyuki

yoriyuki Jul 14, 2014

I think you miss my points.

I think you are making this discussion more confusing than it should be. Binary comparison which includes binary equality has its uses, especially when you have normalized your inputs including your string literals and you actually know in which normal form they are.

My point here is that, there are cases that binary comparison and equality is enough or even necessary without normalization.

First examples of such kinds are data-structures which only requires consistent equality or ordering over Unicode string. The second example is to interact the legacy encoding, which, say, distinguishes Ω (unit) and Greek Ω.

Well it's enough if you want people to write broken Unicode programs. Making a normal form by hand is certainly painful and when clauses are impossible: you need to normalize the literal constant of the pattern, otherwise you are just acting on variables which you can already perfectly do right now:

let ustr nf s = (* function that validates the UTF-8 encoded s and normalizes to nf *)
let cst = ustr `NFD "Éole"

match ustr `NFD x with
| x when x = cst -> ...

Overall I think that unicode string literals without pattern matching and normalization is just a waste of time for everybody.

Again, you miss my point. My point is that, by introducing abstract Unicode string type, we can enforce that the internal representation of Unicode string (say, UTF-8) is valid by type system. We need string literal for just a convenience to write down such abstract data type. We do not need pattern matching for this purpose.

Beside, if you use UTF-8 encoded byte string to represent Unicode string, a.[0], a.[1]... are bytes of UTF-8 encoded string, not first and second Unicode characters. I think it is conceptually ugly.

I think you miss my points.

I think you are making this discussion more confusing than it should be. Binary comparison which includes binary equality has its uses, especially when you have normalized your inputs including your string literals and you actually know in which normal form they are.

My point here is that, there are cases that binary comparison and equality is enough or even necessary without normalization.

First examples of such kinds are data-structures which only requires consistent equality or ordering over Unicode string. The second example is to interact the legacy encoding, which, say, distinguishes Ω (unit) and Greek Ω.

Well it's enough if you want people to write broken Unicode programs. Making a normal form by hand is certainly painful and when clauses are impossible: you need to normalize the literal constant of the pattern, otherwise you are just acting on variables which you can already perfectly do right now:

let ustr nf s = (* function that validates the UTF-8 encoded s and normalizes to nf *)
let cst = ustr `NFD "Éole"

match ustr `NFD x with
| x when x = cst -> ...

Overall I think that unicode string literals without pattern matching and normalization is just a waste of time for everybody.

Again, you miss my point. My point is that, by introducing abstract Unicode string type, we can enforce that the internal representation of Unicode string (say, UTF-8) is valid by type system. We need string literal for just a convenience to write down such abstract data type. We do not need pattern matching for this purpose.

Beside, if you use UTF-8 encoded byte string to represent Unicode string, a.[0], a.[1]... are bytes of UTF-8 encoded string, not first and second Unicode characters. I think it is conceptually ugly.

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jul 14, 2014

Contributor

My point here is that, there are cases that binary comparison and equality is enough or even necessary without normalization.

They are certainly not the average case, there may be a few specific cases or some data sets may give you the illusion that this is the case, until you fall on a damned decomposed é. Even if you want to deal with something "relatively simple" like latin1 characters it's not going to be enough, better not lure programmers in fallacies; it seems they have already enough hard time understanding all of this. I think you miss both the social and technical point here.

Again, you miss my point. My point is that, by introducing abstract Unicode string type, we can enforce that the internal representation of Unicode string (say, UTF-8) is valid by type system.

I perfectly get that point: it has the same basis as this very proposal on which we agree. Sure it would be useful. But then it's much more contentious, for example I expect there will already be disagreement over the actual internal representation (e.g. I would make them immutable arrays of ints, not UTF-8 encoded strings), over what the minimal support should be (as we have right at the moment). Then if you want to introduce literals you will need to hook an UTF-8 decoder in the compiler then you will need to find an actual syntax in the very crowded surface syntax of OCaml, and this for not much gain in my opinion, that is unless we get pattern matching and normalization, which, unlike what you suggest is a basic need in most cases to perform correct unicode processing. I prefer nothing than broken things that will confuse everyone. I prefer small things that improve my coding life than nothing because the change was too invasive.

We need string literal for just a convenience to write down such abstract data type. We do not need pattern matching for this purpose.

I don't like the idea of having literals on which you cannot pattern match. This is conceptually ugly.

Beside, if you use UTF-8 encoded byte string to represent Unicode string, a.[0], a.[1]... are bytes of UTF-8 encoded string, not first and second Unicode characters.

As I already said on the caml-list indexing Unicode characters is worthless in general. From an abstract character point of view, for layout purposes, etc. direct indexing doesn't bring you anything, so I don't really care about that and in real programs it has never been a problem for me not to have direct indexing. The UTF-8 encoded sources files/strings may not be a perfect solution but it works well enough in real programs. Having that as a basis we can move to consolidate it, step by step.

I think it is conceptually ugly.

It's not a concept ! I was not made for that… It's a way to move forward. Progress is made in small steps. I'm already glad we don't have the conceptual mess other languages have with their Unicode support. Again, rather have nothing than broken things. The actual literal notation you'd like is a function call away, from a pragmatic point of view I'd say it is not at the moment (if ever) worth pursuing the idea (that is unless the dev. team is willing to commit to some form of useful unicode string support in the compiler).

Contributor

dbuenzli commented Jul 14, 2014

My point here is that, there are cases that binary comparison and equality is enough or even necessary without normalization.

They are certainly not the average case, there may be a few specific cases or some data sets may give you the illusion that this is the case, until you fall on a damned decomposed é. Even if you want to deal with something "relatively simple" like latin1 characters it's not going to be enough, better not lure programmers in fallacies; it seems they have already enough hard time understanding all of this. I think you miss both the social and technical point here.

Again, you miss my point. My point is that, by introducing abstract Unicode string type, we can enforce that the internal representation of Unicode string (say, UTF-8) is valid by type system.

I perfectly get that point: it has the same basis as this very proposal on which we agree. Sure it would be useful. But then it's much more contentious, for example I expect there will already be disagreement over the actual internal representation (e.g. I would make them immutable arrays of ints, not UTF-8 encoded strings), over what the minimal support should be (as we have right at the moment). Then if you want to introduce literals you will need to hook an UTF-8 decoder in the compiler then you will need to find an actual syntax in the very crowded surface syntax of OCaml, and this for not much gain in my opinion, that is unless we get pattern matching and normalization, which, unlike what you suggest is a basic need in most cases to perform correct unicode processing. I prefer nothing than broken things that will confuse everyone. I prefer small things that improve my coding life than nothing because the change was too invasive.

We need string literal for just a convenience to write down such abstract data type. We do not need pattern matching for this purpose.

I don't like the idea of having literals on which you cannot pattern match. This is conceptually ugly.

Beside, if you use UTF-8 encoded byte string to represent Unicode string, a.[0], a.[1]... are bytes of UTF-8 encoded string, not first and second Unicode characters.

As I already said on the caml-list indexing Unicode characters is worthless in general. From an abstract character point of view, for layout purposes, etc. direct indexing doesn't bring you anything, so I don't really care about that and in real programs it has never been a problem for me not to have direct indexing. The UTF-8 encoded sources files/strings may not be a perfect solution but it works well enough in real programs. Having that as a basis we can move to consolidate it, step by step.

I think it is conceptually ugly.

It's not a concept ! I was not made for that… It's a way to move forward. Progress is made in small steps. I'm already glad we don't have the conceptual mess other languages have with their Unicode support. Again, rather have nothing than broken things. The actual literal notation you'd like is a function call away, from a pragmatic point of view I'd say it is not at the moment (if ever) worth pursuing the idea (that is unless the dev. team is willing to commit to some form of useful unicode string support in the compiler).

stdlib/uchar.mli
+(** [to_int u] is [u] as an integer. *)
+
+val is_char : t -> bool
+(** [is_char u] is [true] iff [u] is a latin1 OCaml character. *)

This comment has been minimized.

@chambart

chambart Nov 4, 2014

Contributor

It was suggested that this function should be named is_valid because we don't want to encourage to open this module and Uchar.is_char is ugly

@chambart

chambart Nov 4, 2014

Contributor

It was suggested that this function should be named is_valid because we don't want to encourage to open this module and Uchar.is_char is ugly

This comment has been minimized.

@dbuenzli

dbuenzli Nov 4, 2014

Contributor

I don't see the connection to opening the modules. Why not another name but Uchar.is_valid wouldn't make sense at all, we are talking about a function that checks whether [u] can be represented by char. Maybe is_latin1 ? That would makes it less consistent with Uchar.of_char and Uchar.to_char but why not. What do you think ?

@dbuenzli

dbuenzli Nov 4, 2014

Contributor

I don't see the connection to opening the modules. Why not another name but Uchar.is_valid wouldn't make sense at all, we are talking about a function that checks whether [u] can be represented by char. Maybe is_latin1 ? That would makes it less consistent with Uchar.of_char and Uchar.to_char but why not. What do you think ?

This comment has been minimized.

@alainfrisch

alainfrisch Nov 4, 2014

Contributor

I think the question was rather on is_uchar.

@alainfrisch

alainfrisch Nov 4, 2014

Contributor

I think the question was rather on is_uchar.

This comment has been minimized.

@dbuenzli

dbuenzli Nov 4, 2014

Contributor

Ah ! Makes more sense. Ok'll rename it.

@dbuenzli

dbuenzli Nov 4, 2014

Contributor

Ah ! Makes more sense. Ok'll rename it.

This comment has been minimized.

@chambart

chambart Nov 5, 2014

Contributor

Oups sorry for the misleading typo...

@chambart

chambart Nov 5, 2014

Contributor

Oups sorry for the misleading typo...

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Nov 4, 2014

Contributor

Daniel, in your first comment, you put in emphasis "in the standard library". Can you provide some more justification for that? (In particular, with the advent of OPAM simplifying the writing of new libraries, could we put this in a "base Unicode" library that the other Unicode libraries all depend on?)

Contributor

mshinwell commented Nov 4, 2014

Daniel, in your first comment, you put in emphasis "in the standard library". Can you provide some more justification for that? (In particular, with the advent of OPAM simplifying the writing of new libraries, could we put this in a "base Unicode" library that the other Unicode libraries all depend on?)

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Nov 4, 2014

Contributor

Le mardi, 4 novembre 2014 à 11:50, Mark Shinwell a écrit :

Daniel, in your first comment, you put in emphasis "in the standard library". Can you provide some more justification for that? (In particular, with the advent of OPAM simplifying the writing of new libraries, could we put this in a "base Unicode" library that the other Unicode libraries all depend on?)

We could of course publish this module separately but it would be a real maintenance burden (not code-wise, infrastructure-wise) for such small functionality — 31 loc which are basically cast in stone. In the end every program using some form of unicode character (and which don't these days ?) would end up with this tiny package in their dependency list and the only benefit would be, in my opinion, to introduce noise in the whole infrastructure; e.g. if you take uutf, uucp, ulex or camomile they don't have any dependencies at the moment. Having it in the standard library is also a better way of enforcing use of that representation for such a fundamental type.

Best,

Daniel

Contributor

dbuenzli commented Nov 4, 2014

Le mardi, 4 novembre 2014 à 11:50, Mark Shinwell a écrit :

Daniel, in your first comment, you put in emphasis "in the standard library". Can you provide some more justification for that? (In particular, with the advent of OPAM simplifying the writing of new libraries, could we put this in a "base Unicode" library that the other Unicode libraries all depend on?)

We could of course publish this module separately but it would be a real maintenance burden (not code-wise, infrastructure-wise) for such small functionality — 31 loc which are basically cast in stone. In the end every program using some form of unicode character (and which don't these days ?) would end up with this tiny package in their dependency list and the only benefit would be, in my opinion, to introduce noise in the whole infrastructure; e.g. if you take uutf, uucp, ulex or camomile they don't have any dependencies at the moment. Having it in the standard library is also a better way of enforcing use of that representation for such a fundamental type.

Best,

Daniel

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Nov 4, 2014

Contributor

Renamed Uchar.is_uchar to Uchar.is_valid.

Contributor

dbuenzli commented Nov 4, 2014

Renamed Uchar.is_uchar to Uchar.is_valid.

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Dec 6, 2014

Contributor

Removed UTF-8 comment as per request.

Contributor

dbuenzli commented Dec 6, 2014

Removed UTF-8 comment as per request.

@damiendoligez

This comment has been minimized.

Show comment
Hide comment
@damiendoligez

damiendoligez Dec 8, 2014

Member

I'm in favor of adding this to the stdlib.

Member

damiendoligez commented Dec 8, 2014

I'm in favor of adding this to the stdlib.

@avsm

This comment has been minimized.

Show comment
Hide comment
@avsm

avsm Feb 15, 2015

Member

Is there anything blocking this from being merged into trunk now? It would be useful to be able to start depending on it, and putting in a transitionary package into OPAM for older compiler revisions (as we did for bytes)

Member

avsm commented Feb 15, 2015

Is there anything blocking this from being merged into trunk now? It would be useful to be able to start depending on it, and putting in a transitionary package into OPAM for older compiler revisions (as we did for bytes)

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Feb 15, 2015

Member

I wouldn't mind merging it if there was a clear consensus in favor, but right now I'm not sure there is -- apparently it wasn't discussed at the last developer meeting? Maybe you could ask other developers for their opinion.

Member

gasche commented Feb 15, 2015

I wouldn't mind merging it if there was a clear consensus in favor, but right now I'm not sure there is -- apparently it wasn't discussed at the last developer meeting? Maybe you could ask other developers for their opinion.

@Drup Drup referenced this pull request Mar 2, 2015

Closed

added result type #147

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Apr 14, 2015

Contributor

It seems this PR goes against the very idea of the stdlib. So let's just close this.

Contributor

dbuenzli commented Apr 14, 2015

It seems this PR goes against the very idea of the stdlib. So let's just close this.

@dbuenzli dbuenzli closed this Apr 14, 2015

@lpw25

This comment has been minimized.

Show comment
Hide comment
@lpw25

lpw25 Apr 15, 2015

Contributor

Reopening. Whilst I appreciate Daniel's frustration, this is a pull request with fairly broad support that I would very much like to see merged.

Contributor

lpw25 commented Apr 15, 2015

Reopening. Whilst I appreciate Daniel's frustration, this is a pull request with fairly broad support that I would very much like to see merged.

@lpw25 lpw25 reopened this Apr 15, 2015

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Nov 18, 2015

Member

We discussed this PR at the developer meeting today, and the consensus was to accept it if we could get three different unicode-related library authors to agree with it. @alainfrisch was in the meeting and agreed. @dbuenzli and @yoriyuki , do you agree with merging the UChar module as defined?

Daniel: if we decide to go ahead and merge, it would be helpful to also provide the UChar module code as a package to use on <4.03 OCaml versions (the package is empty for OCaml >=4.03) so that users can write UChar-using code working on all OCaml versions. Would you be ready to take responsibility for this?

Member

gasche commented Nov 18, 2015

We discussed this PR at the developer meeting today, and the consensus was to accept it if we could get three different unicode-related library authors to agree with it. @alainfrisch was in the meeting and agreed. @dbuenzli and @yoriyuki , do you agree with merging the UChar module as defined?

Daniel: if we decide to go ahead and merge, it would be helpful to also provide the UChar module code as a package to use on <4.03 OCaml versions (the package is empty for OCaml >=4.03) so that users can write UChar-using code working on all OCaml versions. Would you be ready to take responsibility for this?

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Nov 19, 2015

Contributor

Would you be ready to take responsibility for this?

No problem. However it would be nice if such a package could live under the ocaml organisation I do not want to claim ownership over it.

Contributor

dbuenzli commented Nov 19, 2015

Would you be ready to take responsibility for this?

No problem. However it would be nice if such a package could live under the ocaml organisation I do not want to claim ownership over it.

@avsm

This comment has been minimized.

Show comment
Hide comment
@avsm

avsm Nov 20, 2015

Member

Would you be ready to take responsibility for this?
No problem. However it would be nice if such a package could live under the ocaml organisation I do not want to claim ownership over it.

That's fine by me. I can move it in there if you could create the repository on your personal GitHub and issue a transfer.

Member

avsm commented Nov 20, 2015

Would you be ready to take responsibility for this?
No problem. However it would be nice if such a package could live under the ocaml organisation I do not want to claim ownership over it.

That's fine by me. I can move it in there if you could create the repository on your personal GitHub and issue a transfer.

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Nov 30, 2015

Contributor

I made the change from Uchar.pp to Uchar.dump and added a minimal test suite to the commit. The compatibility package lives in https://github.com/ocaml/uchar and will soon be released after a bit of testing w.r.t the build stuff.

Contributor

dbuenzli commented Nov 30, 2015

I made the change from Uchar.pp to Uchar.dump and added a minimal test suite to the commit. The compatibility package lives in https://github.com/ocaml/uchar and will soon be released after a bit of testing w.r.t the build stuff.

+(* *)
+(* OCaml *)
+(* *)
+(* Daniel C. Buenzli *)

This comment has been minimized.

@dbuenzli

dbuenzli Nov 30, 2015

Contributor

I realize this is not centered but it's the first time we get a compiler hacking session with proper wine.

@dbuenzli

dbuenzli Nov 30, 2015

Contributor

I realize this is not centered but it's the first time we get a compiler hacking session with proper wine.

This comment has been minimized.

@dbuenzli

dbuenzli Nov 30, 2015

Contributor

Thanks to @avsm's hosting.

@dbuenzli

dbuenzli Nov 30, 2015

Contributor

Thanks to @avsm's hosting.

This comment has been minimized.

@damiendoligez

damiendoligez Dec 4, 2015

Member

Centering is not mandatory :-)

@damiendoligez

damiendoligez Dec 4, 2015

Member

Centering is not mandatory :-)

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Dec 1, 2015

Contributor

The patch was missing a line in stdlib/StdlibModules which has now been added. A compatibility module has been released, see ocaml/opam-repository#5218.

Contributor

dbuenzli commented Dec 1, 2015

The patch was missing a line in stdlib/StdlibModules which has now been added. A compatibility module has been released, see ocaml/opam-repository#5218.

@lpw25

This comment has been minimized.

Show comment
Hide comment
@lpw25

lpw25 Dec 1, 2015

Contributor

I have been wondering whether this module should be called something more general like Character. I've always thought that names like "wchar" or "uchar" relegate unicode to being some superfluous extra thing rather than what it should be: the default way of representing text. To me it seems reasonable to have a module Char for what is essentially the C char type and Character for representing textual characters. Similarly, in the future we may have a String module for representing the C "string" type and Text for representing text.

Of course we can always make this change at a later date. Perhaps if things like unicode character expression literals are added to the language itself.

Contributor

lpw25 commented Dec 1, 2015

I have been wondering whether this module should be called something more general like Character. I've always thought that names like "wchar" or "uchar" relegate unicode to being some superfluous extra thing rather than what it should be: the default way of representing text. To me it seems reasonable to have a module Char for what is essentially the C char type and Character for representing textual characters. Similarly, in the future we may have a String module for representing the C "string" type and Text for representing text.

Of course we can always make this change at a later date. Perhaps if things like unicode character expression literals are added to the language itself.

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Dec 1, 2015

Contributor

"uchar" relegate unicode to being some superfluous extra thing rather than what it should be: the default way of representing text.

You will find it difficult in general to process Unicode characters without referring to Unicode itself, so I'd argue the U is an important bit of information in our context: it tells you the kind of standardized processing algorithms you can apply to them and which kinds of text you are actually able to represent using the datatype.

I wouldn't mind dropping the U if there was a single notion of character in the language, but since we now have two for historical reasons, it is much more clear to have the U here. Having Char.t is a latin1 character and Character.t is an Unicode scalar value would be namewise more confusing than enlighting in my opinion.

Contributor

dbuenzli commented Dec 1, 2015

"uchar" relegate unicode to being some superfluous extra thing rather than what it should be: the default way of representing text.

You will find it difficult in general to process Unicode characters without referring to Unicode itself, so I'd argue the U is an important bit of information in our context: it tells you the kind of standardized processing algorithms you can apply to them and which kinds of text you are actually able to represent using the datatype.

I wouldn't mind dropping the U if there was a single notion of character in the language, but since we now have two for historical reasons, it is much more clear to have the U here. Having Char.t is a latin1 character and Character.t is an Unicode scalar value would be namewise more confusing than enlighting in my opinion.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 1, 2015

Member

So @dbuenzli do you feel that the PR should be merged in its current state? There is no strong hurry (mid-December being the target), so if your inner self still want to ponder over some changes... Of course you can always send subsequent PRs later.

Member

gasche commented Dec 1, 2015

So @dbuenzli do you feel that the PR should be merged in its current state? There is no strong hurry (mid-December being the target), so if your inner self still want to ponder over some changes... Of course you can always send subsequent PRs later.

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Dec 1, 2015

Contributor

For me it can be merged, but the PR stands in a little conflicted zone, so I have no problem in leaving this open in case we get further comments (which we may also get through the compatibility package).

Contributor

dbuenzli commented Dec 1, 2015

For me it can be merged, but the PR stands in a little conflicted zone, so I have no problem in leaving this open in case we get further comments (which we may also get through the compatibility package).

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 1, 2015

Member

@damiendoligez , as release manager, do you have an opinion? I would say that the sooner the better (we still have large-ish changes down the road, and doing everything at the same time would/will be painful.)

Member

gasche commented Dec 1, 2015

@damiendoligez , as release manager, do you have an opinion? I would say that the sooner the better (we still have large-ish changes down the road, and doing everything at the same time would/will be painful.)

@murmour

This comment has been minimized.

Show comment
Hide comment
@murmour

murmour Dec 2, 2015

Contributor

As a heavy user of Unicode in OCaml who is somewhat annoyed by the library interoperability issues, I would be very happy to see this integrated into the distribution. The proposed interface seems complete and future-proof. The implementation contains no obvious blemishes.

Thanks, Daniel, for proposing this change. I hope it gets in.

Contributor

murmour commented Dec 2, 2015

As a heavy user of Unicode in OCaml who is somewhat annoyed by the library interoperability issues, I would be very happy to see this integrated into the distribution. The proposed interface seems complete and future-proof. The implementation contains no obvious blemishes.

Thanks, Daniel, for proposing this change. I hope it gets in.

stdlib/uchar.mli
+(** [compare u u'] is [Pervasives.compare u u']. *)
+
+val hash : t -> int
+(** [hash u] associates a non negative integer to [u]. *)

This comment has been minimized.

@murmour

murmour Dec 2, 2015

Contributor

I knew something was wrong with this otherwise stellar pull request: "non negative" should be either "non-negative" or "nonnegative" (in case you find hyphens outrageous). Thank God we caught this early!

@murmour

murmour Dec 2, 2015

Contributor

I knew something was wrong with this otherwise stellar pull request: "non negative" should be either "non-negative" or "nonnegative" (in case you find hyphens outrageous). Thank God we caught this early!

This comment has been minimized.

@dbuenzli

dbuenzli Dec 2, 2015

Contributor

Thanks. Dash added.

@dbuenzli

dbuenzli Dec 2, 2015

Contributor

Thanks. Dash added.

@damiendoligez

This comment has been minimized.

Show comment
Hide comment
@damiendoligez

damiendoligez Dec 4, 2015

Member

Let's merge it now.

Member

damiendoligez commented Dec 4, 2015

Let's merge it now.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 4, 2015

Member

@damiendoligez any reason not to merge it yourself?

Member

gasche commented Dec 4, 2015

@damiendoligez any reason not to merge it yourself?

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Dec 9, 2015

Contributor

Minor nitpicks: can you add an entry to Changes and update copyright headers to 2015 for new files?

Contributor

alainfrisch commented Dec 9, 2015

Minor nitpicks: can you add an entry to Changes and update copyright headers to 2015 for new files?

@damiendoligez damiendoligez self-assigned this Dec 21, 2015

@hcarty

This comment has been minimized.

Show comment
Hide comment
@hcarty

hcarty Jan 6, 2016

Contributor

@alainfrisch @damiendoligez If the Changes and copyright changes are holding a merge, I can submit a separate PR with those changes after this gets in.

Contributor

hcarty commented Jan 6, 2016

@alainfrisch @damiendoligez If the Changes and copyright changes are holding a merge, I can submit a separate PR with those changes after this gets in.

alainfrisch added a commit that referenced this pull request Jan 6, 2016

Merge pull request #80 from dbuenzli/uchar
Add Uchar module to the standard library.

@alainfrisch alainfrisch merged commit 4b59df8 into ocaml:trunk Jan 6, 2016

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jan 6, 2016

Contributor

I can submit a separate PR with those changes after this gets in.

That would be very nice to you!

Contributor

alainfrisch commented Jan 6, 2016

I can submit a separate PR with those changes after this gets in.

That would be very nice to you!

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jan 6, 2016

Contributor

I don't see why copyright dates should be changed they all correspond to the year when the code was written.

Contributor

dbuenzli commented Jan 6, 2016

I don't see why copyright dates should be changed they all correspond to the year when the code was written.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jan 6, 2016

Contributor

Yeah ok, what matters is really the Changes file.

Contributor

alainfrisch commented Jan 6, 2016

Yeah ok, what matters is really the Changes file.

@dbuenzli dbuenzli deleted the dbuenzli:uchar branch Jan 6, 2016

stedolan pushed a commit to stedolan/ocaml that referenced this pull request Mar 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment