Unsigned 32-bit and 64-bit integers #1201

Open
wants to merge 22 commits into
from

Conversation

Projects
None yet
@yallop
Member

yallop commented Jun 14, 2017

This pull request adds support for unsigned 32-bit and 64-bit integers, a frequently-requested feature. The patch adds primitive types, standard library modules, and syntax for expressions and patterns, together with corresponding updates to the manual and testsuite. Several opportunities for further enhancement are left for future patches (see Future features below).

Motivation

Libraries that implement protocols or that bind system interfaces frequently need fixed-size unsigned integer types to faithfully represent the values involved. For example

Furthermore, there are occasions where unsigned integers are needed in the OCaml distribution -- for example, for representing inode values.

Consequently, there have been several requests for support for unsigned integers in OCaml:

Unsigned integers as a library?

It is possible to give a fairly convincing implementation of unsigned integers as a library -- for example, the integers and stdint libraries both provide signed and unsigned integer types of various widths. Similarly, it would also be possible to implement the standard int32 and int64 types as libraries, rather than including them in the standard distribution.

But there are some rather compelling advantages to making both signed and unsigned integers as built-in types:

  • using built-in types enables numerous optimizations
    • operations can be implemented as primitives, not C calls
    • the compiler can unbox arguments to foreign functions
    • the compiler can unbox local values
    • the compiler can simplify arithmetic expressions
  • integration with other parts of the standard distribution, including
    • format strings
    • bigarrays

Features added by this PR

This PR adds fairly comprehensive support for fixed-size unsigned integers, including

  • predefined types uint32 and uint64, unsigned counterparts to the existing int32 and int64, with support for OCaml's various polymorphic operations (hashing, equality, ordering, serialization)

  • Uint32 and Uint64 modules, with interfaces that follow the conventions of Int32, Int64 and Nativeint

  • Syntax for decimal, binary and hexadecimal unsigned integers. Unsigned 32-bit and 64-bit integer literals are respectively suffixed with u and U:

    # 0b1010u;;
    - : uint32 = 10u
    # 0xDEAD_BEEFU;;
    - : uint64 = 3735928559U
    # 543U;;
    - : uint64 = 543U
  • Pattern matching for unsigned integers

    match x, y with
    | 0u, 0U -> `zero
    | ...
  • Documentation, both for the new modules and in the user manual

  • A test suite that exercises all the functions in the Uint32 and Uint64 modules

Additional features, including various optimizations, are not implemented here (see below). Once there is consensus on what's in this PR, implementation of the remainder should be straightforward.

Future features

Having unsigned integers in the language and standard library, rather than in third-party libraries, will be a significant improvement. There are several opportunities for further improvements, which are left for future patches, including

  • using primitives in place of C calls
  • unboxing and arithmetic optimizations
  • format string and bigarray integration

@yallop yallop referenced this pull request in ocamllabs/ocaml-ctypes Jun 14, 2017

Closed

Split off Signed/Unsigned modules into a new package #446

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Jun 14, 2017

Member

(I have a strong feeling of déjà vu, did you already post a previous PR about this feature?)

Member

gasche commented Jun 14, 2017

(I have a strong feeling of déjà vu, did you already post a previous PR about this feature?)

@yallop

This comment has been minimized.

Show comment
Hide comment
Member

yallop commented Jun 14, 2017

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jun 14, 2017

Contributor

@yallop Thanks very much for this ! Other data points include bindings to OpenGL and mtime where I had to resort to hacker's delight knowledge to devise a comparison function from Int64.compare.

A documentation nit, it would be nice to specify the exact kind of conversion and boundary behaviour that happen on the Uint{32,64}.{of,to}_OTHER_INT functions.

A naming nit, I don't know if you took the convention from somewhere else but Uint{32,64}.of_string_opt is misleading to me. If you follow the way types read I would rather advocate for Uint{32,64}.opt_of_string.

Contributor

dbuenzli commented Jun 14, 2017

@yallop Thanks very much for this ! Other data points include bindings to OpenGL and mtime where I had to resort to hacker's delight knowledge to devise a comparison function from Int64.compare.

A documentation nit, it would be nice to specify the exact kind of conversion and boundary behaviour that happen on the Uint{32,64}.{of,to}_OTHER_INT functions.

A naming nit, I don't know if you took the convention from somewhere else but Uint{32,64}.of_string_opt is misleading to me. If you follow the way types read I would rather advocate for Uint{32,64}.opt_of_string.

@gasche

gasche approved these changes Jun 14, 2017

I reviewed the patch and the implementation seems fine. I am very happy that there is a good testsuite and that the manual was extended.

(As previously discussed in PR#7280 I am rather in favor of having new types for handling unsigned values, so I marked the pull request as "approved". There is still the thorny issue of adding new module names to the standard library and the issues this may cause, but I think in this case it would be reasonable, and #1010 is supposed to alleviate this anyway. )

A digression on format conversions

The code naturally suggests that it would be nice to extend format strings with uint{32,64} formatters (as noted as Future Work in the PR description). The existing int{32,64} conversions are {l,L}{n,i,u,x,X,o} (n,i are synonyms for signed decimal printing, and all of u,x,X,o already format the (signed) numbers as unsigned). The natural conversions would then be {u,U}{n,i,u,x,X,o}, with the possibly counter-intuitive effects that:

  • lu takes a Int32.t instead of a UInt32.t (for this you need uu)
  • ud prints the Uint32.t number as signed rather than unsigned
  • u alone already exists and just prints an int as unsigned

The problem with this is that it breaks backward-compatibility in the following way: today %ud is a supported integer format that prints a %u (integer printed as unsigned) and then output a d character. The good news is that this change of semantics would always result in a type error rather than be silent, and that it is possible to write upward-compatible code by using %u%,d instead -- so the type errors can be fixed in a backward-compatible way. %Ud is fine given that %U is not a valid format.

Another option would be to reuse the existing lu and Lu conversions instead of adding new formats, but this also break code (it changes their input type), restricts the expressivity (no way to print an Uint32 as octal), and breaks the consistency between int conversions and the int32 conversions, and the consistency between the conversion prefix and the integer literal suffix.

(On a related note, currently ppx authors can rely on the fact that u,U literal suffixes are parsed correctly but rejected by the language to give them a custom semantics, and this property would be broken by the change. I guess this is ok.)

+ | Uconst_block _ -> 6
+ | Uconst_float_array _ -> 7
+ | Uconst_string _ -> 8
+ | Uconst_closure _ -> 9

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

(rank_structured_constant is only used to do tag-based comparison of values of different tags, so changing the outputs is fine as long as the relative ordering is preserved. The function is not exported outside the module so there are no compatibility issues.)

@gasche

gasche Jun 14, 2017

Member

(rank_structured_constant is only used to do tag-based comparison of values of different tags, so changing the outputs is fine as long as the relative ordering is preserved. The function is not exported outside the module so there are no compatibility issues.)

asmcomp/cmmgen.ml
@@ -878,18 +882,24 @@ let box_int_constant bi n =
Pnativeint -> Uconst_nativeint n
| Pint32 -> Uconst_int32 (Nativeint.to_int32 n)
| Pint64 -> Uconst_int64 (Int64.of_nativeint n)
+ | Puint32 -> Uconst_uint32 (Uint32.of_int32 (Nativeint.to_int32 n))
+ | Puint64 -> Uconst_uint64 (Uint64.of_int64 (Int64.of_nativeint n))

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

I found it a bit surprising at first that Uint64 would not have a of_nativeint function if Int64 has one (are the two modules not supposed to share a common interface on the non-sign-related parts?). I suppose that one of the reason is that the natural implementation is fun n -> of_64 (Int64.of_nativeint n), and that you wanted to avoid having UInt64 depend on Int64? In general, is there a rationale for which conversion functions should or should not be provided by each module? (Why would Nativeint.to_uint32 not exist, for example?)

@gasche

gasche Jun 14, 2017

Member

I found it a bit surprising at first that Uint64 would not have a of_nativeint function if Int64 has one (are the two modules not supposed to share a common interface on the non-sign-related parts?). I suppose that one of the reason is that the natural implementation is fun n -> of_64 (Int64.of_nativeint n), and that you wanted to avoid having UInt64 depend on Int64? In general, is there a rationale for which conversion functions should or should not be provided by each module? (Why would Nativeint.to_uint32 not exist, for example?)

asmcomp/cmmgen.ml
- | Pint64 -> "int64" in
+ | Pint64 -> "int64"
+ | Puint32 -> "uint32"
+ | Puint64 -> "uint64" in

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

if you moved that in to its own line, the diff for {u,}int128 would not have this disgraceful change to an existing line.

@gasche

gasche Jun 14, 2017

Member

if you moved that in to its own line, the diff for {u,}int128 would not have this disgraceful change to an existing line.

asmcomp/cmmgen.ml
+ Csymbol_address("caml_uint32_ops") :: Cint32 n :: Cint32 0n :: cont
+ else
+ Csymbol_address("caml_uint32_ops") :: Cint n :: cont
+

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

It could make sense to factorize the two functions with a emit_boxed_32_constant header (n : nativeint) cont function.

@gasche

gasche Jun 14, 2017

Member

It could make sense to factorize the two functions with a emit_boxed_32_constant header (n : nativeint) cont function.

bytecomp/matching.ml
@@ -2280,10 +2280,19 @@ let combine_constant loc arg cst partial ctx def
fail
(Pbintcomp(Pnativeint, Cneq)) (Pbintcomp(Pnativeint, Clt))
arg const_lambda_list

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

I think it would be more natural (and more readable in the future) to move the nativeint case above the {,u}int{32,64} case, as in the rest of the changes.

@gasche

gasche Jun 14, 2017

Member

I think it would be more natural (and more readable in the future) to move the nativeint case above the {,u}int{32,64} case, as in the rest of the changes.

bytecomp/translcore.ml
@@ -364,7 +378,8 @@ let prim_restore_raw_backtrace =
let specialize_comparison table env ty =
let (gencomp, intcomp, floatcomp, stringcomp, bytescomp,
- nativeintcomp, int32comp, int64comp, _) = table in
+ nativeintcomp, int32comp, int64comp,
+ uint32comp, uint64comp, _) = table in

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

(I would support having nativeintcomp on its own line and breaking a line before _) to get nice diffs in the future. There is no nit too small to be picked.)

@gasche

gasche Jun 14, 2017

Member

(I would support having nativeintcomp on its own line and breaking a line before _) to get nice diffs in the future. There is no nit too small to be picked.)

bytecomp/typeopt.ml
- || Path.same p Predef.path_int64 then Addr
+ || Path.same p Predef.path_int64
+ || Path.same p Predef.path_uint32
+ || Path.same p Predef.path_uint64 then Addr

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

(then on its own line?)

@gasche

gasche Jun 14, 2017

Member

(then on its own line?)

+ }
+ }
+ return p;
+}

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

(grmbl)

@gasche

gasche Jun 14, 2017

Member

(grmbl)

+ uint32_t divisor = Uint32_val(b);
+ if (divisor == 0) caml_raise_zero_divide();
+ return caml_copy_uint32(dividend / divisor);
+}

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

I personally have the impression that all of this would be easier to review and maintain if each uint{32,64} operation was right after the corresponding int{32,64} operation, which would also allow to share the ifdef logic between both (for instance ARCH_ALIGN_INT64). But the code looks ok.

@gasche

gasche Jun 14, 2017

Member

I personally have the impression that all of this would be easier to review and maintain if each uint{32,64} operation was right after the corresponding int{32,64} operation, which would also allow to share the ifdef logic between both (for instance ARCH_ALIGN_INT64). But the code looks ok.

@@ -229,7 +237,7 @@ let common_initial_env add_type add_extension empty_env =
add_type ident_char decl_abstr_imm (
add_type ident_int decl_abstr_imm (
add_type ident_extension_constructor decl_abstr (
- empty_env)))))))))))))))))))))))))))
+ empty_env)))))))))))))))))))))))))))))

This comment has been minimized.

@gasche

gasche Jun 14, 2017

Member

Someday my prince will come... and use @@ here.

@gasche

gasche Jun 14, 2017

Member

Someday my prince will come... and use @@ here.

@yallop yallop referenced this pull request in mirage/mirage-solo5 Jun 14, 2017

Merged

Add caml_get_addr #20

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 15, 2017

Contributor

Thanks for working on this; I had a request for this functionality just yesterday, and am also a strong proponent myself.

Following on from @dbuenzli 's second paragraph above, I would like to argue for better naming and semantics for the narrowing conversion functions. For example, there might be functions from unsigned 64-bit integers to unsigned 32-bit integers that either (a) truncate; (b) raise an exception if truncation occurs; (c) return an option, where None is returned if truncation would occur. In particular I think (c)
is essential. (Also see https://github.com/janestreet/base/blob/master/src/int_conversions.mli for reference.)

I wonder if the widening functions should actually be explicitly annotated with "_zero_extend" for clarity, though I'm less sure of that.

It would be useful for the compiler to have a function that returns both halves of a 64-bit unsigned integer as 32-bit unsigned integers.

Contributor

mshinwell commented Jun 15, 2017

Thanks for working on this; I had a request for this functionality just yesterday, and am also a strong proponent myself.

Following on from @dbuenzli 's second paragraph above, I would like to argue for better naming and semantics for the narrowing conversion functions. For example, there might be functions from unsigned 64-bit integers to unsigned 32-bit integers that either (a) truncate; (b) raise an exception if truncation occurs; (c) return an option, where None is returned if truncation would occur. In particular I think (c)
is essential. (Also see https://github.com/janestreet/base/blob/master/src/int_conversions.mli for reference.)

I wonder if the widening functions should actually be explicitly annotated with "_zero_extend" for clarity, though I'm less sure of that.

It would be useful for the compiler to have a function that returns both halves of a 64-bit unsigned integer as 32-bit unsigned integers.

@rleonid

This comment has been minimized.

Show comment
Hide comment
@rleonid

rleonid Jun 16, 2017

Contributor

If unsigned integers are now primitive types, should array/string indexing now use them? This seems on the one hand a huge breaking change but a sensible one for a language like OCaml.

Contributor

rleonid commented Jun 16, 2017

If unsigned integers are now primitive types, should array/string indexing now use them? This seems on the one hand a huge breaking change but a sensible one for a language like OCaml.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Jun 16, 2017

Member

I think the compatibility costs are completely unreasonable, and the performance implication non-clear (we don't have an "unsigned tagged integer" type so we would pay boxing costs for indices). Besides, this doesn't actually improve safety that much, given than bound checks have both lower and upper bounds, and only the lower bound would be type-enforced. (Actually it wouldn't, given that subtractions at those types currently silently overflow. So you would need exceptions on overflow and the optimizations to make that fast, and...)

Member

gasche commented Jun 16, 2017

I think the compatibility costs are completely unreasonable, and the performance implication non-clear (we don't have an "unsigned tagged integer" type so we would pay boxing costs for indices). Besides, this doesn't actually improve safety that much, given than bound checks have both lower and upper bounds, and only the lower bound would be type-enforced. (Actually it wouldn't, given that subtractions at those types currently silently overflow. So you would need exceptions on overflow and the optimizations to make that fast, and...)

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 20, 2017

Member

Thanks for the reviews, @dbuenzli, @gasche, @mshinwell. I've pushed new commits that take your comments into account.

The new commits complete the panoply of numeric conversions, with functions that convert between uint32/uint64 and the existing numeric types (int, int32, int64, nativeint, float). The conversions follow existing standard library conventions for naming, for location, and for behaviour.

I've also documented the boundary behaviour of all the conversion functions, and added more tests.

Finally, I've addressed most of the stylistic suggestions that @gasche left. (The most significant suggestion that I haven't followed is moving the various uint C functions alongside their int counterparts. The current code follows the existing layout of the file, with functions grouped by type rather than operation. Still, I see @gasche's point, and could be persuaded to move the code if pressed.)

@dbuenzli and @mshinwell raised the question of naming, and particularly of the _opt functions. Those functions come from #885, which added option-returning versions of various (perhaps all) stdlib functions that raise exceptions. For backwards compatibility, the functions added in #885 have names follow different conventions to Jane Street's ─ in Core, unsuffixed functions return option and functions that raise exceptions are suffixed; in the stdlib unsuffixed functions raise exceptions and functions that return option are suffixed. The new functions in this PR, such as Uint32.of_string_opt, follow the pattern of Int32.of_string_opt and the rest of the stdlib.

Regarding the semantics of conversions, the new functions follow standard library conventions, wrapping on (i.e. reducing modulo 2ⁿ) out-of-bounds values. As with the existing functions (e.g. Int64.to_int32, this relies on implementation-defined behaviour in C in places (namely on conversions to signed values; conversions to unsigned values are fully-defined), but it only relies on behaviour that is already required by the existing implementation. As with Int64.of_float and similar functions, the behaviour of conversions from out-of-range floats is left unspecified.

Finally, I'm happy to squash some of these commits before this is merged. However, mergers should be aware that the existing commit sequence maintains the property that each commit is sufficient to bootstrap subsequent commits in the history. Several bootstrapping cycles are needed, since the standard library depends on the predefined type, and the literal syntax (used in the standard library) depends on parsing functions in the standard library. If the whole commit sequence is squashed then it will no longer be possible to reconstruct the bootstrapped compiler.

Member

yallop commented Jun 20, 2017

Thanks for the reviews, @dbuenzli, @gasche, @mshinwell. I've pushed new commits that take your comments into account.

The new commits complete the panoply of numeric conversions, with functions that convert between uint32/uint64 and the existing numeric types (int, int32, int64, nativeint, float). The conversions follow existing standard library conventions for naming, for location, and for behaviour.

I've also documented the boundary behaviour of all the conversion functions, and added more tests.

Finally, I've addressed most of the stylistic suggestions that @gasche left. (The most significant suggestion that I haven't followed is moving the various uint C functions alongside their int counterparts. The current code follows the existing layout of the file, with functions grouped by type rather than operation. Still, I see @gasche's point, and could be persuaded to move the code if pressed.)

@dbuenzli and @mshinwell raised the question of naming, and particularly of the _opt functions. Those functions come from #885, which added option-returning versions of various (perhaps all) stdlib functions that raise exceptions. For backwards compatibility, the functions added in #885 have names follow different conventions to Jane Street's ─ in Core, unsuffixed functions return option and functions that raise exceptions are suffixed; in the stdlib unsuffixed functions raise exceptions and functions that return option are suffixed. The new functions in this PR, such as Uint32.of_string_opt, follow the pattern of Int32.of_string_opt and the rest of the stdlib.

Regarding the semantics of conversions, the new functions follow standard library conventions, wrapping on (i.e. reducing modulo 2ⁿ) out-of-bounds values. As with the existing functions (e.g. Int64.to_int32, this relies on implementation-defined behaviour in C in places (namely on conversions to signed values; conversions to unsigned values are fully-defined), but it only relies on behaviour that is already required by the existing implementation. As with Int64.of_float and similar functions, the behaviour of conversions from out-of-range floats is left unspecified.

Finally, I'm happy to squash some of these commits before this is merged. However, mergers should be aware that the existing commit sequence maintains the property that each commit is sufficient to bootstrap subsequent commits in the history. Several bootstrapping cycles are needed, since the standard library depends on the predefined type, and the literal syntax (used in the standard library) depends on parsing functions in the standard library. If the whole commit sequence is squashed then it will no longer be possible to reconstruct the bootstrapped compiler.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Jun 20, 2017

Member

and could be persuaded to move the code if pressed

I won't press.

I won't merge by myself because I believe a larger agreement is needed to add standard library modules and this may be made dependent on #1010 or an alternative compilation-unit-name-conflict-avoiding proposal. But as far as I'm concerned, this is good to go.

Member

gasche commented Jun 20, 2017

and could be persuaded to move the code if pressed

I won't press.

I won't merge by myself because I believe a larger agreement is needed to add standard library modules and this may be made dependent on #1010 or an alternative compilation-unit-name-conflict-avoiding proposal. But as far as I'm concerned, this is good to go.

@gasche gasche added the stdlib label Jun 20, 2017

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 20, 2017

Contributor

I would like to review this again before any merge is considered. I probably won't be able to get to it for a couple of days though, I'm afraid.

Contributor

mshinwell commented Jun 20, 2017

I would like to review this again before any merge is considered. I probably won't be able to get to it for a couple of days though, I'm afraid.

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 20, 2017

Member

There's a problem showing up in the tests, which currently pass with the standard runtime and fail with the debug runtime. I plan to take a closer look soon.
(Update: the problem with the debug runtime is resolved. There's still a problem with generic comparison of 64-bit unsigned values on Windows.)
(Update: the Windows problem is fixed.)

Member

yallop commented Jun 20, 2017

There's a problem showing up in the tests, which currently pass with the standard runtime and fail with the debug runtime. I plan to take a closer look soon.
(Update: the problem with the debug runtime is resolved. There's still a problem with generic comparison of 64-bit unsigned values on Windows.)
(Update: the Windows problem is fixed.)

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 22, 2017

Member

I would like to review this again before any merge is considered.

A review would be appreciated, especially of the backend (asmcomp/bytecomp) portions.

Member

yallop commented Jun 22, 2017

I would like to review this again before any merge is considered.

A review would be appreciated, especially of the backend (asmcomp/bytecomp) portions.

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 23, 2017

Contributor

I'd like to discuss the narrowing conversion functions again before looking at the backend parts. I think we could improve the design by adhering less strongly to the appearance of other "integer" modules in the standard library. I also think this can be done without introducing the potential for confusion across such modules.

Starting on narrowing casts without changing signedness, using Uint64 as an example: I cannot recall any example involving unsigned integers where the existing semantics of the function Uint64.to_uint32 are what is wanted [edit: as the default narrowing conversion]. I think it is a recipe for error. Typically you have some number that should fit inside 32 bits but happens to be inside a wider container and it is an error if it does not fit. I think the default conversion function should either return None or raise an exception in such cases.

I think a truncating function should be provided that returns the bottom half of an unsigned 64-bit integer for the case where the top half contains irrelevant data. The converse, for returning the top half, likewise. In fact the usage of unsigned integers to manipulate packed protocol words is so commonplace that some kind of generic String.sub-like operation should perhaps be considered. Code that extracts particular parts of large words using various shifts and bit masks is very easy to get wrong.

Now there is the matter of conversions of the same underlying width but to a signed type (e.g. Uint64.to_int64). I think there are two common use cases here (resp. for 32-bit): firstly, "reinterpret this 64-bit word as a signed integer" (currently relying on C implementation-defined behaviour, if I understand correctly) and secondly, production of a non-negative 64-bit signed integer from a 64-bit unsigned integer so long as it is small enough. I would like to see these both as separate operations with clear names. The second case should have an option-returning and an exception-raising version. Coming back to the point in my first paragraph, I think these and other names can be chosen so they do not overlap with existing functions in other similar modules. I cannot emphasise enough the importance of good naming.

For the reinterpretation of 64/32-bit unsigned integers as their signed counterparts, can we not write our own implementation so that we are detached from the C language implementation-defined behaviour? To me this seems desirable, although perhaps there are reasons not to do it. (If it is not desirable then I don't think the existence of previous uses should be an argument for adding another use.)

Finally, I think matters would be simplified and clarified if we do not include functions that both narrow and change between signedness of types (e.g. Uint64.to_int32). This would force the user to think about the intended semantics for each step.

Contributor

mshinwell commented Jun 23, 2017

I'd like to discuss the narrowing conversion functions again before looking at the backend parts. I think we could improve the design by adhering less strongly to the appearance of other "integer" modules in the standard library. I also think this can be done without introducing the potential for confusion across such modules.

Starting on narrowing casts without changing signedness, using Uint64 as an example: I cannot recall any example involving unsigned integers where the existing semantics of the function Uint64.to_uint32 are what is wanted [edit: as the default narrowing conversion]. I think it is a recipe for error. Typically you have some number that should fit inside 32 bits but happens to be inside a wider container and it is an error if it does not fit. I think the default conversion function should either return None or raise an exception in such cases.

I think a truncating function should be provided that returns the bottom half of an unsigned 64-bit integer for the case where the top half contains irrelevant data. The converse, for returning the top half, likewise. In fact the usage of unsigned integers to manipulate packed protocol words is so commonplace that some kind of generic String.sub-like operation should perhaps be considered. Code that extracts particular parts of large words using various shifts and bit masks is very easy to get wrong.

Now there is the matter of conversions of the same underlying width but to a signed type (e.g. Uint64.to_int64). I think there are two common use cases here (resp. for 32-bit): firstly, "reinterpret this 64-bit word as a signed integer" (currently relying on C implementation-defined behaviour, if I understand correctly) and secondly, production of a non-negative 64-bit signed integer from a 64-bit unsigned integer so long as it is small enough. I would like to see these both as separate operations with clear names. The second case should have an option-returning and an exception-raising version. Coming back to the point in my first paragraph, I think these and other names can be chosen so they do not overlap with existing functions in other similar modules. I cannot emphasise enough the importance of good naming.

For the reinterpretation of 64/32-bit unsigned integers as their signed counterparts, can we not write our own implementation so that we are detached from the C language implementation-defined behaviour? To me this seems desirable, although perhaps there are reasons not to do it. (If it is not desirable then I don't think the existence of previous uses should be an argument for adding another use.)

Finally, I think matters would be simplified and clarified if we do not include functions that both narrow and change between signedness of types (e.g. Uint64.to_int32). This would force the user to think about the intended semantics for each step.

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 23, 2017

Member

These notes on an alternative design for conversion functions are interesting. If you fancy a go at an implementation then a pull request against this branch is welcome. Otherwise I'll wait for an indication that your preferred design is the consensus of the development team.

Starting on narrowing casts without changing signedness, using Uint64 as an example: I cannot recall any example involving unsigned integers where the existing semantics of the function Uint64.to_uint32 are what is wanted. I think it is a recipe for error.

I think a truncating function should be provided that returns the bottom half of an unsigned 64-bit integer for the case where the top half contains irrelevant data.

There seems to be a misunderstanding here. The existing semantics of the function Uint64.to_uint32 (described as "I cannot recall any example where the existing semantics ... are what is wanted"; "a recipe for error") are precisely the truncating semantics that "should be provided".

# Uint64.to_uint32 0x12345678_9ABCDEF0U = 0x9ABCDEF0u;;
- : bool = true

Do you just mean that you'd like to keep the existing Uint64.to_uint32, but give it a different name?

Member

yallop commented Jun 23, 2017

These notes on an alternative design for conversion functions are interesting. If you fancy a go at an implementation then a pull request against this branch is welcome. Otherwise I'll wait for an indication that your preferred design is the consensus of the development team.

Starting on narrowing casts without changing signedness, using Uint64 as an example: I cannot recall any example involving unsigned integers where the existing semantics of the function Uint64.to_uint32 are what is wanted. I think it is a recipe for error.

I think a truncating function should be provided that returns the bottom half of an unsigned 64-bit integer for the case where the top half contains irrelevant data.

There seems to be a misunderstanding here. The existing semantics of the function Uint64.to_uint32 (described as "I cannot recall any example where the existing semantics ... are what is wanted"; "a recipe for error") are precisely the truncating semantics that "should be provided".

# Uint64.to_uint32 0x12345678_9ABCDEF0U = 0x9ABCDEF0u;;
- : bool = true

Do you just mean that you'd like to keep the existing Uint64.to_uint32, but give it a different name?

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 23, 2017

Contributor

Sorry, that wasn't written very clearly; I've added a slight edit. To explain further: I distinguish "narrowing" cases from "substring" cases. The narrowing case is when the programmer has the expectation that they have, say, a 32-bit number that happens to be in a wider 64-bit container. All of the bits in the 64-bit word are relevant. The substring case is when the programmer wishes to extract a particular subset of the bits from a word, perhaps the lower 32 bits of a 64-bit word, without caring about what the remaining bits are set to. I think the "narrowing" case would naturally be the one carried out by a function called, for example, Uint64.to_uint32; it follows that truncation semantics don't seem correct here since we are failing to catch a violation of the programmer's expectations.

I agree that the lower-half truncation proposed, which is an example of a "substring" case, has the implementation of the existing Uint64.to_uint32. Maybe it could be called least_significant_half?

Contributor

mshinwell commented Jun 23, 2017

Sorry, that wasn't written very clearly; I've added a slight edit. To explain further: I distinguish "narrowing" cases from "substring" cases. The narrowing case is when the programmer has the expectation that they have, say, a 32-bit number that happens to be in a wider 64-bit container. All of the bits in the 64-bit word are relevant. The substring case is when the programmer wishes to extract a particular subset of the bits from a word, perhaps the lower 32 bits of a 64-bit word, without caring about what the remaining bits are set to. I think the "narrowing" case would naturally be the one carried out by a function called, for example, Uint64.to_uint32; it follows that truncation semantics don't seem correct here since we are failing to catch a violation of the programmer's expectations.

I agree that the lower-half truncation proposed, which is an example of a "substring" case, has the implementation of the existing Uint64.to_uint32. Maybe it could be called least_significant_half?

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 23, 2017

Contributor

(Another way of looking at it: narrowing functions should not lose information.)

Contributor

mshinwell commented Jun 23, 2017

(Another way of looking at it: narrowing functions should not lose information.)

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 23, 2017

Member

I think it's worth surfacing the reasons why there are different views here, since I doubt there's much disagreement on the underlying principles.

When you have an enclosed ecosystem -- e.g. a mono-repo and an easy way of communicating with the entire userbase, like at Jane Street -- you can approach every design question from first principles. If you decide it's always better to raise exceptions on overflow, for example, then you can implement those semantics, and change all the existing code to match, so you always end up with something coherent even if your views now are different to your views in the past.

The OCaml standard library is not in this situation. The userbase is large, diverse, and difficult to reach, and there's lots of OCaml code in use that's no longer actively maintained, so once a decision's made it's pretty painful to revisit it. (For example, switching to immutable strings started in OCaml 4.02.0, and is still not complete, several major versions later.) In most cases it's impractical to change the semantics of existing functions, even if there's agreement that the existing design is not optimal.

So yes, naming is very important. But coherent design is at least as important. If Uint64.to_uint32 raises an exception on overflow, but Int64.to_int32 truncates on overflow then users will not be able to apply what they learn from one part of the standard library to another part, which would be a serious loss. If we start every design process from first principles, ignoring the existing interface, we won't have a gradually evolving and consistent unit, we'll end up with a user-hostile and incoherent hodge-podge.

For these reasons, it's not a good idea to have Uint64.to_uint32 raise exceptions unless Int64.to_int32 also raises exceptions. Perhaps there'll be consensus that we should change Int64.to_int32, but I'll be surprised if there is.

(For what it's worth, I don't see a real need to change the behaviour of Int64.to_int32. Integers in most programming languages are all about truncation: succ max_int is min_int, 7 / 2 is 3, abs min_int is negative, et cetera, et cetera.. It's certainly good to encourage users to think about truncation, but it's not enough to look just at conversion functions, since lots of other operations can also lose information. Where it's absolutely vital to avoid truncation, it's probably better to use a specialized library like Hannes's usane that eliminates truncation altogether. )

Member

yallop commented Jun 23, 2017

I think it's worth surfacing the reasons why there are different views here, since I doubt there's much disagreement on the underlying principles.

When you have an enclosed ecosystem -- e.g. a mono-repo and an easy way of communicating with the entire userbase, like at Jane Street -- you can approach every design question from first principles. If you decide it's always better to raise exceptions on overflow, for example, then you can implement those semantics, and change all the existing code to match, so you always end up with something coherent even if your views now are different to your views in the past.

The OCaml standard library is not in this situation. The userbase is large, diverse, and difficult to reach, and there's lots of OCaml code in use that's no longer actively maintained, so once a decision's made it's pretty painful to revisit it. (For example, switching to immutable strings started in OCaml 4.02.0, and is still not complete, several major versions later.) In most cases it's impractical to change the semantics of existing functions, even if there's agreement that the existing design is not optimal.

So yes, naming is very important. But coherent design is at least as important. If Uint64.to_uint32 raises an exception on overflow, but Int64.to_int32 truncates on overflow then users will not be able to apply what they learn from one part of the standard library to another part, which would be a serious loss. If we start every design process from first principles, ignoring the existing interface, we won't have a gradually evolving and consistent unit, we'll end up with a user-hostile and incoherent hodge-podge.

For these reasons, it's not a good idea to have Uint64.to_uint32 raise exceptions unless Int64.to_int32 also raises exceptions. Perhaps there'll be consensus that we should change Int64.to_int32, but I'll be surprised if there is.

(For what it's worth, I don't see a real need to change the behaviour of Int64.to_int32. Integers in most programming languages are all about truncation: succ max_int is min_int, 7 / 2 is 3, abs min_int is negative, et cetera, et cetera.. It's certainly good to encourage users to think about truncation, but it's not enough to look just at conversion functions, since lots of other operations can also lose information. Where it's absolutely vital to avoid truncation, it's probably better to use a specialized library like Hannes's usane that eliminates truncation altogether. )

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 23, 2017

Contributor

I agree that we shouldn't have functions with very similar names having different overflow behaviour, etc, unless that behaviour is reflected in the type system.

Changing existing code doesn't necessarily all have to be done at once. We could in the stdlib introduce new functions throughout the relevant modules with new names. In the existing modules the new names would effectively be aliases for the old names (with the old names perhaps being deprecated); in the new module only the new names would be present. This may be slightly less discoverable at first but I suspect not significantly (especially in the presence of deprecated attributes).

In this specific case, one option might be to add e.g. Uint64.Narrow.to_uint32 (and Uint64.Trunc.to_uint32); we would just not have Uint64.to_uint32.

Contributor

mshinwell commented Jun 23, 2017

I agree that we shouldn't have functions with very similar names having different overflow behaviour, etc, unless that behaviour is reflected in the type system.

Changing existing code doesn't necessarily all have to be done at once. We could in the stdlib introduce new functions throughout the relevant modules with new names. In the existing modules the new names would effectively be aliases for the old names (with the old names perhaps being deprecated); in the new module only the new names would be present. This may be slightly less discoverable at first but I suspect not significantly (especially in the presence of deprecated attributes).

In this specific case, one option might be to add e.g. Uint64.Narrow.to_uint32 (and Uint64.Trunc.to_uint32); we would just not have Uint64.to_uint32.

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 23, 2017

Member

Both those ideas (new names as aliases, and two more descriptive names in place of one shorter name) sound pretty reasonable to me.

Member

yallop commented Jun 23, 2017

Both those ideas (new names as aliases, and two more descriptive names in place of one shorter name) sound pretty reasonable to me.

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Jun 23, 2017

Member

FWIW I appreciate to deprecate conversion functions which lose information (such as Int64.to_int32, Int64.to_int, ...) and introduce (using a more explicit name) functions with clear semantics (that either return an option or raise an exception when the value is out of range for the desired type).

Member

hannesm commented Jun 23, 2017

FWIW I appreciate to deprecate conversion functions which lose information (such as Int64.to_int32, Int64.to_int, ...) and introduce (using a more explicit name) functions with clear semantics (that either return an option or raise an exception when the value is out of range for the desired type).

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 27, 2017

Member

Redesigning the conversion functions in the standard library is a fine idea.

However, this PR is not that. It simply extends the existing interface with unsigned integers in a way that is consistent with the existing design. It's deliberately as unopinionated as possible, which includes taking no stance on the design of conversion functions besides consistency with the standard library.

I'd like to retain the focus in this PR on unsigned integers. There's time before the 4.06 freeze to redesign the conversion functions, but this isn't the best place to do it. Could we please consider here whether we'd like to add unsigned integers, and continue the other (admittedly interesting) discussions elsewhere?

Member

yallop commented Jun 27, 2017

Redesigning the conversion functions in the standard library is a fine idea.

However, this PR is not that. It simply extends the existing interface with unsigned integers in a way that is consistent with the existing design. It's deliberately as unopinionated as possible, which includes taking no stance on the design of conversion functions besides consistency with the standard library.

I'd like to retain the focus in this PR on unsigned integers. There's time before the 4.06 freeze to redesign the conversion functions, but this isn't the best place to do it. Could we please consider here whether we'd like to add unsigned integers, and continue the other (admittedly interesting) discussions elsewhere?

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 27, 2017

Contributor

I personally reckon that it would be better to think about any redesign of these functions now, before we end up with yet more code that might need changing, and may be prone to error.

I haven't heard any arguments against adding unsigned integers as a general concept (quite the opposite in fact).

Contributor

mshinwell commented Jun 27, 2017

I personally reckon that it would be better to think about any redesign of these functions now, before we end up with yet more code that might need changing, and may be prone to error.

I haven't heard any arguments against adding unsigned integers as a general concept (quite the opposite in fact).

@damiendoligez

This comment has been minimized.

Show comment
Hide comment
@damiendoligez

damiendoligez Jun 27, 2017

Member

@xavierleroy might want to chime in at this point...

Member

damiendoligez commented Jun 27, 2017

@xavierleroy might want to chime in at this point...

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Jun 27, 2017

Contributor

Well, I don't like to rain on someone else's parade, and I have already stated my position on this topic at several developers meetings. But here we go again:

I find unsigned integer types confusing, of little practical use, and not worth the cost of implementing them like this PR does. OCaml doesn't need to replicate all the horrors of C...

I find the Java approach much more reasonable for a language like OCaml: have signed integers only and provide as library functions the few operations we need when unsigned interpretation is required, namely shift-right-unsigned (we already have it), unsigned compare, unsigned divide, unsigned modulo.

Contributor

xavierleroy commented Jun 27, 2017

Well, I don't like to rain on someone else's parade, and I have already stated my position on this topic at several developers meetings. But here we go again:

I find unsigned integer types confusing, of little practical use, and not worth the cost of implementing them like this PR does. OCaml doesn't need to replicate all the horrors of C...

I find the Java approach much more reasonable for a language like OCaml: have signed integers only and provide as library functions the few operations we need when unsigned interpretation is required, namely shift-right-unsigned (we already have it), unsigned compare, unsigned divide, unsigned modulo.

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Jun 27, 2017

Contributor

I hadn't actually realised that @xavierleroy was against this; I was remembering the concern more as one about the potential proliferation of differently-sized integer types (which may still be a concern).

Perhaps one proposal might be to add Unsigned sub-modules to e.g. Int64, containing just a few key functions.

Contributor

mshinwell commented Jun 27, 2017

I hadn't actually realised that @xavierleroy was against this; I was remembering the concern more as one about the potential proliferation of differently-sized integer types (which may still be a concern).

Perhaps one proposal might be to add Unsigned sub-modules to e.g. Int64, containing just a few key functions.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy Jun 27, 2017

I find the Java approach much more reasonable for a language like OCaml: have signed integers only and provide as library functions the few operations we need when unsigned interpretation is required, namely shift-right-unsigned (we already have it), unsigned compare, unsigned divide, unsigned modulo.

I think there's a key difference between OCaml and Java that matters here: Java is huge, and therefore gets to create its own rules. OCaml needs to interface with C a lot in order to provide access to all the important libraries that aren't written in OCaml, and it's much easier to do so with unsigned types.

I generally agree that unsigned types should be completely frowned upon in pure OCaml code. In general, unsigned types are a mistake and should never have been created. Were it not for limited memory and bandwidth in the early days of computing, they would not have. But they're almost a necessity for sane compatibility with C libraries.

One way to wall them off and to send a message that they're not for the average user is to make sure that there are very few functions in the standard library and other major libraries that accept unsigned types ie. just stick to the status quo.

bluddy commented Jun 27, 2017

I find the Java approach much more reasonable for a language like OCaml: have signed integers only and provide as library functions the few operations we need when unsigned interpretation is required, namely shift-right-unsigned (we already have it), unsigned compare, unsigned divide, unsigned modulo.

I think there's a key difference between OCaml and Java that matters here: Java is huge, and therefore gets to create its own rules. OCaml needs to interface with C a lot in order to provide access to all the important libraries that aren't written in OCaml, and it's much easier to do so with unsigned types.

I generally agree that unsigned types should be completely frowned upon in pure OCaml code. In general, unsigned types are a mistake and should never have been created. Were it not for limited memory and bandwidth in the early days of computing, they would not have. But they're almost a necessity for sane compatibility with C libraries.

One way to wall them off and to send a message that they're not for the average user is to make sure that there are very few functions in the standard library and other major libraries that accept unsigned types ie. just stick to the status quo.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Jun 27, 2017

Contributor

OCaml needs to interface with C a lot in order to provide access to all the important libraries that aren't written in OCaml, and it's much easier to do so with unsigned types.

But for most functions working with signed or unsigned types is identical. So instead of copying all the machinery for the unsigned types it is possible to just add the functions which differ in behavior (Xavier's list + to_string + of_string + parsing). In the documentation for C stub we could add that int64_val and Val_int64 work also for unsigned type (or just add another version uint64_val that adds the cast/reinterpretation), same for @unboxed in external.

Contributor

bobot commented Jun 27, 2017

OCaml needs to interface with C a lot in order to provide access to all the important libraries that aren't written in OCaml, and it's much easier to do so with unsigned types.

But for most functions working with signed or unsigned types is identical. So instead of copying all the machinery for the unsigned types it is possible to just add the functions which differ in behavior (Xavier's list + to_string + of_string + parsing). In the documentation for C stub we could add that int64_val and Val_int64 work also for unsigned type (or just add another version uint64_val that adds the cast/reinterpretation), same for @unboxed in external.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Jun 27, 2017

Contributor

To amplify @bobot's point: I tend to think of fixed-width integer types (int32, int64) as bit vectors that can be used as signed integers, unsigned integers, or tuples of bits, depending on the operations applied to them. That's an assembly-level view on integers, but in the end I find it less confusing than C's attempts at separating signed integers (viewed mostly as integers, with undefined overflows, etc) and unsigned integers (viewed mostly as bit-vectors).

The dividing line between the two viewpoints (uint32 as a different type than int32 vs. int32 with signed and unsigned interpretations) is the behavior of generic comparisons. With two different types, proper unsigned comparison can be implemented. With a single type, it's signed comparison by default.

Contributor

xavierleroy commented Jun 27, 2017

To amplify @bobot's point: I tend to think of fixed-width integer types (int32, int64) as bit vectors that can be used as signed integers, unsigned integers, or tuples of bits, depending on the operations applied to them. That's an assembly-level view on integers, but in the end I find it less confusing than C's attempts at separating signed integers (viewed mostly as integers, with undefined overflows, etc) and unsigned integers (viewed mostly as bit-vectors).

The dividing line between the two viewpoints (uint32 as a different type than int32 vs. int32 with signed and unsigned interpretations) is the behavior of generic comparisons. With two different types, proper unsigned comparison can be implemented. With a single type, it's signed comparison by default.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy Jun 27, 2017

But for most functions working with signed or unsigned types is identical. So instead of copying all the machinery for the unsigned types it is possible to just add the functions which differ in behavior (Xavier's list + to_string + of_string + parsing).

It seems odd to insist on not adding a type for some safety when doing this stuff. Having to operate with just the right functions, but having no types to make sure you didn't accidentally use signed compare when you needed unsigned compare, or that an unsigned number didn't accidentally leak into a piece of code that expects a signed number, is very far from optimal OCaml design. There's a reason people have created these types already in user libraries.

The next argument is whether this type should be recognized within the compiler. I think the PR description lays out a good case for doing that. In fact, I just realized that everything I've said here is just rehashing the PR's original argument, so I'll just summarize my addition to that argument as: Unsigned types exist in the world whether we like it or not. OCaml isn't going to change that, and OCaml's success depends on how well it can interface with the rest of the world (specifically the C and network protocol world). We must interact with these types one way or another, so let's do it properly, so that our users can leverage these external libraries as efficiently and easily as possible.

bluddy commented Jun 27, 2017

But for most functions working with signed or unsigned types is identical. So instead of copying all the machinery for the unsigned types it is possible to just add the functions which differ in behavior (Xavier's list + to_string + of_string + parsing).

It seems odd to insist on not adding a type for some safety when doing this stuff. Having to operate with just the right functions, but having no types to make sure you didn't accidentally use signed compare when you needed unsigned compare, or that an unsigned number didn't accidentally leak into a piece of code that expects a signed number, is very far from optimal OCaml design. There's a reason people have created these types already in user libraries.

The next argument is whether this type should be recognized within the compiler. I think the PR description lays out a good case for doing that. In fact, I just realized that everything I've said here is just rehashing the PR's original argument, so I'll just summarize my addition to that argument as: Unsigned types exist in the world whether we like it or not. OCaml isn't going to change that, and OCaml's success depends on how well it can interface with the rest of the world (specifically the C and network protocol world). We must interact with these types one way or another, so let's do it properly, so that our users can leverage these external libraries as efficiently and easily as possible.

@avsm

This comment has been minimized.

Show comment
Hide comment
@avsm

avsm Jun 27, 2017

Member

Unsigned types exist in the world whether we like it or not. OCaml isn't going to change that, and OCaml's success depends on how well it can interface with the rest of the world (specifically the C and network protocol world). We must interact with these types one way or another, so let's do it properly, so that our users can leverage these external libraries as efficiently and easily as possible

While these points are fine and valid, they are not an argument for having full unsigned typed in the compiler. A small external library (like uint) would also work fine for many purposes, if the necessary unsigned primitives were present in the compiler.

From @yallop's usecase list above, the first three (interfaces to existing uses of unsigned integers in the wild) would work fine with an external library as well. The main difference will be in an ugly corner of the language: how this external library interacts gracefully with polymorphic comparisons.

Member

avsm commented Jun 27, 2017

Unsigned types exist in the world whether we like it or not. OCaml isn't going to change that, and OCaml's success depends on how well it can interface with the rest of the world (specifically the C and network protocol world). We must interact with these types one way or another, so let's do it properly, so that our users can leverage these external libraries as efficiently and easily as possible

While these points are fine and valid, they are not an argument for having full unsigned typed in the compiler. A small external library (like uint) would also work fine for many purposes, if the necessary unsigned primitives were present in the compiler.

From @yallop's usecase list above, the first three (interfaces to existing uses of unsigned integers in the wild) would work fine with an external library as well. The main difference will be in an ugly corner of the language: how this external library interacts gracefully with polymorphic comparisons.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy Jun 27, 2017

While these points are fine and valid, they are not an argument for having full unsigned typed in the compiler. A small external library (like uint) would also work fine for many purposes, if the necessary unsigned primitives were present in the compiler.

Rather than rehashing @yallop's points about this, I referred above to the fact that the PR already makes a good case for integration into the compiler:

Unsigned integers as a library?

It is possible to give a fairly convincing implementation of unsigned integers as a library -- for example, the integers and stdint libraries both provide signed and unsigned integer types of various widths. Similarly, it would also be possible to implement the standard int32 and int64 types as libraries, rather than including them in the standard distribution.

But there are some rather compelling advantages to making both signed and unsigned integers as built-in types:

  • using built-in types enables numerous optimizations
  • operations can be implemented as primitives, not C calls
  • the compiler can unbox arguments to foreign functions
  • the compiler can unbox local values
  • the compiler can simplify arithmetic expressions

integration with other parts of the standard distribution, including

  • format strings
  • bigarrays

bluddy commented Jun 27, 2017

While these points are fine and valid, they are not an argument for having full unsigned typed in the compiler. A small external library (like uint) would also work fine for many purposes, if the necessary unsigned primitives were present in the compiler.

Rather than rehashing @yallop's points about this, I referred above to the fact that the PR already makes a good case for integration into the compiler:

Unsigned integers as a library?

It is possible to give a fairly convincing implementation of unsigned integers as a library -- for example, the integers and stdint libraries both provide signed and unsigned integer types of various widths. Similarly, it would also be possible to implement the standard int32 and int64 types as libraries, rather than including them in the standard distribution.

But there are some rather compelling advantages to making both signed and unsigned integers as built-in types:

  • using built-in types enables numerous optimizations
  • operations can be implemented as primitives, not C calls
  • the compiler can unbox arguments to foreign functions
  • the compiler can unbox local values
  • the compiler can simplify arithmetic expressions

integration with other parts of the standard distribution, including

  • format strings
  • bigarrays
@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 27, 2017

Member

The dividing line between the two viewpoints (uint32 as a different type than int32 vs. int32 with signed and unsigned interpretations) is the behavior of generic comparisons. With two different types, proper unsigned comparison can be implemented. With a single type, it's signed comparison by default.

Yes, comparison is the crux of the matter. If it weren't for polymorphic comparison it'd be possible to give a reasonably convincing implementation of unsigned integers as a pure library. (Of course, it'd still be convenient to have literal syntax for unsigned integers in the compiler, along with a built-in type.)

My long-term vision for this feature is as follows:

  • the uintN and intN types share a representation
  • the Intn and UintN modules share most of their implementation
  • polymorphic comparison is replaced by abstraction-preserving operations that dispatch on type, not on representation
  • consequently, most of the code in this PR disappears, but the interface is completely unchanged

So I agree with Xavier that there's currently a disproportionately large cost to implementing this feature. In my view it's worth adding the feature now despite that cost, since we can eliminate the cost in the future, and enjoy the benefits in the meantime.

Member

yallop commented Jun 27, 2017

The dividing line between the two viewpoints (uint32 as a different type than int32 vs. int32 with signed and unsigned interpretations) is the behavior of generic comparisons. With two different types, proper unsigned comparison can be implemented. With a single type, it's signed comparison by default.

Yes, comparison is the crux of the matter. If it weren't for polymorphic comparison it'd be possible to give a reasonably convincing implementation of unsigned integers as a pure library. (Of course, it'd still be convenient to have literal syntax for unsigned integers in the compiler, along with a built-in type.)

My long-term vision for this feature is as follows:

  • the uintN and intN types share a representation
  • the Intn and UintN modules share most of their implementation
  • polymorphic comparison is replaced by abstraction-preserving operations that dispatch on type, not on representation
  • consequently, most of the code in this PR disappears, but the interface is completely unchanged

So I agree with Xavier that there's currently a disproportionately large cost to implementing this feature. In my view it's worth adding the feature now despite that cost, since we can eliminate the cost in the future, and enjoy the benefits in the meantime.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy Jun 27, 2017

polymorphic comparison is replaced by abstraction-preserving operations that dispatch on type, not on representation

Is this wishful thinking, a reference to modular implicits, or something else?

bluddy commented Jun 27, 2017

polymorphic comparison is replaced by abstraction-preserving operations that dispatch on type, not on representation

Is this wishful thinking, a reference to modular implicits, or something else?

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 28, 2017

Member

Is this wishful thinking, a reference to modular implicits, or something else?

It's just what it says. Polymorphic comparison is a pragmatic solution to a real problem, but it also complicates the runtime and breaks abstraction. In the long term there would be many advantages to replacing polymorphic comparison with an approach that doesn't break abstraction. One advantage is directly relevant to this PR: if ordering were not tied to representation then it would become possible to share representations between types with different orderings.

Member

yallop commented Jun 28, 2017

Is this wishful thinking, a reference to modular implicits, or something else?

It's just what it says. Polymorphic comparison is a pragmatic solution to a real problem, but it also complicates the runtime and breaks abstraction. In the long term there would be many advantages to replacing polymorphic comparison with an approach that doesn't break abstraction. One advantage is directly relevant to this PR: if ordering were not tied to representation then it would become possible to share representations between types with different orderings.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy Jun 28, 2017

It's just what it says. Polymorphic comparison is a pragmatic solution to a real problem, but it also complicates the runtime and breaks abstraction. In the long term there would be many advantages to replacing polymorphic comparison with an approach that doesn't break abstraction. One advantage is directly relevant to this PR: if ordering were not tied to representation then it would become possible to share representations between types with different orderings.

I don't disagree -- I think polymorphic comparison is absolutely terrible. It's the weakest part of the language by far, causing subtle bugs that can lurk in OCaml code for years. I'm just not sure why you think it'll ever be gotten rid of, or whether you think it will. Are you saying a. 'if we ever hypothetically get rid of polymorphic comparison, this will still be valid', or b. 'we're clearly going to get rid of polymorphic comparison soon-ish, and this will still be valid when that happens'?

Even with modular implicits integrated (which will hopefully happen within the next 5 years), I don't see polymorphic comparison going away, for backwards compatibility reasons.

bluddy commented Jun 28, 2017

It's just what it says. Polymorphic comparison is a pragmatic solution to a real problem, but it also complicates the runtime and breaks abstraction. In the long term there would be many advantages to replacing polymorphic comparison with an approach that doesn't break abstraction. One advantage is directly relevant to this PR: if ordering were not tied to representation then it would become possible to share representations between types with different orderings.

I don't disagree -- I think polymorphic comparison is absolutely terrible. It's the weakest part of the language by far, causing subtle bugs that can lurk in OCaml code for years. I'm just not sure why you think it'll ever be gotten rid of, or whether you think it will. Are you saying a. 'if we ever hypothetically get rid of polymorphic comparison, this will still be valid', or b. 'we're clearly going to get rid of polymorphic comparison soon-ish, and this will still be valid when that happens'?

Even with modular implicits integrated (which will hopefully happen within the next 5 years), I don't see polymorphic comparison going away, for backwards compatibility reasons.

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop Jun 28, 2017

Member

I'd prefer not to fill this thread with pointless speculation about when future changes to the language might occur. I think my last two comments are clear enough about the relationship between polymorphic comparison and this PR. Feel free to contact me offline if you need further clarification.

Member

yallop commented Jun 28, 2017

I'd prefer not to fill this thread with pointless speculation about when future changes to the language might occur. I think my last two comments are clear enough about the relationship between polymorphic comparison and this PR. Feel free to contact me offline if you need further clarification.

@toots

This comment has been minimized.

Show comment
Hide comment
@toots

toots Jul 31, 2017

Hey guys. It's a very long conversation but I'd like to add my $.02. I was recently working on interfacing OCaml with a DB library with the purpose of packing and unpacking bit-wise values as fast as possible. Although, ideally, one would like to equate integers in computer code with integers in mathematic equations, the reality is that it is not how it is actually implemented. It's not just C, it goes down to numerical operations and the representation of negative values. When manipulating bit-wise elements, this can very easily blow up in your face (as it did for me) without much understanding. Having the possibility to work in a type-safe unsigned environment is actually a big plus and frankly, a big benefit of the typing system. Would love to see this PR merged one day ;-)

toots commented Jul 31, 2017

Hey guys. It's a very long conversation but I'd like to add my $.02. I was recently working on interfacing OCaml with a DB library with the purpose of packing and unpacking bit-wise values as fast as possible. Although, ideally, one would like to equate integers in computer code with integers in mathematic equations, the reality is that it is not how it is actually implemented. It's not just C, it goes down to numerical operations and the representation of negative values. When manipulating bit-wise elements, this can very easily blow up in your face (as it did for me) without much understanding. Having the possibility to work in a type-safe unsigned environment is actually a big plus and frankly, a big benefit of the typing system. Would love to see this PR merged one day ;-)

@marco-m

This comment has been minimized.

Show comment
Hide comment
@marco-m

marco-m Aug 24, 2017

Hello, any news?

marco-m commented Aug 24, 2017

Hello, any news?

@damiendoligez damiendoligez added this to the long-term milestone Sep 25, 2017

@talex5 talex5 referenced this pull request in capnproto/capnp-ocaml Oct 3, 2017

Open

port from ocaml-uint to ocaml-stdint #9

@talex5

This comment has been minimized.

Show comment
Hide comment
@talex5

talex5 Oct 3, 2017

As it seems this won't be merged soon, could someone state the currently preferred alternative?

It seems that currently I must choose between:

These all appear to be incompatible with each other. None of the external libraries works with mirage currently. Should I try to add mirage support to one of them?

capnp doesn't need to do much with these values, so I could create my own Capnp.UInt64 module that doesn't use any C stubs. However, that would mean that the ordering would be wrong, unless perhaps I add and subtract some offset when converting between my type and int64. Also, it would mean creating yet another type, different to all the existing ones.

What is the recommended approach here?

talex5 commented Oct 3, 2017

As it seems this won't be merged soon, could someone state the currently preferred alternative?

It seems that currently I must choose between:

These all appear to be incompatible with each other. None of the external libraries works with mirage currently. Should I try to add mirage support to one of them?

capnp doesn't need to do much with these values, so I could create my own Capnp.UInt64 module that doesn't use any C stubs. However, that would mean that the ordering would be wrong, unless perhaps I add and subtract some offset when converting between my type and int64. Also, it would mean creating yet another type, different to all the existing ones.

What is the recommended approach here?

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Oct 3, 2017

Contributor

@talex5 In general what I do if I need to expose these things at the API level I simply define type uint64 = int64 anywhere in the API to clearly document the signatures and don't do anything about it.

Regarding wrong ordering in which context do you need ordering ? If you are relying on polymorphic equality in the library then I think you should rather get rid of that idea altogether. If you need comparison somewhere use (as linked above):

let uint64_compare a b = Int64.(compare (sub a min_int) (sub b min_int))

(I still think it would be better to have Uint64.t type though).

Contributor

dbuenzli commented Oct 3, 2017

@talex5 In general what I do if I need to expose these things at the API level I simply define type uint64 = int64 anywhere in the API to clearly document the signatures and don't do anything about it.

Regarding wrong ordering in which context do you need ordering ? If you are relying on polymorphic equality in the library then I think you should rather get rid of that idea altogether. If you need comparison somewhere use (as linked above):

let uint64_compare a b = Int64.(compare (sub a min_int) (sub b min_int))

(I still think it would be better to have Uint64.t type though).

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 3, 2017

Contributor

I think that at a minimum we should do what Java 8 does and provide div_unsigned, mod_unsigned and compare_unsigned functions in modules Int32, Int64 and Nativeint. There are portable implementations of these operations in terms of the existing signed operations, but those portable implementations are not obvious, hence worth adding to the stdlib. If we really want we can also implement these as C primitives or even as processor instructions, but this can wait until there's a need.

I am still reluctant to having "true" unsigned integers in the standard library, for the same reasons why Java eschews unsigned integer types: most (C and C++) programmers don't understand the subtle differences between unsigned and signed integers, and it's a lot of code to implement unsigned integers properly.

Contributor

xavierleroy commented Oct 3, 2017

I think that at a minimum we should do what Java 8 does and provide div_unsigned, mod_unsigned and compare_unsigned functions in modules Int32, Int64 and Nativeint. There are portable implementations of these operations in terms of the existing signed operations, but those portable implementations are not obvious, hence worth adding to the stdlib. If we really want we can also implement these as C primitives or even as processor instructions, but this can wait until there's a need.

I am still reluctant to having "true" unsigned integers in the standard library, for the same reasons why Java eschews unsigned integer types: most (C and C++) programmers don't understand the subtle differences between unsigned and signed integers, and it's a lot of code to implement unsigned integers properly.

@talex5

This comment has been minimized.

Show comment
Hide comment
@talex5

talex5 Oct 4, 2017

@dbuenzli Internally, capnp uses the type for IDs, where ordering doesn't matter. However, uint64 is also a type that can be used in messages (https://capnproto.org/language.html#built-in-types). If e.g. a Python application sends a large uint64, I'd like it to still be a large int when it arrives at the OCaml side, etc. I don't think there's anything I can do to stop people comparing them with the built-in operators (and, annoyingly, it will work most of the time so it's hard to spot).

talex5 commented Oct 4, 2017

@dbuenzli Internally, capnp uses the type for IDs, where ordering doesn't matter. However, uint64 is also a type that can be used in messages (https://capnproto.org/language.html#built-in-types). If e.g. a Python application sends a large uint64, I'd like it to still be a large int when it arrives at the OCaml side, etc. I don't think there's anything I can do to stop people comparing them with the built-in operators (and, annoyingly, it will work most of the time so it's hard to spot).

@talex5 talex5 referenced this pull request in ocaml/opam-repository Jan 10, 2018

Merged

Package capnp-rpc.0.3.1 #11206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment