Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Printf: alternative format for integers #1182

Merged
merged 4 commits into from May 29, 2018

Conversation

Projects
None yet
@ygrek
Copy link
Contributor

commented May 25, 2017

OCaml lexer and Scanf support inputting integers with _ delimiter, which is useful when inputting large numbers. Alas, there is no easy way to output integers with such delimiter.
This patch implements alternative format for integers %#d %#i %#u to do just that, separating groups of 3 digits :

# Printf.printf "%#d %#i\n" 1234 ~-123456;;
1_234 -123_456

One corner case :

# Printf.printf "%#09d\n" 1234 ;;
00001_234

but that seems consistent with definition of padding.

PS The implementation of inserting delimiter doesn't make any (even reasonable) assumptions about behaviour of C printf, at the cost of being not very efficient..

@dra27

This comment has been minimized.

Copy link
Contributor

commented May 25, 2017

Thanks for this interesting proposal! I agree that the handling of padding is logical (especially when considering "%# 9d)

Two thoughts:

  1. Is it worth generalising the 3. I haven't thought about it in great detail, but I don't think having an optional number before the # creates ambiguity so Printf.sprintf "%2#9d" 1234 = "000012_34"
  2. The int32, int64 and nativeint versions should certainly support this (#ld, #li, #lu, #nd, #ni, #nu, #Ld, #Li and #Lu) and really shouldn't #x, #X, #o, #lx, #lX, #lo, #Lx, #LX, #Lo, #nx, #nX and #no?

Thanks for including a test, though the documentation also needs updating (and a Changes entry) - though I'd wait on further opinions on the change before spending time on it.

I'm for this, as long is gets at least the int32, int64 and nativeint support for the three cases already implemented; I think supporting the grouping in hex and octal output is a good idea and I'm entirely easy on the suggestion of being able to alter the size of the grouping (the only possible use for that might be when grouping hex, where one might prefer a default group size of 4?)

@dra27

This comment has been minimized.

Copy link
Contributor

commented May 25, 2017

Ah, re-reading the code I see that the int32, int64 and nativeint cases are already there: good!

I also forgot that "%#X" means something else already. Perhaps another flag entirely?

@gasche
Copy link
Member

left a comment

I don't see a test for the padding case, nor where it is handled in the implementation, could you add a testcase? As you mention, I agree that the right property would be for the string produced by %#09d to be 9 characters wide when the number fits in 9 digits.

done;
Buffer.contents buf
| _ -> s

This comment has been minimized.

Copy link
@gasche

gasche May 25, 2017

Member

Buffer-elimination challenge accepted:

let add_spaces str =
  let inlen = String.length str in
  let outlen =
    inlen * 4 / 3 - (if inlen mod 3 = 0 then 1 else 0)
  in
  String.init
    outlen
    (fun i ->
       let outidx = outlen - 1 - i in
       let offset = outidx mod 4 in
       if offset = 3 then '_'
       else
         let inidx = outidx / 4 * 3 + offset in
         str.[inlen - 1 - inidx])

P.S.: I tested this code with

Array.init 30 (fun i ->
  let s = string_of_int (truncate (exp (float i))) in
  (s, add_spaces s));;

This comment has been minimized.

Copy link
@ygrek

ygrek May 25, 2017

Author Contributor

I was considering this but I was not sure what C printf is allowed to do. Another option is to do string_of_int on each triplet separately.

This comment has been minimized.

Copy link
@gasche

gasche May 25, 2017

Member

I think your idea of post-processing the C output is just fine. My suggestion remains in this style (as a string -> string function to apply on the returned value), but it allocates one less copy of the string. (I think some minds that work differently than mine may even consider it more readable.)

This comment has been minimized.

Copy link
@ygrek

ygrek May 25, 2017

Author Contributor

Right, more precisely I meant that my implementation guarantees delimiter is between groups of three digits. There are two cases : sign in front and hypothetical case of printf inserting something in the middle of string (I have no idea if this can really happen in any locale?). Though in the latter case any implementation will be not pretty, but first needs to be addressed :

# add_spaces "-123";;
- : string = "-_123"

This comment has been minimized.

Copy link
@gasche

gasche May 25, 2017

Member

Indeed, I see. Let's keep your implementation then. But still, do you actually respect the padding length specification? (Not that my proposal above would have helped.)

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented May 25, 2017

This is a nice idea, but please do keep in mind that grouping digits by 3 is a European convention: many Asian countries use groups of 4 digits (China, Japan) or something more complicated (India). So, you have a tricky internationalization problem here.

@dra27

This comment has been minimized.

Copy link
Contributor

commented May 25, 2017

@xavierleroy - to your knowledge, would %x_y#d be sufficient (so Indian would be Printf.sprintf "%2_3#d" 500000 = "5_00_000")?

@gasche

This comment has been minimized.

Copy link
Member

commented May 25, 2017

Be careful with the fact that changing the format of format conversions is tricky. # is syntax that already exists (and is correctly parsed) for all conversions, so @ygrek was able to implement this behavior without changing the format parsing (or typing) machinery at all, which made life simple. Things that do not fit the current conversion format could be sensibly more work.

@ygrek

This comment has been minimized.

Copy link
Contributor Author

commented May 25, 2017

@gasche is right, changing format's format and typing is likely beyond my skills.
@xavierleroy considering that we were stuck with latin1 in runtime for so long :)
I can propose solving this at compiler configuration time or as runtime switch (via Sys and/or OCAMLRUNPARAM), if deemed really needed.

@ygrek

This comment has been minimized.

Copy link
Contributor Author

commented May 25, 2017

@gasche @dra27 added tests for padding, it is handled on the next level after formatting the digits so I didn't need to do anything special about it.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented May 26, 2017

@dra27 : the Indian style xx_xx_xxx (crores / lakhs / thousands) is the most complicated I know of, but I'm definitely no expert in non-European scripts! I hope OCaml users from Asia will chime in and tell us what they would actually use in their programs.

I share @gasche's worries about further complicating the syntax of format strings.

@bobzhang

This comment has been minimized.

Copy link
Member

commented May 26, 2017

I can confirm in China, we use such convention 1_0000_0000

@ygrek

This comment has been minimized.

Copy link
Contributor Author

commented May 26, 2017

http://www.statisticalconsultants.co.nz/blog/how-the-world-separates-its-digits.html

Add to Sys or Printf :

type int_alt_format = Int_fmt_none | Int_fmt_west | Int_fmt_china | Int_fmt_india

(** sets type of Printf alternative int formatting (e.g. %#d) *)
val set_alt_int_format : int_alt_format -> unit

Works?

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented May 26, 2017

val set_alt_int_format : int_alt_format -> unit
Works?

Please no. This kind of global state leads to nightmares.

Frankly if you want to print these digits for a locale you should rather use a proper localization definition with appropriate data. However since this is to write OCaml literals and not properly localized data and with all due respect to number systems of the world it feels a bit out of scope to me, why simply not standardize on SI prefixes (i.e. 3 as in the proposal) ?

@ygrek

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2017

ftr my work here is done at this point, waiting for review/decision

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Jun 26, 2017

About the various formats: I don't think this level of internationalization has its place in the standard library. It's the kind of thing that can be done much more thoroughly by an external library.

@pierreweis

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2017

@mshinwell

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

Right, so it looks like the decision to be made is: either reject this pull request, or include it in the current form, with support for SI groups of 3 numbers only. Which are we going to do?

@diml

This comment has been minimized.

Copy link
Member

commented Aug 9, 2017

Personally I'm in favor of this PR in its current form. I just find it unfortunate that # means something completely different for x/o and d/u/i. Inserting underscores seems much more interesting to me than the current behavior of %#x and %#o, especially since they follow C lexical conventions rather than OCaml ones.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 29, 2017

@alainfrisch alainfrisch added the stdlib label Feb 28, 2018

@ygrek

This comment has been minimized.

Copy link
Contributor Author

commented Apr 14, 2018

this doesn't go into 4.07? 😢

@gasche

gasche approved these changes Apr 14, 2018

Copy link
Member

left a comment

We could at least considering merging in trunk -- my vague impression is that there is consensus on grouping-by-3.

@damiendoligez damiendoligez removed this from the consider-for-4.07 milestone May 28, 2018

@damiendoligez

This comment has been minimized.

Copy link
Member

commented May 28, 2018

Right. Grouping by 3 is simple enough, and serious internationalization can be done by an external library.

@gasche

This comment has been minimized.

Copy link
Member

commented May 28, 2018

@ygrek apologies for the delay, would you have the time to rebase the PR so that I can merge it in trunk?

@ygrek ygrek force-pushed the ygrek:printf_alt_int branch from 0e4f92a to 3a97858 May 29, 2018

@ygrek

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2018

done

@gasche gasche merged commit 148e811 into ocaml:trunk May 29, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ygrek ygrek deleted the ygrek:printf_alt_int branch Jun 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.