A new format implementation based on GADTs #6017

vicuna · 2013-05-18T00:39:44Z

Original bug ID: 6017
Reporter: bvaugon
Assigned to: @gasche
Status: closed (set by @xavierleroy on 2017-02-16T14:14:38Z)
Resolution: fixed
Priority: high
Severity: feature
Version: 4.00.1
Target version: 4.02.0+dev
Fixed in version: 4.02.0+dev
Category: standard library
Tags: patch
Related to: #5949 #6286 #6297 #7034
Monitored by: ibnfirnas @gasche @lpw25 mcclurmc @diml "Pascal Cuoq" bobot @chambart @jmeber @hcarty @yakobowski @alainfrisch

Bug description

I wrote a new implementation of formats unsing GADTs.

It contains modifications of Printf, Format and Scanf standard library
modules, and OCaml typer adaptations (principally in typing/typecore.ml).

In this new implementation, formats are compiled into GADTs trees by
the compiler and are "evaluated" by the printing and scanning functions.
The main advantages are :

No longer use of Obj.magic in printf.ml, scanf.ml and format.ml.
Better performances.
Stronger static verifications of format validity (flag compatibilities, ...).
Factorisation of format parsing in a single code (for all the stdlib
and the OCaml compiler!) to ensure compatibility.

This new version is "mostly compatible" with the old-one. Differencies are :

Some inconsistent formats are now statically rejected, like
"%1.1s", "%#s", "%0c" (I had to slightly adapt the testsuite,
principally to comment absud tests which no longer compile, so, obviously,
I did not remove any test which still compile).
Some bug fixes :
- The semantic of (printf "%{...%}") (from comments of the testsuite).
- The semantic of (scanf "%[^...]@x").
- The '@%' escape syntax, like in (Format.printf "@%d%s" 42 "hello").
- The managment of nested "%{...%}" and "%(...%)".
- Buffer overflows dues to too big padding (or precision).
- ...
To obtain compatibility of Printf and Format formats, some '@' usages
with an "invalid Format-like syntax" but a "valid Printf-like syntax"
are now accepted by Format printing functions and literally printed.
It is a small disadvantage because it is slightly less convenient to
learn Format "formatting info" syntax by practice, sorry.

Since the OCaml standard library and the OCaml compilers use formats, the
bootstrap is a bit complicated.

I attach two patches (ocaml-4.00.1_new-formats_patch-{1,2}.diff) and
instructions to apply them into "howto-apply-new-formats-patches.txt".
This .txt file can be used (with a slight adaptation from your config.)
as an installation shell-script.

Please do not hesitate to send me any comments or questions about this code.

Benoît Vaugon.

File attachments

vicuna · 2013-05-18T07:23:13Z

Comment author: @gasche

While I have not yet reviewed the implementation in detail, I have
been in contact with Benoît during the polishing part of this
considerable work, which I happily support. Here are some general
comments on the general design to help people get an understanding of
this change. Apologies in advance for any inaccuracy.

The change can be summarized as this: instead of being kept as
a format string all the way to runtime use of the format (where it is
parsed by the printf/scanf/formatf) function, it is replaced by a GADT
value that describes the format (with a subtle enough typing to mirror
the way phantom types are currently used in the format6 type). Instead
of re-parsing the format at runtime, the GADT value is matched over
the produce the desired semantics.

Below is a more detailed high-level comment.

The current way OCaml handles format strings can be roughly summarized
as follows:

there is no syntactic difference between formats and string literals
the type inference engines handles string literals as having an
ambiguous type (either [string] or [(...) format6]), which is
propagated through inference, and possibly forced by other operators
that do expect either a string or a format (printf/scanf/format,
format_of_string...); the type parameters of the [(...) format6]
type are determined by parsing the format to determine the number
and type of its expected parameters, etc.
At runtime, the functions in Printf/Scanf/Format parse the format
string for a second time, printing or reading as it goes. This part
is full of unsafe code (Obj.magic) to get an undetermined number of
parameters, etc. The details of how advanced features of the format
work are often quite horrible; for example, advanced number
formatting options (padding, decimal, etc.) are interpreted by
calling the C [sprintf] function, but those formats are sometimes
rewritten instead of being passed as is (to handle advanced feature
such as "%.*d which allows to pass the decimal length as a dynamic
parameter, implemented as rewriting the format passed to [sprintf]
at runtime).

The duplicated parsing, once at type-checking time and once at
runtime, is source of numerous causes of bugs: some formats that are
ill-supported at runtime are accepted by the type-checker. The use of
unsafe Obj.magic is also a cause of bugs. Several segfault bugs have
been found in the last few months, and Benoît has found some more
during his reimplementation work.

The change by Benoît turns format handling into the following picture:

still no syntactic treatment of format strings (this is to be
constrasted with the various Camlp4 extensions implementing custom
formats by adding a custom lexical syntax for formats; which is
probably the right design after all, but cannot be adopted here for
compatibility reasons)
at typing time, the (tricky) inference part is unchanged; but when
a string is determined to be a format, it is parsed in the same way
as before (with some bugfixes "en passant" from Benoît), but this
information is used to construct a term at a GADT type ([(...)
format6] is defined as a GADT), which has precisely the expected
type. Here is an choice piece of the new format6 declaration:
- | Float : (* %[feEgGF] *)
- ```
 float_conv * ('x, 'y) padding * ('y, float -> 'a) precision *
```
- ```
 ('a, 'b, 'c, 'd, 'e, 'f) format6 ->
```
- ```
   ('x, 'b, 'c, 'd, 'e, 'f) format6
```
- | Bool : (* %[bB] *)
- ```
 ('a, 'b, 'c, 'd, 'e, 'f) format6 ->
```
- ```
   (bool -> 'a, 'b, 'c, 'd, 'e, 'f) format6
```
- | Flush : (* %! *)
- ```
 ('a, 'b, 'c, 'd, 'e, 'f) format6 ->
```
- ```
   ('a, 'b, 'c, 'd, 'e, 'f) format6
```
- | String_literal : (* abc *)
- ```
 string * ('a, 'b, 'c, 'd, 'e, 'f) format6 ->
```
- ```
   ('a, 'b, 'c, 'd, 'e, 'f) format6
```
[..]
- | End_of_format :
- ```
   ('f, 'b, 'c, 'e, 'e, 'f) format6
```
The format "this is %b" will be transformed into the following GADT
value (approximatively):
String_literal ("this is ", Bool (End_of_format))
at runtime, there is no format string to be parsed again: the
printf/scanf/format implementation can directly interpret the GADT
description of the format. Thanks to the rich GADT typing, there
is no need for unsafe features anymore: Obj.magic is gone. An
amusing aspect of the change is that
[string_of_format : (...) format6 -> string] was previously the
identity (as the runtime representation of formats was precisely
the string), and is now a function "unparsing" the GADT value:

+let bprint_format buf fmt =
- let rec fmtiter : type a b c d e f .
- ```
 (a, b, c, d, e, f) format6 -> bool -> unit =
```
- fun fmt ign_flag -> match fmt with
- | String (pad, rest) ->
- ```
 buffer_add_char buf '%'; bprint_ignored_flag buf ign_flag;
```
- ```
 bprint_padding buf pad; buffer_add_char buf 's';
```
- ```
 fmtiter rest false;
```
- | Caml_string (pad, rest) ->
- ```
 buffer_add_char buf '%'; bprint_ignored_flag buf ign_flag;
```
- ```
 bprint_padding buf pad; buffer_add_char buf 'S';
```
- ```
 fmtiter rest false;
```
- | Int (iconv, pad, prec, rest) ->
- ```
 bprint_int_fmt buf ign_flag iconv pad prec;
```
- ```
 fmtiter rest false;
```

I believe this change is highly desirable: it rationalizes the runtime
handling of formats, removes a heap of bugs, documents a pile of
obscure features, and has as its core an elegant use of GADTs.

However, due to the large number of features in the OCaml format
types, the patch is not small. The [format6] GADT has 23
constructors, each handling an orthogonal feature of format. Handling
all this stuff makes for a very large patch: in total, 1835 lines are
removed (mostly the runtime handling of format), and 3304 lines are
added. I find the code well-written and easy to read, but its review
and discussion may still take a bit of time.

vicuna · 2013-05-22T13:29:28Z

Comment author: @yallop

I like this idea a lot. If I'm not mistaken, format could now be written entirely as a library, without special compiler support, using ppx or camlp4 to turn the strings into a GADT tree. (Of course, we'll want to keep format around for backwards compatibility).

vicuna · 2013-05-22T15:04:42Z

Comment author: @alainfrisch

This is an impressive piece of work, which I'd like to see integrated at some point. It probably deserves to be tested in the wild, on large code bases. I know there is some reluctance against using recent additions to the language within the core compiler or stdlib. What about integrating the patch on the trunk after the branch for the next release has been created? This would make it easy to test it, while providing a full release cycle before it is officially release, and a release branch for bug fix releases, if required.

vicuna · 2013-05-22T16:02:50Z

Comment author: @johnwhitington

This looks great.

What is the effect on the readability of type errors? An improvement? Worse? The same? It's rather hard to explain to students those complicated format4 / format6 types...

vicuna · 2013-05-23T00:11:18Z

Comment author: bvaugon

I confirm that typing errors with format types are difficult to understand (and to explain to students). Unfortunately, this patch does not change anything because the inferred types are the same, and the unification produces the same typing errors.

However, I have a suggestion to try to improve this problem about error messages containing format types, and I would like some outside opinions on this potential issue. What do you think to modify the type pretty-printer to display format types in this way:

"%d%s%c" format

rather than:

(int -> string -> char -> 'a,' b, 'c,' d, 'd, 'a) format6

We then get this kind of error message:

This expression has type "%d%s%c" format but an expression was expected of type "%d%d%c" format.

Unfortunately, all type parameters are not necessarily instantiated when generating the error message. These cases would require the display of format types like:

"%d%?%?" format

or maybe:

"%d%['a]%['c]" format

to name type parameters.

Does someone have an opinion or a better idea ?

vicuna · 2013-05-23T08:22:39Z

Comment author: @gasche

I'd rather not mix a discussion about a change to the runtime with new feature ideas about printf/scanf/format. My gut feeling would be to first discuss and handle what can be done purely as an internal change, with no modification to the observable behavior. Then, once we know for sure the status of the change, we can act on it and have room for discussion of observable changes based on this experience.

If you think there is value in having these other discussions right now rather than in a few weeks/months, I would suggest a separate bugtracker issue.

vicuna · 2013-05-23T11:40:57Z

Comment author: bobot

That's a great work. I was wondering if the author can think of ways for improving Format.ikfprintf, Format.ifprintf, Printf.ifprintf:

let debug = ref false

let dprintf fmt =
if !debug
then Format.fprintf Format.err_formatter fmt
else Format.ifprintf Format.err_formatter fmt

Such a function is useful for printing conditionaly debugging information. With the old implementation the cost is quite high since even when debug is false the format string is parsed. With the new implementation that should be faster but there is some allocation.

Currently the best dprintf version that I know, consists to use specialize version without ifprintf:

let dprintf0 fmt =
if !debug
then Format.fprintf Format.err_formatter fmt

let dprintf1 fmt x1 =
if !debug
then Format.fprintf Format.err_formatter fmt x1

let dprintf2 fmt x1 x2 =
if !debug
then Format.fprintf Format.err_formatter fmt x1 x2

...

But the user must select the correct version.

Is it possible to reduce the cost of ifprintf? A specialized version of make_printf: (a, b, c, c, c, unit) CamlinternalFormatBasics.format6 -> a that construct nothing except closure? By adding the number of remaining argument in format6 for faster application for small arities?

Once again thank you for this great job,

vicuna · 2013-05-23T12:44:25Z

Comment author: @gasche

I'm not sure there is some allocation: the generated GADTs are constant immutable values, and there should therefore be pre-allocated by the compilers.

vicuna · 2013-05-23T13:53:22Z

Comment author: bvaugon

The GADT representing the format is not allocated by printing
functions, but an accumulator (an heterogeneous linked list
storing arguments to delay printing) and intermediate closures
are allocated each time {if,k,...}printf are called. I can easily
write an ifprintf which do not allocate the accumulator, but it
is more difficult to remove closure creations.

vicuna · 2013-05-24T10:35:21Z

Comment author: bvaugon

Oh, sorry, to write an ifprintf which allocates nothing (as in
bytecode and native version) if the number of arguments is less than a
bound N (3 in the following example) is simple with the new
implementation of formats. It looks like this:

type ('a, 'b) iprintf_acc =
| IPrintf_arg_0 : ('b, 'b) iprintf_acc
| IPrintf_arg_1 : ('a1 -> 'b, 'b) iprintf_acc
| IPrintf_arg_2 : ('a1 -> 'a2 -> 'b, 'b) iprintf_acc
| IPrintf_arg_3 : ('a1 -> 'a2 -> 'a3 -> 'b, 'b) iprintf_acc
| IPrintf_arg_N : ('aR, 'b) iprintf_acc -> ('ai -> 'aR, 'b) iprintf_acc

let rec fun_of_iprintf_acc : type a . (a, unit) iprintf_acc -> a =
fun acc -> match acc with
| IPrintf_arg_0 -> ()
| IPrintf_arg_1 -> iprintf_ignore_1
| IPrintf_arg_2 -> iprintf_ignore_2
| IPrintf_arg_3 -> iprintf_ignore_3
| IPrintf_arg_N rest -> fun _ -> fun_of_iprintf_acc rest
and iprintf_ignore_1 : type a1 . a1 -> unit = fun _ -> ()
and iprintf_ignore_2 : type a1 a2 . a1 -> a2 -> unit = fun _ _ -> ()
and iprintf_ignore_3 : type a1 a2 a3 . a1 -> a2 -> a3 -> unit = fun _ _ _ -> ()

let rec iprintf_acc_succ : type a b c .
(a, c -> b) iprintf_acc -> (a, b) iprintf_acc =
fun acc -> match acc with
| IPrintf_arg_0 -> IPrintf_arg_1
| IPrintf_arg_1 -> IPrintf_arg_2
| IPrintf_arg_2 -> IPrintf_arg_3
| IPrintf_arg_3 -> IPrintf_arg_N IPrintf_arg_3
| IPrintf_arg_N rest -> IPrintf_arg_N (iprintf_acc_succ rest)

let rec make_iprintf fmt = make_iprintf_aux IPrintf_arg_0 fmt
and make_iprintf_aux : type a b c x .
(x, a) iprintf_acc ->
(a, b, c, c, c, unit) CamlinternalFormatBasics.format6 -> x =
fun acc fmt -> match fmt with
| Char rest -> make_iprintf_aux (iprintf_acc_succ acc) rest
| Bool rest -> make_iprintf_aux (iprintf_acc_succ acc) rest
| String_literal (_, rest) -> make_iprintf_aux acc rest
| End_of_format -> fun_of_iprintf_acc acc
| _ -> assert false (* Other cases not yet implemented *)

let ifprintf _ fmt = make_iprintf fmt

However, if this code is integrated into the standard library, it may
be even increase the size of camlinternalFormat.ml. Does it matter?

vicuna · 2013-05-24T13:33:08Z

Comment author: bobot

Nice, you still need to look at all the formatter but at least it does just that.

I tried by adding the size of the formatter in format6 (your format6 is format6_aux):
type ('a, 'b, 'c, 'd, 'e, 'f) format6 =
| App0: ('f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('f, 'b, 'c, 'd, 'e, 'f) format6
| App1: ('a1 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| App2: ('a1 -> 'a2 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'a2 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| App3: ('a1 -> 'a2 -> 'a3 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'a2 -> 'a3 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| App4: ('a1 -> 'a2 -> 'a3 -> 'a4 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'a2 -> 'a3 -> 'a4 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| Appn: ('a, 'b, 'c, 'd, 'e, 'f) format6_aux ->
('a, 'b, 'c, 'd, 'e, 'f) format6

The nice point is that ifprintf and printf, fprintf, ... can be implemented without any closure but you need one function as big as make_printf by arity.

...

let app4 : type a1 a2 a3 a4 b c d .
(b,c) print -> (b -> d) -> b ->
(a1 -> a2 -> a3 -> a4 -> d, b, c, c, c, d) format6_aux ->
a1 -> a2 -> a3 -> a4 -> d =
fun print k o fmt x1 x2 x3 x4 ->
begin match docst print o fmt with
| Bool fmt -> print.string o (string_of_bool x1); app3 print k o fmt x2 x3 x4
| Alpha fmt -> print.alpha o (x1 o x2); app2 print k o fmt x3 x4
| String_literal _ -> assert false (* docst should have take care of it *)
end

let make_print2 : type a b c d .
(b,c) print -> (b -> d) ->
(b -> (b, c) acc -> d) ->
b ->
(a, b, c, c, c, d) format6 -> a =
fun print k output o fmt ->
match fmt with
| App0 fmt -> app0 print k o fmt
| App1 fmt -> app1 print k o fmt
| App2 fmt -> app2 print k o fmt
| App3 fmt -> app3 print k o fmt
| App4 fmt -> app4 print k o fmt
| Appn fmt -> make_printf output o End_of_acc fmt

(all the code for just a few case is uploaded as special_case_for_small_arities.ml)

I can't think of a generic version of app that doesn't use a closure, except if the inliner can create all the version by itself with the following definitions:

let app_gen f1 f2 x2 x1 o = ... f1 o ... f2 x2 o ...

let app3 x3 x2 x1 o = app_gen (app1 x3) (app2 x3) x2 x1 o
let app4 x4 x3 x2 x1 o = app_gen (app2 x4 x3) (app3 x4 x3) x2 x1 o
let app5 x5 x3 x2 x1 o = app_gen (app3 x5 x4 x3) (app4 x5 x4 x3) x2 x1 o

vicuna · 2013-05-24T13:44:55Z

Comment author: bobot

Just another question about your patch:

Do you only need the type acc to start printing only when the application is complete? Is it specified in the documentation or is it for respecting the old behavior?

vicuna · 2013-05-24T15:11:41Z

Comment author: @gasche

If the question is "do you really need acc?", the answer is 'yes'. As it is the interpretation of the GADT values that does the consumption of the arguments one after the other, you need the accumulator to delay all printing actions until all arguments have been consumed.

Otherwise constructing the closure

let print_id = Printf.printf "%s: %d\n" Conf.i18n_id_name

would start printing stuff before the remaining %d argument is given. This is clearly not the correct semantics (because it would be printed only once, instead of each time print_id is applied, as expected by the user).

vicuna · 2013-05-25T10:12:15Z

Comment author: bobot

My second question was: Is this behavior specified somewhere in the documentation or is it only for respecting the old behavior (which must be done!)?

Gabriel, in another world where it is not specified if printf can start printing before complete application (it is also a possible semantics), you will write

let print_id id = Printf.printf "%s: %d\n" Conf.i18n_id_name id

and in this other world Benoît will implement a make_printf which process directly the argument and which is easier to inline. So its easier in this world to make printf as fast as using directly the output functions:

http://www.lexifi.com/blog/note-about-performance-printf-and-format

But we are not in this other world, so Benoît created the acc type, and we should hope that the improvement done in the inliner will be greater in this world than in the other.

I really like the patch proposed by Benoît and it already give speed-up and simplification, that's why I want to know if we can push it further.

vicuna · 2013-05-25T10:52:03Z

Comment author: @gasche

I am strongly opposed to any change of the semantics that would amount to printing starting before complete application. This is certain to break existing programs¹, and I don't think a patch doing that should be accepted.

¹: and not even some hitting some bug or misbehavior of the current implementation and not noticing it; some perfectly reasonable user programs.

There is in fact a similar problem with invalid_arg failures due to runtime-invalid format, that is done "wrong" in the patch. I'm writing a longer (too long) message about my first forays at patch review, and will write about this more than anyone (except maybe Benoît) wants to read :-)

Regarding performances, Benoît had made early measurements on a much simpler prototype, that shown that the performance it when compared to "using directly the output functions" was reasonable (around 10%-30%, iirc., compared to the several-times slowdown observed by Alain). Benoît, do you think you could update those small speed experiments for the current patch?

I would even be in favor of considering reverting some of the changes made by Alain on performance grounds, turning Format.printf into direct printers in the compiler code. But that will be evaluated and discussed with him in due time.

On a related note, replacing formats with direct printing function may result in large increase in code size. Benoît's expansion into a GADT preserves the compact code size we had with the previous one -- as opposed to some other translation schemes that produce large code closures. It is unclear that always inlining to the native functions will actually improve performances (I've often seen excessive -inline tunings results in slowdown) -- but that will be left to the tricky inliner heuristics to decide.

vicuna · 2013-05-25T11:15:45Z

Comment author: @gasche

I started to look at the details of the patch, and here is some early feedback. I have not done any
kind of exhaustive reviewing so far (reading code linearly), but only looking at the big picture and
specific possible source of defects or improvements.

Organisation of the patch (explanation)

There is code specific to Scanf or Format, that is present in scanf.ml and format.ml
respectively. Most of the code is in fact common to all users of format, and it is either in a new
submodule of pervasives.ml, CamlinternalFormatBasics, or in a separate file
camlinternalFormat.ml. Printf is now only a very small shell over CamlinternalFormat
(defining simple printer continuations).

The split between CamlinternalFormatBasics in pervasives.ml on one side, and camlinternalFormat.ml
on the other, is determined by what is exported by Pervasives regarding format:

the [(...) format6] type and related aliases
the format concatenation operator [(^^)]
the [string_of_format] function

All the code needed to implement these is in pervasives.ml, all the rest is in
camlinternalFormat.ml.

Size of the addition to pervasives.ml (change attempt, no change)

The code added to pervasives.ml is really long (850 lines, added to a file that was previously 460
lines long), and that's one of the not-pleasing aspects of the patch. I have been wondering about
how to reduce this part of the code.

Here is how each of the format-related export of pervasives.ml contributes to code size under the
new implementation:

the [(...) format6] type (and related aliases) whose new definition as a GADT is a mouthful
(about 350 lines), but an expected and interesting one
the [(^^^)] operator for concatenation of format, that is in fact quite easy to define with the
new GADT formulation: it only takes 41 lines
the [string_of_format] operator is the source of most of the code inflation, at almost 500 lines

In the previous implementation, formats are stored as strings until they are passed to the relevant
Printf/Scanf/Format functions, so [string_of_format] was just the identity. The new implementation
forgets about the string when it translates them to those fancy GADT values, and [string_of_format]
is now implemented as a derived operation, actually doing quite a lot of stuff.

One other possibility (that was used in some previous prototypes of Benoît, using functions instead
of GADT values) is to instead translate each format string into the pair (, ), so that [string_of_format] is just extraction of the preserved format string.

I tried to evaluate the possibility of reverting to this style, for the purpose of getting rid of
the format-printing code that is presently in [pervasives.ml]. I have not pushed the experimentation
to its limit, because the format printing logic is also used in other parts of the
implementation. Notably:

the most adventurous modifiers of the float printer %f are still implemented by generating
a format string to pass to C's [sprintf]; I think this is the right design choice as it can
preserve the previous semantics which much less surprise than a OCaml implementation of these
features would create
some of the more advanced format features, in particular %{foo%}, require some parts of format
printing at runtime (this means in particular that the format string cannot be stored only at the
toplevel of the format descrption, but some sub-strings may need to be stored in only some of the
intermediate nodes in the format description)
error messages may want to print relevant part of format string

So it appears the (, ) compilation scheme may allow to get rid of that
code in pervasives.ml, but the code would still need to be preserved somewhere else. Making
ambitious changes to an already ambitious patch just to move some chunks of code from one file to
another does not look like that good a deal.

Besides, there are arguable correctness benefit in keeping the format-string around
([string_of_format] is guaranteed to be correct, vs. the current implementation which may have bug),
but as format printing would still be used in other parts of the implementation anyway (in the
float part), we need to make sure the implementation is correct in any case. (I'm thinking of simply
fuzzing the format parser, if time allows.)

Ideas toward an efficient Format.ifprintf (change idea, no change)

Regarding the request of François, I considered moving the generated GADT value from the structure
(example with "foo%s%d%b")

String_literal("foo",
String(padding,
Int(conv, padding, precision,
Bool(
End_of_format)))

to something more like

Arg((Some(String_literal "foo"), String_tag, String_arg(padding)),
Arg((None, Int_tag, Int_arg(conv,padding,precision)),
Arg((None, Bool_tag, Bool_arg),
End(None))))

This changes makes the GADT type more complicated (but isomorphic to the previous one), and it is
unclear how to handle some of the more exotic format features (%(...%) and %{...%}). It does allow
an efficient yet simple ifprintf implementation

The problem is that it means a very large change to the current patch, which makes everything else
more complex (and slightly slower at runtime due to less efficient data representation). I don't
think it is worth it -- or at least it should be considered separately after the present
implementation and its principles have been reviewed and discussed.

Driving the point further: I think there are other ways to do what François want (preprocessing,
with camlp4 or ppx, to turn non-debug calls into nothing at all; or the killer inliner if it
comes someday), and we need not focus too much on ifprintf performance as a design pressure. Safety
of the implementation, absence of bugs, and performance in the common case are more important. It
would be nice if we could have that easily, but there is no point jumping through hoops
(making large changes or adding huge (non-linear), repetitive chunks of code) otherwise.

The problem with [invalid_arg] errors (bug, change required)

The proposed patches makes changes to the way format errors are handled that are problematic.

(First: some nonsensical formats where previously rejected at runtime and are now rejected at
compile-time (see the change to testsuite/tests/lib-printf for some examples); we probably need to
discuss this choice, but I think it is reasonable and I don't included it in the "problematic"
aspects discussed here.)

(1) The error messages are different from those of the previous implementation, for the worse:
previous message include an error position, and print the complete format string. I believe we
should preserve this behavior, but that is not trivial -- I believe it requires to include
position information in the gadt value generated at format parsing time, and "lazily" reprint
the format string when an error must be printed. I am not sure about the way
(new implementation's) Scanf handles invalid_arg failure by catching them to wrap them with more
information (I would need to be convinced that it works for format strings generated during the
scanning with %(..%) and friends, and would feel more confident with an implementation pushing
the relevant error information locally, instead of playing the exception re-raising game).

(2) The error message may happen before all the format arguments have been readed. This is precisely
an instance of the "eager output" behaviour that we were discussing with François in previous
messages. Consider for example the following program:

  let test = Printf.printf "%d%[a-z]%d" 1

The format %[a-z] is nonsensical as an argument to Printf (it only makes sense for Scanf), but
that is only detected at runtime², when the [printf] codes run on this format.

Note that there are missing arguments, so the previous semantics delays printing: it does
absolutely nothing and in particular no error is shown.

On the contrary, evaluating this program with Benoît's implementation fails with an
Invalid_argument error. I believe this semantics change is problematic and should be fixed.

(Internally, the exception is raised while building the "accumulator" that we were discussing
with François, so before any printing happens and before all arguments are consumed. I believe
the implementation should be changed to accumulate the errors as well, and raise them only once
the accumulator is traversed to effect the printing actions.)

Note that the following code exhibits a related but slightly different change in behavior:

  let test = Printf.printf "%d\n%!%[a-c]%d" 1 "foo" 3 in

This time, all arguments are given. The previous implementation will print 1, and then fail with
an Invalid_argument error. Benoît's implementation will fail with an Invalid_argument error
right away.

I believe that for Benoît's quite ambitious change to be acceptable, it should guarantee maximal
compatiblity with previous behaviour. Given the lack of specification of what happen in the edge
cases, I would find it acceptable that some obviously non-controversial change that may only
affect wrong programs be made -- note that the OCaml maintainers may not share that opinion. But
the changes presented here are not obviously non-controversial, they are making a different answer
to a non-obvious question. This is out of scope of the desirable internal Obj/GADT change, and
I think it should not be included in the present patch.

²: Benoît considered embedding static typing information about which formats are valid for all
Scanf/Printf/Format, and which are only valid for some of them. That would be a nice design, but we
don't think it is possible to do that without changing the involved types
(adding a phantom parameter), so he decided not to do that and I think that's the right choice -- we
are already discussing enough changes.

vicuna · 2013-05-26T09:03:23Z

Comment author: bobot

You can simply add this type on top of Benoît's format6 and have an efficient version for ifprintf, the problem of repeated code was only with make_printf when you want to remove the multiple closure. The old code has that kind of special case. Gabriel's solution has the good point to let the choice of the number of special case for the iprintf, but as Benoît's one it need to access all the format string.

type ('a, 'b, 'c, 'd, 'e, 'f) format6 =
| App0: ('f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('f, 'b, 'c, 'd, 'e, 'f) format6
| App1: ('a1 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| App2: ('a1 -> 'a2 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'a2 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| App3: ('a1 -> 'a2 -> 'a3 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'a2 -> 'a3 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| App4: ('a1 -> 'a2 -> 'a3 -> 'a4 -> 'f, 'b, 'c, 'd, 'e, 'f) format6_aux
-> ('a1 -> 'a2 -> 'a3 -> 'a4 -> 'f, 'b, 'c, 'd, 'e, 'f) format6
| Appn: ('a, 'b, 'c, 'd, 'e, 'f) format6_aux ->
('a, 'b, 'c, 'd, 'e, 'f) format6

(* make_printf or Benoît make_ifprintf *)

Gabriel can you explain:

some of the more advanced format features, in particular %{foo%}, require some parts of format printing at runtime (this means in particular that the format string cannot be stored only at the toplevel of the format descrption, but some sub-strings may need to be stored in only some of the intermediate nodes in the format description)

Why not only at toplevel? For make_printf, string_of_format is used on a whole format6. Another function need that?

vicuna · 2013-05-26T10:42:24Z

Comment author: @gasche

I see. I had probably not understood your proposal correctly the first time, and it's true that this looks rather simple.
string_of_format is called on line 615 of camlinternalFormat to support the behavior of printing "%{foo%}"-formats. I'm not sure what you mean by "it is used on a whole format6" but that it probably the case here. I only meant that you cannot take the current definition of format6, replace all recursion sites by format6_gadt, and redefine format6 as (string * format6_gadt): some places inside the format GADT will still need to refer to a "whole format6". But it may not be an issue for the implementation; I don't claim I grasp it fully yet, and Benoît can probably comment better on the upsides and downsides of the idea of keeping the pair to get rid of string_of_format.

vicuna · 2013-05-26T13:46:11Z

Comment author: @gasche

I wrote a small format fuzzer and quickly found out that string_of_format does not preserve the format strings. Here are some examples:

%L is printed back as %N
%b is printed back as %B
%@c is printed back as @c

I did make the necessary changes to fix the first of these (patch to be uploaded), but that amounts to polluting the GADT description of formats with details that are irrelevant to the semantics, and only meant to enable faithful re-printing of the format string.

I suspect the right design choice is to move back to a (, ) pair, as I previously proposed. Benoît, what do you think?

vicuna · 2013-05-26T14:29:58Z

Comment author: bvaugon

I thought that changing the semantics of programs which are fundamentally
wrong like:

Printf.printf "%d%[a-z]%d" 1

or

Printf.printf "%d\n%!%[a-c]%d" 1 "foo" 3

was not a problem, because, with the new verion, the program break
always earlier and never later than with the old version, so, it is
rather an advantage.

However, it is very simple to reaquire the old behaviour of this kind
of wrong programs (by adding an "Error" constructor to acc). So, if
you really want this, you have just to apply my
fix-incompatible-formats-raising-time-patch.diff attached patch.

===

About:

%L is printed back as %N
%b is printed back as %B

It is not a bug, it is completely normal. Indeed, %L and %N are
semantically equivalent, idem for %b and %B. The function
"string_of_format" has to return a string representing the format, not
the original string from the orginal source code. You have lots of
similar differencies, for example "%-+d" is equivalent to "%+-d", ...

About:

%@c is printed back as @c
It is a bug, sorry.

Can be fixed by adding the case :
| '@' -> buffer_add_string buf "%@"
in the bprint_char_literal function.

As a consequence, string_of_format "@" will equal "%@" but it is not a
problem for the same reason as before.

===

About make_printf versions wich do not allocate intermediate closures
(when printf is applied to a little number of arguments), they are
interresting but they greatly increase the source and generated code
size and reduce the readability. So, it may be better to implement
generic optimisations in the compiler as a kind of "partial
evaluation". The compiler has just to "unloop" the make_printf
function on the format to automatically produce an "optimised" (if not
"optimal") code like:

print_string "Hello ";
print_string (string_of_int n);
print_char '\n';
flush stdout;;

from:

printf "Hello %d\n%!" n;;

vicuna · 2013-05-26T14:39:43Z

Comment author: @gasche

Next week I will consult the people that are even grumpier than me about those backward changes, but my gut feeling is that it's better to have as few of them as possible. Make this patch a purely internal change with no observable difference¹, and let's discuss improvements to the behavior as separate step.

It's much easier to discuss and review a behavior change when it is not embedded in a thousands-of-line diff. Besides, not having behavior changes makes it easier to find bugs: currently my fuzzer stops 99 times out of 100 on minor difference you introduced in error message reporting, and this risks hiding the 1/100 bug it may spot.

¹: I told before I thought some behavior changes, such as failing statically instead of at runtime, were acceptable. I'm not sure where to draw the line between what's ok and what's not, and it would be easier to review and discuss the patch if there was no such line to draw.

vicuna · 2013-05-26T14:49:34Z

Comment author: bvaugon

Remarks that, with the old version, you have :

Printf.printf "%{%d%d%}" "%d%d";;

%i%i- : unit = ()

So, "%d" is transformed into "%i". The hypothesis that formats are literally pretty-printed was already wrong.

vicuna · 2013-05-26T15:04:40Z

Comment author: bvaugon

I think that the GADT definition of formats should verify the two properties :

You cannot (by typing) construct format values which do not make sens
Except for String_literal and Char_literal which is an optimisation, the representation of semantically equivalent formats is as unique as possible.

If you add a constructor for "%b" and an other for "%B" (idem for all others), you increase the code size and break the second property.

The solution (, ) is interresting because it reduce the size of CamlinternalFormatBasics, but it break the first property.

vicuna · 2013-05-26T15:14:43Z

Comment author: @gasche

The first property would still hold of the GADT part, which is indeed the one that contains the structured information. The property is helpful inasmuch as it helps ensure the correctness of the typecore translation and later of the printf/scanf/format functions. With the pair representation, transcore is still trivial to audit for correctness, and printf/scanf/format that immediately take the second part are still as correct as before.

And furthermore, the pair representation guarantees that string_of_format behaves as it previously did, which I still insist is an important property. You think it is about the semantics of format strings, but that is wrong. If people use string_of_format, it is precisely to have a non-semantic representation and play dirty tricks with it. We don't know, and we don't want to know, what changing string_of_format may provoke there.

I think we should do the change to the pair representation.

vicuna · 2013-05-26T21:40:22Z

Comment author: bvaugon

Disadvantages of the (, ) representation are:

it small slow-downs the speed of printing and scanning functions
it increases the memory usage of formats
it duplicates format contents in the (mutable) string and the GADT (inelegance and sensitivity to coherence problems)

Advantages are:

it really reduces the size of CamlinternalFormatBasics
it makes faster and simplier string_of_format
forall s, string_of_format (format_of_string s) = s

If everybody think that the performance costs are acceptable and that nobody will touch the format strings embeded in formats (which can't be abstracted, but which are hidden in CamlinternalFormatBasics), I agree.

Do you want to implement it yourself or I do it ?

If I implement it, do I remake clean patches from original 4.00.1 ?
What's about my fix-incompatible-formats-raising-time-patch.diff patch ?
What's about the faster and no-alloc implementation of ifprintf I proposed above (with IPrintf_arg0, etc., suggested by bobot) ?

I think that error messages about invalid formats obtained at compile time are acceptable. They are already much better than before. However, it is difficult to obtain good error messages at runtime (typically by embedding the whole format in messages) when "%[" or "%_" are used in formats passing to printing functions without reducing performances of all prints. Is it so important ?

vicuna · 2013-05-26T21:54:07Z

Comment author: @gasche

Do you want to implement it yourself or I do it ?

I wouldn't be able to work on that before next week (beginning of June).

If I implement it, do I remake clean patches from original 4.00.1 ?

As you prefer. So far I applied your patches (to the last commit of Damien closing the 4.00 branch, in fact) and worked/reviewed on top of that.

What's about my fix-incompatible-formats-raising-time-patch.diff patch ?

I think it's good to have, and included it in my working copy. If you start from 4.00.1 again, you should integrate it in the result.

What's about the faster and no-alloc implementation of ifprintf I proposed above (with IPrintf_arg0, etc., suggested by bobot) ?

That's rather orthogonal to the change being discussed now and I think we can discuss it later.

vicuna · 2013-05-27T07:32:54Z

Comment author: @alainfrisch

Before doing the change, I'd suggest to check first if string_of_format is actually used in real code. I don't remember every using it.

vicuna · 2013-05-27T10:02:35Z

Comment author: bvaugon

I agree that a further discussion about the utility of the pair (, ) representation would be useful. It would be necessary to look into the influence on the sensitive tradeoff and the implicit drawbacks on the overall performances and style.

vicuna · 2013-05-27T11:24:25Z

Comment author: bobot

FWIW string_of_format is not used in Coq, Frama-C, Why3, Why2, janest-core, lwt and all the other ml files in my home.

vicuna · 2013-05-28T09:39:51Z

Comment author: @diml

I know one place were [string_of_format] is used: ocaml-gettext. In gettext strings are converted by looking into a table mapping original strings to their translations, so obviously one needs the exact string as it appears in the source code.

vicuna · 2013-07-22T21:55:43Z

Comment author: bvaugon

I wrote a new version using pairs (Gadt, "string") to encode formats as suggested by Gabriel (see howto-apply-new-formats-PAIR-patches.txt, ocaml-4.00.1_new-formats_PAIR_patch-1.diff and ocaml-4.00.1_new-formats_PAIR_patch-2.diff) and package the old version including the fix-incompatible-formats-raising-time-patch.diff (see howto-apply-new-formats-NOPAIR-patches.txt, ocaml-4.00.1_new-formats_NOPAIR_patch-1.diff and ocaml-4.00.1_new-formats_NOPAIR_patch-2.diff) to facilitate comparisons of the two versions. So, you can choose the one you prefer.

vicuna · 2014-02-20T01:44:22Z

Comment author: berenger

Since I maintain a lazy logger (it's called dolog in OPAM), I am very interested by this feature. Especially, I am interested in all the functions to ignore some material when conditionally printing.
And, currently, several of these functions are missing (cf. related issues).

vicuna · 2014-05-12T15:44:55Z

Comment author: @gasche

Merged in trunk. There is still a small incompatibility introduced by this patch when you use %(..%), but hopefully it should be ironed out before the release.

vicuna · 2014-12-05T14:19:51Z

Comment author: pboutill

I wrote in my project (https://github.com/Kappa-Dev/KaSim) the function :

let tag_if_debug s =
if !Parameter.debugModeOn
then Format.kfprintf (fun f -> Format.pp_print_newline f ())
Format.std_formatter s
else Format.ifprintf Format.std_formatter s

hoping that "tag_if_debug "%a" Pp.map my_huge_map" costs nothing when not in mode debug.
Gasche told me that it is not the case and asks me to report this here as this question has been raised already.

vicuna · 2014-12-05T15:57:41Z

Comment author: bvaugon

Remarks that in your example, the callback function Pp.map won't be
called with your my_huge_map as argument when !Parameter.debugModeOn
is false.

Consequently, in no-debug mode, the cost to pay with this program is
only the aggregation of the 2, 3 or ... arguments of the call of your
tag_if_debug function in a heterogeneous list. This implies some
allocations (4, 6 or ...) which might be a bit reduced with a better
ifprintf implementation that do not aggregate arguments.

However, your example suggests that it is the reading of your
"my_huge_map" that you want to avoid, and this is already avoided.
Do you want to also avoid the allocation of the list of arguments ?

If yes, it might be done by implementing, in the library, a specialised
ifprintf function that just read the format, then get and ignore
arguments rather than a generic usage of CamlinternalFormat.make_printf.
However, it will be difficult to avoid all allocations in any case
(typically for some intermediate closures).

vicuna · 2014-12-05T16:03:50Z

Comment author: @gasche

If yes, it might be done by implementing, in the library, a specialised
ifprintf function that just read the format, then get and ignore
arguments rather than a generic usage of CamlinternalFormat.make_printf.

This is what I had in mind when asking Pierre to reply to this PR. I think we should try to see if this does not duplicate too much code (my guess it that it wouldn't), without imposing hard-limits such as "no allocation at all" but trying to do something reasonable.

Pierre: indeed as Benoît points out I was wrong when I said that the %a and %t functions would be evaluated in any case -- I'll put that on the lack of lunch. They aren't in the i(k)fprintf case.

vicuna · 2014-12-08T10:16:22Z

Comment author: bobot

If the call of tag_if_debug is in a hot path of your program. I advice you to not use a variadic function. Do as in

#6017#c9322

Add a version of tag_if_debug for each arity and don't use ifprintf, it removes the cost of the intermediate closure.

vicuna · 2015-10-31T16:11:48Z

Comment author: @gasche

Jeremy Yallop just re-implemented the i*printf functions in a way that should be more sensibly efficient (no accumulator is built during the format processing):

#267

vicuna closed this as completed Feb 16, 2017

vicuna added high-priority stdlib labels Mar 14, 2019

vicuna added this to the 4.02.0 milestone Mar 14, 2019

vicuna assigned gasche Mar 14, 2019

vicuna added the feature-wish label Mar 20, 2019

A new format implementation based on GADTs #6017

A new format implementation based on GADTs #6017

Comments

vicuna commented May 18, 2013 • edited by nojb Loading

Bug description

File attachments

vicuna commented May 18, 2013

vicuna commented May 22, 2013

vicuna commented May 22, 2013

vicuna commented May 22, 2013

vicuna commented May 23, 2013

vicuna commented May 23, 2013

vicuna commented May 23, 2013

vicuna commented May 23, 2013

vicuna commented May 23, 2013

vicuna commented May 24, 2013

vicuna commented May 24, 2013

vicuna commented May 24, 2013

vicuna commented May 24, 2013

vicuna commented May 25, 2013

vicuna commented May 25, 2013

vicuna commented May 25, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

Printf.printf "%{%d%d%}" "%d%d";;

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 26, 2013

vicuna commented May 27, 2013

vicuna commented May 27, 2013

vicuna commented May 27, 2013

vicuna commented May 28, 2013

vicuna commented Jul 22, 2013

vicuna commented Feb 20, 2014

vicuna commented May 12, 2014

vicuna commented Dec 5, 2014

vicuna commented Dec 5, 2014

vicuna commented Dec 5, 2014

vicuna commented Dec 8, 2014

vicuna commented Oct 31, 2015

vicuna commented May 18, 2013 •

edited by nojb

Loading