Extend ocamllex with actions before refilling #7

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
7 participants
Contributor

let-def commented Feb 2, 2014

The patch introduces a new "refill" action associated to a lexing rule.
It is optional and if not used the lexer specification and behavior are unchanged.

When specified, it allows the user to control the way the lexer is refilled. For example, an appropriate refill handler could perform the blocking operations of refilling under a concurrency monad such as Lwt or Async, to work better in a cooperative concurrency setting.

To use this feature, a lexing rule should be upgraded from:

rule entry_name arg1 = parse
  | ...

to:

rule entry_name arg1 = refill {refill_function} parse
  | ...

General idea

refill_function is a function which will be invoked by the lexer immediately before refilling the buffer. The function will receive as arguments the continuation to invoke to resume the lexing, as well as all other values to be passed to the continuation.

More precisely, it is a function of type:

('param_0 -> ... -> 'param_n -> Lexing.lexbuf -> int -> 'a) ->
 'param_0 -> ... -> 'param_n -> Lexing.lexbuf -> int -> 'a

where:

  • 'param_0, ..., 'param_n are types of the parameters of the lexing rule
  • the int represents the state of the lexing automaton
  • the first argument is the continuation (which has the exact same type as the rest of the function), which captures the processing ocamllex would usually perform (refilling the buffer, then calling the lexing function again)

Anatomy of generated lexers

Let's start from a simple lexer summing all numbers found in input:

    rule main counter = parse
      | (['0'-'9']+ as num)
        { counter := !counter + int_of_string num;
          main counter lexbuf
        }
      | eof { !counter }
      | _ { main counter lexbuf }

The code generated for the rule looks like:

    let rec main counter lexbuf =
        __ocaml_lex_main_rec counter lexbuf 0
    and __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state =
      match Lexing.engine __ocaml_lex_tables __ocaml_lex_state lexbuf with
        | 0 ->
            let num = Lexing.sub_lexeme lexbuf lexbuf.Lexing.lex_start_pos lexbuf.Lexing.lex_curr_pos in
            ( counter := !counter + int_of_string num;
              main counter lexbuf )
        | 1 -> ( !counter )
        | 2 -> ( main counter lexbuf )
        | __ocaml_lex_state -> lexbuf.Lexing.refill_buff lexbuf; __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state
  1. the main function only purpose is to invoke __ocaml_lex_main_rec starting from the initial state.
  2. the __ocaml_lex_main_rec function first calls the lexing engine then dispatches on its result:
    • in terminal states, user actions are executed,
    • in other states, first the buffer gets refilled then the code loops.

Let's change the code to include some refill action:

    rule main counter =
      refill {fun k counter lexbuf state ->
                prerr_endline "let's refill!";
                k counter lexbuf state}
      parse
      | (['0'-'9']+ as num)
        { counter := !counter + int_of_string num;
          main counter lexbuf
        }
      | eof { !counter }
      | _ { main counter lexbuf }

The generated code now looks like this:

    let rec main counter lexbuf =
        __ocaml_lex_main_rec counter lexbuf 0
    and __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state =
      match Lexing.engine __ocaml_lex_tables __ocaml_lex_state lexbuf with
        | 0 ->
            let num = Lexing.sub_lexeme lexbuf lexbuf.Lexing.lex_start_pos lexbuf.Lexing.lex_curr_pos in
            ( counter := !counter + int_of_string num;
              main counter lexbuf )
        | 1 -> ( !counter )
        | 2 -> ( main counter lexbuf )
        | __ocaml_lex_state ->
            (fun k counter lexbuf state ->
              prerr_endline "let's refill!";
              k counter lexbuf state)
                __ocaml_lex_main_refill
                counter lexbuf __ocaml_lex_state

      and __ocaml_lex_main_refill counter lexbuf __ocaml_lex_state =
        lexbuf.Lexing.refill_buff lexbuf;
        __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state
  1. The first part is unchanged.
  2. The refill case is now split in two parts:
    • the __ocaml_lex_main_refill doing the work that was previously done directly in the branch action: refilling buffer and looping.
    • in the branch, just call the user refill action with __ocaml_lex_main_refill as the continuation.

Design considerations

Scope of refill action

One may wonder why the refill action is local to a rule and not globally applied to all rules in the file.

In a more realistic example, the action could make use of one or more of the parameters, e.g. to access some information relative to the lexbuf, like a rendez-vous point shared with lexbuf refilling function.

As such this action depends on the type of the rule and imposing the same to all rules might get in the way of the user.

Also, it is likely that if someone makes multiple uses of a refill action, most of the code will be put in the prolog of the lexer and shared across different rules. Recalling the refill action at the beginning of each rule will therefore be lightweight but may ease readability by making the special flow more explicit.

Type of refill action

The refill action is passed the lexbuf and the local state, exposing some internals of the lexer. One may argue that more safety could be added, like passing the local state as an abstract type or wrapped in a closure.

However I believe that this choice is consistent with the rest of the ocamllex design: the code is still simple and straightforward, internals are exposed when it is easier (e.g. Lexing.lexbuf type) and it is obvious that messing with those values will lead to undesirable behaviors.

Furthermore, this choice ensures a runtime cost close to zero.

Testing separately, examples

The repository https://github.com/def-lkb/ocamllex provides a standalone version
of ocamllex with this extension and is otherwise completely compatible with
ocaml 4.01 ocamllex.

You can see here a sample integration with Lwt_io.
Also, Merlin's lexer now uses refill actions.

Note to other reviewers: the last true parameter means add_parens, and will cause copy_chunk to automagically insert parenthesis around the refill action, as can be seen in the example described in the pull request. I wondered about that.

Member

gasche commented Feb 3, 2014

I did a review before the patch was submitted, and my concerns were addressed. I think this is a very good change -- one of the first pieces of Merlin's work that should be integrated upstream.

Contributor

c-cube commented Feb 3, 2014

I think it's very interesting that ocamllex (and later ocamlyacc or menhir) becomes compatible with Lwt or other non-blocking frameworks. It would make writing parsers that accept this kind of inputs much much easier (currently one basically has to either hand-write a state machine, or use something like stream parsers, I guess).

Member

samoht commented Feb 3, 2014

Very useful patch.

Member

gasche commented Feb 3, 2014

I discussed the patch with Luc and a few questions were raised (but he won't have much time to look at it, at least before the end of the week):

  • is the state parameter really useful (same question, maybe, for lexbuf)?
  • could this approach also support the code produced by the -ml flag (which doesn't really have an equivalent notion of state, hence the question above)?
  • is it really useful to have one refill_function per lexing rule?

I asked this same question during my previous review, and answered it myself by thinking that rules that manipulate global state (such as the "string buffer" in the string- and comment-lexing rules of the OCaml lexer, as can be seen in the Merlin lexer (global state has been changed into a parameter, but still)) may want to (re)store in some way during refilling. But actually the merlin-lexer example demonstrates that, no, mutable structures are untouched by the refill function, the logic remains the same in each rule.

Of course that wouldn't type-check with the current scheme were refill function are passed the parameters (as distinct rules can have different number and type of parameters). Is it necessary to access the parameters in the refill function? The Lwt_io example uses that, but I would claim that it is a phony example: the state that is supposedly made local by being passed as a parameter is in fact relied upon by the global/implicit lexbuf variable, so it could just as well be a global mutable variable (possibly made re-entrant by functorization in the style of the arithmetic example).

Contributor

let-def commented Feb 3, 2014

The -ml case is actually a problem. The solution to uniformize everything (and optionally hide lexbuf), is to allocate a closure to pass as the continuation.

If we are going to allocate a closure then all remaining problems are just matters of taste:

  • the automaton state is now hidden in closure,
  • whether we hide or not the lexbuf can be discussed… It might make sense to allow to change the lexbuf: the patch provides a resumable lexer, if we allow to change the lexbuf it can be turned into a persistent lexer (speaking as a developer of merlin, it could make sense to resume and backtrack from the middle of a token),
  • local or global refill_action does not change much, yet I believe that having access to parameters allow to avoid the hack you suggested (going through a functor to make the lexer reentrant is kind of heavyweight… the same argument applies to rule parameters in the first place: they are not strictly necessary, we could either lift actions in a reader monad or make the whole file a functor).
Member

gasche commented Feb 3, 2014

I don't think allocating a closure would be a problem; it's drowned in the noise in the Lwt case for example. Of course, rules without a refill_action should not pay any cost, but your patch already does that. If you really wanted to avoid the closure construction, you could make the state parameter a private type in the generated code, and use unit for -ml-generated parsers.

My impression would be that a global refill_action would be more convenient (e.g. it avoids the difference between refill and refill' in the Merlin case), and avoids the silly mistake of forgetting to attach the refill_action to one of the rules (which the compiler will not catch). That said, it's not an important point, and as the code author you ultimately get to decide: if you think that local rules are more useful or elegant, I'm fine with that.

Owner

avsm commented Feb 3, 2014

This is an extremely useful feature. Just a note about something that @yallop observed a while back: it's possible to use exceptions to implement a restartable Lexer without this patch.

https://github.com/yallop/code-snippets/tree/master/restartable-lexer

The proposed patch here is more elegant, but I'd be curious in a performance comparison between the two approaches if you have time. The exception version would allocate fewer closures in principle, but may be slower if exception backtraces were enabled.

Owner

avsm commented Feb 3, 2014

@yallop: thanks! I withdraw my request for a performance comparison, in preference for a working lexer ;-)

Contributor

let-def commented Feb 3, 2014

I changed my mind :). While I still believe rule local actions are interesting, all the use cases so far relied on a global action.
Parameters can be recovered with a monadic scheme (e.g. reader monad), the only benefit lost is that previously, all allocations were avoided because we don't had to allocate a closure and directly jump to the continuation (@avsm seems to worry about performances ;)).

However, we only need to close over the local state (with the C engine) so it should not make a big difference (assuming that refill action/function does a real work behind).
With the ML engine, keeping the local continuation is unavoidable anyway.

The private type solution is a bit too much I feel. ocamllex doesn't generate interface at the moment, doing so would complicate its use a lot for… almost no benefit?!

Unless you have other concerns to discuss, I will rewrite my patch to allow just one main refill action.

Revert "lex: enable user-defined refill action"
This reverts commit e55022a.

Revert "lex: cleanup whitespaces, no semantic changes"
This reverts commit 5fb8660.
Contributor

let-def commented Feb 7, 2014

I updated the code according to Luc's remarks. Here is an explanation of the new implementation:

Contributor

let-def commented Feb 7, 2014

The patch introduces a new "refill" action. It's optional and if unused the lexer specification and behavior are unchanged.

When specified, it allows the user to control the way the lexer is
refilled. For example, an appropriate refill handler could perform
the blocking operations of refilling under a concurrency monad such
as Lwt or Async, to work better in a cooperative concurrency
setting.

To make use of this feature, add

refill {refill_function}

between the header and the first rule.

General idea

refill_function is a function which will be invoked by the lexer
immediately before refilling the buffer. The function will receive as
arguments the continuation to invoke to resume the lexing, and the current
lexing buffer.

More precisely, it's a function of type:

(Lexing.lexbuf -> 'a) ->
 Lexing.lexbuf -> 'a

where:

  • the first argument is the continuation which captures the processing
    ocamllex would usually perform (refilling the buffer, then calling the lexing
    function again),
  • the result type 'a should unify with the result types of all rules.

Anatomy of generated lexers

Let's start from a simple lexer summing all numbers found in input:

rule main counter = parse
  | (['0'-'9']+ as num) 
    { counter := !counter + int_of_string num;
      main counter lexbuf
    }
  | eof { !counter }
  | _ { main counter lexbuf }

C engine

The code generated for the rule looks like:

let rec main counter lexbuf =
    __ocaml_lex_main_rec counter lexbuf 0
and __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state =
  match Lexing.engine __ocaml_lex_tables __ocaml_lex_state lexbuf with
    | 0 ->
        let num = Lexing.sub_lexeme lexbuf lexbuf.Lexing.lex_start_pos lexbuf.Lexing.lex_curr_pos in
        ( counter := !counter + int_of_string num;
          main counter lexbuf )
    | 1 -> ( !counter )
    | 2 -> ( main counter lexbuf )
    | __ocaml_lex_state -> lexbuf.Lexing.refill_buff lexbuf; __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state
  1. the main function only purpose is to invoke __ocaml_lex_main_rec
    starting from the initial state.
  2. the __ocaml_lex_main_rec function first calls the lexing engine then
    dispatches on its result:
  3. in terminal states, user actions are executed
  4. in other states, first the buffer gets refilled then the code loops

Let's change it to include some refill action:

refill 
  { fun k lexbuf -> 
    prerr_endline "let's refill!";
    k lexbuf
  }

rule main counter = 
  parse
  | (['0'-'9']+ as num) 
    { counter := !counter + int_of_string num;
      main counter lexbuf
    }
  | eof { !counter }
  | _ { main counter lexbuf }

The generated code now looks like:

let __ocaml_lex_refill : (Lexing.lexbuf -> 'a) -> (Lexing.lexbuf -> 'a) =
  ( fun k lexbuf -> 
    prerr_endline "let's refill!";
    k lexbuf
  )

let rec main counter lexbuf =
    __ocaml_lex_main_rec counter lexbuf 0
and __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state =
  match Lexing.engine __ocaml_lex_tables __ocaml_lex_state lexbuf with
    | 0 ->
        let num = Lexing.sub_lexeme lexbuf lexbuf.Lexing.lex_start_pos lexbuf.Lexing.lex_curr_pos in
        ( counter := !counter + int_of_string num;
          main counter lexbuf )
    | 1 -> ( !counter )
    | 2 -> ( main counter lexbuf )
    | __ocaml_lex_state -> __ocaml_lex_refill 
        (fun lexbuf -> lexbuf.Lexing.refill_buff lexbuf; 
           __ocaml_lex_main_rec counter lexbuf __ocaml_lex_state) lexbuf
  1. The refill handler is bound to a lexer private name.
  2. The rule entry is unchanged.
  3. The refill case pass the actual refilling code as an argument to the
    refill handler.

ML engine

The ML generator first generates some generic definitions, notably:

val __ocaml_lex_next_char : Lexing.lexbuf -> int

This function is responsible for returning the next character from the lexbuf.
If the buffer needs refill, then __ocaml_lex_next_char calls refill_buff
and retry. If the buffer reached eof, then the function return 256.

Then, it generates the automaton as a group of mutually recursive functions.
Only shifting states call the __ocaml_lex_next_char function.

let rec __ocaml_lex_state0 lexbuf = match __ocaml_lex_next_char lexbuf with
  |256 -> 
    __ocaml_lex_state2 lexbuf
  |48|49|50|51|52|53|54|55|56|57 ->
    __ocaml_lex_state3 lexbuf
  | _ -> 
    __ocaml_lex_state1 lexbuf

and __ocaml_lex_state1 lexbuf = 2

and ...

The result of the automaton is an integer against which the main rule will
dispatch to execute the relevant action.

let rec main counter lexbuf =
  __ocaml_lex_init_lexbuf lexbuf 0; 
  let __ocaml_lex_result = __ocaml_lex_state0 lexbuf in
  ... (* dispatch against __ocaml_lex_result *)

Now in the refilling case:

  • the same __ocaml_lex_refill as above is outputed,
  • a new exception is defined:
    exception Ocaml_lex_refill of (Lexing.lexbuf -> int)
  • __ocaml_lex_next_char is modified, so that refill cases no longer loop
    but just returns -1
  • the automaton states now have a special transition on -1: raising
    Ocaml_lex_refill <current-state>
  • the entry of a rule initialize the buffer then jumps to
    dedicated function executing the automaton while catching this exception and
    calling the refill handler appropriately.
let rec __ocaml_lex_state0 lexbuf = match __ocaml_lex_next_char lexbuf with
  | -1 -> 
    raise (Ocaml_lex_refill __ocaml_lex_state0)
  |256 -> 
    __ocaml_lex_state2 lexbuf
  |48|49|50|51|52|53|54|55|56|57 ->
    __ocaml_lex_state3 lexbuf
  | _ -> 
    __ocaml_lex_state1 lexbuf

and __ocaml_lex_state1 lexbuf = 2

and ...

let rec main counter lexbuf =
  __ocaml_lex_init_lexbuf lexbuf 0; 
  __ocaml_lex_main_rec __ocaml_lex_state0 counter lexbuf

and __ocaml_lex_main_rec __ocaml_lex_state counter lexbuf =
  try
    let __ocaml_lex_result = __ocaml_lex_state0 lexbuf in
    ... (* dispatch against __ocaml_lex_result *)
  with Ocaml_lex_refill __ocaml_lex_state ->
    __ocaml_lex_refill 
      (fun lexbuf -> lexbuf.Lexing.refill_fun lexbuf;
        __ocaml_lex_main_rec __ocaml_lex_state counter lexbuf)
      lexbuf
Member

gasche commented Feb 7, 2014

From your description alone (I didn't know about -ml before), I don't understand why you need two distinct control-flow mechanisms: returning -1, and raising an exception. When you match on -1, could you not call __refill (... current state ..) lexbuf directly? (If the goal is to avoid code duplication, simply binding the refill action to a name should work). Alternatively, you could pass the current state as an additional parameter to ocaml_lex_next_char and have it call refill in tail-position; could that avoid both the exception and the -1?

Contributor

let-def commented Feb 7, 2014

By passing the current state to ocaml_next_char, we can indeed skip the return -1 step and directly raise an exception. If we want to avoid calling an exception, then slightly more work is needed.

The structure of the stack when calling ocaml_lex_next_char looks like:
  ... [user-code] / rule entry / automaton state / ocaml_lex_next_char (top).
Responsibilities:

  • ocaml_lex_next_char is expected to return the next character,
  • automaton state traverse the automaton while needed, then return an integer
  • the rule entry execute the action corresponding to the integer.

The exception is caught by the rule entry, to resume from the right state and, hopefully, dispatch the action latter. In particulair, neither the state functions nor ocaml_lex_next_char are in tail position (and as such, cannot call the refill action).

The other scheme I see to avoid raising an exception is to invert the flow when calling the automaton.
The type of a state would be Lexing.lexbuf -> (int -> 'a) -> 'a and
ocaml_lex_next_char would have type Lexing.lexbuf -> (Lexing.lexbuf -> (int -> 'a) -> 'a) -> (int -> 'a) -> 'a.
This way everything is tail-recursive, no allocation is done on the hot-path, only when entering the rule (to allocate a closure capturing rule parameters).

This is perfectionism right here, but the previous implementation has two spaces before the match (line 138 in this diff), and you removed them here. This makes your output better in the case where match happens right after a function declaration (No_remember or both Remember and has_refill), but also slightly worse in the other case (the match is not indented). Whitespace is not important, but I think preserving past behavior is a good idea (eg. people might want to diff the produced lexer before and after the ocamllex change), so I would add those two spaces back. In that case you can also remove the trailing space line 151 to have only two spaces after the equal sign, instead of three.

Member

gasche commented Feb 8, 2014

If I understand correctly, the current try..with breaks tail-recursivity of (tail-recursive) calls to the lexer rules in user productions. Is that correct? That would be a much stronger argument against this design than mere elegance considerations.

Contributor

let-def commented Feb 16, 2014

I pushed a new version.
Now in the -ml mode, one closure is eventually allocated at entry point and on refill. All calls are in tail position, continuations are explicitly passed from functions to functions.

__ocaml_lex_next_char receive two continations: the current state and the final continuation to dispatch on result. Each state receives the current char and the final continuation.

States consuming characters are split in two functions: __ocaml_lex_stateX and __ocaml_lex_stateX_next.
__ocaml_lex_stateX setup anything needed on lexbuf, then call __ocaml_lex_next_char using __ocaml_lex_stateX_next as continuation.

let rec __ocaml_lex_next_char lexbuf state k =
  if lexbuf.Lexing.lex_curr_pos >= lexbuf.Lexing.lex_buffer_len then begin
    if lexbuf.Lexing.lex_eof_reached then
      state lexbuf k 256
    else begin
      __ocaml_lex_refill (fun lexbuf ->
        lexbuf.Lexing.refill_buff lexbuf ;
        __ocaml_lex_next_char lexbuf state k)
        lexbuf
    end
  end else begin
    let i = lexbuf.Lexing.lex_curr_pos in
    let c = lexbuf.Lexing.lex_buffer.[i] in
    lexbuf.Lexing.lex_curr_pos <- i+1 ;
    state lexbuf k (Char.code c)
  end

let rec __ocaml_lex_state0 lexbuf k = __ocaml_lex_next_char lexbuf __ocaml_lex_state0_next k
and __ocaml_lex_state0_next lexbuf k = function
  |256 -> 
    __ocaml_lex_state2 lexbuf k
  |48|49|50|51|52|53|54|55|56|57 ->
    __ocaml_lex_state3 lexbuf k
  | _ -> 
    __ocaml_lex_state1 lexbuf k

and __ocaml_lex_state1 lexbuf k =
  k lexbuf 2

and ...

let rec main counter lexbuf =
  __ocaml_lex_init_lexbuf lexbuf 0; 
  __ocaml_lex_state0 lexbuf 
  (* final continuation: dispatch against __ocaml_lex_result *)
  (fun lexbuf __ocaml_lex_result ->
  lexbuf.Lexing.lex_start_p <- lexbuf.Lexing.lex_curr_p;
  lexbuf.Lexing.lex_curr_p <- {lexbuf.Lexing.lex_curr_p with
    Lexing.pos_cnum = lexbuf.Lexing.lex_abs_pos+lexbuf.Lexing.lex_curr_pos};
    ... 
  )

Lines 38 and 39 could have 2 more spaces of indentation.

Member

gasche commented Feb 17, 2014

I reviewed the new patch and I think it is good; I am in favor of inclusion.

There is one allocation at each refill (only when a refill action is present), which builds the closure that captures the state. As the state does not have the same type in the normal or -ml backend, it needs to be either hidden in a closure (as it is currently done), or forced to be polymorphic in the refill-action's type. We could have made that latter choice to avoid the allocation, but it would make the interface less convenient for only a marginal gain. In particular, we could expect that users that want user-specified refill-action will do something more costly than just one allocation -- the Lwt example may perform arbitrary scheduling for example, and even the merlin case allocates.

Contributor

maranget commented Mar 7, 2014

The new patch was buggy. I have a patch that correct the bug. However,
have no time to learn how to transmit it in a github friendly manner....

For the impatient the patch is at
http://moscova.inria.fr/~maranget/vrac/ocamllex.patch

Sorry about that

--Luc

Contributor

let-def commented Mar 10, 2014

I applied your patch, thanks!

Member

gasche commented Mar 14, 2014

Patch applied in trunk. Thanks!

@gasche gasche closed this Mar 14, 2014

lpw25 referenced this pull request in lpw25/ocaml Sep 30, 2014

stedolan pushed a commit to stedolan/ocaml that referenced this pull request Aug 18, 2015

Make physical comparison aware of promotion duplication.
Promoting an object duplicates the object in the major heap and uses
write barriers to ensure that the copies are in sync. However, physical
comparison breaks due to this. Physical comparison is essential for
pattern matching polymorphic and extensible variants. See #7 for
example. The fix ensures that the major heap object is always used when
comparing promoted objects.

AltGr pushed a commit to AltGr/ocaml that referenced this pull request Dec 7, 2015

Merge pull request #7 from alainfrisch/flambda_trunk
Build system fixes for Windows

lpw25 referenced this pull request in lpw25/ocaml Feb 28, 2016

Callbacks into OCaml run on the parent stack.
Previously, callbacks from C to OCaml allocated a fresh OCaml stack to
run the callback function. This isn't a problem usually, but it turns
out to be very bad for timer interrupts. Every time a timer interrupt is
handled, a new OCaml stack is created on the heap and discarded soon
after.

This commit modifies the behavior of callbacks such that they run on the
same stack as the parent, need only a check to see if there is enough
space in the parent stack to initiate a callback. Fixes #7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment