Introduction to batEnum

bobzhang edited this page May 8, 2011 · 13 revisions

##Introduction This page was dedicated to the module batEnum (enum), which is widely used for two purposes:

  1. As a bridge to convert from one data structure to another
  2. Efficient implementation of most one-pass algorithms(more algorithms can be linearized than you might expect).

Enum is no more powerful than LazyList, but runs faster. Since it's destructive, it does not cache the result in most cases. In order to read from an enum multiple times, you can clone, create an independent copy which will produce the same values in the same order.

To understand this page, a basic understanding of LazyList is assumed.

PartI ##The fundamental stuff

type 'a t= {
mutable count : unit -> int ;
mutable next : unit -> 'a;
mutable clone : unit -> 'a t; 
mutable fast : bool;
}

The field fast indicates whether the count can be done in O(1) time. That all the fields are mutable is important, since they may all be changed, we will see later.

When you invoke the count function, it may enumerate the whole list, but the current position in the list is kept.

The field clone is to make a clone object, the object and the clone will not interfere with each other if the computation is pure. (This's the most cases in impure languages, when you fetch the value of the lazy block, it will force the evaluation, and you get it the second time, the value is cached, but the side effect will not be reproduced.)

During the following notations, T stands for time complexity, P : Purpose, S : Suggestion, A: Attention, Tip: Tips. For simplicity, 'a t is the same as 'a Enum.t. D: dependent on some functions

##Basic Functions (from basic to high level)

of_enum, enum : 'a Enum.t -> 'a Enum.t 
(*T: O(1)
  P: provided for a uniform interface *)

make : 
next:(unit -> 'a) -> count:(unit-> int)-> clone:(unit -> 'a t)-> 'a t
(*T: O(1)
  Tip: just fill the field of 'a t, the `fast` field is filled with true,
  auxiliary function.
S: a bug:
  let b= LazyList.(Enum.range 1 |> of_enum|> enum );;
  b.fast is true (it uses make, and fast field is true)
 *)
force : 'a t -> unit 
(*T: O(n)
  Tip: Cache the enumeration to a strict list, if the list is very long, then 
probably the memory will be used up. However, after you force a list, the 
clone, count, next, operations is very cheap, since they already has the whole 
list in the memory. If no clone is operated on this enum later, and you 
enumerate the list again, the memory will be reclaimed. Otherwise when clone 
is invoked after force, you may be careful to avoid memory leak, this is also 
quite common in Haskell
   A: force will change its all fields
*)
type 'a _mut_list = {hd :'a ; mutable tl : 'a _mut_list}
(*Tip: This type is mainly for generating a list from head to tail 
 without recursion, and one pass, it uses compiler trick *)
let force t = 
  let rec clone enum count = 
      (*Here the enum is of type 'a list ref, count is of type int. Clone is a 
function clone the state of 'a list ref content, and its length is known,  
since it already has the whole list, this  function is O(1), and did not 
allocate space any more, and fast is true. its clone field is also easy to 
define. 
*)
     let enum = ref !enum and count = ref !count in 
     {
       count = (fun () -> !count );
       next = (fun () -> match !enum with 
                              |[] -> raise No_more_elements
                              |h::t -> decr count ; enum := t ; h );
       clone = (fun () -> let enum = ref !enum and count = ref !count in 
                          clone enum count );
       fast = true ;}
  in 
  let count = ref 0 in 
  let _empty = Obj.magic [] in 
  let rec loop dst = 
      (* we will invoke the code later
         enum = {hd=t.next (); tl = _empty}
         loop !enum ;
so this function will enumerate the t, and fill all the data in enum, then we 
use (Obj.magic enum) to force enum to a list.  If we did not use Obj.magic, we 
get the elements to from the front to  end, and will reverse it, more time 
consuming. This a trick to build a immutable list from front to end using 
Obj.magic*)
      let x = {hd = t.next (); tl = _empty } in 
      incr count; 
      dst.tl <- x;
      loop x 
  in 
  let enum = ref _empty in 
    (try 
      enum := {hd = t.next (); tl = _empty };
      incr count;
      loop !enum 
     with No_more_elements -> ());
   let tc = clone (Obj.magic enum) count in 
      (*here we replace t with tc. Since all the fileds of 'a t are mutable
it changes all the fields of t *)
      t.clone <- tc.clone;
      t.next  <- tc.next;
      t.count <- tc.count;
      t.fast <- true
empty : unit -> 'a t 
(*T: o(1).
  Tip: It is mutable, so its signature should be unit -> 'a t, not like 
functional empty, Set.empty : 'a Set.t
  S: this makes me uncomfortable, the interface is inconsistent, how about 
change its name to make_empty or other ways to make a uniform interface?? *)
from : (unit -> 'a) -> 'a t 
(*T: o(1)
  Tip: very important function in batEnum. 
Field fast is false, next is f. 
Field count is (fun _ -> force e; e.count ())

The clone is tricky, since we need to clone the state, one way is when clone 
is invoked, we force the list, and then clone. If we adopt this way, the clone 
will be O(n), and will cache the whole list. Here we chose another way.
The other way is to make use of the lazy list. when we invoke the clone,
We use LazyList.from to build a lazy list, and LazyList.enum to hold the 
state. so the state consists of 'a LazyList.t ref.
(* Here we use MicroLazyList for lazyList, since LazyList depends on batEnum, 
otherwise it cause circular dependence. 
For MicroLazyList, when it is converted to enum, 
its next field is eval one step, and mutate the state, 
its count is (fun () -> force e; e.count ()), 
its fast field is false, its clone is just sharing the reference. 
*)
(*S: batEnum use MicroLazyList.enum, 
its count field is (fun () -> force e; e.count ());
then it will build a new list unnecessary. 
For example, if we have an enum_a built from 
let enum_a = from f
let enum_b = clone a 
later invoke enum_a.count, since we cloned a, enum_a.count will force enum_a
to evaluate all the elements and pushed it to 'a list. Whereas enum_b still 
hold the evaluated 'a LazyList.t ref. So enum_b and enum_a will hold two
evaluated list, this is quite unnecessary. I recommend to use 
batLazyList.enum, change its count field *)

To conclude, the new generated enum's clone field is o(1), but when you clone  
the num, it will change the record totally, fill the state with a LazyList.t 
ref.

The order of clone, count matters, when you count first you cache the whole 
list at once, and clone is cheap.
When you clone first, its state is kept via a lazy list's reference. 

I recommend to modify MicroLazyList.enum to make it pure when count invoked.
*)
let from f = 
   let e = {
      next = f;
      count = _dummy;
      clone = _dummy;
      fast = false} in 
   e.count <- (fun () -> force e; e.count ());
   e.clone <- (fun () -> 
               let e' = MicroLazyList.enum (MicroLazyList.from f) in 
                e.next <- e'.next ;
                e.clone <- e'.clone;
                e.count () <- (fun () -> force e ; e.count);
(*
Here we use MicroLazyList.enum (MicroLazyList.from f) to keep the state of the 
closure f, and use the result to replace the previous enum totally. 

However,  e.count<- e'.count is wrong, since when e.count is invoked,
it will force e', and will modify e' which will not be reflected on e, or else 
we could add one indirection e.fast <- (fun () -> e'.fast ()).

S: all make this bug is due to the fact that we use force in 
MicroLazyList.enum's count field, if we use batLazyList.enum's count then, the 
count is not destructive, the code will be more consistent. Here we can write 
e.count <- e'.count. 
The only disadvantage is e.count will always  is for LazyList.enum the state 
is 'a LazyList.t ref, and the fast will always be false whether or not the 
list is totally calculated, but I think it does not matter.
*)
                e.fast <- e'.fast ;
                e.clone ());
  e
from2 : (unit -> 'a) -> (unit -> 'a t) -> 'a t
(*T: O(1).
  Tip: fields next, clone are provided by the user
fast is false, count <- (fun () -> force e ; e.count ())
This is the same as from, except that it provide a chance to customize the 
clone function, it is useful, we will see it later
*)

Part II

##High level function (based on previous functions)

init : int -> (int -> 'a) -> 'a t
(*T: o(1) 
  D:from
  Tip: fast is true, count is kept as a state, people from Haskell 
should notice that init n f, init 4 (fun _ -> init 3 identity) and init 4 (const (init 3 identity)) are quite different, since the block init 3 identity is a mutable, should not be shared. This is also true for Array.init, etc.
*)
get : 'a t -> 'a option 
(*T: o(1) 
  D: field next *)
push : 'a t -> 'a -> unit 
(*T: o(1). 
  Tip: it add one element to the head 
use a state next_called = ref false to record.
its t.next is tricky and destructive, when you invoke t.next 
it will modify t.next,count,clone, this is quite common and buggy
*)
let push t e = 
  let rec make t = 
     let (fnext,fcount,fclone,next_called) = (t.next,t.count,t.clone,ref false) in 
     t.next <- (fun () -> 
           t.next <- (fun () -> next_called := true;
                      t.next <- fnext ;
                      t.count <- fcount;
                      t.clone <- fclone; e);
           t.count <- (fun () -> 
                let n = fcount () in 
                if !next_called then n else n+1);
            (* S: next_called is always false, otherwise this function
               is changed, so remove it? 
here people may worry if fcount is destructive, which means it can modify the fields of t, then t.count will be changed, but we replace t.next with the old t.next.  However, here we  wouldn't get a bug, since when fcount is destructive, it modifies all the fields, the next field is also changed, so
don't worry. But we may introduce some bugs via make???
The batList.enum, its count is not destructive, since it does  not modify the count function, just modify the state of the closure *)
           t.clone <- (fun () -> 
                let tc = fclone () in 
                if not !next_called then make tc;
                tc);
             (*here the fclone may be destructive, however, when it's
destructive, it updates all the fields, the definition of from, etc *)
    in make t 
peek : 'a t -> 'a option
(*T: o(1) 
  D: get, push *)
take : int -> 'a t -> 'a t 
(*T: o(n), I recommend to change this function,
first, it's not efficient, it takes the first n elements (<=n when it has no
more than n elements, still successful), and cache it in a list. but the count
for the new list is re-calculated, we can improve to reduce the unnecessary calculation.
Second, in most cases we hope the take function is lazy, I suggest renaming
take to take_strict, and use the next function take_lazy
*)
take_lazy : int -> 'a t -> 'a t 
(*time: o(1). my customized function *)
let take_lazy n t =
  let m = ref n in 
  let next () = if !m > 0 then begin 
    decr m;
    match get t with Some v -> v |None -> raise No_more_elements
  end else 
      raise No_more_elements in 
  from next ;;
(*there is another commented version take in the source code, however its clone is o(n), I recommended to use from, first to reduce the complexity, second, the fast count is indeed unnecessary and wrong, since we may have less
than n elements, right? *)

junk : 'a t -> unit
(*time:o(1) trivial*)
is_empty : 'a t -> bool
(*based on the assumption, if t.fast then t.count should be o(1) *)
count,fast_count,clone
(*trivial*)
iter : 'a t -> ('a -> unit) -> unit 
(*o(n) loop t.next , destructive *)
iteri (*trivial*)
iter2 : ('a ->'b -> 'c ) -> 'a t -> 'b t -> unit
(*
time: o(min(m,n)) ; be careful when get the element is not pure.
if the first enum is used up first, then the short is chosen,
  otherwise, the second is shorter, we should push one element of 
  the first element back.
based on push 
*)
iter2i (*trivial*)
fold,reduce,sum  (*time: o(n) trivial*)
exists,for_all 
(*time : o(n/2), be careful, it will destruct the enum until unnecessary, 
so the side effect is hard to predict. you may make use of it, or bitten by it
*)
scanl: ('a -> 'b -> 'a ) -> 'a -> 'b t -> 'a t 
(* time: o(1), based on from, push .
let y = scanl f init x
be careful, x and y is related, when you enumerate x, it will have an effect on y. This is all caused by lazy evaluation in the impure world.
*)
scan (*trivial, see scanl*)
foldi (*trivial, see fold *)
fold2 (*trivial, see fold + iter2  based on push*)
fold2i (*trivial, see fold2,+iteri *)
map : 'a t -> ('a -> 'b) -> 'b t 
(*time o(1)
let ys = map xs f 
(/@) = map (left associate)
(@/) = flip map (right associate)
lazy, so xs and ys are related see scanl.
*)
let rec map f t = 
  {count = t.count; next = (fun () -> f(t.next ())); 
   clone = (fun () -> map f (t.clone ()));
   fast = t.fast ;}
(* let ys = map xs f 
ys.count is not destructive, since xs.count dose not update ys, which means
ys.count will not call ys.next.
let zs = xs /@ f /@ g
you will see zs.count is the same as g.count. This is quite like Haskell, if
you call length, it will not evaluate the inner block. It's also the same that it's quite subtle to argue the time complexity.
we can not use from next here, since the count is customized, and the clone
, fast is also customized. So we clone ys will cause xs also to be cloned.
could any one explain the benefit??
another counter example:
let xs = range 1 ~until:20 /@ (fun x -> printf "%d\n" x; 2 * x)
let ys = clone xs 
junk xs (*print 1 *)
junk ys 
(*print 1 the side effect is caused twice, most times this is not true *)
*)

find : ('a -> bool) -> 'a t (*see exists *)
mapi : (see map)
filter : ('a -> bool) -> 'a t -> 'a t 
(* time : o(1) (//) = flip filter (left associate) 
based on from2 
let ys = xs // f 
the same as before, could anyone explain why provided a customized clone 
function?
let xs = range 1 ~until:20 // (fun x -> printf "%d\n" x; x mod 2= 0)
let ys = clone xs
junk xs
junk ys (*side effect appeared again *)
what's the benefit over from next ?
*)

filter_map (*see filter*)
append : 'a t -> 'a t -> 'a t
(*time:o(1) *)
let rec append ta tb = 
   let t = { count = (fun () -> ta.count () + tb.count ());
             next = _dummy;
             clone = (fun () -> append (ta.clone ()) (tb.clone ());
             (* here clone t will cause ta, tb to be cloned, however, since
                append does not introduce side effects, it's ok.
                another benefit here, if tb is a clone of tb they will
                share the memory (*very good*)
             *)
             fast = ta.fast && tb.fast } in 
  t.next <- (fun () -> 
             try ta.next ()
             with No_more_elements -> 
                 t.next <- (fun () -> tb.next ());
                 t.count <- (fun () -> tb.count ());
                 (*add one indirection, since 
                   let tc = append ta tb
                   ... (*mutate tb , this is quite common
                   when invoke t.count, it may mutate tb
                   *)
                   the update of tb should be reflected on tc 
                  *)
                 t.fast <- tb.fast ;
                 (* a bug here ??? why not add an indirection *)
                 t.next ());
 t

Examples

To be added

Others

Just for notes (to be added)

  1. the printer for enum is unnecessary on the toplevel (battop.ml)
  2. Seq.fold and Enum.fold signature does not correspond
  3. Some useful functions missed : concat_map
  4. slazy (a bug? fast should be false?)
  5. batCharParser, batParserCo unusable, error message lacked
  6. Stream syntax extension for Enum?
  7. polymorphic heap (this is necessary even though it is unsafe)
  8. double-linked list (it is strange that it does not even support zero element list, should support)
  9. a generic tree data structure is missing