introduce a more efficient mutable list for Enum.force and Enum.take #362

Closed
wants to merge 1 commit into
from

4 participants

@c-cube
ocaml-batteries-team member

Use unrolled mutable lists to store elements of a forced BatEnum.t; it should take less memory on big enumerations. Also added some tests for BatEnum.take and BatEnum.force.

@gasche
ocaml-batteries-team member

Do you have some numeric evidence that it actually improves performances?

I must say I'm ill-disposed towards this commit. The semantic model and code for force is an absolute horror to get right (see eg. the mysterious group_by bug), and I would not necessarily trust myself to review that your different implementation really is equivalent (I haven't tried yet).

If you send me the tests as a separate pull request, I'll happily add them, but I need much more time to get my thoughts in order about Enum.force (I don't even remember what the problems were).

@c-cube
ocaml-batteries-team member

I don't really know how to benchmark this (with two different checkouts of batteries, before and after the commit?). I can also add more tests if you wish. We add a discussion with @thelema last day on #ocaml about this change. I'm curious about the bugs you had with the previous implementation (which looked quite complicated to me: isn't force supposed to eagerly construct a list of the values, then enumerate on this list?).

@gasche
ocaml-batteries-team member

The bug was due to intricate interactions between the various parts of Enum, not only force by itself (but if you change one part...). If the behaviour of the new versions of force and take are provably preserved (and they're not just new implementations of the same informal specification), then they may be correct.

Serious discussions on IRC are problematic, because this is valuable content that is lost, or at least stored in an unstructred form that makes looking back at it more painful than necessary. Being heavy-handed into Github discussions rather than a nice mailing-list with archives as we like them is already a regression, we should devise punishments for the IRC types. More seriously, do you have a link to an archive of the discussion?

@gasche
ocaml-batteries-team member

Nobody in his right mind would commit a performance-improving change to an already insanely complicated code without hard data on the performance improving aspect! I would rather be patching the OCaml GC code than reviewing an Enum change.

@c-cube
ocaml-batteries-team member

Sure, sorry, I had a typing issue that I only solved late in the evening. I admit this pull request is prematurate. Can we pause this until I add some benchmarks?

@gasche
ocaml-batteries-team member

Sure, there is no hurry. Thanks for the work in any case.

@agarwal
ocaml-batteries-team member
@c-cube
ocaml-batteries-team member

The piece of code that implements an unrolled list to store the elements of an Enum is actually adapted from my Sequence library. I do have benchmarks of it (against ExtLib's Enum though, not Batteries'), which was the point of the discussion yesterday. For what I've seen, an Enum is actually quite close to a Stream but with additional capabilities like counting its elements without consuming it; both are driven by the caller (who can call 'get' or 'next'), whereas the only way to consume a Sequence is by iterating/folding on it. @agarwal : does your code in Biocaml use a lot 'peek', 'next' and the likes, or mostly 'iter', 'map', 'filter', 'fold' and other whole-enumeration combinators?

@gasche
ocaml-batteries-team member

@agarwal I have little love for Enum and would be ready to part with it. We inherited it from Extlib, and I think the original idea (let's have a cornerstone datastructure to convert from and to other structures) the good, but the implementation is just too clever for me.
That said, I have also had troubles with Stream (see this yet-unsolved issue for example).

I think we need to do some serious research of what the interface for destructive streams should be, what invariants they should respect, etc. Lure some researcher to work on this not-terribly-sexy subject, get a beautiful LaTeX paper with a correctness proof in Coq, extract the resulting code to OCaml, and never use anything else.

In the meantime, I think non-destructive sequences are a saner choice. Pick something like lazy lists or simon's Sequence (Batteries' Seq is also reasonable) when possible. I would be interested in reports on what the performance implications of picking a non-destructive sequence implementation are in Biocaml. By the way, did you obtain any improvement with Stream instead of Enum?

@c-cube
ocaml-batteries-team member

My goal when writing Sequence was minimal overhead (no allocation of many closures, or lazy list constructors, etc.): it is therefore just a wrapper around the iter functions that allows composition (map, filter, etc.). However, it is not necessarily non-destructive: if you use Sequence.of_hashtbl (iterating on (key,value) pairs of a hashtable), then mutating the hashtable's content also mutates the sequence's "content". On immutable data, like lists, however, a Sequence really is non-destructive and can be used many times. For destructive sequences, in the current master branch, there is a Sequence.persistent that does exactly the same thing as Enum.force.

About extracting from Coq, that would indeed be interesting (provided you can prove termination ;)). I think a functional implementation of Sequence, based on "fold" rather than iter, should be possible.

@gasche
ocaml-batteries-team member

By "destructive" I mean the fact that the idiomatic way to access the next element deletes it from the structure. Both Stream and Enum enforce linear use (only one-time traversal of their sequences). In theory this has good memory consumption properties (avoiding the infamous let avg seq = sum seq / count seq in avg (1 -- 1_000_000) problem), in the sense that all programs that try to behave otherwise become fast and wrong rather than correct but slow. A debatable bargain.

@agarwal
ocaml-batteries-team member
@c-cube
ocaml-batteries-team member

@agarwal isn't there a problem if the data structure is updated during a stream traversal? I also tend to prefer the function of_stream to have the signature

val of_stream : 'a t -> 'a Stream.t -> 'a t

so that one can efficiently update a structure using a Stream (for instance, update a Set rather than using Set.union set (Set.of_stream stream) which is less efficient).

@agarwal
ocaml-batteries-team member
@agarwal
ocaml-batteries-team member
@gasche
ocaml-batteries-team member

I believe this is an interesting discussion that would deserve happening in the caml-list or batteries-devel mailing-list rather than lost in a bugtracker item.

@c-cube
ocaml-batteries-team member

Indeed. Is there a way to export this? Or should we make a summary to start a thread on the caml-list, since it looks like this design issue concerns many libraries?

@agarwal
ocaml-batteries-team member
@c-cube
ocaml-batteries-team member

What are the semantics? Is it that of_stream t s returns a data structure with the elements of both t and s?

Yes, so you can build a data structure using an already existing one (possibly new/empty) and one or more streams. In the case of hashtbl, though, I have two such operators, with the semantic of Hashtbl.replace and Hashtbl.add respectively.

@agarwal
ocaml-batteries-team member
@c-cube
ocaml-batteries-team member

Right, in case the structure is aware that Stream exists, it can have a more efficient way of implementing of_stream.

@agarwal
ocaml-batteries-team member
@thelema thelema commented on the diff Feb 28, 2013
src/batEnum.ml
- let remaining = ref n in
- let f () =
- if !remaining >= 0 then
- let result = e.next () in
- decr remaining;
- result
- else raise No_more_elements
- in let e = make
- ~next: f
- ~count:(fun () -> !remaining)
- ~clone:_dummy
- in e.clone <- (fun () -> force e; e.clone ());
- e*)
+let take (n : int) (e : 'a t) : 'a t =
+ let l = MList.make 64 in
+ (* consume the enum [t] and put its elements into [l] *)
@thelema
ocaml-batteries-team member
thelema added a line comment Feb 28, 2013

Consume the head of [t]; only n elements are consumed; the rest remain.

@c-cube
ocaml-batteries-team member
c-cube added a line comment Feb 28, 2013

Something that may improve the behavior of force on small enums, is to start with a small size of MList block, than then have them get bigger and bigger.

@thelema
ocaml-batteries-team member
thelema added a line comment Feb 28, 2013

i.e. variably sized MList blocks? hmmm... Interesting, but it'd have to be benchmarked.

@c-cube
ocaml-batteries-team member
c-cube added a line comment Feb 28, 2013

The point is not to improve performance in the case of large Enum (should be quite similar), but rather to waste less memory for small/very small Enums.

@thelema
ocaml-batteries-team member
thelema added a line comment Feb 28, 2013

In this particular case, can't you just use min n 64 as the size of the MList node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@thelema
ocaml-batteries-team member

Closing, as it doesn't look like this is mergeable anymore.

@thelema thelema closed this Aug 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment