Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stdlib functional iterators #1002

Merged
merged 25 commits into from Mar 16, 2018

Conversation

@c-cube
Copy link
Contributor

c-cube commented Jan 10, 2017

See #635 for previous discussion and design.

This new proposal uses a purely functional type of delayed lists, as suggested by @fpottier
and already implemented in Batteries. The module is named Seq.

type 'a node =
  | Nil
  | Cons of 'a * 'a t

and 'a t = unit -> 'a node

The type is not structural, sadly (using type 'a t = unit -> ('a * 'a t) option does not pass the acyclicity check), but is very simple to use, and has decent performances (see previous PR). Ping @Drup @alainfrisch @gasche @jakemcarthur @paurkedal for more discussions.

About the choice of unit -> 'a node rather than 'a node lazy_t: it is much faster if the iterator is traversed once, and it entails fewer memory leaks (e.g. for range 0 1_000_000, there will be no memoization). One can still write a memoization combinator (I wrote one for gen which is very efficient, using an underlying buffer).

@paurkedal
Copy link

paurkedal left a comment

I generally think this approach looks good; since this is meant as a generic way to get out data from containers, I prefer it simple and performant rather than abstract (which I agree would be needed) and featureful. Though, I don't know how the performance compares to the previous PR.

@@ -42,6 +42,7 @@ STDLIB_MODULES=\
hashtbl \
int32 \
int64 \
iter \

This comment has been minimized.

@paurkedal

paurkedal Jan 11, 2017

Should be seq, right?


let of_seq i =
let n = ref 0 in
let buf = ref (make 256 '\000') in

This comment has been minimized.

@paurkedal

paurkedal Jan 11, 2017

Would be simpler to use Buffer, since geometric resizing is used anyway. (But good you avoid double iteration, since we don't know the complexity of traversing the input.)

This comment has been minimized.

@c-cube

c-cube Jan 11, 2017

Author Contributor

Yes, relying on Buffer would be more convenient, but it would create circular dependencies since Buffer already depends on Bytes.

This comment has been minimized.

@paurkedal

paurkedal Jan 11, 2017

Of course. Your partial reimplementation is quite lean anyway.

@@ -354,6 +354,35 @@ let stats h =
max_bucket_length = mbl;
bucket_histogram = histo }

(** {6 Seq.tors} *)

This comment has been minimized.

@paurkedal

paurkedal Jan 11, 2017

Search & replace error in heading?

in ascending order, from key [k] or above.
@since NEXT_RELEASE *)

val add_seq : 'a t -> (key * 'a) Seq.t -> 'a t

This comment has been minimized.

@paurkedal

paurkedal Jan 11, 2017

val add_seq : (key * 'a) Seq.t -> 'a t -> 'a t for consistency with add and to make the partial application composable.

This comment has been minimized.

@c-cube

c-cube Jan 11, 2017

Author Contributor

Indeed, will do.

(** Iterate on the whole set, in ascending order
@since NEXT_RELEASE *)

val add_seq : t -> elt Seq.t -> t

This comment has been minimized.

@paurkedal

paurkedal Jan 11, 2017

add_seq : elt Seq.t -> t -> t, cf comment on Map.add_seq.

This comment has been minimized.

@c-cube

c-cube Jan 11, 2017

Author Contributor

fixed in last commit

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Jan 11, 2017

There were some basic performance comparisons in #635. As far as map, flat_map and the likes are concerned, @Drup made some benchmarks a while ago that put the various types discussed previously roughly in the same order, except for ('a -> unit) -> unit that is much faster on some of the operations it supports.

in
fill (len-1) tl

let of_seq i =

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

This can be improved later, but this function might deserve a more efficient implementation which does not build the intermediate list at least for small enough arrays (e.g. keeping the first N values on the stack using a non-tail rec loop and creating the array while unwinding the recursion). Even for larger arrays, it might be more efficient to keep the intermediate values directly in arrays (either a resizing buffer, or a list of chunks). To be checked.

This comment has been minimized.

@c-cube

c-cube Jan 12, 2017

Author Contributor

Why not, will consider it indeed.


let to_seq b =
let rec aux i () =
if i = b.position then Seq.Nil

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

Because of Buffer.{clear, reset, truncate}, you should probably test for i >= b.position instead. (Even if mutation during iteration is specified as being undefined...)

This comment has been minimized.

@c-cube

c-cube Jan 12, 2017

Author Contributor

will do.


let to_seqi b =
let rec aux i () =
if i = b.position then Seq.Nil

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

Same as above.

(** {6 Iterators} *)

val to_seq : 'a array -> 'a Seq.t
(** Iterate on the array, in increasing order

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

You should specify that mutation during iteration is not specified (as you did for Buffer). Or, alternatively specify the behavior as you did for Bytes.

val add: 'a t -> key -> 'a -> unit
val remove: 'a t -> key -> unit
val find: 'a t -> key -> 'a
val copy : 'a t -> 'a t

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

No reason for these whitespace changes (and the recommended style would rather be the one without the whitespace).

val stats: 'a t -> statistics
end

module type FULL =

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

Why do we need this distinction between S and FULL?

This comment has been minimized.

@c-cube

c-cube Jan 12, 2017

Author Contributor

that was because of the sharing of S with some ephemeron table, but I removed the sharing, so this has no reason to exist anymore indeed.

This comment has been minimized.

@alainfrisch

alainfrisch Mar 13, 2018

Contributor

@c-cube Can you address this point?

@@ -215,6 +215,36 @@ val stats : ('a, 'b) t -> statistics
buckets by size.
@since 4.00.0 *)

(** {6 Seq.iterators} *)

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

Typo? (also in the doc strings below)

(** Iterate on the whole map, in ascending order
@since NEXT_RELEASE *)

val to_seq_at : key -> 'a t -> (key * 'a) Seq.t

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

I'm not very fond of the name. to_seq_starting_from or to_seq_from, perhaps?

This comment has been minimized.

@paurkedal

paurkedal Jan 12, 2017

I agree to_seq_from is more descriptive than to_seq_at. But we could also incorporate more general slicing in the main sequence extractor,

val to_seq : ?first: (key -> bool) -> ?last: (key -> bool) -> 'a t -> (key * 'a) Seq.t

borrowing the semantics of find_first and find_last. While the end-of-sequence can be adjusted with a sequence combinator, direct support for it has the advantage of only holding on to the part of the container it needs plus O(log(n). Corr.: There is no such guarantee on space complexity here, since the sequence may contain a single element greater than the top-level node. The should be a gain in expected complexity for small sequence lengths.


val fold_left : ('a -> 'b -> 'a) -> 'a -> 'b t -> 'a

val iter : ('a -> unit) -> 'a t -> unit

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

What about providing a function that memoize the iteration?

This comment has been minimized.

@c-cube

c-cube Jan 12, 2017

Author Contributor

That is something non-trivial to do efficiently, in my experience. Using lazy is relatively slow, so I have something better (a kind of append-only buffer) in the gen library that could be used as a building block for memoize. However, this is imho the kind of function that might be more suited to a 3rd party library on opam (which could provide many more combinators and evolve faster than the stdlib).

This comment has been minimized.

@alainfrisch

alainfrisch Jan 12, 2017

Contributor

Fair enough. It might be worth adding a note to seq.mli stating that the generator must be prepared to be used in a non-linear way, so that people don't expose an overly naive generator iterating over lines of a text file.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Jan 12, 2017

I'm a bit concerned by the addition of a module to the stdlib with such a short name. One way to avoid conflicts would be to add it as a submodule of Pervasives. Another would be to use a longer name (CamlSeq) and expose an alias module Seq = CamlSeq in Pervasives. (There probably requires some tweaks to the build system.)

@shindere

This comment has been minimized.

Copy link
Contributor

shindere commented Jan 12, 2017

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Jan 12, 2017

I wouldn't bet that sequence.ml is not used by any OCaml project (and actually, there is at least https://github.com/c-cube/sequence/blob/4e2595df8983f64cb094334f8b4e2851e3f837fa/src/sequence.ml).

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Jan 12, 2017

Indeed, I have been using Sequence as a toplevel module, it's in several projects. Maybe the Caml_seq + alias would be better (although I'm afraid when a maintainer mentions "tweaks to the build system").

@diml

This comment has been minimized.

Copy link
Member

diml commented Jan 12, 2017

I started working on a patch to prefix all the stdlib modules and use module aliases to get the short names back. It's all working, except for a few tests where types are not printed properly and some ocamldoc errors. I should be able to finish the patch tomorrow or next week, then we'll be free to add new modules to the stdlib without guilt. The branch is here: https://github.com/diml/ocaml/tree/prefix-the-stdlib

@mshinwell mshinwell added the stdlib label Jan 13, 2017

@diml diml referenced this pull request Jan 13, 2017

Merged

Prefix the stdlib #1010

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Jan 24, 2017

is it worth the trouble updating this PR now, or should I wait for #1010 to be merged?

@c-cube c-cube force-pushed the c-cube:stdlib-iterators branch from 18970e4 to af64940 Jan 24, 2017

val filter_map_inplace: (key -> 'a -> 'a option) -> 'a t -> unit
val fold : (key -> 'a -> 'b -> 'b) -> 'a t -> 'b -> 'b
val length : 'a t -> int
val stats: 'a t -> Hashtbl.statistics

This comment has been minimized.

@bobot

bobot Mar 31, 2017

Contributor

The inclusion of Hashtbl.S was not used just for avoiding a copy, it was also a way to force people to keep the interface of ephemerons synchronized with the one of hashtables. Could you try to add the functions to the ephemerons?

This comment has been minimized.

@c-cube

c-cube Mar 31, 2017

Author Contributor

I'm not even sure what it means to iterate on a weak hashtable… I will certainly not do that before #1010 is merged, my quota of motivation is running dry right now.

This comment has been minimized.

@alainfrisch

alainfrisch Mar 14, 2018

Contributor

@bobot Do you agree with delaying the addition of Seq-related functions to ephemerons?

This comment has been minimized.

@bobot

bobot Mar 14, 2018

Contributor

I think the work just amount to copy the version of hashtbl and apply the same modification than for iter. Lets see, I take the code of this PR, I compare Hashtbl.iter and Ephemerons.MakeSeeded.iter, and ... code written but not typed or tested:

let to_seq tbl =
  (* capture current array, so that even if the table is resized we
     keep iterating on the same array *)
  let tbl_data = tbl.data in
  (* state: index * next bucket to traverse *)
  let rec aux i buck () = match buck with
    | Empty ->
        if i = Array.length tbl_data
        then Seq.Nil
        else aux(i+1) tbl_data.(i) ()
    | Cons (_, c, next) ->
            begin match H.get_key c, H.get_data c with
            | None, _ | _, None -> ()
            | Some key, Some data ->
                 Seq.Cons ((key, data), aux i next)
            end; do_bucket rest in
        
  in
  aux 0 Empty

So I prefer if @c-cube could take this code, in his MR. I would prefer that the signature of ephemerons stay the same than the one of hashtbl.


val return : 'a -> 'a t

val map : ('a -> 'b) -> 'a t -> 'b t

This comment has been minimized.

@ghost

ghost May 10, 2017

should we also have a >>=?

This comment has been minimized.

@paurkedal

paurkedal May 11, 2017

Wouldn't that be the same as flat_map found below?


val iter : ('a -> unit) -> 'a t -> unit

val memo : 'a t -> 'a t

This comment has been minimized.

@ghost

ghost May 11, 2017

Should we have an imperative yield function?

let yield (x:' a Seq.t) : ('a option * 'a Sequence.t) = 
  let open Seq in match x () with 
  | Nil -> (None, empty)
  | Cons (a, r) -> (Some a, r)

This comment has been minimized.

@c-cube

c-cube May 11, 2017

Author Contributor

you mean a memoizing peek or something like that? I don't really get what yield would be used for.

besides, until this has chances of being merged (i.e. #1010 merged?) I won't spend any more time on that PR.

This comment has been minimized.

@ghost

ghost May 13, 2017

Hi, sorry for the late reply, it is more like next in Core.Sequence. But we can wait until this PR is merged and then consider this function as a useful extension. I personally used it a lot. Not sure how others feel.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 29, 2017

@infinity0

This comment has been minimized.

Copy link
Contributor

infinity0 commented Jan 21, 2018

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Mar 13, 2018

Frankly, I appreciate your heroic efforts to move this forward, I understand the process can be frustrating, and I'm trying to help, but this is an important change to the stdlib and it deserves proper review and polishing. Btw, some review comments from last year remain unaddressed.

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Mar 13, 2018

Right. Apologies, I'm not at the top today. I've just udpated the code following some reviews, but might have missed some of them, would you be so kind as to indicate which ones are missing, if any?

@ubsan ubsan referenced this pull request Mar 13, 2018

Closed

Adding a Seq type to Base? #41

end
(** The output signature of the functor {!Hashtbl.Make}. *)
(** The core output signature of the functor {!Hashtbl.Make}. *)

This comment has been minimized.

@alainfrisch

alainfrisch Mar 14, 2018

Contributor

This 'core' is no longer relevant.

end
(** The output signature of the functor {!Hashtbl.MakeSeeded}.
(** The core output signature of the functor {!Hashtbl.MakeSeeded}.

This comment has been minimized.

@alainfrisch

alainfrisch Mar 14, 2018

Contributor

Same as above.

Changes Outdated
@@ -50,6 +50,10 @@ Working version

### Standard library:

- GPR #1002: add a new `Seq` module defining a list-of-thunks style iterator.

This comment has been minimized.

@alainfrisch

alainfrisch Mar 14, 2018

Contributor

Spurious whitespace after GPR.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Mar 14, 2018

LGTM, modulo a few tiny inline comments. We will want to add more functions and optimize the implementation of current ones but this can be done later.

I will merge later this week unless someone objects to it.

@c-cube Can you replace NEXT_RELEASE with 4.07?

@gasche

This comment has been minimized.

Copy link
Member

gasche commented Mar 14, 2018

Congratulations on getting this far!

What is (roughly) the amount of test coverage for the new functions? I only re-skimmed the patch, but I see tests for Seq.filter. Is the idea to eventually get more? (I think that basic tests for the various of_seq/to_seq functions for example could be interesting.)

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Mar 14, 2018

I'll do the ephemeron changes and that should be good, then. Thank you @alainfrisch!

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Mar 14, 2018

I added the basics for Ephemeron, following @bobot's advice. @gasche I added a few tests for Set and Map, and adding basic {to_seq,of_seq} hashtbl tests right now.

else aux(i+1) tbl_data.(i) ()
| Cons (_, c, next) ->
begin match H.get_key c, H.get_data c with
| None, _ | _, None -> Seq.Nil

This comment has been minimized.

@bobot

bobot Mar 15, 2018

Contributor

@c-cube Ok, my bad I put () in my comment (here Seq.Nil) but it should be aux i next, even if the binding is not there anymore we still need to look at the other bindings.

@bobot

This comment has been minimized.

Copy link
Contributor

bobot commented Mar 15, 2018

I added the basics for Ephemeron, following @bobot's advice.

Thank you!

c-cube and others added some commits Mar 15, 2018

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Mar 16, 2018

I fixed a conflict in Makefile, now waiting for CI green light before merging. I second @gasche's suggestion to add more tests at some point, though.

@alainfrisch alainfrisch merged commit df80f34 into ocaml:trunk Mar 16, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Mar 16, 2018

Merged, just in time for the week-end! (With a squash merge, since the history did not seem so useful.)

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Mar 16, 2018

Thank you @alainfrisch !! 😁
Now I think I should write a compatibility package for the 'a Seq.t type, and add more tests.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Mar 23, 2018

(@c-cube It occurred to me that such contribution would probably require a CLA. Have you already signed one?)

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Mar 23, 2018

I don't think I have, but this was written on my free time, so there should be not problem. Where should I sign the CLA?

@c-cube

This comment has been minimized.

Copy link
Contributor Author

c-cube commented Mar 23, 2018

There isn't an online form for the CLA? 😕
This is going to take time, I don't have a printer…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.