Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee that Set.add doesn't allocate and returns the original set if the added element is already in the set #6645

Closed
vicuna opened this Issue Nov 4, 2014 · 7 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link
Collaborator

vicuna commented Nov 4, 2014

Original bug ID: 6645
Reporter: @alainfrisch
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2016-12-07T10:49:02Z)
Resolution: fixed
Priority: low
Severity: tweak
Target version: 4.02.2+dev / +rc1
Fixed in version: 4.03.0+dev / +beta1
Category: standard library
Tags: patch
Monitored by: @ygrek jmeber @hcarty

Bug description

I suggest to replace the implementation of Set.add with:

let rec add x = function
    Empty -> Node(Empty, x, Empty, 1)
  | Node(l, v, r, _) as t ->
      let c = Ord.compare x v in
      if c = 0 then t else
      if c < 0 then
        let ll = add x l in
        if l == ll then t else bal ll v r
      else
        let rr = add x r in
        if r == rr then t else bal l v rr

This improves performance by not allocating when the added element is already in the set, but equally importantly, it guarantees that the result is physically equal to the original set in that case. This makes it possible to implement the common pattern "add if not already present and do something different in the two cases" with a single traversal.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jan 13, 2015

Comment author: @damiendoligez

Just note that strictly speaking it doesn't guarantee anything because the specification of == is very weak.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jan 13, 2015

Comment author: @alainfrisch

So, do you think we shouldn't write down on the function documentation that its result is physically equal to its argument when the element is already in the set? This property could be very useful to avoid double lookup.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

Just note that strictly speaking it doesn't guarantee anything because the specification of == is very weak.

It is indeed the case today that storing a value in a data structure and loading it back doesn't necessarily return something physically equal to the original value. This happens when a float is stored in an array or in a unboxed-float record.

But here, we only rely on the fact that:

function
| (...) as t -> .... t ....

returns something physically equal to its argument (plus usual data flow rules: if the "then" branch is chosen, the result of "if-then-else" is physically equal to the one of the "then" branch; or: the value seen be a function's caller is physically equal to the value computed by the function's body).

Do you envision cases where the suggested implementation of Set.add would indeed not satisfy the stated invariant? (e.g. with js_of_ocaml for instance)

Apart from this question of whether to mention the would-be invariant in the documentation, do you agree with the proposal?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Mar 25, 2015

Comment author: @damiendoligez

I agree with the proposal and I agree with fully documenting the invariant. I was just being pedant.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Mar 26, 2015

Comment author: @alainfrisch

Thanks! I've applied this to trunk (rev 15966), documenting the fact that the result is physically equal to the argument (but not the fact that the function doesn't allocate).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Mar 26, 2015

Comment author: @xavierleroy

Preservation of sharing is important for some applications (e.g. static analysis -- one of the reasons that Astree scales is that it is very careful with sharing). However, this is a slippery slope: you did it for Set.add, but what about other Set operations? (union, etc) What about Map operations? Preserving sharing in Map.merge is especially sensitive for static analysis applications.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Mar 26, 2015

Comment author: @alainfrisch

Actually, this is only a first step. LexiFi's version of set.ml indeed has similar optimizations for other operations. For instance, our version of Set.union tries to benefit from physical equality and preserve it:

===============
let rec union s1 s2 =
      if s1 == s2 then s1 (* LEXIFI *)
      else match (s1, s2) with
        (Empty, t2) -> t2
      | (t1, Empty) -> t1
      | (Node(l1, v1, r1, h1), Node(l2, v2, r2, h2)) ->
          if h1 >= h2 then
            if h2 = 1 then add v2 s1 else begin
              (* BEGIN LEXIFI *)
              let (l2, _, r2) = split v1 s2 in
              let ll = union l1 l2 in
              let rr = union r1 r2 in
              if ll == l1 && rr == r1 then s1
              else join ll v1 rr
              (* END LEXIFI *)
            end
          else
            if h1 = 1 then add v1 s2 else begin
              (* BEGIN LEXIFI *)
              let (l1, _, r1) = split v2 s1 in
              let ll = union l1 l2 in
              let rr = union r1 r2 in
              if ll == l2 && rr == r2 then s2
              else join ll v2 rr
              (* END LEXIFI *)
            end
===============

The reason I've started with Set.add is that it's easier to specify what the optimization does and there is a direct use case (avoid Set.mem for the typical memoization pattern). With Set.union as above, the performance depends on how the two sets have been built (whether they share internal nodes), and it's difficult to reason on that, so the optimization is more like a "best effort" than a guaranteed invariant.

What's your opinion on adding such optimizations e.g. for Set.union?

For Map.merge, we also have a specialized version:

val union: (key -> 'a -> 'a -> 'a) -> 'a t -> 'a t -> 'a t

which is more efficient than Map.merge (although it currently doesn't do anything special w.r.t. physical equality yet):

===============
    let rec union f s1 s2 =
      match (s1, s2) with
      | (Empty, s) | (s, Empty) -> s
      | (Node (l1, v1, d1, r1, h1), Node (l2, v2, d2, r2, h2)) ->
          if h1 >= h2 then
            let (l2, d2, r2) = split v1 s2 in
            let l = union f l1 l2 and r = union f r1 r2 in
            let d =
              match d2 with
              | None -> d1
              | Some d2 -> f v1 d1 d2
            in
            join l v1 d r
          else
            let (l1, d1, r1) = split v2 s1 in
            let l = union f l1 l2 and r = union f r1 r2 in
            let d =
              match d1 with
              | None -> d2
              | Some d1 -> f v2 d1 d2
            in
            join l v2 d r
===============

@vicuna vicuna closed this Dec 7, 2016

@vicuna vicuna added the stdlib label Mar 14, 2019

@vicuna vicuna added this to the 4.02.2 milestone Mar 14, 2019

@vicuna vicuna added the bug label Mar 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.