Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flambda: Improve transitive closure in invariant_params_in_recursion #13150

Merged
merged 5 commits into from
May 9, 2024

Conversation

fweimer-rh
Copy link
Contributor

@fweimer-rh fweimer-rh commented May 6, 2024

This change rewrites the transitive closure computation for invariant_params_in_recursion. I believe this is more or less a traditional implementation (of the depth-first search variant).

It fixes a performance issue where Coccinelle can barely be compiled by flambda:

The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).

The old implementation did not really exploit the sparseness of the
graph because it used newly discovered edges in later iterations.
The new implementation processes each original relation only once
per starting node, and does not re-process newly discovered relations.
@rwmjones
Copy link
Contributor

rwmjones commented May 7, 2024

I produced a flamegraph of ocamlopt compiling the problematic source file:
http://oirase.annexia.org/2024-05-ocamlopt.svg

Most of the time (about 86%) is spent in these two functions in ocamlopt when flambda is enabled:

let once state =
Variable.Pair.Map.fold (fun arg _ state -> update arg state) state state
in
let rec fp state =
let state' = once state in
if Variable.Pair.Map.equal equal state state'
then state
else fp state'
in
fp state

This is the issue that Florian's patch aims to solve.

middle_end/flambda/invariant_params.ml Outdated Show resolved Hide resolved
Some (match set with
| Top -> Top
| Implication set ->
loop [] (Variable.Pair.Set.elements set) set))
Copy link
Member

@gasche gasche May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you are computing the transitive closure of each element of the input state independently -- without reusing the computations you have already done for the previous elements. This sounds wasteful: if (v1, s1) is in the input state and we compute its transitive closure s1', and then we encounter another element (v2, s2) and v1 is reachable from s2, iiuc. you will recompute s1' from s1 at this point whereas you could directly use s1'. I would thus rather expect a fold on the input state rather than a map. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In fact it might even be possible to compute the transitive closure for the whole mapping at once, instead of computing element-per-element, which saves even more work as intermediate computations can also be reused. I suppose that the "candidates" would be a pair variable * candidate list, and the frontier would be a list (variable * candidate list) list. But this is a slightly more invasive change than just using a fold above, and maybe there is no performance justification for making the code slightly more complex.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain this in the commit message. In the Coccinelle/Menhir case, the input relation is sparse, but the result, not really. With the fold, it's no longer obvious that the individual steps for each node process at most m relations (m being the number of input relations) because we get extra relations from previous steps.

Copy link
Member

@gasche gasche May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a non-expert trying to understand the code (I haven't studied the details as you have, so apologies for any misunderstandings), the problem with the previous implementation is the fact that whenever any part of the state changes, all variables are recomputed on the next iteration, leading to wasteful recomputations that are probably quadratic in practice.

You are pointing out that with my suggestion there is a risk of unnecessary work as well: if s1' is much larger than s1, but most of it is already in s2, it is wasteful to compute the union s1' U s2 rather than s1 U s2. My intuition would be that this is fine as the union computation is efficient when sets overlap.

With the fold, it's no longer obvious that the individual steps for each node process at most m relations (m being the number of input relations) because we get extra relations from previous steps.

Yes, but on the other hand the number of steps may increase a lot with the map due to duplicated closure work -- in the worst case of a fully connected graph, the computation goes from O(n) to O(n^2). (For example: 1 -> [2], 2 -> [3], ..., n-1 -> [n], n -> [0].)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the fold idea, I tried this:

diff --git a/middle_end/flambda/invariant_params.ml b/middle_end/flambda/invariant_params.ml
index bbadb7db51..0de56dff01 100644
--- a/middle_end/flambda/invariant_params.ml
+++ b/middle_end/flambda/invariant_params.ml
@@ -68,7 +68,7 @@ let transitive_closure state =
   (* Depth-first search for all implications for one argument.
      Arguments are moved from candidate to frontier, assuming
      they are newly added to the result. *)
-  let rec loop candidate frontier result =
+  let rec loop candidate frontier result state =
     match (candidate, frontier) with
     | ([], []) -> Implication result
     | ([], frontier::fs) ->
@@ -77,21 +77,24 @@ let transitive_closure state =
               | Not_found -> Implication Variable.Pair.Set.empty) with
        | Top -> Top
        | Implication candidate ->
-         loop (Variable.Pair.Set.elements candidate) fs result)
+         loop (Variable.Pair.Set.elements candidate) fs result state)
     | (candidate::cs, frontier) ->
       let result' = Variable.Pair.Set.add candidate result in
       if result' != result then
         (* Result change means candidate becomes part of frontier. *)
-        loop cs (candidate :: frontier) result'
+        loop cs (candidate :: frontier) result' state
       else
-        loop cs frontier result
+        loop cs frontier result state
   in
-    Variable.Pair.Map.mapi
-      (fun _ set ->
-         match set with
-        | Top -> Top
-        | Implication set -> loop [] (Variable.Pair.Set.elements set) set)
-      state
+    Variable.Pair.Map.fold
+      (fun key set state ->
+         Variable.Pair.Map.add key
+           (match set with
+           | Top -> Top
+           | Implication set ->
+              loop [] (Variable.Pair.Set.elements set) set state)
+           state)
+      state state
 
 (* CR-soon pchambart: to move to Flambda_utils and document
    mshinwell: I think this calculation is basically the same as

But it blows up again for the Coccinelle example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try the small tweak of making frontier a set instead of a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gasche Could you explain why? This is essentially used as a stack, and as far as I can see, by construction, there cannot be any duplicated elements:

  • The initial contents are the elements of a set. No duplicates there.
  • We push an element onto the stack only if it is newly added to the result set. This cannot add a duplicate, either.

@gasche
Copy link
Member

gasche commented May 7, 2024

To take a concrete example, suppose we start from the state: 1 -> [2], 2 -> [3], 3 -> [4]. With the PR we will first compute the closure for 1, that is [2, 3, 4], then we will compute the closure for 2, that is [3, 4], then for 3, that is [4]. Using a fold, if the fold goes left-to-right, the computation will be the same as with map (no reuse); if it goes right-to-left, when we compute the closure of 1 we already have 2 -> [3, 4] available and we can compute [2, 3, 4] in a single step.

I think that we could compute the fixpoint for all elements at once so that reuse happens no matter the order, but I'm not sure how to write the traversal for this. (I think that I would try a recursive DFS where the state is a shared mutable reference instead of a parameter to the recursive function.)

@gasche
Copy link
Member

gasche commented May 7, 2024

Taking a step back:

  • you have found a real issue on some machine-generated programs and your patch fixes it
  • I believe that the code that you propose is correct
  • there is no need to use a "perfect" fixpoint algorithm here, just something that is good enough (it works on more programs than before); but if we take shortcuts, the same problem may arise for a different sort of generated code (maybe Menhir changes its code-generation strategy in the future and it falls in another bad case)

If I was convinced that this new code does not introduce bad cases that the previous code handles better, I would suggest to merge it. I am not convinced yet -- I think that in the fully-connected case I sketched, the previous algorithm may be asymptotically better.

Context

The computation that we are talking about is computing relations between the argument variables of a nest of mutually-recursive functions. For human-written programs there are at most dozens of them, so any correct algorithm is fine. Apparently for menhir-generated programs there can be thousands of them, so being quadratic hurts.

Fixpoint as a service?

Implementing a good generic fixpoint-computation algorithm for a set of recursive inequations is not too much work, but somewhat different from this PR. (For example, earlier versions of @fpottier's Fix were a small amount of code.)

It could be used here, and also in typing/typedecl_properties. I haven't found other clear use-cases in the type-checker codebase. asmcomp/dataflow contains a fixpoint computation over a Mach program, but it is more specialized -- it is designed to handle nested recursion well.

@fpottier
Copy link
Contributor

fpottier commented May 7, 2024

Hello! I have not studied this PR, but I can confirm that Menhir's code back-end generates lots of mutually recursive functions (possibly in the tens of thousands). In my opinion, the OCaml compiler should be prepared to handle this!

Regarding fixed point computations, I believe that using a generic (reusable) algorithm is a good idea. My library Fix provides two such algorithms:

  • Fix.Fix requires less work on the user's part (the user does not need to describe the dependency graph; dependencies are dynamically discovered) but is a bit more complex and less efficient; it is well-suited to backward analyses.
  • Fix.DataFlow requires more work from the user (dependencies must be explicitly described) but is less complex and more efficient; it is well-suited to forward analyses.

I would suggest re-using one of these algorithms, if possible, or taking inspiration from them to develop your own generic algorithm. Once you have such a tool, it becomes useful in many places. Menhir uses Fix for many purposes.

@elfring
Copy link

elfring commented May 7, 2024

The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).

💭 How was such information determined?

@fweimer-rh
Copy link
Contributor Author

The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).

💭 How was such information determined?

@elfring The usual printf debugging, I'm afraid.

@elfring
Copy link

elfring commented May 7, 2024

…, but I can confirm that Menhir's code back-end generates lots of mutually recursive functions (possibly in the tens of thousands).

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

In my opinion, the OCaml compiler should be prepared to handle this!

🔮 I became curious how affected software components will evolve further.

@elfring
Copy link

elfring commented May 7, 2024

The usual printf debugging, I'm afraid.

💭 Would you like to check and compare software run time characteristics in more detail with the help of any higher level analysis and development tools?

@elfring
Copy link

elfring commented May 7, 2024

I produced a flamegraph of ocamlopt compiling the problematic source file:

@rwmjones:

  • 🤔 Were such data generated according to a complete program execution?
  • 💭 Would you like to share all program parameters?

@fpottier
Copy link
Contributor

fpottier commented May 7, 2024

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

No. The generated code does not depend on any library. It is just a large finite-state automaton.

@elfring
Copy link

elfring commented May 7, 2024

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

No. The generated code does not depend on any library. It is just a large finite-state automaton.

🔮 Can such a view trigger any additional development ideas?

@rwmjones
Copy link
Contributor

rwmjones commented May 7, 2024

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

No. The generated code does not depend on any library. It is just a large finite-state automaton.

The guy's a troll, you're wasting your time replying.

@elfring
Copy link

elfring commented May 7, 2024

The guy's a troll, you're wasting your time replying.

💭 I hope that the running software clarification can be kept more constructive.

@Octachron
Copy link
Member

@elfring , please stop spamming this pull request with repeated inquiries on the same subject. It is courteous to wait for replies before asking other questions. As a moderator, I will have to ban you if you continue to ignore basic netiquette.

@chambart
Copy link
Contributor

chambart commented May 7, 2024

I haven't tested that code, but I don't expect it to be noticeably worse that the current one on existing code.
If it works on menhir generated code (I think that you might also want to try dolmen which also have a big parser for SMTlib files). That's enough to be convincing.
As @gasche said, I don't expect it to matter in any human written code
I don't know of other tools that generate gigantic recursive function packs. (maybe deriving on recursive types, but that's usually much smaller)
There is no need to delay it for a better generic fixpoint lib.

I made a few minor suggestions. I think it would be a tad safer to apply @gasche suggestion of using a set for frontier.
👍

By the way, thank you for fixing my old nasty code.

@fpottier
Copy link
Contributor

fpottier commented May 7, 2024

I don't know of other tools that generate gigantic recursive function packs.

Maybe ocamllex -ml.

@Gbury
Copy link
Contributor

Gbury commented May 7, 2024

I can confirm that I have seen an instance of this code taking way too much time (i.e. more than 15 minutes on CI machines) on a real-life parser as @chambart said (although this happened on the parser for the tptp language and not smtlib), see: the parser and my (bad) fix which was to just compile with classic mode.

@fweimer-rh
Copy link
Contributor Author

If I was convinced that this new code does not introduce bad cases that the previous code handles better, I would suggest to merge it. I am not convinced yet -- I think that in the fully-connected case I sketched, the previous algorithm may be asymptotically better.

@gasche I put together a small test harness:

You are right that for the fully-connected/complete case, the old code is faster (by a factor of 5 to 6), but I'm not sure it's asymptotically so.

I expect that an imperative version with an adjacency matrix etc. would be faster. Would you be interested in that?

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran your benchmarks locally and I was convinced that the new code uniformly performs better, except in this corner case of a fully connected graph. I think that the gap between the two algorithms is not a constant factor in that case (more like a sublinear multiplicative factor), but also that it does not matter in practice in any known case. I am thus happy to improve the code as a credible improvement. Thanks!

@gasche
Copy link
Member

gasche commented May 7, 2024

Could you please include a Changes entry? Maybe something like:

- #13150: improve a transitive-closure computation algorithm in the flambda middle-end
  to avoid a compilation time blowup on Menhir-generated code
  (Florian Weimer, review by Gabriel Scherer and Pierre Chambart, report by Richard Jones)

@gasche gasche added the merge-me label May 8, 2024
@fweimer-rh
Copy link
Contributor Author

Could you please include a Changes entry?

@gasche Added.

Regarding the nonlinear factor for complete graphs, it should be possible to eliminate that with an array-based, imperative implementation. But I assume it won't be necessary.

@gasche
Copy link
Member

gasche commented May 8, 2024

@fweimer-rh apologies for the nitpick, but Check-typo complains about tabulation characters at lines 92/93 of invariant_params.ml, can you remove them?

./middle_end/flambda/invariant_params.ml:92.15: [tab] TAB character(s)
./middle_end/flambda/invariant_params.ml:93.69: [tab] TAB character(s)

@fweimer-rh
Copy link
Contributor Author

apologies for the nitpick, but Check-typo complains about tabulation characters at lines 92/93 of invariant_params.ml, can you remove them

@gasche Fixed in 9f3e832, I hope.

@gasche
Copy link
Member

gasche commented May 9, 2024

Thanks! In the meantime I slept on it and had some ideas on how to improve the complete case. I will merge this one here, if the CI agrees, and then possibly invite you (if you are available) to review an improved version.

@gasche gasche merged commit 787b4fb into ocaml:trunk May 9, 2024
15 checks passed
@fweimer-rh
Copy link
Contributor Author

fweimer-rh commented May 9, 2024

I already tried a big-step variant that uses Set.union and Set.diff instead of the inner loop. It's faster for the complete case, but much slower otherwise. This approach would likely benefit from a big-union operator (Set.t list -> Set.t) that uses the internal height property to optimize the order of operations.

gasche added a commit to gasche/ocaml that referenced this pull request May 9, 2024
This change improves the asymptotic behavior of the
closure-computation by ensuring that closure-computation work is
shared between nodes instead of being duplicated.

Consider for example the relation

  a -> b
  b -> c
  c -> a

The previous code would first compute

  a -> [b, c, a]

by traversing the graph, then

  b -> [c, a, b]

by traversing the graph again, then

  c -> [a, b, c]

again. The new code traverses the graph until it finds a cyclic
node (starting from [a] this would be [a]), then compute

  a -> [a, b, c]

using the same algorithm as before, and then immediately computes the
result for c and c from the result of a, without any additional
traversal.

Performance results using the benchmark script of Florian Weimer (
https://gitlab.com/gasche-snippets/transitive-closure ).

Implementations:
- `old` is the code in 5.2
- `new` is ocaml#13150
- `again` is the algorithm in this commit

Tests:
- `isolated` is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...)
- `cycle` is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1)
- `complete` is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...)

| isolated |   Mean [ms] |    Relative |
|:---------|------------:|------------:|
| `old`    | 162.1 ± 6.3 | 1.66 ± 0.08 |
| `new`    |  97.7 ± 2.9 |        1.00 |
| `again`  | 203.1 ± 4.1 | 2.08 ± 0.07 |

| cycle   |      Mean [s] |        Relative |
|:--------|--------------:|----------------:|
| `old`   | 1.952 ± 0.054 | 596.91 ± 154.38 |
| `new`   | 0.034 ± 0.002 |    10.29 ± 2.71 |
| `again` | 0.003 ± 0.001 |            1.00 |

| complete |    Mean [ms] |     Relative |
|:---------|-------------:|-------------:|
| `old`    |   59.3 ± 5.3 |  8.46 ± 1.38 |
| `new`    | 218.5 ± 15.8 | 31.15 ± 4.83 |
| `again`  |    7.0 ± 1.0 |         1.00 |

The new code has a larger constant factor than the ocaml#13150 algorithm
for isolated graphs, which shows as a 2x slowdown. This makes it
slightly slower than the "old" code in 5.2 in this trivial case.

It performs much better than both trunk and 5.2 in the other cases.
@gasche
Copy link
Member

gasche commented May 9, 2024

I propose an improved variant in #13156. This is not a radically different algorithm, merely an improvement to the scheduling and reuse of computations in @fweimer-rh, but it performs asymptotically better.

@fweimer-rh fweimer-rh deleted the invariant_params_in_recursion branch May 18, 2024 16:27
@rwmjones
Copy link
Contributor

rwmjones commented May 29, 2024

I thought you'd be interested to know that I pulled this patch into Fedora 41 + OCaml 5.2 and rebuilt most things with it.

The only possible weirdness we have seen (which I'm still investigating, could be unrelated) is a stack overflow when compiling flocq on ppc64le. (Seems to be unrelated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants