flambda: Improve transitive closure in invariant_params_in_recursion #13150

fweimer-rh · 2024-05-06T23:57:23Z

This change rewrites the transitive closure computation for invariant_params_in_recursion. I believe this is more or less a traditional implementation (of the depth-first search variant).

It fixes a performance issue where Coccinelle can barely be compiled by flambda:

[cocci] Very long compile times on parsing_cocci/parser_cocci_menhir.ml

The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).

The old implementation did not really exploit the sparseness of the graph because it used newly discovered edges in later iterations. The new implementation processes each original relation only once per starting node, and does not re-process newly discovered relations.

rwmjones · 2024-05-07T07:40:00Z

I produced a flamegraph of ocamlopt compiling the problematic source file:
http://oirase.annexia.org/2024-05-ocamlopt.svg

Most of the time (about 86%) is spent in these two functions in ocamlopt when flambda is enabled:

ocaml/middle_end/flambda/invariant_params.ml

Lines 99 to 108 in e202de7

    
           let once state = 
        
             Variable.Pair.Map.fold (fun arg _ state -> update arg state) state state 
        
           in 
        
           let rec fp state = 
        
             let state' = once state in 
        
             if Variable.Pair.Map.equal equal state state' 
        
             then state 
        
             else fp state' 
        
           in 
        
           fp state

This is the issue that Florian's patch aims to solve.

middle_end/flambda/invariant_params.ml

gasche · 2024-05-07T07:44:29Z

middle_end/flambda/invariant_params.ml

+         Some (match set with
+	       | Top -> Top
+	       | Implication set ->
+	         loop [] (Variable.Pair.Set.elements set) set))


If I understand correctly, you are computing the transitive closure of each element of the input state independently -- without reusing the computations you have already done for the previous elements. This sounds wasteful: if (v1, s1) is in the input state and we compute its transitive closure s1', and then we encounter another element (v2, s2) and v1 is reachable from s2, iiuc. you will recompute s1' from s1 at this point whereas you could directly use s1'. I would thus rather expect a fold on the input state rather than a map. Am I missing something?

(In fact it might even be possible to compute the transitive closure for the whole mapping at once, instead of computing element-per-element, which saves even more work as intermediate computations can also be reused. I suppose that the "candidates" would be a pair variable * candidate list, and the frontier would be a list (variable * candidate list) list. But this is a slightly more invasive change than just using a fold above, and maybe there is no performance justification for making the code slightly more complex.)

I tried to explain this in the commit message. In the Coccinelle/Menhir case, the input relation is sparse, but the result, not really. With the fold, it's no longer obvious that the individual steps for each node process at most m relations (m being the number of input relations) because we get extra relations from previous steps.

For a non-expert trying to understand the code (I haven't studied the details as you have, so apologies for any misunderstandings), the problem with the previous implementation is the fact that whenever any part of the state changes, all variables are recomputed on the next iteration, leading to wasteful recomputations that are probably quadratic in practice.

You are pointing out that with my suggestion there is a risk of unnecessary work as well: if s1' is much larger than s1, but most of it is already in s2, it is wasteful to compute the union s1' U s2 rather than s1 U s2. My intuition would be that this is fine as the union computation is efficient when sets overlap.

With the fold, it's no longer obvious that the individual steps for each node process at most m relations (m being the number of input relations) because we get extra relations from previous steps.

Yes, but on the other hand the number of steps may increase a lot with the map due to duplicated closure work -- in the worst case of a fully connected graph, the computation goes from O(n) to O(n^2). (For example: 1 -> [2], 2 -> [3], ..., n-1 -> [n], n -> [0].)

Regarding the fold idea, I tried this:

diff --git a/middle_end/flambda/invariant_params.ml b/middle_end/flambda/invariant_params.ml index bbadb7db51..0de56dff01 100644 --- a/middle_end/flambda/invariant_params.ml +++ b/middle_end/flambda/invariant_params.ml @@ -68,7 +68,7 @@ let transitive_closure state = (* Depth-first search for all implications for one argument. Arguments are moved from candidate to frontier, assuming they are newly added to the result. *) - let rec loop candidate frontier result = + let rec loop candidate frontier result state = match (candidate, frontier) with | ([], []) -> Implication result | ([], frontier::fs) -> @@ -77,21 +77,24 @@ let transitive_closure state = | Not_found -> Implication Variable.Pair.Set.empty) with | Top -> Top | Implication candidate -> - loop (Variable.Pair.Set.elements candidate) fs result) + loop (Variable.Pair.Set.elements candidate) fs result state) | (candidate::cs, frontier) -> let result' = Variable.Pair.Set.add candidate result in if result' != result then (* Result change means candidate becomes part of frontier. *) - loop cs (candidate :: frontier) result' + loop cs (candidate :: frontier) result' state else - loop cs frontier result + loop cs frontier result state in - Variable.Pair.Map.mapi - (fun _ set -> - match set with - | Top -> Top - | Implication set -> loop [] (Variable.Pair.Set.elements set) set) - state + Variable.Pair.Map.fold + (fun key set state -> + Variable.Pair.Map.add key + (match set with + | Top -> Top + | Implication set -> + loop [] (Variable.Pair.Set.elements set) set state) + state) + state state (* CR-soon pchambart: to move to Flambda_utils and document mshinwell: I think this calculation is basically the same as

But it blows up again for the Coccinelle example.

Could you try the small tweak of making frontier a set instead of a list?

@gasche Could you explain why? This is essentially used as a stack, and as far as I can see, by construction, there cannot be any duplicated elements:

The initial contents are the elements of a set. No duplicates there.

We push an element onto the stack only if it is newly added to the result set. This cannot add a duplicate, either.

gasche · 2024-05-07T07:59:58Z

To take a concrete example, suppose we start from the state: 1 -> [2], 2 -> [3], 3 -> [4]. With the PR we will first compute the closure for 1, that is [2, 3, 4], then we will compute the closure for 2, that is [3, 4], then for 3, that is [4]. Using a fold, if the fold goes left-to-right, the computation will be the same as with map (no reuse); if it goes right-to-left, when we compute the closure of 1 we already have 2 -> [3, 4] available and we can compute [2, 3, 4] in a single step.

I think that we could compute the fixpoint for all elements at once so that reuse happens no matter the order, but I'm not sure how to write the traversal for this. (I think that I would try a recursive DFS where the state is a shared mutable reference instead of a parameter to the recursive function.)

gasche · 2024-05-07T09:01:18Z

Taking a step back:

you have found a real issue on some machine-generated programs and your patch fixes it
I believe that the code that you propose is correct
there is no need to use a "perfect" fixpoint algorithm here, just something that is good enough (it works on more programs than before); but if we take shortcuts, the same problem may arise for a different sort of generated code (maybe Menhir changes its code-generation strategy in the future and it falls in another bad case)

If I was convinced that this new code does not introduce bad cases that the previous code handles better, I would suggest to merge it. I am not convinced yet -- I think that in the fully-connected case I sketched, the previous algorithm may be asymptotically better.

Context

The computation that we are talking about is computing relations between the argument variables of a nest of mutually-recursive functions. For human-written programs there are at most dozens of them, so any correct algorithm is fine. Apparently for menhir-generated programs there can be thousands of them, so being quadratic hurts.

Fixpoint as a service?

Implementing a good generic fixpoint-computation algorithm for a set of recursive inequations is not too much work, but somewhat different from this PR. (For example, earlier versions of @fpottier's Fix were a small amount of code.)

It could be used here, and also in typing/typedecl_properties. I haven't found other clear use-cases in the type-checker codebase. asmcomp/dataflow contains a fixpoint computation over a Mach program, but it is more specialized -- it is designed to handle nested recursion well.

fpottier · 2024-05-07T09:12:12Z

Hello! I have not studied this PR, but I can confirm that Menhir's code back-end generates lots of mutually recursive functions (possibly in the tens of thousands). In my opinion, the OCaml compiler should be prepared to handle this!

Regarding fixed point computations, I believe that using a generic (reusable) algorithm is a good idea. My library Fix provides two such algorithms:

Fix.Fix requires less work on the user's part (the user does not need to describe the dependency graph; dependencies are dynamically discovered) but is a bit more complex and less efficient; it is well-suited to backward analyses.
Fix.DataFlow requires more work from the user (dependencies must be explicitly described) but is less complex and more efficient; it is well-suited to forward analyses.

I would suggest re-using one of these algorithms, if possible, or taking inspiration from them to develop your own generic algorithm. Once you have such a tool, it becomes useful in many places. Menhir uses Fix for many purposes.

elfring · 2024-05-07T11:45:05Z

The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).

💭 How was such information determined?

fweimer-rh · 2024-05-07T11:47:46Z

The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).

💭 How was such information determined?

@elfring The usual printf debugging, I'm afraid.

elfring · 2024-05-07T11:50:08Z

…, but I can confirm that Menhir's code back-end generates lots of mutually recursive functions (possibly in the tens of thousands).

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

In my opinion, the OCaml compiler should be prepared to handle this!

🔮 I became curious how affected software components will evolve further.

elfring · 2024-05-07T12:07:47Z

The usual printf debugging, I'm afraid.

💭 Would you like to check and compare software run time characteristics in more detail with the help of any higher level analysis and development tools?

elfring · 2024-05-07T12:19:35Z

I produced a flamegraph of ocamlopt compiling the problematic source file:

@rwmjones:

🤔 Were such data generated according to a complete program execution?
💭 Would you like to share all program parameters?

fpottier · 2024-05-07T12:29:14Z

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

No. The generated code does not depend on any library. It is just a large finite-state automaton.

elfring · 2024-05-07T12:32:35Z

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

No. The generated code does not depend on any library. It is just a large finite-state automaton.

🔮 Can such a view trigger any additional development ideas?

rwmjones · 2024-05-07T12:33:26Z

Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔

No. The generated code does not depend on any library. It is just a large finite-state automaton.

The guy's a troll, you're wasting your time replying.

elfring · 2024-05-07T12:34:53Z

The guy's a troll, you're wasting your time replying.

💭 I hope that the running software clarification can be kept more constructive.

Octachron · 2024-05-07T12:35:47Z

@elfring , please stop spamming this pull request with repeated inquiries on the same subject. It is courteous to wait for replies before asking other questions. As a moderator, I will have to ban you if you continue to ignore basic netiquette.

middle_end/flambda/invariant_params.ml

chambart · 2024-05-07T16:11:16Z

I haven't tested that code, but I don't expect it to be noticeably worse that the current one on existing code.
If it works on menhir generated code (I think that you might also want to try dolmen which also have a big parser for SMTlib files). That's enough to be convincing.
As @gasche said, I don't expect it to matter in any human written code
I don't know of other tools that generate gigantic recursive function packs. (maybe deriving on recursive types, but that's usually much smaller)
There is no need to delay it for a better generic fixpoint lib.

I made a few minor suggestions. I think it would be a tad safer to apply @gasche suggestion of using a set for frontier.
👍

By the way, thank you for fixing my old nasty code.

fpottier · 2024-05-07T16:23:26Z

I don't know of other tools that generate gigantic recursive function packs.

Maybe ocamllex -ml.

Gbury · 2024-05-07T16:45:43Z

I can confirm that I have seen an instance of this code taking way too much time (i.e. more than 15 minutes on CI machines) on a real-life parser as @chambart said (although this happened on the parser for the tptp language and not smtlib), see: the parser and my (bad) fix which was to just compile with classic mode.

fweimer-rh · 2024-05-07T18:32:51Z

If I was convinced that this new code does not introduce bad cases that the previous code handles better, I would suggest to merge it. I am not convinced yet -- I think that in the fully-connected case I sketched, the previous algorithm may be asymptotically better.

@gasche I put together a small test harness:

transitive_closure.txt (sorry, no .ml extension Github)

You are right that for the fully-connected/complete case, the old code is faster (by a factor of 5 to 6), but I'm not sure it's asymptotically so.

I expect that an imperative version with an adjacency matrix etc. would be faster. Would you be interested in that?

gasche

I ran your benchmarks locally and I was convinced that the new code uniformly performs better, except in this corner case of a fully connected graph. I think that the gap between the two algorithms is not a constant factor in that case (more like a sublinear multiplicative factor), but also that it does not matter in practice in any known case. I am thus happy to improve the code as a credible improvement. Thanks!

gasche · 2024-05-07T21:17:01Z

Could you please include a Changes entry? Maybe something like:

- #13150: improve a transitive-closure computation algorithm in the flambda middle-end
  to avoid a compilation time blowup on Menhir-generated code
  (Florian Weimer, review by Gabriel Scherer and Pierre Chambart, report by Richard Jones)

fweimer-rh · 2024-05-08T12:43:56Z

Could you please include a Changes entry?

@gasche Added.

Regarding the nonlinear factor for complete graphs, it should be possible to eliminate that with an array-based, imperative implementation. But I assume it won't be necessary.

gasche · 2024-05-08T13:27:42Z

@fweimer-rh apologies for the nitpick, but Check-typo complains about tabulation characters at lines 92/93 of invariant_params.ml, can you remove them?

./middle_end/flambda/invariant_params.ml:92.15: [tab] TAB character(s)
./middle_end/flambda/invariant_params.ml:93.69: [tab] TAB character(s)

fweimer-rh · 2024-05-09T07:18:20Z

apologies for the nitpick, but Check-typo complains about tabulation characters at lines 92/93 of invariant_params.ml, can you remove them

@gasche Fixed in 9f3e832, I hope.

gasche · 2024-05-09T07:22:04Z

Thanks! In the meantime I slept on it and had some ideas on how to improve the complete case. I will merge this one here, if the CI agrees, and then possibly invite you (if you are available) to review an improved version.

fweimer-rh · 2024-05-09T08:06:28Z

I already tried a big-step variant that uses Set.union and Set.diff instead of the inner loop. It's faster for the complete case, but much slower otherwise. This approach would likely benefit from a big-union operator (Set.t list -> Set.t) that uses the internal height property to optimize the order of operations.

This change improves the asymptotic behavior of the closure-computation by ensuring that closure-computation work is shared between nodes instead of being duplicated. Consider for example the relation a -> b b -> c c -> a The previous code would first compute a -> [b, c, a] by traversing the graph, then b -> [c, a, b] by traversing the graph again, then c -> [a, b, c] again. The new code traverses the graph until it finds a cyclic node (starting from [a] this would be [a]), then compute a -> [a, b, c] using the same algorithm as before, and then immediately computes the result for c and c from the result of a, without any additional traversal. Performance results using the benchmark script of Florian Weimer ( https://gitlab.com/gasche-snippets/transitive-closure ). Implementations: - `old` is the code in 5.2 - `new` is ocaml#13150 - `again` is the algorithm in this commit Tests: - `isolated` is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...) - `cycle` is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1) - `complete` is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...) | isolated | Mean [ms] | Relative | |:---------|------------:|------------:| | `old` | 162.1 ± 6.3 | 1.66 ± 0.08 | | `new` | 97.7 ± 2.9 | 1.00 | | `again` | 203.1 ± 4.1 | 2.08 ± 0.07 | | cycle | Mean [s] | Relative | |:--------|--------------:|----------------:| | `old` | 1.952 ± 0.054 | 596.91 ± 154.38 | | `new` | 0.034 ± 0.002 | 10.29 ± 2.71 | | `again` | 0.003 ± 0.001 | 1.00 | | complete | Mean [ms] | Relative | |:---------|-------------:|-------------:| | `old` | 59.3 ± 5.3 | 8.46 ± 1.38 | | `new` | 218.5 ± 15.8 | 31.15 ± 4.83 | | `again` | 7.0 ± 1.0 | 1.00 | The new code has a larger constant factor than the ocaml#13150 algorithm for isolated graphs, which shows as a 2x slowdown. This makes it slightly slower than the "old" code in 5.2 in this trivial case. It performs much better than both trunk and 5.2 in the other cases.

gasche · 2024-05-09T09:42:58Z

I propose an improved variant in #13156. This is not a radically different algorithm, merely an improvement to the scheduling and reuse of computations in @fweimer-rh, but it performs asymptotically better.

rwmjones · 2024-05-29T22:14:26Z

I thought you'd be interested to know that I pulled this patch into Fedora 41 + OCaml 5.2 and rebuilt most things with it.

~~The only possible weirdness we have seen (which I'm still investigating, could be unrelated) is a stack overflow when compiling flocq on ppc64le.~~ (Seems to be unrelated)

gasche reviewed May 7, 2024

View reviewed changes

Use Map.mapi instead of non-removing Map.filter_map

6c2912c

chambart reviewed May 7, 2024

View reviewed changes

middle_end/flambda/invariant_params.ml Outdated Show resolved Hide resolved

chambart reviewed May 7, 2024

View reviewed changes

middle_end/flambda/invariant_params.ml Outdated Show resolved Hide resolved

Apply Pierre Chambart's suggestions

0c5e9dc

gasche approved these changes May 7, 2024

View reviewed changes

Update Changes

cddfee7

gasche added the merge-me label May 8, 2024

Convert tabs to spaces

9f3e832

gasche merged commit 787b4fb into ocaml:trunk May 9, 2024
15 checks passed

gasche mentioned this pull request May 9, 2024

follow-up on #13150: improve the closure-computation algorithm again #13156

Open

fweimer-rh deleted the invariant_params_in_recursion branch May 18, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flambda: Improve transitive closure in invariant_params_in_recursion #13150

flambda: Improve transitive closure in invariant_params_in_recursion #13150

fweimer-rh commented May 6, 2024 •

edited

rwmjones commented May 7, 2024

gasche May 7, 2024 •

edited

gasche May 7, 2024

fweimer-rh May 7, 2024

gasche May 7, 2024 •

edited

fweimer-rh May 7, 2024

gasche May 7, 2024

fweimer-rh May 7, 2024

gasche commented May 7, 2024

gasche commented May 7, 2024

fpottier commented May 7, 2024

elfring commented May 7, 2024

fweimer-rh commented May 7, 2024

elfring commented May 7, 2024

elfring commented May 7, 2024

elfring commented May 7, 2024

fpottier commented May 7, 2024

elfring commented May 7, 2024

rwmjones commented May 7, 2024

elfring commented May 7, 2024

Octachron commented May 7, 2024

chambart commented May 7, 2024 •

edited

fpottier commented May 7, 2024

Gbury commented May 7, 2024 •

edited

fweimer-rh commented May 7, 2024

gasche left a comment

gasche commented May 7, 2024 •

edited

fweimer-rh commented May 8, 2024

gasche commented May 8, 2024

fweimer-rh commented May 9, 2024

gasche commented May 9, 2024

fweimer-rh commented May 9, 2024 •

edited

gasche commented May 9, 2024

rwmjones commented May 29, 2024 •

edited

flambda: Improve transitive closure in invariant_params_in_recursion #13150

flambda: Improve transitive closure in invariant_params_in_recursion #13150

Conversation

fweimer-rh commented May 6, 2024 • edited

rwmjones commented May 7, 2024

gasche May 7, 2024 • edited

Choose a reason for hiding this comment

gasche May 7, 2024

Choose a reason for hiding this comment

fweimer-rh May 7, 2024

Choose a reason for hiding this comment

gasche May 7, 2024 • edited

Choose a reason for hiding this comment

fweimer-rh May 7, 2024

Choose a reason for hiding this comment

gasche May 7, 2024

Choose a reason for hiding this comment

fweimer-rh May 7, 2024

Choose a reason for hiding this comment

gasche commented May 7, 2024

gasche commented May 7, 2024

Context

Fixpoint as a service?

fpottier commented May 7, 2024

elfring commented May 7, 2024

fweimer-rh commented May 7, 2024

elfring commented May 7, 2024

elfring commented May 7, 2024

elfring commented May 7, 2024

fpottier commented May 7, 2024

elfring commented May 7, 2024

rwmjones commented May 7, 2024

elfring commented May 7, 2024

Octachron commented May 7, 2024

chambart commented May 7, 2024 • edited

fpottier commented May 7, 2024

Gbury commented May 7, 2024 • edited

fweimer-rh commented May 7, 2024

gasche left a comment

Choose a reason for hiding this comment

gasche commented May 7, 2024 • edited

fweimer-rh commented May 8, 2024

gasche commented May 8, 2024

fweimer-rh commented May 9, 2024

gasche commented May 9, 2024

fweimer-rh commented May 9, 2024 • edited

gasche commented May 9, 2024

rwmjones commented May 29, 2024 • edited

fweimer-rh commented May 6, 2024 •

edited

gasche May 7, 2024 •

edited

gasche May 7, 2024 •

edited

chambart commented May 7, 2024 •

edited

Gbury commented May 7, 2024 •

edited

gasche commented May 7, 2024 •

edited

fweimer-rh commented May 9, 2024 •

edited

rwmjones commented May 29, 2024 •

edited