-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flambda: Improve transitive closure in invariant_params_in_recursion #13150
Conversation
The old implementation did not really exploit the sparseness of the graph because it used newly discovered edges in later iterations. The new implementation processes each original relation only once per starting node, and does not re-process newly discovered relations.
I produced a flamegraph of ocamlopt compiling the problematic source file: Most of the time (about 86%) is spent in these two functions in ocamlopt when flambda is enabled: ocaml/middle_end/flambda/invariant_params.ml Lines 99 to 108 in e202de7
This is the issue that Florian's patch aims to solve. |
Some (match set with | ||
| Top -> Top | ||
| Implication set -> | ||
loop [] (Variable.Pair.Set.elements set) set)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you are computing the transitive closure of each element of the input state independently -- without reusing the computations you have already done for the previous elements. This sounds wasteful: if (v1, s1)
is in the input state and we compute its transitive closure s1'
, and then we encounter another element (v2, s2)
and v1
is reachable from s2
, iiuc. you will recompute s1'
from s1
at this point whereas you could directly use s1'
. I would thus rather expect a fold on the input state rather than a map. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In fact it might even be possible to compute the transitive closure for the whole mapping at once, instead of computing element-per-element, which saves even more work as intermediate computations can also be reused. I suppose that the "candidates" would be a pair variable * candidate list
, and the frontier would be a list (variable * candidate list) list
. But this is a slightly more invasive change than just using a fold
above, and maybe there is no performance justification for making the code slightly more complex.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to explain this in the commit message. In the Coccinelle/Menhir case, the input relation is sparse, but the result, not really. With the fold, it's no longer obvious that the individual steps for each node process at most m
relations (m
being the number of input relations) because we get extra relations from previous steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a non-expert trying to understand the code (I haven't studied the details as you have, so apologies for any misunderstandings), the problem with the previous implementation is the fact that whenever any part of the state changes, all variables are recomputed on the next iteration, leading to wasteful recomputations that are probably quadratic in practice.
You are pointing out that with my suggestion there is a risk of unnecessary work as well: if s1'
is much larger than s1
, but most of it is already in s2
, it is wasteful to compute the union s1' U s2
rather than s1 U s2
. My intuition would be that this is fine as the union computation is efficient when sets overlap.
With the fold, it's no longer obvious that the individual steps for each node process at most m relations (m being the number of input relations) because we get extra relations from previous steps.
Yes, but on the other hand the number of steps may increase a lot with the map
due to duplicated closure work -- in the worst case of a fully connected graph, the computation goes from O(n)
to O(n^2)
. (For example: 1 -> [2], 2 -> [3], ..., n-1 -> [n], n -> [0]
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the fold idea, I tried this:
diff --git a/middle_end/flambda/invariant_params.ml b/middle_end/flambda/invariant_params.ml
index bbadb7db51..0de56dff01 100644
--- a/middle_end/flambda/invariant_params.ml
+++ b/middle_end/flambda/invariant_params.ml
@@ -68,7 +68,7 @@ let transitive_closure state =
(* Depth-first search for all implications for one argument.
Arguments are moved from candidate to frontier, assuming
they are newly added to the result. *)
- let rec loop candidate frontier result =
+ let rec loop candidate frontier result state =
match (candidate, frontier) with
| ([], []) -> Implication result
| ([], frontier::fs) ->
@@ -77,21 +77,24 @@ let transitive_closure state =
| Not_found -> Implication Variable.Pair.Set.empty) with
| Top -> Top
| Implication candidate ->
- loop (Variable.Pair.Set.elements candidate) fs result)
+ loop (Variable.Pair.Set.elements candidate) fs result state)
| (candidate::cs, frontier) ->
let result' = Variable.Pair.Set.add candidate result in
if result' != result then
(* Result change means candidate becomes part of frontier. *)
- loop cs (candidate :: frontier) result'
+ loop cs (candidate :: frontier) result' state
else
- loop cs frontier result
+ loop cs frontier result state
in
- Variable.Pair.Map.mapi
- (fun _ set ->
- match set with
- | Top -> Top
- | Implication set -> loop [] (Variable.Pair.Set.elements set) set)
- state
+ Variable.Pair.Map.fold
+ (fun key set state ->
+ Variable.Pair.Map.add key
+ (match set with
+ | Top -> Top
+ | Implication set ->
+ loop [] (Variable.Pair.Set.elements set) set state)
+ state)
+ state state
(* CR-soon pchambart: to move to Flambda_utils and document
mshinwell: I think this calculation is basically the same as
But it blows up again for the Coccinelle example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try the small tweak of making frontier
a set instead of a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gasche Could you explain why? This is essentially used as a stack, and as far as I can see, by construction, there cannot be any duplicated elements:
- The initial contents are the elements of a set. No duplicates there.
- We push an element onto the stack only if it is newly added to the result set. This cannot add a duplicate, either.
To take a concrete example, suppose we start from the state: I think that we could compute the fixpoint for all elements at once so that reuse happens no matter the order, but I'm not sure how to write the traversal for this. (I think that I would try a recursive DFS where the state is a shared mutable reference instead of a parameter to the recursive function.) |
Taking a step back:
If I was convinced that this new code does not introduce bad cases that the previous code handles better, I would suggest to merge it. I am not convinced yet -- I think that in the fully-connected case I sketched, the previous algorithm may be asymptotically better. ContextThe computation that we are talking about is computing relations between the argument variables of a nest of mutually-recursive functions. For human-written programs there are at most dozens of them, so any correct algorithm is fine. Apparently for menhir-generated programs there can be thousands of them, so being quadratic hurts. Fixpoint as a service?Implementing a good generic fixpoint-computation algorithm for a set of recursive inequations is not too much work, but somewhat different from this PR. (For example, earlier versions of @fpottier's Fix were a small amount of code.) It could be used here, and also in |
Hello! I have not studied this PR, but I can confirm that Menhir's code back-end generates lots of mutually recursive functions (possibly in the tens of thousands). In my opinion, the OCaml compiler should be prepared to handle this! Regarding fixed point computations, I believe that using a generic (reusable) algorithm is a good idea. My library Fix provides two such algorithms:
I would suggest re-using one of these algorithms, if possible, or taking inspiration from them to develop your own generic algorithm. Once you have such a tool, it becomes useful in many places. Menhir uses |
💭 How was such information determined? |
@elfring The usual |
Can such numbers be reduced another bit by improving the usage of corresponding function and class libraries? 🤔
🔮 I became curious how affected software components will evolve further. |
💭 Would you like to check and compare software run time characteristics in more detail with the help of any higher level analysis and development tools? |
|
No. The generated code does not depend on any library. It is just a large finite-state automaton. |
🔮 Can such a view trigger any additional development ideas? |
The guy's a troll, you're wasting your time replying. |
💭 I hope that the running software clarification can be kept more constructive. |
@elfring , please stop spamming this pull request with repeated inquiries on the same subject. It is courteous to wait for replies before asking other questions. As a moderator, I will have to ban you if you continue to ignore basic netiquette. |
I haven't tested that code, but I don't expect it to be noticeably worse that the current one on existing code. I made a few minor suggestions. I think it would be a tad safer to apply @gasche suggestion of using a set for frontier. By the way, thank you for fixing my old nasty code. |
Maybe |
I can confirm that I have seen an instance of this code taking way too much time (i.e. more than 15 minutes on CI machines) on a real-life parser as @chambart said (although this happened on the parser for the tptp language and not smtlib), see: the parser and my (bad) fix which was to just compile with classic mode. |
@gasche I put together a small test harness:
You are right that for the fully-connected/complete case, the old code is faster (by a factor of 5 to 6), but I'm not sure it's asymptotically so. I expect that an imperative version with an adjacency matrix etc. would be faster. Would you be interested in that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran your benchmarks locally and I was convinced that the new code uniformly performs better, except in this corner case of a fully connected graph. I think that the gap between the two algorithms is not a constant factor in that case (more like a sublinear multiplicative factor), but also that it does not matter in practice in any known case. I am thus happy to improve the code as a credible improvement. Thanks!
Could you please include a Changes entry? Maybe something like:
|
@gasche Added. Regarding the nonlinear factor for complete graphs, it should be possible to eliminate that with an array-based, imperative implementation. But I assume it won't be necessary. |
@fweimer-rh apologies for the nitpick, but Check-typo complains about tabulation characters at lines 92/93 of invariant_params.ml, can you remove them?
|
Thanks! In the meantime I slept on it and had some ideas on how to improve the complete case. I will merge this one here, if the CI agrees, and then possibly invite you (if you are available) to review an improved version. |
I already tried a big-step variant that uses |
This change improves the asymptotic behavior of the closure-computation by ensuring that closure-computation work is shared between nodes instead of being duplicated. Consider for example the relation a -> b b -> c c -> a The previous code would first compute a -> [b, c, a] by traversing the graph, then b -> [c, a, b] by traversing the graph again, then c -> [a, b, c] again. The new code traverses the graph until it finds a cyclic node (starting from [a] this would be [a]), then compute a -> [a, b, c] using the same algorithm as before, and then immediately computes the result for c and c from the result of a, without any additional traversal. Performance results using the benchmark script of Florian Weimer ( https://gitlab.com/gasche-snippets/transitive-closure ). Implementations: - `old` is the code in 5.2 - `new` is ocaml#13150 - `again` is the algorithm in this commit Tests: - `isolated` is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...) - `cycle` is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1) - `complete` is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...) | isolated | Mean [ms] | Relative | |:---------|------------:|------------:| | `old` | 162.1 ± 6.3 | 1.66 ± 0.08 | | `new` | 97.7 ± 2.9 | 1.00 | | `again` | 203.1 ± 4.1 | 2.08 ± 0.07 | | cycle | Mean [s] | Relative | |:--------|--------------:|----------------:| | `old` | 1.952 ± 0.054 | 596.91 ± 154.38 | | `new` | 0.034 ± 0.002 | 10.29 ± 2.71 | | `again` | 0.003 ± 0.001 | 1.00 | | complete | Mean [ms] | Relative | |:---------|-------------:|-------------:| | `old` | 59.3 ± 5.3 | 8.46 ± 1.38 | | `new` | 218.5 ± 15.8 | 31.15 ± 4.83 | | `again` | 7.0 ± 1.0 | 1.00 | The new code has a larger constant factor than the ocaml#13150 algorithm for isolated graphs, which shows as a 2x slowdown. This makes it slightly slower than the "old" code in 5.2 in this trivial case. It performs much better than both trunk and 5.2 in the other cases.
I propose an improved variant in #13156. This is not a radically different algorithm, merely an improvement to the scheduling and reuse of computations in @fweimer-rh, but it performs asymptotically better. |
I thought you'd be interested to know that I pulled this patch into Fedora 41 + OCaml 5.2 and rebuilt most things with it.
|
This change rewrites the transitive closure computation for
invariant_params_in_recursion
. I believe this is more or less a traditional implementation (of the depth-first search variant).It fixes a performance issue where Coccinelle can barely be compiled by flambda:
The relation that triggers this is not even particularly large (12,732 arguments with 3,449 Top relations and 56,289 non-Top relations).