-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dune build -w restricting job number for no reason #5549
Comments
@Alizter if you do |
@ejgallego Yes. |
From the meeting: there must be some mutable state remaining in dune that is causing this. We need to scan the code base once more for mutable state and convert it to Memo |
I am seeing the same behavior described in the issue for a mid-large size Melange build. After watch mode restarts when the first change happens, the jobs count drops to 1, with very short peaks of 2-3 jobs. At Ahrefs, we use cpus with large parallelism, and most part of the build can be done using 255 parallel jobs, so the performance impact is severe. We would like to move away from external watch tools like watchexec, as they interfere with the recently added lock to prevent multiple build commands (#6360), but the jobs restricting issue would prevent the migration to Dune watch.
Are there any previous examples in PRs to show this type of conversion and how they look like? Also, is there some way to debug the dep graph information? I understand that for a set of rules that can be executed in parallel to become basically a sequence, each rule must at some point start depending on the previous one somehow, starting from the second and subsequent builds? |
@jchavarri Does this only happen with Melange builds? Have you observed it with OCaml? If not, then that might be an indicator to what is going wrong since Melange and Coq do some similar things. |
It also happens with OCaml. Some more observations:
|
Thanks for the info. Could you find out the following:
|
I did not find any relation between ml files having mli or not.
There are hundreds of modules that depend on the .ml files. I noticed that rather than specific ml files being problematic, it seems to be an issue across the library boundary (maybe). I was able to find a simplified repro using synthetic code, that I put up in https://github.com/jchavarri/dune_repro_issue_5549. Instructions from the readme copied below: Dune repro issue 5549
|
Some added notes:
dune/src/dune_engine/build_system.ml Line 221 in 8d88ee8
Could it be related somehow? Is there a way to debug this to find the problem? |
Another note: the bug is reproducible on 3.0.2 but not on 2.9.3 (both installed from opam). Trying to figure out now the exact commit where it regressed. |
I can pin the regression down to this commit: |
Do you mind trying the following:
If so, then I would suggest to bisect further but with this environment variable switched on. |
You can try the following experiments to see if it helps with the issue:
Perhaps one of these will fix the issue |
@rgrinberg I tried multiple things along the lines of what you suggested. The only thing that worked ultimately was changing the implementation of the let get () =
let+ (_ : Memo.Run.t) = Memo.current_run () in
builtin_default
I am trying now to go forward and identify where this code is in current |
I can't find out how to reapply the fix in most recent For the record, here's the small patch that fixes the problem when applied to 621e4e2, there's no need to return constants or anything: $ git diff
diff --git a/src/dune_engine/source_tree.ml b/src/dune_engine/source_tree.ml
index 1b1bbc5f8..362ce6927 100644
--- a/src/dune_engine/source_tree.ml
+++ b/src/dune_engine/source_tree.ml
@@ -377,7 +377,9 @@ module Settings = struct
let set x = Fdecl.set t x
- let get () = Fdecl.get t
+ let get () =
+ let* (_ : Memo.Run.t) = Memo.current_run () in
+ Fdecl.get t
end
let init = Settings.set I guess I will try now the opposite direction? Move forward in time, version by version, reapplying this patch and checking if it still fixes the problem. |
I have noticed that commenting the following lines in
It seems that unlike the patch in #7224, this change doesn't discard every piece of memoized information completely, but just the However, when using the version of dune with the above patch in a real project, the issue was not fully gone. Given a "leaf" library Z and a library Y that Z depends upon:
The second case seems to be fixed by removing memoization from
Are these 2 optimizations @rgrinberg Is there anything else I can log, measure or debug considering the above information? |
Did you actually observe a speed up from these patches? Yes, removing memoization will make dune run more things in parallel, but that memoization was probably preventing them from running at all in the first place. So perhaps it was not faster all to remove it.
They're pretty critical because they make sure dune doesn't rebuild rules or aliases in the same build.
I would say that it would be very valuable to get some perf measurements on what dune is doing exactly when it's building things serially. |
Btw, what's missing from this ticket is some analysis as to why you think dune's missing opportunities for parallelism. In particular, we still aren't sure if this is in fact a bug in the engine or just how the rules are. It's not enough to have 256 cores and see 1 job running to determine that dune's is not building concurrently enough. We need to demonstrate where which two rules it fails to parallelize. It's a bit of a shame we don't have any good tools in dune to do this, but I'd be more than happy to add them if it was possible. For now, I would suggest to sure |
Yes, I definitely do. Note that the issue always occur when modifying files in a library Y that another library Z depends upon. In those cases, due to the way library dependencies are defined, all modules in Z will be rebuilt. So with the patches applied, if Z has a few hundred modules, you get them built in very short time with 255 jobs, while it can take almost a minute without the patch, as dune will build 1 module at a time. Some specific numbers in the traces shared below.
Because I can see dune builds the modules from the library in a sequential way, when it is clear they can be built in parallel when running dune in regular (non-watch) mode, or with the patch.
Find below a couple of traces, with and without the patch. In both cases I am running dune watch using the repro I shared above. I start dune watch with a failing build (modifying a module in the |
I think I understand the issue now. The problem stems from how we invalidate dependency nodes. We do so by scanning the nodes one by one until we find a node that is out of date or fails the cutoff predicate. This scan is done linearly because the node is invalid even if a single dependency is invalid, and we want to avoid scanning and evaluating dependency nodes as soon as we find out our node is invalidated. Unfortunately, this check can evaluate many dependency nodes linearly and as a result completely undo the concurrency of the underlying memoized function. @snowleopard Have you ran into this problem at all internally? |
Oh, this seems like a problem indeed. I will try to reproduce the performance difference internally. |
I couldn't yet reproduce the problem internally but I'll keep trying. @rgrinberg My understanding is that you think the problem is related to this comment: Lines 422 to 450 in a0dd515
We start with a partial order (due to parallelism) but we linearise it during the incremenal graph traversal. Ideally we would preserve the parallelism structure but it gets completely erased since |
That's an example of when the problem is most acute, but the problem exists even without any concurrency in the memoized function.
I think there's always a risk of computing unnecessarily. To rephrase the problem, when we check for when a node is invalid, we can optimize for two different properties:
It's of course impossible to satisfy both properties in the general case. However, our invalidation algorithm is suboptimal in both of these criteria. Our current algorithm will traverse all nodes one by one while evaluating them sequentially if necessary. So it's both sequential, and it evaluates unnecessary nodes. I would propose the following improvement:
IMO, the scheme proposed would be better than our current algorithm although it would still cause unnecessary computation in the 2nd step. I would suggest that we had a cancellation mechanism for evaluating memoized nodes to mitigate all the unnecessary computation. |
Actually, scratch the above. The issue is indeed only relevant to concurrent computation. What I didn't realize is that when we're doing evaluation to see if the node is out of date, all the nodes that we're going to evaluate will be needed to bring it up to date regardless. So there's really no work wasted. Indeed the real issue is that our dependency list for nodes has been flattened so we're unable to traverse it in parallel where possible. |
Reproduces the loss of concurrency observed in #5549 in a unit test Signed-off-by: Rudi Grinberg <me@rgrinberg.com> <!-- ps-id: 0f232962-2ad2-4f31-8ff7-2f7575e01e7f -->
#7251 attempts to reproduce this |
Reproduces the loss of concurrency observed in #5549 in a unit test Signed-off-by: Rudi Grinberg <me@rgrinberg.com> <!-- ps-id: 0f232962-2ad2-4f31-8ff7-2f7575e01e7f -->
Reproduces the loss of concurrency observed in #5549 in a unit test Signed-off-by: Rudi Grinberg <me@rgrinberg.com> <!-- ps-id: 0f232962-2ad2-4f31-8ff7-2f7575e01e7f -->
Reproduces the loss of concurrency observed in #5549 in a unit test Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
Yep. One approach that I saw used in practice is switching from essentially a Of course, we still need to get those inner batches from somewhere, and that seems to be the tricky part. |
What about speculatively executing the cut off nodes in parallel and cancelling remaining computations once we find that one of them changed according to cutoff? That's not always ideal, but it seems like a good trade off if you have a lot of concurrent resources. If it's not always desirable, we could allow it specifically for functions where we know it would benefit like |
I think we should do our best to avoid executing unnecessary computations. Memo guarantees to do no unnecessary computations, and in my opinion it's one of its strongest features when comparing it to other incremental computation libraries. Surely, preserving the shape of computations (concurrent vs sequential) is a better approach in the long term, since it unlocks a bunch of other improvements (particularly around cutoffs). |
To address this specific issue, could you just try removing some of the cutoffs in Dune rules? Some cutoffs are more useful than others, and it might just happen that it's the useless ones that are causing the problem. One reason I'm thinking this is that we don't appear to be that affected by this internally, so perhaps our cutoff structure is just better. |
I created a synthetic project + benchmark to reproduce the issue using dune watch, and integrated it with ci: Once I had some reference measurement, I modified The results are shown in the chart below. The dune watch rebuild takes ~134.3s when using current The benchmark table pasted below can be found at https://jchavarri.github.io/dune/dev/bench/. I hope this helps navigating the issue, and tracking the performance impact of any solutions to it. Please let me know if anything can be adapted or extended to be more informative. |
@snowleopard I tried removing cutoffs in The cutoff function of dune/src/dune_engine/build_system.ml Line 1027 in aafa992
The part that makes the cutoff function turns Let me know if I can help testing any other things. |
Cool, that's interesting. My understanding is that the current benchmark only examines one scenario, where we change a file in a way that makes the resulting digest change. How about also adding a scenario where after a change is made, the result of action execution remains the same? This is the scenario that benefits from the existence of the cutoff. It would be interesting to see if this benefit is significant (and we'll presumably lose it when dropping the cutoff). In light of the previous discussion about concurrency: let build_deps deps = Dep.Map.parallel_map deps ~f:(fun dep () -> build_dep dep) I'll looking into making it possible to preserve concurrency in such cases during incremental Memo updates. |
Sorry, I am not sure I understand how I can model an example of the given scenario. Would you maybe have a specific example in mind that I could test? As I was not able to find an example for the above, I took some measurements on how many times the function Here are the results, still for the synth benchmark shared in #7255, the three steps happen all during a single dune watch run, and consist on initial successful build, modify With cutoff
Without cutoff
So in this example, the number of times the cutoff allows to memoize calls to I also measured the time taken by each call to this function, it goes from 0.1 to ~1.3ms, this time increases (as expected) when parallelization is at its best, most probably because each thread has to wait for I/O or other bottlenecks to proceed. I understand this time depends mainly on what actual rules the example is running (in this case, I assume calls to |
Would it make sense to make the cutoff configurable? I would say that this scenario:
Is not all that relevant for most projects. I can't think of many compilation commands that can be changed without the output changing in a project. Re-ordering command line flags is one thing that comes to mind. But it hardly seems an important case. |
@jchavarri I'm not sure it fits well into your synthetic benchmark, but a typical example would be changing the source of a generator in a way that doesn't change the generated file (say, just adding a comment to the generator). If many rules depend on this generated file, then the It's worth noting that even if we remove the
@rgrinberg If you refer to the
Are you talking about having an option to essentially disable all Memo |
I meant just making the cutoff on |
OK, I guess you are right. Now that we have support for cheap experimental configuration settings, it makes sense to use it. I would probably give it a slightly more general name: |
based on a discussion with @snowleopard, I think that this has been fixed at JS and that this is going to be contributed soon. |
Yes, sorry, it's on my stack to upstream the corresponding changes. The fix isn't entirely free (we now need to record series-parallel dependency traces, which increases Memo's overheads) but the performance boost for incremental builds is worth it. |
Problem description
There seems to be an issue with
dune build -w
. It can be observably slow due to a restriction of jobs. This can be seen as slow, since stopping watch mode and doing a freshdune build
will be much faster. The issue seems to be during the rule finding phase.Here is a demonstration GIF of this behaviour. Details of what is happening are below:
dune build -w
is run, and it works as expected. Most of the time is spent rule finding, but due to cache and everything already being built, it finishes quickly.dune build
is triggered, and as expected everything builds normally and quickly.There are some key details with this setup. This is the main Coq repo being changed, and importantly there is ml code being edited which are later being depended on by dune Coq stanzas. Retrying the same test, but this time with the
@check
target (building only ml code) results in no observable slow down, leading me to believe that this is an issue with the interaction of rule finding and the Coq rules.Reproduction
dune build -w
. This will take a while but once it starts building .v files (see from--display=short
or--verbose
) you can do the next step.kernel/declareops.ml
, you can even droplet _hello = "world"
inside.Specifications
dune
(output ofdune --version
): 3.0.3ocaml
(output ofocamlc --version
) 4.12.1cc @rgrinberg
The text was updated successfully, but these errors were encountered: