Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace files should also contain dune internal computations #3862

Open
Khady opened this issue Oct 12, 2020 · 21 comments
Open

Trace files should also contain dune internal computations #3862

Khady opened this issue Oct 12, 2020 · 21 comments
Labels
feature-request postponed This issue/PR is postponed and will be reviewed later

Comments

@Khady
Copy link
Contributor

Khady commented Oct 12, 2020

Desired Behavior

When building a target with the --trace-file option, dune currently logs the time taken when running an external process (nproc, ocamldep, ocamlopt, ...). But it doesn't contain any detail about what's happening inside dune. For example, it doesn't tell us how long it takes to scan all the dependencies in a project (in a dir? unsure what the best granularity would be), how long it needs to compute the rules, ...

We end up with dune build ./some_target.bc --trace-file=trace.json calls doing nothing but taking close to 8 seconds without any detail of what is causing such delay. Or a long delay at the start and the middle of a build without knowing what is going on.

Also it would be great to know how many cores are assigned to each task.

@rgrinberg
Copy link
Member

We discussed this in a meeting and here's a quick recap:

The source of slow down you may experience may come from different 3 sources:

  1. Scanning of the filesystem. I believe we've ruled this out since your workspace wasn't that bug.
  2. Parsing of dune files. This could be a factor since you've mentioned you have some fairly large dune files.
  3. Evaluation of dune rules.

It's possible for ass to add tracing for all of these operations with caveats on the 3rd item. While dune rules are evaluated per directory, much computation is shared between the rules. In other words, while it may seem as if loading rules for dir A is much faster than loading the rules for dir B, it's possible that A, B share the result of some computation and B is simply the one paying for it by forcing it. We'd prefer to avoid misleading the user, so we've decided not to offer per directory tracing of rules for now. Although we could make this experimental to see if it's at all helpful in some cases.

Additionally, it may be easier to diagnose your problem with existing profiling tools such as memtrace. @aalekseyev will follow up on what precisely he'd like to see to diagnose this.

@aalekseyev shall we still add tracing for 1. and 2.? It seems like something that we can time reliably.

@Khady
Copy link
Contributor Author

Khady commented Oct 15, 2020

If the sum of the 3 points listed is basically how long it takes to run dune rules target then I think that it is an interesting information but it is also not enough as it is takes only 0.8s in the repo where dune build ./some_target.bc --trace-file=trace.json takes 8s.

Regarding 3, maybe it is possible to see the duration per dune rule rather than per directory? And maybe have a visual representation of dependencies between rules.

@aalekseyev
Copy link
Collaborator

I don't want to rush to adding tracing that may be confusing and maybe not really solve the problem.
Adding good tracing for internal computations is not trivial and I think we're not prepared to take this on right now, probably not until we have an architecture for incremental watching mode ironed out.

@Khady, so we don't have great user-facing ways to profile dune internal computations now. I think the best way to diagnose a Dune internal performance issue is to use general OCaml profiling tools, one of perf, statmemprof and spacetime.

I know we have successfully profiled dune with spacetime in the past by following this guide: https://blog.janestreet.com/a-brief-trip-through-spacetime/ (may be slightly outdated).
statmemprof+memtrace is a new shiny profiler. I never used it with dune, but I used it with other projects and it's great: it has very small overhead compared to spacetime, but still offers a similar (in some ways better) level of detail.
I'm not at all an expert on using perf and it can be confusing at first, but it's a useful tool as well. The upside is that it can directly measure time (along with many other things) instead of memory allocations.

If the problem is in (1) and (2) then I expect either of those methods will clearly point the finger at the relevant phase, but I doubt that's the case. For (3) the result may be tricky to interpret, but I don't know how to make it easier with tracing.

If you end up using any of those methods and need help interpreting the findings, I'm happy to try and help.

@aalekseyev aalekseyev added the postponed This issue/PR is postponed and will be reviewed later label Oct 15, 2020
@Khady
Copy link
Contributor Author

Khady commented Dec 11, 2020

So I tried to use perf and memtrace to get more information about this problem. Without too much success so far.

First, I haven't been able to compile dune with memtrace in it. I think I am missing something in the bootstrap process. Here is the diff and the compilation error I get. That's based on the 2.7.1 tag

diff --git a/bin/dune b/bin/dune
index 0bccec73f..10a7f0ea3 100644
--- a/bin/dune
+++ b/bin/dune
@@ -1,7 +1,8 @@
 (executable
  (name main)
- (libraries memo dune_lang fiber stdune unix cache_daemon cache dune_rules
-   dune_engine dune_util cmdliner threads.posix build_info dune_csexp)
+ (libraries memtrace memo dune_lang fiber stdune unix cache_daemon cache
+   dune_rules dune_engine dune_util cmdliner threads.posix build_info
+   dune_csexp)
  (bootstrap_info bootstrap-info))
 
 (rule
diff --git a/bin/main.ml b/bin/main.ml
index 68159d2f1..6e30720a9 100644
--- a/bin/main.ml
+++ b/bin/main.ml
@@ -257,6 +257,7 @@ let default =
         ] )
 
 let () =
+  Memtrace.trace_if_requested ();
   Colors.setup_err_formatter_colors ();
   try
     match Term.eval_choice default all ~catch:false with
$ make release
ocaml bootstrap.ml
ocamlc -output-complete-exe -w -24 -g -o .duneboot.exe -I boot unix.cma boot/libs.ml boot/duneboot.ml
./.duneboot.exe
Done: 711/716 (jobs: 1)cd _boot && /home/louis/.opam/duneperf/bin/ocamlopt.opt -c -g -no-alias-deps -w -49 -I +threads main.ml
File "bin/main.ml", line 260, characters 2-29:
Error: Unbound module Memtrace
make: *** [Makefile:46: dune.exe] Error 2

Regarding perf, I went with this command to opam exec -- perf record -F 8000 --call-graph=lbr -- dune build src/main.bc. But without knowing the internals of dune, it is hard to interpret the results.

$ perf report -n -g folded --no-children --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 52K of event 'cycles:uppp'
# Event count (approx.): 19945798793
#
# Overhead       Samples  Command          Shared Object       Symbol                                                                               
# ........  ............  ...............  ..................  .....................................................................................
#
    16.95%          8772  dune             dune                [.] mark_slice
1.26% mark_slice;caml_major_collection_slice;caml_alloc_string;camlStdlib__$5e_130;camlStdune__path__append_2809;camlStdune__path__to_string_6072;camlDune_engine__dep__trace_file_1695;camlSt
1.15% mark_slice;caml_major_collection_slice;caml_alloc_string;camlStdlib__$5e_130;camlStdune__path__append_2809;camlStdune__path__to_string_6072;camlDune_engine__dep__trace_file_1695;camlSt
0.67% mark_slice;caml_major_collection_slice;caml_alloc_string;camlStdlib__$5e_130;camlStdune__path__append_2809;camlStdune__path__to_string_6072;camlDune_engine__dep__trace_file_1695;camlSt
    11.19%          5793  dune             dune                [.] caml_MD5Transform
5.71% caml_MD5Transform;caml_MD5Update;caml_md5_string;camlStdlib__digest__string_133;camlDune_engine__build_system__fun_14028;caml_apply2
3.63% caml_MD5Transform;caml_MD5Update;caml_md5_string;camlStdlib__digest__string_133;camlDune_engine__build_system__fun_14028;caml_apply2;camlFiber__apply_405
0.71% caml_MD5Transform;caml_MD5Update;caml_md5_string;camlStdlib__digest__string_133;camlDune_engine__build_system__fun_14028;caml_apply2;camlFiber__apply_405;camlStdlib__list__iter_236
     7.92%          4170  dune             dune                [.] sweep_slice
0.74% sweep_slice;caml_major_collection_slice;caml_alloc_string;camlStdlib__$5e_130;camlStdune__path__append_2809;camlStdune__path__to_string_6072;camlDune_engine__dep__trace_file_1695;camlS
     6.98%          3617  dune             dune                [.] caml_page_table_lookup

I don't think the problem is with the parsing or interpretation of the rules given that dune rules src/main.bc takes only 0.8s.

@aalekseyev
Copy link
Collaborator

I don't understand the dune bootstrap process well enough to know how to fix that, but you can try bootstrapping without memtrace and then building with memtrace:

make dev
git checkout "$rev_with_memtrace"
./dune.exe build @install
./_build/install/default/bin/dune build something-or-other

That just worked for me (at least I was able to build dune).

Looking at perf results, it seems clear that Dep.trace_file in Dune_engine contributes a significant amount, with digest computation also being up there (probably called via Digest.generic from compute_rule_digest function in build_system.ml.

Both of these scale with the number of dependency edges between rules (the deps argument of compute_rule_digest). Maybe there are just too many edges.

@aalekseyev
Copy link
Collaborator

@rgrinberg, I don't know if your "deforestation" efforts are related to this issue, but it seems relevant.

@rgrinberg
Copy link
Member

Both of these scale with the number of dependency edges between rules (the deps argument of compute_rule_digest). Maybe there are just too many edges.

Perhaps there are also some really large files where calculating the digest is taking some time.

I don't know if your "deforestation" efforts are related to this issue, but it seems relevant.

No relation. Although I suppose it could help a tiny bit.

@aalekseyev
Copy link
Collaborator

Perhaps there are also some really large files where calculating the digest is taking some time.

I think in that case Digest.file would probably show up in the stack trace and not Digest.string.
That said, I don't know much about lbr backtraces and how trustworthy they are.
I prefer dwarf backtraces for OCaml.

I expect the individual files are not a problem because they are cached, whereas the rest of compute_rule_digest computation is not cached: if the same set of files is depended on by many rules, then we will traverse the whole set many times over.

@aalekseyev
Copy link
Collaborator

By the way, in jenga we have an optimization where you can "group" dependencies by including a digest of a Dep.Trace.t into another Dep.Trace.t. (we call the equivalent type in jenga Proxy_map, for those familiar with the code)
That lets us share these computations and save a lot of time and memory, at the cost of more rebuilding in the case where the same set ends up being constructed in two different ways.

We will probably need to have the same optimization in Dune sooner or later.

@rgrinberg
Copy link
Member

I think in that case Digest.file would probably show up in the stack trace and not Digest.string.

That's true. This is also supported by Path.to_string appearing in the trace. I doubt it would show if we weren't iterating over a ton of files. @Khady are you using a lot of glob dependencies?

That lets us share these computations and save a lot of time and memory, at the cost of more rebuilding in the case where the same set ends up being constructed in two different ways.

I'm not familiar with the jenga code base, but I'm trying to think of ways how this may occur in dune. Something like (glob_files *.ml) and (foo.ml bar.ml baz.ml) would cause recomputation for example. Two globs may also evaluate to the same set of files. Anything else?

We will probably need to have the same optimization in Dune sooner or later.

Definitely.

@aalekseyev
Copy link
Collaborator

I'm trying to think of ways how this may occur in dune

I wouldn't worry about this. We've never seen this be a problem in practice. The examples you give are probably not really examples, though, because globs need to be tracked in full regardless. The examples go along the lines of:

  • [group [a; b; c]; d] is different from [group [a; b]; c; d];
  • [group [a; b; c]; a] is different from [group [b; c]; a]

anyway, we should probably not derail this thread too much.

@Khady
Copy link
Contributor Author

Khady commented Jan 6, 2021

The commit tries to solve a performance issue. But I don't think that it provides what was asked in this issue. And given the troubles I have to go through to guess what is taking time in our builds, I still believe that having a trace of what happens in dune would be valuable. Could this issue be re-opened?

@rgrinberg
Copy link
Member

Indeed, let's keep it open.

@rgrinberg rgrinberg reopened this Jan 6, 2021
@Khady
Copy link
Contributor Author

Khady commented Jan 8, 2021

So I managed to use memtrace thanks to @aalekseyev instructions. I don't know enough of dune internals to be able to understand what is going on exactly.

trace.zip

Then I ran with perf again but using dwarf instead of lbr.

opam exec -- perf record -F 8000 --call-graph=dwarf -- ~/Code/github/dune/_build/install/default/bin/dune build src/main.bc
$ perf report -n -g folded --no-children --stdio 
Failed to open /home/louis/Code/ahrefs/frontend/.git/index, continuing without symbols
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 45K of event 'cycles:uppp'
# Event count (approx.): 16293020863
#
# Overhead       Samples  Command          Shared Object       Symbol                                                                                
# ........  ............  ...............  ..................  ......................................................................................
#
    17.07%          7494  dune             dune.exe            [.] mark_slice
     8.64%          3909  dune             dune.exe            [.] sweep_slice
     5.64%          2434  git              libz.so.1.2.11      [.] inflate
5.54% inflate;0x5621c0361bbe;0x5621c02c2348;0x5621c02c4286;0x5621c02c4595;0x5621c032566e;0x5621c0325869;0x5621c0325b81;0x5621c0256446;0x5621c01ba52a;0x5621c01bad0c;0x5621c01bb5a0;0x5621c0195
     5.11%          2253  dune             dune.exe            [.] caml_page_table_lookup
     2.57%          1095  dune             dune.exe            [.] camlStdlib__string__rindex_rec_opt_292
     2.42%          1073  dune             dune.exe            [.] caml_apply2
     2.25%           990  dune             dune.exe            [.] caml_string_compare
     2.05%           914  dune             dune.exe            [.] extern_value
     1.60%           724  dune             dune.exe            [.] caml_oldify_one
     1.51%           673  dune             dune.exe            [.] caml_MD5Transform
     1.08%           478  dune             libc-2.28.so        [.] __memcmp_avx2_movbe
     1.03%           465  dune             dune.exe            [.] caml_oldify_mopup
     0.96%           423  dune             dune.exe            [.] nf_allocate
     0.96%           424  dune             dune.exe            [.] camlStdlib__set__add_185
     0.93%           401  git              libz.so.1.2.11      [.] 0x000000000000e217
0.90% 0x7f9e7e3d9217;inflate;0x5621c0361bbe;0x5621c02c2348;0x5621c02c4286;0x5621c02c4595;0x5621c032566e;0x5621c0325869;0x5621c0325b81;0x5621c0256446;0x5621c01ba52a;0x5621c01bad0c;0x5621c01bb
     0.93%           414  dune             dune.exe            [.] caml_alloc_shr_for_minor_gc
     0.92%           396  git              libc-2.28.so        [.] __memcmp_avx2_movbe
0.91% __memcmp_avx2_movbe;0x5621c032276f;0x5621c02c3799;0x5621c02c47ae;0x5621c0325572;0x5621c0325869;0x5621c0325b81;0x5621c0256446;0x5621c01ba52a;0x5621c01bad0c;0x5621c01bb5a0;0x5621c01953a7
     0.89%           387  dune             dune.exe            [.] compare_val
     0.88%           375  dune             dune.exe            [.] camlDune_rules__Module__sources_3041
     0.85%           379  dune             dune.exe            [.] caml_hash
     0.82%           371  dune             dune.exe            [.] camlStdlib__hashtbl__find_opt_686
     0.80%           374  dune             dune.exe            [.] extern_record_location.part.0
     0.78%           341  dune             dune.exe            [.] camlStdune__Set__compare_534
     0.77%           340  dune             dune.exe            [.] caml_alloc_string
     0.71%           312  dune             dune.exe            [.] camlStdune__String__compare_256
     0.67%           287  dune             dune.exe            [.] camlStdlib__map__fold_383
     0.65%           276  dune             dune.exe            [.] camlStdune__List__loop_294
     0.64%           283  dune             dune.exe            [.] camlStdlib__set__bal_164
     0.63%           274  git              libz.so.1.2.11      [.] 0x0000000000009def
0.63% 0x7f9e7e3d4def;inflate;0x5621c0361bbe;0x5621c02c2348;0x5621c02c4286;0x5621c02c4595;0x5621c032566e;0x5621c0325869;0x5621c0325b81;0x5621c0256446;0x5621c01ba52a;0x5621c01bad0c;0x5621c01bb
     0.56%           250  dune             dune.exe            [.] nf_allocate_block
     0.52%           229  dune             dune.exe            [.] camlStdlib__list__rev_append_121
     0.51%           227  dune             libc-2.28.so        [.] __memmove_avx_unaligned_erms
     0.46%           198  dune             dune.exe            [.] camlStdune__Path__parent_3040

and the full report: perf-report.txt

I could share the perf trace but it is over 300mb. So I created a smaller one with a frequency of 120. Findings are about the same.
perf.zip

@rgrinberg
Copy link
Member

Thanks for the data points @Khady. Some good news and bad news. The bad news is that I think some of the slow down in your case is essential. You likely have a non-trivial amount of modules, libraries, packages, and a clean build has to do a minimum amount of work to process all this.

Now for some good news. There seems to be 3 low hanging fruit that we can fix:

  • Inefficiencies in the merlin config generation.
  • We needlessly setup odoc rules for all libraries
  • We do a fair amount of unnecessary allocation in the install rules.

We could fix all of these things, but I doubt this will do more than shave off a couple more seconds. The long term solution is just to reuse the memoized rules between the builds themselves.

@Khady
Copy link
Contributor Author

Khady commented Jan 9, 2021

I gathered some numbers to give some context. The project consist of 1200 ml(i) files (about 90k lines total), 500 atd, 150 custom files. Each of those 150 custom files are turned into 1 atd + 2 ml files. Those files are divided into 15 libs. At the end it represents 3600 ml(i) files to compile. The project depends (directly) on about 40 external (opam) libs. To generate main.bc it requires 17000 dune rules (I'm guessing this number from the done x/17000 (40 jobs) line, dune rules src/main.bc | wc -l show 533).

While I understand that it is some work to check all that, I feel that there is something wrong. It shouldn't take 6s to run 10k stat on files. The project is not that big. It's definitely smaller than some duniverse out there. And I don't know what is the code layout at janestreet, but if they have 10 million lines of code, I wonder how dune is doing there.

I've the impression that dune is using only one thread to check the rules. That's a real limitation. I can get a server with 8/40/256 cores, but I'll still face the same issue.

The computation of the rules isn't super fast but isn't the main blocker either. dune rules src/main.bc takes 0.8s.

It would be nice if dune could provide some stats about a project. For now I have to explore the file system myself and do some guessing. Ideally the number of human written rules in the project (as in rules/stanzas written by a human), the total number of rules (17k in this case?), the number of files in the project, the number of libs, the time taken to compile each lib, I'd even like the time taken for each human written stanzas. Maybe I should open another issue for this feature request?

@rgrinberg
Copy link
Member

While I understand that it is some work to check all that, I feel that there is something wrong. It shouldn't take 6s to run 10k stat on files.

This is not where the time is being spent. Before dune can check that your 17k rules are up to date, dune must first generate these rules. Generating these rules requires quite a bit of computation.

And I don't know what is the code layout at janestreet, but if they have 10 million lines of code, I wonder how dune is doing there.

This is precisely one of the reasons why dune at its form is not usable at janestreet. Generating the rules from scratch on every run does not scale. This is why we've added the rule memoization framework. We're yet to utilize it to save computation between runs, but it should address this issue.

I've the impression that dune is using only one thread to check the rules. That's a real limitation. I can get a server with 8/40/256 cores, but I'll still face the same issue.

That is definitely a limitation. Hopefully when multicore is available, we could experiment with making things faster.

@aalekseyev
Copy link
Collaborator

aalekseyev commented Jan 11, 2021

It seems that roughly a third of the allocation samples in the memtrace come from Merlin rules, which is very impressive.
That seems like a massive low-hanging fruit indeed.

I couldn't get perf to show me backtraces, though (in fact I believe the data file doesn't have symbols at all).

@aalekseyev
Copy link
Collaborator

@Khady, you showed that the rules computation is fast, but this result can be misleading because rule computation is lazy: it will only compute the portion of the rules that's needed to answer a given query. For example it won't compute the command lines of most commands, and it probably won't compute the merlin config files, which is what takes a large chunk of the time.

@ELLIOTTCABLE
Copy link
Contributor

Would a draft PR on this topic be welcome?

I've done a lot of "asynchronous tracing" work recently, and have been building out some tooling for that (with @c-cube); I think it could be well-applied in Dune to help @Khady/@jchavarri narrow down some of these issues on our codebase. (=

@rgrinberg
Copy link
Member

Sure, drafts are fine. I don't know what you have in mind, but it's worth reminding that we are unlikely to add new dependencies just for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request postponed This issue/PR is postponed and will be reviewed later
Projects
None yet
Development

No branches or pull requests

5 participants