Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocaml fails to build when too many cores are used #10235

Closed
vsiles opened this issue Feb 19, 2021 · 15 comments
Closed

Ocaml fails to build when too many cores are used #10235

vsiles opened this issue Feb 19, 2021 · 15 comments

Comments

@vsiles
Copy link

vsiles commented Feb 19, 2021

Hi !

I'm witnessing a lot of build failures of ocaml on machines with lots of cores (resp 24 and 80). Here is repro that fails most of the time, but not always, and not always at the same step.

export OPAMROOT="$PWD/myroot"
export OPAMYES="1"
rm -rf "$OPAMROOT"
mkdir -p "$OPAMROOT"
rm -rf ./tmp
mkdir -p tmp
opam init --disable-sandboxing --no-setup --bare
opam switch create my_switch --empty
cd tmp
opam source ocaml-variants.4.11.1+fp
cd ocaml-variants.4.11.1+fp
./configure
make -j23 world

The two main errors I'm witnessing are: Error: Could not find the .cmi file for interface but the .cmi file in question is there, and Error: Unbound module but the module is there.

The errors are really random.
You can find more context in the original report (I first thought it was opam related) at ocaml/opam#4552

Please find an example of .env file at https://pastebin.com/KtKmfpSP and .out file at https://pastebin.com/5YHy5Rdp from my last failure

@dra27
Copy link
Member

dra27 commented Feb 19, 2021

I'm not quickly seeing it, although the machine is still looping. Could you give some more environment details - platform, etc.?

@dra27
Copy link
Member

dra27 commented Feb 19, 2021

Also exactly which version of make is it running with?

@vsiles
Copy link
Author

vsiles commented Feb 19, 2021

Sorry, here are more information:
Linux - Centos

$ cat /etc/centos-release
CentOS Stream release 8
$ free -m
              total        used        free      shared  buff/cache   available
Mem:         114682       28444        1478        4801       84760       80323
Swap:         65535        6293       59242
$ cat /proc/cpuinfo # 24 of them
processor       : 23
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel Core Processor (Broadwell)
stepping        : 2
microcode       : 0x1
cpu MHz         : 2394.255
cache size      : 16384 KB
physical id     : 23
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 23
initial apicid  : 23
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips        : 4890.48
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
$ make --version
GNU Make 4.2.1
Built for x86_64-redhat-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ gcc --version
gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Is there anything else I can provide ?
FYI I saw this happen on this machine, but also on our CI machines (iirc they are virtual server with 80 cores, same setup)

@gasche
Copy link
Member

gasche commented Feb 19, 2021

The two main errors I'm witnessing are: Error: Could not find the .cmi file for interface but the .cmi file in question is there, and Error: Unbound module but the module is there.

It's interesting that the two errors seem filesystem-related. Which filesystem are you using? (I am not saying it must be a filesystem error, I can't think of a scenario where make would see files that are not visible to its own subprocesses, but I have never thought about this at all. The most likely cause for the error is incorrect dependencies in the Make build system.)

@vsiles
Copy link
Author

vsiles commented Feb 19, 2021

I'm currently using https://github.com/facebookexperimental/eden EdenFS, a fuse based FS. I don't know much more than that to be honest

@vsiles
Copy link
Author

vsiles commented Feb 19, 2021

Let me correct my last message:

  • OPAMROOT is on a fuse/EdenFS file system
  • I managed to make the error trigger with the build artefacts created on a fuse/EdenFS and a btrfs file system

I'll try to replicate in full btrfs

@gasche
Copy link
Member

gasche commented Feb 19, 2021

(Today I learned that Facebook uses a non-decentralized fork of Mercurial internally, whose monorepo-optimized server component is called Mononoke.)

@vsiles
Copy link
Author

vsiles commented Feb 19, 2021

FYI I can't repro the failure after moving the OPAMROOT to a btrfs file system... so I'm confused :) If I can do other test to help investigate, feel free to ask

@lthls
Copy link
Contributor

lthls commented Feb 19, 2021

With the default opam options at least, all your build artefacts will be created inside the OPAMROOT folder (the packages you're compiling, including the compiler, are first copied into $OPAMROOT/$OPAMSWITCH/.opam-switch/build). So the build system used to store your code should not be relevant to the problem (which seems to be what you've observed).

So the most likely culprit, I think, is the fuse/EdenFS combination allowing ocamlc -c foo.mli to think that foo.cmi has been correctly created while in fact it is still being processed by the EdenFS daemon. Then make, thinking that foo.cmi has been created, triggers ocamlc -c foo.ml. This in turn tries to access foo.cmi, which EdenFS will reported as not existing since it still hasn't finished creating the file.
Do you have a way to ask the EdenFS devs to check whether this could happen ?

@vsiles
Copy link
Author

vsiles commented Feb 19, 2021

I'll try to ping them on this. Not sure they will care since ocaml is not "an officially supported language" at FB, but it is worth a try

@gasche
Copy link
Member

gasche commented Feb 19, 2021

I'm no make expert; the two sort of parallel-make failures I have observed with the OCaml build are as follows:

  • Missing dependencies, that result in trying to use an artifact that is not there, or (much more common) reusing a stale build artifact in an incremental-recompilation scenario.
  • Race conditions, where several processes race to create the same file, and some processes observe a corrupted version in the meantime. For example, if ocamlc and ocamlopt race to create the same .cmi, some consumers could see an empty or partially-truncated file. (This particular issue should have been fixed now that we use a temporary file with an atomic mv at the end, Create .cmi files atomically (MPR#7472) #1307 .)

In your example failure log, the two relevant actions seem to be the following:

# 399
./boot/ocamlrun ./boot/ocamlc [...] -c asmcomp/printlinear.mli

# 410
./boot/ocamlrun ./boot/ocamlc [...] -c asmcomp/printlinear.ml

# Error
File "[...]/.opam-switch/build/ocaml-variants.4.11.1+fp/asmcomp/printlinear.ml", line 1:
Error: Could not find the .cmi file for interface
       [...]/.opam-switch/build/ocaml-variants.4.11.1+fp/asmcomp/printlinear.mli

There is a dependency from printlinear.cmo on printlinear.cmi in .depend (so this is not the "missing dependency" scenario), so we know that Make made sure to compile the .mli sequentially-before the .ml (so this should not be the "race condition" scenario, line 410 should have started after 399 completed). But it might be a filesystem-interaction issue: we are assuming that after 399 completes, then all further commands see the produced .cmi file, and apparently this does not happen here. On the OCaml side, the .cmi file is populated by a Sys.rename call, and then read again by a Sys.readdir call. There might be an issue in those, or the filesystem may be incorrectly delaying the visiblity of the file (but then I would assume that many other parallel-build workflows would be affected, not just OCaml).

@vsiles
Copy link
Author

vsiles commented Feb 19, 2021

I asked to the edenfs guys if that seemed plausible, will forward answer (if possible :D)

@gasche
Copy link
Member

gasche commented Feb 19, 2021

If this is a fielsystem issue, you may be able to reproduce it without OCaml:

  • A double-indexed family of files foo{i,j}, all containing the string foo.
  • Initially only the foo{0,j} are present on disk.
  • A final target that depends on all foo{N,j}
    (N is the length of the dependency chain, just pick 5 for testing)
    (the j indices only serve to introduce parallelism in the build)
  • Makefile knows how to produce foo{i+1,j} from foo{i,j} by creating a temporary file with content foo (fopen, .., fclose) and the rename-ing this temporary file into the target.

@xavierleroy
Copy link
Contributor

For what is worth: OCaml's CI includes a parallel build test, which does make -j60 on a 40-core, 80-thread Linux server. The test has been successful for quite a while, which gives me hope that dependencies are good.

@vsiles
Copy link
Author

vsiles commented Feb 22, 2021

If this is a fielsystem issue, you may be able to reproduce it without OCaml:

  • A double-indexed family of files foo{i,j}, all containing the string foo.
  • Initially only the foo{0,j} are present on disk.
  • A final target that depends on all foo{N,j}
    (N is the length of the dependency chain, just pick 5 for testing)
    (the j indices only serve to introduce parallelism in the build)
  • Makefile knows how to produce foo{i+1,j} from foo{i,j} by creating a temporary file with content foo (fopen, .., fclose) and the rename-ing this temporary file into the target.

Didn't manage to replicate the issue like this.. Will do some other try, but I think we are now fairly convinced the issue is FS related. I've contacted EdenFS devs and made a report there. Let's close this one :) Thanks for the assist in investigating the issue.

@vsiles vsiles closed this as completed Feb 22, 2021
symphorien pushed a commit to symphorien/nixpkgs that referenced this issue Feb 27, 2022
Enable parallel building for ocaml-4.08 and above. tested as:

    $ nix build -f. ocaml-ng.ocamlPackages_{4_{00_1,01_0,02,03,04,05,06,07,08,09,10,11,12,13},latest}.ocaml --keep-going

ocaml build system supports parallel building, but but for multiple
top-level targets at the same time as it usually spawns subprocess
$(MAKE) that occasionally conflict with one another. To work it around
we use tiny Makefile with a single rule that calls top-level targets
sequentially as makefile calls:

    nixpkgs_world_bootstrap_world_opt:
       $(MAKE) world
       $(MAKE) bootstrap
       $(MAKE) world.opt

On a 16-core machine ocaml-4.12 build speeds up from 6m55s to 1m35s.

Releases 4_00_1, 4_01_0, 4_04 and 4_05 still have some race in them.
Thus this change enables parallel builds only for ocaml-4.06 and above.

Adapted from NixOS#142723

upstreams's CI tests the parallel makefile: ocaml/ocaml#10235 (comment)
The limit was chosen to be 4.08 because it was released in 2019, not too
long before the above link.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants