Add GNU make jobserver client support #1139

stefanb2 · 2016-04-27T10:31:13Z

As long as ninja is the only build execution tool, the current ninja -jN implementation works fine.

But when you try to convert parts of an existing recursive GNU make based SW build system to ninja, then you have the following situation:

top-level GNU Make (with -jX, acts as job server)
M instances of GNU make (with -j, act as job server clients)
N instances of ninja (don't know anything about job server)

Simply calling `ninja -jY' isn't enough, because then the ninja instances will try to run Y*N jobs, plus the X jobs from the GNU make instances, causing the build host to overload. Relying on -lZ to fix this issue is sub-optimal, because load average is sometimes too slow to reflect the actual situation on the build host.

It would be nice if GNU make jobserver client support could be added to Ninja. Then the N ninja instances would cooperate with the M GNU make instances and on the build host only X jobs would be executed at one time.

The text was updated successfully, but these errors were encountered:

- add new TokenPool interface - GNU make implementation for TokenPool parses and verifies the magic information from the MAKEFLAGS environment variable - RealCommandRunner tries to acquire TokenPool * if no token pool is available then there is no change in behaviour - When a token pool is available then RealCommandRunner behaviour changes as follows * CanRunMore() only returns true if TokenPool::Acquire() returns true * StartCommand() calls TokenPool::Reserve() * WaitForCommand() calls TokenPool::Release() Documentation for GNU make jobserver http://make.mad-scientist.net/papers/jobserver-implementation/ Fixes ninja-build#1139

stefanb2 · 2016-04-27T11:02:51Z

I have tested this implementation over the last few weeks in two different recursive GNU make based build systems that originally had M+1 GNU make instances:

use case A: top-level GNU make, 1 ninja instance, M-1 GNU make instances
use case B: top-level GNU make, N ninja instances, M-N GNU make instances

FYI: google/kati was used to convert existing single makefile GNU make parts to Ninja build file.

nico · 2016-04-27T13:44:48Z

Thanks for the patch!

We've discussed this on the mailing list a few times (e.g. here https://groups.google.com/forum/#!searchin/ninja-build/jobserver/ninja-build/PUlsr7-jpI0/Ga19TOg1c14J). Ninja works best if it knows about the whole build. Now that kati exists, one can convert those to ninja files and munge them up to have a single build manifest (that's Android's transition strategy from Make to Ninja -- they use kati to get everything converted to Ninja files, and then they're incrementally converting directories to use something-not-make -- and then kati produces parts of their Ninja files and the new thing produces parts of the ninja files.)

Is your use case that you have recursive makefiles?

stefanb2 · 2016-04-27T17:27:49Z

I could have guessed that this has been discussed before, because I'm surely not the first person facing such a situation.

Here are my reasons for requesting this:

recursion: kati currently can't translate recursive GNU make based build systems, like Linux kernel kbuild. IMHO a major effort and unfortunately I can't wait for kati to provide this, hence the such sub-component builds will have to stay with GNU make for the time being.
missing features: kati currently can't translate fully modularized GNU make based build systems, i.e. where each component is built in isolation and in a separate build directory, so that all ninja.build files could be merged into a single one. While IMHO not such a major issue as (1) it is much simpler to replace the lowest-level $(MAKE) recipe with a kati/ninja recipe. Parsing + merging might also introduce unnecessary build delay (needs to be seen what would happen in real life)
technical barriers: e.g. sub-component builds that run behind a "chroot firewall". Even if everything moves to Ninja, you would still need 1 (main) + N (one for each chroot) ninja instances that need to cooperate. Ninja doesn't offer anything like that.
too simple workarounds: AOSP makeparallel + kati/ninja runs all $(MAKE) instances hard-coded with "make -j4" with no cooperation between any of the GNU make instances. That is only acceptable if you have no or only a few or small $(MAKE) invocations from the ninja.build file.
organizational barriers: even if it might be possible to use kati/ninja to convert an existing GNU make base sub-part of the system, you might not be allowed to do so. Such sub-component builds need to stay with GNU make.
You ask: why not split the build up and run them as separate builds? Goto (5)...

IMHO my patch provides a good solution, considering

how small the required changes to ninja are,
that the default behaviour is completely unchanged, and
that this will make the life easier for many other ninja users which face the same issues

ghost · 2016-05-23T03:22:59Z

wow +1

- add new TokenPool interface - GNU make implementation for TokenPool parses and verifies the magic information from the MAKEFLAGS environment variable - RealCommandRunner tries to acquire TokenPool * if no token pool is available then there is no change in behaviour - When a token pool is available then RealCommandRunner behaviour changes as follows * CanRunMore() only returns true if TokenPool::Acquire() returns true * StartCommand() calls TokenPool::Reserve() * WaitForCommand() calls TokenPool::Release() Documentation for GNU make jobserver http://make.mad-scientist.net/papers/jobserver-implementation/ Fixes ninja-build#1139

maximuska · 2016-08-07T13:47:36Z

Another possible reason for having jobserver in ninja seems to be LTO support in gcc. -flto=jobserver tells gcc to use GNU make's job server mode to determine the number of parallel jobs. The alternative is to spawn a fixed number of jobs with e.g., -flto=16.

fabio-porcedda · 2017-03-10T10:48:50Z

I would like too have this feature merged, i simply cannot convert all projects to ninja-build because i'm not allowed to do that.

@stefanb2 Thanks a lot for your work

dublet · 2017-04-12T17:44:05Z

Can I just add my voice to the list of people who would like this to be merged? At my company we also use a nested build system, and with this patch it makes ninja behave very nicely indeed. We're not in the position to make ninja build everything yet.

glandium · 2017-05-26T01:02:15Z

Please note that from a quick glance at the commit on @stefanb2's branch, I expect it doesn't work on Windows, where Make uses a different setup.

stefanb2 · 2017-05-26T06:22:06Z

@glandium correct, in the Windows build a no-op token pool implementation is included. But I fail to see why this would be a relevant reason for rejecting this pull request.

That said, I'm pretty sure that it would be possible to provide an update that implements the token protocol used by Windows GNU make 4.x. Probably tokenpool-gnu-make.cc could be refactored into system agnostic and UNIX-dependent bits.

- add new TokenPool interface - GNU make implementation for TokenPool parses and verifies the magic information from the MAKEFLAGS environment variable - RealCommandRunner tries to acquire TokenPool * if no token pool is available then there is no change in behaviour - When a token pool is available then RealCommandRunner behaviour changes as follows * CanRunMore() only returns true if TokenPool::Acquire() returns true * StartCommand() calls TokenPool::Reserve() * WaitForCommand() calls TokenPool::Release() Documentation for GNU make jobserver http://make.mad-scientist.net/papers/jobserver-implementation/ Fixes ninja-build#1139

nox · 2017-11-11T10:09:25Z

This would be really useful too when invoking ninja as part of another build tool, such as cargo.

comicfans · 2017-11-12T23:44:19Z

This should be very useful for super-project build, in our large code base, due to different compiler/environment config, we can not include all projects in one single ninja build, so we have 1 top-level and N sub-projects built by ninja , this config trigger Y*N problem.

xqms · 2017-12-06T10:10:01Z

+1 - this is highly interesting for parallel builds with catkin_tools (https://catkin-tools.readthedocs.io/en/latest/). A catkin_tools workspace consists of separate CMake projects which are built in isolation. To control the CPU consumption of parallel make runs, catkin_tools contains a GNU Make jobserver implementation.
In this way, the make jobserver is starting to become a standard "protocol" for controlling resource consumption of parallel builds.

Note that in the catkin_tools scenario, it is not easy to merge the individual build.ninja files into a hierarchy of subninja files, because

Targets/individual rules will clash - would need CMake changes to keep them apart.
We would need some way of encoding inter-package dependencies (build this subninja before that).
catkin_tools needs to perform additional installation steps after a package has been built.
Also, catkin_tools provides many nice features which would be defeated by a merged build (package-level monitoring, build output grouped by packages, ...).

yann-morin-1998 · 2018-01-06T17:44:13Z

@nico I would like to add my voice to having support for GNu make job-server support in ninja.

Meta-buildsystems like OpenEmbedded (Yocto), OpenWRT, Buildroot and a lot of others,
are tasked with generating systems by building a lot of various packages from various sources,
all using various buildsystems. I'll mostly use Buildroot as an example, as I'm very familiar with
it, but the following is in principle applicable to all the buildsystems as well.

Such build systems will typically have this sequence per package they build:

download sources of a package
extract the sources
configure the package
build the package
install it in a staging location

And they will repeat that sequence for each and all packages that are needed to build the
target system:

build busybox
build coreutils
build foo
build bar
etc...

Once all packages have been built and installed in the staging location, a system image
(e.g. a bootloader + Linux Kernel + root filesystem for example) is generated from that
staging location. That system image can the be directly flashed onto a device.

Now, that was the quick overview.

Since a system can be made of a lot of packages, we want to build as many packages in
parallel (respecting a depndency chain, of course). But then for each package, we also want
to take advantage of parallel compilation, in case no other package is being built at the same
time.

So, if we have a 8-core machine, we would want to build up to 8 jobs in parallel, which means
we have to distribute those jobs to the various packages that need to be built at some point in
time, so that we maximie the number of jobs, but do not over-shoot the 8-CPU limit.

For example, if 8 ninja-based packages are built in parallel and they do not share a job-server,
they will each be building 8 jobs, which is a total of 64 parallel jobs. On the other hand, limiting
the ninja builds to a single job will be a waste of time when only a single package is built at some
point in time (e.g. becasue the other ones have already finished building, or because the
dependency chain needs that one package before continuing).

And as has been already explained in previous posts in this thread, not every package is based
on ninja, and not every package is even conceivably switchable to ninja. And even if every packages
were using ninja, we can't simply aggregate all the ninja definitions to have a super-build, because
eveything would end up clashing with everything else... So we still need to be able to cooperate with
the rest of the world, especially when that rest of the world has been established for decades now... ;-)

Thanks for reading so far! :-)

This reverts commit 0e6689d. Parallel builds are broken due to a mix of Make/Ninja and the job server not being operational. See ninja-build/ninja#1139 Signed-off-by: Anas Nashif <anas.nashif@intel.com>

ihnorton · 2018-03-14T16:37:46Z

+1. We also face this issue of Y*N ninjas while using CMake ExternalProject functionality.

This reverts commit 0e6689d. Parallel builds are broken due to a mix of Make/Ninja and the job server not being operational. See ninja-build/ninja#1139 Signed-off-by: Anas Nashif <anas.nashif@intel.com>

avikivity · 2024-02-04T17:44:06Z

Please consider merging this, it's helpful for build systems that have to recurse into other build systems, and for LTO links.

mattgodbolt · 2024-02-04T21:05:46Z

Seconded; our LTO builds suffer from either overcommitting CPU resources, or under utilizing as they don't play nicely with the overarching ninja setup.

mathstuf · 2024-02-05T02:11:07Z

Note that this is only about making ninja take into account running under make. ninja is not setting up a jobserver to communicate with a make or any other tool running under it. It also doesn't (AFAIK) communicate with any commands ninja runs that may want to also participate (e.g., a build rule in build.ninja won't be able to tell a sub-make command about the job server either)

mattgodbolt · 2024-02-06T14:46:06Z

Got it! Thanks...I got myself confused: I'm after job server support which I think is #1139 😊
edit: or maybe not! Maybe that's not actually filed anywhere: my use case is ninja being able to run the linker with lto options that limit the number of CPUs it uses in the same way as ninja itself limits things.

avikivity · 2024-02-06T15:22:14Z

I think it would work by running ninja under make, so make would be the jobserver for ninja and anything it spawns.

mathstuf · 2024-02-06T16:15:16Z

I think it would work by running ninja under make, so make would be the jobserver for ninja and anything it spawns.

Doesn't ninja need to coordinate to keep the right files open and environment intact for its rules to communicate with the job server?

eli-schwartz · 2024-02-06T16:25:49Z

Doesn't ninja need to coordinate to keep the right files open and environment intact for its rules to communicate with the job server?

No, the new "fifo" jobserver explicitly allows GNU Make to act as the coordinator for your entire process tree, regardless of whether or not any individual process in the process tree supports it, as long as a recursive descendant knows how to communicate via the fifo.

This is a benefit over the classic anonymous pipe for two reasons:

if ninja does NOT support the jobserver, gcc -flto=jobserver can still coordinate with the jobserver when run by ninja
if ninja DOES support the jobserver, it only needs to act as a client and ask for jobs, it doesn't need to act as a server for the jobs it acquired and pass them on to gcc -flto=jobserver.

mathstuf · 2024-02-06T17:11:00Z

Ah, neat, thanks. The fifo mechanism sounds much better then.

avikivity · 2024-02-06T18:26:18Z

Let's meet again, same place, next year.

xim · 2024-02-07T23:09:30Z

Let's meet again, same place, next year.

Count me in

degasus · 2024-02-28T11:45:07Z

The PR is still being worked on. Not sure what you want us to "fix".

@jhasse After reading all of the comments here, I'm under the impression that the pull request #1140 is finished for more than half a decade, and just rebased every year.

@stefanb2 Please correct me if I'm wrong, but it seems like you are waiting for a decision for either

ninja does not want to have any jobserver client support and this issue and this three PR shall be closed
ninja does want to have a jobserver support, but not the GNU Make jobserver protocol. So we shall look for alternatives
ninja does want to have GNU Make jobserver client support, but you don't like the implementation in Add GNU make jobserver client support #1140. So what shall be modified?
ninja does want to have GNU Make jobserver client support, and there is nothing wrong in Add GNU make jobserver client support #1140. So how many more years shall this wait?

digit-google · 2024-02-28T13:26:51Z

I do not know @jhasse exact point of view on the topic, but I can see several issues with the PR:

There is no regression test suite for what is a major change to Ninja's behavior. While there are unit-tests that verify some parts of the implementation, a real regression test suite that can be run on CI would verify that the Ninja binary works as expected, either as a client, a server, or both at the same time. This requires writing new Python tests under misc/ that simulate a jobserver-enabled build with multiple scenarios.
The code is hard to understand and maintain. In particular the way Posix signals are used is scary and brittle. It will very likely lead to flaky and unexpected failures under heavy loads and non-conventional runtime environments (think containers or qemu user emulation). The Win32 part writes directly to the completion port of SubprocessSet. At a minimum, all signal-twiddling and completion-related code should be part of subprocess-posix.cc or subprocess-win32.cc, which would provide a sane API for the token pool implementation.

Ideally, Ninja would implement an asynchronous loop API that would allow to wait for several i/o conditions and timers concurrently in the main thread, and act upon it, and SubprocessSet and TokenPool classes would be all users of it, but that's probably for another PR.

Minor: I recommend reworking the commits in the PR to ensure that each one of them is final, correct, individually testable, and updates both configure.py and CMakeLists.txt at the same time.

eli-schwartz · 2024-02-28T13:49:40Z

@digit-google it would be productive to make review comments directly on the PR.

Preferably any time in the past 8 years, but no time like the present! :)

Note that regression testing can be somewhat accounted for by the fact that an extremely large number of people have been running the patchset in production for years now.

digit-google · 2024-02-28T14:01:37Z

I agree, but I was responding to @degasus who was asking in this thread what could be changed in the PR. However, I'll add similar comments there too.

It is great that the current patchset has been working well, and I encourage putting actual metrics, like number of users, build performance improvement times, in the actual PR description and final patchset.

However, unit and regression testing is about ensuring that future Ninja changes do not break its behavior unexpectedly. Given the complexity of the feature and the fact that is changes how Ninja interacts with its runtime environment, unit-testing is not enough. But that's just my humble opinion.

eli-schwartz · 2024-02-28T14:14:30Z

You can say the same thing about all the existing functionality ninja has.

My opinion is that it isn't fair or reasonable to ask this PR to be a special exception, but it would be fair iff someone wrote an end-to-end testing suite, then asked the jobserver PR to include jobserver coverage in it.

digit-google · 2024-02-28T14:53:17Z

Frankly, that PR would be fine to me, even without a full regression test suite, if it didn't spread tricky signal-handling code in what looks like unrelated parts of the source tree. This is a hackish design that is bound to be a maintenance nightmare for anyone that accepts that in their git repository. I assume that's why @jhasse, who has very very limited bandwidth to maintain Ninja, has not felt confident in accepting it.

And for full disclosure, I am not an official Ninja maintainer in any way, but I maintain my own Ninja fork for the Fuchsia project in order to support a number of important additional features.

While I do plan to implement jobserver support there to, this will not be based on this PR for exactly this reason.

stefanb2 · 2024-03-02T16:59:25Z

@stefanb2 Please correct me if I'm wrong, but it seems like you are waiting for a decision for either

The short answer: I'm not waiting for anything.

The long answer:

This contribution is a side product of the migration of the internal code base at my former workplace to Android N. Android N build system introduced the kati-ninja-combo, which had severe negative impacts on build performance. These were not acceptable for the company, so I looked into adding jobserver client support to ninja. This turned out to be rather simple and the build performance problem was solved. As the resulting changes were already paid for, I requested for permission to contribute them upstream.

IMHO there is nothing for me to do. Either

the project makes a decision about the contribution, or
my former employer requests me to withdraw it.

segevfiner · 2024-03-19T21:08:17Z

Kitware (CMake's authors) also maintain https://github.com/Kitware/ninja which is a fork/build with this PR, and the ninja you can install from PyPI https://pypi.org/project/ninja/, is actually this fork.

jcfr · 2024-03-21T14:19:12Z

Kitware (CMake's authors) also maintain https://github.com/Kitware/ninja

Ditto. We have been using our fork as both (1) a staging area for features in review and (2) the version built and distributed¹ on PyPI.

For context, the distribution of both cmake and ninja on PyPI was motivated to support the scikit-build² initiative.

Ninja has a PR for adding make jobserver support [1] that has been a widely debated PR for many... many years. Given that many people have forked to incorporate this PR, and it claims to solve a problem we have (OOM on gcc processes) it seems like it would be worthwhile using a well maintained fork instead of the main project. This is not a one way door. If we find that the project goes unmaintained, doesn't build, or otherwise has problems, we can always go back to using mainline. Of the forks that have pulled this in, there are: The Fuscia project [2] Their targets seem more specific and less generic, although their improvements seem more extensive. Kitware [3] Maintains a fork of ninja Docker [4] [1] ninja-build/ninja#1139 [2] https://fuchsia.googlesource.com/third_party/github.com/ninja-build/ninja/+/refs/heads/main/README.fuchsia [3] https://github.com/Kitware/ninja [4] https://github.com/dockbuild/ninja-jobserver ''' EXTRA_OEMESON_COMPILE:append = " \ --ninja-args='--tokenpool-master=fifo' \ " PARALLEL_MAKE = "-j 20" BB_NUMBER_THREADS = "20" ''' Signed-off-by: Ed Tanous <ed@tanous.net>

mortie · 2024-06-03T18:18:59Z

What's the current status on this? I'm interested in it from the meta build system perspective, where many different projects written in different languages and using different build systems are all compiled in a coordinated manner. Without make jobserver client support in ninja, meta build systems are forced to make one of the following terrible trade-offs:

Build the projects sequentially, relying on the individual build systems' concurrency support. This is bad, since significant parts of from-scratch build times are single-threaded (especially the configure at the beginning and the linking at the end). AFAIK, this is what Buildroot does by default.
Build the projects concurrently, but limit each individual project to use one core. This works well if there are many small projects, but terrible if there are one or two projects which are significantly bigger than the others. You do not want to build Chromium with only one core.
Build the projects concurrently and let each project use many cores. This is probably the fastest if you have enough RAM, but if you have one of those 32-thread systems, it means you're running 32*32=1024 compiler processes at the same time in the worst case. This requires an immense amount of RAM. This is what Bitbake does by default.

If Ninja and other build systems supported the jobserver protocol, there would be another option:

Build the projects concurrently, and let each project use multiple cores, but run one central job server which limits the total concurrency across all projects.

To my knowledge, Ninja is the only real hold-out to make this a practical possibility. GNU Make and Rust's Cargo already support being jobserver clients.

eli-schwartz · 2024-06-03T18:55:36Z

The current status is that after @stefanb2's PR died the death of eternally pending review, @hundeboll reimplemented it two weeks ago in #2450 and it has been approved and scheduled for inclusion in ninja 1.13.0 (but the merge button hasn't been hit).

No jobserver master support, only client support, but this is probably not a worry for you.

It would be nice if the new PR had linked to the issue as well but it is what it is.

stefanb2 linked a pull request Apr 27, 2016 that will close this issue

Add GNU make jobserver client support #1140

Open

evmar mentioned this issue Mar 7, 2017

ninja jobs explode when nesting. #1253

Open

nashif mentioned this issue Mar 6, 2018

Revert "sanitycheck: Default to using Ninja" zephyrproject-rtos/zephyr#6479

Merged

stefanb2 linked a pull request Feb 22, 2023 that will close this issue

Add GNU make jobserver style "fifo" support #2263

Open

dothebart mentioned this issue May 27, 2023

Consideration Needed: Upgrade V8 and Transition from GYP to GN Build System arangodb/arangodb#19116

Open

dothebart mentioned this issue Jul 6, 2023

Dishonorable mentions Flet/rejected-github-profile-achievements#15

Open

haampie mentioned this issue Oct 27, 2023

Use /proc/loadavg on Linux #2218

Open

lf- mentioned this issue May 11, 2024

stdenv: single make jobserver across multiple nix builds NixOS/nixpkgs#143820

Closed

8 tasks

robertu94 mentioned this issue May 13, 2024

switch to kitware's ninja as the default upstream for ninja and configure it to respect jobserver in spack env depfile -o Makefile spack/spack#44166

Open

1 task

jhasse added this to the 1.13.0 milestone Jun 3, 2024

chillenb mentioned this issue Jul 3, 2024

Linking is really slow with Ninja CMake generator, but not Unix Makefiles pybind/pybind11#5223

Open

3 tasks

Add GNU make jobserver client support #1139

Add GNU make jobserver client support #1139

Comments

stefanb2 commented Apr 27, 2016

stefanb2 commented Apr 27, 2016

nico commented Apr 27, 2016

stefanb2 commented Apr 27, 2016

ghost commented May 23, 2016

maximuska commented Aug 7, 2016

fabio-porcedda commented Mar 10, 2017

dublet commented Apr 12, 2017

glandium commented May 26, 2017

stefanb2 commented May 26, 2017

nox commented Nov 11, 2017

comicfans commented Nov 12, 2017

xqms commented Dec 6, 2017

yann-morin-1998 commented Jan 6, 2018

ihnorton commented Mar 14, 2018

avikivity commented Feb 4, 2024

mattgodbolt commented Feb 4, 2024

mathstuf commented Feb 5, 2024

mattgodbolt commented Feb 6, 2024 • edited Loading

avikivity commented Feb 6, 2024

mathstuf commented Feb 6, 2024

eli-schwartz commented Feb 6, 2024

mathstuf commented Feb 6, 2024

avikivity commented Feb 6, 2024

xim commented Feb 7, 2024 • edited Loading

degasus commented Feb 28, 2024

digit-google commented Feb 28, 2024

eli-schwartz commented Feb 28, 2024

digit-google commented Feb 28, 2024

eli-schwartz commented Feb 28, 2024

digit-google commented Feb 28, 2024

stefanb2 commented Mar 2, 2024

segevfiner commented Mar 19, 2024

jcfr commented Mar 21, 2024

Footnotes

mortie commented Jun 3, 2024

eli-schwartz commented Jun 3, 2024

mattgodbolt commented Feb 6, 2024 •

edited

Loading

xim commented Feb 7, 2024 •

edited

Loading