Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CMake][LIT] Add option to run lit testsuites in parallel #82899

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

patrickdoc
Copy link
Contributor

Currently add_lit_target sets the USES_TERMINAL CMake option. When using Ninja, this forces all lit testsuite targets into the single-threaded console pool.

This PR adds a new option LLVM_PARALLEL_LIT which drops the USES_TERMINAL flag, allowing Ninja to run them in parallel.

The default setting (LLVM_PARALLEL_LIT=OFF) retains the existing behavior of serial testsuite execution.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the cmake Build system in general and CMake in particular label Feb 24, 2024
@JoeLoser JoeLoser self-requested a review February 24, 2024 23:45
Copy link
Member

@JoeLoser JoeLoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this through upstream! Let me know if you need me to commit this on your behalf.

@patrickdoc
Copy link
Contributor Author

@JoeLoser yes please merge, I do not have write access to this repo.

@JoeLoser JoeLoser merged commit 782147e into llvm:main Feb 28, 2024
6 of 7 checks passed
Copy link

@patrickdoc Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may recieve a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

@patrickdoc patrickdoc deleted the lit-parallelism branch March 3, 2024 16:19
@joker-eph
Copy link
Collaborator

joker-eph commented Mar 8, 2024

Can you clarify why this is useful/desirable?

(it's good practice in general for the PR description to describe the context and motivation for the change, that is the "why" instead of "what" is done)

Same for the documentation change: it does not talk about when is it useful/desirable for a user to do.

@patrickdoc
Copy link
Contributor Author

Running the test targets in parallel can result in faster build times depending on your build graph + available resources on your machine. For example, individual targets can be started earlier or you can get increased cpu usage if threads are paused on I/O.

Here's a simple example of a handful of test targets being:

ninja -C build check-llvm-analysis-aliasset check-llvm-analysis-assumptioncache check-llvm-analysis-basicaa check-llvm-analysis-blockfrequencyinfo check-llvm-analysis-branchprobabilityinfo check-llvm-analysis-callgraph check-llvm-analysis-costmodel

Ninja trace for parallelism disabled (pre-existing behavior and current default):
Screenshot 2024-03-08 at 10 36 45 AM

Ninja trace for parallelism enabled (by enabling the new flag):
Screenshot 2024-03-08 at 10 47 46 AM

I would summarize it as: if you have multiple lit targets and idle cpu during your builds, try enabling this.

It greatly depends on your setup, but in my case it cut build times in half.

@joker-eph
Copy link
Collaborator

joker-eph commented Mar 8, 2024

Oh you're running very tiny lit target!

The problem of this is that most of the lit suite are massively parallel themselves.

The way to achieve good parallelism with lit isn't to call the targets the way you're doing here, instead you can do:

ninja && bin/llvm-lit ../llvm/test/Analysis/{AliasSet,AssumptionCache,BasicAA,BlockFrequencyInfo,BranchProbabilityInfo,CallGraph,CostModel}

Or:

LIT_OPTS=" --filter='Analysis/AliasSet|Analysis/AssumptionCache|Analysis/BasicAA|Analysis/BlockFrequencyInfo|Analysis/BranchProbabilityInfo|Analysis/CallGraph|Analysis/CostModel' " \
ninja check

I believe these are giving you "perfect" parallelism without over-subscription problem associated with the cmake option you introduced.

Your screenshots are actually misleading in that they show multiple threads apparently "balanced", which looks all nice, but it's actually a "lie": this is just the ninja view of the world. Each of the lit invocation will spawn as many threads as cores on the machine (the over-subscription I'm referring to).

@Mogball
Copy link
Contributor

Mogball commented Mar 9, 2024

@patrickdoc @JoeLoser Can we try what Mehdi is suggesting instead of what is enabled by this PR? If that works, I don't see a need for this PR and we can revert it, since there is no clear use case.

@patrickdoc
Copy link
Contributor Author

Hey @joker-eph, the pictures in my example weren't my actual builds, just a demonstration of the intended change. I'm aware that lit also tries to parallelize on top of ninja's parallelism so have reduced lit's parallelism with -j X to compensate.

My use-case is CI builds, where we are running the whole build + all the tests, with the goal of being as fast as possible.

Using a single mega lit target has some downsides:

  1. It artificially splits the build into first a compile phase, then a test phase. This means that tests which could be running against early compilation targets get delayed until the end of the build
  2. It doesn't allow flexibility for different sizes of tests. i.e. N small unit tests would have poor utilization of available cpu/mem/etc. while N larger end-to-end integration tests could run into the oversubscribing issue you've described
  3. It doesn't play well with other testing targets in the build. As you mentioned, lit assumes it owns the machine and so runs one test per core. But Ninja doesn't know that so can schedule other targets alongside lit.

The wins I am seeing come from moving parallelism logic out of lit and into Ninja. So we end up with a much smoother build graph because Ninja knows more about the build than lit does.

But to do that, we need to allow Ninja to flexibly schedule the lit targets. The upside of the USES_TERMINAL setting is that you can get live progress bar output, but the downside is that Ninja can only run one target as a time.

Truthfully even just removing this flag altogether wouldn't affect the "single lit target" use-case (aside from losing the progress bar) because it is already delayed until after every other target anyway. But I've added it as an option here to allow folks to opt-in.


For clarification in the docs, would something like this be better:

**LLVM_PARALLEL_LIT**:BOOL
  Defaults to ``OFF``. If using more than one lit CMake target and the Ninja
  backend, setting this to ``ON`` allows Ninja to run the targets in parallel. This
  could lead to too many tests running at once, so it is recommended to also
  reduce lit's internal parallelism with something like`-DLLVM_LIT_ARGS="-j 1"`.

@joker-eph
Copy link
Collaborator

My use-case is CI builds, where we are running the whole build + all the tests, with the goal of being as fast as possible.

This is weird, your motivation stated earlier was "I want to run a very small set of the tests", I'm having a hard time following now what is the motivating example really!

It artificially splits the build into first a compile phase, then a test phase. This means that tests which could be running against early compilation targets get delayed until the end of the build

This is true in theory, however in practice in LLVM the check targets depends on all the binaries needed for all the tests (unless you have many lit test suites in the first place, but I'd like to see an example). Another problem is that running the tests in parallel concurrently with the build is again oversubscribing the machine.

It doesn't allow flexibility for different sizes of tests. i.e. N small unit tests would have poor utilization of available cpu/mem/etc. while N larger end-to-end integration tests could run into the oversubscribing issue you've described

I don't understand your point about small/large? As long as each test is single threader, it does not matter if they are small or large.

The wins I am seeing come from moving parallelism logic out of lit and into Ninja. So we end up with a much smoother build graph because Ninja knows more about the build than lit does.

If you want to do this in a principle way, then we should

  1. Discover the lit tests targets using lit
  2. Feed these individual tests to ninja for scheduling and let ninja do the scheduling.

If you care about solving this properly, I'd be happy to see something in this direction, however this isn't the granularity you operate at right now.

I rather see this reverted and a more principled approach discussed first.

Mogball added a commit that referenced this pull request Mar 11, 2024
Mogball added a commit that referenced this pull request Mar 11, 2024
…84813)

Reverts #82899

Per the discussion on the PR, this needs more design and justification.
@JoeLoser
Copy link
Member

JoeLoser commented Mar 11, 2024

@joker-eph sorry about approving/merging this early before engaging more with the community here on the feedback and approach. As I talked with Jeff this morning, he reverted this in 8846b91 and we'll re-evaluate how we want to proceed. Sorry again, and we appreciate your feedback!

@joker-eph
Copy link
Collaborator

No need to apologize, I assumed all good intents here!
I'd love to find a good solution here though!

@joker-eph
Copy link
Collaborator

One more subtlety: when lit runs the tests, it'll run them all and reports the failures.
When we split the lit invocation across ninja targets, ninja stops at the first failure by default...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmake Build system in general and CMake in particular
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants