[ci] Upgrade to new runners and disable unsupported jobs. #2818

stellaraccident · 2024-01-27T23:22:14Z

Per the RFC and numerous conversations on Discord, this rebuilds the torch-mlir CI and discontinues the infra and coupling to the binary releases (https://discourse.llvm.org/t/rfc-discontinuing-pytorch-1-binary-releases/76371).

I iterated on this to get latency back to about what it was with the old (much larger and non-ephemeral) runners: About 4m - 4.5m for an incremental change.

Behind the scenes changes:

Uses a new runner pool operated by AMD. It is currently set to manual scaling and has two runners (32-core, 64GiB RAM) while we get some traction. We can either fiddle with some auto-scaling or use a schedule to give it an increase during certain high traffic hours.
Builds are now completely isolated and cannot have run-to-run interference like we were getting before (i.e. lock file/permissions stuff).
The GHA runner is installed directly into a manylinux 2.28 container with upgraded dev tools. This eliminates the need to do sub-invocations of docker on Linux in order to run on the same OS that is used to build wheels.
While not using it now, this setup was cloned from another project that posts the built artifacts to the job and fans out testing. Might be useful here later.
Uses a special git cache that lets us have ephemeral runners and still check out the repo and deps (incl. llvm) in ~13s.
Running in an Azure VM Scale Set.

In-repo changes:

Disables (but does not yet delete):
- Old buildAndTest.yml jobs
- releaseSnapshotPackage.yml
Adds a new ci.yml pipeline and scripts the steps in build_tools/ci (by decomposing the existing build_linux_packages.sh for in-tree builds and modularizing it a bit better).
Test framework changes:
- Adds a TORCH_MLIR_TEST_CONCURRENCY env var that can be used to bound the multiprocess concurrency. Ended up not using this in the final version but is useful to have as a knob.
- Changes the default concurrency to nproc * 0.8 + 1 vs nproc * 1.1. We're running on systems with significantly less virtual memory and I did a bit of fiddling to find a good tradeoff.
- Changed multiprocess mode to spawn instead of fork. Otherwise, I was getting instability (as discussed on discord).
- Added MLIR configuration to disable multithreaded contexts globally for the project. Constantly spawning nproc * nproc threads (more than that actually) was OOM'ing.
- Added a test timeout of 5 minutes. If a multiprocess worker crashes, the framework can get wedged indefinitely (and then will just be reaped after multiple hours). We should fix this, but this at least keeps the CI pool from wedging with stuck jobs.

Functional changes needing followup:

No matter what I did, I couldn't get the LTC tests to work, and I'm not 100% sure they were being run in the old setup as the scripts were a bit twisty. I disabled them and left a comment.
Dropped out-of-tree build variants. These were not providing much signal and increase CI needs by 50%.
Dropped MacOS and Windows builds. Now that we are "just a library" and not building releases, there is less pressure to test these commit by commit. Further, since we bump torch-mlir to known good commits on these platforms, it has been a long time since either of these jobs have provided much signal (and they take ~an hour+ to run). We can add them back later post-submit if ever needed.

Per the RFC and numerous conversations on Discord, this rebuilds the torch-mlir CI and discontinues the infra and coupling to the binary releases (https://discourse.llvm.org/t/rfc-discontinuing-pytorch-1-binary-releases/76371). I iterated on this to get latency back to about what it was with the old (much larger and non-ephemeral) runners: About 4m - 4.5m for an incremental change. Behind the scenes changes: * Uses a new runner pool operated by AMD. It is currently set to manual scaling and has two runners (32-core, 64GiB RAM) while we get some traction. We can either fiddle with some auto-scaling or use a schedule to give it an increase during certain high traffic hours. * Builds are now completely isolated and cannot have run-to-run interference like we were getting before (i.e. lock file/permissions stuff). * The GHA runner is installed directly into a manylinux 2.28 container with upgraded dev tools. This eliminates the need to do sub-invocations of docker on Linux in order to run on the same OS that is used to build wheels. * While not using it now, this setup was cloned from another project that posts the built artifacts to the job and fans out testing. Might be useful here later. * Uses a special git cache that lets us have ephemeral runners and still check out the repo and deps (incl. llvm) in ~13s. * Running in an Azure VM Scale Set. In-repo changes: * Disables (but does not yet delete): * Old buildAndTest.yml jobs * releaseSnapshotPackage.yml * Adds a new `ci.yml` pipeline and scripts the steps in `build_tools/ci` (by decomposing the existing `build_linux_packages.sh` for in-tree builds and modularizing it a bit better). * Test framework changes: * Adds a `TORCH_MLIR_TEST_CONCURRENCY` env var that can be used to bound the multiprocess concurrency. Ended up not using this in the final version but is useful to have as a knob. * Changes the default concurrency to `nproc * 0.8 + 1` vs `nproc * 1.1`. We're running on systems with significantly less virtual memory and I did a bit of fiddling to find a good tradeoff. * Changed multiprocess mode to spawn instead of fork. Otherwise, I was getting instability (as discussed on discord). * Added MLIR configuration to disable multithreaded contexts globally for the project. Constantly spawning `nproc * nproc` threads (more than that actually) was OOM'ing. * Added a test timeout of 5 minutes. If a multiprocess worker crashes, the framework can get wedged indefinitely (and then will just be reaped after multiple hours). We should fix this, but this at least keeps the CI pool from wedging with stuck jobs. Functional changes needing followup: * No matter what I did, I couldn't get the LTC tests to work, and I'm not 100% sure they were being run in the old setup as the scripts were a bit twisty. I disabled them and left a comment. * Dropped out-of-tree build variants. These were not providing much signal and increase CI needs by 50%. * Dropped MacOS and Windows builds. Now that we are "just a library" and not building releases, there is less pressure to test these commit by commit. Further, since we bump torch-mlir to known good commits on these platforms, it has been a long time since either of these jobs have provided much signal (and they take ~an hour+ to run). We can add them back later post-submit if ever needed.

stellaraccident added 21 commits January 27, 2024 15:21

[ci] Upgrade to new runners and disable unsupported jobs.

3a329bf

Merge branch 'main' of github.com:llvm/torch-mlir into new_ci

5e3657a

YAML lint

9c679b1

Fix yaml

5467c4d

Fix syntax

006484a

Install python deps.

dd050ee

spec version

918cfd5

Fix path

bae80a2

Add tests

f66c358

Try again

297650f

Try fix multiprocess

cbe7769

Fix typo

c7e21e8

Try again

307bc9a

Tweak process count

4170989

Limit more

82e0f7c

Add timeout

8eb67f5

Disable MLIR multi-threading.

6c6ec16

Add env override for concurrency.

478b3af

Check generated sources.

588988b

Final tweaks

e8f5536

Final

e3815a1

stellaraccident marked this pull request as ready for review January 28, 2024 02:35

stellaraccident merged commit 77c14ab into main Jan 28, 2024
3 checks passed

stellaraccident deleted the new_ci branch January 28, 2024 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] Upgrade to new runners and disable unsupported jobs. #2818

[ci] Upgrade to new runners and disable unsupported jobs. #2818

stellaraccident commented Jan 27, 2024 •

edited

[ci] Upgrade to new runners and disable unsupported jobs. #2818

[ci] Upgrade to new runners and disable unsupported jobs. #2818

Conversation

stellaraccident commented Jan 27, 2024 • edited

stellaraccident commented Jan 27, 2024 •

edited