Skip to content

Commit

Permalink
[ci] Upgrade to new runners and disable unsupported jobs. (#2818)
Browse files Browse the repository at this point in the history
Per the RFC and numerous conversations on Discord, this rebuilds the
torch-mlir CI and discontinues the infra and coupling to the binary
releases
(https://discourse.llvm.org/t/rfc-discontinuing-pytorch-1-binary-releases/76371).

I iterated on this to get latency back to about what it was with the old
(much larger and non-ephemeral) runners: About 4m - 4.5m for an
incremental change.

Behind the scenes changes:

* Uses a new runner pool operated by AMD. It is currently set to manual
scaling and has two runners (32-core, 64GiB RAM) while we get some
traction. We can either fiddle with some auto-scaling or use a schedule
to give it an increase during certain high traffic hours.
* Builds are now completely isolated and cannot have run-to-run
interference like we were getting before (i.e. lock file/permissions
stuff).
* The GHA runner is installed directly into a manylinux 2.28 container
with upgraded dev tools. This eliminates the need to do sub-invocations
of docker on Linux in order to run on the same OS that is used to build
wheels.
* While not using it now, this setup was cloned from another project
that posts the built artifacts to the job and fans out testing. Might be
useful here later.
* Uses a special git cache that lets us have ephemeral runners and still
check out the repo and deps (incl. llvm) in ~13s.
* Running in an Azure VM Scale Set.

In-repo changes:

* Disables (but does not yet delete):
  * Old buildAndTest.yml jobs
  * releaseSnapshotPackage.yml
* Adds a new `ci.yml` pipeline and scripts the steps in `build_tools/ci`
(by decomposing the existing `build_linux_packages.sh` for in-tree
builds and modularizing it a bit better).
* Test framework changes:
* Adds a `TORCH_MLIR_TEST_CONCURRENCY` env var that can be used to bound
the multiprocess concurrency. Ended up not using this in the final
version but is useful to have as a knob.
* Changes the default concurrency to `nproc * 0.8 + 1` vs `nproc * 1.1`.
We're running on systems with significantly less virtual memory and I
did a bit of fiddling to find a good tradeoff.
* Changed multiprocess mode to spawn instead of fork. Otherwise, I was
getting instability (as discussed on discord).
* Added MLIR configuration to disable multithreaded contexts globally
for the project. Constantly spawning `nproc * nproc` threads (more than
that actually) was OOM'ing.
* Added a test timeout of 5 minutes. If a multiprocess worker crashes,
the framework can get wedged indefinitely (and then will just be reaped
after multiple hours). We should fix this, but this at least keeps the
CI pool from wedging with stuck jobs.

Functional changes needing followup:

* No matter what I did, I couldn't get the LTC tests to work, and I'm
not 100% sure they were being run in the old setup as the scripts were a
bit twisty. I disabled them and left a comment.
* Dropped out-of-tree build variants. These were not providing much
signal and increase CI needs by 50%.
* Dropped MacOS and Windows builds. Now that we are "just a library" and
not building releases, there is less pressure to test these commit by
commit. Further, since we bump torch-mlir to known good commits on these
platforms, it has been a long time since either of these jobs have
provided much signal (and they take ~an hour+ to run). We can add them
back later post-submit if ever needed.
  • Loading branch information
stellaraccident committed Jan 28, 2024
1 parent 4a4d80a commit 77c14ab
Show file tree
Hide file tree
Showing 11 changed files with 319 additions and 13 deletions.
12 changes: 6 additions & 6 deletions .github/workflows/buildAndTest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
name: Build and Test

Check warning on line 2 in .github/workflows/buildAndTest.yml

View workflow job for this annotation

GitHub Actions / Static Lint Checks

2:1 [document-start] missing document start "---"

on:

Check warning on line 4 in .github/workflows/buildAndTest.yml

View workflow job for this annotation

GitHub Actions / Static Lint Checks

4:1 [truthy] truthy value should be one of [false, true]
pull_request:
branches: [main]
push:
branches: [main]
workflow_dispatch:
# pull_request:

Check warning on line 5 in .github/workflows/buildAndTest.yml

View workflow job for this annotation

GitHub Actions / Static Lint Checks

5:3 [comments-indentation] comment not indented like content
# branches: [main]
# push:
# branches: [main]
# workflow_dispatch:

# Ensure that only a single job or workflow using the same
# concurrency group will run at a time. This would cancel
Expand All @@ -30,7 +30,7 @@ jobs:
strategy:
fail-fast: true
matrix:
os-arch: [ubuntu-x86_64, macos-arm64, windows-x86_64]
os-arch: [macos-arm64, windows-x86_64]
llvm-build: [in-tree, out-of-tree]
torch-binary: [ON]

Check warning on line 35 in .github/workflows/buildAndTest.yml

View workflow job for this annotation

GitHub Actions / Static Lint Checks

35:24 [truthy] truthy value should be one of [false, true]
torch-version: [nightly, stable]
Expand Down
77 changes: 77 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: CI

on:
workflow_dispatch:
workflow_call:
pull_request:
branches: [main]
push:
branches: [main]

concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
# queued and in-progress runs for the same PR (presubmit) or commit
# (postsubmit).
group: ci-build-test-cpp-linux-${{ github.event.number || github.sha }}
cancel-in-progress: true

jobs:
build-test-linux:
strategy:
fail-fast: true
matrix:
torch-version: [nightly, stable]
name: Build and Test (Linux, torch-${{ matrix.torch-version }}, assertions)
runs-on: torch-mlir-cpubuilder-manylinux-x86-64
env:
CACHE_DIR: ${{ github.workspace }}/.container-cache
steps:
- name: Configure local git mirrors
run: |
# Our stock runners have access to certain local git caches. If these
# files are available, it will prime the cache and configure git to
# use them. Practically, this eliminates network/latency for cloning
# llvm.
if [[ -x /gitmirror/scripts/trigger_update_mirrors.sh ]]; then
/gitmirror/scripts/trigger_update_mirrors.sh
/gitmirror/scripts/git_config.sh
fi
- name: "Checking out repository"
uses: actions/checkout@8f4b7f84864484a7bf31766abe9204da3cbe65b3 # v3.5.0
with:
submodules: true

- name: Enable cache
uses: actions/cache/restore@v3
with:
path: ${{ env.CACHE_DIR }}
key: build-test-cpp-asserts-manylinux-v2-${{ github.sha }}
restore-keys: |
build-test-cpp-asserts-manylinux-v2-
- name: Install python deps (torch-${{ matrix.torch-version }})
run: |
export cache_dir="${{ env.CACHE_DIR }}"
bash build_tools/ci/install_python_deps.sh ${{ matrix.torch-version }}
- name: Build project
run: |
export cache_dir="${{ env.CACHE_DIR }}"
bash build_tools/ci/build_posix.sh
- name: Save cache
uses: actions/cache/save@v3
if: ${{ !cancelled() }}
with:
path: ${{ env.CACHE_DIR }}
key: build-test-cpp-asserts-manylinux-v2-${{ github.sha }}

- name: Test project (torch-${{ matrix.torch-version }})
run: |
export cache_dir="${{ env.CACHE_DIR }}"
bash build_tools/ci/test_posix.sh ${{ matrix.torch-version }}
- name: Check generated sources (torch-nightly only)
if: ${{ matrix.torch-version == 'nightly' }}
run: |
bash build_tools/ci/check_generated_sources.sh
5 changes: 2 additions & 3 deletions .github/workflows/releaseSnapshotPackage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@
name: Release snapshot package

on:
schedule:
- cron: '0 11 * * *'

# schedule:
# - cron: '0 11 * * *'
workflow_dispatch:

jobs:
Expand Down
60 changes: 60 additions & 0 deletions build_tools/ci/build_posix.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/bin/bash

set -eu -o errtrace

this_dir="$(cd $(dirname $0) && pwd)"
repo_root="$(cd $this_dir/../.. && pwd)"
build_dir="$repo_root/build"
install_dir="$repo_root/install"
mkdir -p "$build_dir"
build_dir="$(cd $build_dir && pwd)"
cache_dir="${cache_dir:-}"

# Setup cache dir.
if [ -z "${cache_dir}" ]; then
cache_dir="${repo_root}/.build-cache"
mkdir -p "${cache_dir}"
cache_dir="$(cd ${cache_dir} && pwd)"
fi
echo "Caching to ${cache_dir}"
mkdir -p "${cache_dir}/ccache"
mkdir -p "${cache_dir}/pip"

python="$(which python)"
echo "Using python: $python"

export CMAKE_TOOLCHAIN_FILE="$this_dir/linux_default_toolchain.cmake"
export CC=clang
export CXX=clang++
export CCACHE_DIR="${cache_dir}/ccache"
export CCACHE_MAXSIZE="350M"
export CMAKE_C_COMPILER_LAUNCHER=ccache
export CMAKE_CXX_COMPILER_LAUNCHER=ccache

# Clear ccache stats.
ccache -z

cd $repo_root

echo "::group::CMake configure"
cmake -S "$repo_root/externals/llvm-project/llvm" -B "$build_dir" \
-GNinja \
-DCMAKE_BUILD_TYPE=Release \
-DPython3_EXECUTABLE="$(which python)" \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_INSTALL_PREFIX="$install_dir" \
-DCMAKE_INSTALL_LIBDIR=lib \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_EXTERNAL_PROJECTS="torch-mlir" \
-DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$repo_root" \
-DLLVM_TARGETS_TO_BUILD=host \
-DMLIR_ENABLE_BINDINGS_PYTHON=ON \
-DTORCH_MLIR_ENABLE_LTC=ON
echo "::endgroup::"

echo "::group::Build"
cmake --build "$build_dir" --target tools/torch-mlir/all -- -k 0
echo "::endgroup::"

# Show ccache stats.
ccache --show-stats
45 changes: 45 additions & 0 deletions build_tools/ci/check_generated_sources.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/bin/bash

set -eu -o errtrace

this_dir="$(cd $(dirname $0) && pwd)"
repo_root="$(cd $this_dir/../.. && pwd)"

function _check_file_not_changed_by() {
# _check_file_not_changed_by <cmd> <file>
cmd="$1"
file="$2"
file_backup="$PWD/$(basename $file)"
file_new="$PWD/$(basename $file).new"
# Save the original file.
cp "$file" "$file_backup"
# Run the command to regenerate it.
"$1" || return 1
# Save the new generated file.
cp "$file" "$file_new"
# Restore the original file. We want this function to not change the user's
# working tree state.
mv "$file_backup" "$file"
# We use git-diff as "just a diff program" (no SCM stuff) because it has
# nicer output than regular `diff`.
if ! git diff --no-index --quiet "$file" "$file_new"; then
echo "#######################################################"
echo "Generated file '${file}' is not up to date (see diff below)"
echo ">>> Please run '${cmd}' to update it <<<"
echo "#######################################################"
git diff --no-index --color=always "$file" "$file_new"
# TODO: Is there a better cleanup strategy that doesn't require duplicating
# this inside and outside the `if`?
rm "$file_new"
return 1
fi
rm "$file_new"
}

echo "::group:: Check that update_abstract_interp_lib.sh has been run"
_check_file_not_changed_by $repo_root/build_tools/update_abstract_interp_lib.sh $repo_root/lib/Dialect/Torch/Transforms/AbstractInterpLibrary.cpp
echo "::endgroup::"

echo "::group:: Check that update_torch_ods.sh has been run"
_check_file_not_changed_by $repo_root/build_tools/update_torch_ods.sh $repo_root/include/torch-mlir/Dialect/Torch/IR/GeneratedTorchOps.td
echo "::endgroup::"
34 changes: 34 additions & 0 deletions build_tools/ci/install_python_deps.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash

set -eu -o errtrace

this_dir="$(cd $(dirname $0) && pwd)"
repo_root="$(cd $this_dir/../.. && pwd)"
torch_version="${1:-unknown}"

echo "::group::installing llvm python deps"
python -m pip install --no-cache-dir -r $repo_root/externals/llvm-project/mlir/python/requirements.txt
echo "::endgroup::"

case $torch_version in
nightly)
echo "::group::installing nightly torch"
python3 -m pip install --no-cache-dir -r $repo_root/requirements.txt
python3 -m pip install --no-cache-dir -r $repo_root/torchvision-requirements.txt
echo "::endgroup::"
;;
stable)
echo "::group::installing stable torch"
python3 -m pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu
python3 -m pip install --no-cache-dir -r $repo_root/build-requirements.txt
echo "::endgroup::"
;;
*)
echo "Unrecognized torch version '$torch_version' (specify 'nightly' or 'stable' with cl arg)"
exit 1
;;
esac

echo "::group::installing test requirements"
python -m pip install --no-cache-dir -r $repo_root/test-requirements.txt
echo "::endgroup::"
14 changes: 14 additions & 0 deletions build_tools/ci/linux_default_toolchain.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
message(STATUS "Enabling thin archives (static libraries will not be relocatable)")
set(CMAKE_C_ARCHIVE_APPEND "<CMAKE_AR> qT <TARGET> <LINK_FLAGS> <OBJECTS>")
set(CMAKE_CXX_ARCHIVE_APPEND "<CMAKE_AR> qT <TARGET> <LINK_FLAGS> <OBJECTS>")
set(CMAKE_C_ARCHIVE_CREATE "<CMAKE_AR> crT <TARGET> <LINK_FLAGS> <OBJECTS>")
set(CMAKE_CXX_ARCHIVE_CREATE "<CMAKE_AR> crT <TARGET> <LINK_FLAGS> <OBJECTS>")

set(CMAKE_EXE_LINKER_FLAGS_INIT "-fuse-ld=lld -Wl,--gdb-index")
set(CMAKE_MODULE_LINKER_FLAGS_INIT "-fuse-ld=lld -Wl,--gdb-index")
set(CMAKE_SHARED_LINKER_FLAGS_INIT "-fuse-ld=lld -Wl,--gdb-index")

set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -gsplit-dwarf -ggnu-pubnames")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -gsplit-dwarf -ggnu-pubnames")
set(CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO} -gsplit-dwarf -ggnu-pubnames")
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -gsplit-dwarf -ggnu-pubnames")
44 changes: 44 additions & 0 deletions build_tools/ci/test_posix.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash

set -eu -o errtrace

this_dir="$(cd $(dirname $0) && pwd)"
repo_root="$(cd $this_dir/../.. && pwd)"
torch_version="${1:-unknown}"

export PYTHONPATH="$repo_root/build/tools/torch-mlir/python_packages/torch_mlir:$repo_root/projects/pt1"

echo "::group::Run Linalg e2e integration tests"
python -m e2e_testing.main --config=linalg -v
echo "::endgroup::"

echo "::group::Run make_fx + TOSA e2e integration tests"
python -m e2e_testing.main --config=make_fx_tosa -v
echo "::endgroup::"

echo "::group::Run TOSA e2e integration tests"
python -m e2e_testing.main --config=tosa -v
echo "::endgroup::"

case $torch_version in
nightly)
# Failing with: NotImplementedError:
# Could not run 'aten::empty.memory_format' with arguments from the 'Lazy' backend.
# As of 2024-01-07
# echo "::group::Run Lazy Tensor Core e2e integration tests"
# python -m e2e_testing.main --config=lazy_tensor_core -v
# echo "::endgroup::"

# TODO: There is one failing test in this group on stable. It could
# be xfailed vs excluding entirely.
echo "::group::Run TorchDynamo e2e integration tests"
python -m e2e_testing.main --config=torchdynamo -v
echo "::endgroup::"
;;
stable)
;;
*)
echo "Unrecognized torch version '$torch_version' (specify 'nightly' or 'stable' with cl arg)"
exit 1
;;
esac
24 changes: 20 additions & 4 deletions projects/pt1/python/torch_mlir_e2e_test/framework.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,19 @@
from typing import Any, Callable, List, NamedTuple, Optional, TypeVar, Union, Dict
from itertools import repeat

import os
import sys
import traceback

import torch
import multiprocess as mp
from multiprocess import set_start_method
try:
set_start_method("spawn")
except RuntimeError:
# Children can error here so we suppress.
pass

import torch

TorchScriptValue = Union[int, float, List['TorchScriptValue'],
Dict['TorchScriptValue',
Expand Down Expand Up @@ -316,8 +324,16 @@ def compile_and_run_test(test: Test, config: TestConfig, verbose=False) -> Any:


def run_tests(tests: List[Test], config: TestConfig, sequential=False, verbose=False) -> List[TestResult]:
"""Invoke the given `Test`'s with the provided `TestConfig`."""
num_processes = min(int(mp.cpu_count() * 1.1), len(tests))
"""Invoke the given `Test`'s with the provided `TestConfig`."""
num_processes = min(int(mp.cpu_count() * 0.8) + 1, len(tests))
try:
env_concurrency = int(os.getenv("TORCH_MLIR_TEST_CONCURRENCY", "0"))
except ValueError as e:
raise ValueError("Bad value for TORCH_MLIR_TEST_CONCURRENCY env var: "
"Expected integer.") from e
if env_concurrency > 0:
num_processes = min(num_processes, env_concurrency)

# TODO: We've noticed that on certain 2 core machine parallelizing the tests
# makes the llvm backend legacy pass manager 20x slower than using a
# single process. Need to investigate the root cause eventually. This is a
Expand All @@ -344,7 +360,7 @@ def run_tests(tests: List[Test], config: TestConfig, sequential=False, verbose=F
pool = mp.Pool(num_processes)
arg_list = zip(tests, repeat(config))
handles = pool.starmap_async(compile_and_run_test, arg_list)
results = handles.get()
results = handles.get(timeout=360)

tests_with_results = {result.unique_name for result in results}
all_tests = {test.unique_name for test in tests}
Expand Down
8 changes: 8 additions & 0 deletions python/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@ declare_mlir_python_sources(TorchMLIRPythonSources.Tools
tools/import_onnx/__main__.py
)

declare_mlir_python_sources(TorchMLIRSiteInitialize
ROOT_DIR "${TORCH_MLIR_PYTHON_ROOT_DIR}"
ADD_TO_PARENT TorchMLIRPythonSources
SOURCES
_mlir_libs/_site_initialize_0.py
)

################################################################################
# Extensions
################################################################################
Expand Down Expand Up @@ -79,6 +86,7 @@ set(_source_components
MLIRPythonExtension.RegisterEverything
TorchMLIRPythonSources
TorchMLIRPythonExtensions
TorchMLIRSiteInitialize

# Sources related to optional Torch extension dependent features. Typically
# empty unless if project features are enabled.
Expand Down
9 changes: 9 additions & 0 deletions python/torch_mlir/_mlir_libs/_site_initialize_0.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

# Multi-threading rarely helps the frontend and we are also running in contexts
# where we want to run a lot of test parallelism (and nproc*nproc threads
# puts a large load on the system and virtual memory).
disable_multithreading = True

0 comments on commit 77c14ab

Please sign in to comment.