Skip to content

ci: containerize CI pipeline with pre-built Docker image#2529

Open
BrendanWalsh wants to merge 1 commit intomasterfrom
brwals/containerize-ci
Open

ci: containerize CI pipeline with pre-built Docker image#2529
BrendanWalsh wants to merge 1 commit intomasterfrom
brwals/containerize-ci

Conversation

@BrendanWalsh
Copy link
Copy Markdown
Collaborator

@BrendanWalsh BrendanWalsh commented Mar 27, 2026

Summary

Containerizes the entire SynapseML CI pipeline using a pre-built Docker image, reducing wall-clock time from ~66 minutes to ~27 minutes (59% reduction) and total compute from ~1228 to ~817 agent-minutes (33.5% reduction).

Changes

New files:

  • tools/docker/ci/Dockerfile: Ubuntu 22.04 image with JDK 8, pinned Miniconda, SBT, Spark 3.5, and pre-cached test datasets (3.6GB compressed). Published to mmlsparkmcr.azurecr.io/synapseml/ci. Uses HTTPS + SHA256 verification for libssl1.1, tightened permissions (755 instead of 777).
  • templates/free_disk.yml: Disk cleanup helper template for container jobs.

Modified files:

  • pipeline.yaml (bulk of changes):
    • Add container resource definition with ACR endpoint and Workload Identity Federation auth
    • Add BuildCIImage job with content-hash-based caching (~30s on cache hit)
    • Add container: ci to all test/build/publish jobs (except FabricE2E — see note below)
    • Add scoped compilation via PROJECT variable in UnitTests matrix (each job compiles only its module)
    • Replace conda.yml / update_cli.yml templates with container-native equivalents
    • Add free_disk.yml template to jobs
    • Tune SBT_OPTS (-Xmx4G, ForkJoinPool thread config)
    • Add succeeded() gating to DatabricksE2E condition
    • Increase sbt setup timeout from 5m to 30m
  • build.sbt: ForkJoinPool deadlock fix for parallel test execution in container
  • TestBase.scala: Add 10-minute per-test hard timeout for all test suites (overridable in subclasses)
  • CodegenPlugin.scala: Remove redundant root publishLocal in installPipPackage
  • .gitignore: Add Dockerfile-related entries

FabricE2E exclusion: The FabricE2E job intentionally runs on the host VM (not containerized) because it requires the Fabric SDK and sempy packages that are not in the CI container image.

Performance (measured on ADO Pipeline 17563)

Metric Master (8-build avg) Container (2 green builds) Improvement
Wall clock 66.3m 26.5-27.5m 59% faster
Total compute ~1228 agent-min ~817 agent-min 33.5% savings
UnitTests avg 18.0m/job 13.1m/job -4.9m/job
PythonTests avg 28.4m/job 19.6m/job -8.8m/job

Dependencies

This PR is independent but works best with these companion PRs:

All were split from the original #2506 for independent review.

Testing

  • 64/66 jobs pass on container branch (only Databricks GPU times out due to pre-existing notebook flakiness, not container-related)
  • All unit tests, Python tests, R tests, Fabric E2E, and style checks pass
  • Container image build is idempotent with content-hash caching

This is the core containerization PR, split from #2506.

Copilot AI review requested due to automatic review settings March 27, 2026 05:28
@github-actions
Copy link
Copy Markdown

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA aec4efd.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Containerizes most Azure DevOps CI jobs to run inside a pre-built Docker image, aiming to reduce end-to-end CI time by pre-baking toolchains, caches, and datasets.

Changes:

  • Add a new CI Docker image definition and a pipeline job to build/retag it using a content-hash tag.
  • Move major CI jobs (style/tests/publish) to run with container: ci, add disk cleanup, and scope compilation for unit tests.
  • Add dataset cache support in build.sbt and introduce a global ScalaTest per-test timeout in TestBase.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tools/docker/ci/Dockerfile Defines the CI container image (JDK8, conda env, Spark, SBT warmup, datasets).
templates/free_disk.yml Adds a reusable host disk cleanup step for container jobs.
pipeline.yaml Introduces ci container resource, BuildCIImage job, and migrates many jobs to containers/scoped compilation.
build.sbt Uses DATASET_CACHE to avoid re-downloading test datasets in CI.
core/src/test/.../TestBase.scala Wraps all tests with a global failAfter timeout.
project/CodegenPlugin.scala Removes redundant LocalRootProject publishLocal from installPipPackage.
.gitignore Ignores pipeline.yaml.bak.

pipeline.yaml Outdated
(timeout 5m sbt setup) || (echo "retrying" && timeout 5m sbt setup) || (echo "retrying" && timeout 5m sbt setup)
sbt codegen
sbt publishM2
timeout 5m sbt setup codegen publishM2
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step wraps setup, codegen, and publishM2 under a single timeout 5m. setup compiles all projects (including tests) and downloads datasets, so 5 minutes is likely too short and may introduce flaky timeouts; consider either increasing the timeout substantially or only applying the 5m timeout to sbt setup (leaving codegen/publishM2 unbounded or separately timed).

Suggested change
timeout 5m sbt setup codegen publishM2
timeout 30m sbt setup
sbt codegen publishM2

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Increased to timeout 30m and consolidated setup+codegen+publishM2 into a single sbt invocation to avoid repeated JVM startup.

Comment on lines 234 to 238
cancelTimeoutInMinutes: 0
pool:
vmImage: $(UBUNTU_VERSION)
container: ci
strategy:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatabricksE2E’s job-level condition currently overrides the default succeeded() gating (it only checks the parameter), so this job will still run even if earlier jobs fail. If the intent is consistent with the other updated jobs, include succeeded() (and optionally variables.runTests) in the condition to avoid spending time/resources on E2E when the build is already failing.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added succeeded() to the DatabricksE2E condition.

Comment on lines +152 to +160
// Global per-test timeout (10 minutes). Override in subclass if needed.
val testTimeoutInSeconds: Int = 10 * 60

override def test(testName: String, testTags: Tag*)(testFun: => Any)(implicit pos: Position): Unit = {
super.test(testName, testTags: _*) {
failAfter(Span(testTimeoutInSeconds, Seconds)) {
testFun
}
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions a 30-minute hard timeout for test suites, but the new global per-test timeout here is set to 10 minutes. If 30 minutes is the intended limit, update testTimeoutInSeconds accordingly (or make it configurable via an env var/system property); otherwise the PR description should be updated to avoid confusion.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description updated. The 10-minute per-test timeout is correct — the "30 minutes" in the description referred to the sbt setup timeout, not the per-test limit. Subclasses can override testTimeoutInSeconds for longer-running tests (e.g., DatabricksTestHelper).

Comment on lines +30 to +32
&& wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2.24_amd64.deb -O /tmp/libssl1.1.deb \
&& dpkg -i /tmp/libssl1.1.deb \
&& rm /tmp/libssl1.1.deb
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image installs libssl1.1 by downloading a .deb over plain HTTP and installing it with dpkg without any signature/checksum verification. This is vulnerable to tampering and also bypasses apt’s normal integrity checks; please switch to HTTPS and verify the artifact (e.g., checksum/GPG), or use a trusted apt repository/package source for this dependency.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Switched to HTTPS and added SHA256 checksum verification.

COPY build.sbt /tmp/sbt-warmup/build.sbt
COPY sonatype.sbt /tmp/sbt-warmup/sonatype.sbt
RUN cd /tmp/sbt-warmup \
&& sbt --batch -Dsbt.supershell=false "update; Test/update" 2>/dev/null || true \
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sbt ... 2>/dev/null || true suppresses all warmup errors and discards stderr, which can silently produce an image with an incomplete/empty dependency cache and make CI failures harder to diagnose. Consider keeping logs (at least on failure) and failing the image build if the warmup command errors unexpectedly.

Suggested change
&& sbt --batch -Dsbt.supershell=false "update; Test/update" 2>/dev/null || true \
&& sbt --batch -Dsbt.supershell=false "update; Test/update" \

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed 2>/dev/null so warmup errors are visible. Kept || true since the warmup is best-effort — a partial cache is still better than no cache, and the actual build will fail definitively if dependencies are missing.

RUN cd /tmp/sbt-warmup \
&& sbt --batch -Dsbt.supershell=false "update; Test/update" 2>/dev/null || true \
&& rm -rf /tmp/sbt-warmup /tmp/.sbt \
&& chmod -R 777 $COURSIER_CACHE
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chmod -R 777 $COURSIER_CACHE makes the dependency cache world-writable. Even in CI, it’s safer to scope write permissions to the agent user/group (e.g., chown to the container user/UID used by ADO) to reduce the chance of cache poisoning between steps/jobs.

Suggested change
&& chmod -R 777 $COURSIER_CACHE
&& chown -R 1001:1001 "$COURSIER_CACHE" \
&& chmod -R 775 "$COURSIER_CACHE"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed to chmod -R 755. ADO container jobs run as UID 1001 but the files are owned by root from the build — 755 gives read+execute access without world-writable.

Comment on lines 275 to 280
pool:
vmImage: $(UBUNTU_VERSION)
steps:
#- template: templates/ivy_cache.yml
- template: templates/update_cli.yml
- template: templates/conda.yml
- template: templates/kv.yml
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FabricE2E is still running on the host VM without container: ci and still includes the old update_cli.yml/conda.yml setup. That conflicts with the PR goal/description of containerizing the entire CI pipeline and may leave this job as a remaining slow/non-deterministic outlier; either migrate this job to the ci container (and use free_disk.yml like the others) or update the PR description to explicitly exclude Fabric E2E from containerization.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional exclusion. FabricE2E requires the Fabric SDK and sempy packages which are not in the CI container. Added an explicit comment in pipeline.yaml explaining the exclusion, and updated the PR description.

Comment on lines +37 to +41
RUN wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh \
&& bash /tmp/miniconda.sh -b -p /opt/conda \
&& rm /tmp/miniconda.sh \
&& /opt/conda/bin/conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main \
&& /opt/conda/bin/conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Miniconda3-latest makes the CI image build non-reproducible (the contents can change over time without any repo change), which can introduce hard-to-debug CI breakages. Consider pinning to a specific Miniconda installer version (or checksum-verifying the installer) so the image build is deterministic.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Pinned to Miniconda3-py311_24.11.1-0. Note: this version ships conda 24.x which predates the TOS acceptance requirement, so the conda tos accept lines were removed.

nvidia-cuda-runtime-cu12 nvidia-cudnn-cu12 nvidia-cufft-cu12 \
nvidia-curand-cu12 nvidia-cusolver-cu12 nvidia-cusparse-cu12 \
nvidia-nccl-cu12 nvidia-nvjitlink-cu12 nvidia-nvtx-cu12 2>/dev/null || true \
&& chmod -R 777 /opt/conda/envs
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chmod -R 777 /opt/conda/envs makes the entire conda env tree world-writable inside the image. This broad permission can enable unintended modification of executables/libraries during CI runs; prefer a narrower permission model (e.g., chown to the agent UID/GID, or group-writable with a dedicated group) to reduce tampering risk.

Suggested change
&& chmod -R 777 /opt/conda/envs
&& chmod -R 755 /opt/conda/envs

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed to chmod -R 755 — same rationale as the COURSIER_CACHE change.

@BrendanWalsh BrendanWalsh force-pushed the brwals/containerize-ci branch from 6f18643 to bbb23c6 Compare March 27, 2026 19:12
@BrendanWalsh
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.86%. Comparing base (895752c) to head (bbb23c6).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2529      +/-   ##
==========================================
+ Coverage   84.61%   84.86%   +0.24%     
==========================================
  Files         335      335              
  Lines       17708    17708              
  Branches     1612     1612              
==========================================
+ Hits        14984    15028      +44     
+ Misses       2724     2680      -44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@BrendanWalsh BrendanWalsh force-pushed the brwals/containerize-ci branch 2 times, most recently from 025d3e6 to 7c1b41d Compare March 31, 2026 00:51
Add a BuildCIImage job that produces a content-hash-cached Docker image
(Ubuntu 22.04, JDK 8, conda, SBT, Spark, pre-cached datasets) published
to mmlsparkmcr.azurecr.io/synapseml/ci via Workload Identity Federation.

All test/build jobs now run inside this container, eliminating per-job
conda setup, CLI updates, and dataset downloads. Key changes:

- tools/docker/ci/Dockerfile: multi-stage image definition with pinned
  Miniconda, SHA256-verified libssl1.1, tightened file permissions (755)
- pipeline.yaml: container resource, BuildCIImage job, scoped compilation
  via PROJECT variable, SBT_OPTS tuning, free_disk.yml template,
  succeeded() gating on DatabricksE2E condition
- build.sbt: ForkJoinPool deadlock fix for parallel test execution
- TestBase.scala: 10-minute per-test hard timeout for all test suites
  (overridable in subclasses, e.g. DatabricksTestHelper uses longer)
- CodegenPlugin.scala: remove redundant root publishLocal in installPipPackage
- templates/free_disk.yml: disk cleanup helper

Note: FabricE2E intentionally excluded from containerization — it
requires Fabric SDK / sempy packages not in the CI container image.

Performance: ~59% wall-clock reduction (66m -> 27m), 33.5% total compute
savings (1228m -> 817m agent-minutes) across 59 test jobs.
@BrendanWalsh BrendanWalsh force-pushed the brwals/containerize-ci branch from 7c1b41d to aec4efd Compare March 31, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants