[CI] add LLVM IR bloat regression check by arrayka · Pull Request #1083 · microsoft/DiskANN

arrayka · 2026-05-18T23:15:30Z

This pull request introduces an automated check in the CI workflow to monitor and report on LLVM IR code size growth, helping to detect regressions in monomorphization cost. The main changes include adding a new CI job and a supporting shell script for comparing LLVM IR line counts between the current branch and the baseline.

CI workflow enhancements:

Added a new llvm-lines job to .github/workflows/ci.yml that compares the LLVM IR line counts of the current branch against the main branch using cargo-llvm-lines. This job is not required for PR merges but provides regression reports and enforces a configurable growth threshold.

Supporting scripts:

Introduced .github/scripts/compare-llvm-lines.sh, a Bash script that parses and compares the output of cargo-llvm-lines for the baseline and current builds, generating a markdown-formatted regression report for easier review.

Add an llvm-lines job to CI that compares LLVM IR line counts between the current branch and main to detect monomorphization bloat regressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a CI job that compares LLVM IR line counts for diskann-benchmark between the PR branch and main to detect monomorphization bloat, plus a standalone bash script that produces a markdown delta report from two cargo llvm-lines outputs.

Changes:

New llvm-lines job in .github/workflows/ci.yml (gated by needs: basics) that installs cargo-llvm-lines, runs it on both checkouts, computes overall growth, and writes a delta report to $GITHUB_STEP_SUMMARY.
New .github/scripts/compare-llvm-lines.sh that parses two cargo llvm-lines outputs and emits a per-function markdown comparison table sorted by delta.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`.github/workflows/ci.yml`	Adds an advisory `llvm-lines` job that builds baseline/current `cargo llvm-lines` output, checks total growth against `LLVM_LINES_GROWTH_THRESHOLD`, and surfaces a report to the job summary.
`.github/scripts/compare-llvm-lines.sh`	New bash script that diffs two `cargo llvm-lines` outputs and prints a sorted markdown table of per-function regressions.

Comments suppressed due to low confidence (2)

.github/workflows/ci.yml:529

With GitHub Actions' default shell options (bash --noprofile --norc -eo pipefail, which defaults.run.shell: bash at the top of this file selects), cargo llvm-lines ... | head -100 | tee ... is very likely to fail the step. Once head has read 100 lines it closes its stdin; cargo llvm-lines then dies with SIGPIPE (exit code 141), and pipefail propagates that failure even though head and tee succeeded. Capture the full output first (e.g. cargo llvm-lines ... | tee ../current-llvm-lines.txt) and only apply head later when feeding the report, or disable pipefail locally for this command.

The same issue applies to the head -100 on line 541.

        run: cargo llvm-lines --package diskann-benchmark --release | head -100 | tee ../baseline-llvm-lines.txt

      - name: Generate current LLVM lines
        working-directory: diskann_rust
        run: cargo llvm-lines --package diskann-benchmark --release | head -100 | tee ../current-llvm-lines.txt

.github/workflows/ci.yml:539

If baseline-llvm-lines.txt does not contain a (TOTAL) line (e.g. cargo llvm-lines output format changed, the build was warning/error-filtered, or the SIGPIPE issue above produces an empty file), baseline_total will be the empty string and $(( ... / baseline_total )) will fail with a bash arithmetic error, taking down the step with a confusing message. compare-llvm-lines.sh defends against baseline_total -eq 0 on line 23 but this inline computation does not. Consider either delegating the threshold check to the script or adding the same guard here (and defaulting empty values to 0).

          baseline_total=$(awk '/\(TOTAL\)/{print $1}' baseline-llvm-lines.txt)
          current_total=$(awk '/\(TOTAL\)/{print $1}' current-llvm-lines.txt)
          growth=$(( (current_total - baseline_total) * 100 / baseline_total ))

          if [ "$growth" -gt "$LLVM_LINES_GROWTH_THRESHOLD" ]; then
            echo "::warning::LLVM IR grew ${growth}% ($baseline_total → $current_total lines)"
          fi

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

codecov-commenter · 2026-05-19T00:24:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.45%. Comparing base (205aad7) to head (f9296de).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1083      +/-   ##
==========================================
- Coverage   89.46%   89.45%   -0.01%     
==========================================
  Files         459      458       -1     
  Lines       85482    85398      -84     
==========================================
- Hits        76475    76392      -83     
+ Misses       9007     9006       -1

Flag	Coverage Δ
miri	`89.45% <ø> (-0.01%)`	⬇️
unittests	`89.08% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 18 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@harsha-simhadri

# DiskANN v0.53.0 Release Notes ## Breaking Changes An AI generated, human reviewed list of changes is summarized below. ### Paged search overhauled — channel-based API ([#1078](#1078)) `PagedSearchState` and its `'static`-bound pause/resume model have been replaced with an async, channel-based interface. The recommended way to drive paged search is now via a `tokio::sync::mpsc` channel, with the searcher embedded in an otherwise-`'static` future. See the [rendered RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/01078-paged-search.md) for the new shape. Callers wired against `PagedSearchState` must migrate to the channel API. Users of paged search via `wrapped_async::DiskANNIndex` that know their inner futures will never suspend can use the new `wrapped_async::DiskANNIndex::paged_search_no_await`; this will efficiently run paged searches with minimal synchronization overhead. ### `DiskANNIndex::flat_search` removed ([#1076](#1076)) `DiskANNIndex::flat_search` and the `IdIterator` trait have been removed from the `diskann` crate. Equivalent functionality lives on the new inherent method `DiskIndexSearcher::flat_search` in `diskann-disk`. This unblocks the experimental directions in #1067 and #983. ```rust // Before diskann_index.flat_search(query, ...)?; // After disk_index_searcher.flat_search(query, ...).await?; ``` ### `DiskIndexSearcher::flat_search` now batched ([#1097](#1097)) The new `DiskIndexSearcher::flat_search` uses the bulk `pq_distances` path instead of one-vector-at-a-time `Accessor::build_query_computer` + `evaluate_similarity`. Downstream behavior is equivalent but tighter resource bounds apply. ### `centroid` removed from PQ interfaces ([#1010](#1010)) The dataset-centroid argument has been removed from `FixedChunkPQTable` construction, `populate`, and most other PQ APIs. The shift only ever worked for L2 distance and was silently ignored for inner-product / cosine, so passing it was a footgun. When an L2 shift is required, fold it into the PQ pivots instead (the library now does this internally). ```rust // Before let table = FixedChunkPQTable::new(.., centroid, ..); // After — drop the centroid argument let table = FixedChunkPQTable::new(.., ..); ``` ### Flat search interface ([#983](#983)) A new `flat` module in `diskann` adds a provider-agnostic brute-force search surface, mirroring the shape of graph search. Backends implement a single trait, `DistancesUnordered<C>` (in `flat/strategy.rs`), which fuses iteration and distance computation, allowing any backend (in-memory, quantized, disk, remote) to plug into a shared algorithm. See the [rendered RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/00983-flat-search.md). This is additive but is the new canonical surface — direct ad-hoc flat-search call sites should migrate. ### `bf_tree` extracted into `diskann-bftree` crate ([#1020](#1020)) The bf_tree provider has been moved out of `diskann-providers` (previously at `diskann-providers/src/model/graph/provider/async_/bf_tree/`) into a new standalone `diskann-bftree` crate. Along with the move: - Switched from PQ to spherical quantization. - Dropped dependencies on `DeletionCheck`, `AsDeletionCheck`, and `RemoveDeletedIdsAndCopy`. - Simplified generics. Consumers must update their `Cargo.toml` to depend on `diskann-bftree` and update import paths. ### `direct_distance_impl` and `inner_product_raw` re-exposed ([#1081](#1081)) `direct_distance_impl` (free function) and `FixedChunkPQTable::inner_product_raw` are `pub` again after being privatized in #1044. Restored to unblock a downstream user. Not breaking in the typical direction — this restores previously available API surface. ### MinMax `recompress` takes a grid-scale parameter ([#1109](#1109)) The MinMax `recompress` API now accepts a grid-scale parameter. ## New Features - SIMD-optimized L2-squared norm ([#1107](#1107)) - Significantly faster bitmap computation ([#1099](#1099)) - Large speedup on the bitmap construction path used by filtered search. - LLVM IR bloat regression check in CI ([#1083](#1083)) - CI now flags regressions in generated LLVM IR size, helping catch unintended monomorphization blow-ups. - Recall computation fix for under-k groundtruth ([#1069](#1069)) ## Merged PRs * Revise README for DiskANN3 by @harsha-simhadri in #1046 * [CI] Try to fix publishing step by @hildebrandmw in #1057 * [benchmark] Remove `DispatchRule` by @hildebrandmw in #1064 * [benchmark] Automatic Input Registration by @hildebrandmw in #1066 * Remove centroid from most PQ interfaces by @hildebrandmw in #1010 * [diskann/disk] Remove `flat_search` from `DiskANNIndex` by @hildebrandmw in #1076 * macos build and miri check to nightly by @harsha-simhadri in #1058 * [API] Make some methods public again by @hildebrandmw in #1081 * [benchmark] Simply `Inputs` more by @hildebrandmw in #1077 * Turn on stack protection for the diskann-garnet NuGet build by @jackmoffitt in #1082 * Fix options for diskann-garnet nuget pipeline by @jackmoffitt in #1091 * [CI] add LLVM IR bloat regression check by @arazumov in #1083 * Bump openssl from 0.10.79 to 0.10.80 by @dependabot[bot] in #1093 * [Disk CI benchmarks] Use 1ES.Pool=diskann-github by @arazumov in #869 * Fix recall computation for fewer than k groundtruth results by @magdalendobson in #1069 * bf_tree migration away from diskann-providers by @JordanMaples in #1020 * [RFC/diskann] Overhaul paged search by @hildebrandmw in #1078 * Remove unsafe code from compute_vec_l2sq by @arazumov in #1094 * Remove direct accessor call in `diskann-garnet` by @hildebrandmw in #1098 * Refactor `DiskIndexSearcher::flat_search` to use batching by @hildebrandmw in #1097 * [flat index] Flat Search Interface by @arkrishn94 in #983 * migrating multi-hop tests from diskann-providers to diskann by @JordanMaples in #928 * Significantly speed up bitmap computation by @magdalendobson in #1099 * `compute_vecs_l2sq`: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared by @arazumov in #1107 * [minmax] Add grid scaling to recompress API by @arkrishn94 in #1109 **Full Changelog**: v0.52.0...v0.53.0

ci: add LLVM IR bloat regression check

7448fd2

Add an llvm-lines job to CI that compares LLVM IR line counts between the current branch and main to detect monomorphization bloat regressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

arrayka requested review from a team and Copilot May 18, 2026 23:15

arrayka changed the title ~~ci: add LLVM IR bloat regression check~~ [CI] add LLVM IR bloat regression check May 18, 2026

Copilot started reviewing on behalf of arrayka May 18, 2026 23:20 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/scripts/compare-llvm-lines.sh Outdated

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/workflows/ci.yml Outdated

arrayka and others added 4 commits May 18, 2026 16:25

Add --features "$DISKANN_FEATURES"

728ee52

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

023108d

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fixing issues

b9f51aa

Minor fixes

878d415

harsha-simhadri reviewed May 18, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/workflows/ci.yml

Fixes Compare LLVM lines

f31a6d9

hildebrandmw reviewed May 19, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/scripts/compare-llvm-lines.sh Outdated

Alex Razumov (from Dev Box) added 2 commits May 19, 2026 14:58

Use pr instead of diskann_rust

d9434d2

Run with --all-features

c139f9c

harsha-simhadri approved these changes May 19, 2026

View reviewed changes

Alex Razumov (from Dev Box) added 2 commits May 19, 2026 15:10

Do full outer join

498e134

Minor fix

a957ad5

hildebrandmw approved these changes May 19, 2026

View reviewed changes

arrayka enabled auto-merge (squash) May 19, 2026 22:51

arrayka disabled auto-merge May 19, 2026 22:54

Delta and Current columns sort numerically

f9296de

hildebrandmw approved these changes May 19, 2026

View reviewed changes

arrayka enabled auto-merge (squash) May 19, 2026 23:32

arrayka merged commit 24f43e2 into main May 19, 2026
22 checks passed

arrayka deleted the u/arrayka/llvm-lines branch May 19, 2026 23:36

arkrishn94 mentioned this pull request May 28, 2026

Bump version to 0.53.0 #1111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] add LLVM IR bloat regression check#1083

[CI] add LLVM IR bloat regression check#1083
arrayka merged 11 commits into
mainfrom
u/arrayka/llvm-lines

arrayka commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arrayka commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arrayka commented May 18, 2026 •

edited

Loading

codecov-commenter commented May 19, 2026 •

edited

Loading