fix(profiler): make MySQL median deterministic by IceS2 · Pull Request #27815 · open-metadata/OpenMetadata

IceS2 · 2026-04-29T11:37:07Z

Summary

MySQL MedianFn returned non-deterministic values across runs on identical data. Empirical: 3 sequential runs returned 680, 650, 650 for [600, 650, 680, 720, 750] (textbook median is 680).
Two upstream bugs in the dialect compile dispatch:
1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order — unrelated to the sorted column position the median needs.
2. (SELECT @counter := COUNT(*) FROM {table}) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables.
Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern the Doris and SQLite dialects in this same file already use. Both the correlated (dimension_col) and non-correlated branches were updated symmetrically.

Transitive impact

firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin._compute_sqa_fn with different percentile arguments. They were silently non-deterministic on MySQL too. This PR makes them deterministic as a side effect.

History

Bug present since #10962 (Apr 2023). The original PR description noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings, not returned values) all passed against the broken impl. The dimensionality copy-paste in #24166 (Feb 2024) inherited the same pattern.

Test plan

Existing tests/unit/observability/profiler/sqlalchemy/mysql/test_mysql_median.py still passes (6/6 verified locally).
Local end-to-end: ran the MySQL profiler 10 sequential times against a 5-row table seeded with [600, 650, 680, 720, 750]. Pre-fix: median flipped between 680 and 650 across runs. Post-fix: 10/10 returned 680 (the textbook 3rd-sorted value).
CI runs the MySQL profiler integration suite across affected metrics.

Out of scope (worth follow-ups)

NULL filtering: Median.fn() does not filter NULLs before calling _compute_sqa_fn. MySQL's default ORDER BY col ASC places NULLs first → if NULLs outnumber half the rows, median returns NULL. Pre-existing behavior unchanged by this PR.
In-tree integration test asserting actual median values (not just SQL strings) against seeded data — would have caught both bugs in original CI.

@counter

The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since #10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix.

Copilot

Pull request overview

This PR makes MySQL percentile/median computation deterministic in the ingestion profiler by eliminating MySQL user-variable side effects and ensuring window functions rank rows in a defined order.

Changes:

Add ORDER BY {col} to ROW_NUMBER() windows so row numbering matches the value ordering used for median selection.
Replace @counter := COUNT(*) (user-variable cross join) with COUNT(*) OVER () AS total_count in both correlated and non-correlated branches.

Copilot · 2026-04-29T11:41:46Z

+                ROW_NUMBER() OVER (ORDER BY {col}) AS row_num,
+                COUNT(*) OVER () AS total_count
+            FROM `{table}` AS median_inner
            WHERE median_inner.{dimension_col} = `{table}`.{dimension_col}
-            ORDER BY {col}
            ) temp
-        WHERE temp.row_num = ROUND({percentile} * @counter)
+        WHERE temp.row_num = ROUND({percentile} * temp.total_count)


The MySQL implementation still selects a single ranked row via ROUND(percentile * total_count), which does not match percentile_cont semantics for even-sized datasets (median/percentiles should interpolate/average between the two middle ranks). This can make MySQL results diverge from other dialects using percentile_cont and from the Pandas implementation (np.median). Consider switching to selecting the two adjacent ranks and returning AVG(col) (similar to the SQLite/Informix patterns) so even row counts return the midpoint rather than an arbitrary existing value.

I think this is relevant @IceS2

Copilot · 2026-04-29T11:41:46Z

+                ROW_NUMBER() OVER (ORDER BY {col}) AS row_num,
+                COUNT(*) OVER () AS total_count
+            FROM `{table}`
            ) temp
-        WHERE temp.row_num = ROUND({percentile} * @counter)
+        WHERE temp.row_num = ROUND({percentile} * temp.total_count)


Same as correlated branch: using ROUND(percentile * temp.total_count) picks a single row. For even counts this differs from standard median/percentile interpolation (percentile_cont) and from the Pandas metric behavior. Consider computing the two neighboring rank positions and averaging them to produce consistent results across engines.

@counter

Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from #10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing.

github-actions · 2026-04-29T11:55:01Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Single-line set comprehension per `make py_format` (CI checkstyle).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-29T12:34:58Z

+    # for the single "ready for connections" log line from the main server
+    # (the testcontainers default regex expects two occurrences which only
+    # MariaDB emits — MySQL emits one).
+    container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for(


The testcontainer image tag mysql:8.0 is floating and can change underneath CI (new patches, behavior changes, startup timing differences). Consider pinning to a specific patch version (or aligning with mysql:8.4.5 used in other MySQL integration tests) to improve reproducibility.

Suggested change

container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for(

container = MySqlContainer(image="mysql:8.4.5", dbname="test_db").waiting_for(

gitar-bot · 2026-04-29T14:21:14Z

Code Review ✅ Approved

Enforces deterministic median calculation in the MySQL profiler by adding an ORDER BY clause. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-29T15:45:48Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-04-29T16:43:26Z

🟡 Playwright Results — all passed (14 flaky)

✅ 3968 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	298	1	4
🟡 Shard 2	739	6	8
🟡 Shard 3	753	2	7
🟡 Shard 4	757	2	18
✅ Shard 5	687	0	41
🟡 Shard 6	734	3	8

🟡 14 flaky test(s) (passed on retry)

Features/CustomizeDetailPage.spec.ts › Glossary Term - customization should work (shard 1, 1 retry)
Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
Features/Glossary/GlossaryExpandAllWithStatusFilter.spec.ts › Expand All with Approved filter shows all terms (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
Features/IncidentManager.spec.ts › Verify filters in Incident Manager's page (shard 2, 2 retries)
Features/IncidentManager.spec.ts › Next, Previous and page indicator (shard 2, 2 retries)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for Api Collection (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/ServiceEntity.spec.ts › Tier Add, Update and Remove (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@counter

* fix(profiler): make MySQL median deterministic The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since open-metadata#10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix. * test(profiler): add MySQL median integration test (regression sentinel) Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from open-metadata#10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing. * style(profiler): apply ruff format to MySQL median test Single-line set comprehension per `make py_format` (CI checkstyle).

@counter

* fix(profiler): make MySQL median deterministic The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since open-metadata#10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix. * test(profiler): add MySQL median integration test (regression sentinel) Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from open-metadata#10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing. * style(profiler): apply ruff format to MySQL median test Single-line set comprehension per `make py_format` (CI checkstyle).

IceS2 requested a review from a team as a code owner April 29, 2026 11:37

Copilot AI review requested due to automatic review settings April 29, 2026 11:37

github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 29, 2026

Copilot started reviewing on behalf of IceS2 April 29, 2026 11:37 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

IceS2 had a problem deploying to test April 29, 2026 11:46 — with GitHub Actions Error

IceS2 had a problem deploying to test April 29, 2026 12:00 — with GitHub Actions Error

style(profiler): apply ruff format to MySQL median test

242f828

Single-line set comprehension per `make py_format` (CI checkstyle).

Copilot AI review requested due to automatic review settings April 29, 2026 12:28

Copilot started reviewing on behalf of IceS2 April 29, 2026 12:28 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

IceS2 had a problem deploying to test April 29, 2026 12:38 — with GitHub Actions Error

IceS2 temporarily deployed to test April 29, 2026 12:38 — with GitHub Actions Inactive

IceS2 had a problem deploying to test April 29, 2026 12:38 — with GitHub Actions Error

IceS2 temporarily deployed to test April 29, 2026 12:38 — with GitHub Actions Inactive

Merge branch 'main' into fix/mysql-profiler-median-determinism

c924b8f

IceS2 temporarily deployed to test April 29, 2026 14:31 — with GitHub Actions Inactive

pmbrull approved these changes Apr 30, 2026

View reviewed changes

pmbrull merged commit 8291e06 into main Apr 30, 2026
51 checks passed

pmbrull deleted the fix/mysql-profiler-median-determinism branch April 30, 2026 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(profiler): make MySQL median deterministic#27815

fix(profiler): make MySQL median deterministic#27815
pmbrull merged 4 commits intomainfrom
fix/mysql-profiler-median-determinism

IceS2 commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

TeddyCr Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

gitar-bot Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for(
	container = MySqlContainer(image="mysql:8.4.5", dbname="test_db").waiting_for(

Conversation

IceS2 commented Apr 29, 2026

Summary

Transitive impact

History

Test plan

Out of scope (worth follow-ups)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

TeddyCr Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented Apr 29, 2026

🟡 Playwright Results — all passed (14 flaky)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gitar-bot Bot commented Apr 29, 2026 •

edited

Loading