fix(profiler): make MySQL median deterministic#27815
Conversation
The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since #10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix.
There was a problem hiding this comment.
Pull request overview
This PR makes MySQL percentile/median computation deterministic in the ingestion profiler by eliminating MySQL user-variable side effects and ensuring window functions rank rows in a defined order.
Changes:
- Add
ORDER BY {col}toROW_NUMBER()windows so row numbering matches the value ordering used for median selection. - Replace
@counter := COUNT(*)(user-variable cross join) withCOUNT(*) OVER () AS total_countin both correlated and non-correlated branches.
| ROW_NUMBER() OVER (ORDER BY {col}) AS row_num, | ||
| COUNT(*) OVER () AS total_count | ||
| FROM `{table}` AS median_inner | ||
| WHERE median_inner.{dimension_col} = `{table}`.{dimension_col} | ||
| ORDER BY {col} | ||
| ) temp | ||
| WHERE temp.row_num = ROUND({percentile} * @counter) | ||
| WHERE temp.row_num = ROUND({percentile} * temp.total_count) |
There was a problem hiding this comment.
The MySQL implementation still selects a single ranked row via ROUND(percentile * total_count), which does not match percentile_cont semantics for even-sized datasets (median/percentiles should interpolate/average between the two middle ranks). This can make MySQL results diverge from other dialects using percentile_cont and from the Pandas implementation (np.median). Consider switching to selecting the two adjacent ranks and returning AVG(col) (similar to the SQLite/Informix patterns) so even row counts return the midpoint rather than an arbitrary existing value.
| ROW_NUMBER() OVER (ORDER BY {col}) AS row_num, | ||
| COUNT(*) OVER () AS total_count | ||
| FROM `{table}` | ||
| ) temp | ||
| WHERE temp.row_num = ROUND({percentile} * @counter) | ||
| WHERE temp.row_num = ROUND({percentile} * temp.total_count) |
There was a problem hiding this comment.
Same as correlated branch: using ROUND(percentile * temp.total_count) picks a single row. For even counts this differs from standard median/percentile interpolation (percentile_cont) and from the Pandas metric behavior. Consider computing the two neighboring rank positions and averaging them to produce consistent results across engines.
Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from #10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing.
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
Single-line set comprehension per `make py_format` (CI checkstyle).
| # for the single "ready for connections" log line from the main server | ||
| # (the testcontainers default regex expects two occurrences which only | ||
| # MariaDB emits — MySQL emits one). | ||
| container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for( |
There was a problem hiding this comment.
The testcontainer image tag mysql:8.0 is floating and can change underneath CI (new patches, behavior changes, startup timing differences). Consider pinning to a specific patch version (or aligning with mysql:8.4.5 used in other MySQL integration tests) to improve reproducibility.
| container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for( | |
| container = MySqlContainer(image="mysql:8.4.5", dbname="test_db").waiting_for( |
Code Review ✅ ApprovedEnforces deterministic median calculation in the MySQL profiler by adding an ORDER BY clause. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
🟡 Playwright Results — all passed (14 flaky)✅ 3968 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped
🟡 14 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
* fix(profiler): make MySQL median deterministic The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since open-metadata#10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix. * test(profiler): add MySQL median integration test (regression sentinel) Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from open-metadata#10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing. * style(profiler): apply ruff format to MySQL median test Single-line set comprehension per `make py_format` (CI checkstyle).
* fix(profiler): make MySQL median deterministic The MySQL MedianFn returned non-deterministic values across runs on identical data. Two bugs: 1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order, unrelated to the sorted column position the median needs. 2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables. Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern Doris and SQLite dialects in this same file already use. Both correlated (dimension_col) and non-correlated branches updated symmetrically. Transitive impact: firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin and become deterministic on MySQL as a side effect. Bug present since open-metadata#10962 (2023-04-11). The original PR noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings) all passed against the broken impl. Verified locally: 10/10 sequential runs returned median=680 for [600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix. * test(profiler): add MySQL median integration test (regression sentinel) Mirrors the existing test_median_mariadb.py shape — testcontainers spins up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then asserts MedianFn returns the correct percentile-discrete value across all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col). Two extra regression sentinels guarding against the pre-fix bugs: - test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER (ORDER BY ...) is in the generated SQL and the broken `OVER ()` pattern is absent. - test_compiled_sql_avoids_user_variable_counter — asserts @counter is absent and COUNT(*) OVER () is present. Plus a 10x determinism check (test_median_non_correlated_deterministic_ across_runs) that would have flagged the original bug from open-metadata#10962 had it existed at the time. The MySQL median is percentile-discrete (picks a row at ROUND(p*N)) whereas MariaDB's PERCENTILE_CONT interpolates — same seed data produces different expected values across the two dialects, both documented inline in the test. Wait strategy uses LogMessageWaitStrategy("ready for connections") .with_startup_timeout(120) — testcontainers' default regex expects the message twice (which only MariaDB emits) and times out at 10s before MySQL 8 finishes initializing. * style(profiler): apply ruff format to MySQL median test Single-line set comprehension per `make py_format` (CI checkstyle).



Summary
MedianFnreturned non-deterministic values across runs on identical data. Empirical: 3 sequential runs returned 680, 650, 650 for[600, 650, 680, 720, 750](textbook median is 680).ROW_NUMBER() OVER ()lacked a windowORDER BY, so row numbers were assigned in implementation-defined storage order — unrelated to the sorted column position the median needs.(SELECT @counter := COUNT(*) FROM {table}) t_countcross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables.ROW_NUMBER() OVER (ORDER BY {col})+COUNT(*) OVER () AS total_count, matching the pattern the Doris and SQLite dialects in this same file already use. Both the correlated (dimension_col) and non-correlated branches were updated symmetrically.Transitive impact
firstQuartile,thirdQuartile, andinterQuartileRangeall reuseMedianFnviaPercentilMixin._compute_sqa_fnwith different percentile arguments. They were silently non-deterministic on MySQL too. This PR makes them deterministic as a side effect.History
Bug present since #10962 (Apr 2023). The original PR description noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings, not returned values) all passed against the broken impl. The dimensionality copy-paste in #24166 (Feb 2024) inherited the same pattern.
Test plan
tests/unit/observability/profiler/sqlalchemy/mysql/test_mysql_median.pystill passes (6/6 verified locally).[600, 650, 680, 720, 750]. Pre-fix: median flipped between 680 and 650 across runs. Post-fix: 10/10 returned 680 (the textbook 3rd-sorted value).Out of scope (worth follow-ups)
Median.fn()does not filter NULLs before calling_compute_sqa_fn. MySQL's defaultORDER BY col ASCplaces NULLs first → if NULLs outnumber half the rows, median returns NULL. Pre-existing behavior unchanged by this PR.