Skip to content

fix(profiler): make MySQL median deterministic#27815

Merged
pmbrull merged 4 commits intomainfrom
fix/mysql-profiler-median-determinism
Apr 30, 2026
Merged

fix(profiler): make MySQL median deterministic#27815
pmbrull merged 4 commits intomainfrom
fix/mysql-profiler-median-determinism

Conversation

@IceS2
Copy link
Copy Markdown
Contributor

@IceS2 IceS2 commented Apr 29, 2026

Summary

  • MySQL MedianFn returned non-deterministic values across runs on identical data. Empirical: 3 sequential runs returned 680, 650, 650 for [600, 650, 680, 720, 750] (textbook median is 680).
  • Two upstream bugs in the dialect compile dispatch:
    1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were assigned in implementation-defined storage order — unrelated to the sorted column position the median needs.
    2. (SELECT @counter := COUNT(*) FROM {table}) t_count cross-join relied on user-variable side-effect ordering, which MySQL explicitly leaves undefined for expressions involving user variables.
  • Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS total_count, matching the pattern the Doris and SQLite dialects in this same file already use. Both the correlated (dimension_col) and non-correlated branches were updated symmetrically.

Transitive impact

firstQuartile, thirdQuartile, and interQuartileRange all reuse MedianFn via PercentilMixin._compute_sqa_fn with different percentile arguments. They were silently non-deterministic on MySQL too. This PR makes them deterministic as a side effect.

History

Bug present since #10962 (Apr 2023). The original PR description noted "Tested only external to OM" — no in-tree integration test against actual median values, so the 6 existing unit tests (which assert SQL strings, not returned values) all passed against the broken impl. The dimensionality copy-paste in #24166 (Feb 2024) inherited the same pattern.

Test plan

  • Existing tests/unit/observability/profiler/sqlalchemy/mysql/test_mysql_median.py still passes (6/6 verified locally).
  • Local end-to-end: ran the MySQL profiler 10 sequential times against a 5-row table seeded with [600, 650, 680, 720, 750]. Pre-fix: median flipped between 680 and 650 across runs. Post-fix: 10/10 returned 680 (the textbook 3rd-sorted value).
  • CI runs the MySQL profiler integration suite across affected metrics.

Out of scope (worth follow-ups)

  • NULL filtering: Median.fn() does not filter NULLs before calling _compute_sqa_fn. MySQL's default ORDER BY col ASC places NULLs first → if NULLs outnumber half the rows, median returns NULL. Pre-existing behavior unchanged by this PR.
  • In-tree integration test asserting actual median values (not just SQL strings) against seeded data — would have caught both bugs in original CI.

The MySQL MedianFn returned non-deterministic values across runs on
identical data. Two bugs:

1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were
   assigned in implementation-defined storage order, unrelated to the
   sorted column position the median needs.

2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied
   on user-variable side-effect ordering, which MySQL explicitly leaves
   undefined for expressions involving user variables.

Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS
total_count, matching the pattern Doris and SQLite dialects in this
same file already use. Both correlated (dimension_col) and
non-correlated branches updated symmetrically.

Transitive impact: firstQuartile, thirdQuartile, and
interQuartileRange all reuse MedianFn via PercentilMixin and become
deterministic on MySQL as a side effect.

Bug present since #10962 (2023-04-11). The original PR noted "Tested
only external to OM" — no in-tree integration test against actual
median values, so the 6 existing unit tests (which assert SQL strings)
all passed against the broken impl.

Verified locally: 10/10 sequential runs returned median=680 for
[600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix.
@IceS2 IceS2 requested a review from a team as a code owner April 29, 2026 11:37
Copilot AI review requested due to automatic review settings April 29, 2026 11:37
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes MySQL percentile/median computation deterministic in the ingestion profiler by eliminating MySQL user-variable side effects and ensuring window functions rank rows in a defined order.

Changes:

  • Add ORDER BY {col} to ROW_NUMBER() windows so row numbering matches the value ordering used for median selection.
  • Replace @counter := COUNT(*) (user-variable cross join) with COUNT(*) OVER () AS total_count in both correlated and non-correlated branches.

Comment on lines +195 to +200
ROW_NUMBER() OVER (ORDER BY {col}) AS row_num,
COUNT(*) OVER () AS total_count
FROM `{table}` AS median_inner
WHERE median_inner.{dimension_col} = `{table}`.{dimension_col}
ORDER BY {col}
) temp
WHERE temp.row_num = ROUND({percentile} * @counter)
WHERE temp.row_num = ROUND({percentile} * temp.total_count)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MySQL implementation still selects a single ranked row via ROUND(percentile * total_count), which does not match percentile_cont semantics for even-sized datasets (median/percentiles should interpolate/average between the two middle ranks). This can make MySQL results diverge from other dialects using percentile_cont and from the Pandas implementation (np.median). Consider switching to selecting the two adjacent ranks and returning AVG(col) (similar to the SQLite/Informix patterns) so even row counts return the midpoint rather than an arbitrary existing value.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is relevant @IceS2

Comment on lines +214 to +218
ROW_NUMBER() OVER (ORDER BY {col}) AS row_num,
COUNT(*) OVER () AS total_count
FROM `{table}`
) temp
WHERE temp.row_num = ROUND({percentile} * @counter)
WHERE temp.row_num = ROUND({percentile} * temp.total_count)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as correlated branch: using ROUND(percentile * temp.total_count) picks a single row. For even counts this differs from standard median/percentile interpolation (percentile_cont) and from the Pandas metric behavior. Consider computing the two neighboring rank positions and averaging them to produce consistent results across engines.

Copilot uses AI. Check for mistakes.
Mirrors the existing test_median_mariadb.py shape — testcontainers spins
up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then
asserts MedianFn returns the correct percentile-discrete value across
all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col).

Two extra regression sentinels guarding against the pre-fix bugs:

- test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER
  (ORDER BY ...) is in the generated SQL and the broken `OVER ()`
  pattern is absent.

- test_compiled_sql_avoids_user_variable_counter — asserts @counter
  is absent and COUNT(*) OVER () is present.

Plus a 10x determinism check (test_median_non_correlated_deterministic_
across_runs) that would have flagged the original bug from #10962 had
it existed at the time.

The MySQL median is percentile-discrete (picks a row at ROUND(p*N))
whereas MariaDB's PERCENTILE_CONT interpolates — same seed data
produces different expected values across the two dialects, both
documented inline in the test.

Wait strategy uses LogMessageWaitStrategy("ready for connections")
.with_startup_timeout(120) — testcontainers' default regex expects
the message twice (which only MariaDB emits) and times out at 10s
before MySQL 8 finishes initializing.
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Single-line set comprehension per `make py_format` (CI checkstyle).
Copilot AI review requested due to automatic review settings April 29, 2026 12:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

# for the single "ready for connections" log line from the main server
# (the testcontainers default regex expects two occurrences which only
# MariaDB emits — MySQL emits one).
container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for(
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testcontainer image tag mysql:8.0 is floating and can change underneath CI (new patches, behavior changes, startup timing differences). Consider pinning to a specific patch version (or aligning with mysql:8.4.5 used in other MySQL integration tests) to improve reproducibility.

Suggested change
container = MySqlContainer(image="mysql:8.0", dbname="test_db").waiting_for(
container = MySqlContainer(image="mysql:8.4.5", dbname="test_db").waiting_for(

Copilot uses AI. Check for mistakes.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 29, 2026

Code Review ✅ Approved

Enforces deterministic median calculation in the MySQL profiler by adding an ORDER BY clause. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (14 flaky)

✅ 3968 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 739 0 6 8
🟡 Shard 3 753 0 2 7
🟡 Shard 4 757 0 2 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 734 0 3 8
🟡 14 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Glossary Term - customization should work (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/Glossary/GlossaryExpandAllWithStatusFilter.spec.ts › Expand All with Approved filter shows all terms (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/IncidentManager.spec.ts › Verify filters in Incident Manager's page (shard 2, 2 retries)
  • Features/IncidentManager.spec.ts › Next, Previous and page indicator (shard 2, 2 retries)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Api Collection (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Tier Add, Update and Remove (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@pmbrull pmbrull merged commit 8291e06 into main Apr 30, 2026
51 checks passed
@pmbrull pmbrull deleted the fix/mysql-profiler-median-determinism branch April 30, 2026 05:36
mohitjeswani01 pushed a commit to mohitjeswani01/OpenMetadata that referenced this pull request Apr 30, 2026
* fix(profiler): make MySQL median deterministic

The MySQL MedianFn returned non-deterministic values across runs on
identical data. Two bugs:

1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were
   assigned in implementation-defined storage order, unrelated to the
   sorted column position the median needs.

2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied
   on user-variable side-effect ordering, which MySQL explicitly leaves
   undefined for expressions involving user variables.

Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS
total_count, matching the pattern Doris and SQLite dialects in this
same file already use. Both correlated (dimension_col) and
non-correlated branches updated symmetrically.

Transitive impact: firstQuartile, thirdQuartile, and
interQuartileRange all reuse MedianFn via PercentilMixin and become
deterministic on MySQL as a side effect.

Bug present since open-metadata#10962 (2023-04-11). The original PR noted "Tested
only external to OM" — no in-tree integration test against actual
median values, so the 6 existing unit tests (which assert SQL strings)
all passed against the broken impl.

Verified locally: 10/10 sequential runs returned median=680 for
[600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix.

* test(profiler): add MySQL median integration test (regression sentinel)

Mirrors the existing test_median_mariadb.py shape — testcontainers spins
up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then
asserts MedianFn returns the correct percentile-discrete value across
all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col).

Two extra regression sentinels guarding against the pre-fix bugs:

- test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER
  (ORDER BY ...) is in the generated SQL and the broken `OVER ()`
  pattern is absent.

- test_compiled_sql_avoids_user_variable_counter — asserts @counter
  is absent and COUNT(*) OVER () is present.

Plus a 10x determinism check (test_median_non_correlated_deterministic_
across_runs) that would have flagged the original bug from open-metadata#10962 had
it existed at the time.

The MySQL median is percentile-discrete (picks a row at ROUND(p*N))
whereas MariaDB's PERCENTILE_CONT interpolates — same seed data
produces different expected values across the two dialects, both
documented inline in the test.

Wait strategy uses LogMessageWaitStrategy("ready for connections")
.with_startup_timeout(120) — testcontainers' default regex expects
the message twice (which only MariaDB emits) and times out at 10s
before MySQL 8 finishes initializing.

* style(profiler): apply ruff format to MySQL median test

Single-line set comprehension per `make py_format` (CI checkstyle).
jatinmasaram pushed a commit to jatinmasaram/OpenMetadata that referenced this pull request May 2, 2026
* fix(profiler): make MySQL median deterministic

The MySQL MedianFn returned non-deterministic values across runs on
identical data. Two bugs:

1. ROW_NUMBER() OVER () lacked a window ORDER BY, so row numbers were
   assigned in implementation-defined storage order, unrelated to the
   sorted column position the median needs.

2. The (SELECT @counter := COUNT(*) FROM tbl) t_count cross-join relied
   on user-variable side-effect ordering, which MySQL explicitly leaves
   undefined for expressions involving user variables.

Replaced with ROW_NUMBER() OVER (ORDER BY {col}) + COUNT(*) OVER () AS
total_count, matching the pattern Doris and SQLite dialects in this
same file already use. Both correlated (dimension_col) and
non-correlated branches updated symmetrically.

Transitive impact: firstQuartile, thirdQuartile, and
interQuartileRange all reuse MedianFn via PercentilMixin and become
deterministic on MySQL as a side effect.

Bug present since open-metadata#10962 (2023-04-11). The original PR noted "Tested
only external to OM" — no in-tree integration test against actual
median values, so the 6 existing unit tests (which assert SQL strings)
all passed against the broken impl.

Verified locally: 10/10 sequential runs returned median=680 for
[600,650,680,720,750] post-fix; 3/3 returned mixed 680/650/650 pre-fix.

* test(profiler): add MySQL median integration test (regression sentinel)

Mirrors the existing test_median_mariadb.py shape — testcontainers spins
up a real MySQL 8.0 container, seeds 10 rows across 2 categories, then
asserts MedianFn returns the correct percentile-discrete value across
all six combinations (p=0.25/0.50/0.75 × non-correlated/dimension_col).

Two extra regression sentinels guarding against the pre-fix bugs:

- test_compiled_sql_uses_window_order_by — asserts ROW_NUMBER() OVER
  (ORDER BY ...) is in the generated SQL and the broken `OVER ()`
  pattern is absent.

- test_compiled_sql_avoids_user_variable_counter — asserts @counter
  is absent and COUNT(*) OVER () is present.

Plus a 10x determinism check (test_median_non_correlated_deterministic_
across_runs) that would have flagged the original bug from open-metadata#10962 had
it existed at the time.

The MySQL median is percentile-discrete (picks a row at ROUND(p*N))
whereas MariaDB's PERCENTILE_CONT interpolates — same seed data
produces different expected values across the two dialects, both
documented inline in the test.

Wait strategy uses LogMessageWaitStrategy("ready for connections")
.with_startup_timeout(120) — testcontainers' default regex expects
the message twice (which only MariaDB emits) and times out at 10s
before MySQL 8 finishes initializing.

* style(profiler): apply ruff format to MySQL median test

Single-line set comprehension per `make py_format` (CI checkstyle).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants