Skip to content

FIX: Flaky tests in CI and mark stress tests#617

Merged
bewithgaurav merged 9 commits into
mainfrom
bewithgaurav/fix-flaky-ci-tests
Jun 3, 2026
Merged

FIX: Flaky tests in CI and mark stress tests#617
bewithgaurav merged 9 commits into
mainfrom
bewithgaurav/fix-flaky-ci-tests

Conversation

@bewithgaurav
Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav commented Jun 2, 2026

Work Item / Issue Reference

AB#45424

GitHub Issue:


Summary

Fix all known flaky test failures in PR validation. Audited 30 recent builds — 12 had failures, all from 2 test files on macOS or ARM64 QEMU. Also cap macOS benchmark step to prevent job timeouts.


Audit (30 PR validation builds)

test_022 — GIL heartbeat (8/30 builds, macOS only)

Test Hits Cause
test_query_does_not_block_other_python_threads 5/30 Heartbeat tick threshold too aggressive for CI CPU contention
test_commit_does_not_block_other_python_threads 3/30 Same

test_013 — shutdown handle cleanup (7/30 builds, macOS + QEMU)

Test Hits Cause
test_dbc_handle_cleanup_at_shutdown 2/30 macOS timeout + QEMU SEGV
test_force_gc_finalization_order_issue 2/30 macOS timeout
test_aggressive_dbc_segfault_reproduction 1/30 macOS timeout
test_cleanup_connections_weakset_modification_during_iteration 1/30 macOS timeout
test_weakref_cleanup_at_shutdown 1/30 QEMU SEGV
test_cleanup_connections_scenarios 1/30 QEMU
test_exception_during_query_with_shutdown 1/30 QEMU SEGV (found during this PR)

macOS benchmark timeout (2/30 builds)

Benchmark step had no timeout. On macOS, it takes 10–28 min (vs 3 min Linux, 5 min Windows). Slow runs eat the 60 min job budget and kill the entire job.


Changes

tests/test_022_concurrent_query_gil_release.py

  • Mark both heartbeat tests with @pytest.mark.stress
  • Excluded from PR validation by pytest.ini addopts; runs in nightly stress pipeline

tests/test_013_SqlHandle_free_shutdown.py

  • Bump subprocess timeout from 5s/3s to 15s on all ODBC shutdown tests (fixes macOS TimeoutExpired)
  • Skip entire TestHandleFreeShutdown class on QEMU (verified 0/400 on native ARM64 — all subprocess shutdown tests can SEGV under QEMU user-mode emulation)

tests/conftest.py

  • Add shared is_qemu_emulated() helper and QEMU flag for reuse across test files

eng/pipelines/pr-validation-pipeline.yml

  • Add timeoutInMinutes: 20 on macOS benchmark step with continueOnError: true
  • Slow benchmarks are terminated gracefully without failing the build

What this does NOT change

  • No test logic or C++ changes
  • All tests run on every native platform (Linux x86, Windows, macOS, native ARM64)
  • Benchmark script unchanged — macOS benchmark variance (2.7x) is a separate investigation

Benchmark duration analysis (13 builds)

Platform Range Variance
Linux Ubuntu 3m06s – 3m24s 1.1x
Windows 4m35s – 6m12s 1.3x
macOS 10m28s – 28m39s 2.7x

Additional findings (not in this PR)

  • is_python_finalizing() in ddbc_bindings.cpp checks sys._is_finalizing which was renamed to sys.is_finalizing in Python 3.13+. Shutdown protection silently degrades on 3.13+, but atexit cleanup saves us. Separate fix needed.
  • macOS benchmark variance needs deeper investigation (runner contention vs code issue).

Copilot AI review requested due to automatic review settings June 2, 2026 07:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent timing-sensitive/flaky tests from blocking pull requests by classifying them as stress tests and explicitly excluding stress-marked tests from the PR validation pipeline’s pytest runs.

Changes:

  • Mark the two GIL-release heartbeat tests in test_022_concurrent_query_gil_release.py with @pytest.mark.stress.
  • Add an explicit -m "not stress" filter to all pytest invocations in eng/pipelines/pr-validation-pipeline.yml as defense-in-depth.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
tests/test_022_concurrent_query_gil_release.py Marks two timing-sensitive concurrency tests as stress so they’re excluded from default PR test runs.
eng/pipelines/pr-validation-pipeline.yml Updates PR pipeline pytest commands to explicitly exclude stress tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/pipelines/pr-validation-pipeline.yml Outdated
Comment thread eng/pipelines/pr-validation-pipeline.yml Outdated
Comment thread eng/pipelines/pr-validation-pipeline.yml Outdated
Comment thread eng/pipelines/pr-validation-pipeline.yml Outdated
Comment thread eng/pipelines/pr-validation-pipeline.yml Outdated
Comment thread eng/pipelines/pr-validation-pipeline.yml Outdated
Comment thread tests/test_022_concurrent_query_gil_release.py Outdated
@bewithgaurav bewithgaurav force-pushed the bewithgaurav/fix-flaky-ci-tests branch from b1aed71 to d4d5d61 Compare June 2, 2026 07:30
@bewithgaurav bewithgaurav changed the title TEST: Exclude stress-marked tests from PR validation pipeline CHORE: Exclude stress-marked tests from PR validation pipeline Jun 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

📊 Code Coverage Report

🔥 Diff Coverage

100%


🎯 Overall Coverage

80%


📈 Total Lines Covered: 6654 out of 8256
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

No lines with coverage information in this diff.


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 76.1%
mssql_python.row.py: 76.9%
mssql_python.__init__.py: 77.3%
mssql_python.pybind.connection.connection.cpp: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.logging.py: 85.5%
mssql_python.connection.py: 85.6%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@github-actions github-actions Bot added the pr-size: small Minimal code update label Jun 2, 2026
jahnvi480
jahnvi480 previously approved these changes Jun 2, 2026
Copy link
Copy Markdown
Contributor

@jahnvi480 jahnvi480 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bewithgaurav bewithgaurav changed the title CHORE: Exclude stress-marked tests from PR validation pipeline FIX: Eliminate CI flakiness by excluding stress-marked tests from PR validation pipeline Jun 3, 2026
@bewithgaurav bewithgaurav force-pushed the bewithgaurav/fix-flaky-ci-tests branch from 0ffec34 to eadd358 Compare June 3, 2026 06:03
@bewithgaurav bewithgaurav changed the title FIX: Eliminate CI flakiness by excluding stress-marked tests from PR validation pipeline FIX: Eliminate CI flakiness by marking stress tests Jun 3, 2026
@bewithgaurav bewithgaurav changed the title FIX: Eliminate CI flakiness by marking stress tests FIX: Flaky tests in CI and mark stress tests Jun 3, 2026
@bewithgaurav bewithgaurav changed the title FIX: Flaky tests in CI and mark stress tests FIX: Flaky tests in CI and mark stress tests (PR-Validation-Pipeline) Jun 3, 2026
@bewithgaurav bewithgaurav changed the title FIX: Flaky tests in CI and mark stress tests (PR-Validation-Pipeline) FIX: Flaky tests in CI and mark stress tests Jun 3, 2026
bewithgaurav and others added 5 commits June 3, 2026 12:28
Mark test_query_does_not_block_other_python_threads and
test_commit_does_not_block_other_python_threads with @pytest.mark.stress.

These tests use timing thresholds that flake on macOS CI (especially
pre-release Python 3.14) due to sleep() overshoot and GIL re-acquisition
latency. pytest.ini addopts already excludes stress-marked tests from
default runs; the nightly stress-test-pipeline covers them.

Update module docstring to reflect the new classification.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ODBC handle teardown during Python shutdown on macOS CI runners
occasionally exceeds the 5s subprocess timeout, causing
TimeoutExpired failures in test_013_SqlHandle_free_shutdown.py.

Bump from 5s to 15s for all 12 ODBC-exercising subprocess tests.
Fast environments still finish in 1- the timeout is just a ceiling.2s
Leave the timeout=3 (mock/unit tests) and timeout=10 unchanged.

This was the #1 cause of flaky reruns, hitting 4/5 recent failing builds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add QEMU user-mode emulation detection via /proc/cpuinfo CPU
  implementer 0x51. Skip 4 tests that SIGSEGV under QEMU but pass
  on all native platforms (0/400 locally).

- Bump remaining timeout=3 subprocess tests to 15s (these also
  flaked on macOS  test_cleanup_connections_scenarios andCI
  test_cleanup_connections_weakset_modification_during_iteration).

- Replace @pytest.mark.stress on test_022 with lower heartbeat
 15%). CI worst case was 12 ticks (30%);
  15% threshold (6 ticks) gives 2x margin while still catching
  real GIL starvation (0-2 ticks). Tests stay in PR validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move is_qemu_emulated() and QEMU flag to tests/conftest.py so it is
available to any test file. Remove duplicate from test_013.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mark both test_022 heartbeat tests with @pytest.mark.stress.
pytest.ini addopts already excludes stress tests from PR validation;
the nightly stress-test-pipeline covers them.

No threshold or docstring  original test logic preserved.changes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bewithgaurav bewithgaurav force-pushed the bewithgaurav/fix-flaky-ci-tests branch from da9f7b4 to 39f8b92 Compare June 3, 2026 06:59
bewithgaurav and others added 3 commits June 3, 2026 12:34
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All subprocess shutdown tests can SIGSEGV under QEMU user-mode
emulation. Instead of whack-a-mole on individual tests, skip the
whole class. Tests still run on all 15+ native environments.

Remove individual skipif markers now covered by class-level skip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
macOS CI runners have 2.7x variance on benchmark duration (10-28min)
compared to Linux (3min) and Windows (5min). Without a step timeout,
a slow benchmark run eats the entire 60min job budget and kills the
job. With timeoutInMinutes: 20 + continueOnError: true, a slow
benchmark is terminated gracefully without failing the overall build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bewithgaurav bewithgaurav merged commit 491d7f5 into main Jun 3, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: small Minimal code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants