FIX: Flaky tests in CI and mark stress tests#617
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent timing-sensitive/flaky tests from blocking pull requests by classifying them as stress tests and explicitly excluding stress-marked tests from the PR validation pipeline’s pytest runs.
Changes:
- Mark the two GIL-release heartbeat tests in
test_022_concurrent_query_gil_release.pywith@pytest.mark.stress. - Add an explicit
-m "not stress"filter to all pytest invocations ineng/pipelines/pr-validation-pipeline.ymlas defense-in-depth.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
tests/test_022_concurrent_query_gil_release.py |
Marks two timing-sensitive concurrency tests as stress so they’re excluded from default PR test runs. |
eng/pipelines/pr-validation-pipeline.yml |
Updates PR pipeline pytest commands to explicitly exclude stress tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b1aed71 to
d4d5d61
Compare
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changesNo lines with coverage information in this diff. 📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 76.1%
mssql_python.row.py: 76.9%
mssql_python.__init__.py: 77.3%
mssql_python.pybind.connection.connection.cpp: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.logging.py: 85.5%
mssql_python.connection.py: 85.6%🔗 Quick Links
|
0ffec34 to
eadd358
Compare
Mark test_query_does_not_block_other_python_threads and test_commit_does_not_block_other_python_threads with @pytest.mark.stress. These tests use timing thresholds that flake on macOS CI (especially pre-release Python 3.14) due to sleep() overshoot and GIL re-acquisition latency. pytest.ini addopts already excludes stress-marked tests from default runs; the nightly stress-test-pipeline covers them. Update module docstring to reflect the new classification. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ODBC handle teardown during Python shutdown on macOS CI runners occasionally exceeds the 5s subprocess timeout, causing TimeoutExpired failures in test_013_SqlHandle_free_shutdown.py. Bump from 5s to 15s for all 12 ODBC-exercising subprocess tests. Fast environments still finish in 1- the timeout is just a ceiling.2s Leave the timeout=3 (mock/unit tests) and timeout=10 unchanged. This was the #1 cause of flaky reruns, hitting 4/5 recent failing builds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add QEMU user-mode emulation detection via /proc/cpuinfo CPU implementer 0x51. Skip 4 tests that SIGSEGV under QEMU but pass on all native platforms (0/400 locally). - Bump remaining timeout=3 subprocess tests to 15s (these also flaked on macOS test_cleanup_connections_scenarios andCI test_cleanup_connections_weakset_modification_during_iteration). - Replace @pytest.mark.stress on test_022 with lower heartbeat 15%). CI worst case was 12 ticks (30%); 15% threshold (6 ticks) gives 2x margin while still catching real GIL starvation (0-2 ticks). Tests stay in PR validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move is_qemu_emulated() and QEMU flag to tests/conftest.py so it is available to any test file. Remove duplicate from test_013. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mark both test_022 heartbeat tests with @pytest.mark.stress. pytest.ini addopts already excludes stress tests from PR validation; the nightly stress-test-pipeline covers them. No threshold or docstring original test logic preserved.changes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
da9f7b4 to
39f8b92
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All subprocess shutdown tests can SIGSEGV under QEMU user-mode emulation. Instead of whack-a-mole on individual tests, skip the whole class. Tests still run on all 15+ native environments. Remove individual skipif markers now covered by class-level skip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
macOS CI runners have 2.7x variance on benchmark duration (10-28min) compared to Linux (3min) and Windows (5min). Without a step timeout, a slow benchmark run eats the entire 60min job budget and kills the job. With timeoutInMinutes: 20 + continueOnError: true, a slow benchmark is terminated gracefully without failing the overall build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jahnvi480
approved these changes
Jun 3, 2026
subrata-ms
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Work Item / Issue Reference
Summary
Fix all known flaky test failures in PR validation. Audited 30 recent builds — 12 had failures, all from 2 test files on macOS or ARM64 QEMU. Also cap macOS benchmark step to prevent job timeouts.
Audit (30 PR validation builds)
test_022 — GIL heartbeat (8/30 builds, macOS only)
test_query_does_not_block_other_python_threadstest_commit_does_not_block_other_python_threadstest_013 — shutdown handle cleanup (7/30 builds, macOS + QEMU)
test_dbc_handle_cleanup_at_shutdowntest_force_gc_finalization_order_issuetest_aggressive_dbc_segfault_reproductiontest_cleanup_connections_weakset_modification_during_iterationtest_weakref_cleanup_at_shutdowntest_cleanup_connections_scenariostest_exception_during_query_with_shutdownmacOS benchmark timeout (2/30 builds)
Benchmark step had no timeout. On macOS, it takes 10–28 min (vs 3 min Linux, 5 min Windows). Slow runs eat the 60 min job budget and kill the entire job.
Changes
tests/test_022_concurrent_query_gil_release.py@pytest.mark.stresspytest.iniaddopts; runs in nightly stress pipelinetests/test_013_SqlHandle_free_shutdown.pyTestHandleFreeShutdownclass on QEMU (verified 0/400 on native ARM64 — all subprocess shutdown tests can SEGV under QEMU user-mode emulation)tests/conftest.pyis_qemu_emulated()helper andQEMUflag for reuse across test fileseng/pipelines/pr-validation-pipeline.ymltimeoutInMinutes: 20on macOS benchmark step withcontinueOnError: trueWhat this does NOT change
Benchmark duration analysis (13 builds)
Additional findings (not in this PR)
is_python_finalizing()inddbc_bindings.cppcheckssys._is_finalizingwhich was renamed tosys.is_finalizingin Python 3.13+. Shutdown protection silently degrades on 3.13+, butatexitcleanup saves us. Separate fix needed.