Migrate integration tests to exact MCP responses#70
Merged
Conversation
Restructure the real-binary Python integration harness so each test calls the explicit MCP client methods and compares parsed tool responses against expected dictionaries built with small helpers. This keeps assertions on the public MCP result shape instead of regexing or loosely parsing response text. Validation: - python3 -m unittest tests/test_run_integration_tests.py - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt - git diff --check
The timeout-dependent Python integration cases no longer assert exact output captured before a fixed one-second deadline. The interrupt case now polls busy responses until INTERRUPT_READY is observed, and the gated bundle case now checks stable busy/no-bundle facts plus final transcript contents. Validation: - python3 -m unittest tests/test_run_integration_tests.py - python3 -m py_compile tests/run_integration_tests.py tests/test_run_integration_tests.py - git diff --check - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl --case r-interrupt-restart-prefixes --case r-output-bundle-files - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt Review finding addressed: - [P2] Wait for interrupt readiness before exact checks — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:576-577 When this runs on a slower or loaded runner, the 1s timeout can expire before R has echoed all setup lines or printed `INTERRUPT_READY`; the previous helper polled up to 20s for that marker. The new exact assertion then fails even though the worker is correctly still busy, so keep the readiness polling or assert the exact response only after the marker is observed. Response: added busy-response polling for `INTERRUPT_READY` before sending the interrupt, with helper coverage for polling and early-finish failure. Review finding addressed: - [P2] Avoid pinning output split across timeout — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:665-666 When the gated bundle case runs on a slow runner, this 1s timeout can fire before or after a different amount of the pre-gate output has been captured. The exact assertion here, and the hard-coded poll preview below, then fail although the behavior is correct; assert stable facts like busy/no bundle and final transcript contents instead of requiring a fixed split across the timeout boundary. Response: replaced the timeout-bound exact setup and poll-preview assertions with success, busy/no-bundle, settled, and full-transcript content checks.
Handle transient busy responses after a Ctrl-C prefixed tail in the r-interrupt-restart-prefixes public API case by polling until the expected interrupt output settles. Add a Python regression test for the runner path that returns busy for the Ctrl-C tail before producing interrupt received and AFTER_INTERRUPT on the next empty poll. Review finding: - [P2] Poll after Ctrl-C busy replies -- /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:600-608 When the Ctrl-C follow-up returns a transient busy response, this exact assertion fails immediately instead of draining until the worker settles. This scenario is already handled in the Windows-specific interrupt test by polling after `\u0003...`, and the Python public API suite runs on Windows CI, so `r-interrupt-restart-prefixes` can fail even though the worker later produces `interrupt received` and `AFTER_INTERRUPT`. Addressed by polling after busy Ctrl-C-tail replies before asserting the settled public API output. Validation: - python3 -m unittest tests.test_run_integration_tests.RunIntegrationTestsCaseTests.test_r_interrupt_restart_prefixes_polls_after_transient_busy_interrupt - python3 -m unittest tests.test_run_integration_tests - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt - git diff --check
Format long R snippets and busy-response fixtures with textwrap.dedent() so expected test inputs stay indented with the surrounding Python code. Validation: - python3 tests/test_run_integration_tests.py - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt
Stop hiding macOS sandbox availability behind CODEX_SANDBOX and probe sandbox-exec with a policy that can actually execute the test command. Convert the sandbox availability skips to assertions, remove the Codex full-access TUI env gate, and make Python/Codex outside-workspace probes target home-directory paths instead of temp paths that the sandbox intentionally permits. Allow macOS sandboxed workers to write inherited stdout/stderr through /dev/stdout, /dev/stderr, and /dev/fd/[12], which surfaced once the default sandboxed test paths were enabled. Validation: - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1
Run the reticulate/keras JAX probe once on the host before entering the
network-disabled sandbox. If the host probe cannot materialize the backend
because Rscript, reticulate, keras3, or the backend cache/download setup is
unavailable, skip the sandbox test before the sandboxed assertion.
Reuse the same probe inside the sandbox so the test checks the cached backend
path without depending on network access inside the worker sandbox.
Also remove the macOS Seatbelt /dev/stdout, /dev/stderr, and /dev/fd alias
allowance from this PR. There is no real package or user workflow here that
justifies reopening inherited stdio by path, so keep that denied and make the
artificial direct-fd surface test skip when the sandbox rejects it.
Review finding:
[P2] Preserve skip for missing reticulate cache — /Users/tomasz/github/posit-dev/mcp-repl/tests/sandbox.rs:897-899
When `reticulate` and `keras3` are installed but the JAX backend is not already present in the reticulate/uv cache, this test now fails because the sandbox disables network and `use_backend("jax")` reports cache/download errors; those outcomes were previously skipped. Please keep detecting the offline-cache-missing case so `cargo test` does not depend on a prepopulated user cache.
Response:
The test now pre-populates the reticulate/keras JAX cache with a host-side
Rscript probe and skips before sandbox execution when that warm-up cannot
complete. The sandboxed test then validates the cached path with network access
disabled.
Validation:
- cargo test --test sandbox sandbox_reticulate_keras_backend -- --nocapture
- cargo test --test repl_surface direct_stdout_fd_write_falls_back_to_raw_stream -- --nocapture
- cargo +nightly fmt
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
Change python_discovery_keeps_venv_probe_inside_sandbox to skip when MCP_REPL_PYTHON_EXECUTABLE is set, matching the adjacent Python discovery tests. The explicit executable override bypasses discovery, so the sandbox probe coverage is not meaningful in that environment. Review finding: - [P2] Skip discovery test under explicit Python override — /Users/tomasz/github/posit-dev/mcp-repl/tests/python_backend.rs:238-241 When `MCP_REPL_PYTHON_EXECUTABLE` is set, this assertion makes `cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox` fail before exercising sandbox discovery. That variable is a supported override and the adjacent Python discovery tests still skip because it bypasses discovery, so this test should skip or clear the env for the spawned server instead of panicking. Response: - Replaced the assertion with an explicit skip when the supported Python executable override is present. Validation: - MCP_REPL_PYTHON_EXECUTABLE=/bin/false cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture - cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet
Replace the passive outside-workspace marker probe with a fake venv Python that attempts the write from inside the sandbox and reports the caught exception through the public reply. Force discovery to use that fake candidate by clearing PATH to a temporary empty directory. The test still verifies that the marker file was not created, but now also proves the sandboxed Python side observed the write denial. Validation: cargo +nightly fmt; cargo check; cargo build; python3 tests/run_integration_tests.py --binary target/debug/mcp-repl; cargo clippy --all-targets --all-features -- -D warnings; cargo test --quiet
Replace sleep-based synchronization in the timeout spill and pager follow-up tests with host gate files and R-side marker files. This keeps the same public API coverage while making the macOS timing boundary deterministic under CI load. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR tightens the public integration runner so R integration cases assert exact MCP
tools/callresponse shapes instead of searching loosely through concatenated text. It adds normalized golden expectations for timeout/busy replies, output bundle paths, capped output previews, reset behavior, interrupt handling, and pager commands.The sandbox-focused tests are also made stricter and more representative: sandbox availability now fails fast where coverage depends on it, reticulate/keras JAX setup prewarms host caches before the sandboxed run, and Python discovery sandbox coverage now observes the failed write from the sandboxed Python side before verifying no marker file was created.
Public-facing changes
None intended. This changes test coverage for the public
replandrepl_resetMCP tool responses, but does not change runtime behavior.Internal-only changes
tests/run_integration_tests.pyaroundrepl()/repl_reset()helpers, exacttool_result()expectations, and response normalization..venv/bin/pythonthat attempts to write a marker in$HOME, asserts the sandboxed Python exception appears in the public reply, and verifies the marker file was not created./permissionsand probe outside-workspace writes under$HOME./dev/stdoutfallback test skippable when opening the fd path fails.Diff composition
Measured against
origin/main, this PR is803insertions and437deletions across11files.src/:+2/-0(0.2%of churn)tests/:+801/-437(99.8%of churn)Largest files:
tests/run_integration_tests.py:+393/-291tests/sandbox.rs:+102/-89tests/write_stdin_behavior.rs:+115/-23tests/test_run_integration_tests.py:+129/-0tests/python_backend.rs:+42/-10Testing
cargo +nightly fmtcargo checkcargo buildpython3 tests/run_integration_tests.py --binary target/debug/mcp-replcargo clippy --all-targets --all-features -- -D warningscargo test --quiet