Migrate integration tests to exact MCP responses by t-kalinowski · Pull Request #70 · posit-dev/mcp-repl

t-kalinowski · 2026-05-18T21:00:24Z

Summary

This PR tightens the public integration runner so R integration cases assert exact MCP tools/call response shapes instead of searching loosely through concatenated text. It adds normalized golden expectations for timeout/busy replies, output bundle paths, capped output previews, reset behavior, interrupt handling, and pager commands.

The sandbox-focused tests are also made stricter and more representative: sandbox availability now fails fast where coverage depends on it, reticulate/keras JAX setup prewarms host caches before the sandboxed run, and Python discovery sandbox coverage now observes the failed write from the sandboxed Python side before verifying no marker file was created.

Public-facing changes

None intended. This changes test coverage for the public repl and repl_reset MCP tool responses, but does not change runtime behavior.

Internal-only changes

Refactors tests/run_integration_tests.py around repl() / repl_reset() helpers, exact tool_result() expectations, and response normalization.
Adds unit coverage for the integration runner helpers and transient busy interrupt polling.
Tightens sandbox test setup and skip behavior across Rust integration tests.
Tightens Python discovery sandbox coverage with a fake .venv/bin/python that attempts to write a marker in $HOME, asserts the sandboxed Python exception appears in the public reply, and verifies the marker file was not created.
Stabilizes timeout, output-bundle, pager, and idle protocol tests by replacing timing sleeps with workspace-visible gates and deterministic delayed protocol errors.
Updates the Codex TUI permissions test to use /permissions and probe outside-workspace writes under $HOME.
Keeps the direct /dev/stdout fallback test skippable when opening the fd path fails.

Diff composition

Measured against origin/main, this PR is 803 insertions and 437 deletions across 11 files.

runtime src/: +2/-0 (0.2% of churn)
tests in tests/: +801/-437 (99.8% of churn)

Largest files:

tests/run_integration_tests.py: +393/-291
tests/sandbox.rs: +102/-89
tests/write_stdin_behavior.rs: +115/-23
tests/test_run_integration_tests.py: +129/-0
tests/python_backend.rs: +42/-10

Testing

cargo +nightly fmt
cargo check
cargo build
python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
cargo clippy --all-targets --all-features -- -D warnings
cargo test --quiet

Restructure the real-binary Python integration harness so each test calls the explicit MCP client methods and compares parsed tool responses against expected dictionaries built with small helpers. This keeps assertions on the public MCP result shape instead of regexing or loosely parsing response text. Validation: - python3 -m unittest tests/test_run_integration_tests.py - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt - git diff --check

The timeout-dependent Python integration cases no longer assert exact output captured before a fixed one-second deadline. The interrupt case now polls busy responses until INTERRUPT_READY is observed, and the gated bundle case now checks stable busy/no-bundle facts plus final transcript contents. Validation: - python3 -m unittest tests/test_run_integration_tests.py - python3 -m py_compile tests/run_integration_tests.py tests/test_run_integration_tests.py - git diff --check - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl --case r-interrupt-restart-prefixes --case r-output-bundle-files - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt Review finding addressed: - [P2] Wait for interrupt readiness before exact checks — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:576-577 When this runs on a slower or loaded runner, the 1s timeout can expire before R has echoed all setup lines or printed `INTERRUPT_READY`; the previous helper polled up to 20s for that marker. The new exact assertion then fails even though the worker is correctly still busy, so keep the readiness polling or assert the exact response only after the marker is observed. Response: added busy-response polling for `INTERRUPT_READY` before sending the interrupt, with helper coverage for polling and early-finish failure. Review finding addressed: - [P2] Avoid pinning output split across timeout — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:665-666 When the gated bundle case runs on a slow runner, this 1s timeout can fire before or after a different amount of the pre-gate output has been captured. The exact assertion here, and the hard-coded poll preview below, then fail although the behavior is correct; assert stable facts like busy/no bundle and final transcript contents instead of requiring a fixed split across the timeout boundary. Response: replaced the timeout-bound exact setup and poll-preview assertions with success, busy/no-bundle, settled, and full-transcript content checks.

Handle transient busy responses after a Ctrl-C prefixed tail in the r-interrupt-restart-prefixes public API case by polling until the expected interrupt output settles. Add a Python regression test for the runner path that returns busy for the Ctrl-C tail before producing interrupt received and AFTER_INTERRUPT on the next empty poll. Review finding: - [P2] Poll after Ctrl-C busy replies -- /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:600-608 When the Ctrl-C follow-up returns a transient busy response, this exact assertion fails immediately instead of draining until the worker settles. This scenario is already handled in the Windows-specific interrupt test by polling after `\u0003...`, and the Python public API suite runs on Windows CI, so `r-interrupt-restart-prefixes` can fail even though the worker later produces `interrupt received` and `AFTER_INTERRUPT`. Addressed by polling after busy Ctrl-C-tail replies before asserting the settled public API output. Validation: - python3 -m unittest tests.test_run_integration_tests.RunIntegrationTestsCaseTests.test_r_interrupt_restart_prefixes_polls_after_transient_busy_interrupt - python3 -m unittest tests.test_run_integration_tests - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt - git diff --check

Format long R snippets and busy-response fixtures with textwrap.dedent() so expected test inputs stay indented with the surrounding Python code. Validation: - python3 tests/test_run_integration_tests.py - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt

Stop hiding macOS sandbox availability behind CODEX_SANDBOX and probe sandbox-exec with a policy that can actually execute the test command. Convert the sandbox availability skips to assertions, remove the Codex full-access TUI env gate, and make Python/Codex outside-workspace probes target home-directory paths instead of temp paths that the sandbox intentionally permits. Allow macOS sandboxed workers to write inherited stdout/stderr through /dev/stdout, /dev/stderr, and /dev/fd/[12], which surfaced once the default sandboxed test paths were enabled. Validation: - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1

Run the reticulate/keras JAX probe once on the host before entering the network-disabled sandbox. If the host probe cannot materialize the backend because Rscript, reticulate, keras3, or the backend cache/download setup is unavailable, skip the sandbox test before the sandboxed assertion. Reuse the same probe inside the sandbox so the test checks the cached backend path without depending on network access inside the worker sandbox. Also remove the macOS Seatbelt /dev/stdout, /dev/stderr, and /dev/fd alias allowance from this PR. There is no real package or user workflow here that justifies reopening inherited stdio by path, so keep that denied and make the artificial direct-fd surface test skip when the sandbox rejects it. Review finding: [P2] Preserve skip for missing reticulate cache — /Users/tomasz/github/posit-dev/mcp-repl/tests/sandbox.rs:897-899 When `reticulate` and `keras3` are installed but the JAX backend is not already present in the reticulate/uv cache, this test now fails because the sandbox disables network and `use_backend("jax")` reports cache/download errors; those outcomes were previously skipped. Please keep detecting the offline-cache-missing case so `cargo test` does not depend on a prepopulated user cache. Response: The test now pre-populates the reticulate/keras JAX cache with a host-side Rscript probe and skips before sandbox execution when that warm-up cannot complete. The sandboxed test then validates the cached path with network access disabled. Validation: - cargo test --test sandbox sandbox_reticulate_keras_backend -- --nocapture - cargo test --test repl_surface direct_stdout_fd_write_falls_back_to_raw_stream -- --nocapture - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet

Change python_discovery_keeps_venv_probe_inside_sandbox to skip when MCP_REPL_PYTHON_EXECUTABLE is set, matching the adjacent Python discovery tests. The explicit executable override bypasses discovery, so the sandbox probe coverage is not meaningful in that environment. Review finding: - [P2] Skip discovery test under explicit Python override — /Users/tomasz/github/posit-dev/mcp-repl/tests/python_backend.rs:238-241 When `MCP_REPL_PYTHON_EXECUTABLE` is set, this assertion makes `cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox` fail before exercising sandbox discovery. That variable is a supported override and the adjacent Python discovery tests still skip because it bypasses discovery, so this test should skip or clear the env for the spawned server instead of panicking. Response: - Replaced the assertion with an explicit skip when the supported Python executable override is present. Validation: - MCP_REPL_PYTHON_EXECUTABLE=/bin/false cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture - cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet

Replace the passive outside-workspace marker probe with a fake venv Python that attempts the write from inside the sandbox and reports the caught exception through the public reply. Force discovery to use that fake candidate by clearing PATH to a temporary empty directory. The test still verifies that the marker file was not created, but now also proves the sandboxed Python side observed the write denial. Validation: cargo +nightly fmt; cargo check; cargo build; python3 tests/run_integration_tests.py --binary target/debug/mcp-repl; cargo clippy --all-targets --all-features -- -D warnings; cargo test --quiet

Replace sleep-based synchronization in the timeout spill and pager follow-up tests with host gate files and R-side marker files. This keeps the same public API coverage while making the macOS timing boundary deterministic under CI load. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt

t-kalinowski added 13 commits May 18, 2026 12:09

Increase timeout duration in busy recovery test for improved stability

85752e3

Use workspace-visible timeout test gates

1dd65c4

Stabilize idle protocol error test

0090179

Gate output bundle cleanup tests

c5a487e

t-kalinowski marked this pull request as ready for review May 18, 2026 23:06

t-kalinowski merged commit 2dbc6ac into main May 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate integration tests to exact MCP responses#70

Migrate integration tests to exact MCP responses#70
t-kalinowski merged 13 commits into
mainfrom
golden-mcp-integration-responses

t-kalinowski commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t-kalinowski commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Public-facing changes

Internal-only changes

Diff composition

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

t-kalinowski commented May 18, 2026 •

edited

Loading