fix(windows): use OpenProcess for liveness check in pid.py (#166)#176
Merged
Conversation
`conductor stop` (and `--all` / `--port`) crashes on Windows whenever a
PID file exists in `~/.conductor/runs/`:
OSError: [WinError 11] An attempt was made to load a program with
an incorrect format
The crash happens before any process is stopped, making the command
unusable on Windows.
Root cause
----------
`_is_process_alive` used the Unix idiom `os.kill(pid, 0)` to probe
existence without sending a signal. On Windows that idiom is *not* a
no-op: per the CPython docs, any signal value other than CTRL_C_EVENT /
CTRL_BREAK_EVENT routes through TerminateProcess, and depending on the
target process's bitness/image type can raise OSError subclasses that
neither `ProcessLookupError` nor `PermissionError` cover (e.g. WinError
11 / ERROR_BAD_FORMAT). The exception escaped out of `read_pid_files`
and crashed the CLI. Even on a "successful" call it would actually
terminate the target process with exit code 0.
Fix
---
Split `_is_process_alive` into a platform dispatcher and two
implementations:
* `_is_process_alive_posix` keeps `os.kill(pid, 0)` and adds a
defensive `OSError` catch that logs at debug and returns True.
Treating an unexpected error as "assume alive" prevents a transient
OS hiccup from silently dropping a still-running workflow's PID file
or crashing `conductor stop`.
* `_is_process_alive_windows` uses ctypes to call
`OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, ...)` followed by
`GetExitCodeProcess`. Maps `ERROR_INVALID_PARAMETER` to "not alive",
`ERROR_ACCESS_DENIED` to "alive", and any other failure to "assume
alive" with a debug log line.
Tests
-----
* `TestIsProcessAlivePosix` — covers alive, `ProcessLookupError`,
`PermissionError`, and the regression case (generic `OSError`).
* `TestIsProcessAliveDispatch` — verifies the dispatcher routes to the
right implementation based on `sys.platform`.
* `TestIsProcessAliveWindows` — Windows-only smoke tests gated with
`pytest.mark.skipif`.
* `TestReadPidFilesDoesNotCrashOnOsError` — full integration through
`read_pid_files` simulating the Windows-style failure on the POSIX
path.
Fixes #166
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow-up from PR review of #166. Hardens the original fix and addresses the highest-leverage findings from the comment-analyzer, pr-test-analyzer, silent-failure-hunter, code-reviewer, and code-simplifier passes. src/conductor/cli/pid.py ------------------------ * Hoist Windows ctypes setup (WinDLL load, argtypes/restype, magic numbers) to module level under `if sys.platform == "win32":`. Matches the project idiom used in cli/app.py, cli/update.py, and cli/bg_runner.py; eliminates per-call work; gives the magic numbers greppable names (_PROCESS_QUERY_LIMITED_INFORMATION, _STILL_ACTIVE, _ERROR_ACCESS_DENIED, _ERROR_INVALID_PARAMETER); collapses three `# type: ignore` comments. Also exposes a single `_kernel32` symbol that tests can patch from any platform. * Promote the three "unexpected error -> assume alive" log lines from DEBUG to WARNING (POSIX `except OSError`, Windows unrecognized OpenProcess error, Windows GetExitCodeProcess failure). The default conductor log level is INFO+, so DEBUG hid operationally important information about probe failures. * Include `ctypes.FormatError(err)` in the Windows error log lines so the human-readable Windows error string is captured alongside the numeric code. * Document the well-known STILL_ACTIVE = 259 collision in the `_is_process_alive_windows` docstring (a process exiting with code 259 is reported as still alive forever; we accept this because conductor children don't exit with that code). * Clarify the POSIX docstring to note the regression-fix behavior (intentionally swallows generic OSError to keep `conductor stop` working). * Keep ERROR_ACCESS_DENIED on Windows at DEBUG since it's an expected condition (cross-session or higher-integrity targets), and add a log line so it's still discoverable. src/conductor/cli/app.py ------------------------ * Mirror the same defensive `except OSError` + `logger.warning` pattern in `_stop_process`. The original PR fixed the crash in `_is_process_alive` but `_stop_process` had the same uncaught `OSError` shape one frame deeper. The "assume alive" fallback in `_is_process_alive_windows` makes it strictly more likely that probe-failing PIDs reach `_stop_process`, so this hardening is required to fully close issue #166 across both call sites. * Print a yellow user-facing message and remove the PID file anyway, rather than crash. * Add module-level `logger = logging.getLogger(__name__)`. tests/test_cli/test_pid.py -------------------------- * New `TestIsProcessAliveWindowsMocked` (7 tests, runs on every platform) — mocks `_kernel32` and `ctypes.get_last_error` to cover every branch of `_is_process_alive_windows`: - running process (STILL_ACTIVE) -> True - exited process (exit code 0) -> False - exited with non-zero exit code -> False - OpenProcess ERROR_ACCESS_DENIED -> True - OpenProcess ERROR_INVALID_PARAMETER -> False - OpenProcess unrecognized error -> True (assume alive) - GetExitCodeProcess fails -> True + CloseHandle Includes explicit `CloseHandle.assert_called_once_with(handle)` so a future regression that drops the `try/finally` and leaks handles fails a test, not just a code review. * Renamed the prior Windows-only smoke tests to `TestIsProcessAliveWindowsReal` to clarify they're real-API integration tests on Windows CI, complementing the cross-platform mocked tests. * Refined the "WinError 11" comment in the regression test to focus on intent rather than conflating errno=11 with winerror=11. * Replaced hard-coded `/tmp/wf.yaml` with `tmp_path / "wf.yaml"`. * Dropped redundant "Tests for X" docstring sentences on new test classes. tests/test_cli/test_stop.py --------------------------- * New `TestStopProcessUnexpectedOSError` (2 tests) covering the defensive `except OSError` in `_stop_process`: - Unexpected OSError doesn't crash; user sees yellow message - PID file is still cleaned up after the OSError path Verification ------------ * `uv run pytest tests/test_cli/` -> 362 passed, 2 skipped (vs 353 pre-change; +9 new tests, no regressions) * `make lint` -> clean * `uv run ty check src/conductor/` -> 1 diagnostic (pre-existing baseline) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jrob5756
added a commit
that referenced
this pull request
May 13, 2026
Release notes for 0.1.15 added to CHANGELOG.md. Added: - Per-agent timeout_seconds for hard wall-clock timeouts (#150) - Auto-discovery of .github/instructions/**/*.instructions.md with applyTo filtering, plus polymorphic conventions refactor (#169) Fixed: - system_prompt is now rendered and forwarded to providers (#179) - conductor update on Windows: install scripts become the single upgrade path; install.sh reaches parity (#171) - install.ps1 UTF-8 BOM breaking the irm | iex one-liner (#178) - conductor stop crash on Windows from os.kill(pid, 0) liveness probe (#176) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #166.
conductor stop(and--all/--port) crashes on Windows whenever a PID file exists in~/.conductor/runs/:The crash happens before any process is stopped, making the command effectively unusable on Windows.
Root cause
_is_process_aliveused the Unix idiomos.kill(pid, 0)to probe existence without sending a signal. On Windows that idiom is not a no-op — per the CPython docs, any signal value other thanCTRL_C_EVENT/CTRL_BREAK_EVENTroutes throughTerminateProcess. Depending on the target process's bitness/image type, this can raiseOSErrorsubclasses that neitherProcessLookupErrornorPermissionErrorcover (e.g.WinError 11/ERROR_BAD_FORMAT). The exception escaped out ofread_pid_files()and crashed the CLI.It would also be unsafe even if it didn't raise: a "successful"
os.kill(pid, 0)on Windows would actually terminate the target process with exit code 0.Fix (
src/conductor/cli/pid.py)Split
_is_process_aliveinto a platform dispatcher and two implementations:_is_process_alive_posixkeepsos.kill(pid, 0)and adds a defensiveOSErrorcatch that logs at debug and returnsTrue. Treating an unexpected error as "assume alive" prevents a transient OS hiccup from silently dropping a still-running workflow's PID file or crashingconductor stop._is_process_alive_windowsusesctypesto callOpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, ...)followed byGetExitCodeProcess. MapsERROR_INVALID_PARAMETER→ not alive,ERROR_ACCESS_DENIED→ alive, and any other failure → assume alive (with a debug log).No new dependencies added —
ctypesis part of the Python stdlib. Windows-only ctypes attributes are flagged fortywith# type: ignore[unresolved-attribute], matching the codebase's existing convention for platform-conditional code.Tests (
tests/test_cli/test_pid.py)TestIsProcessAlivePosix— alive,ProcessLookupError,PermissionError, and the regression case (genericOSError, exactly the scenario the issue suggested would have caught this earlier).TestIsProcessAliveDispatch— verifies the dispatcher routes to the right implementation based onsys.platform.TestIsProcessAliveWindows— Windows-only smoke tests using realkernel32.dllfor the current process and an invalid PID, gated withpytest.mark.skipif(sys.platform != "win32").TestReadPidFilesDoesNotCrashOnOsError— full integration throughread_pid_files()simulating the Windows-style failure on the POSIX path; asserts no exception escapes and the PID file is preserved (rather than silently deleted).Verification
uv run pytest tests/test_cli/test_pid.py→ 20 passed, 2 skipped (Windows-only).uv run pytest tests/test_cli/→ 353 passed, 2 skipped.uv run ruff check/ruff format --check— clean.uv run ty check src/conductor/— back to baseline (no new diagnostics).Notes
sys.platform == "win32"and uses lazy ctypes imports, so it has zero impact on POSIX startup or runtime.STILL_ACTIVE(259) sentinel has the well-known corner case that a process can legitimately exit with code 259 and be reported as alive; for our use case (cleaning up stale workflow PID files) this is an acceptable known limitation.Closes #166