Skip to content

fix(windows): use OpenProcess for liveness check in pid.py (#166)#176

Merged
jrob5756 merged 2 commits into
mainfrom
fix/issue-166
May 11, 2026
Merged

fix(windows): use OpenProcess for liveness check in pid.py (#166)#176
jrob5756 merged 2 commits into
mainfrom
fix/issue-166

Conversation

@jrob5756
Copy link
Copy Markdown
Collaborator

Summary

Fixes #166. conductor stop (and --all / --port) crashes on Windows whenever a PID file exists in ~/.conductor/runs/:

OSError: [WinError 11] An attempt was made to load a program with an incorrect format

The crash happens before any process is stopped, making the command effectively unusable on Windows.

Root cause

_is_process_alive used the Unix idiom os.kill(pid, 0) to probe existence without sending a signal. On Windows that idiom is not a no-op — per the CPython docs, any signal value other than CTRL_C_EVENT / CTRL_BREAK_EVENT routes through TerminateProcess. Depending on the target process's bitness/image type, this can raise OSError subclasses that neither ProcessLookupError nor PermissionError cover (e.g. WinError 11 / ERROR_BAD_FORMAT). The exception escaped out of read_pid_files() and crashed the CLI.

It would also be unsafe even if it didn't raise: a "successful" os.kill(pid, 0) on Windows would actually terminate the target process with exit code 0.

Fix (src/conductor/cli/pid.py)

Split _is_process_alive into a platform dispatcher and two implementations:

  • _is_process_alive_posix keeps os.kill(pid, 0) and adds a defensive OSError catch that logs at debug and returns True. Treating an unexpected error as "assume alive" prevents a transient OS hiccup from silently dropping a still-running workflow's PID file or crashing conductor stop.
  • _is_process_alive_windows uses ctypes to call OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, ...) followed by GetExitCodeProcess. Maps ERROR_INVALID_PARAMETER → not alive, ERROR_ACCESS_DENIED → alive, and any other failure → assume alive (with a debug log).

No new dependencies added — ctypes is part of the Python stdlib. Windows-only ctypes attributes are flagged for ty with # type: ignore[unresolved-attribute], matching the codebase's existing convention for platform-conditional code.

Tests (tests/test_cli/test_pid.py)

  • TestIsProcessAlivePosix — alive, ProcessLookupError, PermissionError, and the regression case (generic OSError, exactly the scenario the issue suggested would have caught this earlier).
  • TestIsProcessAliveDispatch — verifies the dispatcher routes to the right implementation based on sys.platform.
  • TestIsProcessAliveWindows — Windows-only smoke tests using real kernel32.dll for the current process and an invalid PID, gated with pytest.mark.skipif(sys.platform != "win32").
  • TestReadPidFilesDoesNotCrashOnOsError — full integration through read_pid_files() simulating the Windows-style failure on the POSIX path; asserts no exception escapes and the PID file is preserved (rather than silently deleted).

Verification

  • uv run pytest tests/test_cli/test_pid.py20 passed, 2 skipped (Windows-only).
  • uv run pytest tests/test_cli/353 passed, 2 skipped.
  • uv run ruff check / ruff format --check — clean.
  • uv run ty check src/conductor/ — back to baseline (no new diagnostics).

Notes

  • Behavior on macOS / Linux is unchanged for the normal alive / dead / no-permission cases.
  • The Windows code path lives behind sys.platform == "win32" and uses lazy ctypes imports, so it has zero impact on POSIX startup or runtime.
  • The STILL_ACTIVE (259) sentinel has the well-known corner case that a process can legitimately exit with code 259 and be reported as alive; for our use case (cleaning up stale workflow PID files) this is an acceptable known limitation.

Closes #166

jrob5756 and others added 2 commits May 11, 2026 09:06
`conductor stop` (and `--all` / `--port`) crashes on Windows whenever a
PID file exists in `~/.conductor/runs/`:

    OSError: [WinError 11] An attempt was made to load a program with
    an incorrect format

The crash happens before any process is stopped, making the command
unusable on Windows.

Root cause
----------
`_is_process_alive` used the Unix idiom `os.kill(pid, 0)` to probe
existence without sending a signal. On Windows that idiom is *not* a
no-op: per the CPython docs, any signal value other than CTRL_C_EVENT /
CTRL_BREAK_EVENT routes through TerminateProcess, and depending on the
target process's bitness/image type can raise OSError subclasses that
neither `ProcessLookupError` nor `PermissionError` cover (e.g. WinError
11 / ERROR_BAD_FORMAT). The exception escaped out of `read_pid_files`
and crashed the CLI. Even on a "successful" call it would actually
terminate the target process with exit code 0.

Fix
---
Split `_is_process_alive` into a platform dispatcher and two
implementations:

* `_is_process_alive_posix` keeps `os.kill(pid, 0)` and adds a
  defensive `OSError` catch that logs at debug and returns True.
  Treating an unexpected error as "assume alive" prevents a transient
  OS hiccup from silently dropping a still-running workflow's PID file
  or crashing `conductor stop`.
* `_is_process_alive_windows` uses ctypes to call
  `OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, ...)` followed by
  `GetExitCodeProcess`. Maps `ERROR_INVALID_PARAMETER` to "not alive",
  `ERROR_ACCESS_DENIED` to "alive", and any other failure to "assume
  alive" with a debug log line.

Tests
-----
* `TestIsProcessAlivePosix` — covers alive, `ProcessLookupError`,
  `PermissionError`, and the regression case (generic `OSError`).
* `TestIsProcessAliveDispatch` — verifies the dispatcher routes to the
  right implementation based on `sys.platform`.
* `TestIsProcessAliveWindows` — Windows-only smoke tests gated with
  `pytest.mark.skipif`.
* `TestReadPidFilesDoesNotCrashOnOsError` — full integration through
  `read_pid_files` simulating the Windows-style failure on the POSIX
  path.

Fixes #166

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow-up from PR review of #166. Hardens the original fix and addresses
the highest-leverage findings from the comment-analyzer, pr-test-analyzer,
silent-failure-hunter, code-reviewer, and code-simplifier passes.

src/conductor/cli/pid.py
------------------------
* Hoist Windows ctypes setup (WinDLL load, argtypes/restype, magic
  numbers) to module level under `if sys.platform == "win32":`. Matches
  the project idiom used in cli/app.py, cli/update.py, and
  cli/bg_runner.py; eliminates per-call work; gives the magic numbers
  greppable names (_PROCESS_QUERY_LIMITED_INFORMATION, _STILL_ACTIVE,
  _ERROR_ACCESS_DENIED, _ERROR_INVALID_PARAMETER); collapses three
  `# type: ignore` comments. Also exposes a single `_kernel32` symbol
  that tests can patch from any platform.
* Promote the three "unexpected error -> assume alive" log lines from
  DEBUG to WARNING (POSIX `except OSError`, Windows unrecognized
  OpenProcess error, Windows GetExitCodeProcess failure). The default
  conductor log level is INFO+, so DEBUG hid operationally important
  information about probe failures.
* Include `ctypes.FormatError(err)` in the Windows error log lines so
  the human-readable Windows error string is captured alongside the
  numeric code.
* Document the well-known STILL_ACTIVE = 259 collision in the
  `_is_process_alive_windows` docstring (a process exiting with code
  259 is reported as still alive forever; we accept this because
  conductor children don't exit with that code).
* Clarify the POSIX docstring to note the regression-fix behavior
  (intentionally swallows generic OSError to keep `conductor stop`
  working).
* Keep ERROR_ACCESS_DENIED on Windows at DEBUG since it's an expected
  condition (cross-session or higher-integrity targets), and add a
  log line so it's still discoverable.

src/conductor/cli/app.py
------------------------
* Mirror the same defensive `except OSError` + `logger.warning`
  pattern in `_stop_process`. The original PR fixed the crash in
  `_is_process_alive` but `_stop_process` had the same uncaught
  `OSError` shape one frame deeper. The "assume alive" fallback in
  `_is_process_alive_windows` makes it strictly more likely that
  probe-failing PIDs reach `_stop_process`, so this hardening is
  required to fully close issue #166 across both call sites.
* Print a yellow user-facing message and remove the PID file anyway,
  rather than crash.
* Add module-level `logger = logging.getLogger(__name__)`.

tests/test_cli/test_pid.py
--------------------------
* New `TestIsProcessAliveWindowsMocked` (7 tests, runs on every
  platform) — mocks `_kernel32` and `ctypes.get_last_error` to cover
  every branch of `_is_process_alive_windows`:
  - running process (STILL_ACTIVE)            -> True
  - exited process (exit code 0)              -> False
  - exited with non-zero exit code            -> False
  - OpenProcess ERROR_ACCESS_DENIED           -> True
  - OpenProcess ERROR_INVALID_PARAMETER       -> False
  - OpenProcess unrecognized error            -> True (assume alive)
  - GetExitCodeProcess fails                  -> True + CloseHandle
  Includes explicit `CloseHandle.assert_called_once_with(handle)` so
  a future regression that drops the `try/finally` and leaks handles
  fails a test, not just a code review.
* Renamed the prior Windows-only smoke tests to
  `TestIsProcessAliveWindowsReal` to clarify they're real-API
  integration tests on Windows CI, complementing the cross-platform
  mocked tests.
* Refined the "WinError 11" comment in the regression test to focus on
  intent rather than conflating errno=11 with winerror=11.
* Replaced hard-coded `/tmp/wf.yaml` with `tmp_path / "wf.yaml"`.
* Dropped redundant "Tests for X" docstring sentences on new test
  classes.

tests/test_cli/test_stop.py
---------------------------
* New `TestStopProcessUnexpectedOSError` (2 tests) covering the
  defensive `except OSError` in `_stop_process`:
  - Unexpected OSError doesn't crash; user sees yellow message
  - PID file is still cleaned up after the OSError path

Verification
------------
* `uv run pytest tests/test_cli/` -> 362 passed, 2 skipped (vs 353
  pre-change; +9 new tests, no regressions)
* `make lint` -> clean
* `uv run ty check src/conductor/` -> 1 diagnostic (pre-existing baseline)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jrob5756 jrob5756 merged commit 9f54692 into main May 11, 2026
9 checks passed
@jrob5756 jrob5756 deleted the fix/issue-166 branch May 11, 2026 13:43
jrob5756 added a commit that referenced this pull request May 13, 2026
Release notes for 0.1.15 added to CHANGELOG.md.

Added:
- Per-agent timeout_seconds for hard wall-clock timeouts (#150)
- Auto-discovery of .github/instructions/**/*.instructions.md with applyTo
  filtering, plus polymorphic conventions refactor (#169)

Fixed:
- system_prompt is now rendered and forwarded to providers (#179)
- conductor update on Windows: install scripts become the single upgrade
  path; install.sh reaches parity (#171)
- install.ps1 UTF-8 BOM breaking the irm | iex one-liner (#178)
- conductor stop crash on Windows from os.kill(pid, 0) liveness probe (#176)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

conductor stop crashes on Windows: OSError [WinError 11] from os.kill(pid, 0) in _is_process_alive

1 participant