Skip to content

Windows: forge ui reports 'agent failed to start' for healthy daemons (Signal(0) unsupported on Windows) #59

@initializ-mk

Description

@initializ-mk

Summary

On Windows, forge ui reports agent failed to start even when the agent process is actually running successfully — port is bound, .forge/serve.log contains the normal startup banner (REST: ..., JSON-RPC: ..., Press Ctrl+C to stop) and structured success logs ({"msg":"scheduler reloaded"}). The bug does NOT occur on macOS/Linux.

The user-visible error text literally contains the agent's successful startup output as if it were the failure cause:

agent failed to start: REST: http://localhost:9105/tasks/send JSON-RPC: POST http://localhost:9105/
──────────────────────────────────────── Press Ctrl+C to stop
{"active":0,"level":"info","msg":"scheduler reloaded","time":"2026-05-18T20:21:04Z"}

That's a strong tell: the UI thinks the daemon died, falls through to its "read the log to find the error" path, and shows the daemon's success banner to the user as "the error".

Root cause

forge-ui/process.go:205-211 defines:

func pidAlive(pid int) bool {
    proc, err := os.FindProcess(pid)
    if err != nil {
        return false
    }
    return proc.Signal(syscall.Signal(0)) == nil
}

Signal(0) is the standard Unix idiom for "is this PID alive?". It does not work on Windows. Go's os.Process.Signal on Windows only handles os.Interrupt (→ GenerateConsoleCtrlEvent) and os.Kill (→ TerminateProcess). For any other signal — including Signal(0) — it returns operating system does not support signal. So pidAlive returns false on Windows regardless of whether the PID is alive. The check is effectively dead code on Windows that always reports "dead".

The forge-cli daemon code already knows this. Compare:

// forge-cli/cmd/serve_unix.go:14-20      (uses Signal(0) — correct on Unix)
func isProcessAlive(pid int) bool {
    proc, err := os.FindProcess(pid)
    if err != nil { return false }
    return proc.Signal(syscall.Signal(0)) == nil
}

// forge-cli/cmd/serve_windows.go:14-22   (uses OpenProcess + CloseHandle — correct on Windows)
func isProcessAlive(pid int) bool {
    const processQueryLimitedInfo = 0x1000
    h, err := syscall.OpenProcess(processQueryLimitedInfo, false, uint32(pid))
    if err != nil { return false }
    _ = syscall.CloseHandle(h)
    return true
}

Both files implement a function called isProcessAlive, build-tag-gated. The forge-ui code didn't reuse this — it duplicated the logic with a single Unix-style pidAlive and forgot the Windows variant.

Symptom chain

Step macOS / Linux Windows
1. UI spawns forge serve start (DETACHED_PROCESS / Setsid) OK OK — daemon detaches cleanly
2. waitForPort (forge-ui/process.go:162) reads PID from .forge/serve.json OK OK
3. pidAlive(pid) called inside the poll loop (process.go:174) Signal(0) succeeds → returns true Signal(0) always errors → returns false
4. waitForPort outcome Loops until TCP port responds → returns true Fast-fails at step 3, returns false immediately
5. UI reaction Reports success to the browser Reports failure, calls readServeLogs
6. serve.log content by this point n/a Contains the daemon's normal startup banner
7. User-visible "error" n/a The banner itself, as in the issue report

Second occurrence — same bug, different blast radius

forge-ui/discovery.go:133 reuses the same broken pidAlive:

if state.PID > 0 && !pidAlive(state.PID) {
    _ = os.Remove(statePath)   // ← deletes .forge/serve.json
    return 0, false
}

On Windows this is actively destructive: every UI discovery call thinks the daemon is dead and deletes the daemon's serve.json state file. The daemon keeps running but disappears from the UI's view. Worth confirming as a separate symptom — Windows users may also see "agent disappeared from the dashboard" / "I have to recreate the agent every page reload".

Reproduction

  1. On a Windows machine, forge init my-agent, configure a model + key.
  2. forge ui, click "Start" on the agent.
  3. Observed: red banner agent failed to start: <agent's success banner>. Backend agent is actually healthy — netstat -an | findstr <port> shows LISTEN, .forge/serve.log shows no errors, the JSON-RPC endpoint responds to curl.
  4. Refresh the dashboard: .forge/serve.json is gone (deleted by discovery.go:133); the daemon process is still running but the UI can't find it.

Steps 3 and 4 both reproduce on any Windows machine; neither reproduces on macOS or Linux.

Recommended fix (Option A)

Extract the platform-aware liveness check into a shared, build-tag-split utility consumed by both forge-cli/cmd/serve_*.go (existing isProcessAlive) and forge-ui/{process,discovery}.go (broken pidAlive). Eliminates the duplicate function and keeps the build-tag boundary in exactly one place.

Candidate location: forge-core/util/processalive/ (or similar) — both forge-cli and forge-ui already depend on forge-core, so no circular dependency risk.

Proposed signature:

// forge-core/util/processalive/processalive.go (or wherever)
package processalive

// IsAlive reports whether a process with the given PID exists.
// Implementation is platform-specific: Unix uses Signal(0); Windows uses
// OpenProcess. Returns false on any error so callers treat ambiguous
// failures (permission, system call error) as "not alive" — same semantics
// as the original Unix-only check.
func IsAlive(pid int) bool { ... }

The two files:

// processalive_unix.go     (//go:build !windows)
// processalive_windows.go  (//go:build windows)

Then:

  • forge-cli/cmd/serve_{unix,windows}.go delete their isProcessAlive and import the helper.
  • forge-ui/process.go:205-211 deletes pidAlive and calls the helper.
  • forge-ui/discovery.go:133 calls the helper.

A short alternative (Option B) — duplicate the platform-split inside forge-ui only — was rejected because it carries the same duplicate-implementation risk that caused this bug.

Acceptance criteria

  • On Windows, starting an agent from forge ui shows the success state, not "agent failed to start", for an agent that the backend actually started successfully.
  • On Windows, forge ui does not delete .forge/serve.json for a still-running daemon.
  • On macOS/Linux, behavior is unchanged.
  • There is exactly one implementation of "is this PID alive" in the repo, build-tag-split. forge-cli and forge-ui both consume it.
  • A test (or unit-test where feasible) covers the Windows OpenProcess path. (CI may not run Windows; at minimum, the build-tag-split is verified to compile on GOOS=windows.)

Out of scope

  • The 5-second timeout in waitForPort is reasonable; this issue does not change it.
  • forge-ui/process.go:213+ readServeLogs is fine — the bug is that it gets called at all on Windows, not what it does.

Cross-reference

The forge-cli side already has the correct Windows implementation since the daemon-lifecycle work; this is a regression where the forge-ui handler shipped without picking up the platform-aware variant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions