Skip to content

Claude Pilot: Bug Reports & Friction Points from Early Trial #54

@KaiKloepfer

Description

@KaiKloepfer

Hi @maxritter,

I wanted to share some thoughts from trying out your product:

Trial duration: ~2 days, very heavy usage, roughly 1T tokens consumed
Environment: Linux (WSL2), Next.js/TypeScript codebase (~440 tests, 43 test suites), git worktrees, multiple parallel Claude Code sessions
Pilot version: Latest as of Feb 15, 2026 (upgraded 2 times during trial)

Overall, Pilot is very promising. Codifying spec-driven development is a great expansion on patterns that I think most real software developers already use with Claude Code, and formalizing them into a repeatable workflow is genuinely valuable. That said, at this early stage the tool is too frictional for me to keep using day-to-day. Wanted to share the bugs and friction I hit in case it helps prioritize.

1. Context Monitor Reports Stale Context After Handoff

After a send-clear handoff, the context monitor hook reports the previous session's context percentage instead of the current one. The Claude Code statusline shows the real value (30-40%), but the hook fires with the old session's number (e.g., 89%). The agent trusts the hook, thinks it's nearly out of context, and immediately enters handoff mode without doing any work.

This creates a destructive loop. The agent hands off, the new session gets the same stale reading, hands off again, and so on. Because the agent deletes the continuation file as part of its handoff prep, subsequent sessions lose the context that was supposed to guide them.

[SPEC] Continue workflow from previous session. IMMEDIATELY use the Skill tool:
Skill(skill="spec", args="--continue /home/<WORKTREE_PATH>/docs/plans/<PLAN_FILE>.md")
Do NOT do anything else first.

● Skill(/spec)
⎿  Successfully loaded skill
⎿  PostToolUse:Skill hook returned blocking error
⎿  [uv run python "${CLAUDE_PLUGIN_ROOT}/hooks/context_monitor.py"]:
   💡 Context 89% - Non-obvious discovery or reusable workflow? → Invoke Skill(learn)
   ⚠️  CONTEXT 89% - PREPARE FOR HANDOFF
   Finish current task with full quality, then hand off.

● Context is at 89% - I need to act quickly. Let me read the continuation file
  and plan, then hand off properly.

Actual context at this point per statusline: ~30-40%.

I suspect the hook is caching or inheriting the context percentage from the parent process environment rather than querying the current session. It should probably call pilot check-context --json against the live session, or at least cross-check before triggering a handoff.

2. Install/Upgrade Overwrites User Configuration

Pilot's install and upgrade process force-overwrites Claude Code config files, including user preferences. There were 2 upgrades during my short trial, and each one:

  1. Reset my Claude Code settings (e.g., I don't use verbose mode, but Pilot enables it every time — I have to manually revert after each upgrade)
  2. Broke most of my MCP server configurations. I spent multiple Claude sessions just diagnosing and fixing plugins that stopped working after an upgrade.

The intent of providing a best-practices template makes sense, but force-overwriting on every upgrade is really disruptive. A merge strategy (only add new keys, don't touch existing ones) or even just backing up the config before overwriting would go a long way.

3. Worktree State Management Is Fragile

Pilot frequently loses track of which worktree it should be working in. I hit several related failure modes:

Agent drifts back to the main tree. During /spec implementation in an isolated worktree, the agent would sometimes start making edits in the main working tree instead. This happened most often around plan file edits and session continuations.

Manual correction causes global contamination. When I tried to fix the drift by manually cd-ing into the correct worktree and restarting Pilot, that worktree became the "main" working directory for all Pilot sessions. Every open session — including completely unrelated ones — started working in that worktree after their next handoff.

Worktree deletion crashes everything. After I killed the contaminated worktree, all sessions that had adopted it as their working directory started crashing on handoff:

[Wrapper] Failed to start Claude: [Errno 2] No such file or directory: '/home/<WORKTREE_PATH>'
Traceback (most recent call last):
  File "<string>", line 9, in <module>
  File "pilot.pyx", line 5287, in pilot.app
  File "pilot.pyx", line 4539, in pilot.ClaudeWrapper.start
  File "pilot.pyx", line 4385, in pilot.ClaudeWrapper._run_supervisor_loop
  File "pilot.pyx", line 4142, in pilot.ClaudeWrapper._start_claude
  File "pilot.pyx", line 4134, in pilot.ClaudeWrapper._start_claude
  File "/usr/lib/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/<WORKTREE_PATH>'

I had to have Claude clean out various Pilot config files to get things working again. Worktree associations should probably be scoped per-session rather than global, and if a worktree is deleted, sessions should fall back to the repo root instead of crashing.

4. Post-Tool-Use Hooks Reduce Development Velocity

The post-tool-use hooks that run linting and type checking after every file edit are well-intentioned, but they create a bad feedback loop in certain setups.

The import-ordering loop. Our ESLint config auto-organizes imports. When Claude edits a file and adds or changes an import, the post-edit lint hook rearranges the imports. On the next edit, Claude targets the pre-lint version of the file, the hook rearranges again, and so on. This repeats about 5 times per edit until Claude figures out it needs to write the entire file atomically.

Type checking on every edit. Our tsc takes a few seconds even on a powerful machine. Running it after every single file edit makes each change 10-20x slower than without Pilot.

Combined, in a repo with strict linting and moderate-complexity TypeScript, per-edit hooks reduce effective development speed by roughly 80% compared to vanilla Claude Code. Forcing checks at the end of a task is great — running them after every file edit is overkill. A configurable hook granularity (on_edit vs on_task_complete) would help a lot here.

5. Agent Feels Less Capable with Pilot Active

This one is harder to pin down, but with Pilot active, Claude feels noticeably less sharp than Opus 4.5/4.6 in a standard Claude Code session. Some concrete observations:

  • More basic mistakes that raw Opus never made — wrong import paths, incorrect function signatures, etc.
  • The configuration seems to prevent sub-agent spawning and other advanced features. This is especially frustrating in quick mode, where I've had to switch to Codex for tasks I'd normally just do in a quick Claude session.
  • Even in spec-driven development, the plans Pilot produces are lower quality than what I can get through manual prompting with the same model. The structure is there, but the specificity and task decomposition are weaker.

I wonder if the volume of rules/hooks/system prompts is inadvertently constraining the model. Quick mode especially should feel like an enhancement, not a downgrade.

6. Session Handoff Fails ~20% of the Time

Roughly 1 in 5 session handoffs fail silently or produce a broken state that I have to manually fix. Failure modes I've seen:

  • Continuation file not written before the session clears
  • New session starts but doesn't pick up the continuation context
  • Session restarts in the wrong working directory (related to fix: make zsh the default shell #3)
  • Pilot process hangs during the handoff wait period

When handoffs fail, I have to manually read the continuation file (if it exists), piece together what was happening, and restart the workflow. This erodes trust in the "endless mode" promise — if I can't walk away and trust that handoffs will work, I end up babysitting every context transition. Some kind of handoff verification or failure logging would help.

7. Handoff Kills Background Processes (Unlike /compact)

When Pilot hands off via send-clear, it terminates the entire Claude process, which kills any background tasks the agent started — dev servers, Chrome MCP browser sessions, file watchers, etc. This is different from Claude Code's built-in /compact, which preserves the process and all background tasks while freeing context.

In practice this means if the agent starts a dev server and opens a Chrome MCP session for E2E testing, a handoff kills both. The next session has to restart the server, re-acquire the Chrome MCP lock, re-navigate to the right page, and re-establish any test state. For iterative UI work this adds a lot of overhead per handoff.

8. Vexor and Memory Tools Rarely Get Used

Vexor (semantic code search) and the persistent memory features are genuinely useful when they get invoked, which is very rarely. Despite rules instructing the agent to use Vexor for codebase exploration and memory for cross-session context, I almost never see either tool called unless I explicitly tell the agent to use them.

This is probably a prompt/rules tuning issue. The rules exist but may be getting diluted by the sheer volume of other instructions. Elevating these tools in the system prompt priority or adding heuristic triggers might help.

Closing Thoughts

The core vision — codified SDD, persistent memory, seamless session continuation, structured verification — is exactly right. These are the patterns I already use manually with Claude Code, and formalizing them is valuable. All of the issues above feel solvable, and I'd be happy to revisit once things stabilize. Best of luck with the progress, and thanks for building this!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions