development plan v2

# Development Plan v2 for Extending AgentLoop

> Revised version of #4. Addresses dependency ordering, missing capabilities, vague acceptance criteria, and gaps identified during deep review. Supersedes #4.

## Overview

This plan extends the basic AgentLoop — currently a ~140-line TypeScript LangChain agent with Mistral, two hardcoded demo tools, in-memory chat history, and a single-pass tool execution loop — towards a full-fledged LLM-based agent that can support software development.

**Key design principles:**
- Each task is scoped to ~1–3 days of focused work and can be fed as an "agent task" to Copilot
- All acceptance criteria are specific and testable
- Tasks declare explicit dependencies
- Testing strategy addresses the LLM-dependency challenge upfront

**Current codebase baseline** (as of initial plan):
- `src/index.ts` — single-pass `executeWithTools()` with `ChatMistralAI`, `InMemoryChatMessageHistory`
- `src/tools.ts` — two hardcoded tools (mock search, `eval()`-based calculator)
- `src/config.ts` — dotenv-based config (Mistral key + logger settings)
- `src/logger.ts` — Pino logger
- Jest test setup in `src/__tests__/`

---

## Task Dependency Graph
Phase 1: T1 → T2 → T3 → T4 → T5 → T6 → T7
 ↓
Phase 2: T1 → T2 → T3 → T4 → T5 → T6 → T7 → T8
 ↓ ↓
Phase 3: T1 → T2 → T3 → T4 → T5
 ↓
Phase 4: T1 → T2 → T3 → T4
 ↓
Phase 5: T1, T2, T3 (parallel)
 ↓
Phase 6: T1, T2 (parallel)

Detailed dependency links are noted per-task below with `Depends on:` tags.

---

## Phase 1: Harden the Core Loop

> Goal: Make the existing agent loop production-worthy before adding any new capabilities.

### Task 1.1: Iterative Tool-Calling Loop
- **Depends on:** nothing (first task)
- **Description**: Refactor `executeWithTools()` in `src/index.ts` from a single tool-call round-trip to a proper agentic loop that iterates until the LLM signals completion (i.e., returns a response with zero tool calls).
- **Steps**:
  1. Replace the current single-pass logic with a `while` loop that continues as long as the LLM response contains `tool_calls`.
  2. Add a configurable `MAX_ITERATIONS` guard (default: 20) in `appConfig` to prevent infinite loops.
  3. Add a configurable `MAX_TOKENS_BUDGET` (optional, for future use) field in `appConfig`.
  4. When `MAX_ITERATIONS` is reached, return the last LLM response with a warning prefix.
  5. Add structured log entries for each iteration (iteration number, tool calls in that iteration).
  6. Write unit tests using mocked LLM responses that simulate: (a) 0 tool calls, (b) 1 round of tool calls, (c) 3 consecutive rounds, (d) hitting MAX_ITERATIONS.
- **Acceptance Criteria**:
  - `executeWithTools("multi-step query")` iterates until no tool calls remain, up to `MAX_ITERATIONS`.
  - A test with a mock LLM returning 3 consecutive tool-call rounds produces exactly 3 iterations and a final text response.
  - A test hitting `MAX_ITERATIONS` returns gracefully with a warning, not an error.
- **Test Requirements**: Unit tests with mocked `ChatMistralAI` (no real API calls). Use `jest.mock()` for the LLM.
- **Guidelines**: Do not change the public API of `agentExecutor.invoke()`. The change is internal to `executeWithTools()`.

### Task 1.2: LLM Provider Abstraction
- **Depends on:** 1.1
- **Description**: Decouple the agent loop from `ChatMistralAI` so any LangChain-compatible chat model can be used.
- **Steps**:
  1. Create `src/llm.ts` that exports a factory function `createLLM(config): BaseChatModel`.
  2. Add `LLM_PROVIDER` (default: `"mistral"`) and `LLM_MODEL` (default: `""`) to `appConfig`.
  3. Implement the Mistral provider in the factory. Add a clear extension point (switch/map) for future providers (OpenAI, Anthropic, Ollama).
  4. Update `src/index.ts` to use the factory instead of directly instantiating `ChatMistralAI`.
  5. Make `temperature` and other model parameters configurable via `appConfig`.
  6. Write unit tests that verify the factory returns the correct provider based on config.
- **Acceptance Criteria**:
  - Setting `LLM_PROVIDER=mistral` produces a `ChatMistralAI` instance.
  - Setting `LLM_PROVIDER=unknown` throws a descriptive error.
  - `src/index.ts` no longer imports `ChatMistralAI` directly.
- **Test Requirements**: Unit tests for the factory function. Mock the provider constructors.
- **Guidelines**: Use LangChain's `BaseChatModel` type for the return type.

### Task 1.3: System Prompt Management
- **Depends on:** 1.1
- **Description**: Replace the hardcoded `"You are a helpful AI assistant."` system prompt with a managed, configurable prompt system.
- **Steps**:
  1. Create `src/prompts/` directory.
  2. Create `src/prompts/system.ts` that exports a `getSystemPrompt(context?: { tools?: string[], projectInfo?: string }): string` function.
  3. Define a base system prompt template that includes: agent identity, available tool names, behavioral instructions.
  4. Add `SYSTEM_PROMPT_PATH` to `appConfig` (optional override — path to a `.txt` or `.md` file).
  5. Update `executeWithTools()` to use `getSystemPrompt()` instead of the hardcoded string.
  6. Write tests verifying prompt generation with and without context parameters.
- **Acceptance Criteria**:
  - The system prompt dynamically includes the names of currently loaded tools.
  - If `SYSTEM_PROMPT_PATH` is set, the prompt is loaded from that file.
  - Changing tools changes the system prompt content.
- **Test Requirements**: Unit tests. No LLM calls needed.
- **Guidelines**: Keep the prompt modular so it can be extended in later phases.

### Task 1.4: Context Window Management
- **Depends on:** 1.1
- **Description**: Implement token counting and context window management to prevent exceeding LLM context limits.
- **Steps**:
  1. Add `tiktoken` (or `js-tiktoken`) as a dependency for token counting.
  2. Create `src/context.ts` with: `countTokens(messages): number` and `trimMessages(messages, maxTokens): BaseMessage[]`.
  3. Add `MAX_CONTEXT_TOKENS` to `appConfig` (default: 28000 — leaving headroom for response).
  4. `trimMessages` strategy: always keep the system prompt and last N user messages; summarize or drop oldest messages from the middle.
  5. Integrate `trimMessages` into `executeWithTools()` before each LLM call.
  6. Add structured logging when messages are trimmed (count of removed messages, tokens saved).
  7. Write unit tests with synthetic message arrays that exceed the token limit.
- **Acceptance Criteria**:
  - When conversation history exceeds `MAX_CONTEXT_TOKENS`, older messages are dropped/summarized.
  - The system prompt and the most recent user message are never dropped.
  - A test with 100 synthetic messages verifies the output fits within `MAX_CONTEXT_TOKENS`.
- **Test Requirements**: Unit tests with synthetic messages. No LLM calls.
- **Guidelines**: Start with a simple "drop oldest" strategy. Summarization can be added in a later task.

### Task 1.5: Error Handling & Recovery
- **Depends on:** 1.1, 1.2
- **Description**: Add robust error handling for LLM API failures and tool execution failures.
- **Steps**:
  1. Create `src/errors.ts` with custom error types: `LLMAPIError`, `ToolExecutionError`, `MaxIterationsError`, `ContextOverflowError`.
  2. Wrap LLM API calls in retry logic with exponential backoff (max 3 retries, configurable).
  3. Add per-tool timeout configuration (default: 30s) and implement timeout enforcement.
  4. When a tool fails, inject the error as a `ToolMessage` so the LLM can reason about the failure (don't crash the loop).
  5. Add rate-limit detection for common LLM APIs (HTTP 429) with appropriate back-off.
  6. Write tests for: (a) LLM timeout → retry → success, (b) LLM timeout → max retries → graceful error, (c) tool throws → error injected as ToolMessage, (d) tool timeout.
- **Acceptance Criteria**:
  - An LLM API failure retries up to 3 times with backoff before throwing `LLMAPIError`.
  - A tool that throws an error does not crash the agent loop; instead, the error is fed back to the LLM.
  - A tool exceeding its timeout is killed and an error `ToolMessage` is generated.
- **Test Requirements**: Unit tests with mocked failures. No real API calls.
- **Guidelines**: Use `AbortController` for tool timeouts.

### Task 1.6: Modular Tool Architecture
- **Depends on:** 1.3, 1.5
- **Description**: Replace the hardcoded tools array with a modular, registry-based tool system.
- **Steps**:
  1. Create `src/tools/` directory. Move existing tools to `src/tools/search.ts` and `src/tools/calculate.ts`.
  2. Create `src/tools/registry.ts` with a `ToolRegistry` class: `register(tool)`, `get(name)`, `list()`, `loadFromDirectory(path)`.
  3. Define a `ToolDefinition` interface: `{ name, description, schema, execute, permissions?, timeout? }`.
  4. Implement `loadFromDirectory()` that dynamically imports all `.ts`/`.js` files from a directory and auto-registers tools that export a `ToolDefinition`.
  5. Replace the `eval()` calculator with a safe math expression parser (e.g., `mathjs`).
  6. Update `src/index.ts` to use `ToolRegistry` instead of the imported `tools` array.
  7. Write tests: (a) register/unregister tools, (b) load from directory, (c) duplicate name rejection, (d) tool listing.
- **Acceptance Criteria**:
  - Tools are loaded dynamically from `src/tools/` at startup.
  - Adding a new `.ts` file to `src/tools/` that exports a `ToolDefinition` makes it available without editing any other file.
  - The `eval()` calculator is replaced with a safe alternative.
  - `ToolRegistry.list()` returns all registered tool names and descriptions.
- **Test Requirements**: Unit + integration tests. Create a test tool fixture.
- **Guidelines**: Each tool file should be self-contained with its own schema.

### Task 1.7: Tool Security & Human-in-the-Loop
- **Depends on:** 1.6
- **Description**: Add a permission model and confirmation mechanism for tool execution.
- **Steps**:
  1. Define permission levels in `ToolDefinition`: `"safe"` (auto-approve), `"cautious"` (log + approve), `"dangerous"` (require confirmation).
  2. Create `src/security.ts` with: `ToolPermissionManager` that checks permissions before tool execution.
  3. Implement a `ConfirmationHandler` interface with a default CLI implementation that prompts the user for `"dangerous"` tools.
  4. Add an `AUTO_APPROVE_ALL` config flag for non-interactive/CI contexts.
  5. Add an allowlist/blocklist mechanism for tool names in `appConfig`.
  6. Write tests: (a) safe tool auto-approved, (b) dangerous tool blocked without confirmation, (c) blocklisted tool rejected, (d) `AUTO_APPROVE_ALL` bypasses confirmation.
- **Acceptance Criteria**:
  - A tool marked `"dangerous"` prompts the user before execution in interactive mode.
  - A blocklisted tool name is rejected with a descriptive error message even if registered.
  - `AUTO_APPROVE_ALL=true` skips all confirmation prompts.
- **Test Requirements**: Unit tests with mocked `ConfirmationHandler`.
- **Guidelines**: The `ConfirmationHandler` interface must be replaceable (for future UI/API integration).

---

## Phase 2: Essential Coding Agent Tools

> Goal: Build the concrete tools a software development agent needs. Each tool follows the `ToolDefinition` interface from Phase 1.

### Task 2.1: Shell Command Execution Tool
- **Depends on:** 1.6, 1.7
- **Description**: Implement a tool that executes shell commands with safety controls.
- **Steps**:
  1. Create `src/tools/shell.ts` implementing `ToolDefinition` with permission level `"dangerous"`.
  2. Use Node.js `child_process.execFile` (not `exec`) to avoid shell injection.
  3. Support configurable working directory, environment variables, and timeout.
  4. Capture stdout, stderr, and exit code. Return structured output: `{ stdout, stderr, exitCode }`.
  5. Implement a command blocklist (configurable, default blocks: `rm -rf /`, `mkfs`, `dd`, etc.).
  6. Write tests: (a) successful command, (b) failing command returns stderr + exit code, (c) blocked command rejected, (d) timeout kills process.
- **Acceptance Criteria**:
  - `shell.execute({ command: "echo hello" })` returns `{ stdout: "hello\n", stderr: "", exitCode: 0 }`.
  - `shell.execute({ command: "rm -rf /" })` is rejected by the blocklist before execution.
  - A command exceeding timeout is killed and returns a timeout error.
- **Test Requirements**: Integration tests in a temp directory. No destructive commands.
- **Guidelines**: Never use `child_process.exec` with unsanitized input. Permission level is `"dangerous"`.

### Task 2.2: File Management Tools
- **Depends on:** 1.6, 1.7
- **Description**: Implement tools for reading, creating, editing, and deleting files.
- **Steps**:
  1. Create `src/tools/file-read.ts` — reads file content, returns content + metadata (size, encoding). Permission: `"safe"`.
  2. Create `src/tools/file-write.ts` — creates or overwrites files. Permission: `"cautious"`.
  3. Create `src/tools/file-edit.ts` — applies targeted edits (search-and-replace or line-range replacement). Permission: `"cautious"`.
  4. Create `src/tools/file-delete.ts` — deletes files. Permission: `"dangerous"`.
  5. All tools enforce a configurable `WORKSPACE_ROOT` — no file operations outside this directory (path traversal prevention).
  6. Create `src/tools/file-list.ts` — lists directory contents with optional glob filtering. Permission: `"safe"`.
  7. Write tests for each tool using a temp directory fixture.
- **Acceptance Criteria**:
  - `file-read` returns file content as a string with correct encoding detection.
  - `file-write` creates a new file and `file-read` can read it back.
  - `file-edit` with `{ path, search: "old", replace: "new" }` modifies only the matched content.
  - Any path outside `WORKSPACE_ROOT` is rejected with an error.
  - `file-list` with a glob pattern returns matching entries.
- **Test Requirements**: Integration tests with `tmp` directories. Clean up after each test.
- **Guidelines**: Use `fs/promises`. Validate all paths with `path.resolve()` + prefix check.

### Task 2.3: Code Search Tool
- **Depends on:** 1.6
- **Description**: Implement a tool for searching through codebases using text patterns and regex.
- **Steps**:
  1. Create `src/tools/code-search.ts` implementing `ToolDefinition`. Permission: `"safe"`.
  2. Support search modes: literal string match, regex, and file-name glob.
  3. Use `ripgrep` CLI if available (with `child_process`), fall back to a recursive `fs` + regex search.
  4. Return results as: `{ matches: [{ file, line, column, content, context }] }` with configurable max results (default: 50).
  5. Respect `.gitignore` patterns when searching.
  6. Write tests: (a) literal search finds match, (b) regex search, (c) respects max results, (d) no matches returns empty array.
- **Acceptance Criteria**:
  - Searching for a known string in a test fixture returns the correct file, line, and content.
  - Regex search for `/function\s+\w+/` returns function definitions.
  - Results are capped at the configured maximum.
- **Test Requirements**: Integration tests with a test fixture directory containing sample files.
- **Guidelines**: Prefer `ripgrep` for performance but ensure the fallback works.

### Task 2.4: Version Control (Git) Tools
- **Depends on:** 2.1
- **Description**: Implement tools for common Git operations.
- **Steps**:
  1. Create `src/tools/git-status.ts` — runs `git status --porcelain` and returns structured output. Permission: `"safe"`.
  2. Create `src/tools/git-diff.ts` — runs `git diff` with optional path filter. Permission: `"safe"`.
  3. Create `src/tools/git-commit.ts` — stages specified files and commits with a message. Permission: `"cautious"`.
  4. Create `src/tools/git-log.ts` — returns recent commit history. Permission: `"safe"`.
  5. Use `simple-git` library (not `nodegit` — it's abandoned and has native compilation issues).
  6. Write tests using a temp Git repository fixture.
- **Acceptance Criteria**:
  - `git-status` on a repo with uncommitted changes returns structured file status list.
  - `git-diff` shows the diff content for modified files.
  - `git-commit` creates a commit that appears in `git-log` output.
  - All tools fail gracefully when run outside a Git repository.
- **Test Requirements**: Integration tests that create/destroy a temp Git repo with `simple-git`.
- **Guidelines**: Do NOT implement `git push` yet — that's a `"dangerous"` operation for a later task.

### Task 2.5: Code Execution Tool
- **Depends on:** 2.1
- **Description**: Implement a tool that runs code snippets or test commands in a controlled environment.
- **Steps**:
  1. Create `src/tools/code-run.ts` implementing `ToolDefinition`. Permission: `"dangerous"`.
  2. Support execution modes: (a) run a shell command (delegates to shell tool), (b) run a script file.
  3. Capture stdout, stderr, exit code. Enforce timeout (default: 60s).
  4. Add `EXECUTION_ENVIRONMENT` config for future sandboxing (default: `"local"`).
  5. Write tests: (a) run a simple Node.js script, (b) run a failing script returns error, (c) timeout enforcement.
- **Acceptance Criteria**:
  - Running `node -e "console.log(42)"` returns `{ stdout: "42\n", exitCode: 0 }`.
  - Running a script with a syntax error returns stderr and non-zero exit code.
  - A script exceeding timeout is killed.
- **Test Requirements**: Integration tests with simple script fixtures.
- **Guidelines**: This is intentionally simple for now — sandboxed execution (Docker, etc.) is a Phase 4 concern.

### Task 2.6: Workspace Context Awareness
- **Depends on:** 2.2, 2.3
- **Description**: Give the agent automatic awareness of the project it's working in.
- **Steps**:
  1. Create `src/workspace.ts` with `analyzeWorkspace(rootPath): WorkspaceInfo`.
  2. `WorkspaceInfo` includes: `{ language, framework, packageManager, hasTests, testCommand, lintCommand, buildCommand, entryPoints, gitInitialized }`.
  3. Detect via heuristics: `package.json` → Node/TS, `requirements.txt`/`pyproject.toml` → Python, `go.mod` → Go, etc.
  4. Extract test/build/lint commands from `package.json` scripts, `Makefile`, etc.
  5. Feed `WorkspaceInfo` into `getSystemPrompt()` (from Task 1.3) so the LLM knows the project context.
  6. Write tests with fixture directories for Node, Python, and Go projects.
- **Acceptance Criteria**:
  - A directory with `package.json` is detected as Node/TypeScript with the correct test command.
  - A directory with `pyproject.toml` is detected as Python.
  - `WorkspaceInfo` is included in the system prompt.
- **Test Requirements**: Unit tests with mock filesystem fixtures.
- **Guidelines**: Start with Node/TS detection (since AgentLoop itself is TS). Add more languages incrementally.

### Task 2.7: Diff/Patch Generation Tool
- **Depends on:** 2.2
- **Description**: Implement a tool that generates unified diffs and can apply patches.
- **Steps**:
  1. Create `src/tools/diff.ts` — generates a unified diff between two strings or files. Permission: `"safe"`.
  2. Create `src/tools/patch.ts` — applies a unified diff patch to a file. Permission: `"cautious"`.
  3. Use a library like `diff` (npm) for diff generation and application.
  4. Write tests: (a) generate diff between two strings, (b) apply patch restores expected content, (c) applying a bad patch returns an error.
- **Acceptance Criteria**:
  - `diff.execute({ original: "hello", modified: "hello world" })` returns a valid unified diff string.
  - Applying the generated diff to the original produces the modified content.
- **Test Requirements**: Unit tests, no external dependencies.
- **Guidelines**: This is foundational for the agent to communicate code changes clearly.

### Task 2.8: MCP Client Integration
- **Depends on:** 1.6, 1.7
- **Description**: Integrate MCP (Model Context Protocol) client support so the agent can connect to external MCP tool servers.
- **Steps**:
  1. Add `@modelcontextprotocol/sdk` as a dependency.
  2. Create `src/mcp/client.ts` implementing an MCP client that connects to MCP servers via stdio or SSE transport.
  3. Create `src/mcp/bridge.ts` that converts MCP tool definitions into `ToolDefinition` format and registers them in the `ToolRegistry`.
  4. Add `MCP_SERVERS` config (JSON array of server configs: `{ name, command, args, transport }`).
  5. On startup, connect to configured MCP servers, discover their tools, and register them.
  6. Handle MCP server lifecycle: start on agent init, stop on agent shutdown.
  7. Write tests: (a) mock MCP server connection, (b) tool discovery and registration, (c) tool invocation through the bridge, (d) server crash handling.
- **Acceptance Criteria**:
  - With `MCP_SERVERS=[{"name":"test","command":"node","args":["test-server.js"],"transport":"stdio"}]`, the agent discovers and registers tools from the MCP server.
  - An MCP tool can be invoked through the normal tool execution flow.
  - If an MCP server crashes, the agent logs the error and marks those tools as unavailable.
- **Test Requirements**: Integration tests with a minimal mock MCP server (a simple Node.js script).
- **Guidelines**: MCP is a protocol for tool/resource servers — the agent is the **client**, not the server. Follow the official MCP SDK docs.

---

## Phase 3: Subagent Architecture & Planning

> Goal: Enable the agent to delegate work to specialized sub-agents and generate plans.

### Task 3.1: Subagent Architecture
- **Depends on:** 1.2, 1.6, 1.5
- **Description**: Design and implement the core subagent framework.
- **Steps**:
  1. Create `src/subagents/` directory.
  2. Define `SubagentDefinition` interface: `{ name, systemPrompt, tools: string[], maxIterations, parentCommunication }`.
  3. Create `src/subagents/runner.ts` — `runSubagent(definition, task): SubagentResult` that creates a new agent loop instance with its own message history, tool set, and iteration budget.
  4. Create `src/subagents/manager.ts` — `SubagentManager` that tracks running subagents, enforces concurrency limits, and collects results.
  5. Subagents use the same `createLLM` factory and `ToolRegistry` but with filtered tool access.
  6. Implement result passing: subagent output is returned to the parent agent as a structured message.
  7. Write tests: (a) subagent runs to completion, (b) subagent respects its own MAX_ITERATIONS, (c) subagent only accesses its allowed tools, (d) manager enforces concurrency limit.
- **Acceptance Criteria**:
  - `runSubagent({ name: "planner", tools: ["file-read"], maxIterations: 5 }, "list files")` runs a sub-loop and returns a result.
  - A subagent cannot access tools not in its `tools` list.
  - `SubagentManager` with concurrency limit 2 queues a third subagent.
- **Test Requirements**: Unit tests with mocked LLM. Integration test with a real (cheap) LLM call as a smoke test.
- **Guidelines**: Subagents are isolated agent loops — they don't share message history with the parent.

### Task 3.2: Plan Generation Subagent
- **Depends on:** 3.1
- **Description**: Implement an LLM-based planning subagent that decomposes user requests into actionable task lists.
- **Steps**:
  1. Create `src/subagents/planner.ts` with a specialized system prompt for plan generation.
  2. The planner takes user input + `WorkspaceInfo` and outputs a structured plan: `{ steps: [{ description, toolsNeeded, estimatedComplexity }] }`.
  3. Implement a JSON-mode or structured output constraint for reliable plan parsing.
  4. Add a plan validation step that checks all referenced tools exist in the registry.
  5. Implement a feedback loop: if the parent agent rejects a plan, the planner refines it.
  6. Write tests: (a) plan generated from input contains steps, (b) plan with invalid tools is flagged, (c) refinement produces a different plan.
- **Acceptance Criteria**:
  - Given "Add a new endpoint to the Express app," the planner produces a multi-step plan with tools like `file-read`, `file-write`, `code-run`.
  - A plan referencing a non-existent tool is flagged with a validation error.
  - After one refinement round, the plan no longer references invalid tools.
- **Test Requirements**: Unit tests with mocked LLM responses. One integration smoke test.
- **Guidelines**: Use structured output / JSON mode to avoid brittle text parsing.

### Task 3.3: Task Execution Orchestrator
- **Depends on:** 3.1, 3.2
- **Description**: Implement the orchestrator that takes a plan and executes it step-by-step.
- **Steps**:
  1. Create `src/orchestrator.ts` — `executeplan(plan): ExecutionResult`.
  2. For each step in the plan, either: (a) execute directly if simple (single tool call), or (b) spawn a subagent for complex steps.
  3. Implement checkpointing: after each step, save progress so execution can resume after a failure.
  4. Implement step-level error handling: if a step fails, the orchestrator can retry, skip, or abort (configurable).
  5. Provide progress reporting (step N of M, current status).
  6. Write tests: (a) 3-step plan executes all steps, (b) step failure with retry succeeds on second attempt, (c) checkpoint allows resumption.
- **Acceptance Criteria**:
  - A 3-step plan executes all steps in order and returns combined results.
  - If step 2 fails, the orchestrator retries once and continues if the retry succeeds.
  - After a crash at step 2, `executePlan(plan, { resumeFrom: 2 })` skips step 1.
- **Test Requirements**: Unit tests with mocked tool executions.
- **Guidelines**: Keep the orchestrator stateless — checkpoint state is stored externally (file or memory).

### Task 3.4: Advanced MCP Features
- **Depends on:** 2.8
- **Description**: Extend MCP support with resources, prompts, and sampling capabilities.
- **Steps**:
  1. Implement MCP resource discovery and reading in `src/mcp/client.ts`.
  2. Implement MCP prompt template support — discover and use prompts from MCP servers.
  3. Implement MCP sampling support — allow MCP servers to request LLM completions from the agent.
  4. Add MCP server health checking and auto-reconnection.
  5. Write tests for each new MCP capability.
- **Acceptance Criteria**:
  - The agent can list and read resources from an MCP server.
  - MCP prompt templates can be discovered and used in the agent's prompt system.
  - An MCP server requesting sampling receives a completion from the agent's LLM.
- **Test Requirements**: Integration tests with a mock MCP server.
- **Guidelines**: Follow the MCP specification for resources, prompts, and sampling.

### Task 3.5: Multi-Agent Coordination
- **Depends on:** 3.1, 3.3
- **Description**: Enable multiple subagents to work in parallel and coordinate on complex tasks.
- **Steps**:
  1. Extend `SubagentManager` with parallel execution support (`runParallel(tasks[])`).
  2. Implement a shared context mechanism — subagents can read (but not write) shared state.
  3. Implement result aggregation — collect outputs from parallel subagents into a unified result.
  4. Add conflict detection — if two subagents modify the same file, flag it.
  5. Write tests: (a) 3 parallel subagents complete independently, (b) conflict detection triggers on overlapping file edits, (c) aggregated results contain all outputs.
- **Acceptance Criteria**:
  - `runParallel([task1, task2, task3])` runs 3 subagents concurrently and returns all results.
  - Two subagents editing the same file produces a conflict warning.
- **Test Requirements**: Unit tests with mocked subagents.
- **Guidelines**: Use `Promise.allSettled()` for parallel execution.

---

## Phase 4: Observability, Security Hardening & Streaming

> Goal: Make the agent production-observable and add streaming UX.

### Task 4.1: Observability & Tracing
- **Depends on:** 1.1, 1.5
- **Description**: Add structured tracing and cost tracking to the agent.
- **Steps**:
  1. Create `src/observability.ts` with a `Tracer` interface.
  2. Implement a default tracer that logs to structured JSON files: one trace per agent invocation, with spans for each LLM call and tool execution.
  3. Track token usage (prompt + completion tokens) per LLM call using LangChain's callback mechanism.
  4. Add cumulative cost estimation based on configurable per-token pricing.
  5. Add a `TRACING_ENABLED` and `TRACE_OUTPUT_DIR` config.
  6. Write tests verifying trace files are created with expected structure.
- **Acceptance Criteria**:
  - Each agent invocation produces a trace file with LLM calls, tool executions, and token counts.
  - Total tokens and estimated cost are included in the trace summary.
- **Test Requirements**: Integration test that runs an agent invocation and inspects the trace file.
- **Guidelines**: Design the `Tracer` interface to be replaceable (for OpenTelemetry, LangSmith, etc. in the future).

### Task 4.2: Streaming Response Support
- **Depends on:** 1.1, 1.2
- **Description**: Add streaming support so partial LLM responses are delivered incrementally.
- **Steps**:
  1. Add a `stream` option to `agentExecutor.invoke(input, { stream: true })` that returns an `AsyncIterable<string>`.
  2. Use LangChain's `.stream()` method instead of `.invoke()` when streaming is requested.
  3. Update the CLI main loop to print tokens as they arrive.
  4. Handle tool calls during streaming (buffer until the complete tool call is received, then execute).
  5. Write tests: (a) streaming mode returns chunks, (b) tool calls are buffered correctly, (c) non-streaming mode still works.
- **Acceptance Criteria**:
  - In streaming mode, the CLI prints tokens character-by-character as they arrive from the LLM.
  - Tool calls during streaming are handled correctly (buffered, executed, then streaming resumes).
- **Test Requirements**: Unit tests with mocked streaming LLM responses.
- **Guidelines**: Use `for await...of` for the streaming consumer.

### Task 4.3: Security Hardening
- **Depends on:** 1.7, 2.1, 2.2
- **Description**: Comprehensive security audit and hardening pass.
- **Steps**:
  1. Implement workspace isolation: all file and shell operations are `chroot`-like confined to `WORKSPACE_ROOT`.
  2. Add network access controls for tools (configurable allowlist of domains/IPs).
  3. Implement resource limits: max file size for read/write, max output size for shell commands, max concurrent tool executions.
  4. Add input sanitization for all tool parameters.
  5. Create a security test suite that attempts common attacks: path traversal, command injection, resource exhaustion.
  6. Document the security model in `docs/security.md`.
- **Acceptance Criteria**:
  - Path traversal attempts (`../../etc/passwd`) are blocked.
  - Shell injection attempts (`; rm -rf /`) are blocked.
  - File reads exceeding `MAX_FILE_SIZE` are rejected.
  - The security document describes the threat model and mitigations.
- **Test Requirements**: Dedicated security test suite.
- **Guidelines**: Assume all LLM-generated tool inputs are potentially malicious.

### Task 4.4: Execution Sandboxing (Optional/Stretch)
- **Depends on:** 4.3, 2.5
- **Description**: Add optional Docker-based sandboxing for code execution.
- **Steps**:
  1. Create `src/sandbox/docker.ts` that runs code execution inside a Docker container.
  2. Use a minimal base image, mount the workspace read-only (or copy), capture output.
  3. Add `SANDBOX_MODE` config: `"none"` (default), `"docker"`.
  4. Write tests: (a) code runs in container, (b) filesystem changes in container don't affect host.
- **Acceptance Criteria**:
  - With `SANDBOX_MODE=docker`, code execution runs inside a container.
  - The host filesystem is not modified by sandboxed execution.
- **Test Requirements**: Integration tests (require Docker installed). Mark as skippable in CI without Docker.
- **Guidelines**: This is a stretch goal. The `"none"` mode is the default and must always work.

---

## Phase 5: Testing, Performance & Quality

> Goal: Comprehensive testing pass and performance optimization.

### Task 5.1: End-to-End Test Suite
- **Depends on:** all Phase 2 and 3 tasks
- **Description**: Build an E2E test suite that tests complete agent workflows.
- **Steps**:
  1. Create `tests/e2e/` directory.
  2. Implement test scenarios: (a) "Create a new file with specific content" — exercises file-write + verification, (b) "Find and fix a bug" — exercises code-search + file-edit + code-run, (c) "Generate a plan for a feature" — exercises planner subagent.
  3. Use a mock LLM with deterministic responses for reproducibility.
  4. Add a flag `E2E_USE_REAL_LLM` for optional live testing (not in CI by default).
  5. Measure and assert on total execution time per scenario.
- **Acceptance Criteria**:
  - All 3 E2E scenarios pass with mock LLM.
  - Each scenario completes in under 10 seconds with mock LLM.
- **Test Requirements**: E2E tests. Mock LLM by default, optional real LLM.
- **Guidelines**: E2E tests should be self-contained — each creates and tears down its own workspace.

### Task 5.2: Performance Benchmarking
- **Depends on:** 5.1
- **Description**: Benchmark and optimize critical paths.
- **Steps**:
  1. Benchmark: tool registry lookup time with 100 registered tools.
  2. Benchmark: context trimming with 1000 messages.
  3. Benchmark: code search on a 10,000-file repository.
  4. Benchmark: workspace analysis on a large project.
  5. Profile with Node.js `--prof` and identify hotspots.
  6. Optimize identified bottlenecks.
  7. Document benchmarks and results.
- **Acceptance Criteria**:
  - Tool registry lookup is < 1ms for 100 tools.
  - Context trimming for 1000 messages completes in < 100ms.
  - All benchmarks are documented with baseline and optimized numbers.
- **Test Requirements**: Benchmark scripts in `benchmarks/`.
- **Guidelines**: Run benchmarks in CI to catch regressions.

### Task 5.3: Testing Infrastructure for LLM-Dependent Code
- **Depends on:** 1.1
- **Description**: Establish the mocking/recording strategy for testing LLM-dependent behavior.
- **Steps**:
  1. Create `tests/fixtures/llm-responses/` with recorded LLM response fixtures.
  2. Build a `MockChatModel` class that replays recorded responses based on input patterns.
  3. Add a recording mode that captures real LLM responses and saves them as fixtures.
  4. Document the testing strategy in `docs/testing.md`.
  5. Retrofit existing tests to use `MockChatModel`.
- **Acceptance Criteria**:
  - `MockChatModel` can replay a recorded multi-turn conversation with tool calls.
  - Recording mode captures real responses and saves valid fixture files.
  - All existing tests pass with `MockChatModel` (no real API calls in CI).
- **Test Requirements**: Meta-tests (tests for the test infrastructure).
- **Guidelines**: This is foundational — do this early (it's in Phase 5 for execution but could be pulled into Phase 1 if needed).

---

## Phase 6: Documentation & Deployment

### Task 6.1: Documentation
- **Depends on:** all prior phases
- **Description**: Comprehensive documentation covering architecture, usage, and extension.
- **Steps**:
  1. Create `docs/` directory structure: `architecture.md`, `getting-started.md`, `tools.md`, `security.md`, `extending.md`, `configuration.md`.
  2. `architecture.md`: system overview diagram, agent loop flow, subagent architecture, MCP integration.
  3. `getting-started.md`: installation, first run, example workflows.
  4. `tools.md`: catalog of all built-in tools with examples.
  5. `extending.md`: how to add a new tool, how to create a subagent, how to connect an MCP server.
  6. `configuration.md`: all environment variables and their defaults.
  7. Update `README.md` to reflect the new architecture.
- **Acceptance Criteria**:
  - A new developer can go from clone to running the agent within 10 minutes using the docs.
  - All config options are documented with descriptions and defaults.
  - The extending guide includes a complete working example of adding a custom tool.
- **Test Requirements**: Documentation review. Verify code examples actually run.
- **Guidelines**: Use Mermaid diagrams in markdown for architecture visualization.

### Task 6.2: Deployment & Packaging
- **Depends on:** 6.1
- **Description**: Package the agent for production deployment.
- **Steps**:
  1. Add a proper `tsconfig.json` build configuration that outputs to `dist/`.
  2. Create a `Dockerfile` for containerized deployment.
  3. Add `npm run build` and `npm run start:prod` scripts.
  4. Create a GitHub Actions workflow for: build → test → publish (npm or Docker).
  5. Add a `CHANGELOG.md` with semantic versioning.
  6. Publish as an npm package (optional, scoped to `@huberp/agentloop`).
- **Acceptance Criteria**:
  - `npm run build` produces a clean `dist/` directory.
  - `docker build .` and `docker run` start the agent successfully.
  - CI pipeline runs all tests and produces a build artifact.
- **Test Requirements**: CI pipeline passes. Docker build succeeds.
- **Guidelines**: Use multi-stage Docker builds for minimal image size.

---

## Cross-Cutting Concerns (Apply Throughout)

These are not separate tasks but principles to enforce in every task's code review:

| Concern | Requirement |
|---------|------------|
| **Logging** | Every tool execution, LLM call, and error is logged with structured context |
| **TypeScript Strict** | Enable `strict: true` in tsconfig. No `any` types except justified cases |
| **Error Messages** | All errors include actionable context (what failed, what was expected, how to fix) |
| **Config** | All magic numbers are configurable via `appConfig` with sensible defaults |
| **Backwards Compat** | `npx tsx src/index.ts` still starts the interactive CLI at every phase |
| **Git Hygiene** | One PR per task. Each PR includes tests. No PR without tests |

---

## Summary of Changes from Original Plan (Issue #4)

| Problem in #4 | Fix in v2 |
|---|---|
| Phase ordering inverted (Phase 2 before 2.5) | Reordered: core tools (Phase 2) before multi-agent (Phase 3) |
| Core loop only does single tool-call pass | Added Task 1.1: iterative tool-calling loop |
| No context window management | Added Task 1.4: token counting + trimming |
| No LLM provider abstraction | Added Task 1.2: provider factory |
| MCP tasks had incorrect semantics | Rewrote as Task 2.8 and 3.4 with correct client-side MCP integration |
| No error handling strategy | Added Task 1.5: comprehensive error handling |
| No human-in-the-loop | Added Task 1.7: permission model + confirmation |
| Vague acceptance criteria | Every task now has specific, testable criteria |
| No dependency graph | Added explicit `Depends on:` for every task |
| `nodegit` recommendation | Replaced with `simple-git` |
| `eval()` in calculator | Task 1.6 explicitly replaces with safe math parser |
| No workspace awareness | Added Task 2.6 |
| No observability | Added Task 4.1 |
| No streaming | Added Task 4.2 |
| No diff/patch capability | Added Task 2.7 |
| No testing strategy for LLM code | Added Task 5.3 |
| Tasks not scoped for agent consumption | Each task is ~1-3 days, self-contained, with concrete steps |

Concern	Requirement
Logging	Every tool execution, LLM call, and error is logged with structured context
TypeScript Strict	Enable `strict: true` in tsconfig. No `any` types except justified cases
Error Messages	All errors include actionable context (what failed, what was expected, how to fix)
Config	All magic numbers are configurable via `appConfig` with sensible defaults
Backwards Compat	`npx tsx src/index.ts` still starts the interactive CLI at every phase
Git Hygiene	One PR per task. Each PR includes tests. No PR without tests

Problem in #4	Fix in v2
Phase ordering inverted (Phase 2 before 2.5)	Reordered: core tools (Phase 2) before multi-agent (Phase 3)
Core loop only does single tool-call pass	Added Task 1.1: iterative tool-calling loop
No context window management	Added Task 1.4: token counting + trimming
No LLM provider abstraction	Added Task 1.2: provider factory
MCP tasks had incorrect semantics	Rewrote as Task 2.8 and 3.4 with correct client-side MCP integration
No error handling strategy	Added Task 1.5: comprehensive error handling
No human-in-the-loop	Added Task 1.7: permission model + confirmation
Vague acceptance criteria	Every task now has specific, testable criteria
No dependency graph	Added explicit `Depends on:` for every task
`nodegit` recommendation	Replaced with `simple-git`
`eval()` in calculator	Task 1.6 explicitly replaces with safe math parser
No workspace awareness	Added Task 2.6
No observability	Added Task 4.1
No streaming	Added Task 4.2
No diff/patch capability	Added Task 2.7
No testing strategy for LLM code	Added Task 5.3
Tasks not scoped for agent consumption	Each task is ~1-3 days, self-contained, with concrete steps

development plan v2 #5

Description

Development Plan v2 for Extending AgentLoop

Overview

Task Dependency Graph

Phase 1: Harden the Core Loop

Task 1.1: Iterative Tool-Calling Loop

Task 1.2: LLM Provider Abstraction

Task 1.3: System Prompt Management

Task 1.4: Context Window Management

Task 1.5: Error Handling & Recovery

Task 1.6: Modular Tool Architecture

Task 1.7: Tool Security & Human-in-the-Loop

Phase 2: Essential Coding Agent Tools

Task 2.1: Shell Command Execution Tool

Task 2.2: File Management Tools

Task 2.3: Code Search Tool

Task 2.4: Version Control (Git) Tools

Task 2.5: Code Execution Tool

Task 2.6: Workspace Context Awareness

Task 2.7: Diff/Patch Generation Tool

Task 2.8: MCP Client Integration

Phase 3: Subagent Architecture & Planning

Task 3.1: Subagent Architecture

Task 3.2: Plan Generation Subagent

Task 3.3: Task Execution Orchestrator

Task 3.4: Advanced MCP Features

Task 3.5: Multi-Agent Coordination

Phase 4: Observability, Security Hardening & Streaming

Task 4.1: Observability & Tracing

Task 4.2: Streaming Response Support

Task 4.3: Security Hardening

Task 4.4: Execution Sandboxing (Optional/Stretch)

Phase 5: Testing, Performance & Quality

Task 5.1: End-to-End Test Suite

Task 5.2: Performance Benchmarking

Task 5.3: Testing Infrastructure for LLM-Dependent Code

Phase 6: Documentation & Deployment

Task 6.1: Documentation

Task 6.2: Deployment & Packaging

Cross-Cutting Concerns (Apply Throughout)

Summary of Changes from Original Plan (Issue #4)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions