Skip to content

development plan v2 #5

@huberp

Description

@huberp

Development Plan v2 for Extending AgentLoop

Revised version of #4. Addresses dependency ordering, missing capabilities, vague acceptance criteria, and gaps identified during deep review. Supersedes #4.

Overview

This plan extends the basic AgentLoop — currently a ~140-line TypeScript LangChain agent with Mistral, two hardcoded demo tools, in-memory chat history, and a single-pass tool execution loop — towards a full-fledged LLM-based agent that can support software development.

Key design principles:

  • Each task is scoped to ~1–3 days of focused work and can be fed as an "agent task" to Copilot
  • All acceptance criteria are specific and testable
  • Tasks declare explicit dependencies
  • Testing strategy addresses the LLM-dependency challenge upfront

Current codebase baseline (as of initial plan):

  • src/index.ts — single-pass executeWithTools() with ChatMistralAI, InMemoryChatMessageHistory
  • src/tools.ts — two hardcoded tools (mock search, eval()-based calculator)
  • src/config.ts — dotenv-based config (Mistral key + logger settings)
  • src/logger.ts — Pino logger
  • Jest test setup in src/__tests__/

Task Dependency Graph

Phase 1: T1 → T2 → T3 → T4 → T5 → T6 → T7

Phase 2: T1 → T2 → T3 → T4 → T5 → T6 → T7 → T8
↓ ↓
Phase 3: T1 → T2 → T3 → T4 → T5

Phase 4: T1 → T2 → T3 → T4

Phase 5: T1, T2, T3 (parallel)

Phase 6: T1, T2 (parallel)

Detailed dependency links are noted per-task below with Depends on: tags.


Phase 1: Harden the Core Loop

Goal: Make the existing agent loop production-worthy before adding any new capabilities.

Task 1.1: Iterative Tool-Calling Loop

  • Depends on: nothing (first task)
  • Description: Refactor executeWithTools() in src/index.ts from a single tool-call round-trip to a proper agentic loop that iterates until the LLM signals completion (i.e., returns a response with zero tool calls).
  • Steps:
    1. Replace the current single-pass logic with a while loop that continues as long as the LLM response contains tool_calls.
    2. Add a configurable MAX_ITERATIONS guard (default: 20) in appConfig to prevent infinite loops.
    3. Add a configurable MAX_TOKENS_BUDGET (optional, for future use) field in appConfig.
    4. When MAX_ITERATIONS is reached, return the last LLM response with a warning prefix.
    5. Add structured log entries for each iteration (iteration number, tool calls in that iteration).
    6. Write unit tests using mocked LLM responses that simulate: (a) 0 tool calls, (b) 1 round of tool calls, (c) 3 consecutive rounds, (d) hitting MAX_ITERATIONS.
  • Acceptance Criteria:
    • executeWithTools("multi-step query") iterates until no tool calls remain, up to MAX_ITERATIONS.
    • A test with a mock LLM returning 3 consecutive tool-call rounds produces exactly 3 iterations and a final text response.
    • A test hitting MAX_ITERATIONS returns gracefully with a warning, not an error.
  • Test Requirements: Unit tests with mocked ChatMistralAI (no real API calls). Use jest.mock() for the LLM.
  • Guidelines: Do not change the public API of agentExecutor.invoke(). The change is internal to executeWithTools().

Task 1.2: LLM Provider Abstraction

  • Depends on: 1.1
  • Description: Decouple the agent loop from ChatMistralAI so any LangChain-compatible chat model can be used.
  • Steps:
    1. Create src/llm.ts that exports a factory function createLLM(config): BaseChatModel.
    2. Add LLM_PROVIDER (default: "mistral") and LLM_MODEL (default: "") to appConfig.
    3. Implement the Mistral provider in the factory. Add a clear extension point (switch/map) for future providers (OpenAI, Anthropic, Ollama).
    4. Update src/index.ts to use the factory instead of directly instantiating ChatMistralAI.
    5. Make temperature and other model parameters configurable via appConfig.
    6. Write unit tests that verify the factory returns the correct provider based on config.
  • Acceptance Criteria:
    • Setting LLM_PROVIDER=mistral produces a ChatMistralAI instance.
    • Setting LLM_PROVIDER=unknown throws a descriptive error.
    • src/index.ts no longer imports ChatMistralAI directly.
  • Test Requirements: Unit tests for the factory function. Mock the provider constructors.
  • Guidelines: Use LangChain's BaseChatModel type for the return type.

Task 1.3: System Prompt Management

  • Depends on: 1.1
  • Description: Replace the hardcoded "You are a helpful AI assistant." system prompt with a managed, configurable prompt system.
  • Steps:
    1. Create src/prompts/ directory.
    2. Create src/prompts/system.ts that exports a getSystemPrompt(context?: { tools?: string[], projectInfo?: string }): string function.
    3. Define a base system prompt template that includes: agent identity, available tool names, behavioral instructions.
    4. Add SYSTEM_PROMPT_PATH to appConfig (optional override — path to a .txt or .md file).
    5. Update executeWithTools() to use getSystemPrompt() instead of the hardcoded string.
    6. Write tests verifying prompt generation with and without context parameters.
  • Acceptance Criteria:
    • The system prompt dynamically includes the names of currently loaded tools.
    • If SYSTEM_PROMPT_PATH is set, the prompt is loaded from that file.
    • Changing tools changes the system prompt content.
  • Test Requirements: Unit tests. No LLM calls needed.
  • Guidelines: Keep the prompt modular so it can be extended in later phases.

Task 1.4: Context Window Management

  • Depends on: 1.1
  • Description: Implement token counting and context window management to prevent exceeding LLM context limits.
  • Steps:
    1. Add tiktoken (or js-tiktoken) as a dependency for token counting.
    2. Create src/context.ts with: countTokens(messages): number and trimMessages(messages, maxTokens): BaseMessage[].
    3. Add MAX_CONTEXT_TOKENS to appConfig (default: 28000 — leaving headroom for response).
    4. trimMessages strategy: always keep the system prompt and last N user messages; summarize or drop oldest messages from the middle.
    5. Integrate trimMessages into executeWithTools() before each LLM call.
    6. Add structured logging when messages are trimmed (count of removed messages, tokens saved).
    7. Write unit tests with synthetic message arrays that exceed the token limit.
  • Acceptance Criteria:
    • When conversation history exceeds MAX_CONTEXT_TOKENS, older messages are dropped/summarized.
    • The system prompt and the most recent user message are never dropped.
    • A test with 100 synthetic messages verifies the output fits within MAX_CONTEXT_TOKENS.
  • Test Requirements: Unit tests with synthetic messages. No LLM calls.
  • Guidelines: Start with a simple "drop oldest" strategy. Summarization can be added in a later task.

Task 1.5: Error Handling & Recovery

  • Depends on: 1.1, 1.2
  • Description: Add robust error handling for LLM API failures and tool execution failures.
  • Steps:
    1. Create src/errors.ts with custom error types: LLMAPIError, ToolExecutionError, MaxIterationsError, ContextOverflowError.
    2. Wrap LLM API calls in retry logic with exponential backoff (max 3 retries, configurable).
    3. Add per-tool timeout configuration (default: 30s) and implement timeout enforcement.
    4. When a tool fails, inject the error as a ToolMessage so the LLM can reason about the failure (don't crash the loop).
    5. Add rate-limit detection for common LLM APIs (HTTP 429) with appropriate back-off.
    6. Write tests for: (a) LLM timeout → retry → success, (b) LLM timeout → max retries → graceful error, (c) tool throws → error injected as ToolMessage, (d) tool timeout.
  • Acceptance Criteria:
    • An LLM API failure retries up to 3 times with backoff before throwing LLMAPIError.
    • A tool that throws an error does not crash the agent loop; instead, the error is fed back to the LLM.
    • A tool exceeding its timeout is killed and an error ToolMessage is generated.
  • Test Requirements: Unit tests with mocked failures. No real API calls.
  • Guidelines: Use AbortController for tool timeouts.

Task 1.6: Modular Tool Architecture

  • Depends on: 1.3, 1.5
  • Description: Replace the hardcoded tools array with a modular, registry-based tool system.
  • Steps:
    1. Create src/tools/ directory. Move existing tools to src/tools/search.ts and src/tools/calculate.ts.
    2. Create src/tools/registry.ts with a ToolRegistry class: register(tool), get(name), list(), loadFromDirectory(path).
    3. Define a ToolDefinition interface: { name, description, schema, execute, permissions?, timeout? }.
    4. Implement loadFromDirectory() that dynamically imports all .ts/.js files from a directory and auto-registers tools that export a ToolDefinition.
    5. Replace the eval() calculator with a safe math expression parser (e.g., mathjs).
    6. Update src/index.ts to use ToolRegistry instead of the imported tools array.
    7. Write tests: (a) register/unregister tools, (b) load from directory, (c) duplicate name rejection, (d) tool listing.
  • Acceptance Criteria:
    • Tools are loaded dynamically from src/tools/ at startup.
    • Adding a new .ts file to src/tools/ that exports a ToolDefinition makes it available without editing any other file.
    • The eval() calculator is replaced with a safe alternative.
    • ToolRegistry.list() returns all registered tool names and descriptions.
  • Test Requirements: Unit + integration tests. Create a test tool fixture.
  • Guidelines: Each tool file should be self-contained with its own schema.

Task 1.7: Tool Security & Human-in-the-Loop

  • Depends on: 1.6
  • Description: Add a permission model and confirmation mechanism for tool execution.
  • Steps:
    1. Define permission levels in ToolDefinition: "safe" (auto-approve), "cautious" (log + approve), "dangerous" (require confirmation).
    2. Create src/security.ts with: ToolPermissionManager that checks permissions before tool execution.
    3. Implement a ConfirmationHandler interface with a default CLI implementation that prompts the user for "dangerous" tools.
    4. Add an AUTO_APPROVE_ALL config flag for non-interactive/CI contexts.
    5. Add an allowlist/blocklist mechanism for tool names in appConfig.
    6. Write tests: (a) safe tool auto-approved, (b) dangerous tool blocked without confirmation, (c) blocklisted tool rejected, (d) AUTO_APPROVE_ALL bypasses confirmation.
  • Acceptance Criteria:
    • A tool marked "dangerous" prompts the user before execution in interactive mode.
    • A blocklisted tool name is rejected with a descriptive error message even if registered.
    • AUTO_APPROVE_ALL=true skips all confirmation prompts.
  • Test Requirements: Unit tests with mocked ConfirmationHandler.
  • Guidelines: The ConfirmationHandler interface must be replaceable (for future UI/API integration).

Phase 2: Essential Coding Agent Tools

Goal: Build the concrete tools a software development agent needs. Each tool follows the ToolDefinition interface from Phase 1.

Task 2.1: Shell Command Execution Tool

  • Depends on: 1.6, 1.7
  • Description: Implement a tool that executes shell commands with safety controls.
  • Steps:
    1. Create src/tools/shell.ts implementing ToolDefinition with permission level "dangerous".
    2. Use Node.js child_process.execFile (not exec) to avoid shell injection.
    3. Support configurable working directory, environment variables, and timeout.
    4. Capture stdout, stderr, and exit code. Return structured output: { stdout, stderr, exitCode }.
    5. Implement a command blocklist (configurable, default blocks: rm -rf /, mkfs, dd, etc.).
    6. Write tests: (a) successful command, (b) failing command returns stderr + exit code, (c) blocked command rejected, (d) timeout kills process.
  • Acceptance Criteria:
    • shell.execute({ command: "echo hello" }) returns { stdout: "hello\n", stderr: "", exitCode: 0 }.
    • shell.execute({ command: "rm -rf /" }) is rejected by the blocklist before execution.
    • A command exceeding timeout is killed and returns a timeout error.
  • Test Requirements: Integration tests in a temp directory. No destructive commands.
  • Guidelines: Never use child_process.exec with unsanitized input. Permission level is "dangerous".

Task 2.2: File Management Tools

  • Depends on: 1.6, 1.7
  • Description: Implement tools for reading, creating, editing, and deleting files.
  • Steps:
    1. Create src/tools/file-read.ts — reads file content, returns content + metadata (size, encoding). Permission: "safe".
    2. Create src/tools/file-write.ts — creates or overwrites files. Permission: "cautious".
    3. Create src/tools/file-edit.ts — applies targeted edits (search-and-replace or line-range replacement). Permission: "cautious".
    4. Create src/tools/file-delete.ts — deletes files. Permission: "dangerous".
    5. All tools enforce a configurable WORKSPACE_ROOT — no file operations outside this directory (path traversal prevention).
    6. Create src/tools/file-list.ts — lists directory contents with optional glob filtering. Permission: "safe".
    7. Write tests for each tool using a temp directory fixture.
  • Acceptance Criteria:
    • file-read returns file content as a string with correct encoding detection.
    • file-write creates a new file and file-read can read it back.
    • file-edit with { path, search: "old", replace: "new" } modifies only the matched content.
    • Any path outside WORKSPACE_ROOT is rejected with an error.
    • file-list with a glob pattern returns matching entries.
  • Test Requirements: Integration tests with tmp directories. Clean up after each test.
  • Guidelines: Use fs/promises. Validate all paths with path.resolve() + prefix check.

Task 2.3: Code Search Tool

  • Depends on: 1.6
  • Description: Implement a tool for searching through codebases using text patterns and regex.
  • Steps:
    1. Create src/tools/code-search.ts implementing ToolDefinition. Permission: "safe".
    2. Support search modes: literal string match, regex, and file-name glob.
    3. Use ripgrep CLI if available (with child_process), fall back to a recursive fs + regex search.
    4. Return results as: { matches: [{ file, line, column, content, context }] } with configurable max results (default: 50).
    5. Respect .gitignore patterns when searching.
    6. Write tests: (a) literal search finds match, (b) regex search, (c) respects max results, (d) no matches returns empty array.
  • Acceptance Criteria:
    • Searching for a known string in a test fixture returns the correct file, line, and content.
    • Regex search for /function\s+\w+/ returns function definitions.
    • Results are capped at the configured maximum.
  • Test Requirements: Integration tests with a test fixture directory containing sample files.
  • Guidelines: Prefer ripgrep for performance but ensure the fallback works.

Task 2.4: Version Control (Git) Tools

  • Depends on: 2.1
  • Description: Implement tools for common Git operations.
  • Steps:
    1. Create src/tools/git-status.ts — runs git status --porcelain and returns structured output. Permission: "safe".
    2. Create src/tools/git-diff.ts — runs git diff with optional path filter. Permission: "safe".
    3. Create src/tools/git-commit.ts — stages specified files and commits with a message. Permission: "cautious".
    4. Create src/tools/git-log.ts — returns recent commit history. Permission: "safe".
    5. Use simple-git library (not nodegit — it's abandoned and has native compilation issues).
    6. Write tests using a temp Git repository fixture.
  • Acceptance Criteria:
    • git-status on a repo with uncommitted changes returns structured file status list.
    • git-diff shows the diff content for modified files.
    • git-commit creates a commit that appears in git-log output.
    • All tools fail gracefully when run outside a Git repository.
  • Test Requirements: Integration tests that create/destroy a temp Git repo with simple-git.
  • Guidelines: Do NOT implement git push yet — that's a "dangerous" operation for a later task.

Task 2.5: Code Execution Tool

  • Depends on: 2.1
  • Description: Implement a tool that runs code snippets or test commands in a controlled environment.
  • Steps:
    1. Create src/tools/code-run.ts implementing ToolDefinition. Permission: "dangerous".
    2. Support execution modes: (a) run a shell command (delegates to shell tool), (b) run a script file.
    3. Capture stdout, stderr, exit code. Enforce timeout (default: 60s).
    4. Add EXECUTION_ENVIRONMENT config for future sandboxing (default: "local").
    5. Write tests: (a) run a simple Node.js script, (b) run a failing script returns error, (c) timeout enforcement.
  • Acceptance Criteria:
    • Running node -e "console.log(42)" returns { stdout: "42\n", exitCode: 0 }.
    • Running a script with a syntax error returns stderr and non-zero exit code.
    • A script exceeding timeout is killed.
  • Test Requirements: Integration tests with simple script fixtures.
  • Guidelines: This is intentionally simple for now — sandboxed execution (Docker, etc.) is a Phase 4 concern.

Task 2.6: Workspace Context Awareness

  • Depends on: 2.2, 2.3
  • Description: Give the agent automatic awareness of the project it's working in.
  • Steps:
    1. Create src/workspace.ts with analyzeWorkspace(rootPath): WorkspaceInfo.
    2. WorkspaceInfo includes: { language, framework, packageManager, hasTests, testCommand, lintCommand, buildCommand, entryPoints, gitInitialized }.
    3. Detect via heuristics: package.json → Node/TS, requirements.txt/pyproject.toml → Python, go.mod → Go, etc.
    4. Extract test/build/lint commands from package.json scripts, Makefile, etc.
    5. Feed WorkspaceInfo into getSystemPrompt() (from Task 1.3) so the LLM knows the project context.
    6. Write tests with fixture directories for Node, Python, and Go projects.
  • Acceptance Criteria:
    • A directory with package.json is detected as Node/TypeScript with the correct test command.
    • A directory with pyproject.toml is detected as Python.
    • WorkspaceInfo is included in the system prompt.
  • Test Requirements: Unit tests with mock filesystem fixtures.
  • Guidelines: Start with Node/TS detection (since AgentLoop itself is TS). Add more languages incrementally.

Task 2.7: Diff/Patch Generation Tool

  • Depends on: 2.2
  • Description: Implement a tool that generates unified diffs and can apply patches.
  • Steps:
    1. Create src/tools/diff.ts — generates a unified diff between two strings or files. Permission: "safe".
    2. Create src/tools/patch.ts — applies a unified diff patch to a file. Permission: "cautious".
    3. Use a library like diff (npm) for diff generation and application.
    4. Write tests: (a) generate diff between two strings, (b) apply patch restores expected content, (c) applying a bad patch returns an error.
  • Acceptance Criteria:
    • diff.execute({ original: "hello", modified: "hello world" }) returns a valid unified diff string.
    • Applying the generated diff to the original produces the modified content.
  • Test Requirements: Unit tests, no external dependencies.
  • Guidelines: This is foundational for the agent to communicate code changes clearly.

Task 2.8: MCP Client Integration

  • Depends on: 1.6, 1.7
  • Description: Integrate MCP (Model Context Protocol) client support so the agent can connect to external MCP tool servers.
  • Steps:
    1. Add @modelcontextprotocol/sdk as a dependency.
    2. Create src/mcp/client.ts implementing an MCP client that connects to MCP servers via stdio or SSE transport.
    3. Create src/mcp/bridge.ts that converts MCP tool definitions into ToolDefinition format and registers them in the ToolRegistry.
    4. Add MCP_SERVERS config (JSON array of server configs: { name, command, args, transport }).
    5. On startup, connect to configured MCP servers, discover their tools, and register them.
    6. Handle MCP server lifecycle: start on agent init, stop on agent shutdown.
    7. Write tests: (a) mock MCP server connection, (b) tool discovery and registration, (c) tool invocation through the bridge, (d) server crash handling.
  • Acceptance Criteria:
    • With MCP_SERVERS=[{"name":"test","command":"node","args":["test-server.js"],"transport":"stdio"}], the agent discovers and registers tools from the MCP server.
    • An MCP tool can be invoked through the normal tool execution flow.
    • If an MCP server crashes, the agent logs the error and marks those tools as unavailable.
  • Test Requirements: Integration tests with a minimal mock MCP server (a simple Node.js script).
  • Guidelines: MCP is a protocol for tool/resource servers — the agent is the client, not the server. Follow the official MCP SDK docs.

Phase 3: Subagent Architecture & Planning

Goal: Enable the agent to delegate work to specialized sub-agents and generate plans.

Task 3.1: Subagent Architecture

  • Depends on: 1.2, 1.6, 1.5
  • Description: Design and implement the core subagent framework.
  • Steps:
    1. Create src/subagents/ directory.
    2. Define SubagentDefinition interface: { name, systemPrompt, tools: string[], maxIterations, parentCommunication }.
    3. Create src/subagents/runner.tsrunSubagent(definition, task): SubagentResult that creates a new agent loop instance with its own message history, tool set, and iteration budget.
    4. Create src/subagents/manager.tsSubagentManager that tracks running subagents, enforces concurrency limits, and collects results.
    5. Subagents use the same createLLM factory and ToolRegistry but with filtered tool access.
    6. Implement result passing: subagent output is returned to the parent agent as a structured message.
    7. Write tests: (a) subagent runs to completion, (b) subagent respects its own MAX_ITERATIONS, (c) subagent only accesses its allowed tools, (d) manager enforces concurrency limit.
  • Acceptance Criteria:
    • runSubagent({ name: "planner", tools: ["file-read"], maxIterations: 5 }, "list files") runs a sub-loop and returns a result.
    • A subagent cannot access tools not in its tools list.
    • SubagentManager with concurrency limit 2 queues a third subagent.
  • Test Requirements: Unit tests with mocked LLM. Integration test with a real (cheap) LLM call as a smoke test.
  • Guidelines: Subagents are isolated agent loops — they don't share message history with the parent.

Task 3.2: Plan Generation Subagent

  • Depends on: 3.1
  • Description: Implement an LLM-based planning subagent that decomposes user requests into actionable task lists.
  • Steps:
    1. Create src/subagents/planner.ts with a specialized system prompt for plan generation.
    2. The planner takes user input + WorkspaceInfo and outputs a structured plan: { steps: [{ description, toolsNeeded, estimatedComplexity }] }.
    3. Implement a JSON-mode or structured output constraint for reliable plan parsing.
    4. Add a plan validation step that checks all referenced tools exist in the registry.
    5. Implement a feedback loop: if the parent agent rejects a plan, the planner refines it.
    6. Write tests: (a) plan generated from input contains steps, (b) plan with invalid tools is flagged, (c) refinement produces a different plan.
  • Acceptance Criteria:
    • Given "Add a new endpoint to the Express app," the planner produces a multi-step plan with tools like file-read, file-write, code-run.
    • A plan referencing a non-existent tool is flagged with a validation error.
    • After one refinement round, the plan no longer references invalid tools.
  • Test Requirements: Unit tests with mocked LLM responses. One integration smoke test.
  • Guidelines: Use structured output / JSON mode to avoid brittle text parsing.

Task 3.3: Task Execution Orchestrator

  • Depends on: 3.1, 3.2
  • Description: Implement the orchestrator that takes a plan and executes it step-by-step.
  • Steps:
    1. Create src/orchestrator.tsexecuteplan(plan): ExecutionResult.
    2. For each step in the plan, either: (a) execute directly if simple (single tool call), or (b) spawn a subagent for complex steps.
    3. Implement checkpointing: after each step, save progress so execution can resume after a failure.
    4. Implement step-level error handling: if a step fails, the orchestrator can retry, skip, or abort (configurable).
    5. Provide progress reporting (step N of M, current status).
    6. Write tests: (a) 3-step plan executes all steps, (b) step failure with retry succeeds on second attempt, (c) checkpoint allows resumption.
  • Acceptance Criteria:
    • A 3-step plan executes all steps in order and returns combined results.
    • If step 2 fails, the orchestrator retries once and continues if the retry succeeds.
    • After a crash at step 2, executePlan(plan, { resumeFrom: 2 }) skips step 1.
  • Test Requirements: Unit tests with mocked tool executions.
  • Guidelines: Keep the orchestrator stateless — checkpoint state is stored externally (file or memory).

Task 3.4: Advanced MCP Features

  • Depends on: 2.8
  • Description: Extend MCP support with resources, prompts, and sampling capabilities.
  • Steps:
    1. Implement MCP resource discovery and reading in src/mcp/client.ts.
    2. Implement MCP prompt template support — discover and use prompts from MCP servers.
    3. Implement MCP sampling support — allow MCP servers to request LLM completions from the agent.
    4. Add MCP server health checking and auto-reconnection.
    5. Write tests for each new MCP capability.
  • Acceptance Criteria:
    • The agent can list and read resources from an MCP server.
    • MCP prompt templates can be discovered and used in the agent's prompt system.
    • An MCP server requesting sampling receives a completion from the agent's LLM.
  • Test Requirements: Integration tests with a mock MCP server.
  • Guidelines: Follow the MCP specification for resources, prompts, and sampling.

Task 3.5: Multi-Agent Coordination

  • Depends on: 3.1, 3.3
  • Description: Enable multiple subagents to work in parallel and coordinate on complex tasks.
  • Steps:
    1. Extend SubagentManager with parallel execution support (runParallel(tasks[])).
    2. Implement a shared context mechanism — subagents can read (but not write) shared state.
    3. Implement result aggregation — collect outputs from parallel subagents into a unified result.
    4. Add conflict detection — if two subagents modify the same file, flag it.
    5. Write tests: (a) 3 parallel subagents complete independently, (b) conflict detection triggers on overlapping file edits, (c) aggregated results contain all outputs.
  • Acceptance Criteria:
    • runParallel([task1, task2, task3]) runs 3 subagents concurrently and returns all results.
    • Two subagents editing the same file produces a conflict warning.
  • Test Requirements: Unit tests with mocked subagents.
  • Guidelines: Use Promise.allSettled() for parallel execution.

Phase 4: Observability, Security Hardening & Streaming

Goal: Make the agent production-observable and add streaming UX.

Task 4.1: Observability & Tracing

  • Depends on: 1.1, 1.5
  • Description: Add structured tracing and cost tracking to the agent.
  • Steps:
    1. Create src/observability.ts with a Tracer interface.
    2. Implement a default tracer that logs to structured JSON files: one trace per agent invocation, with spans for each LLM call and tool execution.
    3. Track token usage (prompt + completion tokens) per LLM call using LangChain's callback mechanism.
    4. Add cumulative cost estimation based on configurable per-token pricing.
    5. Add a TRACING_ENABLED and TRACE_OUTPUT_DIR config.
    6. Write tests verifying trace files are created with expected structure.
  • Acceptance Criteria:
    • Each agent invocation produces a trace file with LLM calls, tool executions, and token counts.
    • Total tokens and estimated cost are included in the trace summary.
  • Test Requirements: Integration test that runs an agent invocation and inspects the trace file.
  • Guidelines: Design the Tracer interface to be replaceable (for OpenTelemetry, LangSmith, etc. in the future).

Task 4.2: Streaming Response Support

  • Depends on: 1.1, 1.2
  • Description: Add streaming support so partial LLM responses are delivered incrementally.
  • Steps:
    1. Add a stream option to agentExecutor.invoke(input, { stream: true }) that returns an AsyncIterable<string>.
    2. Use LangChain's .stream() method instead of .invoke() when streaming is requested.
    3. Update the CLI main loop to print tokens as they arrive.
    4. Handle tool calls during streaming (buffer until the complete tool call is received, then execute).
    5. Write tests: (a) streaming mode returns chunks, (b) tool calls are buffered correctly, (c) non-streaming mode still works.
  • Acceptance Criteria:
    • In streaming mode, the CLI prints tokens character-by-character as they arrive from the LLM.
    • Tool calls during streaming are handled correctly (buffered, executed, then streaming resumes).
  • Test Requirements: Unit tests with mocked streaming LLM responses.
  • Guidelines: Use for await...of for the streaming consumer.

Task 4.3: Security Hardening

  • Depends on: 1.7, 2.1, 2.2
  • Description: Comprehensive security audit and hardening pass.
  • Steps:
    1. Implement workspace isolation: all file and shell operations are chroot-like confined to WORKSPACE_ROOT.
    2. Add network access controls for tools (configurable allowlist of domains/IPs).
    3. Implement resource limits: max file size for read/write, max output size for shell commands, max concurrent tool executions.
    4. Add input sanitization for all tool parameters.
    5. Create a security test suite that attempts common attacks: path traversal, command injection, resource exhaustion.
    6. Document the security model in docs/security.md.
  • Acceptance Criteria:
    • Path traversal attempts (../../etc/passwd) are blocked.
    • Shell injection attempts (; rm -rf /) are blocked.
    • File reads exceeding MAX_FILE_SIZE are rejected.
    • The security document describes the threat model and mitigations.
  • Test Requirements: Dedicated security test suite.
  • Guidelines: Assume all LLM-generated tool inputs are potentially malicious.

Task 4.4: Execution Sandboxing (Optional/Stretch)

  • Depends on: 4.3, 2.5
  • Description: Add optional Docker-based sandboxing for code execution.
  • Steps:
    1. Create src/sandbox/docker.ts that runs code execution inside a Docker container.
    2. Use a minimal base image, mount the workspace read-only (or copy), capture output.
    3. Add SANDBOX_MODE config: "none" (default), "docker".
    4. Write tests: (a) code runs in container, (b) filesystem changes in container don't affect host.
  • Acceptance Criteria:
    • With SANDBOX_MODE=docker, code execution runs inside a container.
    • The host filesystem is not modified by sandboxed execution.
  • Test Requirements: Integration tests (require Docker installed). Mark as skippable in CI without Docker.
  • Guidelines: This is a stretch goal. The "none" mode is the default and must always work.

Phase 5: Testing, Performance & Quality

Goal: Comprehensive testing pass and performance optimization.

Task 5.1: End-to-End Test Suite

  • Depends on: all Phase 2 and 3 tasks
  • Description: Build an E2E test suite that tests complete agent workflows.
  • Steps:
    1. Create tests/e2e/ directory.
    2. Implement test scenarios: (a) "Create a new file with specific content" — exercises file-write + verification, (b) "Find and fix a bug" — exercises code-search + file-edit + code-run, (c) "Generate a plan for a feature" — exercises planner subagent.
    3. Use a mock LLM with deterministic responses for reproducibility.
    4. Add a flag E2E_USE_REAL_LLM for optional live testing (not in CI by default).
    5. Measure and assert on total execution time per scenario.
  • Acceptance Criteria:
    • All 3 E2E scenarios pass with mock LLM.
    • Each scenario completes in under 10 seconds with mock LLM.
  • Test Requirements: E2E tests. Mock LLM by default, optional real LLM.
  • Guidelines: E2E tests should be self-contained — each creates and tears down its own workspace.

Task 5.2: Performance Benchmarking

  • Depends on: 5.1
  • Description: Benchmark and optimize critical paths.
  • Steps:
    1. Benchmark: tool registry lookup time with 100 registered tools.
    2. Benchmark: context trimming with 1000 messages.
    3. Benchmark: code search on a 10,000-file repository.
    4. Benchmark: workspace analysis on a large project.
    5. Profile with Node.js --prof and identify hotspots.
    6. Optimize identified bottlenecks.
    7. Document benchmarks and results.
  • Acceptance Criteria:
    • Tool registry lookup is < 1ms for 100 tools.
    • Context trimming for 1000 messages completes in < 100ms.
    • All benchmarks are documented with baseline and optimized numbers.
  • Test Requirements: Benchmark scripts in benchmarks/.
  • Guidelines: Run benchmarks in CI to catch regressions.

Task 5.3: Testing Infrastructure for LLM-Dependent Code

  • Depends on: 1.1
  • Description: Establish the mocking/recording strategy for testing LLM-dependent behavior.
  • Steps:
    1. Create tests/fixtures/llm-responses/ with recorded LLM response fixtures.
    2. Build a MockChatModel class that replays recorded responses based on input patterns.
    3. Add a recording mode that captures real LLM responses and saves them as fixtures.
    4. Document the testing strategy in docs/testing.md.
    5. Retrofit existing tests to use MockChatModel.
  • Acceptance Criteria:
    • MockChatModel can replay a recorded multi-turn conversation with tool calls.
    • Recording mode captures real responses and saves valid fixture files.
    • All existing tests pass with MockChatModel (no real API calls in CI).
  • Test Requirements: Meta-tests (tests for the test infrastructure).
  • Guidelines: This is foundational — do this early (it's in Phase 5 for execution but could be pulled into Phase 1 if needed).

Phase 6: Documentation & Deployment

Task 6.1: Documentation

  • Depends on: all prior phases
  • Description: Comprehensive documentation covering architecture, usage, and extension.
  • Steps:
    1. Create docs/ directory structure: architecture.md, getting-started.md, tools.md, security.md, extending.md, configuration.md.
    2. architecture.md: system overview diagram, agent loop flow, subagent architecture, MCP integration.
    3. getting-started.md: installation, first run, example workflows.
    4. tools.md: catalog of all built-in tools with examples.
    5. extending.md: how to add a new tool, how to create a subagent, how to connect an MCP server.
    6. configuration.md: all environment variables and their defaults.
    7. Update README.md to reflect the new architecture.
  • Acceptance Criteria:
    • A new developer can go from clone to running the agent within 10 minutes using the docs.
    • All config options are documented with descriptions and defaults.
    • The extending guide includes a complete working example of adding a custom tool.
  • Test Requirements: Documentation review. Verify code examples actually run.
  • Guidelines: Use Mermaid diagrams in markdown for architecture visualization.

Task 6.2: Deployment & Packaging

  • Depends on: 6.1
  • Description: Package the agent for production deployment.
  • Steps:
    1. Add a proper tsconfig.json build configuration that outputs to dist/.
    2. Create a Dockerfile for containerized deployment.
    3. Add npm run build and npm run start:prod scripts.
    4. Create a GitHub Actions workflow for: build → test → publish (npm or Docker).
    5. Add a CHANGELOG.md with semantic versioning.
    6. Publish as an npm package (optional, scoped to @huberp/agentloop).
  • Acceptance Criteria:
    • npm run build produces a clean dist/ directory.
    • docker build . and docker run start the agent successfully.
    • CI pipeline runs all tests and produces a build artifact.
  • Test Requirements: CI pipeline passes. Docker build succeeds.
  • Guidelines: Use multi-stage Docker builds for minimal image size.

Cross-Cutting Concerns (Apply Throughout)

These are not separate tasks but principles to enforce in every task's code review:

Concern Requirement
Logging Every tool execution, LLM call, and error is logged with structured context
TypeScript Strict Enable strict: true in tsconfig. No any types except justified cases
Error Messages All errors include actionable context (what failed, what was expected, how to fix)
Config All magic numbers are configurable via appConfig with sensible defaults
Backwards Compat npx tsx src/index.ts still starts the interactive CLI at every phase
Git Hygiene One PR per task. Each PR includes tests. No PR without tests

Summary of Changes from Original Plan (Issue #4)

Problem in #4 Fix in v2
Phase ordering inverted (Phase 2 before 2.5) Reordered: core tools (Phase 2) before multi-agent (Phase 3)
Core loop only does single tool-call pass Added Task 1.1: iterative tool-calling loop
No context window management Added Task 1.4: token counting + trimming
No LLM provider abstraction Added Task 1.2: provider factory
MCP tasks had incorrect semantics Rewrote as Task 2.8 and 3.4 with correct client-side MCP integration
No error handling strategy Added Task 1.5: comprehensive error handling
No human-in-the-loop Added Task 1.7: permission model + confirmation
Vague acceptance criteria Every task now has specific, testable criteria
No dependency graph Added explicit Depends on: for every task
nodegit recommendation Replaced with simple-git
eval() in calculator Task 1.6 explicitly replaces with safe math parser
No workspace awareness Added Task 2.6
No observability Added Task 4.1
No streaming Added Task 4.2
No diff/patch capability Added Task 2.7
No testing strategy for LLM code Added Task 5.3
Tasks not scoped for agent consumption Each task is ~1-3 days, self-contained, with concrete steps

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions