-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Development Plan v2 for Extending AgentLoop
Revised version of #4. Addresses dependency ordering, missing capabilities, vague acceptance criteria, and gaps identified during deep review. Supersedes #4.
Overview
This plan extends the basic AgentLoop — currently a ~140-line TypeScript LangChain agent with Mistral, two hardcoded demo tools, in-memory chat history, and a single-pass tool execution loop — towards a full-fledged LLM-based agent that can support software development.
Key design principles:
- Each task is scoped to ~1–3 days of focused work and can be fed as an "agent task" to Copilot
- All acceptance criteria are specific and testable
- Tasks declare explicit dependencies
- Testing strategy addresses the LLM-dependency challenge upfront
Current codebase baseline (as of initial plan):
src/index.ts— single-passexecuteWithTools()withChatMistralAI,InMemoryChatMessageHistorysrc/tools.ts— two hardcoded tools (mock search,eval()-based calculator)src/config.ts— dotenv-based config (Mistral key + logger settings)src/logger.ts— Pino logger- Jest test setup in
src/__tests__/
Task Dependency Graph
Phase 1: T1 → T2 → T3 → T4 → T5 → T6 → T7
↓
Phase 2: T1 → T2 → T3 → T4 → T5 → T6 → T7 → T8
↓ ↓
Phase 3: T1 → T2 → T3 → T4 → T5
↓
Phase 4: T1 → T2 → T3 → T4
↓
Phase 5: T1, T2, T3 (parallel)
↓
Phase 6: T1, T2 (parallel)
Detailed dependency links are noted per-task below with Depends on: tags.
Phase 1: Harden the Core Loop
Goal: Make the existing agent loop production-worthy before adding any new capabilities.
Task 1.1: Iterative Tool-Calling Loop
- Depends on: nothing (first task)
- Description: Refactor
executeWithTools()insrc/index.tsfrom a single tool-call round-trip to a proper agentic loop that iterates until the LLM signals completion (i.e., returns a response with zero tool calls). - Steps:
- Replace the current single-pass logic with a
whileloop that continues as long as the LLM response containstool_calls. - Add a configurable
MAX_ITERATIONSguard (default: 20) inappConfigto prevent infinite loops. - Add a configurable
MAX_TOKENS_BUDGET(optional, for future use) field inappConfig. - When
MAX_ITERATIONSis reached, return the last LLM response with a warning prefix. - Add structured log entries for each iteration (iteration number, tool calls in that iteration).
- Write unit tests using mocked LLM responses that simulate: (a) 0 tool calls, (b) 1 round of tool calls, (c) 3 consecutive rounds, (d) hitting MAX_ITERATIONS.
- Replace the current single-pass logic with a
- Acceptance Criteria:
executeWithTools("multi-step query")iterates until no tool calls remain, up toMAX_ITERATIONS.- A test with a mock LLM returning 3 consecutive tool-call rounds produces exactly 3 iterations and a final text response.
- A test hitting
MAX_ITERATIONSreturns gracefully with a warning, not an error.
- Test Requirements: Unit tests with mocked
ChatMistralAI(no real API calls). Usejest.mock()for the LLM. - Guidelines: Do not change the public API of
agentExecutor.invoke(). The change is internal toexecuteWithTools().
Task 1.2: LLM Provider Abstraction
- Depends on: 1.1
- Description: Decouple the agent loop from
ChatMistralAIso any LangChain-compatible chat model can be used. - Steps:
- Create
src/llm.tsthat exports a factory functioncreateLLM(config): BaseChatModel. - Add
LLM_PROVIDER(default:"mistral") andLLM_MODEL(default:"") toappConfig. - Implement the Mistral provider in the factory. Add a clear extension point (switch/map) for future providers (OpenAI, Anthropic, Ollama).
- Update
src/index.tsto use the factory instead of directly instantiatingChatMistralAI. - Make
temperatureand other model parameters configurable viaappConfig. - Write unit tests that verify the factory returns the correct provider based on config.
- Create
- Acceptance Criteria:
- Setting
LLM_PROVIDER=mistralproduces aChatMistralAIinstance. - Setting
LLM_PROVIDER=unknownthrows a descriptive error. src/index.tsno longer importsChatMistralAIdirectly.
- Setting
- Test Requirements: Unit tests for the factory function. Mock the provider constructors.
- Guidelines: Use LangChain's
BaseChatModeltype for the return type.
Task 1.3: System Prompt Management
- Depends on: 1.1
- Description: Replace the hardcoded
"You are a helpful AI assistant."system prompt with a managed, configurable prompt system. - Steps:
- Create
src/prompts/directory. - Create
src/prompts/system.tsthat exports agetSystemPrompt(context?: { tools?: string[], projectInfo?: string }): stringfunction. - Define a base system prompt template that includes: agent identity, available tool names, behavioral instructions.
- Add
SYSTEM_PROMPT_PATHtoappConfig(optional override — path to a.txtor.mdfile). - Update
executeWithTools()to usegetSystemPrompt()instead of the hardcoded string. - Write tests verifying prompt generation with and without context parameters.
- Create
- Acceptance Criteria:
- The system prompt dynamically includes the names of currently loaded tools.
- If
SYSTEM_PROMPT_PATHis set, the prompt is loaded from that file. - Changing tools changes the system prompt content.
- Test Requirements: Unit tests. No LLM calls needed.
- Guidelines: Keep the prompt modular so it can be extended in later phases.
Task 1.4: Context Window Management
- Depends on: 1.1
- Description: Implement token counting and context window management to prevent exceeding LLM context limits.
- Steps:
- Add
tiktoken(orjs-tiktoken) as a dependency for token counting. - Create
src/context.tswith:countTokens(messages): numberandtrimMessages(messages, maxTokens): BaseMessage[]. - Add
MAX_CONTEXT_TOKENStoappConfig(default: 28000 — leaving headroom for response). trimMessagesstrategy: always keep the system prompt and last N user messages; summarize or drop oldest messages from the middle.- Integrate
trimMessagesintoexecuteWithTools()before each LLM call. - Add structured logging when messages are trimmed (count of removed messages, tokens saved).
- Write unit tests with synthetic message arrays that exceed the token limit.
- Add
- Acceptance Criteria:
- When conversation history exceeds
MAX_CONTEXT_TOKENS, older messages are dropped/summarized. - The system prompt and the most recent user message are never dropped.
- A test with 100 synthetic messages verifies the output fits within
MAX_CONTEXT_TOKENS.
- When conversation history exceeds
- Test Requirements: Unit tests with synthetic messages. No LLM calls.
- Guidelines: Start with a simple "drop oldest" strategy. Summarization can be added in a later task.
Task 1.5: Error Handling & Recovery
- Depends on: 1.1, 1.2
- Description: Add robust error handling for LLM API failures and tool execution failures.
- Steps:
- Create
src/errors.tswith custom error types:LLMAPIError,ToolExecutionError,MaxIterationsError,ContextOverflowError. - Wrap LLM API calls in retry logic with exponential backoff (max 3 retries, configurable).
- Add per-tool timeout configuration (default: 30s) and implement timeout enforcement.
- When a tool fails, inject the error as a
ToolMessageso the LLM can reason about the failure (don't crash the loop). - Add rate-limit detection for common LLM APIs (HTTP 429) with appropriate back-off.
- Write tests for: (a) LLM timeout → retry → success, (b) LLM timeout → max retries → graceful error, (c) tool throws → error injected as ToolMessage, (d) tool timeout.
- Create
- Acceptance Criteria:
- An LLM API failure retries up to 3 times with backoff before throwing
LLMAPIError. - A tool that throws an error does not crash the agent loop; instead, the error is fed back to the LLM.
- A tool exceeding its timeout is killed and an error
ToolMessageis generated.
- An LLM API failure retries up to 3 times with backoff before throwing
- Test Requirements: Unit tests with mocked failures. No real API calls.
- Guidelines: Use
AbortControllerfor tool timeouts.
Task 1.6: Modular Tool Architecture
- Depends on: 1.3, 1.5
- Description: Replace the hardcoded tools array with a modular, registry-based tool system.
- Steps:
- Create
src/tools/directory. Move existing tools tosrc/tools/search.tsandsrc/tools/calculate.ts. - Create
src/tools/registry.tswith aToolRegistryclass:register(tool),get(name),list(),loadFromDirectory(path). - Define a
ToolDefinitioninterface:{ name, description, schema, execute, permissions?, timeout? }. - Implement
loadFromDirectory()that dynamically imports all.ts/.jsfiles from a directory and auto-registers tools that export aToolDefinition. - Replace the
eval()calculator with a safe math expression parser (e.g.,mathjs). - Update
src/index.tsto useToolRegistryinstead of the importedtoolsarray. - Write tests: (a) register/unregister tools, (b) load from directory, (c) duplicate name rejection, (d) tool listing.
- Create
- Acceptance Criteria:
- Tools are loaded dynamically from
src/tools/at startup. - Adding a new
.tsfile tosrc/tools/that exports aToolDefinitionmakes it available without editing any other file. - The
eval()calculator is replaced with a safe alternative. ToolRegistry.list()returns all registered tool names and descriptions.
- Tools are loaded dynamically from
- Test Requirements: Unit + integration tests. Create a test tool fixture.
- Guidelines: Each tool file should be self-contained with its own schema.
Task 1.7: Tool Security & Human-in-the-Loop
- Depends on: 1.6
- Description: Add a permission model and confirmation mechanism for tool execution.
- Steps:
- Define permission levels in
ToolDefinition:"safe"(auto-approve),"cautious"(log + approve),"dangerous"(require confirmation). - Create
src/security.tswith:ToolPermissionManagerthat checks permissions before tool execution. - Implement a
ConfirmationHandlerinterface with a default CLI implementation that prompts the user for"dangerous"tools. - Add an
AUTO_APPROVE_ALLconfig flag for non-interactive/CI contexts. - Add an allowlist/blocklist mechanism for tool names in
appConfig. - Write tests: (a) safe tool auto-approved, (b) dangerous tool blocked without confirmation, (c) blocklisted tool rejected, (d)
AUTO_APPROVE_ALLbypasses confirmation.
- Define permission levels in
- Acceptance Criteria:
- A tool marked
"dangerous"prompts the user before execution in interactive mode. - A blocklisted tool name is rejected with a descriptive error message even if registered.
AUTO_APPROVE_ALL=trueskips all confirmation prompts.
- A tool marked
- Test Requirements: Unit tests with mocked
ConfirmationHandler. - Guidelines: The
ConfirmationHandlerinterface must be replaceable (for future UI/API integration).
Phase 2: Essential Coding Agent Tools
Goal: Build the concrete tools a software development agent needs. Each tool follows the
ToolDefinitioninterface from Phase 1.
Task 2.1: Shell Command Execution Tool
- Depends on: 1.6, 1.7
- Description: Implement a tool that executes shell commands with safety controls.
- Steps:
- Create
src/tools/shell.tsimplementingToolDefinitionwith permission level"dangerous". - Use Node.js
child_process.execFile(notexec) to avoid shell injection. - Support configurable working directory, environment variables, and timeout.
- Capture stdout, stderr, and exit code. Return structured output:
{ stdout, stderr, exitCode }. - Implement a command blocklist (configurable, default blocks:
rm -rf /,mkfs,dd, etc.). - Write tests: (a) successful command, (b) failing command returns stderr + exit code, (c) blocked command rejected, (d) timeout kills process.
- Create
- Acceptance Criteria:
shell.execute({ command: "echo hello" })returns{ stdout: "hello\n", stderr: "", exitCode: 0 }.shell.execute({ command: "rm -rf /" })is rejected by the blocklist before execution.- A command exceeding timeout is killed and returns a timeout error.
- Test Requirements: Integration tests in a temp directory. No destructive commands.
- Guidelines: Never use
child_process.execwith unsanitized input. Permission level is"dangerous".
Task 2.2: File Management Tools
- Depends on: 1.6, 1.7
- Description: Implement tools for reading, creating, editing, and deleting files.
- Steps:
- Create
src/tools/file-read.ts— reads file content, returns content + metadata (size, encoding). Permission:"safe". - Create
src/tools/file-write.ts— creates or overwrites files. Permission:"cautious". - Create
src/tools/file-edit.ts— applies targeted edits (search-and-replace or line-range replacement). Permission:"cautious". - Create
src/tools/file-delete.ts— deletes files. Permission:"dangerous". - All tools enforce a configurable
WORKSPACE_ROOT— no file operations outside this directory (path traversal prevention). - Create
src/tools/file-list.ts— lists directory contents with optional glob filtering. Permission:"safe". - Write tests for each tool using a temp directory fixture.
- Create
- Acceptance Criteria:
file-readreturns file content as a string with correct encoding detection.file-writecreates a new file andfile-readcan read it back.file-editwith{ path, search: "old", replace: "new" }modifies only the matched content.- Any path outside
WORKSPACE_ROOTis rejected with an error. file-listwith a glob pattern returns matching entries.
- Test Requirements: Integration tests with
tmpdirectories. Clean up after each test. - Guidelines: Use
fs/promises. Validate all paths withpath.resolve()+ prefix check.
Task 2.3: Code Search Tool
- Depends on: 1.6
- Description: Implement a tool for searching through codebases using text patterns and regex.
- Steps:
- Create
src/tools/code-search.tsimplementingToolDefinition. Permission:"safe". - Support search modes: literal string match, regex, and file-name glob.
- Use
ripgrepCLI if available (withchild_process), fall back to a recursivefs+ regex search. - Return results as:
{ matches: [{ file, line, column, content, context }] }with configurable max results (default: 50). - Respect
.gitignorepatterns when searching. - Write tests: (a) literal search finds match, (b) regex search, (c) respects max results, (d) no matches returns empty array.
- Create
- Acceptance Criteria:
- Searching for a known string in a test fixture returns the correct file, line, and content.
- Regex search for
/function\s+\w+/returns function definitions. - Results are capped at the configured maximum.
- Test Requirements: Integration tests with a test fixture directory containing sample files.
- Guidelines: Prefer
ripgrepfor performance but ensure the fallback works.
Task 2.4: Version Control (Git) Tools
- Depends on: 2.1
- Description: Implement tools for common Git operations.
- Steps:
- Create
src/tools/git-status.ts— runsgit status --porcelainand returns structured output. Permission:"safe". - Create
src/tools/git-diff.ts— runsgit diffwith optional path filter. Permission:"safe". - Create
src/tools/git-commit.ts— stages specified files and commits with a message. Permission:"cautious". - Create
src/tools/git-log.ts— returns recent commit history. Permission:"safe". - Use
simple-gitlibrary (notnodegit— it's abandoned and has native compilation issues). - Write tests using a temp Git repository fixture.
- Create
- Acceptance Criteria:
git-statuson a repo with uncommitted changes returns structured file status list.git-diffshows the diff content for modified files.git-commitcreates a commit that appears ingit-logoutput.- All tools fail gracefully when run outside a Git repository.
- Test Requirements: Integration tests that create/destroy a temp Git repo with
simple-git. - Guidelines: Do NOT implement
git pushyet — that's a"dangerous"operation for a later task.
Task 2.5: Code Execution Tool
- Depends on: 2.1
- Description: Implement a tool that runs code snippets or test commands in a controlled environment.
- Steps:
- Create
src/tools/code-run.tsimplementingToolDefinition. Permission:"dangerous". - Support execution modes: (a) run a shell command (delegates to shell tool), (b) run a script file.
- Capture stdout, stderr, exit code. Enforce timeout (default: 60s).
- Add
EXECUTION_ENVIRONMENTconfig for future sandboxing (default:"local"). - Write tests: (a) run a simple Node.js script, (b) run a failing script returns error, (c) timeout enforcement.
- Create
- Acceptance Criteria:
- Running
node -e "console.log(42)"returns{ stdout: "42\n", exitCode: 0 }. - Running a script with a syntax error returns stderr and non-zero exit code.
- A script exceeding timeout is killed.
- Running
- Test Requirements: Integration tests with simple script fixtures.
- Guidelines: This is intentionally simple for now — sandboxed execution (Docker, etc.) is a Phase 4 concern.
Task 2.6: Workspace Context Awareness
- Depends on: 2.2, 2.3
- Description: Give the agent automatic awareness of the project it's working in.
- Steps:
- Create
src/workspace.tswithanalyzeWorkspace(rootPath): WorkspaceInfo. WorkspaceInfoincludes:{ language, framework, packageManager, hasTests, testCommand, lintCommand, buildCommand, entryPoints, gitInitialized }.- Detect via heuristics:
package.json→ Node/TS,requirements.txt/pyproject.toml→ Python,go.mod→ Go, etc. - Extract test/build/lint commands from
package.jsonscripts,Makefile, etc. - Feed
WorkspaceInfointogetSystemPrompt()(from Task 1.3) so the LLM knows the project context. - Write tests with fixture directories for Node, Python, and Go projects.
- Create
- Acceptance Criteria:
- A directory with
package.jsonis detected as Node/TypeScript with the correct test command. - A directory with
pyproject.tomlis detected as Python. WorkspaceInfois included in the system prompt.
- A directory with
- Test Requirements: Unit tests with mock filesystem fixtures.
- Guidelines: Start with Node/TS detection (since AgentLoop itself is TS). Add more languages incrementally.
Task 2.7: Diff/Patch Generation Tool
- Depends on: 2.2
- Description: Implement a tool that generates unified diffs and can apply patches.
- Steps:
- Create
src/tools/diff.ts— generates a unified diff between two strings or files. Permission:"safe". - Create
src/tools/patch.ts— applies a unified diff patch to a file. Permission:"cautious". - Use a library like
diff(npm) for diff generation and application. - Write tests: (a) generate diff between two strings, (b) apply patch restores expected content, (c) applying a bad patch returns an error.
- Create
- Acceptance Criteria:
diff.execute({ original: "hello", modified: "hello world" })returns a valid unified diff string.- Applying the generated diff to the original produces the modified content.
- Test Requirements: Unit tests, no external dependencies.
- Guidelines: This is foundational for the agent to communicate code changes clearly.
Task 2.8: MCP Client Integration
- Depends on: 1.6, 1.7
- Description: Integrate MCP (Model Context Protocol) client support so the agent can connect to external MCP tool servers.
- Steps:
- Add
@modelcontextprotocol/sdkas a dependency. - Create
src/mcp/client.tsimplementing an MCP client that connects to MCP servers via stdio or SSE transport. - Create
src/mcp/bridge.tsthat converts MCP tool definitions intoToolDefinitionformat and registers them in theToolRegistry. - Add
MCP_SERVERSconfig (JSON array of server configs:{ name, command, args, transport }). - On startup, connect to configured MCP servers, discover their tools, and register them.
- Handle MCP server lifecycle: start on agent init, stop on agent shutdown.
- Write tests: (a) mock MCP server connection, (b) tool discovery and registration, (c) tool invocation through the bridge, (d) server crash handling.
- Add
- Acceptance Criteria:
- With
MCP_SERVERS=[{"name":"test","command":"node","args":["test-server.js"],"transport":"stdio"}], the agent discovers and registers tools from the MCP server. - An MCP tool can be invoked through the normal tool execution flow.
- If an MCP server crashes, the agent logs the error and marks those tools as unavailable.
- With
- Test Requirements: Integration tests with a minimal mock MCP server (a simple Node.js script).
- Guidelines: MCP is a protocol for tool/resource servers — the agent is the client, not the server. Follow the official MCP SDK docs.
Phase 3: Subagent Architecture & Planning
Goal: Enable the agent to delegate work to specialized sub-agents and generate plans.
Task 3.1: Subagent Architecture
- Depends on: 1.2, 1.6, 1.5
- Description: Design and implement the core subagent framework.
- Steps:
- Create
src/subagents/directory. - Define
SubagentDefinitioninterface:{ name, systemPrompt, tools: string[], maxIterations, parentCommunication }. - Create
src/subagents/runner.ts—runSubagent(definition, task): SubagentResultthat creates a new agent loop instance with its own message history, tool set, and iteration budget. - Create
src/subagents/manager.ts—SubagentManagerthat tracks running subagents, enforces concurrency limits, and collects results. - Subagents use the same
createLLMfactory andToolRegistrybut with filtered tool access. - Implement result passing: subagent output is returned to the parent agent as a structured message.
- Write tests: (a) subagent runs to completion, (b) subagent respects its own MAX_ITERATIONS, (c) subagent only accesses its allowed tools, (d) manager enforces concurrency limit.
- Create
- Acceptance Criteria:
runSubagent({ name: "planner", tools: ["file-read"], maxIterations: 5 }, "list files")runs a sub-loop and returns a result.- A subagent cannot access tools not in its
toolslist. SubagentManagerwith concurrency limit 2 queues a third subagent.
- Test Requirements: Unit tests with mocked LLM. Integration test with a real (cheap) LLM call as a smoke test.
- Guidelines: Subagents are isolated agent loops — they don't share message history with the parent.
Task 3.2: Plan Generation Subagent
- Depends on: 3.1
- Description: Implement an LLM-based planning subagent that decomposes user requests into actionable task lists.
- Steps:
- Create
src/subagents/planner.tswith a specialized system prompt for plan generation. - The planner takes user input +
WorkspaceInfoand outputs a structured plan:{ steps: [{ description, toolsNeeded, estimatedComplexity }] }. - Implement a JSON-mode or structured output constraint for reliable plan parsing.
- Add a plan validation step that checks all referenced tools exist in the registry.
- Implement a feedback loop: if the parent agent rejects a plan, the planner refines it.
- Write tests: (a) plan generated from input contains steps, (b) plan with invalid tools is flagged, (c) refinement produces a different plan.
- Create
- Acceptance Criteria:
- Given "Add a new endpoint to the Express app," the planner produces a multi-step plan with tools like
file-read,file-write,code-run. - A plan referencing a non-existent tool is flagged with a validation error.
- After one refinement round, the plan no longer references invalid tools.
- Given "Add a new endpoint to the Express app," the planner produces a multi-step plan with tools like
- Test Requirements: Unit tests with mocked LLM responses. One integration smoke test.
- Guidelines: Use structured output / JSON mode to avoid brittle text parsing.
Task 3.3: Task Execution Orchestrator
- Depends on: 3.1, 3.2
- Description: Implement the orchestrator that takes a plan and executes it step-by-step.
- Steps:
- Create
src/orchestrator.ts—executeplan(plan): ExecutionResult. - For each step in the plan, either: (a) execute directly if simple (single tool call), or (b) spawn a subagent for complex steps.
- Implement checkpointing: after each step, save progress so execution can resume after a failure.
- Implement step-level error handling: if a step fails, the orchestrator can retry, skip, or abort (configurable).
- Provide progress reporting (step N of M, current status).
- Write tests: (a) 3-step plan executes all steps, (b) step failure with retry succeeds on second attempt, (c) checkpoint allows resumption.
- Create
- Acceptance Criteria:
- A 3-step plan executes all steps in order and returns combined results.
- If step 2 fails, the orchestrator retries once and continues if the retry succeeds.
- After a crash at step 2,
executePlan(plan, { resumeFrom: 2 })skips step 1.
- Test Requirements: Unit tests with mocked tool executions.
- Guidelines: Keep the orchestrator stateless — checkpoint state is stored externally (file or memory).
Task 3.4: Advanced MCP Features
- Depends on: 2.8
- Description: Extend MCP support with resources, prompts, and sampling capabilities.
- Steps:
- Implement MCP resource discovery and reading in
src/mcp/client.ts. - Implement MCP prompt template support — discover and use prompts from MCP servers.
- Implement MCP sampling support — allow MCP servers to request LLM completions from the agent.
- Add MCP server health checking and auto-reconnection.
- Write tests for each new MCP capability.
- Implement MCP resource discovery and reading in
- Acceptance Criteria:
- The agent can list and read resources from an MCP server.
- MCP prompt templates can be discovered and used in the agent's prompt system.
- An MCP server requesting sampling receives a completion from the agent's LLM.
- Test Requirements: Integration tests with a mock MCP server.
- Guidelines: Follow the MCP specification for resources, prompts, and sampling.
Task 3.5: Multi-Agent Coordination
- Depends on: 3.1, 3.3
- Description: Enable multiple subagents to work in parallel and coordinate on complex tasks.
- Steps:
- Extend
SubagentManagerwith parallel execution support (runParallel(tasks[])). - Implement a shared context mechanism — subagents can read (but not write) shared state.
- Implement result aggregation — collect outputs from parallel subagents into a unified result.
- Add conflict detection — if two subagents modify the same file, flag it.
- Write tests: (a) 3 parallel subagents complete independently, (b) conflict detection triggers on overlapping file edits, (c) aggregated results contain all outputs.
- Extend
- Acceptance Criteria:
runParallel([task1, task2, task3])runs 3 subagents concurrently and returns all results.- Two subagents editing the same file produces a conflict warning.
- Test Requirements: Unit tests with mocked subagents.
- Guidelines: Use
Promise.allSettled()for parallel execution.
Phase 4: Observability, Security Hardening & Streaming
Goal: Make the agent production-observable and add streaming UX.
Task 4.1: Observability & Tracing
- Depends on: 1.1, 1.5
- Description: Add structured tracing and cost tracking to the agent.
- Steps:
- Create
src/observability.tswith aTracerinterface. - Implement a default tracer that logs to structured JSON files: one trace per agent invocation, with spans for each LLM call and tool execution.
- Track token usage (prompt + completion tokens) per LLM call using LangChain's callback mechanism.
- Add cumulative cost estimation based on configurable per-token pricing.
- Add a
TRACING_ENABLEDandTRACE_OUTPUT_DIRconfig. - Write tests verifying trace files are created with expected structure.
- Create
- Acceptance Criteria:
- Each agent invocation produces a trace file with LLM calls, tool executions, and token counts.
- Total tokens and estimated cost are included in the trace summary.
- Test Requirements: Integration test that runs an agent invocation and inspects the trace file.
- Guidelines: Design the
Tracerinterface to be replaceable (for OpenTelemetry, LangSmith, etc. in the future).
Task 4.2: Streaming Response Support
- Depends on: 1.1, 1.2
- Description: Add streaming support so partial LLM responses are delivered incrementally.
- Steps:
- Add a
streamoption toagentExecutor.invoke(input, { stream: true })that returns anAsyncIterable<string>. - Use LangChain's
.stream()method instead of.invoke()when streaming is requested. - Update the CLI main loop to print tokens as they arrive.
- Handle tool calls during streaming (buffer until the complete tool call is received, then execute).
- Write tests: (a) streaming mode returns chunks, (b) tool calls are buffered correctly, (c) non-streaming mode still works.
- Add a
- Acceptance Criteria:
- In streaming mode, the CLI prints tokens character-by-character as they arrive from the LLM.
- Tool calls during streaming are handled correctly (buffered, executed, then streaming resumes).
- Test Requirements: Unit tests with mocked streaming LLM responses.
- Guidelines: Use
for await...offor the streaming consumer.
Task 4.3: Security Hardening
- Depends on: 1.7, 2.1, 2.2
- Description: Comprehensive security audit and hardening pass.
- Steps:
- Implement workspace isolation: all file and shell operations are
chroot-like confined toWORKSPACE_ROOT. - Add network access controls for tools (configurable allowlist of domains/IPs).
- Implement resource limits: max file size for read/write, max output size for shell commands, max concurrent tool executions.
- Add input sanitization for all tool parameters.
- Create a security test suite that attempts common attacks: path traversal, command injection, resource exhaustion.
- Document the security model in
docs/security.md.
- Implement workspace isolation: all file and shell operations are
- Acceptance Criteria:
- Path traversal attempts (
../../etc/passwd) are blocked. - Shell injection attempts (
; rm -rf /) are blocked. - File reads exceeding
MAX_FILE_SIZEare rejected. - The security document describes the threat model and mitigations.
- Path traversal attempts (
- Test Requirements: Dedicated security test suite.
- Guidelines: Assume all LLM-generated tool inputs are potentially malicious.
Task 4.4: Execution Sandboxing (Optional/Stretch)
- Depends on: 4.3, 2.5
- Description: Add optional Docker-based sandboxing for code execution.
- Steps:
- Create
src/sandbox/docker.tsthat runs code execution inside a Docker container. - Use a minimal base image, mount the workspace read-only (or copy), capture output.
- Add
SANDBOX_MODEconfig:"none"(default),"docker". - Write tests: (a) code runs in container, (b) filesystem changes in container don't affect host.
- Create
- Acceptance Criteria:
- With
SANDBOX_MODE=docker, code execution runs inside a container. - The host filesystem is not modified by sandboxed execution.
- With
- Test Requirements: Integration tests (require Docker installed). Mark as skippable in CI without Docker.
- Guidelines: This is a stretch goal. The
"none"mode is the default and must always work.
Phase 5: Testing, Performance & Quality
Goal: Comprehensive testing pass and performance optimization.
Task 5.1: End-to-End Test Suite
- Depends on: all Phase 2 and 3 tasks
- Description: Build an E2E test suite that tests complete agent workflows.
- Steps:
- Create
tests/e2e/directory. - Implement test scenarios: (a) "Create a new file with specific content" — exercises file-write + verification, (b) "Find and fix a bug" — exercises code-search + file-edit + code-run, (c) "Generate a plan for a feature" — exercises planner subagent.
- Use a mock LLM with deterministic responses for reproducibility.
- Add a flag
E2E_USE_REAL_LLMfor optional live testing (not in CI by default). - Measure and assert on total execution time per scenario.
- Create
- Acceptance Criteria:
- All 3 E2E scenarios pass with mock LLM.
- Each scenario completes in under 10 seconds with mock LLM.
- Test Requirements: E2E tests. Mock LLM by default, optional real LLM.
- Guidelines: E2E tests should be self-contained — each creates and tears down its own workspace.
Task 5.2: Performance Benchmarking
- Depends on: 5.1
- Description: Benchmark and optimize critical paths.
- Steps:
- Benchmark: tool registry lookup time with 100 registered tools.
- Benchmark: context trimming with 1000 messages.
- Benchmark: code search on a 10,000-file repository.
- Benchmark: workspace analysis on a large project.
- Profile with Node.js
--profand identify hotspots. - Optimize identified bottlenecks.
- Document benchmarks and results.
- Acceptance Criteria:
- Tool registry lookup is < 1ms for 100 tools.
- Context trimming for 1000 messages completes in < 100ms.
- All benchmarks are documented with baseline and optimized numbers.
- Test Requirements: Benchmark scripts in
benchmarks/. - Guidelines: Run benchmarks in CI to catch regressions.
Task 5.3: Testing Infrastructure for LLM-Dependent Code
- Depends on: 1.1
- Description: Establish the mocking/recording strategy for testing LLM-dependent behavior.
- Steps:
- Create
tests/fixtures/llm-responses/with recorded LLM response fixtures. - Build a
MockChatModelclass that replays recorded responses based on input patterns. - Add a recording mode that captures real LLM responses and saves them as fixtures.
- Document the testing strategy in
docs/testing.md. - Retrofit existing tests to use
MockChatModel.
- Create
- Acceptance Criteria:
MockChatModelcan replay a recorded multi-turn conversation with tool calls.- Recording mode captures real responses and saves valid fixture files.
- All existing tests pass with
MockChatModel(no real API calls in CI).
- Test Requirements: Meta-tests (tests for the test infrastructure).
- Guidelines: This is foundational — do this early (it's in Phase 5 for execution but could be pulled into Phase 1 if needed).
Phase 6: Documentation & Deployment
Task 6.1: Documentation
- Depends on: all prior phases
- Description: Comprehensive documentation covering architecture, usage, and extension.
- Steps:
- Create
docs/directory structure:architecture.md,getting-started.md,tools.md,security.md,extending.md,configuration.md. architecture.md: system overview diagram, agent loop flow, subagent architecture, MCP integration.getting-started.md: installation, first run, example workflows.tools.md: catalog of all built-in tools with examples.extending.md: how to add a new tool, how to create a subagent, how to connect an MCP server.configuration.md: all environment variables and their defaults.- Update
README.mdto reflect the new architecture.
- Create
- Acceptance Criteria:
- A new developer can go from clone to running the agent within 10 minutes using the docs.
- All config options are documented with descriptions and defaults.
- The extending guide includes a complete working example of adding a custom tool.
- Test Requirements: Documentation review. Verify code examples actually run.
- Guidelines: Use Mermaid diagrams in markdown for architecture visualization.
Task 6.2: Deployment & Packaging
- Depends on: 6.1
- Description: Package the agent for production deployment.
- Steps:
- Add a proper
tsconfig.jsonbuild configuration that outputs todist/. - Create a
Dockerfilefor containerized deployment. - Add
npm run buildandnpm run start:prodscripts. - Create a GitHub Actions workflow for: build → test → publish (npm or Docker).
- Add a
CHANGELOG.mdwith semantic versioning. - Publish as an npm package (optional, scoped to
@huberp/agentloop).
- Add a proper
- Acceptance Criteria:
npm run buildproduces a cleandist/directory.docker build .anddocker runstart the agent successfully.- CI pipeline runs all tests and produces a build artifact.
- Test Requirements: CI pipeline passes. Docker build succeeds.
- Guidelines: Use multi-stage Docker builds for minimal image size.
Cross-Cutting Concerns (Apply Throughout)
These are not separate tasks but principles to enforce in every task's code review:
| Concern | Requirement |
|---|---|
| Logging | Every tool execution, LLM call, and error is logged with structured context |
| TypeScript Strict | Enable strict: true in tsconfig. No any types except justified cases |
| Error Messages | All errors include actionable context (what failed, what was expected, how to fix) |
| Config | All magic numbers are configurable via appConfig with sensible defaults |
| Backwards Compat | npx tsx src/index.ts still starts the interactive CLI at every phase |
| Git Hygiene | One PR per task. Each PR includes tests. No PR without tests |
Summary of Changes from Original Plan (Issue #4)
| Problem in #4 | Fix in v2 |
|---|---|
| Phase ordering inverted (Phase 2 before 2.5) | Reordered: core tools (Phase 2) before multi-agent (Phase 3) |
| Core loop only does single tool-call pass | Added Task 1.1: iterative tool-calling loop |
| No context window management | Added Task 1.4: token counting + trimming |
| No LLM provider abstraction | Added Task 1.2: provider factory |
| MCP tasks had incorrect semantics | Rewrote as Task 2.8 and 3.4 with correct client-side MCP integration |
| No error handling strategy | Added Task 1.5: comprehensive error handling |
| No human-in-the-loop | Added Task 1.7: permission model + confirmation |
| Vague acceptance criteria | Every task now has specific, testable criteria |
| No dependency graph | Added explicit Depends on: for every task |
nodegit recommendation |
Replaced with simple-git |
eval() in calculator |
Task 1.6 explicitly replaces with safe math parser |
| No workspace awareness | Added Task 2.6 |
| No observability | Added Task 4.1 |
| No streaming | Added Task 4.2 |
| No diff/patch capability | Added Task 2.7 |
| No testing strategy for LLM code | Added Task 5.3 |
| Tasks not scoped for agent consumption | Each task is ~1-3 days, self-contained, with concrete steps |