-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Task 2.9: Secure Web Research Tool
Add-on to the Development Plan v2 (#5). Extends Phase 2 with web research capabilities not covered in the original plan.
- Depends on: 1.6, 1.7, 2.8 (optional — MCP path)
- Estimated effort: 2–3 days
- Description: Implement two tools (
web-search,web-fetch) that allow the agent to search the web and fetch/read web pages, returning clean Markdown content. The tools must enforce strict URL hygiene (tracking-parameter stripping, domain allowlist/blocklist, SSRF protection) and be built primarily as thin glue code over well-maintained libraries.
Motivation
The agent currently has no ability to look up external information — documentation, API references, Stack Overflow answers, changelogs, CVE details, etc. A web-research capability closes this gap and is critical for real-world software development tasks. This was identified as a missing piece in the Development Plan v2 (#5).
Steps
1. Create src/tools/web-search.ts
- Implements
ToolDefinition. Permission:"cautious". - Schema:
{ query: string, maxResults?: number }(defaultmaxResults: 5). - Delegates the actual search to one of the configured backends (see Recommended Libraries below).
- Returns:
{ results: [{ title, url, snippet }] }— URLs in results are cleaned (tracking params stripped). - When
WEB_SEARCH_PROVIDER=none(default), returns an error message instructing the user to configure a provider.
2. Create src/tools/web-fetch.ts
- Implements
ToolDefinition. Permission:"cautious". - Schema:
{ url: string, extractMode?: "readability" | "raw" }(default:"readability"). - Pipeline: sanitize URL → blocklist/allowlist check → SSRF check → fetch → extract readable content → convert to Markdown → truncate.
- Returns:
{ url, title, markdown, byline?, excerpt? }.
3. Create src/tools/web-utils.ts — URL Sanitization & Security Layer
| Concern | Implementation |
|---|---|
| Tracking parameter stripping | Use tidy-url to remove utm_*, fbclid, gclid, mc_eid, _ga, ref, and 1500+ other known tracker patterns automatically. One function call — the library maintains general + domain-specific rulesets. |
| Domain blocklist | Configurable WEB_DOMAIN_BLOCKLIST in appConfig. Default: common malware/phishing domains, localhost, internal hostnames. Reject any URL whose hostname matches. |
| Domain allowlist | Optional WEB_DOMAIN_ALLOWLIST — when non-empty, only listed domains are permitted. Useful for enterprise/locked-down environments. |
| Protocol enforcement | Only https: allowed by default. http: opt-in via WEB_ALLOW_HTTP=true. |
| SSRF protection | Reject URLs that resolve to private/loopback addresses (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, ::1, 169.254.0.0/16, etc.) to prevent server-side request forgery. Use dns.resolve + private-range check before fetching. |
4. Content Extraction Pipeline (inside web-fetch.ts)
URL ──▶ tidy-url ──▶ blocklist/allowlist ──▶ SSRF check
│
▼
fetch(url) ──▶ jsdom ──▶ @mozilla/readability
│
▼
turndown ──▶ Markdown
│
truncate to limit
- Fetch: Use Node built-in
fetch()(Node 18+) with configurable timeout,User-Agent, and max response size (WEB_MAX_RESPONSE_BYTES, default: 5 MB). - Readability extraction: Use
@mozilla/readability+jsdomto extract the main article content (strips nav, ads, sidebars, footers — like Firefox Reader View). - HTML → Markdown: Use
turndownto convert cleaned HTML to Markdown. Configure to preserve code blocks, headings, links, lists, and tables. - Fallback: If Readability returns
null, fall back to Turndown conversion of the full<body>. - Truncation: Truncate final Markdown to
WEB_MAX_CONTENT_CHARS(default: 20000) without breaking mid-word.
5. Configuration — add to appConfig
| Variable | Type | Default | Description |
|---|---|---|---|
WEB_SEARCH_PROVIDER |
"brave" | "tavily" | "mcp" | "none" |
"none" |
Search backend. Disabled until configured. |
BRAVE_API_KEY |
string |
"" |
API key for Brave Search (free tier: 2000 queries/mo). |
TAVILY_API_KEY |
string |
"" |
API key for Tavily Search (free tier: 1000 queries/mo). |
WEB_DOMAIN_BLOCKLIST |
string (comma-separated) |
"" |
Domains to block (e.g., "malware.com,evil.org"). |
WEB_DOMAIN_ALLOWLIST |
string (comma-separated) |
"" |
When non-empty, only these domains are allowed. |
WEB_ALLOW_HTTP |
boolean |
false |
Allow http:// URLs (insecure). |
WEB_MAX_RESPONSE_BYTES |
number |
5242880 (5 MB) |
Max HTTP response body size. |
WEB_MAX_CONTENT_CHARS |
number |
20000 |
Max Markdown output length. |
WEB_USER_AGENT |
string |
"AgentLoop/1.0" |
User-Agent header for fetch requests. |
WEB_FETCH_TIMEOUT_MS |
number |
15000 |
Fetch timeout in milliseconds. |
6. Write tests
- (a) URL sanitization strips
utm_*,fbclid,gclidparameters. - (b) Blocklisted domain rejected with descriptive error.
- (c) Allowlist-only mode rejects unlisted domains.
- (d) Private IP / SSRF URL rejected (e.g.,
http://169.254.169.254/,http://localhost:3000/). - (e)
http://URL rejected whenWEB_ALLOW_HTTP=false. - (f) HTML → Readability → Turndown pipeline produces clean Markdown from a fixture HTML file.
- (g) Content exceeding
WEB_MAX_CONTENT_CHARSis truncated. - (h) Search returns structured results (mock the HTTP layer).
Recommended Libraries — Implementation Should Be Thin Glue Code
The agent-side code should be < 50 lines per tool execute function. All heavy lifting is delegated to proven, maintained libraries:
| Concern | Recommended Library | Why | Agent Code |
|---|---|---|---|
| Web Search (Option A — preferred if Task 2.8 is done) | Brave Search MCP Server via MCP bridge (Task 2.8) | Official MCP server, free tier (2000 queries/mo), web+local+news search. Zero agent-side search code — just MCP config. | Config only |
| Web Search (Option B) | Tavily API via direct HTTP or MCP remote (mcp-remote https://mcp.tavily.com/mcp/) |
AI-optimized search results designed for agent use, generous free tier (1000 queries/mo). | ~20 lines |
| Web Search (Option C — simplest standalone) | Brave Search API via direct fetch to https://api.search.brave.com/res/v1/web/search |
No MCP dependency. Simple API key + fetch wrapper. | ~30 lines |
| Tracking param removal | tidy-url |
Maintained, 1500+ rules, handles utm_*, fbclid, gclid, mc_eid, plus domain-specific patterns. |
1 function call |
| Readable content extraction | @mozilla/readability + jsdom |
Mozilla's battle-tested reader-mode algorithm (powers Firefox Reader View). Industry standard for RAG content pipelines. | ~10 lines |
| HTML → Markdown | turndown |
Most popular HTML→Markdown converter. Configurable rules, plugin system, handles code blocks and tables. | ~5 lines |
| SSRF prevention | Manual dns.resolve + private-range check |
Essential — the LLM generates arbitrary URLs. Must block requests to internal infrastructure. | ~20 lines |
Preferred Architecture: MCP-First for Search, Native for Fetch
┌──────────────────────────────────────────────────────────────┐
│ web-search tool │
│ ┌──────────────────┐ OR ┌────────────────────────────┐ │
│ │ MCP bridge │ │ Direct HTTP fetch to │ │
│ │ (Brave/Tavily │ │ Brave/Tavily API │ │
│ │ MCP server) │ │ (~30 lines glue) │ │
│ └────────┬─────────┘ └─────────────┬──────────────┘ │
│ └──────────────┬─────────────────┘ │
│ ▼ │
│ { title, url, snippet }[] │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ web-fetch tool │
│ │
│ URL ──▶ tidy-url ──▶ blocklist/allowlist ──▶ SSRF check │
│ │ │
│ ▼ │
│ fetch(url) ──▶ jsdom ──▶ @mozilla/readability │
│ │ │
│ ▼ │
│ turndown ──▶ Markdown │
│ │ │
│ truncate to limit │
└──────────────────────────────────────────────────────────────┘
New Dependencies
| Package | Purpose | Size Impact |
|---|---|---|
tidy-url |
Tracking param removal | ~50 KB (pure JS, zero native deps) |
@mozilla/readability |
Article content extraction | ~40 KB (zero deps) |
jsdom |
DOM implementation for Readability | ~2 MB (likely already in devDeps for tests) |
turndown |
HTML → Markdown | ~30 KB (zero deps) |
Total new production deps: 4. Search backend is either an MCP server (no agent deps) or a simple fetch call (no new deps).
Acceptance Criteria
-
web-search.execute({ query: "express.js middleware tutorial" })returns an array of{ title, url, snippet }results with clean URLs (no tracking params). -
web-fetch.execute({ url: "https://example.com/article?utm_source=twitter&fbclid=abc" })fetcheshttps://example.com/article(params stripped), returns Markdown content. - A URL pointing to
http://169.254.169.254/(AWS metadata) orhttp://localhost:3000/is rejected with a descriptive error. - A blocklisted domain returns an error, not an HTTP request.
- When
WEB_SEARCH_PROVIDER=none, the search tool returns an error instructing the user to configure a provider. - Fetched content is truncated to
WEB_MAX_CONTENT_CHARSwithout breaking mid-word. - The search tool works with at least one of: Brave MCP (via Task 2.8), Brave direct HTTP, or Tavily direct HTTP.
- All URL sanitization, blocklist, allowlist, and SSRF tests pass without network calls.
Test Requirements
- Unit tests for
web-utils.ts: URL sanitization, blocklist/allowlist, SSRF protection, protocol enforcement (no network calls). - Integration tests for the Readability + Turndown pipeline using fixture HTML files.
- Mock-HTTP tests for both
web-searchandweb-fetchtools. - One optional smoke test gated by
WEB_SEARCH_SMOKE=truethat hits a real search API (not in CI by default).
Guidelines
- Keep the tool files thin — delegate to
web-utils.tsfor URL handling and to libraries for content extraction. Each tool'sexecutefunction should be < 50 lines. web-searchandweb-fetchare separate tools so the LLM can search without fetching (cheaper/faster) or fetch a known URL without searching.- Permission level is
"cautious"(not"dangerous") — no local filesystem or shell access, but network access warrants audit logging. - Never inject raw HTML into the context window. All content passes through the Readability + Turndown pipeline.
- Follow the same patterns as existing tools (
shell.ts,file-edit.ts, etc.) — export atoolDefinitionconstant.