Skip to content

add-on websearch tool #36

@huberp

Description

@huberp

Task 2.9: Secure Web Research Tool

Add-on to the Development Plan v2 (#5). Extends Phase 2 with web research capabilities not covered in the original plan.

  • Depends on: 1.6, 1.7, 2.8 (optional — MCP path)
  • Estimated effort: 2–3 days
  • Description: Implement two tools (web-search, web-fetch) that allow the agent to search the web and fetch/read web pages, returning clean Markdown content. The tools must enforce strict URL hygiene (tracking-parameter stripping, domain allowlist/blocklist, SSRF protection) and be built primarily as thin glue code over well-maintained libraries.

Motivation

The agent currently has no ability to look up external information — documentation, API references, Stack Overflow answers, changelogs, CVE details, etc. A web-research capability closes this gap and is critical for real-world software development tasks. This was identified as a missing piece in the Development Plan v2 (#5).


Steps

1. Create src/tools/web-search.ts

  • Implements ToolDefinition. Permission: "cautious".
  • Schema: { query: string, maxResults?: number } (default maxResults: 5).
  • Delegates the actual search to one of the configured backends (see Recommended Libraries below).
  • Returns: { results: [{ title, url, snippet }] } — URLs in results are cleaned (tracking params stripped).
  • When WEB_SEARCH_PROVIDER=none (default), returns an error message instructing the user to configure a provider.

2. Create src/tools/web-fetch.ts

  • Implements ToolDefinition. Permission: "cautious".
  • Schema: { url: string, extractMode?: "readability" | "raw" } (default: "readability").
  • Pipeline: sanitize URL → blocklist/allowlist check → SSRF check → fetch → extract readable content → convert to Markdown → truncate.
  • Returns: { url, title, markdown, byline?, excerpt? }.

3. Create src/tools/web-utils.ts — URL Sanitization & Security Layer

Concern Implementation
Tracking parameter stripping Use tidy-url to remove utm_*, fbclid, gclid, mc_eid, _ga, ref, and 1500+ other known tracker patterns automatically. One function call — the library maintains general + domain-specific rulesets.
Domain blocklist Configurable WEB_DOMAIN_BLOCKLIST in appConfig. Default: common malware/phishing domains, localhost, internal hostnames. Reject any URL whose hostname matches.
Domain allowlist Optional WEB_DOMAIN_ALLOWLIST — when non-empty, only listed domains are permitted. Useful for enterprise/locked-down environments.
Protocol enforcement Only https: allowed by default. http: opt-in via WEB_ALLOW_HTTP=true.
SSRF protection Reject URLs that resolve to private/loopback addresses (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, ::1, 169.254.0.0/16, etc.) to prevent server-side request forgery. Use dns.resolve + private-range check before fetching.

4. Content Extraction Pipeline (inside web-fetch.ts)

URL ──▶ tidy-url ──▶ blocklist/allowlist ──▶ SSRF check


fetch(url) ──▶ jsdom ──▶ @mozilla/readability


turndown ──▶ Markdown

truncate to limit

  • Fetch: Use Node built-in fetch() (Node 18+) with configurable timeout, User-Agent, and max response size (WEB_MAX_RESPONSE_BYTES, default: 5 MB).
  • Readability extraction: Use @mozilla/readability + jsdom to extract the main article content (strips nav, ads, sidebars, footers — like Firefox Reader View).
  • HTML → Markdown: Use turndown to convert cleaned HTML to Markdown. Configure to preserve code blocks, headings, links, lists, and tables.
  • Fallback: If Readability returns null, fall back to Turndown conversion of the full <body>.
  • Truncation: Truncate final Markdown to WEB_MAX_CONTENT_CHARS (default: 20000) without breaking mid-word.

5. Configuration — add to appConfig

Variable Type Default Description
WEB_SEARCH_PROVIDER "brave" | "tavily" | "mcp" | "none" "none" Search backend. Disabled until configured.
BRAVE_API_KEY string "" API key for Brave Search (free tier: 2000 queries/mo).
TAVILY_API_KEY string "" API key for Tavily Search (free tier: 1000 queries/mo).
WEB_DOMAIN_BLOCKLIST string (comma-separated) "" Domains to block (e.g., "malware.com,evil.org").
WEB_DOMAIN_ALLOWLIST string (comma-separated) "" When non-empty, only these domains are allowed.
WEB_ALLOW_HTTP boolean false Allow http:// URLs (insecure).
WEB_MAX_RESPONSE_BYTES number 5242880 (5 MB) Max HTTP response body size.
WEB_MAX_CONTENT_CHARS number 20000 Max Markdown output length.
WEB_USER_AGENT string "AgentLoop/1.0" User-Agent header for fetch requests.
WEB_FETCH_TIMEOUT_MS number 15000 Fetch timeout in milliseconds.

6. Write tests

  • (a) URL sanitization strips utm_*, fbclid, gclid parameters.
  • (b) Blocklisted domain rejected with descriptive error.
  • (c) Allowlist-only mode rejects unlisted domains.
  • (d) Private IP / SSRF URL rejected (e.g., http://169.254.169.254/, http://localhost:3000/).
  • (e) http:// URL rejected when WEB_ALLOW_HTTP=false.
  • (f) HTML → Readability → Turndown pipeline produces clean Markdown from a fixture HTML file.
  • (g) Content exceeding WEB_MAX_CONTENT_CHARS is truncated.
  • (h) Search returns structured results (mock the HTTP layer).

Recommended Libraries — Implementation Should Be Thin Glue Code

The agent-side code should be < 50 lines per tool execute function. All heavy lifting is delegated to proven, maintained libraries:

Concern Recommended Library Why Agent Code
Web Search (Option A — preferred if Task 2.8 is done) Brave Search MCP Server via MCP bridge (Task 2.8) Official MCP server, free tier (2000 queries/mo), web+local+news search. Zero agent-side search code — just MCP config. Config only
Web Search (Option B) Tavily API via direct HTTP or MCP remote (mcp-remote https://mcp.tavily.com/mcp/) AI-optimized search results designed for agent use, generous free tier (1000 queries/mo). ~20 lines
Web Search (Option C — simplest standalone) Brave Search API via direct fetch to https://api.search.brave.com/res/v1/web/search No MCP dependency. Simple API key + fetch wrapper. ~30 lines
Tracking param removal tidy-url Maintained, 1500+ rules, handles utm_*, fbclid, gclid, mc_eid, plus domain-specific patterns. 1 function call
Readable content extraction @mozilla/readability + jsdom Mozilla's battle-tested reader-mode algorithm (powers Firefox Reader View). Industry standard for RAG content pipelines. ~10 lines
HTML → Markdown turndown Most popular HTML→Markdown converter. Configurable rules, plugin system, handles code blocks and tables. ~5 lines
SSRF prevention Manual dns.resolve + private-range check Essential — the LLM generates arbitrary URLs. Must block requests to internal infrastructure. ~20 lines

Preferred Architecture: MCP-First for Search, Native for Fetch

┌──────────────────────────────────────────────────────────────┐
│ web-search tool │
│ ┌──────────────────┐ OR ┌────────────────────────────┐ │
│ │ MCP bridge │ │ Direct HTTP fetch to │ │
│ │ (Brave/Tavily │ │ Brave/Tavily API │ │
│ │ MCP server) │ │ (~30 lines glue) │ │
│ └────────┬─────────┘ └─────────────┬──────────────┘ │
│ └──────────────┬─────────────────┘ │
│ ▼ │
│ { title, url, snippet }[] │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│ web-fetch tool │
│ │
│ URL ──▶ tidy-url ──▶ blocklist/allowlist ──▶ SSRF check │
│ │ │
│ ▼ │
│ fetch(url) ──▶ jsdom ──▶ @mozilla/readability │
│ │ │
│ ▼ │
│ turndown ──▶ Markdown │
│ │ │
│ truncate to limit │
└──────────────────────────────────────────────────────────────┘

New Dependencies

Package Purpose Size Impact
tidy-url Tracking param removal ~50 KB (pure JS, zero native deps)
@mozilla/readability Article content extraction ~40 KB (zero deps)
jsdom DOM implementation for Readability ~2 MB (likely already in devDeps for tests)
turndown HTML → Markdown ~30 KB (zero deps)

Total new production deps: 4. Search backend is either an MCP server (no agent deps) or a simple fetch call (no new deps).


Acceptance Criteria

  • web-search.execute({ query: "express.js middleware tutorial" }) returns an array of { title, url, snippet } results with clean URLs (no tracking params).
  • web-fetch.execute({ url: "https://example.com/article?utm_source=twitter&fbclid=abc" }) fetches https://example.com/article (params stripped), returns Markdown content.
  • A URL pointing to http://169.254.169.254/ (AWS metadata) or http://localhost:3000/ is rejected with a descriptive error.
  • A blocklisted domain returns an error, not an HTTP request.
  • When WEB_SEARCH_PROVIDER=none, the search tool returns an error instructing the user to configure a provider.
  • Fetched content is truncated to WEB_MAX_CONTENT_CHARS without breaking mid-word.
  • The search tool works with at least one of: Brave MCP (via Task 2.8), Brave direct HTTP, or Tavily direct HTTP.
  • All URL sanitization, blocklist, allowlist, and SSRF tests pass without network calls.

Test Requirements

  • Unit tests for web-utils.ts: URL sanitization, blocklist/allowlist, SSRF protection, protocol enforcement (no network calls).
  • Integration tests for the Readability + Turndown pipeline using fixture HTML files.
  • Mock-HTTP tests for both web-search and web-fetch tools.
  • One optional smoke test gated by WEB_SEARCH_SMOKE=true that hits a real search API (not in CI by default).

Guidelines

  • Keep the tool files thin — delegate to web-utils.ts for URL handling and to libraries for content extraction. Each tool's execute function should be < 50 lines.
  • web-search and web-fetch are separate tools so the LLM can search without fetching (cheaper/faster) or fetch a known URL without searching.
  • Permission level is "cautious" (not "dangerous") — no local filesystem or shell access, but network access warrants audit logging.
  • Never inject raw HTML into the context window. All content passes through the Readability + Turndown pipeline.
  • Follow the same patterns as existing tools (shell.ts, file-edit.ts, etc.) — export a toolDefinition constant.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions