add-on websearch tool

# Task 2.9: Secure Web Research Tool

> Add-on to the [Development Plan v2 (#5)](https://github.com/huberp/agentloop/issues/5). Extends Phase 2 with web research capabilities not covered in the original plan.

- **Depends on:** 1.6, 1.7, 2.8 (optional — MCP path)
- **Estimated effort:** 2–3 days
- **Description**: Implement two tools (`web-search`, `web-fetch`) that allow the agent to search the web and fetch/read web pages, returning clean Markdown content. The tools must enforce strict URL hygiene (tracking-parameter stripping, domain allowlist/blocklist, SSRF protection) and be built primarily as **thin glue code** over well-maintained libraries.

---

## Motivation

The agent currently has no ability to look up external information — documentation, API references, Stack Overflow answers, changelogs, CVE details, etc. A web-research capability closes this gap and is critical for real-world software development tasks. This was identified as a missing piece in the [Development Plan v2 (#5)](https://github.com/huberp/agentloop/issues/5).

---

## Steps

### 1. Create `src/tools/web-search.ts`
- Implements `ToolDefinition`. Permission: `"cautious"`.
- Schema: `{ query: string, maxResults?: number }` (default `maxResults`: 5).
- Delegates the actual search to one of the configured backends (see Recommended Libraries below).
- Returns: `{ results: [{ title, url, snippet }] }` — URLs in results are cleaned (tracking params stripped).
- When `WEB_SEARCH_PROVIDER=none` (default), returns an error message instructing the user to configure a provider.

### 2. Create `src/tools/web-fetch.ts`
- Implements `ToolDefinition`. Permission: `"cautious"`.
- Schema: `{ url: string, extractMode?: "readability" | "raw" }` (default: `"readability"`).
- Pipeline: sanitize URL → blocklist/allowlist check → SSRF check → fetch → extract readable content → convert to Markdown → truncate.
- Returns: `{ url, title, markdown, byline?, excerpt? }`.

### 3. Create `src/tools/web-utils.ts` — URL Sanitization & Security Layer

| Concern | Implementation |
|---------|---------------|
| **Tracking parameter stripping** | Use [`tidy-url`](https://www.npmjs.com/package/tidy-url) to remove `utm_*`, `fbclid`, `gclid`, `mc_eid`, `_ga`, `ref`, and 1500+ other known tracker patterns automatically. One function call — the library maintains general + domain-specific rulesets. |
| **Domain blocklist** | Configurable `WEB_DOMAIN_BLOCKLIST` in `appConfig`. Default: common malware/phishing domains, `localhost`, internal hostnames. Reject any URL whose hostname matches. |
| **Domain allowlist** | Optional `WEB_DOMAIN_ALLOWLIST` — when non-empty, *only* listed domains are permitted. Useful for enterprise/locked-down environments. |
| **Protocol enforcement** | Only `https:` allowed by default. `http:` opt-in via `WEB_ALLOW_HTTP=true`. |
| **SSRF protection** | Reject URLs that resolve to private/loopback addresses (`127.0.0.0/8`, `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `::1`, `169.254.0.0/16`, etc.) to prevent server-side request forgery. Use `dns.resolve` + private-range check before fetching. |

### 4. Content Extraction Pipeline (inside `web-fetch.ts`)


URL ──▶ tidy-url ──▶ blocklist/allowlist ──▶ SSRF check
           │
           ▼
      fetch(url) ──▶ jsdom ──▶ @mozilla/readability
                                      │
                                      ▼
                                  turndown ──▶ Markdown
                                                 │
                                        truncate to limit


- **Fetch**: Use Node built-in `fetch()` (Node 18+) with configurable timeout, `User-Agent`, and max response size (`WEB_MAX_RESPONSE_BYTES`, default: 5 MB).
- **Readability extraction**: Use [`@mozilla/readability`](https://www.npmjs.com/package/@mozilla/readability) + [`jsdom`](https://www.npmjs.com/package/jsdom) to extract the main article content (strips nav, ads, sidebars, footers — like Firefox Reader View).
- **HTML → Markdown**: Use [`turndown`](https://www.npmjs.com/package/turndown) to convert cleaned HTML to Markdown. Configure to preserve code blocks, headings, links, lists, and tables.
- **Fallback**: If Readability returns `null`, fall back to Turndown conversion of the full `<body>`.
- **Truncation**: Truncate final Markdown to `WEB_MAX_CONTENT_CHARS` (default: 20000) without breaking mid-word.

### 5. Configuration — add to `appConfig`

| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `WEB_SEARCH_PROVIDER` | `"brave" \| "tavily" \| "mcp" \| "none"` | `"none"` | Search backend. Disabled until configured. |
| `BRAVE_API_KEY` | `string` | `""` | API key for Brave Search (free tier: 2000 queries/mo). |
| `TAVILY_API_KEY` | `string` | `""` | API key for Tavily Search (free tier: 1000 queries/mo). |
| `WEB_DOMAIN_BLOCKLIST` | `string` (comma-separated) | `""` | Domains to block (e.g., `"malware.com,evil.org"`). |
| `WEB_DOMAIN_ALLOWLIST` | `string` (comma-separated) | `""` | When non-empty, only these domains are allowed. |
| `WEB_ALLOW_HTTP` | `boolean` | `false` | Allow `http://` URLs (insecure). |
| `WEB_MAX_RESPONSE_BYTES` | `number` | `5242880` (5 MB) | Max HTTP response body size. |
| `WEB_MAX_CONTENT_CHARS` | `number` | `20000` | Max Markdown output length. |
| `WEB_USER_AGENT` | `string` | `"AgentLoop/1.0"` | User-Agent header for fetch requests. |
| `WEB_FETCH_TIMEOUT_MS` | `number` | `15000` | Fetch timeout in milliseconds. |

### 6. Write tests

- (a) URL sanitization strips `utm_*`, `fbclid`, `gclid` parameters.
- (b) Blocklisted domain rejected with descriptive error.
- (c) Allowlist-only mode rejects unlisted domains.
- (d) Private IP / SSRF URL rejected (e.g., `http://169.254.169.254/`, `http://localhost:3000/`).
- (e) `http://` URL rejected when `WEB_ALLOW_HTTP=false`.
- (f) HTML → Readability → Turndown pipeline produces clean Markdown from a fixture HTML file.
- (g) Content exceeding `WEB_MAX_CONTENT_CHARS` is truncated.
- (h) Search returns structured results (mock the HTTP layer).

---

## Recommended Libraries — Implementation Should Be Thin Glue Code

The agent-side code should be **< 50 lines per tool `execute` function**. All heavy lifting is delegated to proven, maintained libraries:

| Concern | Recommended Library | Why | Agent Code |
|---------|-------------------|-----|-----------|
| **Web Search (Option A — preferred if Task 2.8 is done)** | [Brave Search MCP Server](https://www.npmjs.com/package/@brave/brave-search-mcp-server) via MCP bridge (Task 2.8) | Official MCP server, free tier (2000 queries/mo), web+local+news search. **Zero agent-side search code** — just MCP config. | Config only |
| **Web Search (Option B)** | [Tavily API](https://tavily.com/) via direct HTTP or MCP remote (`mcp-remote https://mcp.tavily.com/mcp/`) | AI-optimized search results designed for agent use, generous free tier (1000 queries/mo). | ~20 lines |
| **Web Search (Option C — simplest standalone)** | Brave Search API via direct `fetch` to `https://api.search.brave.com/res/v1/web/search` | No MCP dependency. Simple API key + fetch wrapper. | ~30 lines |
| **Tracking param removal** | [`tidy-url`](https://www.npmjs.com/package/tidy-url) | Maintained, 1500+ rules, handles `utm_*`, `fbclid`, `gclid`, `mc_eid`, plus domain-specific patterns. | 1 function call |
| **Readable content extraction** | [`@mozilla/readability`](https://www.npmjs.com/package/@mozilla/readability) + [`jsdom`](https://www.npmjs.com/package/jsdom) | Mozilla's battle-tested reader-mode algorithm (powers Firefox Reader View). Industry standard for RAG content pipelines. | ~10 lines |
| **HTML → Markdown** | [`turndown`](https://www.npmjs.com/package/turndown) | Most popular HTML→Markdown converter. Configurable rules, plugin system, handles code blocks and tables. | ~5 lines |
| **SSRF prevention** | Manual `dns.resolve` + private-range check | Essential — the LLM generates arbitrary URLs. Must block requests to internal infrastructure. | ~20 lines |

### Preferred Architecture: MCP-First for Search, Native for Fetch


┌──────────────────────────────────────────────────────────────┐
│  web-search tool                                              │
│  ┌──────────────────┐   OR   ┌────────────────────────────┐  │
│  │ MCP bridge        │        │ Direct HTTP fetch to        │  │
│  │ (Brave/Tavily    │        │ Brave/Tavily API            │  │
│  │  MCP server)      │        │ (~30 lines glue)            │  │
│  └────────┬─────────┘        └─────────────┬──────────────┘  │
│           └──────────────┬─────────────────┘                  │
│                          ▼                                    │
│               { title, url, snippet }[]                       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  web-fetch tool                                               │
│                                                                │
│   URL ──▶ tidy-url ──▶ blocklist/allowlist ──▶ SSRF check    │
│              │                                                 │
│              ▼                                                 │
│         fetch(url) ──▶ jsdom ──▶ @mozilla/readability         │
│                                        │                      │
│                                        ▼                      │
│                                    turndown ──▶ Markdown      │
│                                                   │           │
│                                          truncate to limit    │
└──────────────────────────────────────────────────────────────┘


### New Dependencies

| Package | Purpose | Size Impact |
|---------|---------|-------------|
| `tidy-url` | Tracking param removal | ~50 KB (pure JS, zero native deps) |
| `@mozilla/readability` | Article content extraction | ~40 KB (zero deps) |
| `jsdom` | DOM implementation for Readability | ~2 MB (likely already in devDeps for tests) |
| `turndown` | HTML → Markdown | ~30 KB (zero deps) |

**Total new production deps: 4.** Search backend is either an MCP server (no agent deps) or a simple `fetch` call (no new deps).

---

## Acceptance Criteria

- [ ] `web-search.execute({ query: "express.js middleware tutorial" })` returns an array of `{ title, url, snippet }` results with clean URLs (no tracking params).
- [ ] `web-fetch.execute({ url: "https://example.com/article?utm_source=twitter&fbclid=abc" })` fetches `https://example.com/article` (params stripped), returns Markdown content.
- [ ] A URL pointing to `http://169.254.169.254/` (AWS metadata) or `http://localhost:3000/` is rejected with a descriptive error.
- [ ] A blocklisted domain returns an error, not an HTTP request.
- [ ] When `WEB_SEARCH_PROVIDER=none`, the search tool returns an error instructing the user to configure a provider.
- [ ] Fetched content is truncated to `WEB_MAX_CONTENT_CHARS` without breaking mid-word.
- [ ] The search tool works with at least one of: Brave MCP (via Task 2.8), Brave direct HTTP, or Tavily direct HTTP.
- [ ] All URL sanitization, blocklist, allowlist, and SSRF tests pass without network calls.

## Test Requirements

- **Unit tests** for `web-utils.ts`: URL sanitization, blocklist/allowlist, SSRF protection, protocol enforcement (no network calls).
- **Integration tests** for the Readability + Turndown pipeline using fixture HTML files.
- **Mock-HTTP tests** for both `web-search` and `web-fetch` tools.
- **One optional smoke test** gated by `WEB_SEARCH_SMOKE=true` that hits a real search API (not in CI by default).

## Guidelines

- Keep the tool files thin — delegate to `web-utils.ts` for URL handling and to libraries for content extraction. Each tool's `execute` function should be **< 50 lines**.
- `web-search` and `web-fetch` are **separate tools** so the LLM can search without fetching (cheaper/faster) or fetch a known URL without searching.
- Permission level is `"cautious"` (not `"dangerous"`) — no local filesystem or shell access, but network access warrants audit logging.
- **Never inject raw HTML into the context window.** All content passes through the Readability + Turndown pipeline.
- Follow the same patterns as existing tools (`shell.ts`, `file-edit.ts`, etc.) — export a `toolDefinition` constant.



Concern	Implementation
Tracking parameter stripping	Use `tidy-url` to remove `utm_*`, `fbclid`, `gclid`, `mc_eid`, `_ga`, `ref`, and 1500+ other known tracker patterns automatically. One function call — the library maintains general + domain-specific rulesets.
Domain blocklist	Configurable `WEB_DOMAIN_BLOCKLIST` in `appConfig`. Default: common malware/phishing domains, `localhost`, internal hostnames. Reject any URL whose hostname matches.
Domain allowlist	Optional `WEB_DOMAIN_ALLOWLIST` — when non-empty, only listed domains are permitted. Useful for enterprise/locked-down environments.
Protocol enforcement	Only `https:` allowed by default. `http:` opt-in via `WEB_ALLOW_HTTP=true`.
SSRF protection	Reject URLs that resolve to private/loopback addresses (`127.0.0.0/8`, `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `::1`, `169.254.0.0/16`, etc.) to prevent server-side request forgery. Use `dns.resolve` + private-range check before fetching.

Variable	Type	Default	Description
`WEB_SEARCH_PROVIDER`	`"brave" \| "tavily" \| "mcp" \| "none"`	`"none"`	Search backend. Disabled until configured.
`BRAVE_API_KEY`	`string`	`""`	API key for Brave Search (free tier: 2000 queries/mo).
`TAVILY_API_KEY`	`string`	`""`	API key for Tavily Search (free tier: 1000 queries/mo).
`WEB_DOMAIN_BLOCKLIST`	`string` (comma-separated)	`""`	Domains to block (e.g., `"malware.com,evil.org"`).
`WEB_DOMAIN_ALLOWLIST`	`string` (comma-separated)	`""`	When non-empty, only these domains are allowed.
`WEB_ALLOW_HTTP`	`boolean`	`false`	Allow `http://` URLs (insecure).
`WEB_MAX_RESPONSE_BYTES`	`number`	`5242880` (5 MB)	Max HTTP response body size.
`WEB_MAX_CONTENT_CHARS`	`number`	`20000`	Max Markdown output length.
`WEB_USER_AGENT`	`string`	`"AgentLoop/1.0"`	User-Agent header for fetch requests.
`WEB_FETCH_TIMEOUT_MS`	`number`	`15000`	Fetch timeout in milliseconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add-on websearch tool #36

Task 2.9: Secure Web Research Tool

Motivation

Steps

1. Create `src/tools/web-search.ts`

2. Create `src/tools/web-fetch.ts`

3. Create `src/tools/web-utils.ts` — URL Sanitization & Security Layer

4. Content Extraction Pipeline (inside `web-fetch.ts`)

5. Configuration — add to `appConfig`

6. Write tests

Recommended Libraries — Implementation Should Be Thin Glue Code

Preferred Architecture: MCP-First for Search, Native for Fetch

New Dependencies

Acceptance Criteria

Test Requirements

Guidelines

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Concern	Recommended Library	Why	Agent Code
Web Search (Option A — preferred if Task 2.8 is done)	Brave Search MCP Server via MCP bridge (Task 2.8)	Official MCP server, free tier (2000 queries/mo), web+local+news search. Zero agent-side search code — just MCP config.	Config only
Web Search (Option B)	Tavily API via direct HTTP or MCP remote (`mcp-remote https://mcp.tavily.com/mcp/`)	AI-optimized search results designed for agent use, generous free tier (1000 queries/mo).	~20 lines
Web Search (Option C — simplest standalone)	Brave Search API via direct `fetch` to `https://api.search.brave.com/res/v1/web/search`	No MCP dependency. Simple API key + fetch wrapper.	~30 lines
Tracking param removal	`tidy-url`	Maintained, 1500+ rules, handles `utm_*`, `fbclid`, `gclid`, `mc_eid`, plus domain-specific patterns.	1 function call
Readable content extraction	`@mozilla/readability` + `jsdom`	Mozilla's battle-tested reader-mode algorithm (powers Firefox Reader View). Industry standard for RAG content pipelines.	~10 lines
HTML → Markdown	`turndown`	Most popular HTML→Markdown converter. Configurable rules, plugin system, handles code blocks and tables.	~5 lines
SSRF prevention	Manual `dns.resolve` + private-range check	Essential — the LLM generates arbitrary URLs. Must block requests to internal infrastructure.	~20 lines

Package	Purpose	Size Impact
`tidy-url`	Tracking param removal	~50 KB (pure JS, zero native deps)
`@mozilla/readability`	Article content extraction	~40 KB (zero deps)
`jsdom`	DOM implementation for Readability	~2 MB (likely already in devDeps for tests)
`turndown`	HTML → Markdown	~30 KB (zero deps)

add-on websearch tool #36

Description

Task 2.9: Secure Web Research Tool

Motivation

Steps

1. Create src/tools/web-search.ts

2. Create src/tools/web-fetch.ts

3. Create src/tools/web-utils.ts — URL Sanitization & Security Layer

4. Content Extraction Pipeline (inside web-fetch.ts)

5. Configuration — add to appConfig

6. Write tests

Recommended Libraries — Implementation Should Be Thin Glue Code

Preferred Architecture: MCP-First for Search, Native for Fetch

New Dependencies

Acceptance Criteria

Test Requirements

Guidelines

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Create `src/tools/web-search.ts`

2. Create `src/tools/web-fetch.ts`

3. Create `src/tools/web-utils.ts` — URL Sanitization & Security Layer

4. Content Extraction Pipeline (inside `web-fetch.ts`)

5. Configuration — add to `appConfig`