Skip to content

Security: add URL allowlist for web_search and web_fetch#19042

Closed
smartprogrammer93 wants to merge 13 commits intoopenclaw:mainfrom
smartprogrammer93:feat/web-tools-url-allowlist
Closed

Security: add URL allowlist for web_search and web_fetch#19042
smartprogrammer93 wants to merge 13 commits intoopenclaw:mainfrom
smartprogrammer93:feat/web-tools-url-allowlist

Conversation

@smartprogrammer93
Copy link
Copy Markdown
Contributor

@smartprogrammer93 smartprogrammer93 commented Feb 17, 2026

Summary

  • Problem: web_search and web_fetch had no mechanism to restrict which external domains the agent could reach, making it impossible to run a network-isolated or domain-scoped agent.
  • Why it matters: Operators deploying OpenClaw in restricted environments (corporate proxies, research sandboxes, prompt-injection hardening) need a declarative allowlist enforced at the tool layer — not just at the network layer.
  • What changed: Added an optional tools.web.urlAllowlist config field. When set, both web_search and web_fetch enforce it. When unset, all URLs are allowed (fully backwards compatible).
  • What did NOT change: Fetch/search behaviour, caching semantics, provider selection, or any auth flow.

URL matching rules

Pattern Matches
example.com exact domain only
*.github.com all subdomains (including deeply nested)
* / *. rejected at config parse time
localhost, *.localhost, *.local, *.internal rejected at config parse time (SSRF guard blocks unconditionally)

Example config

tools:
  web:
    urlAllowlist:
      - "example.com"
      - "*.github.com"
      - "docs.openclaw.ai"

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

User-visible / Behavior Changes

  • New optional config field: tools.web.urlAllowlist: string[]
  • When configured: web_fetch blocks requests (and redirect targets) not matching the allowlist, returning a structured { error: "allowlist_blocked" } tool result. web_search filters results, citations, and inlineCitations from all providers (Brave, Perplexity, Grok, Kimi, Gemini) post-cache so unfiltered data is stored and re-filtered on every read.
  • When not configured: no change in behaviour.
  • SSRF-blocked hostnames (localhost, localhost.localdomain, metadata.google.internal) and wildcard patterns like *.localhost, *.local, *.internal are rejected at config parse time with a clear error message, since the SSRF guard blocks them unconditionally at the network level.

Security Impact (required)

  • New permissions/capabilities? No — this is a restriction mechanism, not a capability expansion.
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? YesurlAllowlist narrows the set of URLs reachable by web_fetch and visible via web_search. Risk: none (additive restriction). Mitigation: config is opt-in and defaults to unrestricted.

Implementation details

web_fetch:

  • Allowlist check before fetch (returns structured error immediately).
  • Second check on finalUrl after redirect resolution — throws AllowlistBlockedError (new typed error exported from ssrf.ts) which is caught in execute and returned as { error: "allowlist_blocked" }. The redirect check is post-connection by design: the SSRF guard already validates each redirect hop's DNS at the network level, so the allowlist check is a content-policy gate only.

web_search:

  • applyUrlAllowlistToPayload filters all three citation shapes returned by LLM-based search providers:
    • results: Array<{url}> — Brave, Perplexity-sonar
    • citations: string[] — Perplexity-chat, Grok, Kimi, Gemini
    • inlineCitations: Array<{url}> — Grok inline citations
  • Cache stores unfiltered results; allowlist applied on read so config changes take effect without cache invalidation.

Repro + Verification

Environment

  • OS: Linux (Ubuntu)
  • Runtime: Node 22
  • Model/provider: any
  • Relevant config: tools.web.urlAllowlist: ["example.com"]

Steps

  1. Set tools.web.urlAllowlist: ["example.com"] in config
  2. Ask agent to fetch https://evil.com via web_fetch
  3. Ask agent to search for something and observe filtered results

Expected

  • Step 2: fetch blocked with allowlist error
  • Step 3: only example.com URLs in results/citations

Actual

  • Matches expected

Evidence

  • 228-line test suite (web-tools.url-allowlist.test.ts) covering all path combinations, importing directly from production exports
  • AllowlistBlockedError smoke test
  • Citations filtering smoke test for LLM-provider payloads

Human Verification (required)

  • Verified scenarios: allowlist enforced on direct fetch, redirect targets, search results (results array), and LLM-provider citation arrays.
  • Edge cases checked: empty allowlist (no-op), wildcard patterns, invalid URLs in results, redirect to non-allowlisted domain.
  • What I did not verify: live Firecrawl fallback bypass (tested via code path analysis and unit tests).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes — field is optional, defaults to unrestricted.
  • Config/env changes? No — new optional field only.
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: remove or leave tools.web.urlAllowlist unset in config.
  • Files/config to restore: tools.web.urlAllowlist key in openclaw.json / openclaw.yaml.
  • Known bad symptoms: agent reports URL blocked unexpectedly → check allowlist patterns.

Risks and Mitigations

  • Risk: allowlist silently misconfigured (e.g. https://example.com instead of example.com).
    • Mitigation: Zod schema rejects URLs with protocol/path at parse time with a descriptive error. Patterns like * and *. are also rejected.
  • Risk: SSRF-blocked hostnames in allowlist causing confusing errors.
    • Mitigation: Schema rejects localhost, localhost.localdomain, metadata.google.internal, and wildcard *.localhost/*.local/*.internal patterns with a clear error message explaining why.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M labels Feb 17, 2026
@alaindimabuyo
Copy link
Copy Markdown

@greptileai please review

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/agents/tools/web-fetch.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 17, 2026

Additional Comments (1)

src/agents/tools/web-search.ts
Allowlist filtering bypassed by module-level cache

The SEARCH_CACHE (line 36) is a module-level Map, but the cache key (lines 609–614) does not incorporate the urlAllowlist. Filtered results are written to the cache at line 748. If the same process hosts multiple tool instances with different allowlist configs (or if the allowlist is changed across tool re-creations without restarting), a cache hit at line 617 will return previously-filtered (or unfiltered) results, bypassing the current allowlist.

In practice this is unlikely since the config is typically stable for the lifetime of a process, but it could be surprising in multi-agent setups with per-agent configs. Consider either:

  • Including a hash of the allowlist in the cache key, or
  • Applying filterResultsByAllowlist to cache hits as well
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/tools/web-search.ts
Line: 609:619

Comment:
**Allowlist filtering bypassed by module-level cache**

The `SEARCH_CACHE` (line 36) is a module-level `Map`, but the cache key (lines 609–614) does not incorporate the `urlAllowlist`. Filtered results are written to the cache at line 748. If the same process hosts multiple tool instances with different allowlist configs (or if the allowlist is changed across tool re-creations without restarting), a cache hit at line 617 will return previously-filtered (or unfiltered) results, bypassing the current allowlist.

In practice this is unlikely since the config is typically stable for the lifetime of a process, but it could be surprising in multi-agent setups with per-agent configs. Consider either:
- Including a hash of the allowlist in the cache key, or
- Applying `filterResultsByAllowlist` to cache hits as well

How can I resolve this? If you propose a fix, please make it concise.

@smartprogrammer93 smartprogrammer93 force-pushed the feat/web-tools-url-allowlist branch from af37732 to 1eaf383 Compare February 17, 2026 08:37
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/agents/tools/web-search.ts Outdated
Comment thread src/agents/tools/web-search.ts Outdated
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/agents/tools/web-fetch.ts Outdated
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

1 similar comment
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/config/zod-schema.agent-runtime.ts Outdated
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

ready for review and merge

@smartprogrammer93 smartprogrammer93 changed the title feat(tools): add URL allowlist for web_search and web_fetch Security: add URL allowlist for web_search and web_fetch Feb 19, 2026
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

smartprogrammer93 commented Feb 22, 2026

Hey,
@steipete @sebslight
This PR replaces #18584 that was reverted. Please consider.

Apologies for the ping, i dont usually ping maintainers but this PR has been ignored for 3 releases and it was reverted so i felt it has some special circumstance.

@smartprogrammer93 smartprogrammer93 force-pushed the feat/web-tools-url-allowlist branch 2 times, most recently from dcd5714 to ebd1c23 Compare February 28, 2026 00:04
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 28, 2026

Greptile Summary

This PR adds an optional tools.web.urlAllowlist config field that restricts which external domains web_fetch and web_search can reach. When unset, all existing behaviour is unchanged (fully backwards-compatible). The implementation is well-structured and all major concerns from prior review rounds have been addressed in this revision.

Key changes:

  • web_fetch blocks non-allowlisted URLs before the fetch and before following each redirect (pre-connection, via a new onRedirectUrl hook in fetchWithSsrFGuard). AllowlistBlockedError is re-thrown in the Firecrawl-fallback catch block, preventing the bypass that existed in an earlier revision.
  • Both the initial-URL block and the redirect block now return the same { error: "allowlist_blocked" } structured tool result, giving the agent a consistent error surface.
  • web_search applies applyUrlAllowlistToPayload post-execute (after resolved.definition.execute(args)), so cache hits are re-filtered on every read rather than storing pre-filtered results — the previously flagged caching bug is resolved.
  • All three citation shapes are covered: results (Brave/Perplexity-sonar) is filtered; citations and inlineCitations (Grok/Kimi/Gemini/Perplexity-chat) use map() + "[blocked by urlAllowlist]" placeholder to preserve [N] positional index alignment in content.
  • Zod schema rejects SSRF-blocked patterns (*.localhost, *.local, *.internal, bare localhost, IP literals) at config parse time with clear error messages.
  • Duplicate resolver functions consolidated into a single resolveUrlAllowlist in web-shared.ts.
  • Tests now import directly from production exports, eliminating the prior drift risk.

Confidence Score: 5/5

  • This PR is safe to merge — it is a backward-compatible, opt-in restriction mechanism with no capability expansion.
  • All critical issues identified in prior review rounds (Firecrawl bypass, redirect-target bypass, LLM-provider citation leakage, pre-filtered caching, inconsistent error surface, import placement, test drift) have been addressed. The implementation correctly handles the pre-connection redirect check, citation index alignment, and schema validation edge cases. No new bugs were found during this review pass.
  • No files require special attention.

Last reviewed commit: "fix: pre-connection ..."

Comment thread src/config/schema.help.ts Outdated
Comment thread src/config/schema.labels.ts Outdated
Comment thread src/agents/tools/web-fetch.ts Outdated
Comment thread src/config/zod-schema.agent-runtime.ts Outdated
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

1 similar comment
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

smartprogrammer93 and others added 12 commits March 22, 2026 11:12
Add optional urlAllowlist config at tools.web level that restricts which
URLs can be accessed by web tools:

- Config types: Add urlAllowlist?: string[] to tools.web
- Zod schema: Add urlAllowlist field with domain pattern validation
- Schema help: Add help text for new config fields
- web_search: Filter Brave search results by allowlist
- web_fetch: Block URLs not matching allowlist before fetching and on redirects
- ssrf.ts: Export normalizeHostnameAllowlist and matchesHostnameAllowlist
- web-shared.ts: Export resolveUrlAllowlist shared utility

URL matching supports exact domain match and wildcard patterns (*.github.com).
Single-label domains like localhost are also supported.
When urlAllowlist is not configured, all URLs are allowed (backwards compatible).

Tests: Add web-tools.url-allowlist.test.ts with 32 tests
…ing as blocked"

This reverts commit eebeb98e3378843f5b9dc5e4b742e3a168132e80.
@smartprogrammer93 smartprogrammer93 force-pushed the feat/web-tools-url-allowlist branch from 558c9e5 to 4b159d2 Compare March 22, 2026 08:13
@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

@smartprogrammer93
Copy link
Copy Markdown
Contributor Author

@greptileai please review

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle
Copy link
Copy Markdown

Closing due to inactivity.
If you believe this PR should be revived, post in #pr-thunderdome-dangerzone on Discord to talk to a maintainer.
That channel is the escape hatch for high-quality PRs that get auto-closed.

@dennisvanderpool
Copy link
Copy Markdown

I solved it for myself using Pipelock
https://github.com/luckyPipewrench/pipelock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation size: L stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants