Security: add URL allowlist for web_search and web_fetch#19042
Security: add URL allowlist for web_search and web_fetch#19042smartprogrammer93 wants to merge 13 commits intoopenclaw:mainfrom
Conversation
|
@greptileai please review |
Additional Comments (1)
The In practice this is unlikely since the config is typically stable for the lifetime of a process, but it could be surprising in multi-agent setups with per-agent configs. Consider either:
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/agents/tools/web-search.ts
Line: 609:619
Comment:
**Allowlist filtering bypassed by module-level cache**
The `SEARCH_CACHE` (line 36) is a module-level `Map`, but the cache key (lines 609–614) does not incorporate the `urlAllowlist`. Filtered results are written to the cache at line 748. If the same process hosts multiple tool instances with different allowlist configs (or if the allowlist is changed across tool re-creations without restarting), a cache hit at line 617 will return previously-filtered (or unfiltered) results, bypassing the current allowlist.
In practice this is unlikely since the config is typically stable for the lifetime of a process, but it could be surprising in multi-agent setups with per-agent configs. Consider either:
- Including a hash of the allowlist in the cache key, or
- Applying `filterResultsByAllowlist` to cache hits as well
How can I resolve this? If you propose a fix, please make it concise. |
af37732 to
1eaf383
Compare
|
@greptileai please review |
|
@greptileai please review |
|
@greptileai please review |
1 similar comment
|
@greptileai please review |
|
@greptileai please review |
|
@greptileai please review |
|
ready for review and merge |
|
Hey, Apologies for the ping, i dont usually ping maintainers but this PR has been ignored for 3 releases and it was reverted so i felt it has some special circumstance. |
dcd5714 to
ebd1c23
Compare
|
@greptileai please review |
Greptile SummaryThis PR adds an optional Key changes:
Confidence Score: 5/5
Last reviewed commit: "fix: pre-connection ..." |
|
@greptileai please review |
1 similar comment
|
@greptileai please review |
Add optional urlAllowlist config at tools.web level that restricts which URLs can be accessed by web tools: - Config types: Add urlAllowlist?: string[] to tools.web - Zod schema: Add urlAllowlist field with domain pattern validation - Schema help: Add help text for new config fields - web_search: Filter Brave search results by allowlist - web_fetch: Block URLs not matching allowlist before fetching and on redirects - ssrf.ts: Export normalizeHostnameAllowlist and matchesHostnameAllowlist - web-shared.ts: Export resolveUrlAllowlist shared utility URL matching supports exact domain match and wildcard patterns (*.github.com). Single-label domains like localhost are also supported. When urlAllowlist is not configured, all URLs are allowed (backwards compatible). Tests: Add web-tools.url-allowlist.test.ts with 32 tests
…locked hostnames in schema
…ing as blocked" This reverts commit eebeb98e3378843f5b9dc5e4b742e3a168132e80.
…dError for Firecrawl bypass
558c9e5 to
4b159d2
Compare
|
@greptileai please review |
|
@greptileai please review |
|
This pull request has been automatically marked as stale due to inactivity. |
|
Closing due to inactivity. |
|
I solved it for myself using Pipelock |
Summary
web_searchandweb_fetchhad no mechanism to restrict which external domains the agent could reach, making it impossible to run a network-isolated or domain-scoped agent.tools.web.urlAllowlistconfig field. When set, bothweb_searchandweb_fetchenforce it. When unset, all URLs are allowed (fully backwards compatible).URL matching rules
example.com*.github.com*/*.localhost,*.localhost,*.local,*.internalExample config
Change Type (select all)
Scope (select all touched areas)
User-visible / Behavior Changes
tools.web.urlAllowlist: string[]web_fetchblocks requests (and redirect targets) not matching the allowlist, returning a structured{ error: "allowlist_blocked" }tool result.web_searchfiltersresults,citations, andinlineCitationsfrom all providers (Brave, Perplexity, Grok, Kimi, Gemini) post-cache so unfiltered data is stored and re-filtered on every read.localhost,localhost.localdomain,metadata.google.internal) and wildcard patterns like*.localhost,*.local,*.internalare rejected at config parse time with a clear error message, since the SSRF guard blocks them unconditionally at the network level.Security Impact (required)
urlAllowlistnarrows the set of URLs reachable byweb_fetchand visible viaweb_search. Risk: none (additive restriction). Mitigation: config is opt-in and defaults to unrestricted.Implementation details
web_fetch:finalUrlafter redirect resolution — throwsAllowlistBlockedError(new typed error exported fromssrf.ts) which is caught inexecuteand returned as{ error: "allowlist_blocked" }. The redirect check is post-connection by design: the SSRF guard already validates each redirect hop's DNS at the network level, so the allowlist check is a content-policy gate only.web_search:applyUrlAllowlistToPayloadfilters all three citation shapes returned by LLM-based search providers:results: Array<{url}>— Brave, Perplexity-sonarcitations: string[]— Perplexity-chat, Grok, Kimi, GeminiinlineCitations: Array<{url}>— Grok inline citationsRepro + Verification
Environment
tools.web.urlAllowlist: ["example.com"]Steps
tools.web.urlAllowlist: ["example.com"]in confighttps://evil.comviaweb_fetchExpected
example.comURLs in results/citationsActual
Evidence
web-tools.url-allowlist.test.ts) covering all path combinations, importing directly from production exportsAllowlistBlockedErrorsmoke testHuman Verification (required)
Review Conversations
Compatibility / Migration
Failure Recovery (if this breaks)
tools.web.urlAllowlistunset in config.tools.web.urlAllowlistkey inopenclaw.json/openclaw.yaml.Risks and Mitigations
https://example.cominstead ofexample.com).*and*.are also rejected.localhost,localhost.localdomain,metadata.google.internal, and wildcard*.localhost/*.local/*.internalpatterns with a clear error message explaining why.