fix(security): make URL fetching opt-in (addresses #432)#449
Conversation
ddccc74 to
8eb9778
Compare
|
Thanks for addressing this and for the mention. The opt-in approach for Since this change was motivated by the issue identified in #432, would it be possible to track this via a GitHub Security Advisory or CVE for proper attribution? Also, note that the path traversal issue in Happy to help further if needed. |
Flip `lx.extract(..., fetch_urls=...)` default from True to False and expand the docstring with the SSRF caveats that come with enabling it. The library passes URLs directly to `requests.get` without SSRF protection: redirects, DNS rebinding, percent-encoded hosts, authority spoofs, and cloud metadata endpoints are all reachable. The safest default is to treat strings as literal text. Callers who need URL fetching should set `fetch_urls=True` only when they trust the source of the URL and run in a sandbox that cannot reach internal services.
8eb9778 to
f88ae19
Compare
|
Thanks for flagging both, @l3tchupkt. Really appreciate you taking the time here. This PR is likely not the right shape for a formal advisory, since it's a small change focused on documentation and a more conservative default rather than a targeted code patch. That said, happy to keep the conversation open. Feel free to reach out if you'd like to discuss it further (email is on my GitHub profile). The path-traversal finding in Thanks again! |
|
@aksg87 Thanks for the clarification, really appreciate the feedback. Understood regarding the SSRF being handled as a design change. For the path traversal issue in Happy to open a focused PR or provide a minimal patch specifically for that issue if helpful. |
|
Follow-up on the path-traversal note: #451 adds a docstring note that |
Summary
lx.extract(..., fetch_urls=...)default fromTruetoFalseso URL strings are treated as literal text unless the caller explicitly opts in.fetch_urlsdocstring with the SSRF risk surface (internal metadata endpoints, loopback, RFC 1918, CGNAT, DNS rebinding, redirects, authority spoofs) and guidance to only enable it when the URL source is trusted and the process runs in a sandbox.Background
#432 (thanks @l3tchupkt) flagged that
langextractcurrently hands caller-supplied URLs directly torequests.getwith no validation. After evaluating the full surface area (internal IPs, CGNAT, mapped-IPv6, redirect validation, DNS rebinding, percent-encoded hosts, parser-mismatch bypasses, etc.), we chose an opt-in model rather than layering URL validation inside the library:fetch_urls=Trueunder controlled conditions.This change is intentionally small and security-oriented. A richer allowlist-based fetch helper is a reasonable future contribution.
Test plan
tests/extract_fetch_urls_test.py:test_url_is_not_fetched_by_default— URL-looking string does not triggerio.download_text_from_urlunder the new default.test_fetch_urls_true_invokes_downloader— explicitfetch_urls=Truestill invokes the downloader.tox -e formatpasses (pyink + isort).tox -e lint-srcpasses (pylint 10.00/10).tox -e lint-testspasses (pylint 10.00/10).pytest tests/ -ra -m "not live_api"— 447 passing.Notes
fetch_urls=Trueexplicitly. The new docstring calls this out.