Skip to content

fix(security): make URL fetching opt-in (addresses #432)#449

Merged
aksg87 merged 1 commit intomainfrom
security/fetch-urls-opt-in
Apr 19, 2026
Merged

fix(security): make URL fetching opt-in (addresses #432)#449
aksg87 merged 1 commit intomainfrom
security/fetch-urls-opt-in

Conversation

@aksg87
Copy link
Copy Markdown
Collaborator

@aksg87 aksg87 commented Apr 19, 2026

Summary

  • Flip lx.extract(..., fetch_urls=...) default from True to False so URL strings are treated as literal text unless the caller explicitly opts in.
  • Expand the fetch_urls docstring with the SSRF risk surface (internal metadata endpoints, loopback, RFC 1918, CGNAT, DNS rebinding, redirects, authority spoofs) and guidance to only enable it when the URL source is trusted and the process runs in a sandbox.

Background

#432 (thanks @l3tchupkt) flagged that langextract currently hands caller-supplied URLs directly to requests.get with no validation. After evaluating the full surface area (internal IPs, CGNAT, mapped-IPv6, redirect validation, DNS rebinding, percent-encoded hosts, parser-mismatch bypasses, etc.), we chose an opt-in model rather than layering URL validation inside the library:

  • The safest and simplest default is to not fetch caller-supplied URLs at all.
  • Callers who want URL fetching can set fetch_urls=True under controlled conditions.
  • Callers who need stricter behavior can download the text with their own vetted HTTP client and pass it as a string.

This change is intentionally small and security-oriented. A richer allowlist-based fetch helper is a reasonable future contribution.

Test plan

  • New tests in tests/extract_fetch_urls_test.py:
    • test_url_is_not_fetched_by_default — URL-looking string does not trigger io.download_text_from_url under the new default.
    • test_fetch_urls_true_invokes_downloader — explicit fetch_urls=True still invokes the downloader.
  • tox -e format passes (pyink + isort).
  • tox -e lint-src passes (pylint 10.00/10).
  • tox -e lint-tests passes (pylint 10.00/10).
  • pytest tests/ -ra -m "not live_api" — 447 passing.

Notes

@github-actions github-actions Bot added the size/S Pull request with 50-150 lines changed label Apr 19, 2026
@aksg87 aksg87 force-pushed the security/fetch-urls-opt-in branch 4 times, most recently from ddccc74 to 8eb9778 Compare April 19, 2026 04:58
@l3tchupkt
Copy link
Copy Markdown

Thanks for addressing this and for the mention.

The opt-in approach for fetch_urls makes sense and effectively reduces the SSRF risk surface.

Since this change was motivated by the issue identified in #432, would it be possible to track this via a GitHub Security Advisory or CVE for proper attribution?

Also, note that the path traversal issue in save_annotated_documents() from #432 is still reproducible and may require a separate fix.

Happy to help further if needed.

Flip `lx.extract(..., fetch_urls=...)` default from True to False and
expand the docstring with the SSRF caveats that come with enabling it.

The library passes URLs directly to `requests.get` without SSRF
protection: redirects, DNS rebinding, percent-encoded hosts, authority
spoofs, and cloud metadata endpoints are all reachable. The safest
default is to treat strings as literal text. Callers who need URL
fetching should set `fetch_urls=True` only when they trust the source
of the URL and run in a sandbox that cannot reach internal services.
@aksg87 aksg87 force-pushed the security/fetch-urls-opt-in branch from 8eb9778 to f88ae19 Compare April 19, 2026 05:15
@aksg87
Copy link
Copy Markdown
Collaborator Author

aksg87 commented Apr 19, 2026

Thanks for flagging both, @l3tchupkt. Really appreciate you taking the time here.

This PR is likely not the right shape for a formal advisory, since it's a small change focused on documentation and a more conservative default rather than a targeted code patch. That said, happy to keep the conversation open. Feel free to reach out if you'd like to discuss it further (email is on my GitHub profile).

The path-traversal finding in save_annotated_documents is noted and we'll look at it separately.

Thanks again!

@aksg87 aksg87 merged commit 252b5f5 into main Apr 19, 2026
14 checks passed
@aksg87 aksg87 deleted the security/fetch-urls-opt-in branch April 19, 2026 05:59
@l3tchupkt
Copy link
Copy Markdown

@aksg87 Thanks for the clarification, really appreciate the feedback.

Understood regarding the SSRF being handled as a design change.

For the path traversal issue in save_annotated_documents(), I can confirm it is independently reproducible and allows writing outside the intended directory via ../ sequences.

Happy to open a focused PR or provide a minimal patch specifically for that issue if helpful.

@aksg87
Copy link
Copy Markdown
Collaborator Author

aksg87 commented Apr 19, 2026

Follow-up on the path-traversal note: #451 adds a docstring note that save_annotated_documents(output_name=...) isn't sanitized. LangExtract is a library, not a hosted service, so callers exposing it to untrusted input (e.g. filenames coming from an HTTP request) are expected to validate on their side. Thanks again for the flag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Pull request with 50-150 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants