fix(security): make URL fetching opt-in (addresses #432) by aksg87 · Pull Request #449 · google/langextract

aksg87 · 2026-04-19T04:47:34Z

Summary

Flip lx.extract(..., fetch_urls=...) default from True to False so URL strings are treated as literal text unless the caller explicitly opts in.
Expand the fetch_urls docstring with the SSRF risk surface (internal metadata endpoints, loopback, RFC 1918, CGNAT, DNS rebinding, redirects, authority spoofs) and guidance to only enable it when the URL source is trusted and the process runs in a sandbox.

Background

#432 (thanks @l3tchupkt) flagged that langextract currently hands caller-supplied URLs directly to requests.get with no validation. After evaluating the full surface area (internal IPs, CGNAT, mapped-IPv6, redirect validation, DNS rebinding, percent-encoded hosts, parser-mismatch bypasses, etc.), we chose an opt-in model rather than layering URL validation inside the library:

The safest and simplest default is to not fetch caller-supplied URLs at all.
Callers who want URL fetching can set fetch_urls=True under controlled conditions.
Callers who need stricter behavior can download the text with their own vetted HTTP client and pass it as a string.

This change is intentionally small and security-oriented. A richer allowlist-based fetch helper is a reasonable future contribution.

Test plan

New tests in tests/extract_fetch_urls_test.py:
- test_url_is_not_fetched_by_default — URL-looking string does not trigger io.download_text_from_url under the new default.
- test_fetch_urls_true_invokes_downloader — explicit fetch_urls=True still invokes the downloader.
tox -e format passes (pyink + isort).
tox -e lint-src passes (pylint 10.00/10).
tox -e lint-tests passes (pylint 10.00/10).
pytest tests/ -ra -m "not live_api" — 447 passing.

Notes

Minor breaking change for callers who relied on the previous default; they now need fetch_urls=True explicitly. The new docstring calls this out.
Closes fix(security): prevent SSRF and path traversal in URL fetching and file output #432 in favor of this simpler approach; @l3tchupkt's original PR led directly to this decision.

l3tchupkt · 2026-04-19T05:02:06Z

Thanks for addressing this and for the mention.

The opt-in approach for fetch_urls makes sense and effectively reduces the SSRF risk surface.

Since this change was motivated by the issue identified in #432, would it be possible to track this via a GitHub Security Advisory or CVE for proper attribution?

Also, note that the path traversal issue in save_annotated_documents() from #432 is still reproducible and may require a separate fix.

Happy to help further if needed.

Flip `lx.extract(..., fetch_urls=...)` default from True to False and expand the docstring with the SSRF caveats that come with enabling it. The library passes URLs directly to `requests.get` without SSRF protection: redirects, DNS rebinding, percent-encoded hosts, authority spoofs, and cloud metadata endpoints are all reachable. The safest default is to treat strings as literal text. Callers who need URL fetching should set `fetch_urls=True` only when they trust the source of the URL and run in a sandbox that cannot reach internal services.

aksg87 · 2026-04-19T05:58:50Z

Thanks for flagging both, @l3tchupkt. Really appreciate you taking the time here.

This PR is likely not the right shape for a formal advisory, since it's a small change focused on documentation and a more conservative default rather than a targeted code patch. That said, happy to keep the conversation open. Feel free to reach out if you'd like to discuss it further (email is on my GitHub profile).

The path-traversal finding in save_annotated_documents is noted and we'll look at it separately.

Thanks again!

l3tchupkt · 2026-04-19T06:02:46Z

@aksg87 Thanks for the clarification, really appreciate the feedback.

Understood regarding the SSRF being handled as a design change.

For the path traversal issue in save_annotated_documents(), I can confirm it is independently reproducible and allows writing outside the intended directory via ../ sequences.

Happy to open a focused PR or provide a minimal patch specifically for that issue if helpful.

aksg87 · 2026-04-19T17:29:43Z

Follow-up on the path-traversal note: #451 adds a docstring note that save_annotated_documents(output_name=...) isn't sanitized. LangExtract is a library, not a hosted service, so callers exposing it to untrusted input (e.g. filenames coming from an HTTP request) are expected to validate on their side. Thanks again for the flag!

github-actions Bot added the size/S Pull request with 50-150 lines changed label Apr 19, 2026

aksg87 force-pushed the security/fetch-urls-opt-in branch 4 times, most recently from ddccc74 to 8eb9778 Compare April 19, 2026 04:58

aksg87 force-pushed the security/fetch-urls-opt-in branch from 8eb9778 to f88ae19 Compare April 19, 2026 05:15

aksg87 mentioned this pull request Apr 19, 2026

fix(security): prevent SSRF and path traversal in URL fetching and file output #432

Closed

aksg87 merged commit 252b5f5 into main Apr 19, 2026
14 checks passed

aksg87 deleted the security/fetch-urls-opt-in branch April 19, 2026 05:59

aksg87 mentioned this pull request Apr 19, 2026

docs: note output_name is not sanitized in save_annotated_documents #451

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(security): make URL fetching opt-in (addresses #432)#449

fix(security): make URL fetching opt-in (addresses #432)#449
aksg87 merged 1 commit intomainfrom
security/fetch-urls-opt-in

aksg87 commented Apr 19, 2026

Uh oh!

l3tchupkt commented Apr 19, 2026

Uh oh!

aksg87 commented Apr 19, 2026

Uh oh!

Uh oh!

l3tchupkt commented Apr 19, 2026

Uh oh!

aksg87 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aksg87 commented Apr 19, 2026

Summary

Background

Test plan

Notes

Uh oh!

l3tchupkt commented Apr 19, 2026

Uh oh!

aksg87 commented Apr 19, 2026

Uh oh!

Uh oh!

l3tchupkt commented Apr 19, 2026

Uh oh!

aksg87 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants