Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap specify default filter url (Fixes: CVE-2023-46229) #11925

Merged
merged 5 commits into from Oct 17, 2023

Conversation

eyurtsev
Copy link
Collaborator

@eyurtsev eyurtsev commented Oct 17, 2023

Specify default filter URL in sitemap loader and add a security note

Fixes: CVE-2023-46229

@vercel
Copy link

vercel bot commented Oct 17, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 17, 2023 5:08pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 17, 2023
@@ -20,8 +21,43 @@ def _batch_block(iterable: Iterable, size: int) -> Generator[List[dict], None, N
yield item


def _extract_domain_and_scheme(url: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider renaming to the order in which the components appear in the string

Suggested change
def _extract_domain_and_scheme(url: str) -> str:
def _extract_scheme_and_domain(url: str) -> str:

Comment on lines 117 to 119
self.filter_urls: Optional[List[str]] = [
_extract_domain_and_scheme(web_path)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter URLs are treated as a regex, so extracting http://example.com from the web path will inappropriately match both http://examplexcom.org and http://example.com.attacker.com/

A better approach would be to parse each target URL and check it against the scheme and domain, then match it against the regex with proper escaping.

self.filter_urls = filter_urls
# Define a list of URL patterns (interpreted as regular expressions) that
# will be allowed to be loaded.
# restrict_to_same_domain takes precedence over filter_urls when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the precedence in the docstring?

Comment on lines +159 to +161
if self.allow_url_patterns and not any(
re.match(regexp_pattern, loc_text)
for regexp_pattern in self.allow_url_patterns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we care about performance at all, but this will recompile all the regexes for every visited URL which is a bit of a heavyweight operation. The regexes don't change so we could compile them once and then use the compiled forms each time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important in this case, can always optimize later if needed

Co-authored-by: Predrag Gruevski <2348618+obi1kenobi@users.noreply.github.com>
@eyurtsev eyurtsev merged commit 90e9ec6 into master Oct 17, 2023
32 checks passed
@eyurtsev eyurtsev deleted the eugene/sitemap_fix branch October 17, 2023 17:19
@eyurtsev
Copy link
Collaborator Author

Fixes: CVE-2023-46229

@eyurtsev eyurtsev changed the title Sitemap specify default filter url Sitemap specify default filter url (Fixes: CVE-2023-46229) Oct 19, 2023
hoanq1811 pushed a commit to hoanq1811/langchain that referenced this pull request Feb 2, 2024
Specify default filter URL in sitemap loader and add a security note

---------

Co-authored-by: Predrag Gruevski <2348618+obi1kenobi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants