New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Sitemap specify default filter url (Fixes: CVE-2023-46229) #11925

Merged

eyurtsev merged 5 commits into master from eugene/sitemap_fix

Oct 17, 2023

Collaborator

eyurtsev commented Oct 17, 2023 •

edited

Specify default filter URL in sitemap loader and add a security note

Fixes: CVE-2023-46229

eyurtsev added 2 commits

October 17, 2023 11:11


          qxqx

fa28e38

c1a42da

vercel bot commented Oct 17, 2023 •

edited

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 17, 2023 5:08pm

dosubot bot added Ɑ: doc loader 🤖:improvement labels

eyurtsev requested review from obi1kenobi and baskaryan

October 17, 2023 15:45

obi1kenobi reviewed

View reviewed changes

libs/langchain/langchain/document_loaders/sitemap.py Outdated

		@@ -20,8 +21,43 @@ def _batch_block(iterable: Iterable, size: int) -> Generator[List[dict], None, N
		yield item


		def _extract_domain_and_scheme(url: str) -> str:

Collaborator

obi1kenobi Oct 17, 2023

nit: consider renaming to the order in which the components appear in the string

Suggested change

      
            def _extract_domain_and_scheme(url: str) -> str:
          
            def _extract_scheme_and_domain(url: str) -> str:

libs/langchain/langchain/document_loaders/sitemap.py Outdated

Comment on lines 117 to 119

+                          self.filter_urls: Optional[List[str]] = [
+                              _extract_domain_and_scheme(web_path)
+                          ]

Collaborator

obi1kenobi Oct 17, 2023

Filter URLs are treated as a regex, so extracting http://example.com from the web path will inappropriately match both http://examplexcom.org and http://example.com.attacker.com/

A better approach would be to parse each target URL and check it against the scheme and domain, then match it against the regex with proper escaping.

200d0ab

obi1kenobi reviewed

View reviewed changes

libs/langchain/langchain/document_loaders/sitemap.py

-                      self.filter_urls = filter_urls
+                      # Define a list of URL patterns (interpreted as regular expressions) that
+                      # will be allowed to be loaded.
+                      # restrict_to_same_domain takes precedence over filter_urls when

Collaborator

obi1kenobi Oct 17, 2023

Mention the precedence in the docstring?

libs/langchain/langchain/document_loaders/sitemap.py

Comment on lines +159 to +161

+                          if self.allow_url_patterns and not any(
+                              re.match(regexp_pattern, loc_text)
+                              for regexp_pattern in self.allow_url_patterns

Collaborator

obi1kenobi Oct 17, 2023

I don't know if we care about performance at all, but this will recompile all the regexes for every visited URL which is a bit of a heavyweight operation. The regexes don't change so we could compile them once and then use the compiled forms each time.

Collaborator Author

eyurtsev Oct 17, 2023

Not important in this case, can always optimize later if needed

91c0a02

obi1kenobi reviewed

View reviewed changes

libs/langchain/langchain/document_loaders/sitemap.py Outdated Show resolved Hide resolved


          Update libs/langchain/langchain/document_loaders/sitemap.py

787f492

Co-authored-by: Predrag Gruevski <2348618+obi1kenobi@users.noreply.github.com>

eyurtsev merged commit 90e9ec6 into master

32 checks passed

eyurtsev deleted the eugene/sitemap_fix branch

October 17, 2023 17:19

Collaborator Author

eyurtsev commented Oct 19, 2023

Fixes: CVE-2023-46229

eyurtsev changed the title ~~Sitemap specify default filter url~~ Sitemap specify default filter url (Fixes: CVE-2023-46229)

hoanq1811 pushed a commit to hoanq1811/langchain that referenced this pull request


          Sitemap specify default filter url (langchain-ai#11925)

ff17f02

Specify default filter URL in sitemap loader and add a security note

---------

Co-authored-by: Predrag Gruevski <2348618+obi1kenobi@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment