-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the right encoding to parse the web page in RecursiveUrlLoader #20632
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
libs/community/langchain_community/document_loaders/recursive_url_loader.py
Outdated
Show resolved
Hide resolved
@@ -169,6 +178,12 @@ def _get_child_links_recursive( | |||
visited.add(url) | |||
try: | |||
response = requests.get(url, timeout=self.timeout, headers=self.headers) | |||
|
|||
if self.encoding is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me. Might be better not do it by mutating the response encoding
Instead modifying the code below to do
encoding = ...
text = response.content.decode(encoding=encoding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response.text
actually do what you said, and this is the current way to use the response
in RecursiveUrlLoader
langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py
Line 198 in 804390b
content = self.extractor(response.text) |
langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py
Line 202 in 804390b
metadata=self.metadata_extractor(response.text, url, response), |
langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py
Lines 206 to 207 in 804390b
sub_links = extract_sub_links( | |
response.text, |
In
response.text
, it decodes response.content
by
content = str(self.content, encoding, errors="replace")
return content
It's just like your text = response.content.decode(encoding=encoding)
As shown in #13749 ,
RecursiveUrlLoader
has encoding issue. This PR is to solve this.