Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the right encoding to parse the web page in RecursiveUrlLoader #20632

Merged
merged 12 commits into from
Apr 30, 2024

Conversation

fubuki8087
Copy link
Contributor

As shown in #13749 , RecursiveUrlLoader has encoding issue. This PR is to solve this.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 19, 2024
Copy link

vercel bot commented Apr 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 30, 2024 6:41pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 19, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 19, 2024
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 23, 2024
@@ -169,6 +178,12 @@ def _get_child_links_recursive(
visited.add(url)
try:
response = requests.get(url, timeout=self.timeout, headers=self.headers)

if self.encoding is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. Might be better not do it by mutating the response encoding

Instead modifying the code below to do

encoding = ...
text = response.content.decode(encoding=encoding)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.text actually do what you said, and this is the current way to use the response in RecursiveUrlLoader


metadata=self.metadata_extractor(response.text, url, response),

sub_links = extract_sub_links(
response.text,

In response.text, it decodes response.content by

content = str(self.content, encoding, errors="replace")
return content

It's just like your text = response.content.decode(encoding=encoding)

@baskaryan baskaryan enabled auto-merge (squash) April 30, 2024 18:33
@baskaryan baskaryan merged commit f1c3687 into langchain-ai:master Apr 30, 2024
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants