Using the right encoding to parse the web page in RecursiveUrlLoader #20632

fubuki8087 · 2024-04-19T00:49:35Z

As shown in #13749 , RecursiveUrlLoader has encoding issue. This PR is to solve this.

vercel · 2024-04-19T00:49:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Apr 30, 2024 6:41pm

libs/community/langchain_community/document_loaders/recursive_url_loader.py

eyurtsev · 2024-04-25T17:00:07Z

libs/community/langchain_community/document_loaders/recursive_url_loader.py

@@ -169,6 +178,12 @@ def _get_child_links_recursive(
        visited.add(url)
        try:
            response = requests.get(url, timeout=self.timeout, headers=self.headers)
+
+            if self.encoding is not None:


This looks reasonable to me. Might be better not do it by mutating the response encoding

Instead modifying the code below to do

encoding = ... text = response.content.decode(encoding=encoding)

response.text actually do what you said, and this is the current way to use the response in RecursiveUrlLoader

langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py

Line 198 in 804390b

content = self.extractor(response.text)

langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py

Line 202 in 804390b

metadata=self.metadata_extractor(response.text, url, response),

langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py

Lines 206 to 207 in 804390b

sub_links = extract_sub_links(

response.text,

In response.text, it decodes response.content by

content = str(self.content, encoding, errors="replace") return content

It's just like your text = response.content.decode(encoding=encoding)

fix langchain-ai#13749

8bd830a

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 19, 2024

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 19, 2024

fubuki8087 and others added 2 commits April 19, 2024 09:18

Merge branch 'master' into master

b1ca618

Merge branch 'master' into master

b1c4a05

baskaryan approved these changes Apr 19, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 19, 2024

Merge branch 'master' into master

3a1d568

vercel bot temporarily deployed to Preview April 19, 2024 06:20 Inactive

Merge branch 'master' into master

eace2fb

eyurtsev reviewed Apr 19, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/recursive_url_loader.py Outdated Show resolved Hide resolved

fubuki8087 added 2 commits April 23, 2024 16:32

Merge branch 'langchain-ai:master' into master

9f4330a

make encoding optional

a94a589

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 23, 2024

fubuki8087 requested a review from eyurtsev April 23, 2024 09:06

baskaryan and others added 2 commits April 25, 2024 09:54

Merge branch 'master' into master

c36d097

fmt

8e67283

eyurtsev approved these changes Apr 25, 2024

View reviewed changes

fubuki8087 and others added 3 commits April 28, 2024 14:15

Merge branch 'master' into master

f52694d

Merge branch 'master' into master

fab7513

Merge branch 'master' into master

4702734

baskaryan enabled auto-merge (squash) April 30, 2024 18:33

baskaryan merged commit f1c3687 into langchain-ai:master Apr 30, 2024
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the right encoding to parse the web page in RecursiveUrlLoader #20632

Using the right encoding to parse the web page in RecursiveUrlLoader #20632

fubuki8087 commented Apr 19, 2024

vercel bot commented Apr 19, 2024 •

edited

eyurtsev Apr 25, 2024

fubuki8087 Apr 28, 2024

Using the right encoding to parse the web page in RecursiveUrlLoader #20632

Using the right encoding to parse the web page in RecursiveUrlLoader #20632

Conversation

fubuki8087 commented Apr 19, 2024

vercel bot commented Apr 19, 2024 • edited

eyurtsev Apr 25, 2024

Choose a reason for hiding this comment

fubuki8087 Apr 28, 2024

Choose a reason for hiding this comment

vercel bot commented Apr 19, 2024 •

edited