Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnstructuredURLLoader Error 403 #1829

Closed
waxxdu opened this issue Mar 20, 2023 · 9 comments · Fixed by #6246
Closed

UnstructuredURLLoader Error 403 #1829

waxxdu opened this issue Mar 20, 2023 · 9 comments · Fixed by #6246

Comments

@waxxdu
Copy link

waxxdu commented Mar 20, 2023

Issue: When trying to read data from some URLs, I get a 403 error during load. I assume this is due to the web-server not allowing all user agents.

Expected behavior: It would be great if I could specify a user agent (e.g. standard browsers like Mozilla, maybe also Google bots) for making the URL requests.

My code

from langchain.document_loaders import UnstructuredURLLoader
urls = ["https://dsgvo-gesetz.de/art-1"]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

Error message

ValueError                                Traceback (most recent call last)
Cell In[62], line 1
----> 1 data = loader.load()

File /opt/conda/lib/python3.10/site-packages/langchain/document_loaders/url.py:28, in UnstructuredURLLoader.load(self)
     26 docs: List[Document] = list()
     27 for url in self.urls:
---> 28     elements = partition_html(url=url)
     29     text = "\n\n".join([str(el) for el in elements])
     30     metadata = {"source": url}

File /opt/conda/lib/python3.10/site-packages/unstructured/partition/html.py:72, in partition_html(filename, file, text, url, encoding, include_page_breaks, include_metadata, parser)
     70 response = requests.get(url)
     71 if not response.ok:
---> 72     raise ValueError(f"URL return an error: {response.status_code}")
     74 content_type = response.headers.get("Content-Type", "")
     75 if not content_type.startswith("text/html"):

ValueError: URL return an error: 403

for reference: URL that works without the 403 error
https://www.heise.de/newsticker/

@hkaraoguz
Copy link

I have worked with UnstructuredURLLoader a bit too. Under the hood it uses requests library to fetch data. I have implemented my own solution to fetch the URL and parse the content using unstructured. You can then pass the parsed html text as a document to Langchain if you want.

@cragwolfe
Copy link
Contributor

cragwolfe commented Mar 28, 2023

I created an issue #1944 which would allow passing the User-Agent header, or any others headers that might be needed.

@cragwolfe
Copy link
Contributor

See this PR for how to pass a User-Agent header to UnstructuredURLLoader:
#2105

@Bohan-J
Copy link

Bohan-J commented Apr 6, 2023

@waxxdu I get the same problem here. Have you figure out how to solve it? I update my langchain version and pass User-Agent to the loader. But I still get error code 403 in return.

@waxxdu
Copy link
Author

waxxdu commented Apr 6, 2023

@Bohan-J I did not find the time yet to try again. It is on my agenda - possibly Monday or later next week. I will post successful results if any.

@adheeshenoy
Copy link

Looking for a solution to this as well.

@crallan
Copy link

crallan commented Apr 18, 2023

I'm trying to use the WebBrowserTool and I got the same code (403) on some sites. I created a separate issue for that #3118

@simplyviki
Copy link

same error, any resolution ?

@crallan
Copy link

crallan commented Jun 7, 2023

What we did as workaround is to create a custom tool, and use crawlee (https://crawlee.dev/) there and store the results on a vector. Crawlee already has some good techinques to avoid the 403 errors like fake agents, request headers and proxy pools.

hwchase17 added a commit that referenced this issue Jun 19, 2023
To bypass SSL verification errors during web scraping, you can include
the ssl_verify=False parameter along with the headers parameter. This
combination of arguments proves useful, especially for beginners in the
field of web scraping.

<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

Fixes #1829 

#### Before submitting

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->

#### Who can review?

Tag maintainers/contributors who might be interested:
@hwchase17 @eyurtsev 
 -->

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
kacperlukawski pushed a commit to kacperlukawski/langchain that referenced this issue Jun 29, 2023
To bypass SSL verification errors during web scraping, you can include
the ssl_verify=False parameter along with the headers parameter. This
combination of arguments proves useful, especially for beginners in the
field of web scraping.

<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

Fixes langchain-ai#1829 

#### Before submitting

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->

#### Who can review?

Tag maintainers/contributors who might be interested:
@hwchase17 @eyurtsev 
 -->

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants