UnstructuredURLLoader Error 403 #1829

waxxdu · 2023-03-20T21:26:40Z

Issue: When trying to read data from some URLs, I get a 403 error during load. I assume this is due to the web-server not allowing all user agents.

Expected behavior: It would be great if I could specify a user agent (e.g. standard browsers like Mozilla, maybe also Google bots) for making the URL requests.

My code

from langchain.document_loaders import UnstructuredURLLoader
urls = ["https://dsgvo-gesetz.de/art-1"]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

Error message

ValueError                                Traceback (most recent call last)
Cell In[62], line 1
----> 1 data = loader.load()

File /opt/conda/lib/python3.10/site-packages/langchain/document_loaders/url.py:28, in UnstructuredURLLoader.load(self)
     26 docs: List[Document] = list()
     27 for url in self.urls:
---> 28     elements = partition_html(url=url)
     29     text = "\n\n".join([str(el) for el in elements])
     30     metadata = {"source": url}

File /opt/conda/lib/python3.10/site-packages/unstructured/partition/html.py:72, in partition_html(filename, file, text, url, encoding, include_page_breaks, include_metadata, parser)
     70 response = requests.get(url)
     71 if not response.ok:
---> 72     raise ValueError(f"URL return an error: {response.status_code}")
     74 content_type = response.headers.get("Content-Type", "")
     75 if not content_type.startswith("text/html"):

ValueError: URL return an error: 403

for reference: URL that works without the 403 error
https://www.heise.de/newsticker/

The text was updated successfully, but these errors were encountered:

hkaraoguz · 2023-03-22T22:27:01Z

I have worked with UnstructuredURLLoader a bit too. Under the hood it uses requests library to fetch data. I have implemented my own solution to fetch the URL and parse the content using unstructured. You can then pass the parsed html text as a document to Langchain if you want.

cragwolfe · 2023-03-28T05:36:26Z

I created an issue #1944 which would allow passing the User-Agent header, or any others headers that might be needed.

cragwolfe · 2023-03-29T04:32:07Z

See this PR for how to pass a User-Agent header to UnstructuredURLLoader:
#2105

Bohan-J · 2023-04-06T02:34:16Z

@waxxdu I get the same problem here. Have you figure out how to solve it? I update my langchain version and pass User-Agent to the loader. But I still get error code 403 in return.

waxxdu · 2023-04-06T22:37:21Z

@Bohan-J I did not find the time yet to try again. It is on my agenda - possibly Monday or later next week. I will post successful results if any.

adheeshenoy · 2023-04-13T23:53:03Z

Looking for a solution to this as well.

crallan · 2023-04-18T22:11:47Z

I'm trying to use the WebBrowserTool and I got the same code (403) on some sites. I created a separate issue for that #3118

simplyviki · 2023-06-07T16:27:16Z

same error, any resolution ?

crallan · 2023-06-07T16:40:36Z

What we did as workaround is to create a custom tool, and use crawlee (https://crawlee.dev/) there and store the results on a vector. Crawlee already has some good techinques to avoid the 403 errors like fake agents, request headers and proxy pools.

@hwchase17

To bypass SSL verification errors during web scraping, you can include the ssl_verify=False parameter along with the headers parameter. This combination of arguments proves useful, especially for beginners in the field of web scraping.  Fixes #1829 #### Before submitting  #### Who can review? Tag maintainers/contributors who might be interested: @hwchase17 @eyurtsev --> --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

@hwchase17

To bypass SSL verification errors during web scraping, you can include the ssl_verify=False parameter along with the headers parameter. This combination of arguments proves useful, especially for beginners in the field of web scraping.  Fixes langchain-ai#1829 #### Before submitting  #### Who can review? Tag maintainers/contributors who might be interested: @hwchase17 @eyurtsev --> --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

jackfrost1411 mentioned this issue Jun 15, 2023

Add markdown to specify important arguments #6246

Merged

hwchase17 closed this as completed in #6246 Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnstructuredURLLoader Error 403 #1829

UnstructuredURLLoader Error 403 #1829

waxxdu commented Mar 20, 2023

hkaraoguz commented Mar 22, 2023

cragwolfe commented Mar 28, 2023 •

edited

Loading

cragwolfe commented Mar 29, 2023

Bohan-J commented Apr 6, 2023

waxxdu commented Apr 6, 2023

adheeshenoy commented Apr 13, 2023

crallan commented Apr 18, 2023 •

edited

Loading

simplyviki commented Jun 7, 2023

crallan commented Jun 7, 2023

UnstructuredURLLoader Error 403 #1829

UnstructuredURLLoader Error 403 #1829

Comments

waxxdu commented Mar 20, 2023

hkaraoguz commented Mar 22, 2023

cragwolfe commented Mar 28, 2023 • edited Loading

cragwolfe commented Mar 29, 2023

Bohan-J commented Apr 6, 2023

waxxdu commented Apr 6, 2023

adheeshenoy commented Apr 13, 2023

crallan commented Apr 18, 2023 • edited Loading

simplyviki commented Jun 7, 2023

crallan commented Jun 7, 2023

cragwolfe commented Mar 28, 2023 •

edited

Loading

crallan commented Apr 18, 2023 •

edited

Loading