Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Ahmetyasin · 2024-04-25T21:06:55Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query = "Advanced regression models for real estate price prediction", load_max_docs=100, load_all_available_meta=True).load()

Error Message and Stack Trace (if applicable)

HTTPError Traceback (most recent call last)
Cell In[9], line 7
1 from langchain_community.document_loaders import ArxivLoader
3 #ArxivLoader
4 #HTTPError: HTTP Error 404: Not Found
5
6 # Load documents using ArxivLoader with the concise_query
----> 7 docs = ArxivLoader(query = "Advanced regression models for real estate price prediction", load_max_docs=100, load_all_available_meta=True).load()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain_community/document_loaders/arxiv.py:27, in ArxivLoader.load(self)
26 def load(self) -> List[Document]:
---> 27 return self.client.load(self.query)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain_community/utilities/arxiv.py:208, in ArxivAPIWrapper.load(self, query)
206 for result in results:
207 try:
--> 208 doc_file_name: str = result.download_pdf()
209 with fitz.open(doc_file_name) as doc_file:
210 text: str = "".join(page.get_text() for page in doc_file)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/arxiv/init.py:214, in Result.download_pdf(self, dirpath, filename)
212 filename = self._get_default_filename()
213 path = os.path.join(dirpath, filename)
--> 214 written_path, _ = urlretrieve(self.pdf_url, path)
215 return written_path

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data)
224 """
225 Retrieve a URL into a temporary location on disk.
226
(...)
237 data file as well as the resulting HTTPMessage object.
238 """
239 url_type, path = _splittype(url)
--> 241 with contextlib.closing(urlopen(url, data)) as fp:
242 headers = fp.info()
244 # Just return the local path and the "headers" for file://
245 # URLs. No sense in performing a copy unless requested.

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
214 else:
215 opener = _opener
--> 216 return opener.open(url, data, timeout)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
523 for processor in self.process_response.get(protocol, []):
524 meth = getattr(processor, meth_name)
--> 525 response = meth(req, response)
527 return response

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
631 # According to RFC 2616, "2xx" code indicates that the client's
632 # request was successfully received, understood, and accepted.
633 if not (200 <= code < 300):
--> 634 response = self.parent.error(
635 'http', request, response, code, msg, hdrs)
637 return response

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
561 if http_err:
562 args = (dict, 'default', 'http_error_default') + orig_args
--> 563 return self._call_chain(*args)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
494 for handler in handlers:
495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
497 if result is not None:
498 return result

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 404: Not Found

Description

I am trying to use ArxivLoader from langchain_community.document_loaders to load 100 academic articles from arxiv database for certain keywords with metadata. But, I am getting this error. I have successfully run the code beforehand, but this time, it gives 404 Error.

System Info

MacOS, Python 3.10.13

pip freeze | grep langchain

langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.46
langchain-openai==0.0.6
langchain-text-splitters==0.0.1
langchainhub==0.1.14

jkpawlowski96 · 2024-04-29T19:19:42Z

I decided to support the project and solve this error.
Can someone take a look at #21042 ?

…21042) ### Description: When attempting to download PDF files from arXiv, an unexpected 404 error frequently occurs. This error halts the operation, regardless of whether there are additional documents to process. As a solution, I suggest implementing a mechanism to ignore and communicate this error and continue processing the next document from the list. Proposed Solution: To address the issue of unexpected 404 errors during PDF downloads from arXiv, I propose implementing the following solution: - Error Handling: Implement error handling mechanisms to catch and handle 404 errors gracefully. - Communication: Inform the user or logging system about the occurrence of the 404 error. - Continued Processing: After encountering a 404 error, continue processing the remaining documents from the list without interruption. This solution ensures that the application can handle unexpected errors without terminating the entire operation. It promotes resilience and robustness in the face of intermittent issues encountered during PDF downloads from arXiv. ### Issue: #20909 ### Dependencies: none --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>

jkpawlowski96 · 2024-04-30T19:53:45Z

@Ahmetyasin it is solved. After the next release, if you don't want this to happen again, create a loader like this.

loader = ArxivLoader(
            query="Advanced regression models for real estate price prediction",
            load_max_docs=100,
            load_all_available_meta=True,
            continue_on_failure=True
        )

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 25, 2024

This was referenced Apr 29, 2024

community: Skip unexpected 404 HTTP Error in Arxiv download jkpawlowski96/langchain#1

Closed

community: Skip unexpected 404 HTTP Error in Arxiv download #21042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Ahmetyasin commented Apr 25, 2024

jkpawlowski96 commented Apr 29, 2024

jkpawlowski96 commented Apr 30, 2024 •

edited

Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Comments

Ahmetyasin commented Apr 25, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

jkpawlowski96 commented Apr 29, 2024

jkpawlowski96 commented Apr 30, 2024 • edited

jkpawlowski96 commented Apr 30, 2024 •

edited