Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Open
5 tasks done
Ahmetyasin opened this issue Apr 25, 2024 · 2 comments
Open
5 tasks done

Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909

Ahmetyasin opened this issue Apr 25, 2024 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@Ahmetyasin
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query = "Advanced regression models for real estate price prediction", load_max_docs=100, load_all_available_meta=True).load()

Error Message and Stack Trace (if applicable)


HTTPError Traceback (most recent call last)
Cell In[9], line 7
1 from langchain_community.document_loaders import ArxivLoader
3 #ArxivLoader
4 #HTTPError: HTTP Error 404: Not Found
5
6 # Load documents using ArxivLoader with the concise_query
----> 7 docs = ArxivLoader(query = "Advanced regression models for real estate price prediction", load_max_docs=100, load_all_available_meta=True).load()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain_community/document_loaders/arxiv.py:27, in ArxivLoader.load(self)
26 def load(self) -> List[Document]:
---> 27 return self.client.load(self.query)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain_community/utilities/arxiv.py:208, in ArxivAPIWrapper.load(self, query)
206 for result in results:
207 try:
--> 208 doc_file_name: str = result.download_pdf()
209 with fitz.open(doc_file_name) as doc_file:
210 text: str = "".join(page.get_text() for page in doc_file)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/arxiv/init.py:214, in Result.download_pdf(self, dirpath, filename)
212 filename = self._get_default_filename()
213 path = os.path.join(dirpath, filename)
--> 214 written_path, _ = urlretrieve(self.pdf_url, path)
215 return written_path

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data)
224 """
225 Retrieve a URL into a temporary location on disk.
226
(...)
237 data file as well as the resulting HTTPMessage object.
238 """
239 url_type, path = _splittype(url)
--> 241 with contextlib.closing(urlopen(url, data)) as fp:
242 headers = fp.info()
244 # Just return the local path and the "headers" for file://
245 # URLs. No sense in performing a copy unless requested.

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
214 else:
215 opener = _opener
--> 216 return opener.open(url, data, timeout)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
523 for processor in self.process_response.get(protocol, []):
524 meth = getattr(processor, meth_name)
--> 525 response = meth(req, response)
527 return response

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
631 # According to RFC 2616, "2xx" code indicates that the client's
632 # request was successfully received, understood, and accepted.
633 if not (200 <= code < 300):
--> 634 response = self.parent.error(
635 'http', request, response, code, msg, hdrs)
637 return response

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
561 if http_err:
562 args = (dict, 'default', 'http_error_default') + orig_args
--> 563 return self._call_chain(*args)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
494 for handler in handlers:
495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
497 if result is not None:
498 return result

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 404: Not Found

Description

I am trying to use ArxivLoader from langchain_community.document_loaders to load 100 academic articles from arxiv database for certain keywords with metadata. But, I am getting this error. I have successfully run the code beforehand, but this time, it gives 404 Error.

System Info

MacOS, Python 3.10.13

pip freeze | grep langchain

langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.46
langchain-openai==0.0.6
langchain-text-splitters==0.0.1
langchainhub==0.1.14

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 25, 2024
@jkpawlowski96
Copy link
Contributor

I decided to support the project and solve this error.
Can someone take a look at #21042 ?

baskaryan added a commit that referenced this issue Apr 30, 2024
…21042)

### Description:
When attempting to download PDF files from arXiv, an unexpected 404
error frequently occurs. This error halts the operation, regardless of
whether there are additional documents to process. As a solution, I
suggest implementing a mechanism to ignore and communicate this error
and continue processing the next document from the list.

Proposed Solution: To address the issue of unexpected 404 errors during
PDF downloads from arXiv, I propose implementing the following solution:

- Error Handling: Implement error handling mechanisms to catch and
handle 404 errors gracefully.
- Communication: Inform the user or logging system about the occurrence
of the 404 error.
- Continued Processing: After encountering a 404 error, continue
processing the remaining documents from the list without interruption.

This solution ensures that the application can handle unexpected errors
without terminating the entire operation. It promotes resilience and
robustness in the face of intermittent issues encountered during PDF
downloads from arXiv.

### Issue:
#20909 
### Dependencies:
none

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
@jkpawlowski96
Copy link
Contributor

jkpawlowski96 commented Apr 30, 2024

@Ahmetyasin it is solved. After the next release, if you don't want this to happen again, create a loader like this.

loader = ArxivLoader(
            query="Advanced regression models for real estate price prediction",
            load_max_docs=100,
            load_all_available_meta=True,
            continue_on_failure=True
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

2 participants