Getting "HTTP Error 404: Not Found" Error Using ArxivLoader #20909
Labels
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
Ɑ: doc loader
Related to document loader module (not documentation)
Checked other resources
Example Code
from langchain_community.document_loaders import ArxivLoader
docs = ArxivLoader(query = "Advanced regression models for real estate price prediction", load_max_docs=100, load_all_available_meta=True).load()
Error Message and Stack Trace (if applicable)
HTTPError Traceback (most recent call last)
Cell In[9], line 7
1 from langchain_community.document_loaders import ArxivLoader
3 #ArxivLoader
4 #HTTPError: HTTP Error 404: Not Found
5
6 # Load documents using ArxivLoader with the concise_query
----> 7 docs = ArxivLoader(query = "Advanced regression models for real estate price prediction", load_max_docs=100, load_all_available_meta=True).load()
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain_community/document_loaders/arxiv.py:27, in ArxivLoader.load(self)
26 def load(self) -> List[Document]:
---> 27 return self.client.load(self.query)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain_community/utilities/arxiv.py:208, in ArxivAPIWrapper.load(self, query)
206 for result in results:
207 try:
--> 208 doc_file_name: str = result.download_pdf()
209 with fitz.open(doc_file_name) as doc_file:
210 text: str = "".join(page.get_text() for page in doc_file)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/arxiv/init.py:214, in Result.download_pdf(self, dirpath, filename)
212 filename = self._get_default_filename()
213 path = os.path.join(dirpath, filename)
--> 214 written_path, _ = urlretrieve(self.pdf_url, path)
215 return written_path
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data)
224 """
225 Retrieve a URL into a temporary location on disk.
226
(...)
237 data file as well as the resulting HTTPMessage object.
238 """
239 url_type, path = _splittype(url)
--> 241 with contextlib.closing(urlopen(url, data)) as fp:
242 headers = fp.info()
244 # Just return the local path and the "headers" for file://
245 # URLs. No sense in performing a copy unless requested.
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
214 else:
215 opener = _opener
--> 216 return opener.open(url, data, timeout)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
523 for processor in self.process_response.get(protocol, []):
524 meth = getattr(processor, meth_name)
--> 525 response = meth(req, response)
527 return response
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
631 # According to RFC 2616, "2xx" code indicates that the client's
632 # request was successfully received, understood, and accepted.
633 if not (200 <= code < 300):
--> 634 response = self.parent.error(
635 'http', request, response, code, msg, hdrs)
637 return response
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
561 if http_err:
562 args = (dict, 'default', 'http_error_default') + orig_args
--> 563 return self._call_chain(*args)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
494 for handler in handlers:
495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
497 if result is not None:
498 return result
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
Description
I am trying to use ArxivLoader from langchain_community.document_loaders to load 100 academic articles from arxiv database for certain keywords with metadata. But, I am getting this error. I have successfully run the code beforehand, but this time, it gives 404 Error.
System Info
MacOS, Python 3.10.13
pip freeze | grep langchain
langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.46
langchain-openai==0.0.6
langchain-text-splitters==0.0.1
langchainhub==0.1.14
The text was updated successfully, but these errors were encountered: