arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
query: an arXiv query string. Advanced query formats are documented in the arXiv API User Manual.
+
id_list: list of arXiv record IDs (typically of the format "0710.5765v1"). See the arXiv API User's Manual for documentation of the interaction between query and id_list.
+
max_results: The maximum number of results to be returned in an execution of this search. To fetch every result available, set max_results=float('inf') (default); to fetch up to 10 results, set max_results=10. The API's limit is 300,000 results.
+
sort_by: The sort criterion for results: relevance, lastUpdatedDate, or submittedDate.
+
sort_order: The sort order for results: 'descending' or 'ascending'.
+
+
+
To fetch arXiv records matching a Search, use search.results() or (Client).results(search) to get a generator yielding Results.
+
+
Example: fetching results
+
+
Print the titles fo the 10 most recent articles related to the keyword "quantum:"
result.links: Up to three URLs associated with this result, as arxiv.Links.
+
result.pdf_url: A URL for the result's PDF if present. Note: this URL also appears among result.links.
+
+
+
They also expose helper methods for downloading papers: (Result).download_pdf() and (Result).download_source().
+
+
Example: downloading papers
+
+
To download a PDF of the paper with ID "1605.08386v1," run a Search and then use (Result).download_pdf():
+
+
importarxiv
+
+paper=next(arxiv.Search(id_list=["1605.08386v1"]).results())
+# Download the PDF to the PWD with a default filename.
+paper.download_pdf()
+# Download the PDF to the PWD with a custom filename.
+paper.download_pdf(filename="downloaded-paper.pdf")
+# Download the PDF to a specified directory with a custom filename.
+paper.download_pdf(dirpath="./mydir",filename="downloaded-paper.pdf")
+
+
+
The same interface is available for downloading .tar.gz files of the paper source:
+
+
importarxiv
+
+paper=next(arxiv.Search(id_list=["1605.08386v1"]).results())
+# Download the archive to the PWD with a default filename.
+paper.download_source()
+# Download the archive to the PWD with a custom filename.
+paper.download_source(filename="downloaded-paper.tar.gz")
+# Download the archive to a specified directory with a custom filename.
+paper.download_source(dirpath="./mydir",filename="downloaded-paper.tar.gz")
+
+
+
Client
+
+
A Client specifies a strategy for fetching results from arXiv's API; it obscures pagination and retry logic.
+
+
For most use cases the default client should suffice. You can construct it explicitly with arxiv.Client(), or use it via the (Search).results() method.
page_size: the number of papers to fetch from arXiv per page of results. Smaller pages can be retrieved faster, but may require more round-trips. The API's limit is 2000 results.
+
delay_seconds: the number of seconds to wait between requests for pages. arXiv's Terms of Use ask that you "make no more than one request every three seconds."
+
num_retries: The number of times the client will retry a request that fails, either with a non-200 HTTP status code or with an unexpected number of results given the search parameters.
+
+
+
Example: fetching results with a custom client
+
+
(Search).results() uses the default client settings. If you want to use a client you've defined instead of the defaults, use (Client).results(...):
+
+
importarxiv
+
+big_slow_client=arxiv.Client(
+ page_size=1000,
+ delay_seconds=10,
+ num_retries=5
+)
+
+# Prints 1000 titles before needing to make another request.
+forresultinbig_slow_client.results(arxiv.Search(query="quantum")):
+ print(result.title)
+
+
+
Example: logging
+
+
To inspect this package's network behavior and API logic, configure an INFO-level logger.
+
+
>>> importlogging,arxiv
+>>> logging.basicConfig(level=logging.INFO)
+>>> paper=next(arxiv.Search(id_list=["1605.08386v1"]).results())
+INFO:arxiv.arxiv:Requesting 100 results at offset 0
+INFO:arxiv.arxiv:Requesting page of results
+INFO:arxiv.arxiv:Got first page; 1 of inf results available
+
importlogging
+importtime
+importfeedparser
+importre
+importos
+importwarnings
+
+fromurllib.parseimporturlencode
+fromurllib.requestimporturlretrieve
+fromdatetimeimportdatetime,timedelta,timezone
+fromcalendarimporttimegm
+
+fromenumimportEnum
+fromtypingimportDict,Generator,List
+
+logger=logging.getLogger(__name__)
+
+_DEFAULT_TIME=datetime.min
+
+
+classResult(object):
+ """
+ An entry in an arXiv query results feed.
+
+ See [the arXiv API User's Manual: Details of Atom Results
+ Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
+ """
+
+ entry_id:str
+ """A url of the form `http://arxiv.org/abs/{id}`."""
+ updated:time.struct_time
+ """When the result was last updated."""
+ published:time.struct_time
+ """When the result was originally published."""
+ title:str
+ """The title of the result."""
+ authors:list
+ """The result's authors."""
+ summary:str
+ """The result abstrace."""
+ comment:str
+ """The authors' comment if present."""
+ journal_ref:str
+ """A journal reference if present."""
+ doi:str
+ """A URL for the resolved DOI to an external resource if present."""
+ primary_category:str
+ """
+ The result's primary arXiv category. See [arXiv: Category
+ Taxonomy](https://arxiv.org/category_taxonomy).
+ """
+ categories:List[str]
+ """
+ All of the result's categories. See [arXiv: Category
+ Taxonomy](https://arxiv.org/category_taxonomy).
+ """
+ links:list
+ """Up to three URLs associated with this result."""
+ pdf_url:str
+ """The URL of a PDF version of this result if present among links."""
+ _raw:feedparser.FeedParserDict
+ """
+ The raw feedparser result object if this Result was constructed with
+ Result._from_feed_entry.
+ """
+
+ def__init__(
+ self,
+ entry_id:str,
+ updated:datetime=_DEFAULT_TIME,
+ published:datetime=_DEFAULT_TIME,
+ title:str="",
+ authors:List['Result.Author']=[],
+ summary:str="",
+ comment:str="",
+ journal_ref:str="",
+ doi:str="",
+ primary_category:str="",
+ categories:List[str]=[],
+ links:List['Result.Link']=[],
+ _raw:feedparser.FeedParserDict=None,
+ ):
+ """
+ Constructs an arXiv search result item.
+
+ In most cases, prefer using `Result._from_feed_entry` to parsing and
+ constructing `Result`s yourself.
+ """
+ self.entry_id=entry_id
+ self.updated=updated
+ self.published=published
+ self.title=title
+ self.authors=authors
+ self.summary=summary
+ self.comment=comment
+ self.journal_ref=journal_ref
+ self.doi=doi
+ self.primary_category=primary_category
+ self.categories=categories
+ self.links=links
+ # Calculated members
+ self.pdf_url=Result._get_pdf_url(links)
+ # Debugging
+ self._raw=_raw
+
+ def_from_feed_entry(entry:feedparser.FeedParserDict)->'Result':
+ """
+ Converts a feedparser entry for an arXiv search result feed into a
+ Result object.
+ """
+ ifnothasattr(entry,"id"):
+ raiseResult.MissingFieldError("id")
+ # Title attribute may be absent for certain titles. Defaulting to "0" as
+ # it's the only title observed to cause this bug.
+ # https://github.com/lukasschwab/arxiv.py/issues/71
+ # title = entry.title if hasattr(entry, "title") else "0"
+ title="0"
+ ifhasattr(entry,"title"):
+ title=entry.title
+ else:
+ logger.warning(
+ "Result %s is missing title attribute; defaulting to '0'",
+ entry.id
+ )
+ returnResult(
+ entry_id=entry.id,
+ updated=Result._to_datetime(entry.updated_parsed),
+ published=Result._to_datetime(entry.published_parsed),
+ title=re.sub(r'\s+',' ',title),
+ authors=[Result.Author._from_feed_author(a)forainentry.authors],
+ summary=entry.summary,
+ comment=entry.get('arxiv_comment'),
+ journal_ref=entry.get('arxiv_journal_ref'),
+ doi=entry.get('arxiv_doi'),
+ primary_category=entry.arxiv_primary_category.get('term'),
+ categories=[tag.get('term')fortaginentry.tags],
+ links=[Result.Link._from_feed_link(link)forlinkinentry.links],
+ _raw=entry
+ )
+
+ def__str__(self)->str:
+ returnself.entry_id
+
+ def__repr__(self)->str:
+ return(
+ '{}(entry_id={}, updated={}, published={}, title={}, authors={}, '
+ 'summary={}, comment={}, journal_ref={}, doi={}, '
+ 'primary_category={}, categories={}, links={})'
+ ).format(
+ _classname(self),
+ repr(self.entry_id),
+ repr(self.updated),
+ repr(self.published),
+ repr(self.title),
+ repr(self.authors),
+ repr(self.summary),
+ repr(self.comment),
+ repr(self.journal_ref),
+ repr(self.doi),
+ repr(self.primary_category),
+ repr(self.categories),
+ repr(self.links)
+ )
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result):
+ returnself.entry_id==other.entry_id
+ returnFalse
+
+ defget_short_id(self)->str:
+ """
+ Returns the short ID for this result.
+
+ + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
+ `result.get_short_id()` returns `2107.05580v1`.
+
+ + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
+ `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
+ 2007 arXiv identifier format).
+
+ For an explanation of the difference between arXiv's legacy and current
+ identifiers, see [Understanding the arXiv
+ identifier](https://arxiv.org/help/arxiv_identifier).
+ """
+ returnself.entry_id.split('arxiv.org/abs/')[-1]
+
+ def_get_default_filename(self,extension:str="pdf")->str:
+ """
+ A default `to_filename` function for the extension given.
+ """
+ nonempty_title=self.titleifself.titleelse"UNTITLED"
+ # Remove disallowed characters.
+ clean_title='_'.join(re.findall(r'\w+',nonempty_title))
+ return"{}.{}.{}".format(self.get_short_id(),clean_title,extension)
+
+ defdownload_pdf(self,dirpath:str='./',filename:str='')->str:
+ """
+ Downloads the PDF for this result to the specified directory.
+
+ The filename is generated by calling `to_filename(self)`.
+ """
+ ifnotfilename:
+ filename=self._get_default_filename()
+ path=os.path.join(dirpath,filename)
+ written_path,_=urlretrieve(self.pdf_url,path)
+ returnwritten_path
+
+ defdownload_source(self,dirpath:str='./',filename:str='')->str:
+ """
+ Downloads the source tarfile for this result to the specified
+ directory.
+
+ The filename is generated by calling `to_filename(self)`.
+ """
+ ifnotfilename:
+ filename=self._get_default_filename('tar.gz')
+ path=os.path.join(dirpath,filename)
+ # Bodge: construct the source URL from the PDF URL.
+ source_url=self.pdf_url.replace('/pdf/','/src/')
+ written_path,_=urlretrieve(source_url,path)
+ returnwritten_path
+
+ def_get_pdf_url(links:list)->str:
+ """
+ Finds the PDF link among a result's links and returns its URL.
+
+ Should only be called once for a given `Result`, in its constructor.
+ After construction, the URL should be available in `Result.pdf_url`.
+ """
+ pdf_urls=[link.hrefforlinkinlinksiflink.title=='pdf']
+ iflen(pdf_urls)==0:
+ returnNone
+ eliflen(pdf_urls)>1:
+ logger.warning(
+ "Result has multiple PDF links; using %s",
+ pdf_urls[0]
+ )
+ returnpdf_urls[0]
+
+ def_to_datetime(ts:time.struct_time)->datetime:
+ """
+ Converts a UTC time.struct_time into a time-zone-aware datetime.
+
+ This will be replaced with feedparser functionality [when it becomes
+ available](https://github.com/kurtmckee/feedparser/issues/212).
+ """
+ returndatetime.fromtimestamp(timegm(ts),tz=timezone.utc)
+
+ classAuthor(object):
+ """
+ A light inner class for representing a result's authors.
+ """
+
+ name:str
+ """The author's name."""
+
+ def__init__(self,name:str):
+ """
+ Constructs an `Author` with the specified name.
+
+ In most cases, prefer using `Author._from_feed_author` to parsing
+ and constructing `Author`s yourself.
+ """
+ self.name=name
+
+ def_from_feed_author(
+ feed_author:feedparser.FeedParserDict
+ )->'Result.Author':
+ """
+ Constructs an `Author` with the name specified in an author object
+ from a feed entry.
+
+ See usage in `Result._from_feed_entry`.
+ """
+ returnResult.Author(feed_author.name)
+
+ def__str__(self)->str:
+ returnself.name
+
+ def__repr__(self)->str:
+ return'{}({})'.format(_classname(self),repr(self.name))
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result.Author):
+ returnself.name==other.name
+ returnFalse
+
+ classLink(object):
+ """
+ A light inner class for representing a result's links.
+ """
+
+ href:str
+ """The link's `href` attribute."""
+ title:str
+ """The link's title."""
+ rel:str
+ """The link's relationship to the `Result`."""
+ content_type:str
+ """The link's HTTP content type."""
+
+ def__init__(
+ self,
+ href:str,
+ title:str=None,
+ rel:str=None,
+ content_type:str=None
+ ):
+ """
+ Constructs a `Link` with the specified link metadata.
+
+ In most cases, prefer using `Link._from_feed_link` to parsing and
+ constructing `Link`s yourself.
+ """
+ self.href=href
+ self.title=title
+ self.rel=rel
+ self.content_type=content_type
+
+ def_from_feed_link(
+ feed_link:feedparser.FeedParserDict
+ )->'Result.Link':
+ """
+ Constructs a `Link` with link metadata specified in a link object
+ from a feed entry.
+
+ See usage in `Result._from_feed_entry`.
+ """
+ returnResult.Link(
+ href=feed_link.href,
+ title=feed_link.get('title'),
+ rel=feed_link.get('rel'),
+ content_type=feed_link.get('content_type')
+ )
+
+ def__str__(self)->str:
+ returnself.href
+
+ def__repr__(self)->str:
+ return'{}({}, title={}, rel={}, content_type={})'.format(
+ _classname(self),
+ repr(self.href),
+ repr(self.title),
+ repr(self.rel),
+ repr(self.content_type)
+ )
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result.Link):
+ returnself.href==other.href
+ returnFalse
+
+ classMissingFieldError(Exception):
+ """
+ An error indicating an entry is unparseable because it lacks required
+ fields.
+ """
+
+ missing_field:str
+ """The required field missing from the would-be entry."""
+ message:str
+ """Message describing what caused this error."""
+
+ def__init__(self,missing_field):
+ self.missing_field=missing_field
+ self.message="Entry from arXiv missing required info"
+
+ def__repr__(self)->str:
+ return'{}({})'.format(
+ _classname(self),
+ repr(self.missing_field)
+ )
+
+
+classSortCriterion(Enum):
+ """
+ A SortCriterion identifies a property by which search results can be
+ sorted.
+
+ See [the arXiv API User's Manual: sort order for return
+ results](https://arxiv.org/help/api/user-manual#sort).
+ """
+ Relevance="relevance"
+ LastUpdatedDate="lastUpdatedDate"
+ SubmittedDate="submittedDate"
+
+
+classSortOrder(Enum):
+ """
+ A SortOrder indicates order in which search results are sorted according
+ to the specified arxiv.SortCriterion.
+
+ See [the arXiv API User's Manual: sort order for return
+ results](https://arxiv.org/help/api/user-manual#sort).
+ """
+ Ascending="ascending"
+ Descending="descending"
+
+
+classSearch(object):
+ """
+ A specification for a search of arXiv's database.
+
+ To run a search, use `Search.run` to use a default client or `Client.run`
+ with a specific client.
+ """
+
+ query:str
+ """
+ A query string.
+
+ See [the arXiv API User's Manual: Details of Query
+ Construction](https://arxiv.org/help/api/user-manual#query_details).
+ """
+ id_list:list
+ """
+ A list of arXiv article IDs to which to limit the search.
+
+ See [the arXiv API User's
+ Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
+ for documentation of the interaction between `query` and `id_list`.
+ """
+ max_results:float
+ """
+ The maximum number of results to be returned in an execution of this
+ search.
+
+ To fetch every result available, set `max_results=float('inf')`.
+ """
+ sort_by:SortCriterion
+ """The sort criterion for results."""
+ sort_order:SortOrder
+ """The sort order for results."""
+
+ def__init__(
+ self,
+ query:str="",
+ id_list:List[str]=[],
+ max_results:float=float('inf'),
+ sort_by:SortCriterion=SortCriterion.Relevance,
+ sort_order:SortOrder=SortOrder.Descending
+ ):
+ """
+ Constructs an arXiv API search with the specified criteria.
+ """
+ self.query=query
+ self.id_list=id_list
+ self.max_results=max_results
+ self.sort_by=sort_by
+ self.sort_order=sort_order
+
+ def__str__(self)->str:
+ # TODO: develop a more informative string representation.
+ returnrepr(self)
+
+ def__repr__(self)->str:
+ return(
+ '{}(query={}, id_list={}, max_results={}, sort_by={}, '
+ 'sort_order={})'
+ ).format(
+ _classname(self),
+ repr(self.query),
+ repr(self.id_list),
+ repr(self.max_results),
+ repr(self.sort_by),
+ repr(self.sort_order)
+ )
+
+ def_url_args(self)->Dict[str,str]:
+ """
+ Returns a dict of search parameters that should be included in an API
+ request for this search.
+ """
+ return{
+ "search_query":self.query,
+ "id_list":','.join(self.id_list),
+ "sortBy":self.sort_by.value,
+ "sortOrder":self.sort_order.value
+ }
+
+ defget(self)->Generator[Result,None,None]:
+ """
+ **Deprecated** after 1.2.0; use `Search.results`.
+ """
+ warnings.warn(
+ "The 'get' method is deprecated, use 'results' instead",
+ DeprecationWarning,
+ stacklevel=2
+ )
+ returnself.results()
+
+ defresults(self)->Generator[Result,None,None]:
+ """
+ Executes the specified search using a default arXiv API client.
+
+ For info on default behavior, see `Client.__init__` and `Client.results`.
+ """
+ returnClient().results(self)
+
+
+classClient(object):
+ """
+ Specifies a strategy for fetching results from arXiv's API.
+
+ This class obscures pagination and retry logic, and exposes
+ `Client.results`.
+ """
+
+ query_url_format='http://export.arxiv.org/api/query?{}'
+ """The arXiv query API endpoint format."""
+ page_size:int
+ """Maximum number of results fetched in a single API request."""
+ delay_seconds:int
+ """Number of seconds to wait between API requests."""
+ num_retries:int
+ """Number of times to retry a failing API request."""
+ _last_request_dt:datetime
+
+ def__init__(
+ self,
+ page_size:int=100,
+ delay_seconds:int=3,
+ num_retries:int=3
+ ):
+ """
+ Constructs an arXiv API client with the specified options.
+
+ Note: the default parameters should provide a robust request strategy
+ for most use cases. Extreme page sizes, delays, or retries risk
+ violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
+ brittle behavior, and inconsistent results.
+ """
+ self.page_size=page_size
+ self.delay_seconds=delay_seconds
+ self.num_retries=num_retries
+ self._last_request_dt=None
+
+ def__str__(self)->str:
+ # TODO: develop a more informative string representation.
+ returnrepr(self)
+
+ def__repr__(self)->str:
+ return'{}(page_size={}, delay_seconds={}, num_retries={})'.format(
+ _classname(self),
+ repr(self.page_size),
+ repr(self.delay_seconds),
+ repr(self.num_retries)
+ )
+
+ defget(self,search:Search)->Generator[Result,None,None]:
+ """
+ **Deprecated** after 1.2.0; use `Client.results`.
+ """
+ warnings.warn(
+ "The 'get' method is deprecated, use 'results' instead",
+ DeprecationWarning,
+ stacklevel=2
+ )
+ returnself.results(search)
+
+ defresults(self,search:Search)->Generator[Result,None,None]:
+ """
+ Uses this client configuration to fetch one page of the search results
+ at a time, yielding the parsed `Result`s, until `max_results` results
+ have been yielded or there are no more search results.
+
+ If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
+
+ For more on using generators, see
+ [Generators](https://wiki.python.org/moin/Generators).
+ """
+ offset=0
+ # total_results may be reduced according to the feed's
+ # opensearch:totalResults value.
+ total_results=search.max_results
+ first_page=True
+ whileoffset<total_results:
+ page_size=min(self.page_size,search.max_results-offset)
+ logger.info("Requesting {} results at offset {}".format(
+ page_size,
+ offset,
+ ))
+ page_url=self._format_url(search,offset,page_size)
+ feed=self._parse_feed(page_url,first_page)
+ iffirst_page:
+ # NOTE: this is an ugly fix for a known bug. The totalresults
+ # value is set to 1 for results with zero entries. If that API
+ # bug is fixed, we can remove this conditional and always set
+ # `total_results = min(...)`.
+ iflen(feed.entries)==0:
+ logger.info("Got empty results; stopping generation")
+ total_results=0
+ else:
+ total_results=min(
+ total_results,
+ int(feed.feed.opensearch_totalresults)
+ )
+ logger.info("Got first page; {} of {} results available".format(
+ total_results,
+ search.max_results
+ ))
+ # Subsequent pages are not the first page.
+ first_page=False
+ # Update offset for next request: account for received results.
+ offset+=len(feed.entries)
+ # Yield query results until page is exhausted.
+ forentryinfeed.entries:
+ try:
+ yieldResult._from_feed_entry(entry)
+ exceptResult.MissingFieldError:
+ logger.warning("Skipping partial result")
+ continue
+
+ def_format_url(self,search:Search,start:int,page_size:int)->str:
+ """
+ Construct a request API for search that returns up to `page_size`
+ results starting with the result at index `start`.
+ """
+ url_args=search._url_args()
+ url_args.update({
+ "start":start,
+ "max_results":page_size,
+ })
+ returnself.query_url_format.format(urlencode(url_args))
+
+ def_parse_feed(
+ self,
+ url:str,
+ first_page:bool=True
+ )->feedparser.FeedParserDict:
+ """
+ Fetches the specified URL and parses it with feedparser.
+
+ If a request fails or is unexpectedly empty, retries the request up to
+ `self.num_retries` times.
+ """
+ # Invoke the recursive helper with initial available retries.
+ returnself.__try_parse_feed(
+ url,
+ first_page=first_page,
+ retries_left=self.num_retries
+ )
+
+ def__try_parse_feed(
+ self,
+ url:str,
+ first_page:bool,
+ retries_left:int,
+ last_err:Exception=None,
+ )->feedparser.FeedParserDict:
+ """
+ Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
+ number of seconds has not passed since `_parse_feed` was last called,
+ sleeps until delay_seconds seconds have passed.
+ """
+ retry=self.num_retries-retries_left
+ # If this call would violate the rate limit, sleep until it doesn't.
+ ifself._last_request_dtisnotNone:
+ required=timedelta(seconds=self.delay_seconds)
+ since_last_request=datetime.now()-self._last_request_dt
+ ifsince_last_request<required:
+ to_sleep=(required-since_last_request).total_seconds()
+ logger.info("Sleeping for %f seconds",to_sleep)
+ time.sleep(to_sleep)
+ logger.info("Requesting page of results",extra={
+ 'url':url,
+ 'first_page':first_page,
+ 'retry':retry,
+ 'last_err':last_err.messageiflast_errisnotNoneelseNone,
+ })
+ feed=feedparser.parse(url)
+ self._last_request_dt=datetime.now()
+ err=None
+ iffeed.status!=200:
+ err=HTTPError(url,retry,feed)
+ eliflen(feed.entries)==0andnotfirst_page:
+ err=UnexpectedEmptyPageError(url,retry)
+ iferrisnotNone:
+ ifretries_left>0:
+ returnself.__try_parse_feed(
+ url,
+ first_page=first_page,
+ retries_left=retries_left-1,
+ last_err=err,
+ )
+ # Feed was never returned in self.num_retries tries. Raise the last
+ # exception encountered.
+ raiseerr
+ returnfeed
+
+
+classArxivError(Exception):
+ """This package's base Exception class."""
+
+ url:str
+ """The feed URL that could not be fetched."""
+ retry:int
+ """
+ The request try number which encountered this error; 0 for the initial try,
+ 1 for the first retry, and so on.
+ """
+ message:str
+ """Message describing what caused this error."""
+
+ def__init__(self,url:str,retry:int,message:str):
+ """
+ Constructs an `ArxivError` encountered while fetching the specified URL.
+ """
+ self.url=url
+ self.retry=retry
+ self.message=message
+ super().__init__(self.message)
+
+ def__str__(self)->str:
+ return'{} ({})'.format(self.message,self.url)
+
+
+classUnexpectedEmptyPageError(ArxivError):
+ """
+ An error raised when a page of results that should be non-empty is empty.
+
+ This should never happen in theory, but happens sporadically due to
+ brittleness in the underlying arXiv API; usually resolved by retries.
+
+ See `Client.results` for usage.
+ """
+ def__init__(self,url:str,retry:int):
+ """
+ Constructs an `UnexpectedEmptyPageError` encountered for the specified
+ API URL after `retry` tries.
+ """
+ self.url=url
+ super().__init__(url,retry,"Page of results was unexpectedly empty")
+
+ def__repr__(self)->str:
+ return'{}({}, {})'.format(
+ _classname(self),
+ repr(self.url),
+ repr(self.retry)
+ )
+
+
+classHTTPError(ArxivError):
+ """
+ A non-200 status encountered while fetching a page of results.
+
+ See `Client.results` for usage.
+ """
+
+ status:int
+ """The HTTP status reported by feedparser."""
+ entry:feedparser.FeedParserDict
+ """The feed entry describing the error, if present."""
+
+ def__init__(self,url:str,retry:int,feed:feedparser.FeedParserDict):
+ """
+ Constructs an `HTTPError` for the specified status code, encountered for
+ the specified API URL after `retry` tries.
+ """
+ self.url=url
+ self.status=feed.status
+ # If the feed is valid and includes a single entry, trust it's an
+ # explanation.
+ ifnotfeed.bozoandlen(feed.entries)==1:
+ self.entry=feed.entries[0]
+ else:
+ self.entry=None
+ super().__init__(
+ url,
+ retry,
+ "Page request resulted in HTTP {}: {}".format(
+ self.status,
+ self.entry.summaryifself.entryelseNone,
+ ),
+ )
+
+ def__repr__(self)->str:
+ return'{}({}, {}, {})'.format(
+ _classname(self),
+ repr(self.url),
+ repr(self.retry),
+ repr(self.status)
+ )
+
+
+def_classname(o):
+ """A helper function for use in __repr__ methods: arxiv.Result.Link."""
+ return'arxiv.{}'.format(o.__class__.__qualname__)
+
classResult(object):
+ """
+ An entry in an arXiv query results feed.
+
+ See [the arXiv API User's Manual: Details of Atom Results
+ Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
+ """
+
+ entry_id:str
+ """A url of the form `http://arxiv.org/abs/{id}`."""
+ updated:time.struct_time
+ """When the result was last updated."""
+ published:time.struct_time
+ """When the result was originally published."""
+ title:str
+ """The title of the result."""
+ authors:list
+ """The result's authors."""
+ summary:str
+ """The result abstrace."""
+ comment:str
+ """The authors' comment if present."""
+ journal_ref:str
+ """A journal reference if present."""
+ doi:str
+ """A URL for the resolved DOI to an external resource if present."""
+ primary_category:str
+ """
+ The result's primary arXiv category. See [arXiv: Category
+ Taxonomy](https://arxiv.org/category_taxonomy).
+ """
+ categories:List[str]
+ """
+ All of the result's categories. See [arXiv: Category
+ Taxonomy](https://arxiv.org/category_taxonomy).
+ """
+ links:list
+ """Up to three URLs associated with this result."""
+ pdf_url:str
+ """The URL of a PDF version of this result if present among links."""
+ _raw:feedparser.FeedParserDict
+ """
+ The raw feedparser result object if this Result was constructed with
+ Result._from_feed_entry.
+ """
+
+ def__init__(
+ self,
+ entry_id:str,
+ updated:datetime=_DEFAULT_TIME,
+ published:datetime=_DEFAULT_TIME,
+ title:str="",
+ authors:List['Result.Author']=[],
+ summary:str="",
+ comment:str="",
+ journal_ref:str="",
+ doi:str="",
+ primary_category:str="",
+ categories:List[str]=[],
+ links:List['Result.Link']=[],
+ _raw:feedparser.FeedParserDict=None,
+ ):
+ """
+ Constructs an arXiv search result item.
+
+ In most cases, prefer using `Result._from_feed_entry` to parsing and
+ constructing `Result`s yourself.
+ """
+ self.entry_id=entry_id
+ self.updated=updated
+ self.published=published
+ self.title=title
+ self.authors=authors
+ self.summary=summary
+ self.comment=comment
+ self.journal_ref=journal_ref
+ self.doi=doi
+ self.primary_category=primary_category
+ self.categories=categories
+ self.links=links
+ # Calculated members
+ self.pdf_url=Result._get_pdf_url(links)
+ # Debugging
+ self._raw=_raw
+
+ def_from_feed_entry(entry:feedparser.FeedParserDict)->'Result':
+ """
+ Converts a feedparser entry for an arXiv search result feed into a
+ Result object.
+ """
+ ifnothasattr(entry,"id"):
+ raiseResult.MissingFieldError("id")
+ # Title attribute may be absent for certain titles. Defaulting to "0" as
+ # it's the only title observed to cause this bug.
+ # https://github.com/lukasschwab/arxiv.py/issues/71
+ # title = entry.title if hasattr(entry, "title") else "0"
+ title="0"
+ ifhasattr(entry,"title"):
+ title=entry.title
+ else:
+ logger.warning(
+ "Result %s is missing title attribute; defaulting to '0'",
+ entry.id
+ )
+ returnResult(
+ entry_id=entry.id,
+ updated=Result._to_datetime(entry.updated_parsed),
+ published=Result._to_datetime(entry.published_parsed),
+ title=re.sub(r'\s+',' ',title),
+ authors=[Result.Author._from_feed_author(a)forainentry.authors],
+ summary=entry.summary,
+ comment=entry.get('arxiv_comment'),
+ journal_ref=entry.get('arxiv_journal_ref'),
+ doi=entry.get('arxiv_doi'),
+ primary_category=entry.arxiv_primary_category.get('term'),
+ categories=[tag.get('term')fortaginentry.tags],
+ links=[Result.Link._from_feed_link(link)forlinkinentry.links],
+ _raw=entry
+ )
+
+ def__str__(self)->str:
+ returnself.entry_id
+
+ def__repr__(self)->str:
+ return(
+ '{}(entry_id={}, updated={}, published={}, title={}, authors={}, '
+ 'summary={}, comment={}, journal_ref={}, doi={}, '
+ 'primary_category={}, categories={}, links={})'
+ ).format(
+ _classname(self),
+ repr(self.entry_id),
+ repr(self.updated),
+ repr(self.published),
+ repr(self.title),
+ repr(self.authors),
+ repr(self.summary),
+ repr(self.comment),
+ repr(self.journal_ref),
+ repr(self.doi),
+ repr(self.primary_category),
+ repr(self.categories),
+ repr(self.links)
+ )
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result):
+ returnself.entry_id==other.entry_id
+ returnFalse
+
+ defget_short_id(self)->str:
+ """
+ Returns the short ID for this result.
+
+ + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
+ `result.get_short_id()` returns `2107.05580v1`.
+
+ + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
+ `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
+ 2007 arXiv identifier format).
+
+ For an explanation of the difference between arXiv's legacy and current
+ identifiers, see [Understanding the arXiv
+ identifier](https://arxiv.org/help/arxiv_identifier).
+ """
+ returnself.entry_id.split('arxiv.org/abs/')[-1]
+
+ def_get_default_filename(self,extension:str="pdf")->str:
+ """
+ A default `to_filename` function for the extension given.
+ """
+ nonempty_title=self.titleifself.titleelse"UNTITLED"
+ # Remove disallowed characters.
+ clean_title='_'.join(re.findall(r'\w+',nonempty_title))
+ return"{}.{}.{}".format(self.get_short_id(),clean_title,extension)
+
+ defdownload_pdf(self,dirpath:str='./',filename:str='')->str:
+ """
+ Downloads the PDF for this result to the specified directory.
+
+ The filename is generated by calling `to_filename(self)`.
+ """
+ ifnotfilename:
+ filename=self._get_default_filename()
+ path=os.path.join(dirpath,filename)
+ written_path,_=urlretrieve(self.pdf_url,path)
+ returnwritten_path
+
+ defdownload_source(self,dirpath:str='./',filename:str='')->str:
+ """
+ Downloads the source tarfile for this result to the specified
+ directory.
+
+ The filename is generated by calling `to_filename(self)`.
+ """
+ ifnotfilename:
+ filename=self._get_default_filename('tar.gz')
+ path=os.path.join(dirpath,filename)
+ # Bodge: construct the source URL from the PDF URL.
+ source_url=self.pdf_url.replace('/pdf/','/src/')
+ written_path,_=urlretrieve(source_url,path)
+ returnwritten_path
+
+ def_get_pdf_url(links:list)->str:
+ """
+ Finds the PDF link among a result's links and returns its URL.
+
+ Should only be called once for a given `Result`, in its constructor.
+ After construction, the URL should be available in `Result.pdf_url`.
+ """
+ pdf_urls=[link.hrefforlinkinlinksiflink.title=='pdf']
+ iflen(pdf_urls)==0:
+ returnNone
+ eliflen(pdf_urls)>1:
+ logger.warning(
+ "Result has multiple PDF links; using %s",
+ pdf_urls[0]
+ )
+ returnpdf_urls[0]
+
+ def_to_datetime(ts:time.struct_time)->datetime:
+ """
+ Converts a UTC time.struct_time into a time-zone-aware datetime.
+
+ This will be replaced with feedparser functionality [when it becomes
+ available](https://github.com/kurtmckee/feedparser/issues/212).
+ """
+ returndatetime.fromtimestamp(timegm(ts),tz=timezone.utc)
+
+ classAuthor(object):
+ """
+ A light inner class for representing a result's authors.
+ """
+
+ name:str
+ """The author's name."""
+
+ def__init__(self,name:str):
+ """
+ Constructs an `Author` with the specified name.
+
+ In most cases, prefer using `Author._from_feed_author` to parsing
+ and constructing `Author`s yourself.
+ """
+ self.name=name
+
+ def_from_feed_author(
+ feed_author:feedparser.FeedParserDict
+ )->'Result.Author':
+ """
+ Constructs an `Author` with the name specified in an author object
+ from a feed entry.
+
+ See usage in `Result._from_feed_entry`.
+ """
+ returnResult.Author(feed_author.name)
+
+ def__str__(self)->str:
+ returnself.name
+
+ def__repr__(self)->str:
+ return'{}({})'.format(_classname(self),repr(self.name))
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result.Author):
+ returnself.name==other.name
+ returnFalse
+
+ classLink(object):
+ """
+ A light inner class for representing a result's links.
+ """
+
+ href:str
+ """The link's `href` attribute."""
+ title:str
+ """The link's title."""
+ rel:str
+ """The link's relationship to the `Result`."""
+ content_type:str
+ """The link's HTTP content type."""
+
+ def__init__(
+ self,
+ href:str,
+ title:str=None,
+ rel:str=None,
+ content_type:str=None
+ ):
+ """
+ Constructs a `Link` with the specified link metadata.
+
+ In most cases, prefer using `Link._from_feed_link` to parsing and
+ constructing `Link`s yourself.
+ """
+ self.href=href
+ self.title=title
+ self.rel=rel
+ self.content_type=content_type
+
+ def_from_feed_link(
+ feed_link:feedparser.FeedParserDict
+ )->'Result.Link':
+ """
+ Constructs a `Link` with link metadata specified in a link object
+ from a feed entry.
+
+ See usage in `Result._from_feed_entry`.
+ """
+ returnResult.Link(
+ href=feed_link.href,
+ title=feed_link.get('title'),
+ rel=feed_link.get('rel'),
+ content_type=feed_link.get('content_type')
+ )
+
+ def__str__(self)->str:
+ returnself.href
+
+ def__repr__(self)->str:
+ return'{}({}, title={}, rel={}, content_type={})'.format(
+ _classname(self),
+ repr(self.href),
+ repr(self.title),
+ repr(self.rel),
+ repr(self.content_type)
+ )
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result.Link):
+ returnself.href==other.href
+ returnFalse
+
+ classMissingFieldError(Exception):
+ """
+ An error indicating an entry is unparseable because it lacks required
+ fields.
+ """
+
+ missing_field:str
+ """The required field missing from the would-be entry."""
+ message:str
+ """Message describing what caused this error."""
+
+ def__init__(self,missing_field):
+ self.missing_field=missing_field
+ self.message="Entry from arXiv missing required info"
+
+ def__repr__(self)->str:
+ return'{}({})'.format(
+ _classname(self),
+ repr(self.missing_field)
+ )
+
defget_short_id(self)->str:
+ """
+ Returns the short ID for this result.
+
+ + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
+ `result.get_short_id()` returns `2107.05580v1`.
+
+ + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
+ `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
+ 2007 arXiv identifier format).
+
+ For an explanation of the difference between arXiv's legacy and current
+ identifiers, see [Understanding the arXiv
+ identifier](https://arxiv.org/help/arxiv_identifier).
+ """
+ returnself.entry_id.split('arxiv.org/abs/')[-1]
+
+
+
+
+
Returns the short ID for this result.
+
+
+
If the result URL is "http://arxiv.org/abs/2107.05580v1",
+result.get_short_id() returns 2107.05580v1.
+
If the result URL is "http://arxiv.org/abs/quant-ph/0201082v1",
+result.get_short_id() returns "quant-ph/0201082v1" (the pre-March
+2007 arXiv identifier format).
defdownload_pdf(self,dirpath:str='./',filename:str='')->str:
+ """
+ Downloads the PDF for this result to the specified directory.
+
+ The filename is generated by calling `to_filename(self)`.
+ """
+ ifnotfilename:
+ filename=self._get_default_filename()
+ path=os.path.join(dirpath,filename)
+ written_path,_=urlretrieve(self.pdf_url,path)
+ returnwritten_path
+
+
+
+
+
Downloads the PDF for this result to the specified directory.
+
+
The filename is generated by calling to_filename(self).
defdownload_source(self,dirpath:str='./',filename:str='')->str:
+ """
+ Downloads the source tarfile for this result to the specified
+ directory.
+
+ The filename is generated by calling `to_filename(self)`.
+ """
+ ifnotfilename:
+ filename=self._get_default_filename('tar.gz')
+ path=os.path.join(dirpath,filename)
+ # Bodge: construct the source URL from the PDF URL.
+ source_url=self.pdf_url.replace('/pdf/','/src/')
+ written_path,_=urlretrieve(source_url,path)
+ returnwritten_path
+
+
+
+
+
Downloads the source tarfile for this result to the specified
+directory.
+
+
The filename is generated by calling to_filename(self).
classAuthor(object):
+ """
+ A light inner class for representing a result's authors.
+ """
+
+ name:str
+ """The author's name."""
+
+ def__init__(self,name:str):
+ """
+ Constructs an `Author` with the specified name.
+
+ In most cases, prefer using `Author._from_feed_author` to parsing
+ and constructing `Author`s yourself.
+ """
+ self.name=name
+
+ def_from_feed_author(
+ feed_author:feedparser.FeedParserDict
+ )->'Result.Author':
+ """
+ Constructs an `Author` with the name specified in an author object
+ from a feed entry.
+
+ See usage in `Result._from_feed_entry`.
+ """
+ returnResult.Author(feed_author.name)
+
+ def__str__(self)->str:
+ returnself.name
+
+ def__repr__(self)->str:
+ return'{}({})'.format(_classname(self),repr(self.name))
+
+ def__eq__(self,other)->bool:
+ ifisinstance(other,Result.Author):
+ returnself.name==other.name
+ returnFalse
+
+
+
+
+
A light inner class for representing a result's authors.
def__init__(self,name:str):
+ """
+ Constructs an `Author` with the specified name.
+
+ In most cases, prefer using `Author._from_feed_author` to parsing
+ and constructing `Author`s yourself.
+ """
+ self.name=name
+
+ #  
+
+
+ class
+ SortCriterion(enum.Enum):
+
+
+
+ View Source
+
classSortCriterion(Enum):
+ """
+ A SortCriterion identifies a property by which search results can be
+ sorted.
+
+ See [the arXiv API User's Manual: sort order for return
+ results](https://arxiv.org/help/api/user-manual#sort).
+ """
+ Relevance="relevance"
+ LastUpdatedDate="lastUpdatedDate"
+ SubmittedDate="submittedDate"
+
+
+
+
+
A SortCriterion identifies a property by which search results can be
+sorted.
classSortOrder(Enum):
+ """
+ A SortOrder indicates order in which search results are sorted according
+ to the specified arxiv.SortCriterion.
+
+ See [the arXiv API User's Manual: sort order for return
+ results](https://arxiv.org/help/api/user-manual#sort).
+ """
+ Ascending="ascending"
+ Descending="descending"
+
+
+
+
+
A SortOrder indicates order in which search results are sorted according
+to the specified arxiv.SortCriterion.
classSearch(object):
+ """
+ A specification for a search of arXiv's database.
+
+ To run a search, use `Search.run` to use a default client or `Client.run`
+ with a specific client.
+ """
+
+ query:str
+ """
+ A query string.
+
+ See [the arXiv API User's Manual: Details of Query
+ Construction](https://arxiv.org/help/api/user-manual#query_details).
+ """
+ id_list:list
+ """
+ A list of arXiv article IDs to which to limit the search.
+
+ See [the arXiv API User's
+ Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
+ for documentation of the interaction between `query` and `id_list`.
+ """
+ max_results:float
+ """
+ The maximum number of results to be returned in an execution of this
+ search.
+
+ To fetch every result available, set `max_results=float('inf')`.
+ """
+ sort_by:SortCriterion
+ """The sort criterion for results."""
+ sort_order:SortOrder
+ """The sort order for results."""
+
+ def__init__(
+ self,
+ query:str="",
+ id_list:List[str]=[],
+ max_results:float=float('inf'),
+ sort_by:SortCriterion=SortCriterion.Relevance,
+ sort_order:SortOrder=SortOrder.Descending
+ ):
+ """
+ Constructs an arXiv API search with the specified criteria.
+ """
+ self.query=query
+ self.id_list=id_list
+ self.max_results=max_results
+ self.sort_by=sort_by
+ self.sort_order=sort_order
+
+ def__str__(self)->str:
+ # TODO: develop a more informative string representation.
+ returnrepr(self)
+
+ def__repr__(self)->str:
+ return(
+ '{}(query={}, id_list={}, max_results={}, sort_by={}, '
+ 'sort_order={})'
+ ).format(
+ _classname(self),
+ repr(self.query),
+ repr(self.id_list),
+ repr(self.max_results),
+ repr(self.sort_by),
+ repr(self.sort_order)
+ )
+
+ def_url_args(self)->Dict[str,str]:
+ """
+ Returns a dict of search parameters that should be included in an API
+ request for this search.
+ """
+ return{
+ "search_query":self.query,
+ "id_list":','.join(self.id_list),
+ "sortBy":self.sort_by.value,
+ "sortOrder":self.sort_order.value
+ }
+
+ defget(self)->Generator[Result,None,None]:
+ """
+ **Deprecated** after 1.2.0; use `Search.results`.
+ """
+ warnings.warn(
+ "The 'get' method is deprecated, use 'results' instead",
+ DeprecationWarning,
+ stacklevel=2
+ )
+ returnself.results()
+
+ defresults(self)->Generator[Result,None,None]:
+ """
+ Executes the specified search using a default arXiv API client.
+
+ For info on default behavior, see `Client.__init__` and `Client.results`.
+ """
+ returnClient().results(self)
+
+
+
+
+
A specification for a search of arXiv's database.
+
+
To run a search, use Search.run to use a default client or Client.run
+with a specific client.
defresults(self)->Generator[Result,None,None]:
+ """
+ Executes the specified search using a default arXiv API client.
+
+ For info on default behavior, see `Client.__init__` and `Client.results`.
+ """
+ returnClient().results(self)
+
+
+
+
+
Executes the specified search using a default arXiv API client.
classClient(object):
+ """
+ Specifies a strategy for fetching results from arXiv's API.
+
+ This class obscures pagination and retry logic, and exposes
+ `Client.results`.
+ """
+
+ query_url_format='http://export.arxiv.org/api/query?{}'
+ """The arXiv query API endpoint format."""
+ page_size:int
+ """Maximum number of results fetched in a single API request."""
+ delay_seconds:int
+ """Number of seconds to wait between API requests."""
+ num_retries:int
+ """Number of times to retry a failing API request."""
+ _last_request_dt:datetime
+
+ def__init__(
+ self,
+ page_size:int=100,
+ delay_seconds:int=3,
+ num_retries:int=3
+ ):
+ """
+ Constructs an arXiv API client with the specified options.
+
+ Note: the default parameters should provide a robust request strategy
+ for most use cases. Extreme page sizes, delays, or retries risk
+ violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
+ brittle behavior, and inconsistent results.
+ """
+ self.page_size=page_size
+ self.delay_seconds=delay_seconds
+ self.num_retries=num_retries
+ self._last_request_dt=None
+
+ def__str__(self)->str:
+ # TODO: develop a more informative string representation.
+ returnrepr(self)
+
+ def__repr__(self)->str:
+ return'{}(page_size={}, delay_seconds={}, num_retries={})'.format(
+ _classname(self),
+ repr(self.page_size),
+ repr(self.delay_seconds),
+ repr(self.num_retries)
+ )
+
+ defget(self,search:Search)->Generator[Result,None,None]:
+ """
+ **Deprecated** after 1.2.0; use `Client.results`.
+ """
+ warnings.warn(
+ "The 'get' method is deprecated, use 'results' instead",
+ DeprecationWarning,
+ stacklevel=2
+ )
+ returnself.results(search)
+
+ defresults(self,search:Search)->Generator[Result,None,None]:
+ """
+ Uses this client configuration to fetch one page of the search results
+ at a time, yielding the parsed `Result`s, until `max_results` results
+ have been yielded or there are no more search results.
+
+ If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
+
+ For more on using generators, see
+ [Generators](https://wiki.python.org/moin/Generators).
+ """
+ offset=0
+ # total_results may be reduced according to the feed's
+ # opensearch:totalResults value.
+ total_results=search.max_results
+ first_page=True
+ whileoffset<total_results:
+ page_size=min(self.page_size,search.max_results-offset)
+ logger.info("Requesting {} results at offset {}".format(
+ page_size,
+ offset,
+ ))
+ page_url=self._format_url(search,offset,page_size)
+ feed=self._parse_feed(page_url,first_page)
+ iffirst_page:
+ # NOTE: this is an ugly fix for a known bug. The totalresults
+ # value is set to 1 for results with zero entries. If that API
+ # bug is fixed, we can remove this conditional and always set
+ # `total_results = min(...)`.
+ iflen(feed.entries)==0:
+ logger.info("Got empty results; stopping generation")
+ total_results=0
+ else:
+ total_results=min(
+ total_results,
+ int(feed.feed.opensearch_totalresults)
+ )
+ logger.info("Got first page; {} of {} results available".format(
+ total_results,
+ search.max_results
+ ))
+ # Subsequent pages are not the first page.
+ first_page=False
+ # Update offset for next request: account for received results.
+ offset+=len(feed.entries)
+ # Yield query results until page is exhausted.
+ forentryinfeed.entries:
+ try:
+ yieldResult._from_feed_entry(entry)
+ exceptResult.MissingFieldError:
+ logger.warning("Skipping partial result")
+ continue
+
+ def_format_url(self,search:Search,start:int,page_size:int)->str:
+ """
+ Construct a request API for search that returns up to `page_size`
+ results starting with the result at index `start`.
+ """
+ url_args=search._url_args()
+ url_args.update({
+ "start":start,
+ "max_results":page_size,
+ })
+ returnself.query_url_format.format(urlencode(url_args))
+
+ def_parse_feed(
+ self,
+ url:str,
+ first_page:bool=True
+ )->feedparser.FeedParserDict:
+ """
+ Fetches the specified URL and parses it with feedparser.
+
+ If a request fails or is unexpectedly empty, retries the request up to
+ `self.num_retries` times.
+ """
+ # Invoke the recursive helper with initial available retries.
+ returnself.__try_parse_feed(
+ url,
+ first_page=first_page,
+ retries_left=self.num_retries
+ )
+
+ def__try_parse_feed(
+ self,
+ url:str,
+ first_page:bool,
+ retries_left:int,
+ last_err:Exception=None,
+ )->feedparser.FeedParserDict:
+ """
+ Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
+ number of seconds has not passed since `_parse_feed` was last called,
+ sleeps until delay_seconds seconds have passed.
+ """
+ retry=self.num_retries-retries_left
+ # If this call would violate the rate limit, sleep until it doesn't.
+ ifself._last_request_dtisnotNone:
+ required=timedelta(seconds=self.delay_seconds)
+ since_last_request=datetime.now()-self._last_request_dt
+ ifsince_last_request<required:
+ to_sleep=(required-since_last_request).total_seconds()
+ logger.info("Sleeping for %f seconds",to_sleep)
+ time.sleep(to_sleep)
+ logger.info("Requesting page of results",extra={
+ 'url':url,
+ 'first_page':first_page,
+ 'retry':retry,
+ 'last_err':last_err.messageiflast_errisnotNoneelseNone,
+ })
+ feed=feedparser.parse(url)
+ self._last_request_dt=datetime.now()
+ err=None
+ iffeed.status!=200:
+ err=HTTPError(url,retry,feed)
+ eliflen(feed.entries)==0andnotfirst_page:
+ err=UnexpectedEmptyPageError(url,retry)
+ iferrisnotNone:
+ ifretries_left>0:
+ returnself.__try_parse_feed(
+ url,
+ first_page=first_page,
+ retries_left=retries_left-1,
+ last_err=err,
+ )
+ # Feed was never returned in self.num_retries tries. Raise the last
+ # exception encountered.
+ raiseerr
+ returnfeed
+
+
+
+
+
Specifies a strategy for fetching results from arXiv's API.
+
+
This class obscures pagination and retry logic, and exposes
+Client.results.
+
+
+
+
+
#  
+
+
+ Client(page_size: int = 100, delay_seconds: int = 3, num_retries: int = 3)
+
+
+
+ View Source
+
def__init__(
+ self,
+ page_size:int=100,
+ delay_seconds:int=3,
+ num_retries:int=3
+ ):
+ """
+ Constructs an arXiv API client with the specified options.
+
+ Note: the default parameters should provide a robust request strategy
+ for most use cases. Extreme page sizes, delays, or retries risk
+ violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
+ brittle behavior, and inconsistent results.
+ """
+ self.page_size=page_size
+ self.delay_seconds=delay_seconds
+ self.num_retries=num_retries
+ self._last_request_dt=None
+
+
+
+
+
Constructs an arXiv API client with the specified options.
+
+
Note: the default parameters should provide a robust request strategy
+for most use cases. Extreme page sizes, delays, or retries risk
+violating the arXiv API Terms of Use,
+brittle behavior, and inconsistent results.
defresults(self,search:Search)->Generator[Result,None,None]:
+ """
+ Uses this client configuration to fetch one page of the search results
+ at a time, yielding the parsed `Result`s, until `max_results` results
+ have been yielded or there are no more search results.
+
+ If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
+
+ For more on using generators, see
+ [Generators](https://wiki.python.org/moin/Generators).
+ """
+ offset=0
+ # total_results may be reduced according to the feed's
+ # opensearch:totalResults value.
+ total_results=search.max_results
+ first_page=True
+ whileoffset<total_results:
+ page_size=min(self.page_size,search.max_results-offset)
+ logger.info("Requesting {} results at offset {}".format(
+ page_size,
+ offset,
+ ))
+ page_url=self._format_url(search,offset,page_size)
+ feed=self._parse_feed(page_url,first_page)
+ iffirst_page:
+ # NOTE: this is an ugly fix for a known bug. The totalresults
+ # value is set to 1 for results with zero entries. If that API
+ # bug is fixed, we can remove this conditional and always set
+ # `total_results = min(...)`.
+ iflen(feed.entries)==0:
+ logger.info("Got empty results; stopping generation")
+ total_results=0
+ else:
+ total_results=min(
+ total_results,
+ int(feed.feed.opensearch_totalresults)
+ )
+ logger.info("Got first page; {} of {} results available".format(
+ total_results,
+ search.max_results
+ ))
+ # Subsequent pages are not the first page.
+ first_page=False
+ # Update offset for next request: account for received results.
+ offset+=len(feed.entries)
+ # Yield query results until page is exhausted.
+ forentryinfeed.entries:
+ try:
+ yieldResult._from_feed_entry(entry)
+ exceptResult.MissingFieldError:
+ logger.warning("Skipping partial result")
+ continue
+
+
+
+
+
Uses this client configuration to fetch one page of the search results
+at a time, yielding the parsed Results, until max_results results
+have been yielded or there are no more search results.
+ #  
+
+
+ class
+ ArxivError(builtins.Exception):
+
+
+
+ View Source
+
classArxivError(Exception):
+ """This package's base Exception class."""
+
+ url:str
+ """The feed URL that could not be fetched."""
+ retry:int
+ """
+ The request try number which encountered this error; 0 for the initial try,
+ 1 for the first retry, and so on.
+ """
+ message:str
+ """Message describing what caused this error."""
+
+ def__init__(self,url:str,retry:int,message:str):
+ """
+ Constructs an `ArxivError` encountered while fetching the specified URL.
+ """
+ self.url=url
+ self.retry=retry
+ self.message=message
+ super().__init__(self.message)
+
+ def__str__(self)->str:
+ return'{} ({})'.format(self.message,self.url)
+
classUnexpectedEmptyPageError(ArxivError):
+ """
+ An error raised when a page of results that should be non-empty is empty.
+
+ This should never happen in theory, but happens sporadically due to
+ brittleness in the underlying arXiv API; usually resolved by retries.
+
+ See `Client.results` for usage.
+ """
+ def__init__(self,url:str,retry:int):
+ """
+ Constructs an `UnexpectedEmptyPageError` encountered for the specified
+ API URL after `retry` tries.
+ """
+ self.url=url
+ super().__init__(url,retry,"Page of results was unexpectedly empty")
+
+ def__repr__(self)->str:
+ return'{}({}, {})'.format(
+ _classname(self),
+ repr(self.url),
+ repr(self.retry)
+ )
+
+
+
+
+
An error raised when a page of results that should be non-empty is empty.
+
+
This should never happen in theory, but happens sporadically due to
+brittleness in the underlying arXiv API; usually resolved by retries.
def__init__(self,url:str,retry:int):
+ """
+ Constructs an `UnexpectedEmptyPageError` encountered for the specified
+ API URL after `retry` tries.
+ """
+ self.url=url
+ super().__init__(url,retry,"Page of results was unexpectedly empty")
+
classHTTPError(ArxivError):
+ """
+ A non-200 status encountered while fetching a page of results.
+
+ See `Client.results` for usage.
+ """
+
+ status:int
+ """The HTTP status reported by feedparser."""
+ entry:feedparser.FeedParserDict
+ """The feed entry describing the error, if present."""
+
+ def__init__(self,url:str,retry:int,feed:feedparser.FeedParserDict):
+ """
+ Constructs an `HTTPError` for the specified status code, encountered for
+ the specified API URL after `retry` tries.
+ """
+ self.url=url
+ self.status=feed.status
+ # If the feed is valid and includes a single entry, trust it's an
+ # explanation.
+ ifnotfeed.bozoandlen(feed.entries)==1:
+ self.entry=feed.entries[0]
+ else:
+ self.entry=None
+ super().__init__(
+ url,
+ retry,
+ "Page request resulted in HTTP {}: {}".format(
+ self.status,
+ self.entry.summaryifself.entryelseNone,
+ ),
+ )
+
+ def__repr__(self)->str:
+ return'{}({}, {}, {})'.format(
+ _classname(self),
+ repr(self.url),
+ repr(self.retry),
+ repr(self.status)
+ )
+
+
+
+
+
A non-200 status encountered while fetching a page of results.
def__init__(self,url:str,retry:int,feed:feedparser.FeedParserDict):
+ """
+ Constructs an `HTTPError` for the specified status code, encountered for
+ the specified API URL after `retry` tries.
+ """
+ self.url=url
+ self.status=feed.status
+ # If the feed is valid and includes a single entry, trust it's an
+ # explanation.
+ ifnotfeed.bozoandlen(feed.entries)==1:
+ self.entry=feed.entries[0]
+ else:
+ self.entry=None
+ super().__init__(
+ url,
+ retry,
+ "Page request resulted in HTTP {}: {}".format(
+ self.status,
+ self.entry.summaryifself.entryelseNone,
+ ),
+ )
+
+
+
+
+
Constructs an HTTPError for the specified status code, encountered for
+the specified API URL after retry tries.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
query: an arXiv query string. Advanced query formats are documented in the arXiv API User Manual.
-
id_list: list of arXiv record IDs (typically of the format "0710.5765v1"). See the arXiv API User's Manual for documentation of the interaction between query and id_list.
-
max_results: The maximum number of results to be returned in an execution of this search. To fetch every result available, set max_results=float('inf') (default); to fetch up to 10 results, set max_results=10. The API's limit is 300,000 results.
-
sort_by: The sort criterion for results: relevance, lastUpdatedDate, or submittedDate.
-
sort_order: The sort order for results: 'descending' or 'ascending'.
-
-
-
To fetch arXiv records matching a Search, use search.results() or (Client).results(search) to get a generator yielding Results.
-
-
Example: fetching results
-
-
Print the titles fo the 10 most recent articles related to the keyword "quantum:"
result.links: Up to three URLs associated with this result, as arxiv.Links.
-
result.pdf_url: A URL for the result's PDF if present. Note: this URL also appears among result.links.
-
-
-
They also expose helper methods for downloading papers: (Result).download_pdf() and (Result).download_source().
-
-
Example: downloading papers
-
-
To download a PDF of the paper with ID "1605.08386v1," run a Search and then use (Result).download_pdf():
-
-
importarxiv
-
-paper=next(arxiv.Search(id_list=["1605.08386v1"]).results())
-# Download the PDF to the PWD with a default filename.
-paper.download_pdf()
-# Download the PDF to the PWD with a custom filename.
-paper.download_pdf(filename="downloaded-paper.pdf")
-# Download the PDF to a specified directory with a custom filename.
-paper.download_pdf(dirpath="./mydir",filename="downloaded-paper.pdf")
-
-
-
The same interface is available for downloading .tar.gz files of the paper source:
-
-
importarxiv
-
-paper=next(arxiv.Search(id_list=["1605.08386v1"]).results())
-# Download the archive to the PWD with a default filename.
-paper.download_source()
-# Download the archive to the PWD with a custom filename.
-paper.download_source(filename="downloaded-paper.tar.gz")
-# Download the archive to a specified directory with a custom filename.
-paper.download_source(dirpath="./mydir",filename="downloaded-paper.tar.gz")
-
-
-
Client
-
-
A Client specifies a strategy for fetching results from arXiv's API; it obscures pagination and retry logic.
-
-
For most use cases the default client should suffice. You can construct it explicitly with arxiv.Client(), or use it via the (Search).results() method.
page_size: the number of papers to fetch from arXiv per page of results. Smaller pages can be retrieved faster, but may require more round-trips. The API's limit is 2000 results.
-
delay_seconds: the number of seconds to wait between requests for pages. arXiv's Terms of Use ask that you "make no more than one request every three seconds."
-
num_retries: The number of times the client will retry a request that fails, either with a non-200 HTTP status code or with an unexpected number of results given the search parameters.
-
-
-
Example: fetching results with a custom client
-
-
(Search).results() uses the default client settings. If you want to use a client you've defined instead of the defaults, use (Client).results(...):
-
-
importarxiv
-
-big_slow_client=arxiv.Client(
- page_size=1000,
- delay_seconds=10,
- num_retries=5
-)
-
-# Prints 1000 titles before needing to make another request.
-forresultinbig_slow_client.results(arxiv.Search(query="quantum")):
- print(result.title)
-
-
-
Example: logging
-
-
To inspect this package's network behavior and API logic, configure an INFO-level logger.
-
-
>>> importlogging,arxiv
->>> logging.basicConfig(level=logging.INFO)
->>> paper=next(arxiv.Search(id_list=["1605.08386v1"]).results())
-INFO:arxiv.arxiv:Requesting 100 results at offset 0
-INFO:arxiv.arxiv:Requesting page of results
-INFO:arxiv.arxiv:Got first page; 1 of inf results available
-
-
-
-
- View Source
-
""".. include:: ../README.md"""
-importlogging
-importtime
-importfeedparser
-importre
-importos
-importwarnings
-
-fromurllib.parseimporturlencode
-fromurllib.requestimporturlretrieve
-fromdatetimeimportdatetime,timedelta,timezone
-fromcalendarimporttimegm
-
-fromenumimportEnum
-fromtypingimportDict,Generator,List
-
-logger=logging.getLogger(__name__)
-
-_DEFAULT_TIME=datetime.min
-
-
-classResult(object):
- """
- An entry in an arXiv query results feed.
-
- See [the arXiv API User's Manual: Details of Atom Results
- Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
- """
-
- entry_id:str
- """A url of the form `http://arxiv.org/abs/{id}`."""
- updated:time.struct_time
- """When the result was last updated."""
- published:time.struct_time
- """When the result was originally published."""
- title:str
- """The title of the result."""
- authors:list
- """The result's authors."""
- summary:str
- """The result abstrace."""
- comment:str
- """The authors' comment if present."""
- journal_ref:str
- """A journal reference if present."""
- doi:str
- """A URL for the resolved DOI to an external resource if present."""
- primary_category:str
- """
- The result's primary arXiv category. See [arXiv: Category
- Taxonomy](https://arxiv.org/category_taxonomy).
- """
- categories:List[str]
- """
- All of the result's categories. See [arXiv: Category
- Taxonomy](https://arxiv.org/category_taxonomy).
- """
- links:list
- """Up to three URLs associated with this result."""
- pdf_url:str
- """The URL of a PDF version of this result if present among links."""
- _raw:feedparser.FeedParserDict
- """
- The raw feedparser result object if this Result was constructed with
- Result._from_feed_entry.
- """
-
- def__init__(
- self,
- entry_id:str,
- updated:datetime=_DEFAULT_TIME,
- published:datetime=_DEFAULT_TIME,
- title:str="",
- authors:List['Result.Author']=[],
- summary:str="",
- comment:str="",
- journal_ref:str="",
- doi:str="",
- primary_category:str="",
- categories:List[str]=[],
- links:List['Result.Link']=[],
- _raw:feedparser.FeedParserDict=None,
- ):
- """
- Constructs an arXiv search result item.
-
- In most cases, prefer using `Result._from_feed_entry` to parsing and
- constructing `Result`s yourself.
- """
- self.entry_id=entry_id
- self.updated=updated
- self.published=published
- self.title=title
- self.authors=authors
- self.summary=summary
- self.comment=comment
- self.journal_ref=journal_ref
- self.doi=doi
- self.primary_category=primary_category
- self.categories=categories
- self.links=links
- # Calculated members
- self.pdf_url=Result._get_pdf_url(links)
- # Debugging
- self._raw=_raw
-
- def_from_feed_entry(entry:feedparser.FeedParserDict)->'Result':
- """
- Converts a feedparser entry for an arXiv search result feed into a
- Result object.
- """
- ifnothasattr(entry,"id"):
- raiseResult.MissingFieldError("id")
- # Title attribute may be absent for certain titles. Defaulting to "0" as
- # it's the only title observed to cause this bug.
- # https://github.com/lukasschwab/arxiv.py/issues/71
- # title = entry.title if hasattr(entry, "title") else "0"
- title="0"
- ifhasattr(entry,"title"):
- title=entry.title
- else:
- logger.warning(
- "Result %s is missing title attribute; defaulting to '0'",
- entry.id
- )
- returnResult(
- entry_id=entry.id,
- updated=Result._to_datetime(entry.updated_parsed),
- published=Result._to_datetime(entry.published_parsed),
- title=re.sub(r'\s+',' ',title),
- authors=[Result.Author._from_feed_author(a)forainentry.authors],
- summary=entry.summary,
- comment=entry.get('arxiv_comment'),
- journal_ref=entry.get('arxiv_journal_ref'),
- doi=entry.get('arxiv_doi'),
- primary_category=entry.arxiv_primary_category.get('term'),
- categories=[tag.get('term')fortaginentry.tags],
- links=[Result.Link._from_feed_link(link)forlinkinentry.links],
- _raw=entry
- )
-
- def__str__(self)->str:
- returnself.entry_id
-
- def__repr__(self)->str:
- return(
- '{}(entry_id={}, updated={}, published={}, title={}, authors={}, '
- 'summary={}, comment={}, journal_ref={}, doi={}, '
- 'primary_category={}, categories={}, links={})'
- ).format(
- _classname(self),
- repr(self.entry_id),
- repr(self.updated),
- repr(self.published),
- repr(self.title),
- repr(self.authors),
- repr(self.summary),
- repr(self.comment),
- repr(self.journal_ref),
- repr(self.doi),
- repr(self.primary_category),
- repr(self.categories),
- repr(self.links)
- )
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result):
- returnself.entry_id==other.entry_id
- returnFalse
-
- defget_short_id(self)->str:
- """
- Returns the short ID for this result.
-
- + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
- `result.get_short_id()` returns `2107.05580v1`.
-
- + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
- `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
- 2007 arXiv identifier format).
-
- For an explanation of the difference between arXiv's legacy and current
- identifiers, see [Understanding the arXiv
- identifier](https://arxiv.org/help/arxiv_identifier).
- """
- returnself.entry_id.split('arxiv.org/abs/')[-1]
-
- def_get_default_filename(self,extension:str="pdf")->str:
- """
- A default `to_filename` function for the extension given.
- """
- nonempty_title=self.titleifself.titleelse"UNTITLED"
- # Remove disallowed characters.
- clean_title='_'.join(re.findall(r'\w+',nonempty_title))
- return"{}.{}.{}".format(self.get_short_id(),clean_title,extension)
-
- defdownload_pdf(self,dirpath:str='./',filename:str='')->str:
- """
- Downloads the PDF for this result to the specified directory.
-
- The filename is generated by calling `to_filename(self)`.
- """
- ifnotfilename:
- filename=self._get_default_filename()
- path=os.path.join(dirpath,filename)
- written_path,_=urlretrieve(self.pdf_url,path)
- returnwritten_path
-
- defdownload_source(self,dirpath:str='./',filename:str='')->str:
- """
- Downloads the source tarfile for this result to the specified
- directory.
-
- The filename is generated by calling `to_filename(self)`.
- """
- ifnotfilename:
- filename=self._get_default_filename('tar.gz')
- path=os.path.join(dirpath,filename)
- # Bodge: construct the source URL from the PDF URL.
- source_url=self.pdf_url.replace('/pdf/','/src/')
- written_path,_=urlretrieve(source_url,path)
- returnwritten_path
-
- def_get_pdf_url(links:list)->str:
- """
- Finds the PDF link among a result's links and returns its URL.
-
- Should only be called once for a given `Result`, in its constructor.
- After construction, the URL should be available in `Result.pdf_url`.
- """
- pdf_urls=[link.hrefforlinkinlinksiflink.title=='pdf']
- iflen(pdf_urls)==0:
- returnNone
- eliflen(pdf_urls)>1:
- logger.warning(
- "Result has multiple PDF links; using %s",
- pdf_urls[0]
- )
- returnpdf_urls[0]
-
- def_to_datetime(ts:time.struct_time)->datetime:
- """
- Converts a UTC time.struct_time into a time-zone-aware datetime.
-
- This will be replaced with feedparser functionality [when it becomes
- available](https://github.com/kurtmckee/feedparser/issues/212).
- """
- returndatetime.fromtimestamp(timegm(ts),tz=timezone.utc)
-
- classAuthor(object):
- """
- A light inner class for representing a result's authors.
- """
-
- name:str
- """The author's name."""
-
- def__init__(self,name:str):
- """
- Constructs an `Author` with the specified name.
-
- In most cases, prefer using `Author._from_feed_author` to parsing
- and constructing `Author`s yourself.
- """
- self.name=name
-
- def_from_feed_author(
- feed_author:feedparser.FeedParserDict
- )->'Result.Author':
- """
- Constructs an `Author` with the name specified in an author object
- from a feed entry.
-
- See usage in `Result._from_feed_entry`.
- """
- returnResult.Author(feed_author.name)
-
- def__str__(self)->str:
- returnself.name
-
- def__repr__(self)->str:
- return'{}({})'.format(_classname(self),repr(self.name))
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result.Author):
- returnself.name==other.name
- returnFalse
-
- classLink(object):
- """
- A light inner class for representing a result's links.
- """
-
- href:str
- """The link's `href` attribute."""
- title:str
- """The link's title."""
- rel:str
- """The link's relationship to the `Result`."""
- content_type:str
- """The link's HTTP content type."""
-
- def__init__(
- self,
- href:str,
- title:str=None,
- rel:str=None,
- content_type:str=None
- ):
- """
- Constructs a `Link` with the specified link metadata.
-
- In most cases, prefer using `Link._from_feed_link` to parsing and
- constructing `Link`s yourself.
- """
- self.href=href
- self.title=title
- self.rel=rel
- self.content_type=content_type
-
- def_from_feed_link(
- feed_link:feedparser.FeedParserDict
- )->'Result.Link':
- """
- Constructs a `Link` with link metadata specified in a link object
- from a feed entry.
-
- See usage in `Result._from_feed_entry`.
- """
- returnResult.Link(
- href=feed_link.href,
- title=feed_link.get('title'),
- rel=feed_link.get('rel'),
- content_type=feed_link.get('content_type')
- )
-
- def__str__(self)->str:
- returnself.href
-
- def__repr__(self)->str:
- return'{}({}, title={}, rel={}, content_type={})'.format(
- _classname(self),
- repr(self.href),
- repr(self.title),
- repr(self.rel),
- repr(self.content_type)
- )
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result.Link):
- returnself.href==other.href
- returnFalse
-
- classMissingFieldError(Exception):
- """
- An error indicating an entry is unparseable because it lacks required
- fields.
- """
-
- missing_field:str
- """The required field missing from the would-be entry."""
- message:str
- """Message describing what caused this error."""
-
- def__init__(self,missing_field):
- self.missing_field=missing_field
- self.message="Entry from arXiv missing required info"
-
- def__repr__(self)->str:
- return'{}({})'.format(
- _classname(self),
- repr(self.missing_field)
- )
-
-
-classSortCriterion(Enum):
- """
- A SortCriterion identifies a property by which search results can be
- sorted.
-
- See [the arXiv API User's Manual: sort order for return
- results](https://arxiv.org/help/api/user-manual#sort).
- """
- Relevance="relevance"
- LastUpdatedDate="lastUpdatedDate"
- SubmittedDate="submittedDate"
-
-
-classSortOrder(Enum):
- """
- A SortOrder indicates order in which search results are sorted according
- to the specified arxiv.SortCriterion.
-
- See [the arXiv API User's Manual: sort order for return
- results](https://arxiv.org/help/api/user-manual#sort).
- """
- Ascending="ascending"
- Descending="descending"
-
-
-classSearch(object):
- """
- A specification for a search of arXiv's database.
-
- To run a search, use `Search.run` to use a default client or `Client.run`
- with a specific client.
- """
-
- query:str
- """
- A query string.
-
- See [the arXiv API User's Manual: Details of Query
- Construction](https://arxiv.org/help/api/user-manual#query_details).
- """
- id_list:list
- """
- A list of arXiv article IDs to which to limit the search.
-
- See [the arXiv API User's
- Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
- for documentation of the interaction between `query` and `id_list`.
- """
- max_results:float
- """
- The maximum number of results to be returned in an execution of this
- search.
-
- To fetch every result available, set `max_results=float('inf')`.
- """
- sort_by:SortCriterion
- """The sort criterion for results."""
- sort_order:SortOrder
- """The sort order for results."""
-
- def__init__(
- self,
- query:str="",
- id_list:List[str]=[],
- max_results:float=float('inf'),
- sort_by:SortCriterion=SortCriterion.Relevance,
- sort_order:SortOrder=SortOrder.Descending
- ):
- """
- Constructs an arXiv API search with the specified criteria.
- """
- self.query=query
- self.id_list=id_list
- self.max_results=max_results
- self.sort_by=sort_by
- self.sort_order=sort_order
-
- def__str__(self)->str:
- # TODO: develop a more informative string representation.
- returnrepr(self)
-
- def__repr__(self)->str:
- return(
- '{}(query={}, id_list={}, max_results={}, sort_by={}, '
- 'sort_order={})'
- ).format(
- _classname(self),
- repr(self.query),
- repr(self.id_list),
- repr(self.max_results),
- repr(self.sort_by),
- repr(self.sort_order)
- )
-
- def_url_args(self)->Dict[str,str]:
- """
- Returns a dict of search parameters that should be included in an API
- request for this search.
- """
- return{
- "search_query":self.query,
- "id_list":','.join(self.id_list),
- "sortBy":self.sort_by.value,
- "sortOrder":self.sort_order.value
- }
-
- defget(self)->Generator[Result,None,None]:
- """
- **Deprecated** after 1.2.0; use `Search.results`.
- """
- warnings.warn(
- "The 'get' method is deprecated, use 'results' instead",
- DeprecationWarning,
- stacklevel=2
- )
- returnself.results()
-
- defresults(self)->Generator[Result,None,None]:
- """
- Executes the specified search using a default arXiv API client.
-
- For info on default behavior, see `Client.__init__` and `Client.results`.
- """
- returnClient().results(self)
-
-
-classClient(object):
- """
- Specifies a strategy for fetching results from arXiv's API.
-
- This class obscures pagination and retry logic, and exposes
- `Client.results`.
- """
-
- query_url_format='http://export.arxiv.org/api/query?{}'
- """The arXiv query API endpoint format."""
- page_size:int
- """Maximum number of results fetched in a single API request."""
- delay_seconds:int
- """Number of seconds to wait between API requests."""
- num_retries:int
- """Number of times to retry a failing API request."""
- _last_request_dt:datetime
-
- def__init__(
- self,
- page_size:int=100,
- delay_seconds:int=3,
- num_retries:int=3
- ):
- """
- Constructs an arXiv API client with the specified options.
-
- Note: the default parameters should provide a robust request strategy
- for most use cases. Extreme page sizes, delays, or retries risk
- violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
- brittle behavior, and inconsistent results.
- """
- self.page_size=page_size
- self.delay_seconds=delay_seconds
- self.num_retries=num_retries
- self._last_request_dt=None
-
- def__str__(self)->str:
- # TODO: develop a more informative string representation.
- returnrepr(self)
-
- def__repr__(self)->str:
- return'{}(page_size={}, delay_seconds={}, num_retries={})'.format(
- _classname(self),
- repr(self.page_size),
- repr(self.delay_seconds),
- repr(self.num_retries)
- )
-
- defget(self,search:Search)->Generator[Result,None,None]:
- """
- **Deprecated** after 1.2.0; use `Client.results`.
- """
- warnings.warn(
- "The 'get' method is deprecated, use 'results' instead",
- DeprecationWarning,
- stacklevel=2
- )
- returnself.results(search)
-
- defresults(self,search:Search)->Generator[Result,None,None]:
- """
- Uses this client configuration to fetch one page of the search results
- at a time, yielding the parsed `Result`s, until `max_results` results
- have been yielded or there are no more search results.
-
- If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
-
- For more on using generators, see
- [Generators](https://wiki.python.org/moin/Generators).
- """
- offset=0
- # total_results may be reduced according to the feed's
- # opensearch:totalResults value.
- total_results=search.max_results
- first_page=True
- whileoffset<total_results:
- page_size=min(self.page_size,search.max_results-offset)
- logger.info("Requesting {} results at offset {}".format(
- page_size,
- offset,
- ))
- page_url=self._format_url(search,offset,page_size)
- feed=self._parse_feed(page_url,first_page)
- iffirst_page:
- # NOTE: this is an ugly fix for a known bug. The totalresults
- # value is set to 1 for results with zero entries. If that API
- # bug is fixed, we can remove this conditional and always set
- # `total_results = min(...)`.
- iflen(feed.entries)==0:
- logger.info("Got empty results; stopping generation")
- total_results=0
- else:
- total_results=min(
- total_results,
- int(feed.feed.opensearch_totalresults)
- )
- logger.info("Got first page; {} of {} results available".format(
- total_results,
- search.max_results
- ))
- # Subsequent pages are not the first page.
- first_page=False
- # Update offset for next request: account for received results.
- offset+=len(feed.entries)
- # Yield query results until page is exhausted.
- forentryinfeed.entries:
- try:
- yieldResult._from_feed_entry(entry)
- exceptResult.MissingFieldError:
- logger.warning("Skipping partial result")
- continue
-
- def_format_url(self,search:Search,start:int,page_size:int)->str:
- """
- Construct a request API for search that returns up to `page_size`
- results starting with the result at index `start`.
- """
- url_args=search._url_args()
- url_args.update({
- "start":start,
- "max_results":page_size,
- })
- returnself.query_url_format.format(urlencode(url_args))
-
- def_parse_feed(
- self,
- url:str,
- first_page:bool=True
- )->feedparser.FeedParserDict:
- """
- Fetches the specified URL and parses it with feedparser.
-
- If a request fails or is unexpectedly empty, retries the request up to
- `self.num_retries` times.
- """
- # Invoke the recursive helper with initial available retries.
- returnself.__try_parse_feed(
- url,
- first_page=first_page,
- retries_left=self.num_retries
- )
-
- def__try_parse_feed(
- self,
- url:str,
- first_page:bool,
- retries_left:int,
- last_err:Exception=None,
- )->feedparser.FeedParserDict:
- """
- Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
- number of seconds has not passed since `_parse_feed` was last called,
- sleeps until delay_seconds seconds have passed.
- """
- retry=self.num_retries-retries_left
- # If this call would violate the rate limit, sleep until it doesn't.
- ifself._last_request_dtisnotNone:
- required=timedelta(seconds=self.delay_seconds)
- since_last_request=datetime.now()-self._last_request_dt
- ifsince_last_request<required:
- to_sleep=(required-since_last_request).total_seconds()
- logger.info("Sleeping for %f seconds",to_sleep)
- time.sleep(to_sleep)
- logger.info("Requesting page of results",extra={
- 'url':url,
- 'first_page':first_page,
- 'retry':retry,
- 'last_err':last_err.messageiflast_errisnotNoneelseNone,
- })
- feed=feedparser.parse(url)
- self._last_request_dt=datetime.now()
- err=None
- iffeed.status!=200:
- err=HTTPError(url,retry,feed)
- eliflen(feed.entries)==0andnotfirst_page:
- err=UnexpectedEmptyPageError(url,retry)
- iferrisnotNone:
- ifretries_left>0:
- returnself.__try_parse_feed(
- url,
- first_page=first_page,
- retries_left=retries_left-1,
- last_err=err,
- )
- # Feed was never returned in self.num_retries tries. Raise the last
- # exception encountered.
- raiseerr
- returnfeed
-
-
-classArxivError(Exception):
- """This package's base Exception class."""
-
- url:str
- """The feed URL that could not be fetched."""
- retry:int
- """
- The request try number which encountered this error; 0 for the initial try,
- 1 for the first retry, and so on.
- """
- message:str
- """Message describing what caused this error."""
-
- def__init__(self,url:str,retry:int,message:str):
- """
- Constructs an `ArxivError` encountered while fetching the specified URL.
- """
- self.url=url
- self.retry=retry
- self.message=message
- super().__init__(self.message)
-
- def__str__(self)->str:
- return'{} ({})'.format(self.message,self.url)
-
-
-classUnexpectedEmptyPageError(ArxivError):
- """
- An error raised when a page of results that should be non-empty is empty.
-
- This should never happen in theory, but happens sporadically due to
- brittleness in the underlying arXiv API; usually resolved by retries.
-
- See `Client.results` for usage.
- """
- def__init__(self,url:str,retry:int):
- """
- Constructs an `UnexpectedEmptyPageError` encountered for the specified
- API URL after `retry` tries.
- """
- self.url=url
- super().__init__(url,retry,"Page of results was unexpectedly empty")
-
- def__repr__(self)->str:
- return'{}({}, {})'.format(
- _classname(self),
- repr(self.url),
- repr(self.retry)
- )
-
-
-classHTTPError(ArxivError):
- """
- A non-200 status encountered while fetching a page of results.
-
- See `Client.results` for usage.
- """
-
- status:int
- """The HTTP status reported by feedparser."""
- entry:feedparser.FeedParserDict
- """The feed entry describing the error, if present."""
-
- def__init__(self,url:str,retry:int,feed:feedparser.FeedParserDict):
- """
- Constructs an `HTTPError` for the specified status code, encountered for
- the specified API URL after `retry` tries.
- """
- self.url=url
- self.status=feed.status
- # If the feed is valid and includes a single entry, trust it's an
- # explanation.
- ifnotfeed.bozoandlen(feed.entries)==1:
- self.entry=feed.entries[0]
- else:
- self.entry=None
- super().__init__(
- url,
- retry,
- "Page request resulted in HTTP {}: {}".format(
- self.status,
- self.entry.summaryifself.entryelseNone,
- ),
- )
-
- def__repr__(self)->str:
- return'{}({}, {}, {})'.format(
- _classname(self),
- repr(self.url),
- repr(self.retry),
- repr(self.status)
- )
-
-
-def_classname(o):
- """A helper function for use in __repr__ methods: arxiv.Result.Link."""
- return'arxiv.{}'.format(o.__class__.__qualname__)
-
classResult(object):
- """
- An entry in an arXiv query results feed.
-
- See [the arXiv API User's Manual: Details of Atom Results
- Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
- """
-
- entry_id:str
- """A url of the form `http://arxiv.org/abs/{id}`."""
- updated:time.struct_time
- """When the result was last updated."""
- published:time.struct_time
- """When the result was originally published."""
- title:str
- """The title of the result."""
- authors:list
- """The result's authors."""
- summary:str
- """The result abstrace."""
- comment:str
- """The authors' comment if present."""
- journal_ref:str
- """A journal reference if present."""
- doi:str
- """A URL for the resolved DOI to an external resource if present."""
- primary_category:str
- """
- The result's primary arXiv category. See [arXiv: Category
- Taxonomy](https://arxiv.org/category_taxonomy).
- """
- categories:List[str]
- """
- All of the result's categories. See [arXiv: Category
- Taxonomy](https://arxiv.org/category_taxonomy).
- """
- links:list
- """Up to three URLs associated with this result."""
- pdf_url:str
- """The URL of a PDF version of this result if present among links."""
- _raw:feedparser.FeedParserDict
- """
- The raw feedparser result object if this Result was constructed with
- Result._from_feed_entry.
- """
-
- def__init__(
- self,
- entry_id:str,
- updated:datetime=_DEFAULT_TIME,
- published:datetime=_DEFAULT_TIME,
- title:str="",
- authors:List['Result.Author']=[],
- summary:str="",
- comment:str="",
- journal_ref:str="",
- doi:str="",
- primary_category:str="",
- categories:List[str]=[],
- links:List['Result.Link']=[],
- _raw:feedparser.FeedParserDict=None,
- ):
- """
- Constructs an arXiv search result item.
-
- In most cases, prefer using `Result._from_feed_entry` to parsing and
- constructing `Result`s yourself.
- """
- self.entry_id=entry_id
- self.updated=updated
- self.published=published
- self.title=title
- self.authors=authors
- self.summary=summary
- self.comment=comment
- self.journal_ref=journal_ref
- self.doi=doi
- self.primary_category=primary_category
- self.categories=categories
- self.links=links
- # Calculated members
- self.pdf_url=Result._get_pdf_url(links)
- # Debugging
- self._raw=_raw
-
- def_from_feed_entry(entry:feedparser.FeedParserDict)->'Result':
- """
- Converts a feedparser entry for an arXiv search result feed into a
- Result object.
- """
- ifnothasattr(entry,"id"):
- raiseResult.MissingFieldError("id")
- # Title attribute may be absent for certain titles. Defaulting to "0" as
- # it's the only title observed to cause this bug.
- # https://github.com/lukasschwab/arxiv.py/issues/71
- # title = entry.title if hasattr(entry, "title") else "0"
- title="0"
- ifhasattr(entry,"title"):
- title=entry.title
- else:
- logger.warning(
- "Result %s is missing title attribute; defaulting to '0'",
- entry.id
- )
- returnResult(
- entry_id=entry.id,
- updated=Result._to_datetime(entry.updated_parsed),
- published=Result._to_datetime(entry.published_parsed),
- title=re.sub(r'\s+',' ',title),
- authors=[Result.Author._from_feed_author(a)forainentry.authors],
- summary=entry.summary,
- comment=entry.get('arxiv_comment'),
- journal_ref=entry.get('arxiv_journal_ref'),
- doi=entry.get('arxiv_doi'),
- primary_category=entry.arxiv_primary_category.get('term'),
- categories=[tag.get('term')fortaginentry.tags],
- links=[Result.Link._from_feed_link(link)forlinkinentry.links],
- _raw=entry
- )
-
- def__str__(self)->str:
- returnself.entry_id
-
- def__repr__(self)->str:
- return(
- '{}(entry_id={}, updated={}, published={}, title={}, authors={}, '
- 'summary={}, comment={}, journal_ref={}, doi={}, '
- 'primary_category={}, categories={}, links={})'
- ).format(
- _classname(self),
- repr(self.entry_id),
- repr(self.updated),
- repr(self.published),
- repr(self.title),
- repr(self.authors),
- repr(self.summary),
- repr(self.comment),
- repr(self.journal_ref),
- repr(self.doi),
- repr(self.primary_category),
- repr(self.categories),
- repr(self.links)
- )
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result):
- returnself.entry_id==other.entry_id
- returnFalse
-
- defget_short_id(self)->str:
- """
- Returns the short ID for this result.
-
- + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
- `result.get_short_id()` returns `2107.05580v1`.
-
- + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
- `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
- 2007 arXiv identifier format).
-
- For an explanation of the difference between arXiv's legacy and current
- identifiers, see [Understanding the arXiv
- identifier](https://arxiv.org/help/arxiv_identifier).
- """
- returnself.entry_id.split('arxiv.org/abs/')[-1]
-
- def_get_default_filename(self,extension:str="pdf")->str:
- """
- A default `to_filename` function for the extension given.
- """
- nonempty_title=self.titleifself.titleelse"UNTITLED"
- # Remove disallowed characters.
- clean_title='_'.join(re.findall(r'\w+',nonempty_title))
- return"{}.{}.{}".format(self.get_short_id(),clean_title,extension)
-
- defdownload_pdf(self,dirpath:str='./',filename:str='')->str:
- """
- Downloads the PDF for this result to the specified directory.
-
- The filename is generated by calling `to_filename(self)`.
- """
- ifnotfilename:
- filename=self._get_default_filename()
- path=os.path.join(dirpath,filename)
- written_path,_=urlretrieve(self.pdf_url,path)
- returnwritten_path
-
- defdownload_source(self,dirpath:str='./',filename:str='')->str:
- """
- Downloads the source tarfile for this result to the specified
- directory.
-
- The filename is generated by calling `to_filename(self)`.
- """
- ifnotfilename:
- filename=self._get_default_filename('tar.gz')
- path=os.path.join(dirpath,filename)
- # Bodge: construct the source URL from the PDF URL.
- source_url=self.pdf_url.replace('/pdf/','/src/')
- written_path,_=urlretrieve(source_url,path)
- returnwritten_path
-
- def_get_pdf_url(links:list)->str:
- """
- Finds the PDF link among a result's links and returns its URL.
-
- Should only be called once for a given `Result`, in its constructor.
- After construction, the URL should be available in `Result.pdf_url`.
- """
- pdf_urls=[link.hrefforlinkinlinksiflink.title=='pdf']
- iflen(pdf_urls)==0:
- returnNone
- eliflen(pdf_urls)>1:
- logger.warning(
- "Result has multiple PDF links; using %s",
- pdf_urls[0]
- )
- returnpdf_urls[0]
-
- def_to_datetime(ts:time.struct_time)->datetime:
- """
- Converts a UTC time.struct_time into a time-zone-aware datetime.
-
- This will be replaced with feedparser functionality [when it becomes
- available](https://github.com/kurtmckee/feedparser/issues/212).
- """
- returndatetime.fromtimestamp(timegm(ts),tz=timezone.utc)
-
- classAuthor(object):
- """
- A light inner class for representing a result's authors.
- """
-
- name:str
- """The author's name."""
-
- def__init__(self,name:str):
- """
- Constructs an `Author` with the specified name.
-
- In most cases, prefer using `Author._from_feed_author` to parsing
- and constructing `Author`s yourself.
- """
- self.name=name
-
- def_from_feed_author(
- feed_author:feedparser.FeedParserDict
- )->'Result.Author':
- """
- Constructs an `Author` with the name specified in an author object
- from a feed entry.
-
- See usage in `Result._from_feed_entry`.
- """
- returnResult.Author(feed_author.name)
-
- def__str__(self)->str:
- returnself.name
-
- def__repr__(self)->str:
- return'{}({})'.format(_classname(self),repr(self.name))
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result.Author):
- returnself.name==other.name
- returnFalse
-
- classLink(object):
- """
- A light inner class for representing a result's links.
- """
-
- href:str
- """The link's `href` attribute."""
- title:str
- """The link's title."""
- rel:str
- """The link's relationship to the `Result`."""
- content_type:str
- """The link's HTTP content type."""
-
- def__init__(
- self,
- href:str,
- title:str=None,
- rel:str=None,
- content_type:str=None
- ):
- """
- Constructs a `Link` with the specified link metadata.
-
- In most cases, prefer using `Link._from_feed_link` to parsing and
- constructing `Link`s yourself.
- """
- self.href=href
- self.title=title
- self.rel=rel
- self.content_type=content_type
-
- def_from_feed_link(
- feed_link:feedparser.FeedParserDict
- )->'Result.Link':
- """
- Constructs a `Link` with link metadata specified in a link object
- from a feed entry.
-
- See usage in `Result._from_feed_entry`.
- """
- returnResult.Link(
- href=feed_link.href,
- title=feed_link.get('title'),
- rel=feed_link.get('rel'),
- content_type=feed_link.get('content_type')
- )
-
- def__str__(self)->str:
- returnself.href
-
- def__repr__(self)->str:
- return'{}({}, title={}, rel={}, content_type={})'.format(
- _classname(self),
- repr(self.href),
- repr(self.title),
- repr(self.rel),
- repr(self.content_type)
- )
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result.Link):
- returnself.href==other.href
- returnFalse
-
- classMissingFieldError(Exception):
- """
- An error indicating an entry is unparseable because it lacks required
- fields.
- """
-
- missing_field:str
- """The required field missing from the would-be entry."""
- message:str
- """Message describing what caused this error."""
-
- def__init__(self,missing_field):
- self.missing_field=missing_field
- self.message="Entry from arXiv missing required info"
-
- def__repr__(self)->str:
- return'{}({})'.format(
- _classname(self),
- repr(self.missing_field)
- )
-
defget_short_id(self)->str:
- """
- Returns the short ID for this result.
-
- + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
- `result.get_short_id()` returns `2107.05580v1`.
-
- + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
- `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
- 2007 arXiv identifier format).
-
- For an explanation of the difference between arXiv's legacy and current
- identifiers, see [Understanding the arXiv
- identifier](https://arxiv.org/help/arxiv_identifier).
- """
- returnself.entry_id.split('arxiv.org/abs/')[-1]
-
-
-
-
-
Returns the short ID for this result.
-
-
-
If the result URL is "http://arxiv.org/abs/2107.05580v1",
-result.get_short_id() returns 2107.05580v1.
-
If the result URL is "http://arxiv.org/abs/quant-ph/0201082v1",
-result.get_short_id() returns "quant-ph/0201082v1" (the pre-March
-2007 arXiv identifier format).
defdownload_pdf(self,dirpath:str='./',filename:str='')->str:
- """
- Downloads the PDF for this result to the specified directory.
-
- The filename is generated by calling `to_filename(self)`.
- """
- ifnotfilename:
- filename=self._get_default_filename()
- path=os.path.join(dirpath,filename)
- written_path,_=urlretrieve(self.pdf_url,path)
- returnwritten_path
-
-
-
-
-
Downloads the PDF for this result to the specified directory.
-
-
The filename is generated by calling to_filename(self).
defdownload_source(self,dirpath:str='./',filename:str='')->str:
- """
- Downloads the source tarfile for this result to the specified
- directory.
-
- The filename is generated by calling `to_filename(self)`.
- """
- ifnotfilename:
- filename=self._get_default_filename('tar.gz')
- path=os.path.join(dirpath,filename)
- # Bodge: construct the source URL from the PDF URL.
- source_url=self.pdf_url.replace('/pdf/','/src/')
- written_path,_=urlretrieve(source_url,path)
- returnwritten_path
-
-
-
-
-
Downloads the source tarfile for this result to the specified
-directory.
-
-
The filename is generated by calling to_filename(self).
classAuthor(object):
- """
- A light inner class for representing a result's authors.
- """
-
- name:str
- """The author's name."""
-
- def__init__(self,name:str):
- """
- Constructs an `Author` with the specified name.
-
- In most cases, prefer using `Author._from_feed_author` to parsing
- and constructing `Author`s yourself.
- """
- self.name=name
-
- def_from_feed_author(
- feed_author:feedparser.FeedParserDict
- )->'Result.Author':
- """
- Constructs an `Author` with the name specified in an author object
- from a feed entry.
-
- See usage in `Result._from_feed_entry`.
- """
- returnResult.Author(feed_author.name)
-
- def__str__(self)->str:
- returnself.name
-
- def__repr__(self)->str:
- return'{}({})'.format(_classname(self),repr(self.name))
-
- def__eq__(self,other)->bool:
- ifisinstance(other,Result.Author):
- returnself.name==other.name
- returnFalse
-
-
-
-
-
A light inner class for representing a result's authors.
def__init__(self,name:str):
- """
- Constructs an `Author` with the specified name.
-
- In most cases, prefer using `Author._from_feed_author` to parsing
- and constructing `Author`s yourself.
- """
- self.name=name
-
- #  
-
-
- class
- SortCriterion(enum.Enum):
-
-
-
- View Source
-
classSortCriterion(Enum):
- """
- A SortCriterion identifies a property by which search results can be
- sorted.
-
- See [the arXiv API User's Manual: sort order for return
- results](https://arxiv.org/help/api/user-manual#sort).
- """
- Relevance="relevance"
- LastUpdatedDate="lastUpdatedDate"
- SubmittedDate="submittedDate"
-
-
-
-
-
A SortCriterion identifies a property by which search results can be
-sorted.
classSortOrder(Enum):
- """
- A SortOrder indicates order in which search results are sorted according
- to the specified arxiv.SortCriterion.
-
- See [the arXiv API User's Manual: sort order for return
- results](https://arxiv.org/help/api/user-manual#sort).
- """
- Ascending="ascending"
- Descending="descending"
-
-
-
-
-
A SortOrder indicates order in which search results are sorted according
-to the specified arxiv.SortCriterion.
classSearch(object):
- """
- A specification for a search of arXiv's database.
-
- To run a search, use `Search.run` to use a default client or `Client.run`
- with a specific client.
- """
-
- query:str
- """
- A query string.
-
- See [the arXiv API User's Manual: Details of Query
- Construction](https://arxiv.org/help/api/user-manual#query_details).
- """
- id_list:list
- """
- A list of arXiv article IDs to which to limit the search.
-
- See [the arXiv API User's
- Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
- for documentation of the interaction between `query` and `id_list`.
- """
- max_results:float
- """
- The maximum number of results to be returned in an execution of this
- search.
-
- To fetch every result available, set `max_results=float('inf')`.
- """
- sort_by:SortCriterion
- """The sort criterion for results."""
- sort_order:SortOrder
- """The sort order for results."""
-
- def__init__(
- self,
- query:str="",
- id_list:List[str]=[],
- max_results:float=float('inf'),
- sort_by:SortCriterion=SortCriterion.Relevance,
- sort_order:SortOrder=SortOrder.Descending
- ):
- """
- Constructs an arXiv API search with the specified criteria.
- """
- self.query=query
- self.id_list=id_list
- self.max_results=max_results
- self.sort_by=sort_by
- self.sort_order=sort_order
-
- def__str__(self)->str:
- # TODO: develop a more informative string representation.
- returnrepr(self)
-
- def__repr__(self)->str:
- return(
- '{}(query={}, id_list={}, max_results={}, sort_by={}, '
- 'sort_order={})'
- ).format(
- _classname(self),
- repr(self.query),
- repr(self.id_list),
- repr(self.max_results),
- repr(self.sort_by),
- repr(self.sort_order)
- )
-
- def_url_args(self)->Dict[str,str]:
- """
- Returns a dict of search parameters that should be included in an API
- request for this search.
- """
- return{
- "search_query":self.query,
- "id_list":','.join(self.id_list),
- "sortBy":self.sort_by.value,
- "sortOrder":self.sort_order.value
- }
-
- defget(self)->Generator[Result,None,None]:
- """
- **Deprecated** after 1.2.0; use `Search.results`.
- """
- warnings.warn(
- "The 'get' method is deprecated, use 'results' instead",
- DeprecationWarning,
- stacklevel=2
- )
- returnself.results()
-
- defresults(self)->Generator[Result,None,None]:
- """
- Executes the specified search using a default arXiv API client.
-
- For info on default behavior, see `Client.__init__` and `Client.results`.
- """
- returnClient().results(self)
-
-
-
-
-
A specification for a search of arXiv's database.
-
-
To run a search, use Search.run to use a default client or Client.run
-with a specific client.
defresults(self)->Generator[Result,None,None]:
- """
- Executes the specified search using a default arXiv API client.
-
- For info on default behavior, see `Client.__init__` and `Client.results`.
- """
- returnClient().results(self)
-
-
-
-
-
Executes the specified search using a default arXiv API client.
classClient(object):
- """
- Specifies a strategy for fetching results from arXiv's API.
-
- This class obscures pagination and retry logic, and exposes
- `Client.results`.
- """
-
- query_url_format='http://export.arxiv.org/api/query?{}'
- """The arXiv query API endpoint format."""
- page_size:int
- """Maximum number of results fetched in a single API request."""
- delay_seconds:int
- """Number of seconds to wait between API requests."""
- num_retries:int
- """Number of times to retry a failing API request."""
- _last_request_dt:datetime
-
- def__init__(
- self,
- page_size:int=100,
- delay_seconds:int=3,
- num_retries:int=3
- ):
- """
- Constructs an arXiv API client with the specified options.
-
- Note: the default parameters should provide a robust request strategy
- for most use cases. Extreme page sizes, delays, or retries risk
- violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
- brittle behavior, and inconsistent results.
- """
- self.page_size=page_size
- self.delay_seconds=delay_seconds
- self.num_retries=num_retries
- self._last_request_dt=None
-
- def__str__(self)->str:
- # TODO: develop a more informative string representation.
- returnrepr(self)
-
- def__repr__(self)->str:
- return'{}(page_size={}, delay_seconds={}, num_retries={})'.format(
- _classname(self),
- repr(self.page_size),
- repr(self.delay_seconds),
- repr(self.num_retries)
- )
-
- defget(self,search:Search)->Generator[Result,None,None]:
- """
- **Deprecated** after 1.2.0; use `Client.results`.
- """
- warnings.warn(
- "The 'get' method is deprecated, use 'results' instead",
- DeprecationWarning,
- stacklevel=2
- )
- returnself.results(search)
-
- defresults(self,search:Search)->Generator[Result,None,None]:
- """
- Uses this client configuration to fetch one page of the search results
- at a time, yielding the parsed `Result`s, until `max_results` results
- have been yielded or there are no more search results.
-
- If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
-
- For more on using generators, see
- [Generators](https://wiki.python.org/moin/Generators).
- """
- offset=0
- # total_results may be reduced according to the feed's
- # opensearch:totalResults value.
- total_results=search.max_results
- first_page=True
- whileoffset<total_results:
- page_size=min(self.page_size,search.max_results-offset)
- logger.info("Requesting {} results at offset {}".format(
- page_size,
- offset,
- ))
- page_url=self._format_url(search,offset,page_size)
- feed=self._parse_feed(page_url,first_page)
- iffirst_page:
- # NOTE: this is an ugly fix for a known bug. The totalresults
- # value is set to 1 for results with zero entries. If that API
- # bug is fixed, we can remove this conditional and always set
- # `total_results = min(...)`.
- iflen(feed.entries)==0:
- logger.info("Got empty results; stopping generation")
- total_results=0
- else:
- total_results=min(
- total_results,
- int(feed.feed.opensearch_totalresults)
- )
- logger.info("Got first page; {} of {} results available".format(
- total_results,
- search.max_results
- ))
- # Subsequent pages are not the first page.
- first_page=False
- # Update offset for next request: account for received results.
- offset+=len(feed.entries)
- # Yield query results until page is exhausted.
- forentryinfeed.entries:
- try:
- yieldResult._from_feed_entry(entry)
- exceptResult.MissingFieldError:
- logger.warning("Skipping partial result")
- continue
-
- def_format_url(self,search:Search,start:int,page_size:int)->str:
- """
- Construct a request API for search that returns up to `page_size`
- results starting with the result at index `start`.
- """
- url_args=search._url_args()
- url_args.update({
- "start":start,
- "max_results":page_size,
- })
- returnself.query_url_format.format(urlencode(url_args))
-
- def_parse_feed(
- self,
- url:str,
- first_page:bool=True
- )->feedparser.FeedParserDict:
- """
- Fetches the specified URL and parses it with feedparser.
-
- If a request fails or is unexpectedly empty, retries the request up to
- `self.num_retries` times.
- """
- # Invoke the recursive helper with initial available retries.
- returnself.__try_parse_feed(
- url,
- first_page=first_page,
- retries_left=self.num_retries
- )
-
- def__try_parse_feed(
- self,
- url:str,
- first_page:bool,
- retries_left:int,
- last_err:Exception=None,
- )->feedparser.FeedParserDict:
- """
- Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
- number of seconds has not passed since `_parse_feed` was last called,
- sleeps until delay_seconds seconds have passed.
- """
- retry=self.num_retries-retries_left
- # If this call would violate the rate limit, sleep until it doesn't.
- ifself._last_request_dtisnotNone:
- required=timedelta(seconds=self.delay_seconds)
- since_last_request=datetime.now()-self._last_request_dt
- ifsince_last_request<required:
- to_sleep=(required-since_last_request).total_seconds()
- logger.info("Sleeping for %f seconds",to_sleep)
- time.sleep(to_sleep)
- logger.info("Requesting page of results",extra={
- 'url':url,
- 'first_page':first_page,
- 'retry':retry,
- 'last_err':last_err.messageiflast_errisnotNoneelseNone,
- })
- feed=feedparser.parse(url)
- self._last_request_dt=datetime.now()
- err=None
- iffeed.status!=200:
- err=HTTPError(url,retry,feed)
- eliflen(feed.entries)==0andnotfirst_page:
- err=UnexpectedEmptyPageError(url,retry)
- iferrisnotNone:
- ifretries_left>0:
- returnself.__try_parse_feed(
- url,
- first_page=first_page,
- retries_left=retries_left-1,
- last_err=err,
- )
- # Feed was never returned in self.num_retries tries. Raise the last
- # exception encountered.
- raiseerr
- returnfeed
-
-
-
-
-
Specifies a strategy for fetching results from arXiv's API.
-
-
This class obscures pagination and retry logic, and exposes
-Client.results.
-
-
-
-
-
#  
-
-
- Client(page_size: int = 100, delay_seconds: int = 3, num_retries: int = 3)
-
-
-
- View Source
-
def__init__(
- self,
- page_size:int=100,
- delay_seconds:int=3,
- num_retries:int=3
- ):
- """
- Constructs an arXiv API client with the specified options.
-
- Note: the default parameters should provide a robust request strategy
- for most use cases. Extreme page sizes, delays, or retries risk
- violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
- brittle behavior, and inconsistent results.
- """
- self.page_size=page_size
- self.delay_seconds=delay_seconds
- self.num_retries=num_retries
- self._last_request_dt=None
-
-
-
-
-
Constructs an arXiv API client with the specified options.
-
-
Note: the default parameters should provide a robust request strategy
-for most use cases. Extreme page sizes, delays, or retries risk
-violating the arXiv API Terms of Use,
-brittle behavior, and inconsistent results.
defresults(self,search:Search)->Generator[Result,None,None]:
- """
- Uses this client configuration to fetch one page of the search results
- at a time, yielding the parsed `Result`s, until `max_results` results
- have been yielded or there are no more search results.
-
- If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
-
- For more on using generators, see
- [Generators](https://wiki.python.org/moin/Generators).
- """
- offset=0
- # total_results may be reduced according to the feed's
- # opensearch:totalResults value.
- total_results=search.max_results
- first_page=True
- whileoffset<total_results:
- page_size=min(self.page_size,search.max_results-offset)
- logger.info("Requesting {} results at offset {}".format(
- page_size,
- offset,
- ))
- page_url=self._format_url(search,offset,page_size)
- feed=self._parse_feed(page_url,first_page)
- iffirst_page:
- # NOTE: this is an ugly fix for a known bug. The totalresults
- # value is set to 1 for results with zero entries. If that API
- # bug is fixed, we can remove this conditional and always set
- # `total_results = min(...)`.
- iflen(feed.entries)==0:
- logger.info("Got empty results; stopping generation")
- total_results=0
- else:
- total_results=min(
- total_results,
- int(feed.feed.opensearch_totalresults)
- )
- logger.info("Got first page; {} of {} results available".format(
- total_results,
- search.max_results
- ))
- # Subsequent pages are not the first page.
- first_page=False
- # Update offset for next request: account for received results.
- offset+=len(feed.entries)
- # Yield query results until page is exhausted.
- forentryinfeed.entries:
- try:
- yieldResult._from_feed_entry(entry)
- exceptResult.MissingFieldError:
- logger.warning("Skipping partial result")
- continue
-
-
-
-
-
Uses this client configuration to fetch one page of the search results
-at a time, yielding the parsed Results, until max_results results
-have been yielded or there are no more search results.
- #  
-
-
- class
- ArxivError(builtins.Exception):
-
-
-
- View Source
-
classArxivError(Exception):
- """This package's base Exception class."""
-
- url:str
- """The feed URL that could not be fetched."""
- retry:int
- """
- The request try number which encountered this error; 0 for the initial try,
- 1 for the first retry, and so on.
- """
- message:str
- """Message describing what caused this error."""
-
- def__init__(self,url:str,retry:int,message:str):
- """
- Constructs an `ArxivError` encountered while fetching the specified URL.
- """
- self.url=url
- self.retry=retry
- self.message=message
- super().__init__(self.message)
-
- def__str__(self)->str:
- return'{} ({})'.format(self.message,self.url)
-
classUnexpectedEmptyPageError(ArxivError):
- """
- An error raised when a page of results that should be non-empty is empty.
-
- This should never happen in theory, but happens sporadically due to
- brittleness in the underlying arXiv API; usually resolved by retries.
-
- See `Client.results` for usage.
- """
- def__init__(self,url:str,retry:int):
- """
- Constructs an `UnexpectedEmptyPageError` encountered for the specified
- API URL after `retry` tries.
- """
- self.url=url
- super().__init__(url,retry,"Page of results was unexpectedly empty")
-
- def__repr__(self)->str:
- return'{}({}, {})'.format(
- _classname(self),
- repr(self.url),
- repr(self.retry)
- )
-
-
-
-
-
An error raised when a page of results that should be non-empty is empty.
-
-
This should never happen in theory, but happens sporadically due to
-brittleness in the underlying arXiv API; usually resolved by retries.
def__init__(self,url:str,retry:int):
- """
- Constructs an `UnexpectedEmptyPageError` encountered for the specified
- API URL after `retry` tries.
- """
- self.url=url
- super().__init__(url,retry,"Page of results was unexpectedly empty")
-
classHTTPError(ArxivError):
- """
- A non-200 status encountered while fetching a page of results.
-
- See `Client.results` for usage.
- """
-
- status:int
- """The HTTP status reported by feedparser."""
- entry:feedparser.FeedParserDict
- """The feed entry describing the error, if present."""
-
- def__init__(self,url:str,retry:int,feed:feedparser.FeedParserDict):
- """
- Constructs an `HTTPError` for the specified status code, encountered for
- the specified API URL after `retry` tries.
- """
- self.url=url
- self.status=feed.status
- # If the feed is valid and includes a single entry, trust it's an
- # explanation.
- ifnotfeed.bozoandlen(feed.entries)==1:
- self.entry=feed.entries[0]
- else:
- self.entry=None
- super().__init__(
- url,
- retry,
- "Page request resulted in HTTP {}: {}".format(
- self.status,
- self.entry.summaryifself.entryelseNone,
- ),
- )
-
- def__repr__(self)->str:
- return'{}({}, {}, {})'.format(
- _classname(self),
- repr(self.url),
- repr(self.retry),
- repr(self.status)
- )
-
-
-
-
-
A non-200 status encountered while fetching a page of results.
def__init__(self,url:str,retry:int,feed:feedparser.FeedParserDict):
- """
- Constructs an `HTTPError` for the specified status code, encountered for
- the specified API URL after `retry` tries.
- """
- self.url=url
- self.status=feed.status
- # If the feed is valid and includes a single entry, trust it's an
- # explanation.
- ifnotfeed.bozoandlen(feed.entries)==1:
- self.entry=feed.entries[0]
- else:
- self.entry=None
- super().__init__(
- url,
- retry,
- "Page request resulted in HTTP {}: {}".format(
- self.status,
- self.entry.summaryifself.entryelseNone,
- ),
- )
-
-
-
-
-
Constructs an HTTPError for the specified status code, encountered for
-the specified API URL after retry tries.