Unreliable results: pages from API are unexpectedly empty #43

Ecanlilar · 2020-07-29T20:43:00Z

Describe the bug
When running the query below, I receive an inconsistent count of results nearly every run. The results below contain the record count generate by the code provided in the "To Reproduce" section. As you can see, I receive wildly different results each time. Is there a parameter setting I can adjust to receive more reliable results?

1: 10,000
2: 10,000
3: 14,800
4: 14,800
5: 14,800
6(no max chunk results): 23,000
7 (no max chunk results): 8,000

To Reproduce
import arxiv
import pandas as pd
test = arxiv.query(query="quantum",
id_list=[],
max_results=None,
start = 0,
sort_by="relevance",
sort_order="descending",
prune=True,
iterative=False
,max_chunk_results=1000
)
test_df = pd.DataFrame(test)
print(len(test_df))

Expected behavior
I am expecting a consistent count of results from this query when run back to back (say within a few minutes or so of each other).

Versions

python version: 3.7.4
arxiv.py version: 0.5.3

The text was updated successfully, but these errors were encountered:

lukasschwab · 2020-08-06T18:49:45Z

I can repro this without using pandas:

>>> for _ in range(50):
...     test = arxiv.query(query="quantum", id_list=[], max_results=None, start = 0, sort_by="relevance", sort_order="descending", prune=True, iterative=False, max_chunk_results=1000)
...     print(_, len(test))
...
0 5000
1 13000
2 4200
3 3000
4 1000
...

I suspect this has to do with something on arXiv's end rather than something in this client library––some rate limiting, for example––but I need to investigate more.

MohamedAliRashad · 2020-10-05T05:58:57Z

Any updates on this ?

jannisborn · 2020-10-19T22:54:09Z

I observed the same issue as @Ecanlilar and would also be interested in a solution. Maybe the time_sleep argument in Search could help? It can't be controlled via query though.

jonas-nothnagel · 2021-03-31T09:57:45Z

same problem. any updates?

lukasschwab · 2021-04-02T05:01:13Z

Diagnosis

After some extended testing tonight, I'm confident that this is an issue with the underlying arXiv API and not with this client library; I'll close this issue here accordingly. The team that maintains the arXiv API is focused on building a JSON API to replace the existing Atom API; I'll raise this issue with them.

Unfortunately, I'm not sure what the root cause is on their end, so I don't have a recommendation. If you do want to fork and modify this package to add retries (described below), you might consider a smaller max_chunk_results page size than 1000 to make the retries faster.

The issue: the arXiv API sometimes returns a valid, but empty, feed with a 200 status even though there are entries available for the specified query at the specified offset. More at the bottom of this comment.

Available improvements

This client library can––and perhaps should––be modified to mitigate this issue using retries. arXiv feeds include a opensearch:totalResults property indicating the total number of entries corresponding to the query; Search._get_next should

Pull this property to use it to limit pagination: n_left = min(self.max_results, <totalResults from feed>)
Retry if n_left > 0 but results is an empty list.

I spotted an unrelated bug in _prune_result which I'll fix shortly.

I'm actually inclined to clean up this client more deeply, which will probably lead to a 1.0.0 release (and perhaps an interface that'll play nicer with the new JSON API when it's released).

Testing

I tested using the query I constructed earlier in this issue:

import arxiv
test = arxiv.query(query="quantum", id_list=[], max_results=None, start = 0, sort_by="relevance", sort_order="descending", prune=True, iterative=False, max_chunk_results=1000)

I modified two functions to shed some light on why _get_next stopped iterating:

I modified _parse to log the requested URL for each request, the resulting HTTP code, and the number of entries.
I modified _get_next to, when n_fetched == 0, double-check that result by reinvoking _parse with the arguments that yielded zero entries.

In one such run, I got an empty 200 response at start=6000:

{'bozo': False, 'entries': [], 'feed': {'links': [{'href': 'http://arxiv.org/api/query?search_query%3Dquantum%26id_list%3D%26start%3D6000%26max_results%3D1000', 'rel': 'self', 'type': 'application/atom+xml'}], 'title': 'ArXiv Query: search_query=quantum&amp;id_list=&amp;start=6000&amp;max_results=1000', 'title_detail': {'type': 'text/html', 'language': None, 'base': 'http://export.arxiv.org/api/query?search_query=quantum&id_list=&start=6000&max_results=1000&sortBy=relevance&sortOrder=descending', 'value': 'ArXiv Query: search_query=quantum&amp;id_list=&amp;start=6000&amp;max_results=1000'}, 'id': 'http://arxiv.org/api/U9c7OUmEOZDvAXlaxzJl09rG9z0', 'guidislink': True, 'link': 'http://arxiv.org/api/U9c7OUmEOZDvAXlaxzJl09rG9z0', 'updated': '2021-04-02T00:00:00-04:00', 'updated_parsed': time.struct_time(tm_year=2021, tm_mon=4, tm_mday=2, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=92, tm_isdst=0), 'opensearch_totalresults': '320665', 'opensearch_startindex': '6000', 'opensearch_itemsperpage': '1000'}, 'headers': {'date': 'Fri, 02 Apr 2021 04:41:09 GMT', 'server': 'Apache', 'access-control-allow-origin': '*', 'vary': 'Accept-Encoding,User-Agent', 'content-encoding': 'gzip', 'content-length': '412', 'connection': 'close', 'content-type': 'application/atom+xml; charset=UTF-8'}, 'href': 'http://export.arxiv.org/api/query?search_query=quantum&id_list=&start=6000&max_results=1000&sortBy=relevance&sortOrder=descending', 'status': 200, 'encoding': 'UTF-8', 'version': 'atom10', 'namespaces': {'': 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}}

But re-calling _parse yielded 1000 entries. In this case, a retry would continue _get_next's iteration.

lukasschwab · 2021-04-02T07:08:24Z

Did some more work on this tonight.

Anecdotally, retries (and other weird behavior like partial pages) seems to happen more with large page size; reducing the page size from 1000 to 100 makes this issue hard to reproduce. Hope that's helpful!

I've started sketching out a v1.0.0 client that adds retries; in my cursory testing so far, a small number of retries (default: 3) seems to make this behave more robustly.

That sketch is here: https://github.com/lukasschwab/arxiv.py/tree/v1.0.0-rewrite

But beware:

There's still a lot of v0.x functionality to reimplement
I need to clean up a lot (tests, docs, removing now-unused code)
This is definitely a breaking change; I may go so far as to define a result-entry class, to make results here easier to work with than the existing dicts.

Thanks for the input on this issue; I think this'll lead to a meaningful improvement in this package 😁

lukasschwab · 2021-04-04T05:01:37Z

v1.0.0 is released, and it implements retries! https://github.com/lukasschwab/arxiv.py/releases/tag/1.0.0

Cheers

jonas-nothnagel · 2021-04-12T15:07:40Z

Hi @lukasschwab,

first of all. Wow. Thank you so much for this comprehensive update and new release. Really great to see how much work you put in this and how well you document it!

I am now testing the 1.0.1 release and encounter following issue with a minimum working example:

import arxiv 

def query_arxiv(string_query):
    search = arxiv.Search(
        query=string_query,
        #max_results=20000,)
    return search

for i in range(0,5):
    print("try:", i)
    search = query_arxiv("abs:disaster risk management")

If I set max_results to a small value it works well and the the results are consistent. However, for values >1000, or for setting no value at all manually, I always run into Page of results was unexpectedly empty.

I feel this was already answered somewhere else, but how to best query 10000s of results with the new release?

For example for leaving max_results unspecified I run into:

lukasschwab · 2021-04-12T19:18:37Z

@jonas-nothnagel I think you're successfully fetching your 20,000 results! I just need to make the logging clearer. (Opened #56)

Each of those log lines is written from UnexpectedEmptyPageError.__init__. The error is constructed, but it is only raised if all retries are exhausted:

arxiv.py/arxiv/arxiv.py

Lines 374 to 385 in bb625a2

    
           for retry in range(self.num_retries): 
        
               logger.info("Requesting feed", extra={'retry': retry, 'url': url}) 
        
               feed = feedparser.parse(url) 
        
               self._last_request_dt = datetime.now() 
        
               if feed.status != 200: 
        
                   err = HTTPError(url, retry, feed.status) 
        
               elif len(feed.entries) == 0 and not first_page: 
        
                   err = UnexpectedEmptyPageError(url, retry) 
        
               else: 
        
                   return feed 
        
           # Raise the last exception encountered. 
        
           raise err

If all the retries are exhausted and the error is raised, then the generator will stop producing results and you'll see the full exception logged to the console. No more pages will be fetched.

The logs you're seeing say this: the API sometimes sends you empty pages, but the retried requests are succeeding! Otherwise the whole thing would stop.

Does that make sense?

If you'd like, you can log the results as you go along to see that they're still being fetched:

import arxiv

generator = arxiv.Search("abs:disaster risk management", max_results=20000).get()
results = []

for result in generator:
  print("Got result:", result.entry_id)
  results.append(result)

When I improve the logging it'll be easier to see the underlying requests as they happen.

bilalazhar72 · 2023-10-17T13:38:42Z

I observed the same issue as @Ecanlilar and would also be interested in a solution. Maybe the time_sleep argument in Search could help? It can't be controlled via query though.

in my script i tried time.sleep still same issue

lukasschwab · 2023-10-17T23:23:17Z

@bilalazhar72 see #129. I recommend upgrading to the v2.0.0 client if you haven't already.

lukasschwab · 2024-03-15T19:57:46Z

@vaish30 wrong place for this — responding here: #155

lukasschwab closed this as completed Apr 2, 2021

lukasschwab added the api Issues that correspond to arXiv API behavior rather than behavior introduced by this wrapper. label Apr 2, 2021

lukasschwab mentioned this issue Apr 4, 2021

Introduce v1.0.0 client: refactor into Client, Search, Result #51

Merged

3 tasks

jannisborn mentioned this issue Apr 6, 2021

Randomness in arxiv API requests jannisborn/paperscraper#8

Closed

lukasschwab added the 1.0.0 Tasks blocking a 1.0.0 release or related to its breaking changes. label Apr 11, 2021

lukasschwab mentioned this issue Apr 12, 2021

Use Python logging best practices #56

Open

lukasschwab changed the title ~~Unreliable results~~ Unreliable results: pages from API are unexpectedly empty Apr 12, 2021

lukasschwab mentioned this issue Apr 20, 2021

Update instructions for arxiv client v1.1.0 EPS-Libraries-Berkeley/volt#162

Merged

lukasschwab mentioned this issue Aug 15, 2021

Upgrade arxiv dependency to 1.4.2 jannisborn/paperscraper#10

Merged

robisen1 mentioned this issue Jun 29, 2023

UnexpectedEmptyPageError and associated errorscre jannisborn/paperscraper#31

Closed

lukasschwab mentioned this issue Oct 15, 2023

Investigating: arXiv API flakiness #129

Closed

This comment was marked as off-topic.

Sign in to view

lukasschwab mentioned this issue Mar 15, 2024

Request for help #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreliable results: pages from API are unexpectedly empty #43

Unreliable results: pages from API are unexpectedly empty #43

Ecanlilar commented Jul 29, 2020

lukasschwab commented Aug 6, 2020

MohamedAliRashad commented Oct 5, 2020

jannisborn commented Oct 19, 2020

jonas-nothnagel commented Mar 31, 2021

lukasschwab commented Apr 2, 2021

lukasschwab commented Apr 2, 2021

lukasschwab commented Apr 4, 2021

jonas-nothnagel commented Apr 12, 2021 •

edited

lukasschwab commented Apr 12, 2021 •

edited

bilalazhar72 commented Oct 17, 2023

lukasschwab commented Oct 17, 2023

This comment was marked as off-topic.

lukasschwab commented Mar 15, 2024

Unreliable results: pages from API are unexpectedly empty #43

Unreliable results: pages from API are unexpectedly empty #43

Comments

Ecanlilar commented Jul 29, 2020

lukasschwab commented Aug 6, 2020

MohamedAliRashad commented Oct 5, 2020

jannisborn commented Oct 19, 2020

jonas-nothnagel commented Mar 31, 2021

lukasschwab commented Apr 2, 2021

Diagnosis

Available improvements

Testing

lukasschwab commented Apr 2, 2021

lukasschwab commented Apr 4, 2021

jonas-nothnagel commented Apr 12, 2021 • edited

lukasschwab commented Apr 12, 2021 • edited

bilalazhar72 commented Oct 17, 2023

lukasschwab commented Oct 17, 2023

This comment was marked as off-topic.

lukasschwab commented Mar 15, 2024

jonas-nothnagel commented Apr 12, 2021 •

edited

lukasschwab commented Apr 12, 2021 •

edited