New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreliable results: pages from API are unexpectedly empty #43
Comments
I can repro this without using
I suspect this has to do with something on arXiv's end rather than something in this client library––some rate limiting, for example––but I need to investigate more. |
Any updates on this ? |
I observed the same issue as @Ecanlilar and would also be interested in a solution. Maybe the |
same problem. any updates? |
DiagnosisAfter some extended testing tonight, I'm confident that this is an issue with the underlying arXiv API and not with this client library; I'll close this issue here accordingly. The team that maintains the arXiv API is focused on building a JSON API to replace the existing Atom API; I'll raise this issue with them. Unfortunately, I'm not sure what the root cause is on their end, so I don't have a recommendation. If you do want to fork and modify this package to add retries (described below), you might consider a smaller The issue: the arXiv API sometimes returns a valid, but empty, feed with a 200 status even though there are entries available for the specified query at the specified offset. More at the bottom of this comment. Available improvementsThis client library can––and perhaps should––be modified to mitigate this issue using retries. arXiv feeds include a
I spotted an unrelated bug in I'm actually inclined to clean up this client more deeply, which will probably lead to a 1.0.0 release (and perhaps an interface that'll play nicer with the new JSON API when it's released). TestingI tested using the query I constructed earlier in this issue: import arxiv
test = arxiv.query(query="quantum", id_list=[], max_results=None, start = 0, sort_by="relevance", sort_order="descending", prune=True, iterative=False, max_chunk_results=1000) I modified two functions to shed some light on why
In one such run, I got an empty 200 response at
But re-calling |
Did some more work on this tonight. Anecdotally, retries (and other weird behavior like partial pages) seems to happen more with large page size; reducing the page size from 1000 to 100 makes this issue hard to reproduce. Hope that's helpful! I've started sketching out a v1.0.0 client that adds retries; in my cursory testing so far, a small number of retries (default: 3) seems to make this behave more robustly. That sketch is here: https://github.com/lukasschwab/arxiv.py/tree/v1.0.0-rewrite But beware:
Thanks for the input on this issue; I think this'll lead to a meaningful improvement in this package 😁 |
v1.0.0 is released, and it implements retries! https://github.com/lukasschwab/arxiv.py/releases/tag/1.0.0 Cheers |
Hi @lukasschwab, first of all. Wow. Thank you so much for this comprehensive update and new release. Really great to see how much work you put in this and how well you document it! I am now testing the 1.0.1 release and encounter following issue with a minimum working example:
If I set I feel this was already answered somewhere else, but how to best query 10000s of results with the new release? For example for leaving |
@jonas-nothnagel I think you're successfully fetching your 20,000 results! I just need to make the logging clearer. (Opened #56) Each of those log lines is written from Lines 374 to 385 in bb625a2
If all the retries are exhausted and the error is raised, then the generator will stop producing results and you'll see the full exception logged to the console. No more pages will be fetched. The logs you're seeing say this: the API sometimes sends you empty pages, but the retried requests are succeeding! Otherwise the whole thing would stop. Does that make sense? If you'd like, you can log the results as you go along to see that they're still being fetched: import arxiv
generator = arxiv.Search("abs:disaster risk management", max_results=20000).get()
results = []
for result in generator:
print("Got result:", result.entry_id)
results.append(result) When I improve the logging it'll be easier to see the underlying requests as they happen. |
in my script i tried time.sleep still same issue |
@bilalazhar72 see #129. I recommend upgrading to the v2.0.0 client if you haven't already. |
Describe the bug
When running the query below, I receive an inconsistent count of results nearly every run. The results below contain the record count generate by the code provided in the "To Reproduce" section. As you can see, I receive wildly different results each time. Is there a parameter setting I can adjust to receive more reliable results?
To Reproduce
import arxiv
import pandas as pd
test = arxiv.query(query="quantum",
id_list=[],
max_results=None,
start = 0,
sort_by="relevance",
sort_order="descending",
prune=True,
iterative=False
,max_chunk_results=1000
)
test_df = pd.DataFrame(test)
print(len(test_df))
Expected behavior
I am expecting a consistent count of results from this query when run back to back (say within a few minutes or so of each other).
Versions
python
version: 3.7.4arxiv.py
version: 0.5.3The text was updated successfully, but these errors were encountered: