Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating: arXiv API flakiness #129

Closed
lukasschwab opened this issue Oct 15, 2023 · 10 comments
Closed

Investigating: arXiv API flakiness #129

lukasschwab opened this issue Oct 15, 2023 · 10 comments
Assignees
Labels
api Issues that correspond to arXiv API behavior rather than behavior introduced by this wrapper. bug Deviations from documented behavior.

Comments

@lukasschwab
Copy link
Owner

lukasschwab commented Oct 15, 2023

Description

A clear and concise description of what the bug is.

The arXiv API seems to be degraded. I expect to see more bug reports about this until the underlying issue is resolved.

Behavior identified in #43 seems to have intensified or changed in character (e.g. increased clustering, such that retries are more likely to re-fail, perhaps because of cached bad responses).

Why can't you fix the API?
: I'm not affiliated with arXiv — I maintain a wrapper library for an API I don't administer. I've written the arxiv-api Google Group about this issue.

Why aren't you merging bug fixes?
: Some of the proposed changes here (e.g. consolidating on HTTPS, pinning a specific feedparser version, etc.) are probably good changes regardless of the API's stability. I'm hesitant to rush merging and releasing changes without having a strong sense, through integration tests, that they don't damage this library's behavior. That judgment is subject to change, esp. if this issue persists.

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Versions

  • python version: independent.
  • arxiv.py version: 1.*.*.

Additional context

Add any other context about the problem here.

PRs directly addressing the instability:

@lukasschwab lukasschwab added bug Deviations from documented behavior. api Issues that correspond to arXiv API behavior rather than behavior introduced by this wrapper. labels Oct 15, 2023
@lukasschwab lukasschwab self-assigned this Oct 15, 2023
@lukasschwab lukasschwab pinned this issue Oct 16, 2023
@lukasschwab
Copy link
Owner Author

Good example of flakiness between identical versions/protocols: #132 (comment)

@liyucheng09
Copy link

Good diagnosis for this issue. I guess there is not too much we can do unless they fix the backend.

@liyucheng09
Copy link

BTW I found arxiv treats requests differently for programatic clients and real browsers. I suspect this flakiness is on purpose.

@lukasschwab
Copy link
Owner Author

@liyucheng09 can you share any details on that investigation? In #127 I tried tweaking the user-agent.

@liyucheng09
Copy link

I tried about 300 attempts hourly today. More than 3000 in total. 0 out of 3000 suceeded.
By sending a user-agent to the feedparser, 28 out of 100 suceeded.
I suppose we could safely say arxiv is declining requests from programmatic clients.

@Ar4ikov
Copy link

Ar4ikov commented Oct 17, 2023

Hello! feedparser that needs to arxiv lib works contains that... I really can't describe my emotions, when I'd seen that first time. (feedparser/init.py)

USER_AGENT = "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__

Does the developer find out this funny?

Instead of using normally worked application, I need to cp -r /path/to/site-packages/feedparser /path/to/my-project-dir/, change USER_AGENT to my real and finally! ArXiv API works 100 times of 100.

It will be much MUCH better, if feedparser will use something like that:

from os import environ

# <...>
USER_AGENT = environ.get('PYTHON_FEEDPARSER_USER_AGENT', "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__)  # thank you for you joke, I I throw to the garbage myself and my 2 days for running my project that use langchain and ArXiVLoader

@lukasschwab
Copy link
Owner Author

lukasschwab commented Oct 17, 2023

@Ar4ikov I believe all currently-released versions of feedparser support specifying the User-Agent header through a named parameter (agent) to feedparser.parse, but — to your point — this package neither overrides the default nor exposes a way to set it.

I think the most robust change is to make the HTTP calls from arxiv (e.g. with requests), then pass the body to feedparser for parsing.

Nonetheless, my testing hasn't shown that updating the user agent makes the tests pass 100% of the time. Still searching, but I'll investigate this angle more.

Update: I published the major version release.

If you find any issues with the new version unrelated to the API instability, please open separate issues for those! I rolled this release in a hurry.

@lukasschwab
Copy link
Owner Author

The API seems much more stable now than it was over the weekend. CI is consistently succeeding locally.

I'm going to close this issue for the time being. I'll reopen it in the future if I see similar instability (increased rate of unexpectedly empty first pages, ConnectionReset errors).

@jaypantone
Copy link

I know this is closed, but I just wanted to add that over the last week or two I have started to experience this issue. The API calls occasionally return empty results erroneously.

@lukasschwab
Copy link
Owner Author

lukasschwab commented Mar 19, 2024

@jaypantone yeah, lots of inbound issues about this. I don't work for arXiv, so I can't affect a change there directly.

Don't overload them with requests, but you might consider describing your issue on the arXiv mailing list:

I've pinned this issue in the hopes that more people find it rather than creating new ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Issues that correspond to arXiv API behavior rather than behavior introduced by this wrapper. bug Deviations from documented behavior.
Projects
None yet
Development

No branches or pull requests

4 participants