Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable user to use .export for PDF download #87

Open
dev-89 opened this issue Nov 23, 2021 · 5 comments
Open

Enable user to use .export for PDF download #87

dev-89 opened this issue Nov 23, 2021 · 5 comments
Assignees
Labels
enhancement Requests for new features or improvements.

Comments

@dev-89
Copy link

dev-89 commented Nov 23, 2021

Motivation

The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.

Solution

A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:

idx = paper.pdf_url.index('arxiv')
paper.pdf_url = paper.pdf_url[:idx] + 'export.' + paper.pdf_url[idx:]

where paper is a Result instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the _get_pdf_url method. A boolean flag user_exportcould be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the "Play Nice" section.

@dev-89 dev-89 added the enhancement Requests for new features or improvements. label Nov 23, 2021
@lukasschwab
Copy link
Owner

Out of curiosity, did you run into rate-limiting yourself? Do you know when it kicked in (roughly)?

There's an export.arxiv.org record for every result from the API, so it should be safe to add the export subdomain before downloading, but it might be best to manage this with an optional flag in the download_pdf/download_source arguments.

We also need to confirm the download behavior when a PDF does not already exist for the export.arxiv.org record. In the browser, there's an intermediate "we're generating this PDF from source" page (screenshot below), then a redirect to the PDF once it's generated.

Screen Shot 2021-11-25 at 12 05 34

These cases must be handled gracefully.

@brandonrobertz
Copy link

I honestly think this library should default to using export.arxiv.org for everything, with an optional flag to use the non-robots allowed live site. First thing I did using this library was accidentally fetch a query that got me blocked from using arXiv for several hours. I bet a lot of users run into this, given the default values (default page size of 300000, for example, is enough to get one blocked).

@lukasschwab
Copy link
Owner

@brandonrobertz this library does use export.arxiv.org for everything except download URLs:

query_url_format = 'http://export.arxiv.org/api/query?{}'

The difference is that it receives download URLs from the API instead of building them.

Digression: let's chat limits.

default page size of 300000, for example, is enough to get one blocked

The default (Client).page_size is 100.

If you're interpreting the max_results limit in README.md, max_results isn't a page size; it's the maximum number of results across all pages for a search. If (Search).max_results = 300000 and (Client).page_size = 100, the client will make up to 3000 requests (iff there are ≥300,000 results available).

  • Maybe there should be a lower default.
  • Maybe there's a bug in the client code around delay_seconds. That delay between requests is meant to appease arXiv's rate limits, even for large queries.

Did you call (Result).download_pdf or (Result).download_source 300,000 times? If no, mind opening a separate issue to discuss your use case?

@brandonrobertz
Copy link

Interesting, sorry about the bad assumption, I didn't realize this used the export site. That's even more perplexing, then. And no I didn't call download_pdf 300k times. I got 403 after attempting to do results = arxiv.Search(query="cat:cs.LG").results()

I can open separate PR.

@lukasschwab
Copy link
Owner

@brandonrobertz No worries! Happy to advise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Requests for new features or improvements.
Projects
None yet
Development

No branches or pull requests

3 participants