How to use Python arXiv API

Last updated: Sep 2025

In [1]:
import time
import subprocess

from utils import inspect_dictionary
from scrapers import get_arxiv_records, get_keywords

The arXiv API can be queried used an https request. The base query is of the form:

http://export.arxiv.org/api/query

To find papers on a certain subject, you'll use `search_query=all:` parameter. The all: prefix allows searching across all fields (title, author, abstract, comment, journal reference, subject category, report number). Note that spaces are not allowed in URLs, so they should be encoded with + signs.

To get the most recent papers, you need to sort by date. The `sortBy=submittedDate` option sorts by the date the article was first submitted. For the most recent papers, the sort order should be `sortOrder=descending`. For the most popular papers use `sortBy=relevance`. This uses Lucene's default relevance search which orders results based on an internally "computed relevance" score (the system's best estimation of how well each document matches the query; can't currently find algorithmic details).

Example query to get the 2 most recent papers related to "artificial intelligence":

http://export.arxiv.org/api/query?search_query=all:artificial+intelligence&sortBy=submittedDate&sortOrder=descending&max_results=2

In [2]:
records = get_arxiv_records("artificial intelligence", sort_by="date", max_results=5)

for paper in records:
    print(f"{records[paper]["date_submitted"]} == {records[paper]["title"]}")
    # try:
    #     keywords = get_keywords(abstract=records[paper]["abstract"])
    #     time.sleep(3)
    #     print(keywords)
    #     records[paper]["generated_keywords"] = keywords
    # except:
    #     print(f"[ServerError] for paper {paper}: ")
    
print(f"Starting download of {len(records)} articles...")


for paper in records:
    pdf_url = records[paper].get("pdf_url")
    title = records[paper].get("title") 
    print(f"arxiv-downloader {pdf_url} -d papers")      
    

2025-09-22 == UniPixel: Unified Object Referring and Segmentation for Pixel-Level
  Visual Reasoning
2025-09-22 == SEQR: Secure and Efficient QR-based LoRA Routing
2025-09-22 == OnePiece: Bringing Context Engineering and Reasoning to Industrial
  Cascade Ranking System
2025-09-22 == Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative
  Decoding
2025-09-22 == Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning
Starting download of 5 articles...
arxiv-downloader http://arxiv.org/pdf/2509.18094v1 -d papers
arxiv-downloader http://arxiv.org/pdf/2509.18093v1 -d papers
arxiv-downloader http://arxiv.org/pdf/2509.18091v1 -d papers
arxiv-downloader http://arxiv.org/pdf/2509.18085v1 -d papers
arxiv-downloader http://arxiv.org/pdf/2509.18083v1 -d papers


In [None]:
! arxiv-downloader http://arxiv.org/pdf/2509.18094v1 -d ./papers

'arxiv-downloader' is not recognized as an internal or external command,
operable program or batch file.


## Attempting to return specific papers
Attention is all you need: https://arxiv.org/abs/1706.03762

In [None]:
# specific query for Google's attention paper
records = get_arxiv_records("Attention Is All You Need", sort_by="popularity", max_results=3)

for paper in records:
    try:
        keywords = get_keywords(abstract=records[paper]["abstract"])
        time.sleep(3)
        print(keywords)
        records[paper]["generated_keywords"] = keywords
    except:
        print(f"[ServerError] for paper {paper}: ")

In [None]:
# specific query for Google's attention paper
records_authors = get_arxiv_records("Ashish Vaswani, Noam Shazeer, Niki Parmar", sort_by="popularity", max_results=3)
for paper in records_authors:
    try:
        keywords = get_keywords(abstract=records_authors[paper]["abstract"])
        time.sleep(3)
        print(keywords)
        records_authors[paper]["generated_keywords"] = keywords
    except:
        print(f"[ServerError] for paper {paper}: ")

In [None]:
for entry in records:
    print(f"======== {records[entry].get("title")}")
    print(f"        {records[entry].get("authors")}")
    print(f"        {records[entry].get("generated_keywords")}")
print()
print()
for entry in records_authors:
    print(f"======== {records_authors[entry].get("title")}")
    print(f"        {records_authors[entry].get("authors")}")
    print(f"        {records_authors[entry].get("generated_keywords")}")

NOTE: Seems difficult to search for specific things; relevance metric is pretty opaque.