How to use Python arXiv API

Last updated: Sep 2025

In [2]:
import time

from utils import inspect_dictionary
from scrapers import get_arxiv_records, get_keywords

The arXiv API can be queried used an https request. The base query is of the form:

http://export.arxiv.org/api/query

To find papers on a certain subject, you'll use `search_query=all:` parameter. The all: prefix allows searching across all fields (title, author, abstract, comment, journal reference, subject category, report number). Note that spaces are not allowed in URLs, so they should be encoded with + signs.

To get the most recent papers, you need to sort by date. The `sortBy=submittedDate` option sorts by the date the article was first submitted. For the most recent papers, the sort order should be `sortOrder=descending`. For the most popular papers use `sortBy=relevance`. This uses Lucene's default relevance search which orders results based on an internally "computed relevance" score (the system's best estimation of how well each document matches the query; can't currently find algorithmic details).

Example query to get the 2 most recent papers related to "artificial intelligence":

http://export.arxiv.org/api/query?search_query=all:artificial+intelligence&sortBy=submittedDate&sortOrder=descending&max_results=2

In [3]:
records = get_arxiv_records("artificial intelligence", sort_by="date", max_results=5)
for paper in records:
    try:
        keywords = get_keywords(abstract=records[paper]["abstract"])
        time.sleep(3)
        print(keywords)
        records[paper]["generated_keywords"] = keywords
    except:
        print(f"[ServerError] for paper {paper}: ")

['large language models', 'data compliance', 'multilingual representation', 'open models', 'Goldfish objective']
['text-to-image generation', 'exam benchmark', 'multidisciplinary', 'visual reasoning', 'artificial general intelligence']
['generative artificial intelligence', 'data security', 'membership inference attacks', 'diffusion models', 'critically-damped higher-order Langevin dynamics']
['language models', 'training order', 'linear encoding', 'temporal signal', 'knowledge modification']
['Stochastic Optimization', 'Banach Spaces', 'Bregman Geometry', 'Machine Learning', 'Deep Learning']


In [4]:
for entry in records:
    print(records[entry].get("generated_keywords"))

['large language models', 'data compliance', 'multilingual representation', 'open models', 'Goldfish objective']
['text-to-image generation', 'exam benchmark', 'multidisciplinary', 'visual reasoning', 'artificial general intelligence']
['generative artificial intelligence', 'data security', 'membership inference attacks', 'diffusion models', 'critically-damped higher-order Langevin dynamics']
['language models', 'training order', 'linear encoding', 'temporal signal', 'knowledge modification']
['Stochastic Optimization', 'Banach Spaces', 'Bregman Geometry', 'Machine Learning', 'Deep Learning']
