Resolve a scholarly article identifier to its retrievable artifacts — the full-text body and any supplementary material — and fetch their bytes.
litfetch is two cooperating seams:
- a fetch ladder — pluggable
Fetcherbackends (PMC Open Access S3, Europe PMC, Elsevier OA) tried in priority order; the first to serve the body wins, returning aBlob(aFileplus its bytes); - an optional resolver layer — pluggable
Resolvers that enrich what you know about a paper (pmid→pmcid/doi, etc.) so the ladder can act.
You hand it an ArticleIds bundle (any of pmid / pmcid / doi). Resolution
is demand-driven: a resolver only runs when the next fetcher needs an
identifier you don't yet have, and runs at most once.
An article is modelled as a file-set: a collection of File references (the
body in its various media types, plus supplementary material, distinguished by
FileKind), each hosted upstream. litfetch fetches the raw artifacts and reports
their access terms; it does not render them. To turn a fetched JATS/Elsevier
body into markdown, run litdown
on the bytes yourself (see Render to markdown).
The examples below are a tour; docs/api.md is the full
reference for the public surface.
pip install litfetchbioRxiv / medRxiv preprint full text needs a browser-fingerprint HTTP client,
enabled by the biorxiv extra:
pip install 'litfetch[biorxiv]'Hand fetch_body an ArticleIds; the default ladder serves the first available
body as a Blob:
from litfetch import ArticleIds, fetch_body
blob = await fetch_body(ArticleIds(pmcid='PMC5334499'))
if blob:
print(blob.file.source, blob.file.media_type, len(blob.content))litfetch returns raw bytes, not markdown. Convert a JATS/Elsevier body with litdown — you pick and pin the converter:
import io
import litdown
from litfetch import ArticleIds, fetch_body
blob = await fetch_body(ArticleIds(pmcid='PMC5334499'))
if blob:
markdown = litdown.convert(io.BytesIO(blob.content))A resolver is an async (ArticleIds, Http) -> ArticleIds — the session running
it supplies the Http. Enrich from whatever you have — a corpus client, a local
cache, an API — and merge it in (this one ignores Http, hence _http):
from litfetch import ArticleIds, Http, fetch_body
async def my_resolver(ids: ArticleIds, _http: Http) -> ArticleIds:
if not ids.pmid:
return ids
pmcid, doi = await my_corpus.lookup(ids.pmid)
return ids.merge(ArticleIds(pmcid=pmcid, doi=doi))
blob = await fetch_body(ArticleIds(pmid='29622564'), resolver=my_resolver)Bundled resolvers are constructed with their config, then passed in the same
slot. chain(...) composes several (yours first, fallbacks after); it stops
once every identifier is known:
from litfetch import ArticleIds, fetch_body
from litfetch.resolvers import SemanticScholarResolver, NcbiIdConverterResolver, chain
resolver = chain(
my_resolver, # your own
SemanticScholarResolver(api_key=S2_KEY), # bundled
NcbiIdConverterResolver(tool='myapp'), # bundled
)
blob = await fetch_body(ArticleIds(pmid='29622564'), resolver=resolver)Polite-pool identification (NCBI/Crossref email, Unpaywall's required email)
comes from a session contact, not a hardcoded default — set it on the session:
async with litfetch.Session(contact='you@example.org') as s: await s.fetch_body(...).
default_resolver() is a batteries-included, keyless chain
(Europe PMC search + NCBI ID Converter).
A non-PubMed paper you only have a DOI for, plus your own Elsevier key:
blob = await fetch_body(
ArticleIds(doi='10.1016/j.cell.2020.01.001'),
credentials={'elsevier_api_key': key},
)list_files enumerates the file-set (references, no bytes); fetch_file
materialises one:
from litfetch import ArticleIds, FileKind, list_files, fetch_file
files = await list_files(ArticleIds(pmcid='PMC5334499'), kind=FileKind.SUPPLEMENTARY)
for file in files:
blob = await fetch_file(file)Read the licence from the fetched bytes, falling back to an access authority (Unpaywall) when the bytes carry none:
from litfetch import extract_source_metadata, resolve_access
meta = extract_source_metadata(blob) # from the JATS/Elsevier bytes
if meta.licence is None:
meta = await resolve_access(ArticleIds(doi='10.1016/j.cell.2020.01.001'))Each resolver is usable on its own as a cross-reference tool, independent of
fetching. A resolver is given the Http to use, so run it inside a session:
from litfetch import ArticleIds, Session
from litfetch.resolvers import SemanticScholarResolver
async with Session() as s:
ids = await SemanticScholarResolver()(ArticleIds(doi='10.1016/j.cell.2020.01.001'), s)
print(ids.pmid, ids.pmcid)The one-shot functions above each open a throwaway session. For many papers,
hold one Session (pooled connection, shared pacing) and open a scope per
paper — the scope caches within itself, so a duplicate upstream call (e.g.
Unpaywall for both licence and PDF) is fetched once:
from litfetch import ArticleIds, Session
async with Session() as session:
for pmid in pmids:
async with session.scope() as s:
blob = await s.fetch_body(ArticleIds(pmid=pmid))
access = await s.resolve_access(ArticleIds(pmid=pmid))- A new body fetcher: implement the
Fetcherprotocol — aname, arequires: frozenset[str]of theArticleIdsfields it needs, and an asyncfetch(ids, *, credentials, http)returning a bodyBloborNone. Add it to afetchers=list (or your owndefault_fetchers). - A new file source: implement the
FileSourceprotocol — aname, and asynclist_files(ids, ...)/fetch_file(file, ...)— to enumerate and materialise an article's file-set (body renditions and supplementary alike). - A new resolver: write an async
ArticleIds -> ArticleIdsthat fills gaps viaArticleIds.mergeand never overwrites a known id.
uv sync
uv run ruff check . && uv run ruff format --check .
uv run pyright
uv run pytest