feat(arxiv): full abstract/authors + surface pdf/categories/comment + new `recent <category>` by jackwener · Pull Request #1289 · jackwener/OpenCLI

jackwener · 2026-05-04T10:17:15Z

What

Three changes to the arxiv adapter, all agent-facing data quality / coverage gaps I hit while exploring:

1. `paper <id>` — stop silently truncating the data

The Atom feed for 1706.03762 has 8 authors and a ~200-word abstract. Old behavior:

Abstract was cut at 200 chars + … (silent — caller had no signal)
Authors were .slice(0, 3).join(', ') (also silent)

New behavior: full abstract, full author list. Plus surface fields the API was already returning that we threw away:

pdf (from <link rel="related" type="application/pdf">, with id-based fallback)
primary_category (e.g. cs.CL)
categories (all <category term> entries comma-joined)
comment (e.g. "15 pages, 5 figures", conference info, etc.)
updated

2. `search <query>` — minor cleanup

Add primary_category to columns (cheap signal of paper area)
Switch the no-results path from CliError('NOT_FOUND', …) to EmptyResultError('arxiv', …) to match the convention other public-API adapters use (matches the lesson from PR fix(douban): drop unparseable fields from movie-hot, add id/votes #1285).

3. `recent <category>` — new

You couldn't ask "what's new in cs.CL today?" without inventing a search term. Now:

opencli arxiv recent cs.CL --limit 10
opencli arxiv recent math.PR --limit 5

Uses search_query=cat:<category>&sortBy=submittedDate&sortOrder=descending. Validates the category string against the arxiv <archive>.<sub> shape (also accepts archive-only like math or q-bio). Bad input → ArgumentError.

Why

I'm an agent that calls these adapters. With the old paper, an agent fetching "Attention Is All You Need" got 4 of 8 authors and a clipped first paragraph — silently. That's a worse failure mode than an outright error: caller has no idea data is missing.

recent is just an obvious gap — arxiv search "anything" was the only way to read the Atom feed.

Tested

clis/arxiv/arxiv.test.js: 4 tests covering command registration, full-abstract / all-authors parsing, XML entity decoding in titles, pdf/categories/comment extraction, and category validation.
Live runs:
- opencli arxiv paper 1706.03762 → all 8 authors, full abstract, primary_category=cs.CL, pdf URL present.
- opencli arxiv recent cs.CL --limit 3 → 3 newest cs.CL papers, sorted by submitted date desc.
- opencli arxiv search "attention is all you need" --limit 3 → 3 hits with primary_category surfaced.
- opencli arxiv recent "BAD..STR" → ARGUMENT error (exit 2).

…dd `recent <category>` `paper` was silently truncating the abstract to 200 chars and dropping all but the first 3 authors — agents calling it for a paper summary lost data. Stop truncating, return all authors, and surface the rest of what the Atom feed already gives us: pdf url (`<link rel="related">`), all `categories`, `primary_category`, and the author `comment` (page count, conference, etc.). `search` keeps a compact list shape (no abstract column, but adds `primary_category`). New `arxiv recent <category>` lists newest submissions in a category sorted by `submittedDate desc` — fills a gap (previously you had to know a search term to surface anything). Validates the category string and rejects malformed input via `ArgumentError`. `search` also switches its no-results path from `CliError('NOT_FOUND', ...)` to `EmptyResultError` to match the convention other public-API adapters use. Tests cover: command registration, full-abstract / all-authors parsing, XML entity decoding in titles, pdf/categories/comment extraction, and category validation.

* feat(indeed): add `search` and `job` adapters (US site) Adds an Indeed adapter that fills the US job-search gap (alongside existing 51job / boss-zhipin / linkedin coverage). Both commands run through a real browser session because Indeed sits behind Cloudflare and answers bare HTTP fetches with `403` + `cf-mitigated: challenge`. ## Commands - `indeed search <query>` — keyword job search - args: `query`, `--location`, `--fromage`, `--sort`, `--start`, `--limit` - columns: `rank, id, title, company, location, salary, tags, url` - `indeed job <jk>` (alias `detail`, `view`) — full job posting - args: `id` (positional, the 16-char hex `jk` from `search`) - columns: `id, title, company, location, salary, job_type, description, url` ## Listing↔detail id pairing `search.id` is the Indeed `jk` (job key, 16-char lowercase hex). It feeds directly into `indeed job <jk>`. Conforms to the listing↔detail id pairing convention proposed in #1297. ## CF challenge handling The adapter polls the result selectors for up to 15s after navigation, giving the browser time to clear the Cloudflare interstitial. If the challenge is still up after the wait, the adapter throws a `CommandExecutionError` with a hint pointing the user at the connected browser to clear it once. Subsequent calls reuse the warmed cookies via `Strategy.COOKIE`, mirroring the v2ex / boss / linkedin patterns. ## Validation `utils.js` keeps argument validation pure and unit-testable: - `requireJobKey` rejects anything that isn't a 16-char lowercase hex - `requireFromage` only accepts `1` / `3` / `7` / `14` (Indeed's enum) - `requireSort` only accepts `relevance` / `date` - `requireBoundedInt(limit, default=15, max=25)` — Indeed serves at most one page (10 jobs/page); ArgumentError on out-of-range, no silent clamping, per the typed-error feedback in #1289. ## Tests 18 unit tests in `clis/indeed/indeed.test.js` cover registration, validators, URL builders, and DOM-card normalizers. Browser-driven verification stays out of CI by design (CF challenge is interactive). ## Docs - `docs/adapters/browser/indeed.md` — full adapter doc with prerequisite CF-challenge notes and listing↔detail id pairing callout. - Sidebar entry + adapter index row. * fix(indeed): tighten timeout fail-fast and runtime tests * fix(indeed): align readiness with search parser

* docs(cases): add three researcher workflow examples Add use cases under cases/ that exercise the recently-landed researcher-friendly adapters: - daily-rl-research-monitor.md uses arxiv recent + openreview venue + hf top to compress a morning paper-skim into one shell pipeline. - find-paper-implementation.md chains arxiv search/paper + dblp search + hf top + openreview search to map a paper's canonical record, follow-ups, and community uptake. - track-conference-papers.md walks openreview venue + reviews to shortlist accepted papers and digest review threads in batch. Each file is a real workflow built on commands from #1289 (arxiv recent), #1294 (openreview), and #1299 (dblp). * docs(cases): correct venue ids and forum example to ones that return data The first revision used "ICLR.cc/2026/Conference" and "ICLR 2026 oral" as venue strings. Both return EMPTY_RESULT today because the venue is not open. Update each case to use natural-language venue text that OpenReview currently exposes ("ICLR 2024 oral", "NeurIPS 2025 oral") and a real forum id (KS8mIvetg2, "Proving Test Set Contamination in Black-Box Language Models") in the reviews / paper drill-down. Note the arxiv free-text-search ranking quirk so the worked DPO example makes sense.

jackwener added 2 commits May 4, 2026 18:16

fix(arxiv): harden category and limit validation

830c4ef

jackwener merged commit e848594 into main May 4, 2026
11 checks passed

This was referenced May 4, 2026

feat(lobsters): surface short_id + created_at on listings, add read <short_id> #1291

Merged

feat(indeed): add search and job adapters (US site) #1298

Merged

This was referenced May 4, 2026

feat(pixiv): surface user_id + url on listings (id-pairing polish) #1300

Merged

feat(twitter): surface tweet id on bookmarks/likes/tweets listings #1301

Merged

feat(1688/hupu/douban/linux-do): surface dropped ids on listings (sweep) #1302

Merged

Benjamin-eecs mentioned this pull request May 4, 2026

docs(cases): add three researcher workflow examples #1317

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(arxiv): full abstract/authors + surface pdf/categories/comment + new `recent <category>`#1289

feat(arxiv): full abstract/authors + surface pdf/categories/comment + new `recent <category>`#1289
jackwener merged 2 commits intomainfrom
feat/arxiv-fix-paper-add-recent

jackwener commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackwener commented May 4, 2026

What

1. paper <id> — stop silently truncating the data

2. search <query> — minor cleanup

3. recent <category> — new

Why

Tested

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `paper <id>` — stop silently truncating the data

2. `search <query>` — minor cleanup

3. `recent <category>` — new