Skip to content

feat(arxiv): full abstract/authors + surface pdf/categories/comment + new recent <category>#1289

Merged
jackwener merged 2 commits intomainfrom
feat/arxiv-fix-paper-add-recent
May 4, 2026
Merged

feat(arxiv): full abstract/authors + surface pdf/categories/comment + new recent <category>#1289
jackwener merged 2 commits intomainfrom
feat/arxiv-fix-paper-add-recent

Conversation

@jackwener
Copy link
Copy Markdown
Owner

What

Three changes to the arxiv adapter, all agent-facing data quality / coverage gaps I hit while exploring:

1. paper <id> — stop silently truncating the data

The Atom feed for 1706.03762 has 8 authors and a ~200-word abstract. Old behavior:

  • Abstract was cut at 200 chars + (silent — caller had no signal)
  • Authors were .slice(0, 3).join(', ') (also silent)

New behavior: full abstract, full author list. Plus surface fields the API was already returning that we threw away:

  • pdf (from <link rel="related" type="application/pdf">, with id-based fallback)
  • primary_category (e.g. cs.CL)
  • categories (all <category term> entries comma-joined)
  • comment (e.g. "15 pages, 5 figures", conference info, etc.)
  • updated

2. search <query> — minor cleanup

3. recent <category> — new

You couldn't ask "what's new in cs.CL today?" without inventing a search term. Now:

opencli arxiv recent cs.CL --limit 10
opencli arxiv recent math.PR --limit 5

Uses search_query=cat:<category>&sortBy=submittedDate&sortOrder=descending. Validates the category string against the arxiv <archive>.<sub> shape (also accepts archive-only like math or q-bio). Bad input → ArgumentError.

Why

I'm an agent that calls these adapters. With the old paper, an agent fetching "Attention Is All You Need" got 4 of 8 authors and a clipped first paragraph — silently. That's a worse failure mode than an outright error: caller has no idea data is missing.

recent is just an obvious gap — arxiv search "anything" was the only way to read the Atom feed.

Tested

  • clis/arxiv/arxiv.test.js: 4 tests covering command registration, full-abstract / all-authors parsing, XML entity decoding in titles, pdf/categories/comment extraction, and category validation.
  • Live runs:
    • opencli arxiv paper 1706.03762 → all 8 authors, full abstract, primary_category=cs.CL, pdf URL present.
    • opencli arxiv recent cs.CL --limit 3 → 3 newest cs.CL papers, sorted by submitted date desc.
    • opencli arxiv search "attention is all you need" --limit 3 → 3 hits with primary_category surfaced.
    • opencli arxiv recent "BAD..STR"ARGUMENT error (exit 2).

jackwener added 2 commits May 4, 2026 18:16
…dd `recent <category>`

`paper` was silently truncating the abstract to 200 chars and dropping all but
the first 3 authors — agents calling it for a paper summary lost data. Stop
truncating, return all authors, and surface the rest of what the Atom feed
already gives us: pdf url (`<link rel="related">`), all `categories`,
`primary_category`, and the author `comment` (page count, conference, etc.).

`search` keeps a compact list shape (no abstract column, but adds
`primary_category`).

New `arxiv recent <category>` lists newest submissions in a category sorted by
`submittedDate desc` — fills a gap (previously you had to know a search term
to surface anything). Validates the category string and rejects malformed
input via `ArgumentError`.

`search` also switches its no-results path from `CliError('NOT_FOUND', ...)`
to `EmptyResultError` to match the convention other public-API adapters use.

Tests cover: command registration, full-abstract / all-authors parsing, XML
entity decoding in titles, pdf/categories/comment extraction, and category
validation.
@jackwener jackwener merged commit e848594 into main May 4, 2026
11 checks passed
jackwener added a commit that referenced this pull request May 4, 2026
* feat(indeed): add `search` and `job` adapters (US site)

Adds an Indeed adapter that fills the US job-search gap (alongside
existing 51job / boss-zhipin / linkedin coverage). Both commands run
through a real browser session because Indeed sits behind Cloudflare
and answers bare HTTP fetches with `403` + `cf-mitigated: challenge`.

## Commands

- `indeed search <query>` — keyword job search
  - args: `query`, `--location`, `--fromage`, `--sort`, `--start`, `--limit`
  - columns: `rank, id, title, company, location, salary, tags, url`
- `indeed job <jk>` (alias `detail`, `view`) — full job posting
  - args: `id` (positional, the 16-char hex `jk` from `search`)
  - columns: `id, title, company, location, salary, job_type, description, url`

## Listing↔detail id pairing

`search.id` is the Indeed `jk` (job key, 16-char lowercase hex). It feeds
directly into `indeed job <jk>`. Conforms to the listing↔detail id
pairing convention proposed in #1297.

## CF challenge handling

The adapter polls the result selectors for up to 15s after navigation,
giving the browser time to clear the Cloudflare interstitial. If the
challenge is still up after the wait, the adapter throws a
`CommandExecutionError` with a hint pointing the user at the connected
browser to clear it once. Subsequent calls reuse the warmed cookies via
`Strategy.COOKIE`, mirroring the v2ex / boss / linkedin patterns.

## Validation

`utils.js` keeps argument validation pure and unit-testable:

- `requireJobKey` rejects anything that isn't a 16-char lowercase hex
- `requireFromage` only accepts `1` / `3` / `7` / `14` (Indeed's enum)
- `requireSort` only accepts `relevance` / `date`
- `requireBoundedInt(limit, default=15, max=25)` — Indeed serves at most
  one page (10 jobs/page); ArgumentError on out-of-range, no silent
  clamping, per the typed-error feedback in #1289.

## Tests

18 unit tests in `clis/indeed/indeed.test.js` cover registration,
validators, URL builders, and DOM-card normalizers. Browser-driven
verification stays out of CI by design (CF challenge is interactive).

## Docs

- `docs/adapters/browser/indeed.md` — full adapter doc with prerequisite
  CF-challenge notes and listing↔detail id pairing callout.
- Sidebar entry + adapter index row.

* fix(indeed): tighten timeout fail-fast and runtime tests

* fix(indeed): align readiness with search parser
jackwener pushed a commit that referenced this pull request May 4, 2026
* docs(cases): add three researcher workflow examples

Add use cases under cases/ that exercise the recently-landed
researcher-friendly adapters:

- daily-rl-research-monitor.md uses arxiv recent + openreview venue
  + hf top to compress a morning paper-skim into one shell pipeline.
- find-paper-implementation.md chains arxiv search/paper + dblp
  search + hf top + openreview search to map a paper's canonical
  record, follow-ups, and community uptake.
- track-conference-papers.md walks openreview venue + reviews to
  shortlist accepted papers and digest review threads in batch.

Each file is a real workflow built on commands from #1289 (arxiv
recent), #1294 (openreview), and #1299 (dblp).

* docs(cases): correct venue ids and forum example to ones that return data

The first revision used "ICLR.cc/2026/Conference" and "ICLR 2026 oral"
as venue strings. Both return EMPTY_RESULT today because the venue is
not open. Update each case to use natural-language venue text that
OpenReview currently exposes ("ICLR 2024 oral", "NeurIPS 2025 oral")
and a real forum id (KS8mIvetg2, "Proving Test Set Contamination in
Black-Box Language Models") in the reviews / paper drill-down. Note
the arxiv free-text-search ranking quirk so the worked DPO example
makes sense.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant