Conversation
…dd `recent <category>`
`paper` was silently truncating the abstract to 200 chars and dropping all but
the first 3 authors — agents calling it for a paper summary lost data. Stop
truncating, return all authors, and surface the rest of what the Atom feed
already gives us: pdf url (`<link rel="related">`), all `categories`,
`primary_category`, and the author `comment` (page count, conference, etc.).
`search` keeps a compact list shape (no abstract column, but adds
`primary_category`).
New `arxiv recent <category>` lists newest submissions in a category sorted by
`submittedDate desc` — fills a gap (previously you had to know a search term
to surface anything). Validates the category string and rejects malformed
input via `ArgumentError`.
`search` also switches its no-results path from `CliError('NOT_FOUND', ...)`
to `EmptyResultError` to match the convention other public-API adapters use.
Tests cover: command registration, full-abstract / all-authors parsing, XML
entity decoding in titles, pdf/categories/comment extraction, and category
validation.
This was referenced May 4, 2026
jackwener
added a commit
that referenced
this pull request
May 4, 2026
* feat(indeed): add `search` and `job` adapters (US site) Adds an Indeed adapter that fills the US job-search gap (alongside existing 51job / boss-zhipin / linkedin coverage). Both commands run through a real browser session because Indeed sits behind Cloudflare and answers bare HTTP fetches with `403` + `cf-mitigated: challenge`. ## Commands - `indeed search <query>` — keyword job search - args: `query`, `--location`, `--fromage`, `--sort`, `--start`, `--limit` - columns: `rank, id, title, company, location, salary, tags, url` - `indeed job <jk>` (alias `detail`, `view`) — full job posting - args: `id` (positional, the 16-char hex `jk` from `search`) - columns: `id, title, company, location, salary, job_type, description, url` ## Listing↔detail id pairing `search.id` is the Indeed `jk` (job key, 16-char lowercase hex). It feeds directly into `indeed job <jk>`. Conforms to the listing↔detail id pairing convention proposed in #1297. ## CF challenge handling The adapter polls the result selectors for up to 15s after navigation, giving the browser time to clear the Cloudflare interstitial. If the challenge is still up after the wait, the adapter throws a `CommandExecutionError` with a hint pointing the user at the connected browser to clear it once. Subsequent calls reuse the warmed cookies via `Strategy.COOKIE`, mirroring the v2ex / boss / linkedin patterns. ## Validation `utils.js` keeps argument validation pure and unit-testable: - `requireJobKey` rejects anything that isn't a 16-char lowercase hex - `requireFromage` only accepts `1` / `3` / `7` / `14` (Indeed's enum) - `requireSort` only accepts `relevance` / `date` - `requireBoundedInt(limit, default=15, max=25)` — Indeed serves at most one page (10 jobs/page); ArgumentError on out-of-range, no silent clamping, per the typed-error feedback in #1289. ## Tests 18 unit tests in `clis/indeed/indeed.test.js` cover registration, validators, URL builders, and DOM-card normalizers. Browser-driven verification stays out of CI by design (CF challenge is interactive). ## Docs - `docs/adapters/browser/indeed.md` — full adapter doc with prerequisite CF-challenge notes and listing↔detail id pairing callout. - Sidebar entry + adapter index row. * fix(indeed): tighten timeout fail-fast and runtime tests * fix(indeed): align readiness with search parser
9 tasks
jackwener
pushed a commit
that referenced
this pull request
May 4, 2026
* docs(cases): add three researcher workflow examples Add use cases under cases/ that exercise the recently-landed researcher-friendly adapters: - daily-rl-research-monitor.md uses arxiv recent + openreview venue + hf top to compress a morning paper-skim into one shell pipeline. - find-paper-implementation.md chains arxiv search/paper + dblp search + hf top + openreview search to map a paper's canonical record, follow-ups, and community uptake. - track-conference-papers.md walks openreview venue + reviews to shortlist accepted papers and digest review threads in batch. Each file is a real workflow built on commands from #1289 (arxiv recent), #1294 (openreview), and #1299 (dblp). * docs(cases): correct venue ids and forum example to ones that return data The first revision used "ICLR.cc/2026/Conference" and "ICLR 2026 oral" as venue strings. Both return EMPTY_RESULT today because the venue is not open. Update each case to use natural-language venue text that OpenReview currently exposes ("ICLR 2024 oral", "NeurIPS 2025 oral") and a real forum id (KS8mIvetg2, "Proving Test Set Contamination in Black-Box Language Models") in the reviews / paper drill-down. Note the arxiv free-text-search ranking quirk so the worked DPO example makes sense.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Three changes to the arxiv adapter, all agent-facing data quality / coverage gaps I hit while exploring:
1.
paper <id>— stop silently truncating the dataThe Atom feed for
1706.03762has 8 authors and a ~200-word abstract. Old behavior:…(silent — caller had no signal).slice(0, 3).join(', ')(also silent)New behavior: full abstract, full author list. Plus surface fields the API was already returning that we threw away:
pdf(from<link rel="related" type="application/pdf">, with id-based fallback)primary_category(e.g.cs.CL)categories(all<category term>entries comma-joined)comment(e.g."15 pages, 5 figures", conference info, etc.)updated2.
search <query>— minor cleanupprimary_categoryto columns (cheap signal of paper area)CliError('NOT_FOUND', …)toEmptyResultError('arxiv', …)to match the convention other public-API adapters use (matches the lesson from PR fix(douban): drop unparseable fields from movie-hot, add id/votes #1285).3.
recent <category>— newYou couldn't ask "what's new in cs.CL today?" without inventing a search term. Now:
Uses
search_query=cat:<category>&sortBy=submittedDate&sortOrder=descending. Validates the category string against the arxiv<archive>.<sub>shape (also accepts archive-only likemathorq-bio). Bad input →ArgumentError.Why
I'm an agent that calls these adapters. With the old
paper, an agent fetching "Attention Is All You Need" got 4 of 8 authors and a clipped first paragraph — silently. That's a worse failure mode than an outright error: caller has no idea data is missing.recentis just an obvious gap —arxiv search "anything"was the only way to read the Atom feed.Tested
clis/arxiv/arxiv.test.js: 4 tests covering command registration, full-abstract / all-authors parsing, XML entity decoding in titles, pdf/categories/comment extraction, and category validation.opencli arxiv paper 1706.03762→ all 8 authors, full abstract, primary_category=cs.CL, pdf URL present.opencli arxiv recent cs.CL --limit 3→ 3 newest cs.CL papers, sorted by submitted date desc.opencli arxiv search "attention is all you need" --limit 3→ 3 hits with primary_category surfaced.opencli arxiv recent "BAD..STR"→ARGUMENTerror (exit 2).