v4.6.0 — Phase D: Agent + Unified Scrape + Onboarding
Phase D of IMPROVEMENT_PLAN.md — "Firecrawl-Competitive: Agent + Unified Scrape + Onboarding". Closes the three Firecrawl feature gaps with no clean CrawlForge equivalent — an autonomous agent, a unified scrape entry point, and ranked map — plus a one-command onboarding flow, all local-first (MCP-native primitives + local-LLM via Ollama; no cloud proxy/reliability layer). Purely additive: tool count 24 → 26, no breaking changes to existing tools. Regression coverage ships in tests/unit/phaseD-regressions.test.js.
Added
scrapetool (unified scrape) — one call takes aformatsarray ("markdown" | "html" | "rawHtml" | "text" | "links" | "metadata" | "screenshot"or{type:"json", schema, prompt?}) plus anonlyMainContentflag, does a single fetch + one cheerio load, and dispatches each requested format from that one parse (reusingextractBlockText, the Readability→markdown helper,htmlToMarkdown, andExtractWithLlmfor JSON). Partial-success is non-fatal: a failed format records awarnings[]entry rather than failing the whole call.onlyMainContentmaps to the existing Readability boilerplate-removal branch. New:src/tools/scrape/unifiedScrape.js.agenttool (autonomous) — natural-language prompt → autonomous search / navigate / extract → prose-or-structured output, no URLs required. Input:prompt, optionalurls[](≤20), optionalschema,model:'default'|'pro',maxSteps(≤10),maxUrls(≤20). Built as a hardcoded PLAN → GATHER → ACT → DECIDE → SHAPE state machine over existing pieces (SearchWebTool,fetchAndParse,ExtractWithLlm,SamplingClient, andResearchOrchestratorfor theprotier). Three independent hard stops — steps, URLs, and a wall-clock budget — plus "answer found", all enforced in the orchestrator and never delegated to the LLM. No-LLM-key path returns a degraded-but-useful result ({degraded:true, reason, ...evidence}) so the host LLM can finish, mirroringdeep_research's raw-evidence behavior.ElicitationHelperconfirms before apro/expensive run (fail-open). New:src/core/AgentOrchestrator.js,src/tools/agent/agent.js.map_sitesearch=ranking — optionalsearchstring ranks discovered URLs by relevance via the existingResultRanker.rankResults()(slug adapter over the URL path) and emitsranked_urls:[{url, score}]sorted descending. Default (no-search) output shape is unchanged (back-compat). The ranker is constructed lazily/once to avoid itsCacheManagertimer leaking per request.src/tools/crawl/mapSite.js,server.jscrawlforge initCLI command — one command orchestrates existing pieces: API-key detection, skill installation (install({target})), and idempotent merge of the MCP server stanza into the detected client config (~/.claude.json, Claude Desktop's OS-specific config, Cursor~/.cursor/mcp.json) — without clobbering other servers. Flags:--all,--client <name>,--yes. New:src/cli/commands/init.js; registered insrc/cli/index.js;package.jsonpostinstall hint updated.SKILL.md— canonical, agent-fetchable capabilities reference generated by concatenatingsrc/skills/*.md(the sameconcatenateSkills()source used by the installer), with a "Phase D New Tools" section documentingscrapeandagent. Referenced fromREADME.md.
Changed
extract_textreuse —extractBlockText($)is now exported and the Readability→markdown conversion is factored into a reusable exported helper, so the newscrapetool reuses them against an already-loaded cheerio instance without re-fetching. No behavior change toextract_text.src/tools/basic/extractText.js- Cost model —
getToolCost()addsscrape: 2andagent: 8;projectCost()scalesscrapewith the number of requested formats andagentwithmaxUrls+ theprotier (external LLM usage is billed by the provider, not in credits).src/core/AuthManager.js - Tool count 24 → 26 —
scrapeandagentregistered (bothwithAuth), added to the startup tool list and graceful-shutdown cleanup; server description updated.
Verified
tests/unit/phaseD-regressions.test.js 34/34 pass (mocked LLM/search/fetch — no live network; covers the agent loop's maxSteps/maxUrls/wall-clock hard stops and clamps, the no-LLM degraded path, unified scrape single-fetch multi-format + partial-success warnings, and map_site search= ranking). Full npm run test:unit green except the pre-existing streamableHttp / searchWebSearxng suites, which fail only under the sandbox's listen EPERM (localhost-bind) restriction and pass cleanly with the sandbox disabled (0 failures). npm test MCP harness exits 0 (0 errors; 60% rate unchanged from v4.5.0). node test-tools.js 15/15 pass + 5 network-skipped (100%). Live MCP smoke tests are deferred — they require publishing + reinstalling the global binary. Version bumped 4.5.0 → 4.6.0; tool count 24 → 26.