Skip to content

v4.6.0 — Phase D: Agent + Unified Scrape + Onboarding

Choose a tag to compare

@mysleekdesigns mysleekdesigns released this 07 Jun 17:59
· 18 commits to main since this release

Phase D of IMPROVEMENT_PLAN.md — "Firecrawl-Competitive: Agent + Unified Scrape + Onboarding". Closes the three Firecrawl feature gaps with no clean CrawlForge equivalent — an autonomous agent, a unified scrape entry point, and ranked map — plus a one-command onboarding flow, all local-first (MCP-native primitives + local-LLM via Ollama; no cloud proxy/reliability layer). Purely additive: tool count 24 → 26, no breaking changes to existing tools. Regression coverage ships in tests/unit/phaseD-regressions.test.js.

Added

  • scrape tool (unified scrape) — one call takes a formats array ("markdown" | "html" | "rawHtml" | "text" | "links" | "metadata" | "screenshot" or {type:"json", schema, prompt?}) plus an onlyMainContent flag, does a single fetch + one cheerio load, and dispatches each requested format from that one parse (reusing extractBlockText, the Readability→markdown helper, htmlToMarkdown, and ExtractWithLlm for JSON). Partial-success is non-fatal: a failed format records a warnings[] entry rather than failing the whole call. onlyMainContent maps to the existing Readability boilerplate-removal branch. New: src/tools/scrape/unifiedScrape.js.
  • agent tool (autonomous) — natural-language prompt → autonomous search / navigate / extract → prose-or-structured output, no URLs required. Input: prompt, optional urls[] (≤20), optional schema, model:'default'|'pro', maxSteps (≤10), maxUrls (≤20). Built as a hardcoded PLAN → GATHER → ACT → DECIDE → SHAPE state machine over existing pieces (SearchWebTool, fetchAndParse, ExtractWithLlm, SamplingClient, and ResearchOrchestrator for the pro tier). Three independent hard stops — steps, URLs, and a wall-clock budget — plus "answer found", all enforced in the orchestrator and never delegated to the LLM. No-LLM-key path returns a degraded-but-useful result ({degraded:true, reason, ...evidence}) so the host LLM can finish, mirroring deep_research's raw-evidence behavior. ElicitationHelper confirms before a pro/expensive run (fail-open). New: src/core/AgentOrchestrator.js, src/tools/agent/agent.js.
  • map_site search= ranking — optional search string ranks discovered URLs by relevance via the existing ResultRanker.rankResults() (slug adapter over the URL path) and emits ranked_urls:[{url, score}] sorted descending. Default (no-search) output shape is unchanged (back-compat). The ranker is constructed lazily/once to avoid its CacheManager timer leaking per request. src/tools/crawl/mapSite.js, server.js
  • crawlforge init CLI command — one command orchestrates existing pieces: API-key detection, skill installation (install({target})), and idempotent merge of the MCP server stanza into the detected client config (~/.claude.json, Claude Desktop's OS-specific config, Cursor ~/.cursor/mcp.json) — without clobbering other servers. Flags: --all, --client <name>, --yes. New: src/cli/commands/init.js; registered in src/cli/index.js; package.json postinstall hint updated.
  • SKILL.md — canonical, agent-fetchable capabilities reference generated by concatenating src/skills/*.md (the same concatenateSkills() source used by the installer), with a "Phase D New Tools" section documenting scrape and agent. Referenced from README.md.

Changed

  • extract_text reuseextractBlockText($) is now exported and the Readability→markdown conversion is factored into a reusable exported helper, so the new scrape tool reuses them against an already-loaded cheerio instance without re-fetching. No behavior change to extract_text. src/tools/basic/extractText.js
  • Cost modelgetToolCost() adds scrape: 2 and agent: 8; projectCost() scales scrape with the number of requested formats and agent with maxUrls + the pro tier (external LLM usage is billed by the provider, not in credits). src/core/AuthManager.js
  • Tool count 24 → 26scrape and agent registered (both withAuth), added to the startup tool list and graceful-shutdown cleanup; server description updated.

Verified

tests/unit/phaseD-regressions.test.js 34/34 pass (mocked LLM/search/fetch — no live network; covers the agent loop's maxSteps/maxUrls/wall-clock hard stops and clamps, the no-LLM degraded path, unified scrape single-fetch multi-format + partial-success warnings, and map_site search= ranking). Full npm run test:unit green except the pre-existing streamableHttp / searchWebSearxng suites, which fail only under the sandbox's listen EPERM (localhost-bind) restriction and pass cleanly with the sandbox disabled (0 failures). npm test MCP harness exits 0 (0 errors; 60% rate unchanged from v4.5.0). node test-tools.js 15/15 pass + 5 network-skipped (100%). Live MCP smoke tests are deferred — they require publishing + reinstalling the global binary. Version bumped 4.5.0 → 4.6.0; tool count 24 → 26.