v4.5.0 — Phase C: Robustness, Security & Polish
Phase C of IMPROVEMENT_PLAN.md — "Robustness, Security & Polish". Closes all C-series items so tools are robust, polite on the network, and consistent in their contracts. Regression coverage ships in tests/unit/phaseC-regressions.test.js (27 tests).
Added
get_batch_resultstool — paginated retrieval ofbatch_scraperesults bybatchId(page/pageSize). Tool count 23 → 24. Also restoredlist_ollama_modelsto the startup tool list.server.js,src/tools/advanced/batchScrape/index.jsstealth_modeengine selection —engine: 'chromium'(default) |'camoufox', wired through the operation-basedscrape_with_stealth→createStealthContext→launchStealthBrowserpath; a mismatched running browser is torn down before switching.src/core/StealthBrowserManager.jsextract_with_llmstructured output — when aschemais provided and the provider is Anthropic, output is forced via tool-use (tools+tool_choice), guaranteeing schema-shaped JSON; output is then validated with zod (valid/validationErrorsin the result). Truncation metadata (truncated,original_length) is surfaced.src/tools/extract/extractWithLlm.jsprocess_documentpage ranges —options.pageRange: {start, end}(1-based, inclusive) returns exactly those pages via per-pagepagerendercapture. The serveroptionsschema is now passthrough so granular options (maxPages,pageRange,extractText, …) actually reach the tool instead of being stripped.src/core/processing/PDFProcessor.js,src/tools/extract/processDocument.js,server.js
Fixed
fetch_urlbody-size cap — Content-Length pre-check plus a streaming byte-count guard (configurable viaMAX_FETCH_BODY_SIZE, default 25 MB) prevents memory exhaustion across all basic tools. The guard is defensive: responses without a Headers object or aReadableStreambody are returned unchanged so native.text()/.json()keep working.src/tools/basic/_fetch.js,src/constants/config.js- Ineffective fetch timeouts — replaced the no-op
timeout:option (ignored by Nodefetch) withAbortSignal.timeout(...)inextract_content,process_document, andtrack_changes.src/tools/extract/extractContent.js,src/tools/extract/processDocument.js,src/tools/tracking/trackChanges/differ.js generate_llms_txtintrusive probing — security-path and rate-limit probing are now opt-in (checkSecurity,probeRateLimitdefaultfalse); remaining probes run in bounded parallel batches instead of long sequential loops.src/core/LLMsTxtAnalyzer.js,src/tools/llmstxt/generateLLMsTxt.jscrawl_deeprate limiting & logging — per-domain rate-limiter map (reused rather than recreated per URL); filter/robots block messages routed throughlogger.debuginstead of rawconsole.error(stdout-hygiene).src/core/crawlers/BFSCrawler.jsstealth_modesec-ch-ua mismatch —sec-ch-uabrand versions are derived from the resolved User-Agent's Chrome major version (was hardcoded120against a121UA).src/core/StealthBrowserManager.js- Stale User-Agent —
fetch_url/extract_structurednow send a version-derivedCrawlForge/<version> (+https://crawlforge.dev)UA (wasCrawlForge/1.0.0/CrawlForge-MCP/3.0).src/tools/basic/_fetch.js,src/tools/extract/extractStructured.js localizationgeo-blocking & phone regex —handle_geo_blockingrenamed todetect_geo_blocking(it only detects and recommends — no bypass is applied); fixed the US phone regex (\\d→\d).src/core/LocalizationManager.js,server.jsextract_with_llmJSON recovery — extracts the first balanced embedded JSON object/array (string/escape-aware), tolerating prose both before and after the JSON; previously only leading-prose-then-trailing-JSON was recovered.src/tools/extract/extractWithLlm.jslist_ollama_modelsrobustness — hardened against a non-arraymodelsfield;modified_atnormalized to ISO 8601.src/tools/extract/listOllamaModels.jsprocess_documentpage extraction —extractPDFPagesnow produces a real page range; previously itsendPagewas clobbered bymaxPagesandstartPage > 1only logged a warning while returning all pages.src/core/processing/PDFProcessor.jsbatch_scrapemarkdown title / webhook status — markdown builder de-dups the<title>heading against the first<h1>; webhook delivery status is returned on the batch result.src/tools/advanced/batchScrape/worker.js,reporter.js,index.js
Verified
npm run test:unit 360/360 (sandbox-off; sandbox-on listen EPERM failures are the pre-existing HTTP-transport/searxng port-binding cases). node test-tools.js 20/20 (100%). npm test MCP harness exits 0 (0 errors). npm audit: 4 pre-existing moderate advisories (uuid/node-cron transitive) — out of Phase-C scope. Version bumped 4.4.0 → 4.5.0; tool count 23 → 24.