Switch browser engine from puppeteer-extra/stealth to CloakBrowser by orangecoding · Pull Request #307 · orangecoding/fredy

orangecoding · 2026-05-09T14:20:20Z

Summary

Replace puppeteer engine: Removes puppeteer, puppeteer-extra, and puppeteer-extra-plugin-stealth in favour of cloakbrowser + puppeteer-core. CloakBrowser applies 49 source-level C++ fingerprint patches (canvas, WebGL, audio, WebRTC, automation signals) that cannot be detected at the JS layer, unlike stealth plugins that inject JavaScript at runtime.
Humanizer enabled: humanize: true is set in launchBrowser(), adding Bézier-curve mouse movement, natural keyboard timing with thinking pauses, and realistic scroll physics out of the box.
Simplified browser lifecycle: Manual userDataDir management and the ARM64 executablePath override are removed — CloakBrowser ships its own binary for x86_64 and arm64 and handles session isolation internally. Proxy is now passed via CloakBrowser's native proxy option instead of a --proxy-server Chrome flag.
Docker emoji/font fix: Adds fonts-noto-color-emoji and fonts-freefont-ttf to the apt-get install step. Without these packages, canvas fingerprint hashes in minimal Linux environments don't match any real browser, triggering detection on Kasada and Akamai. Replaces npx puppeteer browsers install chrome with a node ensureBinary() call (supports both architectures). Removes the now-unnecessary TARGETARCH ARG and ARM64 system-Chromium branch.
Test mock updated: __fredy_removeUserDataDir field removed from the mock browser object; isConnected: () => true added to match the real Puppeteer Browser API.

Test plan

yarn lint — no errors
yarn test:offline — all 76 tests pass
Live smoke test against a real provider (requires network)
Docker build verification (docker build .)

- Replace puppeteer, puppeteer-extra, puppeteer-extra-plugin-stealth with cloakbrowser + puppeteer-core; CloakBrowser applies 49 source-level C++ fingerprint patches that cannot be detected at the JS layer. - Enable humanize:true in launchBrowser() for Bézier mouse curves, natural keyboard timing, and realistic scroll physics. - Remove manual userDataDir management and ARM64 executablePath override; CloakBrowser ships its own binary for x86_64 and arm64. - Proxy is now passed via CloakBrowser's native proxy option instead of --proxy-server Chrome flag. - Dockerfile: add fonts-noto-color-emoji + fonts-freefont-ttf so canvas fingerprint hashes match real browsers (required for Kasada/Akamai); replace npx puppeteer browsers install with node ensureBinary() call; remove TARGETARCH ARG and ARM64 system-Chromium branch. - Update test mock to reflect simplified browser object (no __fredy_* fields). https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

CloakBrowser ships its own Chromium binary with an independent CA bundle. This flag prevents ERR_CERT_AUTHORITY_INVALID failures in environments with SSL-inspecting proxies or non-standard root CAs (Docker CI, corporate networks). https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

- Remove all CDP overrides (applyBotPreventionToPage, applyLanguagePersistence, applyPostNavigationHumanSignals) that created detectable inconsistencies on top of CloakBrowser's C++ patches; pass locale to CloakBrowser launch instead - Drop --lang arg (replaced by CloakBrowser locale flag) - Extend immowelt puppeteerTimeout to 90 s to accommodate React SPA rendering latency under CloakBrowser's humanise delays - Fix kleinanzeigen detail test: serve the offline fixture for the search URL so only individual detail pages are fetched live, avoiding rate limiting from a second fresh session hitting the same search endpoint https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

…ed detail test Immowelt's CDN challenges cold browser sessions before React can render the listing grid, causing the old waitForSelector approach to silently timeout. - Add preNavigateUrl option to puppeteerExtractor: visits a warm-up page first so the site sees an established session before the search URL - Add waitForNetworkIdle option: a second idle-wait phase after domcontentloaded that catches React's listing API round-trip (which fires long after the initial HTML is parsed); errors are swallowed so partial DOM is still used - Switch immowelt config to waitForSelector=null + networkidle warm-up so page.content() is returned after the SPA has loaded its data - Set immowelt preNavigateUrl to the homepage to warm the session - In the detail enrichment test, spy on puppeteerExtractor to serve the offline fixture for the search URL; only individual listing detail pages are fetched live (they are far less aggressively protected) https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

Add a Vitest globalSetup that calls ensureBinary() once in the main process before workers start. Without this, running yarn test on a fresh checkout (or after the binary cache is cleared) immediately fails every browser-based test with "Failed to launch the browser process" before any useful output appears. The setup is a no-op in offline mode and when the binary is already cached. https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

Direct runs (yarn start:backend) on a fresh checkout have no binary and only crash when the first scraping job fires. Calling ensureBinary() at startup downloads it on first run and is instant when already cached. In Docker it stays a no-op since the binary is pre-baked during docker build. https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

The "Invalid file descriptor to ICU data received" crash seen in Sparkasse tests was caused by a partially-extracted CloakBrowser .4 binary that contained only the chrome executable but was missing icudtl.dat and other resource files. The ensureBinary() function returned this incomplete installation because latest_version_linux-x64 pointed to .4. The --no-zygote flag is kept as a safeguard for container environments with limited kernel namespaces, but the comment now accurately describes its purpose rather than attributing it to a non-existent fd inheritance issue. https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

…installs CloakBrowser's ensureBinary() only checks that the chrome executable exists, not that required resource files (icudtl.dat, resources.pak) are present. A partial extraction — e.g. an interrupted update — can leave a directory that passes ensureBinary()'s check but causes Chrome to crash immediately with "Invalid file descriptor to ICU data received". ensureValidBinary() wraps ensureBinary() with a completeness check: - If the required resource files are missing it removes the corrupt directory and all latest_version* markers, then calls ensureBinary() again so it falls back to (or re-downloads) a complete build. - It pins the validated path via CLOAKBROWSER_BINARY_PATH so CloakBrowser's own internal ensureBinary() call inside launch() always uses the same, verified binary. Used in index.js (app startup) and test/globalSetup.js (before live tests). https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

…e-limiting The second sparkasse test launched a fresh browser against the live search endpoint right after the first test already did, leaving the IP in a suspicious state that caused bot detection or rate-limiting to return empty results. When getListings() returns nothing, execute() resolves to undefined and expect(listings).toBeInstanceOf(Array) fails. Apply the same hybrid fixture approach used by kleinanzeigen and immowelt: intercept puppeteerExtractor calls whose pathname matches the search URL and return the offline fixture, while letting individual detail page requests go live (they are less aggressively rate-limited than the search endpoint). https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

Remove the fixture-backed spy — live tests must hit the real server. Root problem: two cold browser sessions hitting sparkasse in quick succession triggered bot detection, causing the second search request to return empty results and execute() to resolve undefined. Fix: - One browser launched in beforeAll and reused across both tests, so both the search and detail requests come from the same warm session. - The detail test calls provider.config.fetchDetails() directly on the listings returned by the first test instead of re-running the full pipeline. This avoids a redundant second scrape of the search page while still exercising the live detail endpoint. https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

…il tests All five provider tests with a 'with provider_details enabled' describe block were either (a) intercepting the search URL with an offline fixture to avoid hitting the live server twice, or (b) re-running the full execute() pipeline with a fresh browser, which triggered rate-limiting / bot detection on the second cold request. Pattern applied to all five: - immowelt, kleinanzeigen, wgGesucht, immobilienDe: launch one browser in beforeAll/afterAll, pass it to the first test's Fredy constructor, and call provider.config.fetchDetails() directly in the second test using the listings and browser already in hand. One warm session, two live endpoints tested. - immoscout: API-based (no browser), so no browser sharing needed. Second test calls provider.config.fetchDetails() directly on liveListings[0] from the first test instead of re-querying the search API. Removed: all readFixture spies, getKnownListingHashesForJobAndProvider mocks, and the puppeteerExtractorMod imports that were only needed for the spy. https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

On macOS the CloakBrowser binary lives at: ~/.cloakbrowser/chromium-X.Y.Z/Chromium.app/Contents/MacOS/Chromium path.dirname() gave Contents/MacOS/ — but icudtl.dat and resources.pak are inside Contents/Frameworks/…, not next to the binary. So the old code incorrectly flagged every macOS installation as corrupt, deleted only the MacOS/ subdirectory (not the full versioned dir), then failed again. Fixes: - isBinaryComplete: on macOS check for Info.plist and Frameworks/ inside Chromium.app/Contents/ instead of looking for Linux resource files next to the binary. On Linux/Windows the existing check is unchanged. - getVersionedDir: resolves the full chromium-X.Y.Z/ directory regardless of platform (4 levels up on macOS, 1 on Linux/Windows) so removeCorruptInstallation always deletes the entire versioned tree. - missingDescription: reports the correct missing items per platform. https://claude.ai/code/session_01WXzA3orbwE2hdk723c6MgH

claude added 12 commits May 9, 2026 14:19

orangecoding merged commit ac02817 into master May 10, 2026
2 checks passed

orangecoding deleted the claude/switch-to-cloak-browser-omfyA branch May 10, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch browser engine from puppeteer-extra/stealth to CloakBrowser#307

Switch browser engine from puppeteer-extra/stealth to CloakBrowser#307
orangecoding merged 12 commits intomasterfrom
claude/switch-to-cloak-browser-omfyA

orangecoding commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

orangecoding commented May 9, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants