Releases: info-suvastutech/scrapy-stealth
v0.6.9a2
Fixed
- Windows browser-restart log noise (
WinError 995)
Suppressed benign Windows Proactor teardown errors logged when the event loop and proxy relay are torn down during a browser restart.
The loop exception handler now ignoresWinError 995(ERROR_OPERATION_ABORTED) andWinError 64(ERROR_NETNAME_DELETED)
alongside the existing10054(WSAECONNRESET); genuine errors are still surfaced. The restart itself was always succeeding — only
the spuriousERRORtracebacks are gone.
v0.6.9a1
Added
- Proxy bypass list (
BROWSER_PROXY_BYPASS_LIST)
Route chosen domains around the proxy in the browser engine. The user-supplied list is passed to Chrome's--proxy-bypass-list
launch flag, so requests to those domains connect to the origin directly instead of through the proxy relay. Supports the full Chrome
bypass syntax — bare hostnames, wildcards (*.example.com), IP/CIDR ranges, ports, and the<local>token. Configured globally via
config/settings; only takes effect when a proxy is in use.
v0.6.8
Added
- Intelligent content wait (
_smart_wait)
Automatically detects JavaScript challenges, CAPTCHAs, and anti-bot interstitial pages and waits for meaningful page content before returning a response, improving success rates on protected websites. - Advanced challenge detection
Added comprehensive detection for Cloudflare, DataDome, Akamai, Kasada, and other common anti-bot challenge pages. - Randomized browser fingerprinting
Browser sessions now launch with realistic randomized window sizes and language configurations to reduce fingerprint consistency across sessions. - Intelligent browser restart (
BROWSER_RESTART_AFTER_BANS)
Browser instances are now restarted only after a configurable number of consecutive bans or challenge responses, replacing the previous fixed-request restart strategy. - Static asset blocking (
BROWSER_STATIC_ASSETS_BLOCK)
Optional blocking of images, fonts, stylesheets, and other non-essential assets via Chrome DevTools Protocol, reducing bandwidth usage and improving page load performance. StealthDependencyError
New typed exception for optional dependency loading failures, providing platform-specific guidance for resolving missing native libraries and runtime dependencies.
Fixed
- Windows browser restart race condition
Resolved event-loop teardown and restart timing issues that could produceInvalidStateErrorexceptions during browser restarts. - Windows dependency loading failures
Improved handling ofwreqandcurl_cffiDLL loading errors with actionable error messages instead of opaque import tracebacks. - Deferred dependency loading
Optional browser-profile dependencies are now loaded lazily, preventing unrelated engines from failing when specific native dependencies are unavailable. - Browser response rendering
Improved response handling to ensure successful pages are fully rendered before being returned to Scrapy.
Changed
- Browser restart strategy
Replaced the request-count-based restart mechanism with ban-aware restart logic, reducing unnecessary browser restarts during healthy crawls. - Test suite refactoring
Simplified browser-related test cases and reduced mock complexity for improved maintainability.
Performance
- Reduced bandwidth consumption
Static asset blocking can significantly decrease network usage and page load times when visual assets are not required. - Improved browser stability
Smarter restart behavior reduces browser churn while maintaining long-running crawl reliability.
v0.6.8a2
Added
-
StealthDependencyError— typed exception for compiled-dependency failures
New exception class inexceptions.pythat inherits from bothStealthExceptionand
ImportError, fitting naturally into both the package exception hierarchy and standard
except ImportErrorhandlers.
Raised whenever a compiled optional dependency (wreq,curl_cffi) fails to load —
typically because a required native DLL or shared library could not be found.The exception provides a platform-aware, actionable message at raise time:
- Windows — instructs the user to install both x64 and x86 Visual C++ Redistributables
(2015–2022) with direct download links. - Linux — suggests the appropriate
apt-get/yumpackages for missing system
libraries (libssl,libcurl).
StealthDependencyErroris exported from the top-level package and added to__all__,
making it catchable in user code alongside the other stealth exceptions. - Windows — instructs the user to install both x64 and x86 Visual C++ Redistributables
Fixed
-
engines/basic.py—ImportError: DLL load failed while importing wreqon fresh Windows
The barefrom wreq.blocking import Clientandfrom wreq.proxy import Proxymodule-level
imports crashed immediately on machines without the Visual C++ Redistributable installed,
surfacing as an opaqueDLL load failedtraceback deep inside Scrapy's middleware loader.
Both imports are now wrapped intry/except ImportErrorand delegate to
StealthDependencyError.check("wreq", exc)for a clear, actionable error message. -
engines/turbo.py— same DLL failure forcurl_cffion fresh Windows
from curl_cffi import CurlHttpVersionandfrom curl_cffi.requests import Sessionsuffer
the same failure path aswreqwhen the VCRT is absent.
Both imports are now guarded withStealthDependencyError.check("curl_cffi", exc). -
utils/profiles.py—wreq.emulationcrash at import time propagated silently
from wreq.emulation import Emulation, Profilewas a module-level import, meaning the
entireprofilesmodule — and by extension every engine that imports it — failed to load
on VCRT-missing machines, producing the same deepDLL load failedtraceback.
The import is now guarded with a_WREQ_AVAILABLEflag;EmulationandProfilefall
back toNoneso the module loads cleanly. The private_require_wreq()helper raises
StealthDependencyErrorat the point of actual use (inside_resolve_basic), not at
import time, keeping theturboandbrowserdrivers unaffected on machines where
wreqis broken butcurl_cffiloads fine.
v0.6.8b1
Added
- Intelligent browser restart (
BROWSER_RESTART_AFTER_BANS)
The browser engine now restarts Chrome (fresh fingerprint, cookies, and CDP session) only
when it actually needs to — afterBROWSER_RESTART_AFTER_BANS(default5) consecutive
responses are classified as banned or challenged byAntiBotDetector. A single clean
response resets the streak to zero, so a browser sailing through cleanly is never restarted,
no matter how many requests it has served. Replaces the previous fixed-count
BROWSER_RESTART_EVERYrestart, which fired blindly every N requests regardless of whether
anything was actually going wrong.
Implemented via a smallBanStreakTrackerhelper inutils/browser.py.
Fixed
- Browser engine — restart/teardown race on Windows
_reset_browser()now waits for the old event loop's thread to fully stop (_stop_loop())
before starting a new loop and thread. Previously the oldProactorEventLoopcould keep
polling its selector after the replacement loop was already running, surfacing as an
InvalidStateErrorcrash or unretrievedOSErrortask exceptions on Windows.
v0.6.8a1
Added
- Intelligent content wait (
_smart_wait)
The browser engine now detects if a page is a JS challenge, CAPTCHA, or script-heavy stub
(e.g., Cloudflare, DataDome) and automatically waits for the real content to populate.
It uses a heuristic based on body length and tag structure to decide whether to wait,
significantly improving success rates on protected sites while maintaining speed on
normal pages. - JS challenge detection (
_JS_IS_CHALLENGE)
A comprehensive JavaScript-based detector that identifies common anti-bot platforms
(Cloudflare, DataDome, Akamai, Kasada) and challenge states (Ray ID, "Checking your browser")
by scanning the DOM and window title. - Randomized browser fingerprinting
Chrome is now launched with randomized--window-sizeand--langarguments selected from
a curated list of common configurations. This ensures that every browser session (and
every proxy-rotated request) presents a unique, realistic identity to anti-bot systems.
Changed
- Refactored test cases
Simplified fetch mocks in tests by removing the unnecessary proxy argument and
streamlining assertions.
Fixed
- Browser engine — improved response handling
Integrated_smart_waitinto the fetch pipeline, ensuring 2xx responses are fully
rendered before returning.
v0.6.7
Changed
- Browser engine — single persistent browser for both proxy and non-proxy modes
Previously, proxy mode spawned a fresh Chrome process for every request and tore it down
immediately after, making concurrent proxy crawls extremely expensive. The engine now runs
one persistent browser regardless of whether a proxy is configured.
A local auth-injecting relay (_start_proxy_relay) is started once at browser initialisation
and the browser is launched with--proxy-server=http://127.0.0.1:<relay_port>baked in.
Each request opens an isolated tab (vianew_tab=True) and closes it when done — identical
to non-proxy mode. Proxy credentials are injected at the TCP level by the relay and never
touch the browser.
Impact: one Chrome process per spider instead of one per request; dramatically lower memory
and startup overhead on proxy-enabled crawls. - Browser engine — splash screen loaded once at startup, not per request
The project logo /chrome://welcomesplash was previously loaded in every request tab as a
warm-up step before navigating to the real target. It is now loaded once onbrowser.main_tab
immediately after the browser starts (_start()), warming up the renderer, stealth patches,
and (when proxied) the relay tunnel — before any spider request arrives. Request tabs navigate
directly to the target URL with no splash overhead. - Browser engine — early return on non-2xx responses
_do_fetchnow reads the HTTP status code before waiting for page content. Responses in the
2xx range receive the full_wait_for_content()+ settle delay as before. Non-2xx responses
(4xx, 5xx) skip the content wait and return immediately with whatever the browser has already
rendered, avoiding up to 10 seconds of unnecessary polling on error pages.
Added
_wait_for_status(page, timeout=8.0)utility
The Navigation Timing API (performance.getEntriesByType('navigation')[0].responseStatus)
is written asynchronously by Chrome and can return0immediately afterpage.wait(),
especially through a proxy or after redirects. The new helper polls every 250 ms until a
non-zero status is available, then returns it. If the entry never populates within 8 seconds
(rare SPA edge case) it falls back to200— the safest assumption when the page loaded but
left no timing entry._JS_STATUSdefault changed from?? 200to?? 0to expose the
"not ready" state to the poller rather than masking it.
Fixed
- Browser engine —
ConnectionResetError/BrokenPipeErrorlog noise on Windows
On Windows with Python 3.13+, closing a Chrome tab or stopping the browser triggers
_ProactorBasePipeTransport._call_connection_lost()which raises
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host. This is harmless — the connection is already gone — but asyncio logged it as
an unhandled exception on every tab close. The loop exception handler now suppresses
ConnectionResetError,BrokenPipeError, and rawOSErrorwithwinerror == 10054
(the unwrapped variant seen on some Python 3.14 builds). - Browser engine — relay and tab-semaphore torn down correctly on browser restart
_reset_browser()now closes the proxy relay server and clears_relay_server/
_relay_portbefore spinning up a new event loop, so the restarted browser gets a fresh
relay rather than pointing at a dead port.
v0.6.6
Added
-
BROWSER_EXECUTABLE_PATHconfiguration option
New setting allows specifying a custom Chrome/Chromium/Brave binary path for the browser engine.
Set viaconfig.BROWSER_EXECUTABLE_PATHorBROWSER_EXECUTABLE_PATHin Scrapy settings.
Useful when Chrome is installed in a non-standard location or when using alternative browsers like Brave.
Proper error messages guide users to set the config if the binary is not found at the configured path. -
Unified logger output for browser engine
Replaced directconsolemodule usage withloggerthroughout the browser engine for consistent,
structured logging that integrates with Scrapy's logging system. All browser startup messages,
restarts, and warnings now appear in the standard[scrapy-stealth]log format.
Changed
-
Browser engine — simplified stealth approach for improved detection evasion
TheBrowserEnginehas been streamlined to focus on real Chrome behavior without aggressive JavaScript injection.
Removed the_STEALTH_JSinjection (which masked CDP fingerprints and spoofed Windows platform attributes)
because anti-bot systems increasingly detect the injections themselves rather than the CDP presence.The engine now:
- Removes all custom user-agent forcing (uses Chrome's default)
- Eliminates JavaScript navigator property overrides (
webdriver,platform,plugins,languages, WebGL, UAv4) - Simplifies browser arguments to essential flags only (disables only
AutomationControlledblink feature) - Maintains Xvfb support for non-headless Chrome on Linux without
$DISPLAY - Keeps persistent browser reuse for performance
- Works identically in headless and non-headless modes
Result:
headless=Falsewith real display/Xvfb now evades detection more effectively because
the browser appears "normal" to anti-bot systems rather than heavily modified.
Fixed
- Browser engine — bans when using
headless=Falsewith injection-based detection
Anti-bot systems like Akamai specifically scan for the telltale patterns in commonly-used CDP stealth scripts.
Removing the injection eliminates a major detection surface while maintaining the evasion benefits of running
a real browser process.
Optimized
- Browser engine — code duplication eliminated
Extracted_start_browser()helper method that centralizes browser startup andBROWSER_EXECUTABLE_PATH
error handling._start()(persistent browser) and_do_fetch()(per-proxy browser) now call the same
code path, reducing maintenance burden and ensuring consistent behavior across non-proxy and proxy modes.
v0.6.6a2
Added
- Xvfb virtual display support for Docker / Zyte
On Linux without a$DISPLAY, the browser engine now automatically starts
Xvfb :99before launching Chrome. This lets Chrome run in non-headless mode
against a virtual framebuffer — identical to a real desktop session — which is
significantly harder for anti-bot systems to detect than--headless=new.
Falls back to headless silently if Xvfb is not installed.
Requiresapt-get install -y xvfbin your Docker image.
v0.6.6a1
Added
BROWSER_NO_SANDBOXconfig option
NewBROWSER_NO_SANDBOX: bool | Nonesetting controls Chrome's sandbox mode.
Defaults toNone(auto-detect): sandbox is disabled automatically when the process runs
as root on Linux (e.g. Zyte, Docker). SetTrueto force no-sandbox,Falseto keep
sandbox even as root. Configurable viasettings.py(BROWSER_NO_SANDBOX = True) or
theconfigobject.
Fixed
-
Browser engine fails on Docker (running as root)
Chrome refuses to start without--no-sandboxwhen the process is root. The engine now
auto-detects root and adds both--no-sandboxand--disable-dev-shm-usage(required
in containers with limited/dev/shm). -
headless=Falsecrashes in display-less environments
When no$DISPLAYis set on Linux (Docker, CI), the engine now silently overrides
headless=Falsetoheadless=True, preventing Chrome from crashing on startup.