Skip to content

Releases: info-suvastutech/scrapy-stealth

v0.6.9a2

18 Jun 13:40

Choose a tag to compare

v0.6.9a2 Pre-release
Pre-release

Fixed

  • Windows browser-restart log noise (WinError 995)
    Suppressed benign Windows Proactor teardown errors logged when the event loop and proxy relay are torn down during a browser restart.
    The loop exception handler now ignores WinError 995 (ERROR_OPERATION_ABORTED) and WinError 64 (ERROR_NETNAME_DELETED)
    alongside the existing 10054 (WSAECONNRESET); genuine errors are still surfaced. The restart itself was always succeeding — only
    the spurious ERROR tracebacks are gone.

v0.6.9a1

18 Jun 12:12

Choose a tag to compare

v0.6.9a1 Pre-release
Pre-release

Added

  • Proxy bypass list (BROWSER_PROXY_BYPASS_LIST)
    Route chosen domains around the proxy in the browser engine. The user-supplied list is passed to Chrome's --proxy-bypass-list
    launch flag, so requests to those domains connect to the origin directly instead of through the proxy relay. Supports the full Chrome
    bypass syntax — bare hostnames, wildcards (*.example.com), IP/CIDR ranges, ports, and the <local> token. Configured globally via
    config/settings; only takes effect when a proxy is in use.

v0.6.8

18 Jun 07:58

Choose a tag to compare

Added

  • Intelligent content wait (_smart_wait)
    Automatically detects JavaScript challenges, CAPTCHAs, and anti-bot interstitial pages and waits for meaningful page content before returning a response, improving success rates on protected websites.
  • Advanced challenge detection
    Added comprehensive detection for Cloudflare, DataDome, Akamai, Kasada, and other common anti-bot challenge pages.
  • Randomized browser fingerprinting
    Browser sessions now launch with realistic randomized window sizes and language configurations to reduce fingerprint consistency across sessions.
  • Intelligent browser restart (BROWSER_RESTART_AFTER_BANS)
    Browser instances are now restarted only after a configurable number of consecutive bans or challenge responses, replacing the previous fixed-request restart strategy.
  • Static asset blocking (BROWSER_STATIC_ASSETS_BLOCK)
    Optional blocking of images, fonts, stylesheets, and other non-essential assets via Chrome DevTools Protocol, reducing bandwidth usage and improving page load performance.
  • StealthDependencyError
    New typed exception for optional dependency loading failures, providing platform-specific guidance for resolving missing native libraries and runtime dependencies.

Fixed

  • Windows browser restart race condition
    Resolved event-loop teardown and restart timing issues that could produce InvalidStateError exceptions during browser restarts.
  • Windows dependency loading failures
    Improved handling of wreq and curl_cffi DLL loading errors with actionable error messages instead of opaque import tracebacks.
  • Deferred dependency loading
    Optional browser-profile dependencies are now loaded lazily, preventing unrelated engines from failing when specific native dependencies are unavailable.
  • Browser response rendering
    Improved response handling to ensure successful pages are fully rendered before being returned to Scrapy.

Changed

  • Browser restart strategy
    Replaced the request-count-based restart mechanism with ban-aware restart logic, reducing unnecessary browser restarts during healthy crawls.
  • Test suite refactoring
    Simplified browser-related test cases and reduced mock complexity for improved maintainability.

Performance

  • Reduced bandwidth consumption
    Static asset blocking can significantly decrease network usage and page load times when visual assets are not required.
  • Improved browser stability
    Smarter restart behavior reduces browser churn while maintaining long-running crawl reliability.

v0.6.8a2

18 Jun 07:33
b957631

Choose a tag to compare

v0.6.8a2 Pre-release
Pre-release

Added

  • StealthDependencyError — typed exception for compiled-dependency failures
    New exception class in exceptions.py that inherits from both StealthException and
    ImportError, fitting naturally into both the package exception hierarchy and standard
    except ImportError handlers.
    Raised whenever a compiled optional dependency (wreq, curl_cffi) fails to load —
    typically because a required native DLL or shared library could not be found.

    The exception provides a platform-aware, actionable message at raise time:

    • Windows — instructs the user to install both x64 and x86 Visual C++ Redistributables
      (2015–2022) with direct download links.
    • Linux — suggests the appropriate apt-get / yum packages for missing system
      libraries (libssl, libcurl).

    StealthDependencyError is exported from the top-level package and added to __all__,
    making it catchable in user code alongside the other stealth exceptions.

Fixed

  • engines/basic.pyImportError: DLL load failed while importing wreq on fresh Windows
    The bare from wreq.blocking import Client and from wreq.proxy import Proxy module-level
    imports crashed immediately on machines without the Visual C++ Redistributable installed,
    surfacing as an opaque DLL load failed traceback deep inside Scrapy's middleware loader.
    Both imports are now wrapped in try/except ImportError and delegate to
    StealthDependencyError.check("wreq", exc) for a clear, actionable error message.

  • engines/turbo.py — same DLL failure for curl_cffi on fresh Windows
    from curl_cffi import CurlHttpVersion and from curl_cffi.requests import Session suffer
    the same failure path as wreq when the VCRT is absent.
    Both imports are now guarded with StealthDependencyError.check("curl_cffi", exc).

  • utils/profiles.pywreq.emulation crash at import time propagated silently
    from wreq.emulation import Emulation, Profile was a module-level import, meaning the
    entire profiles module — and by extension every engine that imports it — failed to load
    on VCRT-missing machines, producing the same deep DLL load failed traceback.
    The import is now guarded with a _WREQ_AVAILABLE flag; Emulation and Profile fall
    back to None so the module loads cleanly. The private _require_wreq() helper raises
    StealthDependencyError at the point of actual use (inside _resolve_basic), not at
    import time, keeping the turbo and browser drivers unaffected on machines where
    wreq is broken but curl_cffi loads fine.

v0.6.8b1

16 Jun 07:38

Choose a tag to compare

v0.6.8b1 Pre-release
Pre-release

Added

  • Intelligent browser restart (BROWSER_RESTART_AFTER_BANS)
    The browser engine now restarts Chrome (fresh fingerprint, cookies, and CDP session) only
    when it actually needs to — after BROWSER_RESTART_AFTER_BANS (default 5) consecutive
    responses are classified as banned or challenged by AntiBotDetector. A single clean
    response resets the streak to zero, so a browser sailing through cleanly is never restarted,
    no matter how many requests it has served. Replaces the previous fixed-count
    BROWSER_RESTART_EVERY restart, which fired blindly every N requests regardless of whether
    anything was actually going wrong.
    Implemented via a small BanStreakTracker helper in utils/browser.py.

Fixed

  • Browser engine — restart/teardown race on Windows
    _reset_browser() now waits for the old event loop's thread to fully stop (_stop_loop())
    before starting a new loop and thread. Previously the old ProactorEventLoop could keep
    polling its selector after the replacement loop was already running, surfacing as an
    InvalidStateError crash or unretrieved OSError task exceptions on Windows.

v0.6.8a1

12 Jun 09:39

Choose a tag to compare

v0.6.8a1 Pre-release
Pre-release

Added

  • Intelligent content wait (_smart_wait)
    The browser engine now detects if a page is a JS challenge, CAPTCHA, or script-heavy stub
    (e.g., Cloudflare, DataDome) and automatically waits for the real content to populate.
    It uses a heuristic based on body length and tag structure to decide whether to wait,
    significantly improving success rates on protected sites while maintaining speed on
    normal pages.
  • JS challenge detection (_JS_IS_CHALLENGE)
    A comprehensive JavaScript-based detector that identifies common anti-bot platforms
    (Cloudflare, DataDome, Akamai, Kasada) and challenge states (Ray ID, "Checking your browser")
    by scanning the DOM and window title.
  • Randomized browser fingerprinting
    Chrome is now launched with randomized --window-size and --lang arguments selected from
    a curated list of common configurations. This ensures that every browser session (and
    every proxy-rotated request) presents a unique, realistic identity to anti-bot systems.

Changed

  • Refactored test cases
    Simplified fetch mocks in tests by removing the unnecessary proxy argument and
    streamlining assertions.

Fixed

  • Browser engine — improved response handling
    Integrated _smart_wait into the fetch pipeline, ensuring 2xx responses are fully
    rendered before returning.

v0.6.7

10 Jun 11:45

Choose a tag to compare

Changed

  • Browser engine — single persistent browser for both proxy and non-proxy modes
    Previously, proxy mode spawned a fresh Chrome process for every request and tore it down
    immediately after, making concurrent proxy crawls extremely expensive. The engine now runs
    one persistent browser regardless of whether a proxy is configured.
    A local auth-injecting relay (_start_proxy_relay) is started once at browser initialisation
    and the browser is launched with --proxy-server=http://127.0.0.1:<relay_port> baked in.
    Each request opens an isolated tab (via new_tab=True) and closes it when done — identical
    to non-proxy mode. Proxy credentials are injected at the TCP level by the relay and never
    touch the browser.
    Impact: one Chrome process per spider instead of one per request; dramatically lower memory
    and startup overhead on proxy-enabled crawls.
  • Browser engine — splash screen loaded once at startup, not per request
    The project logo / chrome://welcome splash was previously loaded in every request tab as a
    warm-up step before navigating to the real target. It is now loaded once on browser.main_tab
    immediately after the browser starts (_start()), warming up the renderer, stealth patches,
    and (when proxied) the relay tunnel — before any spider request arrives. Request tabs navigate
    directly to the target URL with no splash overhead.
  • Browser engine — early return on non-2xx responses
    _do_fetch now reads the HTTP status code before waiting for page content. Responses in the
    2xx range receive the full _wait_for_content() + settle delay as before. Non-2xx responses
    (4xx, 5xx) skip the content wait and return immediately with whatever the browser has already
    rendered, avoiding up to 10 seconds of unnecessary polling on error pages.

Added

  • _wait_for_status(page, timeout=8.0) utility
    The Navigation Timing API (performance.getEntriesByType('navigation')[0].responseStatus)
    is written asynchronously by Chrome and can return 0 immediately after page.wait(),
    especially through a proxy or after redirects. The new helper polls every 250 ms until a
    non-zero status is available, then returns it. If the entry never populates within 8 seconds
    (rare SPA edge case) it falls back to 200 — the safest assumption when the page loaded but
    left no timing entry. _JS_STATUS default changed from ?? 200 to ?? 0 to expose the
    "not ready" state to the poller rather than masking it.

Fixed

  • Browser engine — ConnectionResetError / BrokenPipeError log noise on Windows
    On Windows with Python 3.13+, closing a Chrome tab or stopping the browser triggers
    _ProactorBasePipeTransport._call_connection_lost() which raises
    ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host. This is harmless — the connection is already gone — but asyncio logged it as
    an unhandled exception on every tab close. The loop exception handler now suppresses
    ConnectionResetError, BrokenPipeError, and raw OSError with winerror == 10054
    (the unwrapped variant seen on some Python 3.14 builds).
  • Browser engine — relay and tab-semaphore torn down correctly on browser restart
    _reset_browser() now closes the proxy relay server and clears _relay_server /
    _relay_port before spinning up a new event loop, so the restarted browser gets a fresh
    relay rather than pointing at a dead port.

v0.6.6

08 Jun 13:04

Choose a tag to compare

Added

  • BROWSER_EXECUTABLE_PATH configuration option
    New setting allows specifying a custom Chrome/Chromium/Brave binary path for the browser engine.
    Set via config.BROWSER_EXECUTABLE_PATH or BROWSER_EXECUTABLE_PATH in Scrapy settings.
    Useful when Chrome is installed in a non-standard location or when using alternative browsers like Brave.
    Proper error messages guide users to set the config if the binary is not found at the configured path.

  • Unified logger output for browser engine
    Replaced direct console module usage with logger throughout the browser engine for consistent,
    structured logging that integrates with Scrapy's logging system. All browser startup messages,
    restarts, and warnings now appear in the standard [scrapy-stealth] log format.

Changed

  • Browser engine — simplified stealth approach for improved detection evasion
    The BrowserEngine has been streamlined to focus on real Chrome behavior without aggressive JavaScript injection.
    Removed the _STEALTH_JS injection (which masked CDP fingerprints and spoofed Windows platform attributes)
    because anti-bot systems increasingly detect the injections themselves rather than the CDP presence.

    The engine now:

    • Removes all custom user-agent forcing (uses Chrome's default)
    • Eliminates JavaScript navigator property overrides (webdriver, platform, plugins, languages, WebGL, UAv4)
    • Simplifies browser arguments to essential flags only (disables only AutomationControlled blink feature)
    • Maintains Xvfb support for non-headless Chrome on Linux without $DISPLAY
    • Keeps persistent browser reuse for performance
    • Works identically in headless and non-headless modes

    Result: headless=False with real display/Xvfb now evades detection more effectively because
    the browser appears "normal" to anti-bot systems rather than heavily modified.

Fixed

  • Browser engine — bans when using headless=False with injection-based detection
    Anti-bot systems like Akamai specifically scan for the telltale patterns in commonly-used CDP stealth scripts.
    Removing the injection eliminates a major detection surface while maintaining the evasion benefits of running
    a real browser process.

Optimized

  • Browser engine — code duplication eliminated
    Extracted _start_browser() helper method that centralizes browser startup and BROWSER_EXECUTABLE_PATH
    error handling. _start() (persistent browser) and _do_fetch() (per-proxy browser) now call the same
    code path, reducing maintenance burden and ensuring consistent behavior across non-proxy and proxy modes.

v0.6.6a2

04 Jun 11:04
389c39a

Choose a tag to compare

v0.6.6a2 Pre-release
Pre-release

Added

  • Xvfb virtual display support for Docker / Zyte
    On Linux without a $DISPLAY, the browser engine now automatically starts
    Xvfb :99 before launching Chrome. This lets Chrome run in non-headless mode
    against a virtual framebuffer — identical to a real desktop session — which is
    significantly harder for anti-bot systems to detect than --headless=new.
    Falls back to headless silently if Xvfb is not installed.
    Requires apt-get install -y xvfb in your Docker image.

v0.6.6a1

04 Jun 09:57
6cc7b07

Choose a tag to compare

v0.6.6a1 Pre-release
Pre-release

Added

  • BROWSER_NO_SANDBOX config option
    New BROWSER_NO_SANDBOX: bool | None setting controls Chrome's sandbox mode.
    Defaults to None (auto-detect): sandbox is disabled automatically when the process runs
    as root on Linux (e.g. Zyte, Docker). Set True to force no-sandbox, False to keep
    sandbox even as root. Configurable via settings.py (BROWSER_NO_SANDBOX = True) or
    the config object.

Fixed

  • Browser engine fails on Docker (running as root)
    Chrome refuses to start without --no-sandbox when the process is root. The engine now
    auto-detects root and adds both --no-sandbox and --disable-dev-shm-usage (required
    in containers with limited /dev/shm).

  • headless=False crashes in display-less environments
    When no $DISPLAY is set on Linux (Docker, CI), the engine now silently overrides
    headless=False to headless=True, preventing Chrome from crashing on startup.