Skip to content

Bypassing Cloudflare

Sudipto Chandra edited this page Jun 10, 2026 · 2 revisions

A Complete Technical Reference for Web Scrapers (2026)

Note

Scope: This article covers every known detection layer used by Cloudflare, the tools and techniques available to bypass each one, and — critically — the layers that currently have no reliable bypass. It is written for developers building legitimate web scrapers and data pipelines who need to understand the full landscape of modern bot protection in 2026.


📖 Table of Contents


Background: How Modern Bot Protection Works

The bot detection industry shifted from simple IP blocklists to composite trust scoring between 2022 and 2026. Modern systems like Cloudflare no longer make a binary "bot or not" decision on a single signal. Instead, they compute a score from multiple independent detection layers simultaneously. A scraper only gets through if it passes all active layers at once — failing even one is usually enough to trigger a challenge or outright block.

flowchart LR
    Req[Incoming Request] --> Layers{Detection Layers<br/>evaluated in parallel}
    Layers --> L1[IP Reputation]
    Layers --> L2[TLS / JA4 Fingerprint]
    Layers --> L3[HTTP/2 + Header Order]
    Layers --> L4[Browser JS Fingerprint]
    Layers --> L5[CDP Detection]
    Layers --> L6[Behavioral / ML]
    L1 --> Score[Composite Trust Score]
    L2 --> Score
    L3 --> Score
    L4 --> Score
    L5 --> Score
    L6 --> Score
    Score --> Decision{Score-based decision}
    Decision -->|High trust| Allow[✓ 200 OK — content served]
    Decision -->|Uncertain| Challenge[JS Challenge or Turnstile]
    Decision -->|Low trust| Block[✗ 403 / 1020 — blocked]
    Challenge -->|solved| Allow
    Challenge -->|failed| Block
Loading

The Conjunctive-Pass Property

The composite score is non-compensatory: a strong score on many layers does not cancel out a single strongly anomalous one. If we write pᵢ for the probability that a given adversarial strategy passes detector dᵢ, admission requires passing every active detector, so the overall evasion probability is bounded by the single hardest layer:

P_evade ≤ min(p₁, p₂, ... pₙ)

This is the formal explanation of why no silver bullet exists. Correcting a previously failing layer yields no improvement once some other layer is the binding constraint — the floor is now set by the new minimum. In the other direction, a defender gains disproportionate value from adding even one layer with near-zero pass probability, since it drives Pevade toward zero regardless of what all other layers score.

The Emit-Versus-Possess Distinction

The most useful lens for understanding why different layers have different bypass hardness is not "network vs. browser" — it is whether a layer reads an emitted artifact or a possessed property:

Type Definition Evasion hardness
Emit-reading Inspects an artifact the client transmits at connection time: TLS ClientHello bytes, HTTP/2 frame ordering, header sequence. The artifact carries no secret and no history, so anything a genuine browser emits can be reproduced verbatim by a client engineered to do so. Low — structurally reproducible
Possess-reading Interrogates a property the client must continuously hold and cannot fabricate on demand: accumulated per-zone behavioral history, or a private cryptographic key. These resist forgery not because the check is obscure, but because reproducing them requires genuinely being what the check tests for. High — structurally resistant

This distinction predicts bypass hardness more reliably than any other framing. Every layer that reads an emitted artifact is, in the limit, reproducible. Every layer that reads a continuously possessed property resists forgery.

Important

The four transport-and-header layers (TLS, HTTP/2, HTTP/3, header order) all read emitted artifacts and are strongly correlated — a client built to reproduce one browser's network stack tends to pass them together. They do not act as four independent barriers; they collapse into a single effective term. Adding another emit-reading layer barely moves Pevade, whereas adding one possess-reading layer can drive it to near zero.

Cloudflare protects an estimated 74–79% of all websites that use any bot mitigation tooling, making it by far the most commonly encountered wall in web scraping. Its detection stack has grown substantially more sophisticated since 2024, with the introduction of per-zone ML models, JA4+ fingerprinting, post-quantum TLS signals, and the new Web Bot Auth system.


Layer 1 — IP Reputation

How it works

Every incoming IP is scored against Cloudflare's global threat intelligence database. The score incorporates: known datacenter ASN ranges (AWS, GCP, Azure, Hetzner, DigitalOcean, etc.), Tor exit node lists, residential proxy pool reputation (shared flagged history propagates across users of the same pool), and CGNAT classification for mobile carrier IPs.

IP reputation is a hybrid layer in the emit-versus-possess framework: the address itself is freely chosen (emitted), but the reputation attached to it is a possessed property built up over time — something an adversary can rent but not fabricate. Its residual hardness is economic rather than technical. Datacenter ranges are cheaply and safely blockable because legitimate human traffic rarely originates there. Mobile carrier ranges are nearly unblockable because a single CGNAT address multiplexes thousands of real subscribers, so any block inflicts mass collateral damage.

Bypass options

IP Type Effectiveness (2026) Notes
Mobile carrier (4G/5G) ✅ Best CGNAT economics prevent aggressive blocking — one IP = thousands of real subscribers. Cloudflare cannot block without mass collateral damage to legitimate traffic.
Real residential ⚠️ Degraded Shared pool history means flagged IPs propagate fast. Success rates dropped significantly in 2025–2026 as Cloudflare tightened ASN-level signals.
ISP (static residential) ⚠️ Middling Better than datacenter, worse than real residential. Unreliable on heavily protected targets.
Datacenter ❌ Blocked Cloudflare maintains a complete registry of all major datacenter ASN ranges. These IPs are flagged on arrival regardless of other signals.
Tor exit nodes ❌ Blocked Cloudflare maintains a complete Tor exit node list and challenges or blocks all of them by default.

Tip

Mobile carrier proxies for high-security targets; real residential proxies for medium-difficulty targets; never use datacenter IPs for Cloudflare-protected scraping.


Layer 2 — TLS / JA3 / JA4 Fingerprinting

Evasion type: Emit-reading — structurally reproducible

How it works

When any HTTP client connects over HTTPS, it sends a TLS ClientHello handshake. Cloudflare extracts a fingerprint from this handshake and compares it against a database of over 15 million known-good browser profiles analyzed daily.

  • JA3 (2017) — MD5 hash of TLS version, cipher suites, extensions, elliptic curves, and elliptic curve formats.
  • JA4 (2023, now standard) — Multi-part fingerprint covering TLS version, SNI presence, cipher count, extension count, ALPN, and separate hashes for ciphers and extensions. More resistant to evasion than JA3. Cloudflare now checks both.

Python's requests, httpx, aiohttp, and most standard HTTP clients produce a TLS handshake that looks nothing like Chrome. Cloudflare identifies the mismatch in milliseconds — not because of what you requested, but because of how you connected.

Because the handshake carries no secret, a client engineered to reproduce a target browser's exact byte sequence becomes indistinguishable on this axis. Its defensive value is real but bounded: it reliably catches naive clients and contributes nothing against a faithful impersonator.

Post-Quantum TLS (X25519MLKEM768) and Encrypted ClientHello

Chrome 131 (November 2024) shipped the standardized hybrid post-quantum key share X25519MLKEM768 — classical X25519 combined with ML-KEM-768, the NIST FIPS 203 standard — by default in every ClientHello, superseding the earlier X25519Kyber768Draft00 experiment. Every Chrome release since carries it. Cloudflare folds this into the JA4 fingerprint, so a scraper that claims to be a current Chrome but omits the post-quantum key share contradicts itself with a hard anomaly signal.

Encrypted ClientHello (ECH) is rolling out gradually in 2026. ECH encrypts the SNI field, hiding the destination hostname from network observers. Chrome enables ECH opportunistically when supported by the target. Scrapers that don't support ECH may stand out as the only "modern Chrome" clients not using it, depending on the destination.

This layer remains emit-reading: updating the impersonation profile to include the correct key share is sufficient to neutralize the PQ check. It penalizes stale profiles rather than providing structural protection.

Important

Don't hard-pin a Chrome version — use the impersonate="chrome" alias. Since curl_cffi v0.15.1, the bare chrome, safari, and firefox aliases automatically track the latest supported fingerprint, so you no longer need to chase version numbers manually. An old pinned profile (e.g. chrome120) is itself a detection signal in 2026, because no real user runs a two-year-old browser. If you must pin, pin a current release (real tools in mid-2026 run Chrome 145–148), and run curl-cffi update to pull the newest fingerprints without upgrading the package. Profiles missing the standardized X25519MLKEM768 key share create an anomaly against a post-quantum-aware fingerprinter.

Bypass options

  • curl_cffi (Python) — The most practical Python solution. Wraps curl-impersonate to reproduce exact browser TLS signatures including cipher suite order, extensions, ALPN, and post-quantum key shares. Since v0.15.x it also supports HTTP/3 fingerprints (Chrome 145/146, Firefox 147) and runtime fingerprint updates via curl-cffi update. Near drop-in replacement for requests:

    from curl_cffi import requests
    # The bare "chrome" alias auto-tracks the latest supported fingerprint
    response = requests.get(url, impersonate="chrome")
  • tls-client (Python/Go) — More fingerprint profiles available; slightly less actively maintained than curl_cffi.

  • cycletls (Go) — Preferred in Go-based scraping pipelines; excellent for Akamai-grade JA4 spoofing as well.

  • got-scraping (Node.js) — Native Node.js solution with automatic header generation matching the impersonated browser.

What does NOT work:

  • Raw requests or httpx — Both produce non-browser TLS signatures. httpx with http2=True fixes the protocol version but not the cipher suite order or extension fingerprint.
  • Any HTTP client that doesn't implement curl-impersonate at the network stack level.
  • Stale impersonate profiles (e.g. chrome120 in 2026) — missing the standardized post-quantum key share and, ironically, themselves a detection signal. Use the chrome alias instead.

Layer 3 — HTTP/2 and HTTP/3 Fingerprinting

Evasion type: Emit-reading — structurally reproducible (collapses with Layer 2 and 4)

How it works

Real browsers send HTTP/2 with a specific frame ordering: SETTINGSWINDOW_UPDATEHEADERS. The SETTINGS frame contains specific parameter values (header table size, initial window size, max concurrent streams) that are version-specific to each browser release. Cloudflare's heuristics engine includes approximately 50 hand-written rules targeting HTTP/2 frame fingerprints.

HTTP/3 / QUIC fingerprinting: Current Chrome prefers HTTP/3 (over QUIC) when the server advertises it via the Alt-Svc header. Cloudflare has HTTP/3 enabled on most zones by default. QUIC has its own fingerprinting characteristics — transport parameters, frame patterns, and initial packet structure all vary by client. A scraper using HTTP/2 against a Cloudflare zone that advertises HTTP/3 isn't blocked outright, but the fingerprint mismatch is a weak signal. For maximum stealth on HTTP/3-enabled zones, use a client that supports it.

Python requests defaults to HTTP/1.1. Even httpx with http2=True enabled sends non-browser frame ordering — it sends the frames in the correct types but with wrong parameter values and ordering. Neither speaks HTTP/3.

Bypass options

  • curl_cffi — Fixes both TLS and HTTP/2 frame ordering simultaneously with the same impersonate parameter. Supports HTTP/3 via the underlying curl binary when compiled with QUIC support. This triple fix is the primary reason it is the recommended baseline tool.
  • Real browser (Camoufox, Patchright, nodriver, any) — Sends correct HTTP/2 or HTTP/3 by definition since it is a real browser engine.

Warning

Enabling HTTP/2 in httpx or aiohttp does not resolve fingerprinting. The frames are sent, but with wrong parameter values and in the wrong order. The TLS + HTTP/2 combination must match a specific browser version exactly.


Layer 4 — HTTP Header Order Fingerprinting

Evasion type: Emit-reading — structurally reproducible (collapses with Layer 2 and 3)

How it works

Browser engines send HTTP request headers in a deterministic, version-specific order. For Chrome on Windows, the order is consistent across millions of real requests: Host, Connection, Pragma, Cache-Control, sec-ch-ua, sec-ch-ua-mobile, sec-ch-ua-platform, Upgrade-Insecure-Requests, User-Agent, Accept, Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-User, Sec-Fetch-Dest, Accept-Encoding, Accept-Language.

In HTTP/2, pseudo-headers (:method, :path, :scheme, :authority) also follow a browser-specific order before the regular headers.

A scraper that uses the correct User-Agent and TLS fingerprint but sends headers in a different order — or omits Sec-Fetch headers entirely — produces an anomalous signature. Cloudflare evaluates header order as a signal independent of TLS. This is why manually constructing headers in requests or httpx doesn't work even if every individual header value is correct: it's the order that matters, not just the presence.

Bypass options

  • curl_cffi with impersonate — The impersonate parameter sets the correct header order alongside TLS and HTTP/2. This is handled automatically; no manual configuration needed. This is the third reason curl_cffi is the baseline tool — it fixes Layer 2, 3, and 4 simultaneously.
  • Real browser — Header order is inherently correct.
  • DrissionPage — A Python library that unifies requests-style control and browser-based control in a single API, with built-in anti-detection including correct header ordering:
    from DrissionPage import ChromiumPage
    page = ChromiumPage()
    page.get(url)
    html = page.html

Caution

When using curl_cffi, do not pass a custom headers dict unless you are certain of the correct order for the browser version you are impersonating. Passing headers={} overrides the correct order with your custom order, breaking the impersonation. Let impersonate handle headers automatically; only override specific values when necessary.


Layer 5 — Browser / JavaScript Fingerprinting

Evasion type: Emit-reading (high-dimensional, coupled) — moderate hardness

How it works

When a challenge page or protected page loads, Cloudflare's injected JavaScript actively collects device and browser fingerprints:

  • Canvas fingerprint — Renders a hidden canvas element and hashes the output. GPUs produce slightly different outputs per hardware/driver combination. Headless browsers produce consistent, suspiciously uniform outputs.
  • WebGL fingerprint — GPU renderer string and vendor. Headless Chrome reports "SwiftShader" or "Mesa" — both are immediate bot signals.
  • Audio context fingerprint — Processes a short audio buffer through the AudioContext API. Results vary by hardware configuration.
  • Font enumeration — Checks for system fonts that correlate with the claimed OS/platform.
  • navigator.webdriver — Set to true in all standard Playwright, Selenium, and Puppeteer instances. Cloudflare checks this directly.
  • Permissions API anomalies — Headless browsers handle notification, camera, and microphone permission queries differently than real browsers.
  • Plugin list — Real Chrome has browser plugins; headless Chrome has none by default.
  • Screen dimensions and color depth — Headless defaults are often unset, zero, or inconsistent with the claimed operating system.
  • JS engine timing — Real V8 engines (Chrome) and SpiderMonkey (Firefox) have characteristic performance.now() resolution and execution timing patterns. Automated browsers driven through CDP introduce subtle timing anomalies.
  • WebRTC IP leak — Real IP leakage through WebRTC STUN requests, which can reveal the machine's actual IP even when routing traffic through a proxy.

WebRTC — critical for proxy-based scraping

Warning

WebRTC can expose the real IP address of the machine running the browser, regardless of proxy or VPN configuration. Cloudflare's JavaScript probes can trigger WebRTC STUN requests and detect if the reported IP differs from the connection IP. This affects all browser-based scrapers using proxies unless WebRTC is explicitly disabled.

  • Chromium/Chrome (Patchright, nodriver): Add --disable-webrtc to browser launch arguments, or use a WebRTC leak prevention extension.
  • Firefox (Camoufox): Set media.peerconnection.enabled = false in Firefox preferences before launch:
    from camoufox.sync_api import Camoufox
    with Camoufox(headless=False, firefox_user_prefs={
        "media.peerconnection.enabled": False
    }) as browser:
        page = browser.new_page()
        page.goto(url)

Bypass options

  • Camoufox — Firefox-based stealth browser, actively maintained as of 2026. Patches canvas, WebGL, audio, font, and navigator fingerprints. Performs well on targets where Chrome-shaped fingerprints are specifically being targeted. WebRTC disabled via prefs as shown above.

    from camoufox.sync_api import Camoufox
    with Camoufox(headless=False) as browser:
        page = browser.new_page()
        page.goto(url)
        html = page.content()
  • Patchright — Chromium-based fork of Playwright with 49+ C++ patches targeting fingerprint anomalies. Preferred when Chrome fingerprints are specifically required. Add --disable-webrtc to args.

  • SeleniumBase UC Mode — Patches navigator.webdriver and several other common signals. Lighter weight than Camoufox or Patchright; appropriate for mid-difficulty targets.

What does NOT work in 2026:

  • playwright-extra with the stealth plugin — Deprecated by its maintainers in February 2025. Cloudflare fingerprinting detects it reliably.
  • Standard Playwright or Puppeteer without patching — Flagged on navigator.webdriver alone, before any fingerprint analysis.

Layer 6 — CDP / DevTools Protocol Detection

Evasion type: Emit → Possess (transitional) — partial barrier, not fully solved

How it works

Chrome DevTools Protocol (CDP) is the interface used by Playwright, Puppeteer, Selenium, and all similar automation tools to control a browser. Cloudflare's JavaScript probes for CDP artifacts that persist even after common surface-level patches like setting navigator.webdriver = false. Detection points include:

  • Specific property descriptors on window and document that CDP modifies at a low level
  • Timing anomalies in how certain JS APIs respond when driven by automation
  • Inconsistencies in how Runtime.evaluate and Page.addScriptToEvaluateOnLoad modify the JavaScript environment
  • The presence of __playwright_target__ and similar internal properties

Bypass options

  • nodriver — Successor to undetected-chromedriver. Implements its own custom DevTools interface instead of using the standard CDP automation protocol that detectors target. Currently the most effective freely available tool for automation-protocol-level detection:

    import nodriver as uc
    
    async def main():
        browser = await uc.start(headless=False)
        page = await browser.get(url)
        await page.sleep(2)
        html = await page.get_content()
        browser.stop()
    
    uc.loop().run_until_complete(main())
  • Headful mode (headless=False) — Always produces better results than headless across all browser tools. GPU renderer, screen properties, and several timing-based signals only appear correct in headful mode.

  • CloakBrowser — A newer patched-Chromium fork (49 source-level C++ fingerprint modifications) with a drop-in Playwright API. An alternative to Patchright for Chrome-based stealth; free, MIT-licensed wrapper.

Note

2026 benchmark reality check. An independent June 2026 benchmark (7 stealth tools, 31 Cloudflare targets, 651 verdicts, headed mode on a residential IP) found a clear split: nodriver passed all 31 targets with zero blocks — the only tool to do so. Patchright, CloakBrowser, and Camoufox clustered in the middle. Notably, raw curl_cffi (a ~21-line wrapper, no JS engine) tied CloakBrowser on raw pass count — confirming that for TLS/header-only targets, the lightweight HTTP client matches a heavy Chromium fork. The decisive factor for Turnstile-protected targets was having zero Playwright in the protocol stack, which is exactly what nodriver achieves and what every Playwright fork (Camoufox, Patchright, CloakBrowser) cannot.

Important

Partially unresolvable: Camoufox, Patchright, and CloakBrowser all use Playwright's automation protocol internally, so they remain detectable by deep automation-protocol fingerprinting even with all their other patches applied. nodriver is the only freely available tool that meaningfully addresses this specific layer by replacing the control channel wholesale. For the hardest targets, no open-source tool fully solves CDP detection. One caveat: nodriver does not simulate mouse movement, scroll, or keystroke timing, so it gets through CDP detection but can still be caught by heavy behavioral scoring (Layer 7) — pair it with the behavioral mitigations below.


Layer 7 — Behavioral / ML Signals

Evasion type: Possess-reading — high, non-portable hardness

How it works

Since 2025, Cloudflare deploys per-zone ML anomaly models trained separately per website zone — what looks like normal behavior on a developer documentation site may be flagged as anomalous on an e-commerce checkout flow. Customer data is not shared across zones. Behavioral signals include:

  • Request timing regularity — Perfectly timed requests at fixed intervals are a strong bot signal. Human browsing has irregular, non-uniform timing.
  • Navigation patterns — Real users follow referrer chains (homepage → listing → detail page). Scrapers often jump directly to deep URLs with no referrer.
  • Scroll and mouse events — On JS-rendered pages, sessions that never scroll or move the mouse before reading content are distinguishable from human sessions.
  • Cookie and session age — Fresh sessions with zero history are treated as higher risk than returning sessions with accumulated cookies and visit history.
  • Inter-request intervals — Sub-second intervals between full page loads are a strong bot signal.
  • Session history depth — Real users typically have several pages of history; a session that opens directly to a scraping target with no prior navigation looks anomalous.
  • Concurrent session count per IP — Multiple parallel sessions from the same IP — especially with similar fingerprints — is itself a bot signal. Real users rarely have 10+ tabs hammering the same site in parallel. Keep concurrency per IP modest (1–3 parallel sessions max for most targets).
  • Inter-request signals (Aug 2024+) — Cloudflare correlates fingerprints, JA4 hashes, and behavior across multiple requests from the same source over time. A single request looks fine in isolation but stands out across a session window. Mitigated only by being consistent across the entire session.

Bypass options

  • Use random.uniform(1.5, 5.0) delays between requests, preferring exponential or gamma distributions over uniform — they feel more natural and human-like.
  • Session warm-up: Visit the homepage first, let JavaScript execute fully, accumulate cookies, then navigate naturally to target pages. Simulate the referrer chain a real user would follow.
  • Pass realistic Referer headers matching natural navigation: e.g. Referer: https://site.com/listing/ when requesting a detail page.
  • Persist cookies and session state across requests. Never start a fresh cookieless session for every scraping run.
  • For browser-based scrapers, add randomized mouse movement before any interaction using Playwright's mouse.move() with bezier curves or page.mouse.move() with randomized intermediate waypoints.
  • Concurrency control — Cap parallel requests per IP at 1–3. If you need more throughput, scale horizontally with more IPs, not more parallel requests per IP.
  • Browser profile persistence — Save and reuse the browser's user data directory across scraping sessions. A browser profile that has visited the target site 5 times over several days looks significantly more like a real returning user than one starting cold every run:
    # Chromium / Patchright
    browser = await playwright.chromium.launch_persistent_context(
        user_data_dir="/path/to/persistent/profile",
        headless=False
    )
    
    # nodriver
    browser = await uc.start(
        headless=False,
        user_data_dir="/path/to/persistent/profile"
    )
    Store one profile directory per proxy IP to maintain consistency between IP reputation and session history.

Important

Per-zone ML makes behavioral bypassing a moving target. What works on one Cloudflare-protected site may not work on another. Calibration per target is necessary for high-value scraping.


Layer 8 — Managed Challenge (JavaScript Challenge)

How it works

When Cloudflare's composite score is uncertain — not confident enough to block outright, not confident enough to allow through — it serves an interstitial "Checking your browser..." page. This page executes JavaScript, verifies the client can execute it correctly, evaluates the resulting fingerprint, and if satisfied issues a cf_clearance cookie that grants session access for a limited time.

Cloudflare's two key cookies

  • cf_clearance — Issued after successfully solving a managed challenge or Turnstile. Grants access to protected content. Typically valid for 30 minutes to 2 hours. Cryptographically bound to the IP address it was issued on. This binding is architecturally significant: by tying the proof-of-solve to the address, Cloudflare manufactures a possess-like property — continued control of one address — out of an otherwise reusable token. This couples the challenge layer to the IP-reputation and behavioral layers, forcing an adversary to defeat them jointly rather than serially.
  • __cf_bm — Cloudflare Bot Management's session fingerprint cookie. Issued on every visit, even before any challenge. Stores a short-lived (~30 min) fingerprint of the current browser session. Used to track session consistency over the session window. More tightly bound to the specific browser session than cf_clearance.

When reusing cookies, you generally need both — cf_clearance proves you solved the challenge, __cf_bm proves the session has been continuous. Missing or mismatched __cf_bm can trigger a re-challenge even when cf_clearance is still valid.

Sticky vs. rotating proxies — critical requirement

Caution

cf_clearance is cryptographically bound to the IP address it was issued on. Using a rotating proxy that changes IP between the solve step and the request step will cause the cookie to be rejected. This is one of the most common and costly mistakes in production scraping pipelines.

Always use sticky session proxies (same IP for the full duration of a scraping session) when implementing cookie reuse. Configure your proxy provider's sticky session mode explicitly — do not rely on defaults.

Bypass options

Option A — Cookie reuse (most efficient for scale)

sequenceDiagram
    autonumber
    participant Scraper
    participant nodriver as Stealth Browser<br/>(nodriver/Camoufox)
    participant curl as curl_cffi
    participant CF as Cloudflare
    participant Origin

    Scraper->>nodriver: Launch with sticky proxy
    nodriver->>CF: GET site
    CF-->>nodriver: Challenge page (JS)
    nodriver->>nodriver: Execute JS challenge
    nodriver->>CF: Submit answer
    CF-->>nodriver: cf_clearance + __cf_bm cookies
    nodriver->>Scraper: Extract cookies + User-Agent
    Scraper->>curl: Pass cookies + same sticky proxy

    loop For each subsequent page (fast path)
        curl->>CF: GET page (with cookies, matching UA)
        CF->>Origin: Forward request
        Origin-->>CF: HTML response
        CF-->>curl: 200 OK + content
    end

    Note over Scraper,curl: When cf_clearance expires (~30min–2hr),<br/>re-solve via browser and refresh cookies
Loading

Solve the challenge once with a full stealth browser, extract the resulting cookies, and pass them to a lightweight HTTP client for all subsequent requests. cf_clearance is typically valid for 30 minutes to 2 hours:

  1. Launch nodriver or Camoufox on a sticky proxy, navigate to the target site
  2. The browser executes the challenge natively and receives cf_clearance + __cf_bm
  3. Extract both cookies plus the exact User-Agent string used
  4. Pass these cookies to curl_cffi with a matching impersonate profile and the same User-Agent
  5. Make all subsequent requests via curl_cffi — no browser overhead per request
  6. Monitor for 403 responses indicating cookie expiry; re-solve when detected
import nodriver as uc
from curl_cffi import requests as cffi_requests

async def get_clearance_cookies(url: str, proxy: str) -> dict:
    browser = await uc.start(
        headless=False,
        browser_args=[f"--proxy-server={proxy}", "--disable-webrtc"]
    )
    page = await browser.get(url)
    await page.sleep(4)  # allow challenge to resolve
    cookies = {c["name"]: c["value"]
               for c in await page.browser.cookies([url])}
    browser.stop()
    return cookies

def make_request(url: str, cookies: dict, proxy: str) -> str:
    session = cffi_requests.Session()
    for name, value in cookies.items():
        session.cookies.set(name, value)
    resp = session.get(
        url,
        impersonate="chrome",
        proxies={"https": proxy},  # MUST be the same sticky IP
    )
    return resp.text

Option B — Solve inline (simpler pipeline)

Use nodriver or Camoufox for every request. Higher resource cost but no cookie management complexity. Appropriate when session volume is low or when JavaScript execution is required to render the page content.

What does NOT work in 2026:

  • FlareSolverr — Deprecated. Cloudflare detects its browser fingerprint reliably since early 2025.
  • cloudscraper — Challenge format has changed multiple times since its last significant update. Consider it defunct.

Layer 9 — Cloudflare Turnstile CAPTCHA

How it works

Turnstile replaced hCaptcha as Cloudflare's user-facing CAPTCHA. It runs a client-side JavaScript challenge that evaluates browser authenticity, canvas fingerprint, and interaction history before presenting any visual challenge. On lower-risk sessions it may be invisible (non-interactive); on suspicious sessions it presents a checkbox or image challenge. Unlike the managed challenge, Turnstile actively evaluates the quality of the browser environment itself before issuing a token.

Bypass options

Option A — Click-based solver (free, lower reliability)

A stealth browser (Camoufox or Patchright) automatically clicks the Turnstile checkbox. No external service needed. Works on lower-sensitivity configurations. Standard Playwright without patching is detected and blocked before reaching the click interaction.

Option B — CAPTCHA solving service (paid, higher reliability)

Services like 2Captcha and CapSolver accept the Turnstile challenge parameters, solve via real humans or AI, and return a valid token. Cost is approximately $2–4 per 1,000 solves; latency is 10–30 seconds per solve. CapSolver maintains an AI-based system that adapts to Cloudflare updates faster than most DIY implementations.

Option C — Token reuse

Turnstile tokens are valid for approximately 5 minutes. For scraping multiple pages in a single session, solve once and reuse the token across all subsequent requests within that window.


Layer 10 — Bot Fight Mode Tiers (Free, Pro, and Under Attack)

How it works

Cloudflare offers three distinct protection tiers at the lower end of the bot management stack, often conflated but meaningfully different:

flowchart TD
    Free[🆓 Free Plan] -->|default| BFM[Bot Fight Mode<br/>━━━━━━━━━━<br/>Heuristics only<br/>No ML]
    Pro[💼 Pro / Business] -->|optional| SBFM[Super Bot Fight Mode<br/>━━━━━━━━━━<br/>+ Definitely automated detection<br/>+ Verified bot allowlist<br/>+ Static resource protection]
    Ent[🏢 Enterprise] -->|paid| BM[Bot Management v9<br/>━━━━━━━━━━<br/>Full ML scoring<br/>Per-zone models<br/>Behavioral analysis]
    AnyPlan[Any Plan] -->|manual toggle| UAM[⚠️ Under Attack Mode<br/>━━━━━━━━━━<br/>JS challenge for EVERY visitor<br/>No exceptions by score]
    May2025[🆕 May 2025] -->|opt-in only| WBA[🔐 Web Bot Auth<br/>━━━━━━━━━━<br/>Cryptographic agent passport<br/>Fails open currently]
Loading
Tier Plan What It Does
Bot Fight Mode Free Simple heuristics: datacenter IP blocklists, missing headers, obviously automated User-Agents, absence of browser cookies. No ML.
Super Bot Fight Mode Pro / Business Adds "definitely automated" classification, a verified bots allowlist, and protection for static resources (images, CSS, JS). Still heuristic-based but more comprehensive.
Under Attack Mode All plans (manual toggle) Serves a JavaScript challenge to every visitor regardless of bot score — including legitimate users. Zero exceptions by score.

Under Attack Mode deserves specific attention. Sites activate it during active DDoS events or when experiencing aggressive scraping. When it is on, it is qualitatively different from all other tiers — no request bypasses the challenge based on reputation or prior session quality. Every connection must execute the JavaScript challenge and obtain cf_clearance before receiving any content.

Bypass options

  • Bot Fight Mode (free): curl_cffi with impersonate="chrome" + any residential or mobile IP + correct headers. Bypassed reliably by the baseline stack.
  • Super Bot Fight Mode (Pro/Business): Same baseline stack. Slightly more aggressive header checks — ensure Sec-Fetch-* headers are present and correct (handled automatically by curl_cffi impersonation).
  • Under Attack Mode: No lightweight bypass. Every session must solve the managed challenge via nodriver or Camoufox, regardless of other signal quality. Implement cookie reuse (Layer 8, Option A) to minimize browser invocations.

Layer 11 — ML Bot Score v9 (Paid Bot Management)

How it works

Available to paying Cloudflare customers. The v9 model (default on new zones since late 2025) is a composite ML score combining all previous layers simultaneously: TLS fingerprint, HTTP/2 fingerprint, header order, browser JS signals, CDP detection signals, behavioral patterns, IP reputation, session history, and cross-zone threat intelligence. A site actively using Bot Management v9 is substantially harder to scrape reliably than one on any free or Pro tier protection.

Bypass options

No single tool resolves this. The minimum working stack for a v9 zone:

  1. Mobile carrier IP on a sticky session — Non-flagged, CGNAT-protected
  2. curl_cffi with the chrome alias — Correct TLS, HTTP/2, header order, and PQ key share
  3. cf_clearance + __cf_bm from nodriver or Camoufox — Valid session cookies from a genuine browser challenge solve
  4. Persistent browser profile — Session history on the target site
  5. Session warming — Homepage visit, cookie accumulation, realistic referrer chain
  6. Randomized delays — Non-uniform inter-request intervals
  7. Concurrency cap — Max 1–3 parallel sessions per IP

With this full stack, success rates are high for light and medium Bot Management configurations. For zones with aggressive behavioral tuning, success remains inconsistent. Targets with v9 + behavioral tuning are typically high-value commercial sites (large e-commerce, ticketing platforms, financial services) rather than general content sites.


Layer 12 — Cloudflare Workers Custom Detection

How it works

Cloudflare Workers allows site operators to deploy custom JavaScript that runs at Cloudflare's edge — before a request reaches origin and before standard bot management rules apply. Workers can implement arbitrary detection logic not available in Cloudflare's standard products, including:

  • Custom rate limiting algorithms per URL pattern
  • Honeypot endpoint detection (fake URLs that only bots would visit)
  • Custom fingerprinting based on request timing, body shape, or query parameter patterns
  • Challenge-response mechanisms unique to the site
  • Geo-fencing or time-of-day access restrictions
  • Integration with first-party analytics to detect scraping patterns in session data

Sites running custom Workers detection are identifiable by unusual response headers (often cf-cache-status: DYNAMIC combined with non-standard challenge pages) or 403 responses that don't match the standard Cloudflare error page format.

Bypass options

There is no generic bypass for custom Workers detection — by definition, each implementation is unique. Approaches:

  • Analyze the response — Custom challenge pages typically contain JavaScript. Reading it reveals what signals are being evaluated.
  • Respect the honeypot — Never request URLs that are not linked from a real page. Build URL lists by following links from rendered pages, not by guessing paths.
  • Study request patterns — Capture real browser traffic with DevTools and replicate the exact sequence: which endpoints are called, in what order, with what query parameters and timing.
  • Look for unprotected subdomains — Workers rules often apply to specific paths or subdomains. The API subdomain (api.site.com) may have different or no Workers coverage.

Note

Cannot generically bypass — Custom Workers detection requires per-site analysis and adaptation. There is no tool that handles arbitrary custom logic.


Layer 13 — AI Bot Blocker (2024+)

How it works

In July 2024, Cloudflare launched a one-click toggle available on all plans including free to block known AI crawler bots. It targets specific User-Agent strings associated with AI scrapers and training data collectors: GPTBot, ClaudeBot, CommonCrawl, CCBot, Diffbot, and others.

Bypass

Use a standard browser User-Agent. This feature targets labeled AI crawlers specifically, not general scrapers presenting as browsers. Trivially avoided by any scraper already using a browser User-Agent string.


Layer 14 — AI Labyrinth (Honeypot / Data Poisoning)

Evasion type: Behavioral trap — silent fingerprinting + data poisoning

How it works

AI Labyrinth, launched March 2025 as a free opt-in toggle, is a fundamentally different kind of control. Instead of blocking a suspected bot, Cloudflare embeds hidden nofollow links on the page that lead to a maze of pre-generated, AI-written decoy pages (built with Workers AI, stored in R2, screened for XSS). The decoy content is factually plausible but completely irrelevant to the real site.

The links are invisible to humans (via styling and metadata) and marked so compliant search crawlers ignore them. But a scraper that blindly follows every link walks into the maze. Two things then happen:

  1. Data poisoning — The scraper fills its database with AI-generated nonsense mixed in with real data. If that data trains a model, it can contribute to model collapse.
  2. Silent fingerprinting — "No real human goes four links deep into a maze of AI-generated nonsense." Anything that does is definitively flagged as a bot, and that fingerprint is shared across Cloudflare's entire network — not just the site being scraped.

Warning

AI Labyrinth is dangerous precisely because it is silent. Traditional protection gives instant feedback — you see a CAPTCHA or a 403 and know your scraper needs work. With AI Labyrinth, your scraper may appear to be working perfectly while collecting garbage and getting your IP and fingerprint permanently flagged network-wide. The damage is invisible until you audit the data.

Bypass / avoidance options

  • Never follow nofollow links. AI Labyrinth links carry rel="nofollow". A scraper that respects nofollow (and only follows links a human would plausibly click) never enters the maze. Filter these out at the link-extraction step.
  • Only follow links that are visible to humans. Check computed CSS — links hidden via display:none, visibility:hidden, zero-size, or off-screen positioning are honeypots. Render the page and extract only visually present, clickable links.
  • Build URL lists from rendered, visible content — not from raw HTML link enumeration, and never by guessing or brute-forcing paths.
  • Validate content relevance. If you start seeing pages whose topic drifts away from the site's actual subject matter (e.g. an e-commerce site suddenly serving science encyclopedia articles), you are in the maze. Stop and discard the session.
  • The deeper defense is not triggering detection at all — AI Labyrinth only activates when a session is already considered suspicious. A high-trust session (mobile IP, clean fingerprint, human-like behavior) is far less likely to be served the honeypot links in the first place.

Caution

Crawl politely and shallowly. Aggressive depth-first crawling that follows every link is exactly the behavior AI Labyrinth is built to punish. Limit crawl depth, respect visible site structure, and treat sudden topical drift as a tripwire.


Layer 15 — Web Bot Auth (May 2025)

Evasion type: Possess-reading — cryptographically unforgeable (where mandated)

How it works

Web Bot Auth is a cryptographic agent passport system proposed in May 2025 and folded into Cloudflare's verified-bot program in mid-2025. Sites that opt in can require connecting agents to present a cryptographically signed credential proving membership in a registered, authorized set.

Technically: an agent signs selected components of each HTTP request — at minimum the @authority field it is addressing, together with a short created/expires validity window that bounds replay — under a private Ed25519 key. The corresponding public key is published at a well-known directory. The edge fetches the published key and validates the signature using the HTTP Message Signatures standard (RFC 9421). No TLS, HTTP/2, header, fingerprint, or behavioral emulation substitutes for the secret — the artifact is a signature over the request, not a reproducible client byte sequence.

This is the first layer in the entire detection lineage that is possess-reading at the protocol level. There is no forge-the-artifact path.

Warning

Current deployment fails open. Web Bot Auth is currently a positive identification signal, not a hard gate. A request bearing a valid signature is exempted from bot challenges as an authenticated agent. A request bearing no signature is not thereby blocked — it simply falls back to the rest of the stack. Web Bot Auth becomes a true admission gate only where a site explicitly mandates a valid signature. Most sites have not done this.

Current adoption state (mid-2026):

  • IETF standardization is underway — standards-track specifications were targeted for the IESG by April 2026, with a Best Current Practice document on key management due August 2026. This is moving from one vendor's proposal to an internet standard.
  • Cross-platform adoption is broad — beyond Cloudflare, support has been implemented by AWS WAF, Vercel, Shopify, and Akamai. It is becoming shared infrastructure rather than a Cloudflare-only feature.
  • Payment networks have adopted it — Visa's Trusted Agent Protocol (TAP) and Mastercard Agent Pay both build on the Web Bot Auth signature mechanism, positioning it as the "front door" for agentic commerce.
  • Major agent operators sign by default — OpenAI (Operator), Browserbase, Anchor Browser, and Amazon Bedrock AgentCore Browser now sign their traffic. Cloudflare validates signatures against keys published in its Radar bot directory.
  • Scale context — AI bot traffic on Cloudflare's network already exceeds 10 billion requests per week; the CEO has predicted bot traffic will surpass human traffic by 2027. This is the trajectory the entire bot-management industry is betting on.

Bypass:

  • Against sites that do not mandate signatures (the current majority): Falls back to the standard stack. Treat as any other Cloudflare-protected target.
  • Against sites that mandate a valid signature: ❌ No bypass. This is cryptographic authentication. The only path is legitimate agent registration via the Cloudflare Bot Submission Form (or the equivalent for whichever CDN/WAF the site uses).

The long-run trajectory is toward mandate rather than optional. With IETF standardization, multi-CDN support, and payment-network integration all landing in 2026, the window in which Web Bot Auth "fails open" everywhere is closing. Understanding this layer now is preparation for targets that will harden it over 2026–2027.


Layer 16 — Cloudflare Access (Zero Trust)

How it works

Login-gated content behind Cloudflare's identity provider. Requires authentication via SSO, OAuth, or email OTP before any page content is served.

Caution

Bypass: Not applicable. This is authentication, not anti-bot protection. Bypassing without credentials constitutes unauthorized access. If content requires login, the correct path is authenticated scraping using a legitimately obtained account and session.


Cloudflare Bypass Summary Table

The Evasion Type column applies the emit-versus-possess framework: emit-reading layers are structurally reproducible; possess-reading layers resist forgery by design.

Layer Evasion Type Difficulty Best Tool / Approach Bypassable?
IP Reputation Hybrid (emit + possess) Medium Mobile carrier proxy (sticky) ✅ Yes (economic cost)
TLS / JA3 / JA4 Emit Medium curl_cffi impersonate="chrome" alias ✅ Yes
Post-Quantum TLS (X25519MLKEM768) + ECH Emit Low–Medium curl_cffi chrome alias (auto-included) ✅ Yes (keep profile current)
HTTP/2 + HTTP/3 Fingerprint Emit Medium curl_cffi (handled by impersonate) ✅ Yes
HTTP Header Order Emit Medium curl_cffi (handled by impersonate) ✅ Yes
Browser JS Fingerprint + WebRTC Emit (coupled) High Camoufox (disable WebRTC) / Patchright ✅ Mostly
CDP Detection Emit → Possess High nodriver ⚠️ Partially
Behavioral / ML (per-zone) Possess High Profile persistence + session warming + low concurrency ⚠️ Partially
Managed Challenge Emit + bound Medium nodriver + cookie reuse (sticky IP) ✅ Yes
Turnstile Emit + bound Medium–High CapSolver / 2Captcha ✅ Yes (paid)
Bot Fight Mode (free) Emit Low curl_cffi + residential/mobile IP ✅ Yes
Super Bot Fight Mode Emit Low–Medium curl_cffi + correct Sec-Fetch headers ✅ Yes
Under Attack Mode Emit + bound Medium nodriver/Camoufox + cookie reuse ✅ Yes (every request)
ML Bot Score v9 Emit + Possess Very High Full stack (all above combined) ⚠️ Inconsistent
Workers Custom Detection Arbitrary Variable Per-site analysis; no generic tool ⚠️ Site-specific
AI Bot Blocker Declaration Very Low Browser User-Agent ✅ Trivially
AI Labyrinth (honeypot) Behavioral trap Low (if careful) Ignore nofollow + hidden links; crawl shallow ✅ Avoid by not triggering
Web Bot Auth Possess N/A ⚠️ Fails open currently; ❌ No if site mandates
Cloudflare Access N/A (auth) N/A ❌ No (requires credentials)

Recommended Stacks by Scenario

flowchart TD
    Start([Target site protected by Cloudflare]) --> Q0{Is the page already<br/>in the Wayback Machine?}
    Q0 -->|Yes, and snapshot is<br/>fresh enough| ScenE[Scenario E: Wayback fetch<br/>━━━━━━━━━━<br/>Zero anti-bot<br/>Cheapest option]
    Q0 -->|No / too stale| Q1{Is there a JSON API<br/>behind the page?}
    Q1 -->|Yes| ScenC[Scenario C: API Interception<br/>━━━━━━━━━━<br/>Cleanest live solution<br/>Usually unprotected]
    Q1 -->|No| Q2{Does the site serve<br/>a JS challenge?}
    Q2 -->|No| ScenA[Scenario A: curl_cffi<br/>+ residential proxy<br/>━━━━━━━━━━<br/>70-80% of cases]
    Q2 -->|Yes| Q3{Request volume?}
    Q3 -->|Low<br/>under 100/hr| ScenB1[nodriver/Camoufox<br/>inline for each request]
    Q3 -->|High<br/>over 100/hr| ScenB2[Scenario B: Solve once,<br/>reuse cf_clearance<br/>━━━━━━━━━━<br/>Most efficient pattern]
    ScenB2 --> Q4{Is Bot Management v9<br/>actively tuned?}
    Q4 -->|No| Done([✓ Working solution])
    Q4 -->|Yes| ScenD[Scenario D: Managed API<br/>━━━━━━━━━━<br/>ScrapFly/BrightData/ScrapingBee]
Loading

Scenario A — Basic Protection (Bot Fight Mode / No JS Challenge)

The correct starting point for any scraping project. Covers roughly 70–80% of Cloudflare-protected sites encountered in practice.

from curl_cffi import requests

session = requests.Session()
response = session.get(
    url,
    impersonate="chrome",
    proxies={"https": "http://user:pass@residential-proxy:port"},
)
print(response.status_code, len(response.text))

Tip

Do not pass a custom headers dict — let impersonate set headers automatically to preserve correct order.


Scenario B — Managed Challenge / Turnstile / Under Attack Mode

Solve once, reuse cookie. Requires sticky session proxy.

import asyncio
import nodriver as uc
from curl_cffi import requests as cffi_requests

STICKY_PROXY = "http://user:pass@proxy-sticky:port"

async def solve_challenge(url: str) -> dict:
    browser = await uc.start(
        headless=False,
        browser_args=[
            f"--proxy-server={STICKY_PROXY}",
            "--disable-webrtc",
        ],
        user_data_dir="./browser_profile",  # persist profile
    )
    page = await browser.get(url)
    await page.sleep(4)
    cookies = {c["name"]: c["value"]
               for c in await page.browser.cookies([url])}
    browser.stop()
    return cookies

def scrape(url: str, cookies: dict) -> str:
    session = cffi_requests.Session()
    for name, value in cookies.items():
        session.cookies.set(name, value)
    resp = session.get(
        url,
        impersonate="chrome",
        proxies={"https": STICKY_PROXY},  # same IP as solve
    )
    resp.raise_for_status()
    return resp.text

async def main():
    cookies = await solve_challenge("https://target-site.com")
    html = scrape("https://target-site.com/data", cookies)
    print(html[:500])

asyncio.run(main())

Scenario C — API Interception (Often the Cleanest Solution)

Before investing in browser automation, check whether the target site's data is served by an underlying JSON API. Modern web applications built on React, Vue, or Next.js almost universally fetch their data from API endpoints that are separate from the HTML-serving layer. These API endpoints are frequently:

  • Less aggressively protected than the HTML CDN layer
  • Served from a different subdomain (api.site.com, cdn-api.site.com) that may have weaker or no Cloudflare coverage
  • Accessible to mobile apps (iOS/Android) via a distinct API with different — often weaker — bot protection
  • Returning clean structured JSON rather than requiring HTML parsing

How to find them:

  1. Open the target page in Chrome DevTools → Network tab → filter by Fetch/XHR
  2. Navigate the page as a real user and observe all API calls made
  3. Look for calls returning JSON that contains the data you need
  4. Note the endpoint URL, required headers, and any authentication tokens

How to replicate them:

from curl_cffi import requests

# Copy the exact headers from DevTools → right-click request → "Copy as cURL"
# Then translate to curl_cffi:
session = requests.Session()
resp = session.get(
    "https://api.target-site.com/v2/listings",
    impersonate="chrome",
    params={"page": 1, "limit": 50},
    headers={
        "x-api-key": "observed-from-devtools",
        "origin": "https://target-site.com",
    },
    proxies={"https": proxy},
)
data = resp.json()

API endpoints often don't serve Cloudflare managed challenges (no HTML page to inject JS into), so curl_cffi with TLS + header impersonation is frequently sufficient. This approach eliminates the need for browser automation entirely on many targets.

Tip

Mobile app APIs: If the site has an Android or iOS app, install the app in an emulator with HTTP traffic proxied through mitmproxy or Charles. Mobile APIs often use OAuth tokens with longer validity than browser sessions, are lighter on bot protection, and return cleaner data structures.


Scenario D — Full Bot Management v9 / High-Security Target

For sites with Cloudflare's paid Bot Management actively configured and tuned, DIY maintenance becomes a full-time engineering task. Cloudflare updates its detection models continuously. Consider:

Managed scraping APIs (ScrapFly, BrightData, ScrapingBee, ZenRows): Handle TLS fingerprinting, browser rendering, proxy rotation, and challenge solving internally. Per-request cost of $0.003–$0.01 is almost always lower than the engineering time required to maintain reliable DIY bypasses against v9 targets.

If you must DIY, the minimum working stack:

  • Mobile carrier IP on a sticky session
  • curl_cffi impersonate="chrome" (auto-tracks latest)
  • cf_clearance + __cf_bm obtained via nodriver with persistent --user-data-dir
  • WebRTC disabled in the browser
  • Session warming (homepage → referrer chain → target)
  • Gamma-distributed delays (mean ~3s, shape ~2)
  • All requests on the exact same proxy IP as the challenge solve
  • Concurrency cap at 1–3 parallel sessions per IP

Scenario E — Wayback Machine Fallback (When All Else Fails)

When a live target proves unbypassable — or before burning effort and proxy budget on a hard target — check whether the page is already archived. The Internet Archive's Wayback Machine has captured over 800 billion pages, and an archived snapshot sidesteps the live site's entire protection stack:

  • No live anti-bot. The archived copy is served by web.archive.org, not the protected origin. There is no Cloudflare challenge, no Turnstile, no Bot Management — because you are not touching the protected site at all. The Wayback crawler already paid that cost, often years ago, sometimes before the protection even existed.
  • Static HTML. Most snapshots are plain HTML, far easier to parse than a live JS-rendered page.
  • Historical dimension. You can pull multiple snapshots over time to study how content (pricing, listings, messaging) evolved — something the live site can't give you.

The catch: snapshots may be stale, incomplete (missing CSS/JS/images), or simply absent for the URL you want. This is a fallback and a historical-data source, not a real-time scraping solution. And web.archive.org itself rate-limits and returns frequent 503s, so be gentle and add retry logic.

The CDX API — query what's archived before fetching:

import requests as plain_requests  # archive.org has no anti-bot; plain requests is fine
import json
import time

def find_snapshots(target_url: str, from_date=None, to_date=None) -> list:
    """Query the Wayback CDX API for all archived snapshots of a URL."""
    params = {
        "url": target_url,
        "output": "json",
        "collapse": "digest",   # dedupe identical captures
        "fl": "timestamp,original,statuscode",
    }
    if from_date:
        params["from"] = from_date   # YYYYMMDD
    if to_date:
        params["to"] = to_date
    resp = plain_requests.get(
        "http://web.archive.org/cdx/search/cdx",
        params=params, timeout=30,
    )
    rows = resp.json()
    return rows[1:] if rows else []   # first row is the column header

def fetch_snapshot(timestamp: str, original_url: str) -> str:
    """Fetch a specific archived snapshot. The 'id_' suffix returns the
    raw original capture without the Wayback navigation toolbar injected."""
    archive_url = f"https://web.archive.org/web/{timestamp}id_/{original_url}"
    for attempt in range(3):
        resp = plain_requests.get(archive_url, timeout=30)
        if resp.status_code == 200:
            return resp.text
        time.sleep(2 ** attempt)   # archive.org 503s frequently — back off
    return ""

# Example: get the most recent archived copy of a blocked page
snaps = find_snapshots("https://target-site.com/blocked-page")
if snaps:
    ts, original, status = snaps[-1]   # last = most recent
    html = fetch_snapshot(ts, original)
    print(f"Recovered snapshot from {ts}: {len(html)} bytes")
else:
    print("No archive available — fall back to live scraping")

Tip

The id_ suffix in the snapshot URL (/web/{timestamp}id_/{url}) is the key detail: it returns the raw original capture without Wayback's injected navigation toolbar and rewritten links, giving you clean HTML to parse. Without it you get the human-facing archive view wrapped in archive.org's chrome.

Note

If a page has never been archived, you can request a fresh capture by hitting https://web.archive.org/save/{url} — but this triggers archive.org's own crawler against the live (protected) site, which may itself be blocked. It works for lightly protected pages and is worth trying as a one-off, but it is not a reliable bulk strategy.

This is the option most scraping guides omit: the cheapest bypass is often not bypassing the live site at all. Always check the archive before committing engineering effort to a hard target.

Bot protection responses use specific HTTP status codes with distinct meanings. Handling them correctly prevents wasted requests and unnecessary blocks.

Status Meaning Correct Response
200 (with challenge HTML) JS challenge served as 200 Detect by checking response.text for cf-browser-verification or __cf_chl_ in body. Re-solve the challenge.
403 Hard block or expired cf_clearance Rotate proxy IP, re-solve challenge, rebuild session.
429 Rate limited Exponential backoff: wait 2^attempt * random(0.5, 1.5) seconds. Rotate IP after 3 failures.
503 Under Attack Mode or origin overloaded Wait 10–30 seconds, solve managed challenge, retry.
1020 Cloudflare firewall rule block IP is specifically blocked by a WAF rule. Switch to mobile carrier IP and fresh session.
1015 Cloudflare rate limit Slow down. Use Retry-After header value if present.
1010 Browser integrity check failed CDP/browser fingerprint detection. Switch to nodriver or Camoufox.

Detecting silent challenges (200 with challenge body):

import asyncio
import random
import time

def is_challenge_page(response) -> bool:
    body = response.text
    signals = [
        "cf-browser-verification",
        "checking your browser",
        "jschl_vc",
        "jschl_answer",
        "__cf_chl_",
    ]
    return any(s.lower() in body.lower() for s in signals)

def fetch_with_retry(url, session, proxy, max_retries=3):
    for attempt in range(max_retries):
        resp = session.get(url, impersonate="chrome",
                           proxies={"https": proxy})
        if is_challenge_page(resp):
            new_cookies = asyncio.run(solve_challenge(url))
            for name, value in new_cookies.items():
                session.cookies.set(name, value)
            continue
        if resp.status_code == 429:
            wait = (2 ** attempt) * random.uniform(0.5, 1.5)
            time.sleep(wait)
            continue
        if resp.status_code == 403:
            raise Exception("Hard block — rotate IP and rebuild session")
        resp.raise_for_status()
        return resp
    raise Exception(f"Failed after {max_retries} retries")

What No Longer Works in 2026

Tool / Technique Status Reason
cloudscraper ❌ Dead Challenge format has changed multiple times since last meaningful update
FlareSolverr ❌ Dead Cloudflare detects its browser fingerprint reliably since early 2025. Deprecated and unmaintained.
playwright-extra stealth plugin ❌ Dead Deprecated February 2025; Cloudflare fingerprinting detects it reliably
undetected-chromedriver standalone ❌ Insufficient Does not address CDP protocol detection; use nodriver instead
Datacenter proxies ❌ Blocked All major datacenter ASNs are in Cloudflare's blocklist
Raw requests or httpx ❌ Flagged TLS fingerprint, HTTP/2 frames, and header order all wrong
Custom headers dict overriding curl_cffi impersonate ⚠️ Breaks header order Overrides the correct header order set by impersonation
Hard-pinned impersonate profiles (e.g. chrome120) ⚠️ Degraded + signal Missing PQ key share; a stale profile is itself anomalous. Use the chrome alias.
ISP proxies on heavy Bot Management ⚠️ Unreliable Increasingly flagged by tightened ASN-level signals
Rotating proxies with cf_clearance reuse ❌ Broken Cookie is IP-bound; rotating breaks the session. Use sticky sessions.
Headless browser without WebRTC disabled ⚠️ Leaks real IP WebRTC STUN reveals machine IP even through proxy
High concurrent sessions per IP ⚠️ Flagged 10+ parallel sessions from one IP is a behavioral signal
Blindly following every link (incl. nofollow/hidden) ❌ Walks into AI Labyrinth Triggers honeypot fingerprinting + poisons your data. Follow only human-visible links.
Aggressive depth-first crawling ⚠️ Triggers honeypots AI Labyrinth and Workers honeypots specifically target this behavior

Legal and Ethical Considerations

Important

Bypassing bot detection is a technical capability, not a legal right. Before using any of these techniques, consider the following.

  • Terms of Service — Most websites prohibit automated access in their ToS. Violating ToS may expose you to civil liability depending on jurisdiction, even for publicly accessible content.
  • Computer fraud laws — In some jurisdictions (CFAA in the US, Computer Misuse Act in the UK), accessing a system in a manner that circumvents technical protection measures may carry legal risk even for publicly visible content.
  • Rate and impact — Even technically permitted scraping can become tortious if the request volume materially impairs the target server's ability to serve other users.
  • Data protection regulations — If scraped data includes personal information (names, emails, profiles, contact details), GDPR, CCPA, PDPA, and similar regulations may apply to how that data is stored, processed, and shared.
  • robots.txt — While not legally binding in most jurisdictions, respecting robots.txt is an established community norm and is considered in legal assessments of scraping intent.

The techniques in this article are documented for developers engaged in legitimate data collection — price monitoring, research, content aggregation of public information, accessibility tools, competitive intelligence, and similar use cases. Where an official API exists for the target data, use it — it is faster, more stable, and free of legal ambiguity.


Conclusion

Modern bot protection in 2026 is a multi-layer composite system, not a single wall. The critical insight is that all active layers must be passed simultaneously — a stack that is perfect on five layers but wrong on one will still fail. And the layers are not equally hard: the emit-versus-possess distinction determines where durable resistance actually comes from.

The practical hierarchy:

  1. Check the Wayback Machine first — before spending any effort on a hard target, query the CDX API. An archived snapshot sidesteps the live protection stack entirely and is the cheapest possible "bypass." Often the page you need is already sitting in 800+ billion captures.
  2. Try API interception next — many targets expose unprotected or lightly protected JSON APIs that eliminate the need for browser automation entirely.
  3. curl_cffi with the impersonate="chrome" alias is the baseline for all HTTP-level scraping. It resolves TLS (including X25519MLKEM768), HTTP/2, HTTP/3, and header order fingerprinting in a single call, and the alias auto-tracks the latest browser fingerprint so you never ship a stale profile. The four transport-and-header layers (Layers 2–4) are all emit-reading and collapse into one effective term — solving them together with curl_cffi is the right approach.
  4. nodriver is the strongest available open-source tool for JS challenge solving and CDP detection avoidance — it won an independent June 2026 benchmark outright (zero blocks across 31 Cloudflare targets) by keeping Playwright entirely out of the protocol stack.
  5. Camoufox is the best choice when Firefox fingerprints are needed or when Chrome-shaped tools are specifically being targeted. Always disable WebRTC. (Note: all Playwright forks — Camoufox, Patchright, CloakBrowser — remain detectable at the automation-protocol layer; only nodriver avoids it.)
  6. Mobile carrier IPs on sticky sessions provide the best IP reputation foundation. Never use datacenter IPs.
  7. Cookie reuse (solve once with browser → execute many with curl_cffi) is the most resource-efficient production pattern. Reuse both cf_clearance and __cf_bm. The address-binding of cf_clearance is intentional — it manufactures a possess-like property out of the token.
  8. Persistent browser profiles meaningfully improve behavioral scoring at zero additional cost.
  9. Concurrency control matters — cap parallel sessions per IP at 1–3 to avoid behavioral flags.
  10. Never follow hidden or nofollow links — AI Labyrinth turns aggressive crawling into both a fingerprinting trap and a data-poisoning attack. Crawl shallow, follow only human-visible links, and treat sudden topical drift as a tripwire.
  11. Correct error handling (detecting silent 200 challenge pages, exponential backoff on 429) is essential for production reliability.

The two structural hard limits:

  • Web Bot Auth — unforgeable by cryptographic design where a site mandates a valid signature. Most deployments still fail open (unsigned requests fall back to the regular stack), but that window is closing fast: IETF standardization landed in 2026, AWS WAF / Vercel / Shopify / Akamai have implemented it, and Visa and Mastercard built their agentic-commerce protocols on top of it. Monitor per-target as enforcement spreads.
  • Cloudflare Access — authentication, not anti-bot protection. No bypass without credentials.

The deeper principle: The transport-fingerprint layers (TLS, HTTP/2, header order) attract the most practitioner attention but are structurally the shallowest — they are all emit-reading artifacts that a faithful impersonator reproduces exactly. Durable assurance lives in the behavioral (possess-reading, per-zone, non-portable) and cryptographic layers. As imitation tooling continues closing the emit-readable gap, bot management migrates toward properties that cannot be emitted: verified behavioral continuity and, ultimately, cryptographic agent identity. The shift is already visible — AI Labyrinth poisons rather than blocks, and Web Bot Auth replaces imitation with a held secret.


Note

Last updated: June 2026. Bot detection systems update continuously; validate tool versions and success rates against current community benchmarks before production deployment.

Clone this wiki locally