-
-
Notifications
You must be signed in to change notification settings - Fork 0
Bypassing Cloudflare
Note
Scope: This article covers every known detection layer used by Cloudflare, the tools and techniques available to bypass each one, and — critically — the layers that currently have no reliable bypass. It is written for developers building legitimate web scrapers and data pipelines who need to understand the full landscape of modern bot protection in 2026.
- Background: How Modern Bot Protection Works
- Layer 1 — IP Reputation
- Layer 2 — TLS / JA3 / JA4 Fingerprinting
- Layer 3 — HTTP/2 and HTTP/3 Fingerprinting
- Layer 4 — HTTP Header Order Fingerprinting
- Layer 5 — Browser / JavaScript Fingerprinting
- Layer 6 — CDP / DevTools Protocol Detection
- Layer 7 — Behavioral / ML Signals
- Layer 8 — Managed Challenge (JavaScript Challenge)
- Layer 9 — Cloudflare Turnstile CAPTCHA
- Layer 10 — Bot Fight Mode Tiers
- Layer 11 — ML Bot Score v9
- Layer 12 — Cloudflare Workers Custom Detection
- Layer 13 — AI Bot Blocker
- Layer 14 — AI Labyrinth (Honeypot)
- Layer 15 — Web Bot Auth
- Layer 16 — Cloudflare Access
- Bypass Summary Table
- Recommended Stacks by Scenario
- What No Longer Works in 2026
- Legal and Ethical Considerations
- Conclusion
The bot detection industry shifted from simple IP blocklists to composite trust scoring between 2022 and 2026. Modern systems like Cloudflare no longer make a binary "bot or not" decision on a single signal. Instead, they compute a score from multiple independent detection layers simultaneously. A scraper only gets through if it passes all active layers at once — failing even one is usually enough to trigger a challenge or outright block.
flowchart LR
Req[Incoming Request] --> Layers{Detection Layers<br/>evaluated in parallel}
Layers --> L1[IP Reputation]
Layers --> L2[TLS / JA4 Fingerprint]
Layers --> L3[HTTP/2 + Header Order]
Layers --> L4[Browser JS Fingerprint]
Layers --> L5[CDP Detection]
Layers --> L6[Behavioral / ML]
L1 --> Score[Composite Trust Score]
L2 --> Score
L3 --> Score
L4 --> Score
L5 --> Score
L6 --> Score
Score --> Decision{Score-based decision}
Decision -->|High trust| Allow[✓ 200 OK — content served]
Decision -->|Uncertain| Challenge[JS Challenge or Turnstile]
Decision -->|Low trust| Block[✗ 403 / 1020 — blocked]
Challenge -->|solved| Allow
Challenge -->|failed| Block
The composite score is non-compensatory: a strong score on many layers does not cancel out a single strongly anomalous one. If we write pᵢ for the probability that a given adversarial strategy passes detector dᵢ, admission requires passing every active detector, so the overall evasion probability is bounded by the single hardest layer:
P_evade ≤ min(p₁, p₂, ... pₙ)
This is the formal explanation of why no silver bullet exists. Correcting a previously failing layer yields no improvement once some other layer is the binding constraint — the floor is now set by the new minimum. In the other direction, a defender gains disproportionate value from adding even one layer with near-zero pass probability, since it drives Pevade toward zero regardless of what all other layers score.
The most useful lens for understanding why different layers have different bypass hardness is not "network vs. browser" — it is whether a layer reads an emitted artifact or a possessed property:
| Type | Definition | Evasion hardness |
|---|---|---|
| Emit-reading | Inspects an artifact the client transmits at connection time: TLS ClientHello bytes, HTTP/2 frame ordering, header sequence. The artifact carries no secret and no history, so anything a genuine browser emits can be reproduced verbatim by a client engineered to do so. | Low — structurally reproducible |
| Possess-reading | Interrogates a property the client must continuously hold and cannot fabricate on demand: accumulated per-zone behavioral history, or a private cryptographic key. These resist forgery not because the check is obscure, but because reproducing them requires genuinely being what the check tests for. | High — structurally resistant |
This distinction predicts bypass hardness more reliably than any other framing. Every layer that reads an emitted artifact is, in the limit, reproducible. Every layer that reads a continuously possessed property resists forgery.
Important
The four transport-and-header layers (TLS, HTTP/2, HTTP/3, header order) all read emitted artifacts and are strongly correlated — a client built to reproduce one browser's network stack tends to pass them together. They do not act as four independent barriers; they collapse into a single effective term. Adding another emit-reading layer barely moves Pevade, whereas adding one possess-reading layer can drive it to near zero.
Cloudflare protects an estimated 74–79% of all websites that use any bot mitigation tooling, making it by far the most commonly encountered wall in web scraping. Its detection stack has grown substantially more sophisticated since 2024, with the introduction of per-zone ML models, JA4+ fingerprinting, post-quantum TLS signals, and the new Web Bot Auth system.
How it works
Every incoming IP is scored against Cloudflare's global threat intelligence database. The score incorporates: known datacenter ASN ranges (AWS, GCP, Azure, Hetzner, DigitalOcean, etc.), Tor exit node lists, residential proxy pool reputation (shared flagged history propagates across users of the same pool), and CGNAT classification for mobile carrier IPs.
IP reputation is a hybrid layer in the emit-versus-possess framework: the address itself is freely chosen (emitted), but the reputation attached to it is a possessed property built up over time — something an adversary can rent but not fabricate. Its residual hardness is economic rather than technical. Datacenter ranges are cheaply and safely blockable because legitimate human traffic rarely originates there. Mobile carrier ranges are nearly unblockable because a single CGNAT address multiplexes thousands of real subscribers, so any block inflicts mass collateral damage.
Bypass options
| IP Type | Effectiveness (2026) | Notes |
|---|---|---|
| Mobile carrier (4G/5G) | ✅ Best | CGNAT economics prevent aggressive blocking — one IP = thousands of real subscribers. Cloudflare cannot block without mass collateral damage to legitimate traffic. |
| Real residential | Shared pool history means flagged IPs propagate fast. Success rates dropped significantly in 2025–2026 as Cloudflare tightened ASN-level signals. | |
| ISP (static residential) | Better than datacenter, worse than real residential. Unreliable on heavily protected targets. | |
| Datacenter | ❌ Blocked | Cloudflare maintains a complete registry of all major datacenter ASN ranges. These IPs are flagged on arrival regardless of other signals. |
| Tor exit nodes | ❌ Blocked | Cloudflare maintains a complete Tor exit node list and challenges or blocks all of them by default. |
Tip
Mobile carrier proxies for high-security targets; real residential proxies for medium-difficulty targets; never use datacenter IPs for Cloudflare-protected scraping.
Evasion type: Emit-reading — structurally reproducible
How it works
When any HTTP client connects over HTTPS, it sends a TLS ClientHello handshake. Cloudflare extracts a fingerprint from this handshake and compares it against a database of over 15 million known-good browser profiles analyzed daily.
- JA3 (2017) — MD5 hash of TLS version, cipher suites, extensions, elliptic curves, and elliptic curve formats.
- JA4 (2023, now standard) — Multi-part fingerprint covering TLS version, SNI presence, cipher count, extension count, ALPN, and separate hashes for ciphers and extensions. More resistant to evasion than JA3. Cloudflare now checks both.
Python's requests, httpx, aiohttp, and most standard HTTP clients produce a TLS handshake that looks nothing like Chrome. Cloudflare identifies the mismatch in milliseconds — not because of what you requested, but because of how you connected.
Because the handshake carries no secret, a client engineered to reproduce a target browser's exact byte sequence becomes indistinguishable on this axis. Its defensive value is real but bounded: it reliably catches naive clients and contributes nothing against a faithful impersonator.
Post-Quantum TLS (X25519MLKEM768) and Encrypted ClientHello
Chrome 131 (November 2024) shipped the standardized hybrid post-quantum key share X25519MLKEM768 — classical X25519 combined with ML-KEM-768, the NIST FIPS 203 standard — by default in every ClientHello, superseding the earlier X25519Kyber768Draft00 experiment. Every Chrome release since carries it. Cloudflare folds this into the JA4 fingerprint, so a scraper that claims to be a current Chrome but omits the post-quantum key share contradicts itself with a hard anomaly signal.
Encrypted ClientHello (ECH) is rolling out gradually in 2026. ECH encrypts the SNI field, hiding the destination hostname from network observers. Chrome enables ECH opportunistically when supported by the target. Scrapers that don't support ECH may stand out as the only "modern Chrome" clients not using it, depending on the destination.
This layer remains emit-reading: updating the impersonation profile to include the correct key share is sufficient to neutralize the PQ check. It penalizes stale profiles rather than providing structural protection.
Important
Don't hard-pin a Chrome version — use the impersonate="chrome" alias. Since curl_cffi v0.15.1, the bare chrome, safari, and firefox aliases automatically track the latest supported fingerprint, so you no longer need to chase version numbers manually. An old pinned profile (e.g. chrome120) is itself a detection signal in 2026, because no real user runs a two-year-old browser. If you must pin, pin a current release (real tools in mid-2026 run Chrome 145–148), and run curl-cffi update to pull the newest fingerprints without upgrading the package. Profiles missing the standardized X25519MLKEM768 key share create an anomaly against a post-quantum-aware fingerprinter.
Bypass options
-
curl_cffi(Python) — The most practical Python solution. Wrapscurl-impersonateto reproduce exact browser TLS signatures including cipher suite order, extensions, ALPN, and post-quantum key shares. Since v0.15.x it also supports HTTP/3 fingerprints (Chrome 145/146, Firefox 147) and runtime fingerprint updates viacurl-cffi update. Near drop-in replacement forrequests:from curl_cffi import requests # The bare "chrome" alias auto-tracks the latest supported fingerprint response = requests.get(url, impersonate="chrome")
-
tls-client(Python/Go) — More fingerprint profiles available; slightly less actively maintained thancurl_cffi. -
cycletls(Go) — Preferred in Go-based scraping pipelines; excellent for Akamai-grade JA4 spoofing as well. -
got-scraping(Node.js) — Native Node.js solution with automatic header generation matching the impersonated browser.
What does NOT work:
- Raw
requestsorhttpx— Both produce non-browser TLS signatures.httpxwithhttp2=Truefixes the protocol version but not the cipher suite order or extension fingerprint. - Any HTTP client that doesn't implement
curl-impersonateat the network stack level. - Stale
impersonateprofiles (e.g.chrome120in 2026) — missing the standardized post-quantum key share and, ironically, themselves a detection signal. Use thechromealias instead.
Evasion type: Emit-reading — structurally reproducible (collapses with Layer 2 and 4)
How it works
Real browsers send HTTP/2 with a specific frame ordering: SETTINGS → WINDOW_UPDATE → HEADERS. The SETTINGS frame contains specific parameter values (header table size, initial window size, max concurrent streams) that are version-specific to each browser release. Cloudflare's heuristics engine includes approximately 50 hand-written rules targeting HTTP/2 frame fingerprints.
HTTP/3 / QUIC fingerprinting: Current Chrome prefers HTTP/3 (over QUIC) when the server advertises it via the Alt-Svc header. Cloudflare has HTTP/3 enabled on most zones by default. QUIC has its own fingerprinting characteristics — transport parameters, frame patterns, and initial packet structure all vary by client. A scraper using HTTP/2 against a Cloudflare zone that advertises HTTP/3 isn't blocked outright, but the fingerprint mismatch is a weak signal. For maximum stealth on HTTP/3-enabled zones, use a client that supports it.
Python requests defaults to HTTP/1.1. Even httpx with http2=True enabled sends non-browser frame ordering — it sends the frames in the correct types but with wrong parameter values and ordering. Neither speaks HTTP/3.
Bypass options
-
curl_cffi— Fixes both TLS and HTTP/2 frame ordering simultaneously with the sameimpersonateparameter. Supports HTTP/3 via the underlyingcurlbinary when compiled with QUIC support. This triple fix is the primary reason it is the recommended baseline tool. - Real browser (Camoufox, Patchright, nodriver, any) — Sends correct HTTP/2 or HTTP/3 by definition since it is a real browser engine.
Warning
Enabling HTTP/2 in httpx or aiohttp does not resolve fingerprinting. The frames are sent, but with wrong parameter values and in the wrong order. The TLS + HTTP/2 combination must match a specific browser version exactly.
Evasion type: Emit-reading — structurally reproducible (collapses with Layer 2 and 3)
How it works
Browser engines send HTTP request headers in a deterministic, version-specific order. For Chrome on Windows, the order is consistent across millions of real requests: Host, Connection, Pragma, Cache-Control, sec-ch-ua, sec-ch-ua-mobile, sec-ch-ua-platform, Upgrade-Insecure-Requests, User-Agent, Accept, Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-User, Sec-Fetch-Dest, Accept-Encoding, Accept-Language.
In HTTP/2, pseudo-headers (:method, :path, :scheme, :authority) also follow a browser-specific order before the regular headers.
A scraper that uses the correct User-Agent and TLS fingerprint but sends headers in a different order — or omits Sec-Fetch headers entirely — produces an anomalous signature. Cloudflare evaluates header order as a signal independent of TLS. This is why manually constructing headers in requests or httpx doesn't work even if every individual header value is correct: it's the order that matters, not just the presence.
Bypass options
-
curl_cffiwithimpersonate— Theimpersonateparameter sets the correct header order alongside TLS and HTTP/2. This is handled automatically; no manual configuration needed. This is the third reasoncurl_cffiis the baseline tool — it fixes Layer 2, 3, and 4 simultaneously. - Real browser — Header order is inherently correct.
-
DrissionPage— A Python library that unifies requests-style control and browser-based control in a single API, with built-in anti-detection including correct header ordering:from DrissionPage import ChromiumPage page = ChromiumPage() page.get(url) html = page.html
Caution
When using curl_cffi, do not pass a custom headers dict unless you are certain of the correct order for the browser version you are impersonating. Passing headers={} overrides the correct order with your custom order, breaking the impersonation. Let impersonate handle headers automatically; only override specific values when necessary.
Evasion type: Emit-reading (high-dimensional, coupled) — moderate hardness
How it works
When a challenge page or protected page loads, Cloudflare's injected JavaScript actively collects device and browser fingerprints:
- Canvas fingerprint — Renders a hidden canvas element and hashes the output. GPUs produce slightly different outputs per hardware/driver combination. Headless browsers produce consistent, suspiciously uniform outputs.
- WebGL fingerprint — GPU renderer string and vendor. Headless Chrome reports "SwiftShader" or "Mesa" — both are immediate bot signals.
- Audio context fingerprint — Processes a short audio buffer through the AudioContext API. Results vary by hardware configuration.
- Font enumeration — Checks for system fonts that correlate with the claimed OS/platform.
-
navigator.webdriver— Set totruein all standard Playwright, Selenium, and Puppeteer instances. Cloudflare checks this directly. - Permissions API anomalies — Headless browsers handle notification, camera, and microphone permission queries differently than real browsers.
- Plugin list — Real Chrome has browser plugins; headless Chrome has none by default.
- Screen dimensions and color depth — Headless defaults are often unset, zero, or inconsistent with the claimed operating system.
-
JS engine timing — Real V8 engines (Chrome) and SpiderMonkey (Firefox) have characteristic
performance.now()resolution and execution timing patterns. Automated browsers driven through CDP introduce subtle timing anomalies. - WebRTC IP leak — Real IP leakage through WebRTC STUN requests, which can reveal the machine's actual IP even when routing traffic through a proxy.
WebRTC — critical for proxy-based scraping
Warning
WebRTC can expose the real IP address of the machine running the browser, regardless of proxy or VPN configuration. Cloudflare's JavaScript probes can trigger WebRTC STUN requests and detect if the reported IP differs from the connection IP. This affects all browser-based scrapers using proxies unless WebRTC is explicitly disabled.
-
Chromium/Chrome (Patchright, nodriver): Add
--disable-webrtcto browser launch arguments, or use a WebRTC leak prevention extension. -
Firefox (Camoufox): Set
media.peerconnection.enabled = falsein Firefox preferences before launch:from camoufox.sync_api import Camoufox with Camoufox(headless=False, firefox_user_prefs={ "media.peerconnection.enabled": False }) as browser: page = browser.new_page() page.goto(url)
Bypass options
-
Camoufox — Firefox-based stealth browser, actively maintained as of 2026. Patches canvas, WebGL, audio, font, and navigator fingerprints. Performs well on targets where Chrome-shaped fingerprints are specifically being targeted. WebRTC disabled via prefs as shown above.
from camoufox.sync_api import Camoufox with Camoufox(headless=False) as browser: page = browser.new_page() page.goto(url) html = page.content()
-
Patchright — Chromium-based fork of Playwright with 49+ C++ patches targeting fingerprint anomalies. Preferred when Chrome fingerprints are specifically required. Add
--disable-webrtcto args. -
SeleniumBase UC Mode — Patches
navigator.webdriverand several other common signals. Lighter weight than Camoufox or Patchright; appropriate for mid-difficulty targets.
What does NOT work in 2026:
-
playwright-extrawith the stealth plugin — Deprecated by its maintainers in February 2025. Cloudflare fingerprinting detects it reliably. - Standard Playwright or Puppeteer without patching — Flagged on
navigator.webdriveralone, before any fingerprint analysis.
Evasion type: Emit → Possess (transitional) — partial barrier, not fully solved
How it works
Chrome DevTools Protocol (CDP) is the interface used by Playwright, Puppeteer, Selenium, and all similar automation tools to control a browser. Cloudflare's JavaScript probes for CDP artifacts that persist even after common surface-level patches like setting navigator.webdriver = false. Detection points include:
- Specific property descriptors on
windowanddocumentthat CDP modifies at a low level - Timing anomalies in how certain JS APIs respond when driven by automation
- Inconsistencies in how
Runtime.evaluateandPage.addScriptToEvaluateOnLoadmodify the JavaScript environment - The presence of
__playwright_target__and similar internal properties
Bypass options
-
nodriver— Successor toundetected-chromedriver. Implements its own custom DevTools interface instead of using the standard CDP automation protocol that detectors target. Currently the most effective freely available tool for automation-protocol-level detection:import nodriver as uc async def main(): browser = await uc.start(headless=False) page = await browser.get(url) await page.sleep(2) html = await page.get_content() browser.stop() uc.loop().run_until_complete(main())
-
Headful mode (
headless=False) — Always produces better results than headless across all browser tools. GPU renderer, screen properties, and several timing-based signals only appear correct in headful mode. -
CloakBrowser — A newer patched-Chromium fork (49 source-level C++ fingerprint modifications) with a drop-in Playwright API. An alternative to Patchright for Chrome-based stealth; free, MIT-licensed wrapper.
Note
2026 benchmark reality check. An independent June 2026 benchmark (7 stealth tools, 31 Cloudflare targets, 651 verdicts, headed mode on a residential IP) found a clear split: nodriver passed all 31 targets with zero blocks — the only tool to do so. Patchright, CloakBrowser, and Camoufox clustered in the middle. Notably, raw curl_cffi (a ~21-line wrapper, no JS engine) tied CloakBrowser on raw pass count — confirming that for TLS/header-only targets, the lightweight HTTP client matches a heavy Chromium fork. The decisive factor for Turnstile-protected targets was having zero Playwright in the protocol stack, which is exactly what nodriver achieves and what every Playwright fork (Camoufox, Patchright, CloakBrowser) cannot.
Important
Partially unresolvable: Camoufox, Patchright, and CloakBrowser all use Playwright's automation protocol internally, so they remain detectable by deep automation-protocol fingerprinting even with all their other patches applied. nodriver is the only freely available tool that meaningfully addresses this specific layer by replacing the control channel wholesale. For the hardest targets, no open-source tool fully solves CDP detection. One caveat: nodriver does not simulate mouse movement, scroll, or keystroke timing, so it gets through CDP detection but can still be caught by heavy behavioral scoring (Layer 7) — pair it with the behavioral mitigations below.
Evasion type: Possess-reading — high, non-portable hardness
How it works
Since 2025, Cloudflare deploys per-zone ML anomaly models trained separately per website zone — what looks like normal behavior on a developer documentation site may be flagged as anomalous on an e-commerce checkout flow. Customer data is not shared across zones. Behavioral signals include:
- Request timing regularity — Perfectly timed requests at fixed intervals are a strong bot signal. Human browsing has irregular, non-uniform timing.
- Navigation patterns — Real users follow referrer chains (homepage → listing → detail page). Scrapers often jump directly to deep URLs with no referrer.
- Scroll and mouse events — On JS-rendered pages, sessions that never scroll or move the mouse before reading content are distinguishable from human sessions.
- Cookie and session age — Fresh sessions with zero history are treated as higher risk than returning sessions with accumulated cookies and visit history.
- Inter-request intervals — Sub-second intervals between full page loads are a strong bot signal.
- Session history depth — Real users typically have several pages of history; a session that opens directly to a scraping target with no prior navigation looks anomalous.
- Concurrent session count per IP — Multiple parallel sessions from the same IP — especially with similar fingerprints — is itself a bot signal. Real users rarely have 10+ tabs hammering the same site in parallel. Keep concurrency per IP modest (1–3 parallel sessions max for most targets).
- Inter-request signals (Aug 2024+) — Cloudflare correlates fingerprints, JA4 hashes, and behavior across multiple requests from the same source over time. A single request looks fine in isolation but stands out across a session window. Mitigated only by being consistent across the entire session.
Bypass options
- Use
random.uniform(1.5, 5.0)delays between requests, preferring exponential or gamma distributions over uniform — they feel more natural and human-like. - Session warm-up: Visit the homepage first, let JavaScript execute fully, accumulate cookies, then navigate naturally to target pages. Simulate the referrer chain a real user would follow.
- Pass realistic
Refererheaders matching natural navigation: e.g.Referer: https://site.com/listing/when requesting a detail page. - Persist cookies and session state across requests. Never start a fresh cookieless session for every scraping run.
- For browser-based scrapers, add randomized mouse movement before any interaction using Playwright's
mouse.move()with bezier curves orpage.mouse.move()with randomized intermediate waypoints. - Concurrency control — Cap parallel requests per IP at 1–3. If you need more throughput, scale horizontally with more IPs, not more parallel requests per IP.
-
Browser profile persistence — Save and reuse the browser's user data directory across scraping sessions. A browser profile that has visited the target site 5 times over several days looks significantly more like a real returning user than one starting cold every run:
Store one profile directory per proxy IP to maintain consistency between IP reputation and session history.
# Chromium / Patchright browser = await playwright.chromium.launch_persistent_context( user_data_dir="/path/to/persistent/profile", headless=False ) # nodriver browser = await uc.start( headless=False, user_data_dir="/path/to/persistent/profile" )
Important
Per-zone ML makes behavioral bypassing a moving target. What works on one Cloudflare-protected site may not work on another. Calibration per target is necessary for high-value scraping.
How it works
When Cloudflare's composite score is uncertain — not confident enough to block outright, not confident enough to allow through — it serves an interstitial "Checking your browser..." page. This page executes JavaScript, verifies the client can execute it correctly, evaluates the resulting fingerprint, and if satisfied issues a cf_clearance cookie that grants session access for a limited time.
Cloudflare's two key cookies
-
cf_clearance— Issued after successfully solving a managed challenge or Turnstile. Grants access to protected content. Typically valid for 30 minutes to 2 hours. Cryptographically bound to the IP address it was issued on. This binding is architecturally significant: by tying the proof-of-solve to the address, Cloudflare manufactures a possess-like property — continued control of one address — out of an otherwise reusable token. This couples the challenge layer to the IP-reputation and behavioral layers, forcing an adversary to defeat them jointly rather than serially. -
__cf_bm— Cloudflare Bot Management's session fingerprint cookie. Issued on every visit, even before any challenge. Stores a short-lived (~30 min) fingerprint of the current browser session. Used to track session consistency over the session window. More tightly bound to the specific browser session thancf_clearance.
When reusing cookies, you generally need both — cf_clearance proves you solved the challenge, __cf_bm proves the session has been continuous. Missing or mismatched __cf_bm can trigger a re-challenge even when cf_clearance is still valid.
Sticky vs. rotating proxies — critical requirement
Caution
cf_clearance is cryptographically bound to the IP address it was issued on. Using a rotating proxy that changes IP between the solve step and the request step will cause the cookie to be rejected. This is one of the most common and costly mistakes in production scraping pipelines.
Always use sticky session proxies (same IP for the full duration of a scraping session) when implementing cookie reuse. Configure your proxy provider's sticky session mode explicitly — do not rely on defaults.
Bypass options
Option A — Cookie reuse (most efficient for scale)
sequenceDiagram
autonumber
participant Scraper
participant nodriver as Stealth Browser<br/>(nodriver/Camoufox)
participant curl as curl_cffi
participant CF as Cloudflare
participant Origin
Scraper->>nodriver: Launch with sticky proxy
nodriver->>CF: GET site
CF-->>nodriver: Challenge page (JS)
nodriver->>nodriver: Execute JS challenge
nodriver->>CF: Submit answer
CF-->>nodriver: cf_clearance + __cf_bm cookies
nodriver->>Scraper: Extract cookies + User-Agent
Scraper->>curl: Pass cookies + same sticky proxy
loop For each subsequent page (fast path)
curl->>CF: GET page (with cookies, matching UA)
CF->>Origin: Forward request
Origin-->>CF: HTML response
CF-->>curl: 200 OK + content
end
Note over Scraper,curl: When cf_clearance expires (~30min–2hr),<br/>re-solve via browser and refresh cookies
Solve the challenge once with a full stealth browser, extract the resulting cookies, and pass them to a lightweight HTTP client for all subsequent requests. cf_clearance is typically valid for 30 minutes to 2 hours:
- Launch
nodriverorCamoufoxon a sticky proxy, navigate to the target site - The browser executes the challenge natively and receives
cf_clearance+__cf_bm - Extract both cookies plus the exact User-Agent string used
- Pass these cookies to
curl_cffiwith a matchingimpersonateprofile and the same User-Agent - Make all subsequent requests via
curl_cffi— no browser overhead per request - Monitor for 403 responses indicating cookie expiry; re-solve when detected
import nodriver as uc
from curl_cffi import requests as cffi_requests
async def get_clearance_cookies(url: str, proxy: str) -> dict:
browser = await uc.start(
headless=False,
browser_args=[f"--proxy-server={proxy}", "--disable-webrtc"]
)
page = await browser.get(url)
await page.sleep(4) # allow challenge to resolve
cookies = {c["name"]: c["value"]
for c in await page.browser.cookies([url])}
browser.stop()
return cookies
def make_request(url: str, cookies: dict, proxy: str) -> str:
session = cffi_requests.Session()
for name, value in cookies.items():
session.cookies.set(name, value)
resp = session.get(
url,
impersonate="chrome",
proxies={"https": proxy}, # MUST be the same sticky IP
)
return resp.textOption B — Solve inline (simpler pipeline)
Use nodriver or Camoufox for every request. Higher resource cost but no cookie management complexity. Appropriate when session volume is low or when JavaScript execution is required to render the page content.
What does NOT work in 2026:
- FlareSolverr — Deprecated. Cloudflare detects its browser fingerprint reliably since early 2025.
-
cloudscraper— Challenge format has changed multiple times since its last significant update. Consider it defunct.
How it works
Turnstile replaced hCaptcha as Cloudflare's user-facing CAPTCHA. It runs a client-side JavaScript challenge that evaluates browser authenticity, canvas fingerprint, and interaction history before presenting any visual challenge. On lower-risk sessions it may be invisible (non-interactive); on suspicious sessions it presents a checkbox or image challenge. Unlike the managed challenge, Turnstile actively evaluates the quality of the browser environment itself before issuing a token.
Bypass options
Option A — Click-based solver (free, lower reliability)
A stealth browser (Camoufox or Patchright) automatically clicks the Turnstile checkbox. No external service needed. Works on lower-sensitivity configurations. Standard Playwright without patching is detected and blocked before reaching the click interaction.
Option B — CAPTCHA solving service (paid, higher reliability)
Services like 2Captcha and CapSolver accept the Turnstile challenge parameters, solve via real humans or AI, and return a valid token. Cost is approximately $2–4 per 1,000 solves; latency is 10–30 seconds per solve. CapSolver maintains an AI-based system that adapts to Cloudflare updates faster than most DIY implementations.
Option C — Token reuse
Turnstile tokens are valid for approximately 5 minutes. For scraping multiple pages in a single session, solve once and reuse the token across all subsequent requests within that window.
How it works
Cloudflare offers three distinct protection tiers at the lower end of the bot management stack, often conflated but meaningfully different:
flowchart TD
Free[🆓 Free Plan] -->|default| BFM[Bot Fight Mode<br/>━━━━━━━━━━<br/>Heuristics only<br/>No ML]
Pro[💼 Pro / Business] -->|optional| SBFM[Super Bot Fight Mode<br/>━━━━━━━━━━<br/>+ Definitely automated detection<br/>+ Verified bot allowlist<br/>+ Static resource protection]
Ent[🏢 Enterprise] -->|paid| BM[Bot Management v9<br/>━━━━━━━━━━<br/>Full ML scoring<br/>Per-zone models<br/>Behavioral analysis]
AnyPlan[Any Plan] -->|manual toggle| UAM[⚠️ Under Attack Mode<br/>━━━━━━━━━━<br/>JS challenge for EVERY visitor<br/>No exceptions by score]
May2025[🆕 May 2025] -->|opt-in only| WBA[🔐 Web Bot Auth<br/>━━━━━━━━━━<br/>Cryptographic agent passport<br/>Fails open currently]
| Tier | Plan | What It Does |
|---|---|---|
| Bot Fight Mode | Free | Simple heuristics: datacenter IP blocklists, missing headers, obviously automated User-Agents, absence of browser cookies. No ML. |
| Super Bot Fight Mode | Pro / Business | Adds "definitely automated" classification, a verified bots allowlist, and protection for static resources (images, CSS, JS). Still heuristic-based but more comprehensive. |
| Under Attack Mode | All plans (manual toggle) | Serves a JavaScript challenge to every visitor regardless of bot score — including legitimate users. Zero exceptions by score. |
Under Attack Mode deserves specific attention. Sites activate it during active DDoS events or when experiencing aggressive scraping. When it is on, it is qualitatively different from all other tiers — no request bypasses the challenge based on reputation or prior session quality. Every connection must execute the JavaScript challenge and obtain cf_clearance before receiving any content.
Bypass options
-
Bot Fight Mode (free):
curl_cffiwithimpersonate="chrome"+ any residential or mobile IP + correct headers. Bypassed reliably by the baseline stack. -
Super Bot Fight Mode (Pro/Business): Same baseline stack. Slightly more aggressive header checks — ensure
Sec-Fetch-*headers are present and correct (handled automatically bycurl_cffiimpersonation). -
Under Attack Mode: No lightweight bypass. Every session must solve the managed challenge via
nodriverorCamoufox, regardless of other signal quality. Implement cookie reuse (Layer 8, Option A) to minimize browser invocations.
How it works
Available to paying Cloudflare customers. The v9 model (default on new zones since late 2025) is a composite ML score combining all previous layers simultaneously: TLS fingerprint, HTTP/2 fingerprint, header order, browser JS signals, CDP detection signals, behavioral patterns, IP reputation, session history, and cross-zone threat intelligence. A site actively using Bot Management v9 is substantially harder to scrape reliably than one on any free or Pro tier protection.
Bypass options
No single tool resolves this. The minimum working stack for a v9 zone:
- Mobile carrier IP on a sticky session — Non-flagged, CGNAT-protected
-
curl_cffiwith thechromealias — Correct TLS, HTTP/2, header order, and PQ key share -
cf_clearance+__cf_bmfromnodriverorCamoufox— Valid session cookies from a genuine browser challenge solve - Persistent browser profile — Session history on the target site
- Session warming — Homepage visit, cookie accumulation, realistic referrer chain
- Randomized delays — Non-uniform inter-request intervals
- Concurrency cap — Max 1–3 parallel sessions per IP
With this full stack, success rates are high for light and medium Bot Management configurations. For zones with aggressive behavioral tuning, success remains inconsistent. Targets with v9 + behavioral tuning are typically high-value commercial sites (large e-commerce, ticketing platforms, financial services) rather than general content sites.
How it works
Cloudflare Workers allows site operators to deploy custom JavaScript that runs at Cloudflare's edge — before a request reaches origin and before standard bot management rules apply. Workers can implement arbitrary detection logic not available in Cloudflare's standard products, including:
- Custom rate limiting algorithms per URL pattern
- Honeypot endpoint detection (fake URLs that only bots would visit)
- Custom fingerprinting based on request timing, body shape, or query parameter patterns
- Challenge-response mechanisms unique to the site
- Geo-fencing or time-of-day access restrictions
- Integration with first-party analytics to detect scraping patterns in session data
Sites running custom Workers detection are identifiable by unusual response headers (often cf-cache-status: DYNAMIC combined with non-standard challenge pages) or 403 responses that don't match the standard Cloudflare error page format.
Bypass options
There is no generic bypass for custom Workers detection — by definition, each implementation is unique. Approaches:
- Analyze the response — Custom challenge pages typically contain JavaScript. Reading it reveals what signals are being evaluated.
- Respect the honeypot — Never request URLs that are not linked from a real page. Build URL lists by following links from rendered pages, not by guessing paths.
- Study request patterns — Capture real browser traffic with DevTools and replicate the exact sequence: which endpoints are called, in what order, with what query parameters and timing.
-
Look for unprotected subdomains — Workers rules often apply to specific paths or subdomains. The API subdomain (
api.site.com) may have different or no Workers coverage.
Note
Cannot generically bypass — Custom Workers detection requires per-site analysis and adaptation. There is no tool that handles arbitrary custom logic.
How it works
In July 2024, Cloudflare launched a one-click toggle available on all plans including free to block known AI crawler bots. It targets specific User-Agent strings associated with AI scrapers and training data collectors: GPTBot, ClaudeBot, CommonCrawl, CCBot, Diffbot, and others.
Bypass
Use a standard browser User-Agent. This feature targets labeled AI crawlers specifically, not general scrapers presenting as browsers. Trivially avoided by any scraper already using a browser User-Agent string.
Evasion type: Behavioral trap — silent fingerprinting + data poisoning
How it works
AI Labyrinth, launched March 2025 as a free opt-in toggle, is a fundamentally different kind of control. Instead of blocking a suspected bot, Cloudflare embeds hidden nofollow links on the page that lead to a maze of pre-generated, AI-written decoy pages (built with Workers AI, stored in R2, screened for XSS). The decoy content is factually plausible but completely irrelevant to the real site.
The links are invisible to humans (via styling and metadata) and marked so compliant search crawlers ignore them. But a scraper that blindly follows every link walks into the maze. Two things then happen:
- Data poisoning — The scraper fills its database with AI-generated nonsense mixed in with real data. If that data trains a model, it can contribute to model collapse.
- Silent fingerprinting — "No real human goes four links deep into a maze of AI-generated nonsense." Anything that does is definitively flagged as a bot, and that fingerprint is shared across Cloudflare's entire network — not just the site being scraped.
Warning
AI Labyrinth is dangerous precisely because it is silent. Traditional protection gives instant feedback — you see a CAPTCHA or a 403 and know your scraper needs work. With AI Labyrinth, your scraper may appear to be working perfectly while collecting garbage and getting your IP and fingerprint permanently flagged network-wide. The damage is invisible until you audit the data.
Bypass / avoidance options
-
Never follow
nofollowlinks. AI Labyrinth links carryrel="nofollow". A scraper that respectsnofollow(and only follows links a human would plausibly click) never enters the maze. Filter these out at the link-extraction step. -
Only follow links that are visible to humans. Check computed CSS — links hidden via
display:none,visibility:hidden, zero-size, or off-screen positioning are honeypots. Render the page and extract only visually present, clickable links. - Build URL lists from rendered, visible content — not from raw HTML link enumeration, and never by guessing or brute-forcing paths.
- Validate content relevance. If you start seeing pages whose topic drifts away from the site's actual subject matter (e.g. an e-commerce site suddenly serving science encyclopedia articles), you are in the maze. Stop and discard the session.
- The deeper defense is not triggering detection at all — AI Labyrinth only activates when a session is already considered suspicious. A high-trust session (mobile IP, clean fingerprint, human-like behavior) is far less likely to be served the honeypot links in the first place.
Caution
Crawl politely and shallowly. Aggressive depth-first crawling that follows every link is exactly the behavior AI Labyrinth is built to punish. Limit crawl depth, respect visible site structure, and treat sudden topical drift as a tripwire.
Evasion type: Possess-reading — cryptographically unforgeable (where mandated)
How it works
Web Bot Auth is a cryptographic agent passport system proposed in May 2025 and folded into Cloudflare's verified-bot program in mid-2025. Sites that opt in can require connecting agents to present a cryptographically signed credential proving membership in a registered, authorized set.
Technically: an agent signs selected components of each HTTP request — at minimum the @authority field it is addressing, together with a short created/expires validity window that bounds replay — under a private Ed25519 key. The corresponding public key is published at a well-known directory. The edge fetches the published key and validates the signature using the HTTP Message Signatures standard (RFC 9421). No TLS, HTTP/2, header, fingerprint, or behavioral emulation substitutes for the secret — the artifact is a signature over the request, not a reproducible client byte sequence.
This is the first layer in the entire detection lineage that is possess-reading at the protocol level. There is no forge-the-artifact path.
Warning
Current deployment fails open. Web Bot Auth is currently a positive identification signal, not a hard gate. A request bearing a valid signature is exempted from bot challenges as an authenticated agent. A request bearing no signature is not thereby blocked — it simply falls back to the rest of the stack. Web Bot Auth becomes a true admission gate only where a site explicitly mandates a valid signature. Most sites have not done this.
Current adoption state (mid-2026):
- IETF standardization is underway — standards-track specifications were targeted for the IESG by April 2026, with a Best Current Practice document on key management due August 2026. This is moving from one vendor's proposal to an internet standard.
- Cross-platform adoption is broad — beyond Cloudflare, support has been implemented by AWS WAF, Vercel, Shopify, and Akamai. It is becoming shared infrastructure rather than a Cloudflare-only feature.
- Payment networks have adopted it — Visa's Trusted Agent Protocol (TAP) and Mastercard Agent Pay both build on the Web Bot Auth signature mechanism, positioning it as the "front door" for agentic commerce.
- Major agent operators sign by default — OpenAI (Operator), Browserbase, Anchor Browser, and Amazon Bedrock AgentCore Browser now sign their traffic. Cloudflare validates signatures against keys published in its Radar bot directory.
- Scale context — AI bot traffic on Cloudflare's network already exceeds 10 billion requests per week; the CEO has predicted bot traffic will surpass human traffic by 2027. This is the trajectory the entire bot-management industry is betting on.
Bypass:
- Against sites that do not mandate signatures (the current majority): Falls back to the standard stack. Treat as any other Cloudflare-protected target.
- Against sites that mandate a valid signature: ❌ No bypass. This is cryptographic authentication. The only path is legitimate agent registration via the Cloudflare Bot Submission Form (or the equivalent for whichever CDN/WAF the site uses).
The long-run trajectory is toward mandate rather than optional. With IETF standardization, multi-CDN support, and payment-network integration all landing in 2026, the window in which Web Bot Auth "fails open" everywhere is closing. Understanding this layer now is preparation for targets that will harden it over 2026–2027.
How it works
Login-gated content behind Cloudflare's identity provider. Requires authentication via SSO, OAuth, or email OTP before any page content is served.
Caution
Bypass: Not applicable. This is authentication, not anti-bot protection. Bypassing without credentials constitutes unauthorized access. If content requires login, the correct path is authenticated scraping using a legitimately obtained account and session.
The Evasion Type column applies the emit-versus-possess framework: emit-reading layers are structurally reproducible; possess-reading layers resist forgery by design.
| Layer | Evasion Type | Difficulty | Best Tool / Approach | Bypassable? |
|---|---|---|---|---|
| IP Reputation | Hybrid (emit + possess) | Medium | Mobile carrier proxy (sticky) | ✅ Yes (economic cost) |
| TLS / JA3 / JA4 | Emit | Medium |
curl_cffi impersonate="chrome" alias |
✅ Yes |
| Post-Quantum TLS (X25519MLKEM768) + ECH | Emit | Low–Medium |
curl_cffi chrome alias (auto-included) |
✅ Yes (keep profile current) |
| HTTP/2 + HTTP/3 Fingerprint | Emit | Medium |
curl_cffi (handled by impersonate) |
✅ Yes |
| HTTP Header Order | Emit | Medium |
curl_cffi (handled by impersonate) |
✅ Yes |
| Browser JS Fingerprint + WebRTC | Emit (coupled) | High | Camoufox (disable WebRTC) / Patchright | ✅ Mostly |
| CDP Detection | Emit → Possess | High | nodriver |
|
| Behavioral / ML (per-zone) | Possess | High | Profile persistence + session warming + low concurrency | |
| Managed Challenge | Emit + bound | Medium |
nodriver + cookie reuse (sticky IP) |
✅ Yes |
| Turnstile | Emit + bound | Medium–High | CapSolver / 2Captcha | ✅ Yes (paid) |
| Bot Fight Mode (free) | Emit | Low |
curl_cffi + residential/mobile IP |
✅ Yes |
| Super Bot Fight Mode | Emit | Low–Medium |
curl_cffi + correct Sec-Fetch headers |
✅ Yes |
| Under Attack Mode | Emit + bound | Medium |
nodriver/Camoufox + cookie reuse |
✅ Yes (every request) |
| ML Bot Score v9 | Emit + Possess | Very High | Full stack (all above combined) | |
| Workers Custom Detection | Arbitrary | Variable | Per-site analysis; no generic tool | |
| AI Bot Blocker | Declaration | Very Low | Browser User-Agent | ✅ Trivially |
| AI Labyrinth (honeypot) | Behavioral trap | Low (if careful) | Ignore nofollow + hidden links; crawl shallow |
✅ Avoid by not triggering |
| Web Bot Auth | Possess | N/A | — | |
| Cloudflare Access | N/A (auth) | N/A | — | ❌ No (requires credentials) |
flowchart TD
Start([Target site protected by Cloudflare]) --> Q0{Is the page already<br/>in the Wayback Machine?}
Q0 -->|Yes, and snapshot is<br/>fresh enough| ScenE[Scenario E: Wayback fetch<br/>━━━━━━━━━━<br/>Zero anti-bot<br/>Cheapest option]
Q0 -->|No / too stale| Q1{Is there a JSON API<br/>behind the page?}
Q1 -->|Yes| ScenC[Scenario C: API Interception<br/>━━━━━━━━━━<br/>Cleanest live solution<br/>Usually unprotected]
Q1 -->|No| Q2{Does the site serve<br/>a JS challenge?}
Q2 -->|No| ScenA[Scenario A: curl_cffi<br/>+ residential proxy<br/>━━━━━━━━━━<br/>70-80% of cases]
Q2 -->|Yes| Q3{Request volume?}
Q3 -->|Low<br/>under 100/hr| ScenB1[nodriver/Camoufox<br/>inline for each request]
Q3 -->|High<br/>over 100/hr| ScenB2[Scenario B: Solve once,<br/>reuse cf_clearance<br/>━━━━━━━━━━<br/>Most efficient pattern]
ScenB2 --> Q4{Is Bot Management v9<br/>actively tuned?}
Q4 -->|No| Done([✓ Working solution])
Q4 -->|Yes| ScenD[Scenario D: Managed API<br/>━━━━━━━━━━<br/>ScrapFly/BrightData/ScrapingBee]
The correct starting point for any scraping project. Covers roughly 70–80% of Cloudflare-protected sites encountered in practice.
from curl_cffi import requests
session = requests.Session()
response = session.get(
url,
impersonate="chrome",
proxies={"https": "http://user:pass@residential-proxy:port"},
)
print(response.status_code, len(response.text))Tip
Do not pass a custom headers dict — let impersonate set headers automatically to preserve correct order.
Solve once, reuse cookie. Requires sticky session proxy.
import asyncio
import nodriver as uc
from curl_cffi import requests as cffi_requests
STICKY_PROXY = "http://user:pass@proxy-sticky:port"
async def solve_challenge(url: str) -> dict:
browser = await uc.start(
headless=False,
browser_args=[
f"--proxy-server={STICKY_PROXY}",
"--disable-webrtc",
],
user_data_dir="./browser_profile", # persist profile
)
page = await browser.get(url)
await page.sleep(4)
cookies = {c["name"]: c["value"]
for c in await page.browser.cookies([url])}
browser.stop()
return cookies
def scrape(url: str, cookies: dict) -> str:
session = cffi_requests.Session()
for name, value in cookies.items():
session.cookies.set(name, value)
resp = session.get(
url,
impersonate="chrome",
proxies={"https": STICKY_PROXY}, # same IP as solve
)
resp.raise_for_status()
return resp.text
async def main():
cookies = await solve_challenge("https://target-site.com")
html = scrape("https://target-site.com/data", cookies)
print(html[:500])
asyncio.run(main())Before investing in browser automation, check whether the target site's data is served by an underlying JSON API. Modern web applications built on React, Vue, or Next.js almost universally fetch their data from API endpoints that are separate from the HTML-serving layer. These API endpoints are frequently:
- Less aggressively protected than the HTML CDN layer
- Served from a different subdomain (
api.site.com,cdn-api.site.com) that may have weaker or no Cloudflare coverage - Accessible to mobile apps (iOS/Android) via a distinct API with different — often weaker — bot protection
- Returning clean structured JSON rather than requiring HTML parsing
How to find them:
- Open the target page in Chrome DevTools → Network tab → filter by
Fetch/XHR - Navigate the page as a real user and observe all API calls made
- Look for calls returning JSON that contains the data you need
- Note the endpoint URL, required headers, and any authentication tokens
How to replicate them:
from curl_cffi import requests
# Copy the exact headers from DevTools → right-click request → "Copy as cURL"
# Then translate to curl_cffi:
session = requests.Session()
resp = session.get(
"https://api.target-site.com/v2/listings",
impersonate="chrome",
params={"page": 1, "limit": 50},
headers={
"x-api-key": "observed-from-devtools",
"origin": "https://target-site.com",
},
proxies={"https": proxy},
)
data = resp.json()API endpoints often don't serve Cloudflare managed challenges (no HTML page to inject JS into), so curl_cffi with TLS + header impersonation is frequently sufficient. This approach eliminates the need for browser automation entirely on many targets.
Tip
Mobile app APIs: If the site has an Android or iOS app, install the app in an emulator with HTTP traffic proxied through mitmproxy or Charles. Mobile APIs often use OAuth tokens with longer validity than browser sessions, are lighter on bot protection, and return cleaner data structures.
For sites with Cloudflare's paid Bot Management actively configured and tuned, DIY maintenance becomes a full-time engineering task. Cloudflare updates its detection models continuously. Consider:
Managed scraping APIs (ScrapFly, BrightData, ScrapingBee, ZenRows): Handle TLS fingerprinting, browser rendering, proxy rotation, and challenge solving internally. Per-request cost of $0.003–$0.01 is almost always lower than the engineering time required to maintain reliable DIY bypasses against v9 targets.
If you must DIY, the minimum working stack:
- Mobile carrier IP on a sticky session
-
curl_cffiimpersonate="chrome"(auto-tracks latest) -
cf_clearance+__cf_bmobtained vianodriverwith persistent--user-data-dir - WebRTC disabled in the browser
- Session warming (homepage → referrer chain → target)
- Gamma-distributed delays (mean ~3s, shape ~2)
- All requests on the exact same proxy IP as the challenge solve
- Concurrency cap at 1–3 parallel sessions per IP
When a live target proves unbypassable — or before burning effort and proxy budget on a hard target — check whether the page is already archived. The Internet Archive's Wayback Machine has captured over 800 billion pages, and an archived snapshot sidesteps the live site's entire protection stack:
-
No live anti-bot. The archived copy is served by
web.archive.org, not the protected origin. There is no Cloudflare challenge, no Turnstile, no Bot Management — because you are not touching the protected site at all. The Wayback crawler already paid that cost, often years ago, sometimes before the protection even existed. - Static HTML. Most snapshots are plain HTML, far easier to parse than a live JS-rendered page.
- Historical dimension. You can pull multiple snapshots over time to study how content (pricing, listings, messaging) evolved — something the live site can't give you.
The catch: snapshots may be stale, incomplete (missing CSS/JS/images), or simply absent for the URL you want. This is a fallback and a historical-data source, not a real-time scraping solution. And web.archive.org itself rate-limits and returns frequent 503s, so be gentle and add retry logic.
The CDX API — query what's archived before fetching:
import requests as plain_requests # archive.org has no anti-bot; plain requests is fine
import json
import time
def find_snapshots(target_url: str, from_date=None, to_date=None) -> list:
"""Query the Wayback CDX API for all archived snapshots of a URL."""
params = {
"url": target_url,
"output": "json",
"collapse": "digest", # dedupe identical captures
"fl": "timestamp,original,statuscode",
}
if from_date:
params["from"] = from_date # YYYYMMDD
if to_date:
params["to"] = to_date
resp = plain_requests.get(
"http://web.archive.org/cdx/search/cdx",
params=params, timeout=30,
)
rows = resp.json()
return rows[1:] if rows else [] # first row is the column header
def fetch_snapshot(timestamp: str, original_url: str) -> str:
"""Fetch a specific archived snapshot. The 'id_' suffix returns the
raw original capture without the Wayback navigation toolbar injected."""
archive_url = f"https://web.archive.org/web/{timestamp}id_/{original_url}"
for attempt in range(3):
resp = plain_requests.get(archive_url, timeout=30)
if resp.status_code == 200:
return resp.text
time.sleep(2 ** attempt) # archive.org 503s frequently — back off
return ""
# Example: get the most recent archived copy of a blocked page
snaps = find_snapshots("https://target-site.com/blocked-page")
if snaps:
ts, original, status = snaps[-1] # last = most recent
html = fetch_snapshot(ts, original)
print(f"Recovered snapshot from {ts}: {len(html)} bytes")
else:
print("No archive available — fall back to live scraping")Tip
The id_ suffix in the snapshot URL (/web/{timestamp}id_/{url}) is the key detail: it returns the raw original capture without Wayback's injected navigation toolbar and rewritten links, giving you clean HTML to parse. Without it you get the human-facing archive view wrapped in archive.org's chrome.
Note
If a page has never been archived, you can request a fresh capture by hitting https://web.archive.org/save/{url} — but this triggers archive.org's own crawler against the live (protected) site, which may itself be blocked. It works for lightly protected pages and is worth trying as a one-off, but it is not a reliable bulk strategy.
This is the option most scraping guides omit: the cheapest bypass is often not bypassing the live site at all. Always check the archive before committing engineering effort to a hard target.
Bot protection responses use specific HTTP status codes with distinct meanings. Handling them correctly prevents wasted requests and unnecessary blocks.
| Status | Meaning | Correct Response |
|---|---|---|
200 (with challenge HTML) |
JS challenge served as 200 | Detect by checking response.text for cf-browser-verification or __cf_chl_ in body. Re-solve the challenge. |
403 |
Hard block or expired cf_clearance
|
Rotate proxy IP, re-solve challenge, rebuild session. |
429 |
Rate limited | Exponential backoff: wait 2^attempt * random(0.5, 1.5) seconds. Rotate IP after 3 failures. |
503 |
Under Attack Mode or origin overloaded | Wait 10–30 seconds, solve managed challenge, retry. |
1020 |
Cloudflare firewall rule block | IP is specifically blocked by a WAF rule. Switch to mobile carrier IP and fresh session. |
1015 |
Cloudflare rate limit | Slow down. Use Retry-After header value if present. |
1010 |
Browser integrity check failed | CDP/browser fingerprint detection. Switch to nodriver or Camoufox. |
Detecting silent challenges (200 with challenge body):
import asyncio
import random
import time
def is_challenge_page(response) -> bool:
body = response.text
signals = [
"cf-browser-verification",
"checking your browser",
"jschl_vc",
"jschl_answer",
"__cf_chl_",
]
return any(s.lower() in body.lower() for s in signals)
def fetch_with_retry(url, session, proxy, max_retries=3):
for attempt in range(max_retries):
resp = session.get(url, impersonate="chrome",
proxies={"https": proxy})
if is_challenge_page(resp):
new_cookies = asyncio.run(solve_challenge(url))
for name, value in new_cookies.items():
session.cookies.set(name, value)
continue
if resp.status_code == 429:
wait = (2 ** attempt) * random.uniform(0.5, 1.5)
time.sleep(wait)
continue
if resp.status_code == 403:
raise Exception("Hard block — rotate IP and rebuild session")
resp.raise_for_status()
return resp
raise Exception(f"Failed after {max_retries} retries")| Tool / Technique | Status | Reason |
|---|---|---|
cloudscraper |
❌ Dead | Challenge format has changed multiple times since last meaningful update |
FlareSolverr |
❌ Dead | Cloudflare detects its browser fingerprint reliably since early 2025. Deprecated and unmaintained. |
playwright-extra stealth plugin |
❌ Dead | Deprecated February 2025; Cloudflare fingerprinting detects it reliably |
undetected-chromedriver standalone |
❌ Insufficient | Does not address CDP protocol detection; use nodriver instead |
| Datacenter proxies | ❌ Blocked | All major datacenter ASNs are in Cloudflare's blocklist |
Raw requests or httpx
|
❌ Flagged | TLS fingerprint, HTTP/2 frames, and header order all wrong |
Custom headers dict overriding curl_cffi impersonate |
Overrides the correct header order set by impersonation | |
Hard-pinned impersonate profiles (e.g. chrome120) |
Missing PQ key share; a stale profile is itself anomalous. Use the chrome alias. |
|
| ISP proxies on heavy Bot Management | Increasingly flagged by tightened ASN-level signals | |
Rotating proxies with cf_clearance reuse |
❌ Broken | Cookie is IP-bound; rotating breaks the session. Use sticky sessions. |
| Headless browser without WebRTC disabled | WebRTC STUN reveals machine IP even through proxy | |
| High concurrent sessions per IP | 10+ parallel sessions from one IP is a behavioral signal | |
Blindly following every link (incl. nofollow/hidden) |
❌ Walks into AI Labyrinth | Triggers honeypot fingerprinting + poisons your data. Follow only human-visible links. |
| Aggressive depth-first crawling | AI Labyrinth and Workers honeypots specifically target this behavior |
Important
Bypassing bot detection is a technical capability, not a legal right. Before using any of these techniques, consider the following.
- Terms of Service — Most websites prohibit automated access in their ToS. Violating ToS may expose you to civil liability depending on jurisdiction, even for publicly accessible content.
- Computer fraud laws — In some jurisdictions (CFAA in the US, Computer Misuse Act in the UK), accessing a system in a manner that circumvents technical protection measures may carry legal risk even for publicly visible content.
- Rate and impact — Even technically permitted scraping can become tortious if the request volume materially impairs the target server's ability to serve other users.
- Data protection regulations — If scraped data includes personal information (names, emails, profiles, contact details), GDPR, CCPA, PDPA, and similar regulations may apply to how that data is stored, processed, and shared.
-
robots.txt — While not legally binding in most jurisdictions, respecting
robots.txtis an established community norm and is considered in legal assessments of scraping intent.
The techniques in this article are documented for developers engaged in legitimate data collection — price monitoring, research, content aggregation of public information, accessibility tools, competitive intelligence, and similar use cases. Where an official API exists for the target data, use it — it is faster, more stable, and free of legal ambiguity.
Modern bot protection in 2026 is a multi-layer composite system, not a single wall. The critical insight is that all active layers must be passed simultaneously — a stack that is perfect on five layers but wrong on one will still fail. And the layers are not equally hard: the emit-versus-possess distinction determines where durable resistance actually comes from.
The practical hierarchy:
- Check the Wayback Machine first — before spending any effort on a hard target, query the CDX API. An archived snapshot sidesteps the live protection stack entirely and is the cheapest possible "bypass." Often the page you need is already sitting in 800+ billion captures.
- Try API interception next — many targets expose unprotected or lightly protected JSON APIs that eliminate the need for browser automation entirely.
-
curl_cffiwith theimpersonate="chrome"alias is the baseline for all HTTP-level scraping. It resolves TLS (including X25519MLKEM768), HTTP/2, HTTP/3, and header order fingerprinting in a single call, and the alias auto-tracks the latest browser fingerprint so you never ship a stale profile. The four transport-and-header layers (Layers 2–4) are all emit-reading and collapse into one effective term — solving them together withcurl_cffiis the right approach. -
nodriveris the strongest available open-source tool for JS challenge solving and CDP detection avoidance — it won an independent June 2026 benchmark outright (zero blocks across 31 Cloudflare targets) by keeping Playwright entirely out of the protocol stack. -
Camoufox is the best choice when Firefox fingerprints are needed or when Chrome-shaped tools are specifically being targeted. Always disable WebRTC. (Note: all Playwright forks — Camoufox, Patchright, CloakBrowser — remain detectable at the automation-protocol layer; only
nodriveravoids it.) - Mobile carrier IPs on sticky sessions provide the best IP reputation foundation. Never use datacenter IPs.
-
Cookie reuse (solve once with browser → execute many with
curl_cffi) is the most resource-efficient production pattern. Reuse bothcf_clearanceand__cf_bm. The address-binding ofcf_clearanceis intentional — it manufactures a possess-like property out of the token. - Persistent browser profiles meaningfully improve behavioral scoring at zero additional cost.
- Concurrency control matters — cap parallel sessions per IP at 1–3 to avoid behavioral flags.
-
Never follow hidden or
nofollowlinks — AI Labyrinth turns aggressive crawling into both a fingerprinting trap and a data-poisoning attack. Crawl shallow, follow only human-visible links, and treat sudden topical drift as a tripwire. - Correct error handling (detecting silent 200 challenge pages, exponential backoff on 429) is essential for production reliability.
The two structural hard limits:
- Web Bot Auth — unforgeable by cryptographic design where a site mandates a valid signature. Most deployments still fail open (unsigned requests fall back to the regular stack), but that window is closing fast: IETF standardization landed in 2026, AWS WAF / Vercel / Shopify / Akamai have implemented it, and Visa and Mastercard built their agentic-commerce protocols on top of it. Monitor per-target as enforcement spreads.
- Cloudflare Access — authentication, not anti-bot protection. No bypass without credentials.
The deeper principle: The transport-fingerprint layers (TLS, HTTP/2, header order) attract the most practitioner attention but are structurally the shallowest — they are all emit-reading artifacts that a faithful impersonator reproduces exactly. Durable assurance lives in the behavioral (possess-reading, per-zone, non-portable) and cryptographic layers. As imitation tooling continues closing the emit-readable gap, bot management migrates toward properties that cannot be emitted: verified behavioral continuity and, ultimately, cryptographic agent identity. The shift is already visible — AI Labyrinth poisons rather than blocks, and Web Bot Auth replaces imitation with a held secret.
Note
Last updated: June 2026. Bot detection systems update continuously; validate tool versions and success rates against current community benchmarks before production deployment.