Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network interception with Fetch.enable breaks cloudflare #123

Closed
milahu opened this issue Nov 30, 2023 · 43 comments
Closed

network interception with Fetch.enable breaks cloudflare #123

milahu opened this issue Nov 30, 2023 · 43 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@milahu
Copy link

milahu commented Nov 30, 2023

im trying to capture all responses as described in readme#use-events

cloudflare says

Please unblock challenges.cloudflare.com to proceed.

chrome shows a warning in the address bar

your connection to this site is not secure

fixed by adding options.add_argument("--disable-web-security")
to don't enforce the same-origin policy

test_selenium_driverless.py
#!/usr/bin/env python3

import asyncio
import base64
import sys
import time
import traceback

from cdp_socket.exceptions import CDPError

from selenium_driverless import webdriver


async def on_request(params, global_conn):

    url = params["request"]["url"]
    _params = {"requestId": params['requestId']}
    if params.get('responseStatusCode') in [301, 302, 303, 307, 308]:
        # redirected request
        return await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params)
    else:
        try:
            body = await global_conn.execute_cdp_cmd("Fetch.getResponseBody", _params, timeout=1)
        except CDPError as e:
            if e.code == -32000 and e.message == 'Can only get response body on requests captured after headers received.':
                print(params, "\n", file=sys.stderr)
                traceback.print_exc()
                await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params)
            else:
                raise e
        else:
            start = time.monotonic()
            body_decoded = base64.b64decode(body['body'])

            # modify body here

            body_modified = base64.b64encode(body_decoded).decode("ascii")
            fulfill_params = {"responseCode": 200, "body": body_modified}
            fulfill_params.update(_params)
            _time = time.monotonic() - start
            if _time > 0.01:
                print(f"decoding took long: {_time} s")
            await global_conn.execute_cdp_cmd("Fetch.fulfillRequest", fulfill_params)
            print("Mocked response", url)


async def main():
    options = webdriver.ChromeOptions()
    options.add_argument("--window-size=500,900")
    # fix: please unblock challenges.cloudflare.com to proceed
    # Don't enforce the same-origin policy
    options.add_argument("--disable-web-security")
    async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver:
        driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited"))
        global_conn = driver.base_target
        await driver.get("about:blank")
        await global_conn.execute_cdp_cmd("Fetch.enable", cmd_args={"patterns": [{"requestStage": "Response", "urlPattern":"*"}]})
        await global_conn.add_cdp_listener("Fetch.requestPaused", lambda data: on_request(data, global_conn))
        await driver.get(
            #'https://wikipedia.org',
            "https://nowsecure.nl/#relax", # test cloudflare
            timeout=60, wait_load=False)
        while True:
            #time.sleep(10) # no. cloudflare would hang
            await asyncio.sleep(10)


asyncio.run(main())
@kaliiiiiiiiii
Copy link
Owner

I can confirm this. However, I suspect this to be a timing leak and cloudfare therefore sending a 403 back=> not really a way to fix.
image
image

@milahu or any other thoughts//ideas on that?

@juhacz
Copy link

juhacz commented Nov 30, 2023

The problem is that one of Cloudflare's engineers is watching this repository... :)

@kaliiiiiiiiii
Copy link
Owner

The problem is that one of Cloudflare's engineers is watching this repository... :)

@juhacz
Likely, yes.

Soo in case some @Cloudfare staff is reading this:

Why not hire me directly instead of needing someone to analyse & understand the code on here ? :)

@juhacz
Copy link

juhacz commented Nov 30, 2023

@kaliiiiiiiiii Because we need people like you more :) I Suggest creating a profile at https://www.buymeacoffee.com/ I think people will confirm my words :)

@milahu
Copy link
Author

milahu commented Dec 1, 2023

I suspect this to be a timing leak

you mean the python response handler is too slow?

or maybe the continueResponse/fulfillRequest logic has a bug
(note: continueResponse is experimental)

but yeah, it seems to be a new problem
with the error message
"Please unblock challenges.cloudflare.com to proceed."
i only find a tapatalk.com thread from 2023-10-30 with no solution

any other thoughts//ideas on that?

so far i used the "export HAR" function of chrome devtools network
but that is slower than capturing the live traffic

the exported HAR file does not include the bodies of binary responses
which is actually good for large binaries
i dont want to store a 1GB response body in RAM
but let chrome write it to the filesystem

chromium is open source, so it should be easy to find
how the "record network log" command works

an alternative would be a local http proxy
i guess Fetch.enable also works with a http proxy inside of chrome
and maybe that proxy is visible to cloudflare

in the long term, they will replace captchas with government ID logins
and to bypass that, we will need p2p scraping tools...

@kaliiiiiiiiii
Copy link
Owner

@kaliiiiiiiiii Because we need people like you more :) I Suggest creating a profile at https://www.buymeacoffee.com/ I think people will confirm my words :)

@juhacz
added:) https://github.com/kaliiiiiiiiii#support-me

@milahu
Copy link
Author

milahu commented Dec 1, 2023

chromium is open source, so it should be easy to find
how the "record network log" command works

chromium devtools sources

chromium/src/third_party/devtools-frontend/src/front_end/panels/network/network-meta.ts

UIStrings.recordNetworkLog

UI.ActionRegistration.registerActionExtension({
  actionId: 'network.toggle-recording',
  category: UI.ActionRegistration.ActionCategory.NETWORK,
  iconClass: UI.ActionRegistration.IconClass.START_RECORDING,
  toggleable: true,
  toggledIconClass: UI.ActionRegistration.IconClass.STOP_RECORDING,
  toggleWithRedColor: true,
  contextTypes() {
    return maybeRetrieveContextTypes(Network => [Network.NetworkPanel.NetworkPanel]);
  },
  async loadActionDelegate() {
    const Network = await loadNetworkModule();
    return new Network.NetworkPanel.ActionDelegate();
  },
  options: [
    {
      value: true,
      title: i18nLazyString(UIStrings.recordNetworkLog),
    },
    {
      value: false,
      title: i18nLazyString(UIStrings.stopRecordingNetworkLog),
    },
  ],

chromium/src/third_party/devtools-frontend/src/front_end/panels/network/NetworkPanel.ts

network.toggle-recording

export class ActionDelegate implements UI.ActionRegistration.ActionDelegate {
  handleAction(context: UI.Context.Context, actionId: string): boolean {
    const panel = context.flavor(NetworkPanel);
    if (panel === null) {
      return false;
    }
    switch (actionId) {
      case 'network.toggle-recording': {
        panel.toggleRecord(!panel.recordLogSetting.get());
        return true;
      }

panel.toggleRecord

  toggleRecord(toggled: boolean): void {
    this.toggleRecordAction.setToggled(toggled);
    if (this.recordLogSetting.get() !== toggled) {
      this.recordLogSetting.set(toggled);
    }

    this.networkLogView.setRecording(toggled);
    if (!toggled && this.filmStripRecorder) {
      this.filmStripRecorder.stopRecording(this.filmStripAvailable.bind(this));
    }
  }

this.filmStripRecorder

  private willReloadPage(): void {
    if (this.pendingStopTimer) {
      clearTimeout(this.pendingStopTimer);
      delete this.pendingStopTimer;
    }
    if (this.isShowing() && this.filmStripRecorder) {
      this.filmStripRecorder.startRecording();
    }
  }

this.filmStripRecorder

      this.filmStripRecorder = new FilmStripRecorder(this.networkLogView.timeCalculator(), this.filmStripView);

FilmStripRecorder

export class FilmStripRecorder implements TraceEngine.TracingManager.TracingManagerClient {
  // ...
  startRecording(): void {
    // ...
    const tracingManager =
        SDK.TargetManager.TargetManager.instance().scopeTarget()?.model(TraceEngine.TracingManager.TracingManager);
    // ...
    this.tracingManager = tracingManager;
    this.resourceTreeModel = this.tracingManager.target().model(SDK.ResourceTreeModel.ResourceTreeModel);
    this.tracingModel = new TraceEngine.Legacy.TracingModel();
    void this.tracingManager.start(this, '-*,disabled-by-default-devtools.screenshot', '');
    // ...
  }
  // ...
  stopRecording(callback: (filmStrip: TraceEngine.Extras.FilmStrip.Data) => void): void {
    // ...
    this.tracingManager.stop();
    // ...
  }
}

FilmStripRecorder implements TraceEngine.TracingManager.TracingManagerClient

SDK.TargetManager.TargetManager.instance

import * as SDK from '../../core/sdk/sdk.js';

chromium/src/third_party/devtools-frontend/src/front_end/core/sdk/sdk.ts

import * as TargetManager from './TargetManager.js';

chromium/src/third_party/devtools-frontend/src/front_end/core/sdk/TargetManager.ts

TraceEngine.TracingManager.TracingManager

chromium/src/third_party/devtools-frontend/src/front_end/models/trace/TracingManager.ts

export class TracingManager extends SDK.SDKModel.SDKModel<void> {
  readonly #tracingAgent: ProtocolProxyApi.TracingApi;
  // ...
  async start(client: TracingManagerClient, categoryFilter: string, options: string):
      Promise<Protocol.ProtocolResponseWithError> {
    // ...
    const args = {
      bufferUsageReportingInterval: bufferUsageReportingIntervalMs,
      categories: categoryFilter,
      options: options,
      transferMode: Protocol.Tracing.StartRequestTransferMode.ReportEvents,
    };
    const response = await this.#tracingAgent.invoke_start(args);
    // ...
  }

chromium/src/third_party/devtools-frontend/src/front_end/generated/protocol-proxy-api.d.ts

/**
 * API generated from Protocol commands and events.
 */
declare namespace ProtocolProxyApi {
  // ...
  export interface TracingApi {
    // ...
    invoke_start(params: Protocol.Tracing.StartRequest): Promise<Protocol.ProtocolResponseWithError>;

bufferUsageReportingInterval

chromium sources

chromium/src/out/Debug/gen/content/browser/devtools/protocol/tracing.cc

bufferUsageReportingInterval

struct startParams : public crdtp::DeserializableProtocolObject<startParams> {
    Maybe<String> categories;
    Maybe<String> options;
    Maybe<double> bufferUsageReportingInterval;
    Maybe<String> transferMode;
    Maybe<String> streamFormat;
    Maybe<String> streamCompression;
    Maybe<protocol::Tracing::TraceConfig> traceConfig;
    Maybe<Binary> perfettoConfig;
    Maybe<String> tracingBackend;
    DECLARE_DESERIALIZATION_SUPPORT();
};

startParams

void DomainDispatcherImpl::start(const crdtp::Dispatchable& dispatchable)
{
    // Prepare input parameters.
    auto deserializer = crdtp::DeferredMessage::FromSpan(dispatchable.Params())->MakeDeserializer();
    startParams params;
    if (!startParams::Deserialize(&deserializer, &params)) {
      ReportInvalidParams(dispatchable, deserializer);
      return;
    }

    m_backend->Start(std::move(params.categories), std::move(params.options), std::move(params.bufferUsageReportingInterval), std::move(params.transferMode), std::move(params.streamFormat), std::move(params.streamCompression), std::move(params.traceConfig), std::move(params.perfettoConfig), std::move(params.tracingBackend), std::make_unique<StartCallbackImpl>(weakPtr(), dispatchable.CallId(), dispatchable.Serialized()));
}

or simply: Tracing.start

@kaliiiiiiiiii
Copy link
Owner

@milahu

you mean the python response handler is too slow?

yep or maybe even the interception at C++ Chromium is to slow over a single websocket.

  1. Long-term workaround here would be ausing smth like selenium-wire, this however requires some development, to fix th SSL pinning.

or maybe the continueResponse/fulfillRequest logic has a bug (note: continueResponse is experimental)

Yep there for sure are some bugs. What I as well could think of is that maybe some iframes don't get intercepted correctly, and therefore have a detectable difference to the main frame.

so far i used the "export HAR" function of chrome devtools network but that is slower than capturing the live traffic

Yep that works as well of course, however more a workaround:)

an alternative would be a local http proxy i guess Fetch.enable also works with a http proxy inside of chrome and maybe that proxy is visible to cloudflare

See 1.
I assumed chrome intercepts directly between frames | boringssl and doesn't tunnel it through a proxy after boringssl.
Maybe we can find some source-code on that?

another thing to try is

  1. Network.setRequestInterception (deprecaded tho).

Soo feel free to share a POC & status if you try that

@kaliiiiiiiiii
Copy link
Owner

Yep there for sure are some bugs. What I as well could think of is that maybe some iframes don't get intercepted correctly, and therefore have a detectable difference to the main frame.

That would then explain why disabling site isolation works

@milahu
Copy link
Author

milahu commented Dec 1, 2023

interception

for my use case, i dont need any active interception of requests/responses
i just need a passive live-stream of http traffic

so i will use Tracing.start

edit: no. the Tracing.dataCollected events are only sent after Tracing.end
and the Tracing.dataCollected events dont contain http traffic 0__o

i still dont understand how devtools network log gets the live network traffic
the network log uses Tracing.start only to get the trace categories
"-*,disabled-by-default-devtools.screenshot"

@kaliiiiiiiiii kaliiiiiiiiii added bug Something isn't working enhancement New feature or request labels Dec 2, 2023
@kaliiiiiiiiii kaliiiiiiiiii changed the title network interception with Fetch.enable breaks cloudflare network interception with Fetch.enable breaks cloudflare Dec 2, 2023
@milahu
Copy link
Author

milahu commented Dec 4, 2023

an alternative would be a local http proxy

selenium-wire uses a patched version of mitmproxy as http proxy

this also allows for active network interception
without chromium --disable-web-security
because we can tell chromium to trust the proxy's certificate

@kaliiiiiiiiii
Copy link
Owner

an alternative would be a local http proxy

selenium-wire uses a patched version of mitmproxy as http proxy

this also allows for active network interception without chromium --disable-web-security because we can tell chromium to trust the proxy's certificate

still pretty sure the SSL/TLS fingerprint doesn't match to chrome as it doesn't use boringssl tho. see wkeeling/selenium-wire#215 (comment)

@kaliiiiiiiiii
Copy link
Owner

Interesting note here that:

from cdp_socket.utils.utils import launch_chrome, random_port
from cdp_socket.socket import CDPSocket
import os
import asyncio

global sock1


async def on_resumed(params):
    global sock1
    await sock1.exec("Fetch.continueRequest", {"requestId": params['requestId']})
    print(params["request"]["url"])


async def main():
    global sock1
    PORT = random_port()
    process = launch_chrome(PORT)

    async with CDPSocket(PORT) as base_socket:
        targets = await base_socket.targets
        target = targets[0]
        sock1 = await base_socket.get_socket(target)
        await sock1.exec("Network.clearBrowserCookies")
        await sock1.exec("Fetch.enable")
        sock1.add_listener("Fetch.requestPaused", on_resumed)
        await sock1.exec("Page.navigate", {"url": "https://nowsecure.nl#relax"})
        await asyncio.sleep(5)

    os.kill(process.pid, 15)


asyncio.run(main())

works just fine

@milahu
Copy link
Author

milahu commented Dec 18, 2023

works just fine

this works for requests, but not for responses
because Fetch.getResponseBody always throws CDPError -32000

test.py
#!/usr/bin/env python3

# https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/123#issuecomment-1858803756

from cdp_socket.utils.utils import launch_chrome, random_port
from cdp_socket.socket import CDPSocket
from cdp_socket.exceptions import CDPError

import os
import asyncio
import json
import base64
import sys
import time
import traceback

global sock1


async def on_request_paused(params):
    global sock1

    url = params["request"]["url"]
    url_clean = url.split("?")[0]
    if len(url_clean) > 60:
        url_clean = url_clean[:60] + "..."
    _params = {"requestId": params['requestId']}
    #if params.get('responseStatusCode') in [301, 302, 303, 307, 308]:
    #    # redirected request
    #    return await sock1.exec("Fetch.continueResponse", _params)
    try:
        #print("Fetch.getResponseBody ...", url_clean)
        body = await sock1.exec("Fetch.getResponseBody", _params, timeout=30)
    except CDPError as e:
        #print("Fetch.getResponseBody CDPError", url_clean)
        if e.code == -32000:
            # Can only get response body on HeadersReceived pattern matched requests.
            print("Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse", url_clean)
            #print("Fetch.continueResponse ...", url_clean)
            res = await sock1.exec("Fetch.continueResponse", _params, timeout=30)
            #print("Fetch.continueResponse done", url_clean)
            return res
        else:
            print("Fetch.getResponseBody CDPError raise", url_clean)
            raise e
    else:
        print("Fetch.getResponseBody done", url_clean)
        start = time.monotonic()
        body_decoded = base64.b64decode(body['body'])
        # modify body here
        body_modified = base64.b64encode(body_decoded).decode("ascii")
        fulfill_params = {"responseCode": 200, "body": body_modified}
        fulfill_params.update(_params)
        _time = time.monotonic() - start
        if _time > 0.01:
            print(f"decoding took long: {_time} s")
        print("Fetch.fulfillRequest ...")
        res = await sock1.exec("Fetch.fulfillRequest", fulfill_params, timeout=30)
        print("Fetch.fulfillRequest done", url_clean)
        print("Mocked response", url_clean)
        return res


async def main():
    global sock1
    PORT = random_port()
    process = launch_chrome(PORT)

    async with CDPSocket(PORT) as base_socket:
        targets = await base_socket.targets
        target = targets[0]
        sock1 = await base_socket.get_socket(target)
        await sock1.exec("Network.clearBrowserCookies")
        await sock1.exec("Fetch.enable")
        sock1.add_listener("Fetch.requestPaused", on_request_paused)
        # timeout: fix: asyncio.exceptions.TimeoutError
        await sock1.exec("Page.navigate", {"url": "https://nowsecure.nl#relax"}, timeout=30)
        print("waiting after Page.navigate")
        await asyncio.sleep(5)

    os.kill(process.pid, 30)


asyncio.run(main())
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/
waiting after Page.navigate
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/styles/challenges.css
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/challenge-platform/h/g/orchestr...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://challenges.cloudflare.com/turnstile/v0/g/74bd6362/ap...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/favicon.ico
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/challenge-platform/h/g/flow/ov1...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/favicon.ico
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://challenges.cloudflare.com/cdn-cgi/challenge-platform...

similar...
https://github.com/cloud-browser/scrapy-cloud-browser/blob/main/scrapy_cloud_browser/scenarist/page.py

@milahu
Copy link
Author

milahu commented Dec 20, 2023

chrome://net-export/ could be useful for passive capturing of traffic

Click the button to start logging future network activity to a file on disk. The log includes details of network activity from all of Chrome, including incognito and non-incognito tabs, visited URLs, and information about the network configuration

via chrome://net-internals/

@kaliiiiiiiiii
Copy link
Owner

Looks like Network.setRequestInterception has the same issues. WOnder tho why it's flaged as "Insecure", eventho the request is over HTTPS
image
image

import asyncio
import base64
import sys
import time
import traceback

from cdp_socket.exceptions import CDPError

from selenium_driverless import webdriver


async def on_request(params, global_conn):
    url = params["request"]["url"]
    _params = {"interceptionId": params['interceptionId']}
    if params.get('responseStatusCode') in [301, 302, 303, 307, 308]:
        # redirected request
        return await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", _params)
    else:
        try:
            body = await global_conn.execute_cdp_cmd("Network.getResponseBodyForInterception", _params, timeout=1)
        except CDPError as e:
            if e.code == -32000 and e.message == 'Can only get response body on requests captured after headers received.':
                print(params, "\n", file=sys.stderr)
                traceback.print_exc()
                await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params)
            else:
                raise e
        else:
            start = time.monotonic()
            body_encoded = base64.b64decode(body['body'])

            # modify body here

            body_modified = base64.b64encode(body_encoded).decode()
            fulfill_params = {"rawResponse": body_modified}
            fulfill_params.update(_params)
            _time = time.monotonic() - start
            if _time > 0.01:
                print(f"decoding took long: {_time} s")
            await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", fulfill_params)
            print("Mocked response", url)


async def main():
    options = webdriver.ChromeOptions()
    async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver:
        driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited"))
        global_conn = driver.current_target
        await driver.get("about:blank")
        await global_conn.execute_cdp_cmd("Network.enable", {"maxTotalBufferSize": 1_000_000,  # 1GB
                                                             "maxResourceBufferSize":1_000_000,
                                                             "maxPostDataSize":1_000_000
                                                             })
        await global_conn.execute_cdp_cmd("Network.setRequestInterception", {"patterns":[{"urlPattern":"*", "interceptionStage":"HeadersReceived"}]})
        await global_conn.add_cdp_listener("Network.requestIntercepted", lambda data: on_request(data, global_conn))
        await driver.get(
            'https://nowsecure.nl',
            timeout=60, wait_load=False)
        while True:
            await asyncio.sleep(10)


asyncio.run(main())

@milahu
Copy link
Author

milahu commented Dec 31, 2023

wonder tho why it's flaged as "Insecure", eventho the request is over HTTPS

i guess it uses a local https proxy with a self-signed certificate
without adding that certificate as "trusted cert" to ~/.pki/nssdb/

but still, this fails to bypass cloudflare

Please unblock challenges.cloudflare.com to proceed.

@kaliiiiiiiiii
Copy link
Owner

Also interesting here, that local overrides with the chrome devtools just work fine:
image

i guess it uses a local https proxy with a self-signed certificate
without adding that certificate as "trusted cert" to ~/.pki/nssdb/

ahh yep, that makes sense

but still, this fails to bypass cloudflare

maybe there's a way to detect self-signed certificate usage? If no, it's probably timing or SSL//TLS fingerprinting I guess

I see 2 possible aproaches here:

  1. check if we can access that over a chrome extensions (check if existing ones work) @milahu feel free to lmk if you find a workimg one. Getting the source-code & analysing shouldn't be that hard.
  2. What if we, instead of mofifying the body binary, point the url to a local webserver?

@milahu
Copy link
Author

milahu commented Dec 31, 2023

for now i gave up on intercepting requests...
chrome seems to make it really hard, also to provide security against MITM attacks

probably i would try the frida route
as i described in wkeeling/selenium-wire#656 (comment)

@kaliiiiiiiiii
Copy link
Owner

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks

probably i would try the frida route as i described in wkeeling/selenium-wire#656 (comment)

Well yeah, eventho I assume that the memory manipulation//ddl hooking solutions are specific to:

  • chrome versions
  • OS
    and therefore hard to maintain long-term:/

@kaliiiiiiiiii
Copy link
Owner

At

chrome://net-export/ could be useful for passive capturing of traffic

Click the button to start logging future network activity to a file on disk. The log includes details of network activity from all of Chrome, including incognito and non-incognito tabs, visited URLs, and information about the network configuration

via chrome://net-internals/

Uhh I think passive capturing works as well with Fetch.enable or Network.setRequestInterception as long you don't modify the body btw

@kaliiiiiiiiii
Copy link
Owner

Even changing request headers works just fine

image

import asyncio
import base64
import sys
import time
import traceback

from cdp_socket.exceptions import CDPError

from selenium_driverless import webdriver


async def on_request(params, global_conn):
    url = params["request"]["url"]
    _params = {"interceptionId": params['interceptionId']}
    if params.get('responseStatusCode') in [301, 302, 303, 307, 308]:
        # redirected request
        return await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", _params)
    else:

        fulfill_params = {"headers":params["request"]["headers"]}
        fulfill_params["headers"]["test"] = "Hello World!"
        fulfill_params.update(_params)
        await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", fulfill_params)
        print(url)


async def main():
    options = webdriver.ChromeOptions()
    async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver:
        driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited"))
        global_conn = driver.current_target
        await driver.get("about:blank")
        await global_conn.execute_cdp_cmd("Network.enable", {"maxTotalBufferSize": 1_000_000,  # 1GB
                                                             "maxResourceBufferSize": 1_000_000,
                                                             "maxPostDataSize": 1_000_000
                                                             })
        await global_conn.execute_cdp_cmd("Network.setRequestInterception",
                                          {"patterns": [{"urlPattern": "*",
                                                         # "interceptionStage": "HeadersReceived"
                                                         }]})
        await global_conn.add_cdp_listener("Network.requestIntercepted", lambda data: on_request(data, global_conn))
        await driver.get(
            'https://nowsecure.nl',
            timeout=60, wait_load=False)
        while True:
            await asyncio.sleep(10)


asyncio.run(main())

@milahu
Copy link
Author

milahu commented Dec 31, 2023

print(url)

and where is the response body?

@milahu
Copy link
Author

milahu commented Jan 13, 2024

and where is the response body?

Network.getResponseBody

#!/usr/bin/env python3

import asyncio
from selenium_driverless import webdriver
from selenium_driverless.types.by import By
import base64

async def main():

    driver = await webdriver.Chrome()
    #await asyncio.sleep(1)

    target = None

    async def requestWillBeSent(args):
        #print("requestWillBeSent", args)
        print("requestWillBeSent", args["request"]["url"])

    async def requestWillBeSentExtraInfo(args):
        print("requestWillBeSentExtraInfo", args)

    async def responseReceived(args):
        # TODO better. get target of this response
        nonlocal target
        #print("responseReceived", args)
        status = args["response"]["status"]
        url = args["response"]["url"]
        _type = args["response"]["headers"]["Content-Type"]

        # TODO better. detect when response data is ready
        # fix: No data found for resource with given identifier
        await asyncio.sleep(1)

        args = {
            "requestId": args["requestId"],
        }
        body = await target.execute_cdp_cmd("Network.getResponseBody", args)
        body = base64.b64decode(body["body"]) if body["base64Encoded"] else body["body"]

        print("responseReceived", status, url, _type, repr(body[:20]) + "...")

    async def responseReceivedExtraInfo(args):
        print("responseReceivedExtraInfo", args)

    async def targetCreated(args):
        print("targetCreated", args)

    async def targetInfoChanged(args):
        #print("targetInfoChanged", args)
        print("targetInfoChanged")

    target = await driver.current_target
    #print("target.id", target.id)

    # enable Target events
    args = {
        "discover": True,
        #"filter": ...
    }
    await target.execute_cdp_cmd("Target.setDiscoverTargets", args)

    await target.add_cdp_listener("Target.targetCreated", targetCreated)
    await target.add_cdp_listener("Target.targetInfoChanged", targetInfoChanged)

    #print("driver.targets", await driver.targets)

    # enable Network events
    args = {
        "maxTotalBufferSize": 1_000_000,  # 1GB
        "maxResourceBufferSize": 1_000_000,
        "maxPostDataSize": 1_000_000
    }
    await target.execute_cdp_cmd("Network.enable", args)

    await target.add_cdp_listener("Network.requestWillBeSent", requestWillBeSent)
    #await target.add_cdp_listener("Network.requestWillBeSentExtraInfo", requestWillBeSentExtraInfo)
    await target.add_cdp_listener("Network.responseReceived", responseReceived)
    #await target.add_cdp_listener("Network.responseReceivedExtraInfo", responseReceivedExtraInfo)



    #await asyncio.sleep(1)

    url = "http://httpbin.org/get"
    print("driver.get", url)
    await driver.get(url)
    await asyncio.sleep(3)

    #print("driver.targets", await driver.targets)

    """
    print("hit enter to close")
    input()
    """

    await driver.close()

asyncio.run(main())

example output

driver.get http://httpbin.org/get
requestWillBeSent http://httpbin.org/get
targetInfoChanged
requestWillBeSent http://httpbin.org/favicon.ico
responseReceived 200 http://httpbin.org/get application/json '{\n  "args": {}, \n  "'...
responseReceived 404 http://httpbin.org/favicon.ico text/html '<!DOCTYPE HTML PUBLI'...

@milahu
Copy link
Author

milahu commented Jan 24, 2024

Please unblock challenges.cloudflare.com to proceed.

this error appears when Fetch.fulfillRequest has no response headers

fix:

    async def requestPaused(args):
        # ...
        body = base64.b64encode(body).decode("ascii")
        _args = {
            "requestId": args["requestId"],
            "responseCode": args["responseStatusCode"],
            # fix: Please unblock challenges.cloudflare.com to proceed.
            "responseHeaders": args["responseHeaders"],
            "body": body,
        }
        if args["responseStatusText"] != "":
            # empty string throws "Invalid http status code or phrase"
            _args["responsePhrase"] = args["responseStatusText"]
        await target.execute_cdp_cmd("Fetch.fulfillRequest", _args)

passive capturing works as well with Fetch.enable or Network.setRequestInterception as long you don't modify the body

im looking for a generic solution, based on streams
so i can handle infinite-size responses without storing the whole response in RAM
and so i can handle streams of events with low latency

see also https://github.com/milahu/aiohttp_chromium/tree/main/test/stream-response

feel free to copy/paste/modify these scripts to Selenium-Driverless/examples/

@kaliiiiiiiiii
Copy link
Owner

see also https://github.com/milahu/aiohttp_chromium/tree/main/test/stream-response

feel free to copy/paste/modify these scripts to Selenium-Driverless/examples/

ah yep, thanks. Might be nice if you can keep it up long-term somewhere in your repo for reference

broken: Network.enable and Network.streamResourceContent and Network.dataReceived - this is broken in chromium 117, because data is always empty.

ah heck, well then Network usage should probably be avoided as it's deprecated and more stuff might break in future chrome versions

Please unblock challenges.cloudflare.com to proceed.

this error appears when Fetch.fulfillRequest has no response headers

    async def requestPaused(args):
        # ...
        body = base64.b64encode(body).decode("ascii")
        _args = {
            "requestId": args["requestId"],
            "responseCode": args["responseStatusCode"],
            # fix: Please unblock challenges.cloudflare.com to proceed.
            "responseHeaders": args["responseHeaders"],
            "body": body,
        }
        if args["responseStatusText"] != "":
            # empty string throws "Invalid http status code or phrase"
           _args["responsePhrase"] = > args["responseStatusText"]
        await target.execute_cdp_cmd("Fetch.fulfillRequest", _args)

Uh nice that we've finally got it working! Great job!
Wonder, is there any way to optimize base64.b64encode(body).decode("ascii") even more btw?

And also, are we sure that Fetch.enable intercepts as well:

  1. WebWorkers & service-workers
  2. cross//OOPIF iframes?
  3. background scripts in extensions.

I remember there being Network.setBypassServiceWorker, however no idea if it affects Fetch.enable as well.

If some still don't get intercepted, maybe target-interception might be considerable, see https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/4b71a5ab59a193d41eab80ed8f68a66e8ad5c230/tests/target_interception.py . I'm however not sure how reliable it is and how bad the timing leaks are.

@milahu
Copy link
Author

milahu commented Jan 24, 2024

then Network usage should probably be avoided as it's deprecated and more stuff might break in future chrome versions

Network.streamResourceContent and Network.dataReceived
are not deprecated, but experimental
so i expect them to work in newer versions

is there any way to optimize base64.b64encode(body).decode("ascii")

im afraid no... i also would prefer a binary protocol, no base64, no json

base64 is needed for Fetch.fulfillRequest

body: string: A response body. If absent, original response body will be used if the request is intercepted at the response stage and empty body will be used if the request is intercepted at the request stage. (Encoded as a base64 string when passed over JSON)

when i pass the body as bytes i get

TypeError: Object of type bytes is not JSON serializable

per CDP docs, the only non-JSON endpoint is

WebSocket /devtools/page/{targetId}
The WebSocket endpoint for the protocol.

are we sure that Fetch.enable intercepts as well

no idea, i dont need these targets

in Fetch.requestPaused.py im calling

    target = await driver.current_target
    # ...
    await target.execute_cdp_cmd("Fetch.enable", args)
    await target.add_cdp_listener("Fetch.requestPaused", requestPaused)

but this also works with

    await driver.execute_cdp_cmd("Fetch.enable", args)
    await driver.add_cdp_listener("Fetch.requestPaused", requestPaused)

then requestPaused should be called for all targets

@kaliiiiiiiiii
Copy link
Owner

kaliiiiiiiiii commented Jan 24, 2024

Also, I'm just thinking about - if we can't stream the responses when intercepting the requests - there's technically a way to detect the timing (if the server responds in chuncks), right?

And even if it would be possible, I suppose there could be a way to setup a server with sepecific chunk timing & size + detect that at JavaScript.

See http://scatter.cowchimp.com/ for a poc on scattering the chunk timing

@milahu
Copy link
Author

milahu commented Jan 24, 2024

aah, now i understand your question

are we sure that Fetch.enable intercepts as well

so ideally, all targets should be intercepted
to add the same latency to all requests

practically, i would avoid this premature optimization
because different latencies can have legitimate reasons
like different cpu loads on different cpu cores

maybe put this on a todo list / future work list / debug ideas list
in case cloudflare blocking becomes more aggressive

@kaliiiiiiiiii
Copy link
Owner

target = await driver.current_target
# ...
await target.execute_cdp_cmd("Fetch.

yeah ofc - as this will executes cdp on the same target.

I'm not sure if//how driver.base_target behaves tbh. I could imagine, that service-worker requests are only covered by base_target. At least for target interception, this is the case.

@milahu
Copy link
Author

milahu commented Jan 24, 2024

Network.streamResourceContent and Network.dataReceived
are not deprecated, but experimental
so i expect them to work in newer versions

bad news: this also fails with chromium 120

maybe this is a bug in selenium_driverless?
tomorrow i will port Network.dataReceived.py to selenium
i would be surprised if this is a chromium bug

@kaliiiiiiiiii
Copy link
Owner

Network.streamResourceContent and Network.dataReceived
are not deprecated, but experimental
so i expect them to work in newer versions

bad news: this also fails with chromium 120

maybe this is a bug in selenium_driverless? tomorrow i will port Network.dataReceived.py to selenium i would be surprised if this is a chromium bug

mhh maybe try with bare CDP-socket. Wouldn't know why driverless could break this. Unless it's some chrome flag which gets applied by default

@milahu
Copy link
Author

milahu commented Jan 26, 2024

tomorrow i will port Network.dataReceived.py to selenium

not possible, because chromedriver does not support the Network.streamResourceContent command

so there is no

await session.execute(devtools.network.stream_resource_content(request_id))
# or
driver.execute("Network.streamResourceContent", {"requestId": request_id})

there is only network.take_response_body_for_interception_as_stream

await session.execute(devtools.network.take_response_body_for_interception_as_stream(interception_id))

... but that requires an interception_id
and there is still no IO.write so i cannot send the stream to chromium

see also Selenium 4: how add event listeners in CDP

CDP is broken by design?

i have the impression that this feature (reading and writing of streams)
is deliberately not implemented by CDP

see also Fetch.fulfillRequest and (very) long body

Unfortunately, there's no streaming support for Fetch network interception at the moment

yeah, totally "unfortunately" and totally "at the moment"

no, i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client
which is pretty much what we are trying to do here...

dynamic analysis

so... i really tried to avoid this part (because i have zero experience here)
but i will have to use frida to insert hooks into the chromium binary

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks

probably i would try the frida route as i described in wkeeling/selenium-wire#656 (comment)

lets see what tomorrow will bring ; )

@kaliiiiiiiiii
Copy link
Owner

tomorrow i will port Network.dataReceived.py to selenium

not possible, because chromedriver does not support the Network.streamResourceContent command

so there is no

await session.execute(devtools.network.stream_resource_content(request_id))
# or
driver.execute("Network.streamResourceContent", {"requestId": request_id})

there is only network.take_response_body_for_interception_as_stream

await session.execute(devtools.network.take_response_body_for_interception_as_stream(interception_id))

... but that requires an interception_id and there is still no IO.write so i cannot send the stream to chromium

see also Selenium 4: how add event listeners in CDP

CDP is broken by design?

i have the impression that this feature (reading and writing of streams) is deliberately not implemented by CDP

Yeah that might indeed be the case. As well due to security reasons such as streaming all stuff encrypted trough a proxy.

see also Fetch.fulfillRequest and (very) long body

Unfortunately, there's no streaming support for Fetch network interception at the moment

yeah, totally "unfortunately" and totally "at the moment"

no, i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client which is pretty much what we are trying to do here...

yeah, I guess so

dynamic analysis

so... i really tried to avoid this part (because i have zero experience here) but i will have to use frida to insert hooks into the chromium binary

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks
probably i would try the frida route as i described in wkeeling/selenium-wire#656 (comment)

lets see what tomorrow will bring ; )

well have funn hahe👀 gonna be a pain. Pretty sure Chrome has stuff against that implemented.

@kaliiiiiiiiii
Copy link
Owner

not resolved yet lol

@kaliiiiiiiiii kaliiiiiiiiii reopened this Jan 29, 2024
@milahu
Copy link
Author

milahu commented Jan 29, 2024

well... the original issue is fixed by sending responseHeaders

currently i dont have time to implement reading and writing of streams
also i guess this is out-of-scope for selenium_driverless
because this is not possible with CDP

@kaliiiiiiiiii
Copy link
Owner

kaliiiiiiiiii commented Feb 1, 2024

well... the original issue is fixed by sending responseHeaders

currently i dont have time to implement reading and writing of streams also i guess this is out-of-scope for selenium_driverless because this is not possible with CDP

Hmm does https://bugs.chromium.org/p/chromium/issues/detail?id=1138839 still apply tho?
Also, I'm not that sure if all headers have the correct order tbh

Maybe using binaryResponseHeaders for continuing the request would be more safe?

@kaliiiiiiiiii
Copy link
Owner

@milahu

probably i would try the frida route

Maybe https://github.com/tomer8007/chromium-ipc-sniffer could be a consideration worth👀
screenshot below id 4 years old, some stuff might have changed ofc.

@milahu
Copy link
Author

milahu commented Feb 29, 2024

i would be surprised if that works
the raw HTTP traffic is hidden for better security

However, this project won't see anything that doesn't go over pipes, which is mostly shared memory IPC:

  • Mojo data pipe contents (raw networking buffers, audio, etc.)

... so the raw HTTP traffic is in shared memory

the most promising method is running chromium in a debugger, either gdb or lldb
but i have to disable sandboxing to set breakpoints on BIO_read and BIO_write
radare is too slow, frida fails to hook the functions
gdb works, but parsing its output is slow, and gdb in python is kinda broken
lldb would be better for interfacing with python (or native code), but its kinda broken...
see also chromium-capture-http

but all these are workarounds
and a proper fix would be to implement full http stream support
to fix either Fetch.requestPaused.py or Network.dataReceived.py

effectively, this would allow inserting an http proxy
with full control over request and response streams

its surprising that such a basic feature is missing

there is Fetch.takeResponseBodyAsStream and IO.read
but not Fetch.giveResponseBodyAsStream and IO.write

there is Network.takeResponseBodyForInterceptionAsStream and IO.read
but not Network.giveResponseBodyForInterceptionAsStream and IO.write

currently this has zero priority for me, i just dont need it

@kaliiiiiiiiii
Copy link
Owner

will be fixed with https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/dev/src/selenium_driverless/scripts/network_interceptor.py

I'll close this issue when it's released & the documentation is complete

@kaliiiiiiiiii
Copy link
Owner

resolved with https://kaliiiiiiiiii.github.io/Selenium-Driverless/api/RequestInterception/

@milahu
Copy link
Author

milahu commented May 27, 2024

a proper fix would be to implement full http stream support

nothing new from google
https://issues.chromium.org/issues/332570739

just another feature request
which would be easy to implement, but is ignored as "low priority"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants