**Run on Google Colab (Quickstart)**

```bash
! git clone --branch main --single-branch https://github.com/sbaaihamza/scrapping-lib.git
%cd scrapping-lib
! pip install -e ".[browser,dev]"
! playwright install
# Preferred (installs OS deps automatically on supported distros):
! playwright install --with-deps chromium
# If needed (manual deps fallback):
! apt-get update
! apt-get install -y libxcomposite1 libxcursor1 libgtk-3-0 libatk1.0-0 libcairo2 libgdk-pixbuf2.0-0
%cd /content/scrapping-lib/notebooks
```

*Note: Playwright has both sync and async APIs. These notebooks are designed to be async-safe for Jupyter/Colab. If you encounter OS dependency issues, use the `playwright install --with-deps chromium` command.*



# Engine In-Depth: Browser Engine Cases

This notebook explores the `browser` engine, focusing on JS-heavy pages, page actions, and professional diagnostics using Playwright.

### Learning Outcomes
*   Understand **Playwright Architecture** (Browser vs. Context vs. Page).
*   Master **Debugging Tools** (Headed mode, slowMo, Traces).
*   Conduct a **Hard Targets Diagnostics Lab** to identify bot-detection signals.
*   Implement **Challenge Detection** (Turnstile/reCAPTCHA) with safe stop-and-record behavior.

## 0) Setup & Readiness

In [None]:
import json
import os
import sys
from pathlib import Path

import pandas as pd

from scrapping.diagnostics.classifiers import diagnose_rendered_dom, recommend_next_step
from scrapping.engines.browser import BrowserEngine, BrowserEngineOptions

REPO_ROOT = Path.cwd().parent
sys.path.append(str(REPO_ROOT))
os.chdir(str(REPO_ROOT))

ONLINE = os.getenv('ONLINE', '0') == '1'
HARD_TARGETS = os.getenv('HARD_TARGETS', '0') == '1'

print(f'Online mode: {ONLINE} | Hard Targets: {HARD_TARGETS}')

## 1) Playwright Primer

The Browser engine uses **Playwright** to execute a real browser instance.
*   **Browser**: The executable (Chromium, Firefox).
*   **Context**: An isolated incognito-like session (cookies, storage).
*   **Page**: A single tab within a context.
*   **Waits**: Ensuring the DOM is ready using selectors or network idle states.
*   **Artifacts**: Automatic capture of HTML, Screenshots, and Traces.

In [None]:
opts = BrowserEngineOptions(headless=True, save_artifacts=True)
engine = BrowserEngine(options=opts)

url = "https://quotes.toscrape.com/js/"
if ONLINE:
    res = engine.get_rendered(url, wait_for=".quote")
    if res.ok:
        print(f"Fetch OK | Text Len: {len(res.text)}")
    else:
        print(f"Fetch Failed: {res.short_error()}")
    print(f"Artifacts: {res.engine_trace[-1].get('artifacts')}")
else:
    print("Offline: Engine initialized. Use fixtures to simulate.")
engine.close()

## 2) Debugging: Headed, slowMo, and Traces

When a script fails, use these tools to see what the browser is doing:
*   **Headed Mode**: `headless=False` to see the window.
*   **slowMo**: Add a delay (e.g., 500ms) between actions to follow along.
*   **Trace Viewer**: Comprehensive recording of every event, network call, and DOM change.

In [None]:
# Example configuration for debugging (manual run only)
debug_opts = BrowserEngineOptions(
    headless=False, 
    trace=True, 
    artifacts_dir='debug_artifacts'
)
print("To debug: Run in a local environment where a UI is available.")

## 3) Hard Targets: Bot-Detection Lab

We test our engine against bot-detectors to analyze which signals (like `webdriver` or `headless`) are flagged.

In [None]:
from scrapping.orchestrator import Orchestrator, OrchestratorOptions

config_path = 'examples/configs/real/hard_targets/browser_bot_diagnostics.json'
with open(config_path) as f:
    cfg = json.load(f)

if ONLINE and HARD_TARGETS:
    orch = Orchestrator(options=OrchestratorOptions(results_dir='results/bot_lab'))
    report = orch.run(cfg)
    
    results = []
    for s_id, s_res in report['sources'].items():
        results.append({
            "Target": s_id, 
            "Diagnosis": s_res.get('diagnosis', {}).get('label'),
            "Reason": s_res.get('diagnosis', {}).get('reason')
        })
    print(pd.DataFrame(results))
else:
    print("Lab skipped: Set ONLINE=1 and HARD_TARGETS=1 to run bot diagnostics.")

## 4) Challenge Pages (Compliance-First)

When facing Turnstile or reCAPTCHA, our goal is **detection + safe stop**. We do not attempt bypass.

In [None]:

html_sample = "<html><body><div id='cf-turnstile-widget'></div></body></html>"
diag = diagnose_rendered_dom(html_sample)

print(f"Sample Detection Result: {diag.label}")
print(f"Action: {recommend_next_step(diag)}")

if diag.label == "challenge_detected":
    print("âœ… Success == Detection + Graceful Stop.")

### Verification Checklist
- [ ] Engine initializes without loop errors.
- [ ] `wait_for` ensures content is rendered before extraction.
- [ ] Screenshots are saved for every failure.
- [ ] Bot-detection signals are analyzed but not evaded.