**Run on Google Colab (Quickstart)**

```bash
! git clone --branch main --single-branch https://github.com/sbaaihamza/scrapping-lib.git
%cd scrapping-lib
! pip install -e ".[browser,dev]"
! playwright install
# Preferred (installs OS deps automatically on supported distros):
! playwright install --with-deps chromium
# If needed (manual deps fallback):
! apt-get update
! apt-get install -y libxcomposite1 libxcursor1 libgtk-3-0 libatk1.0-0 libcairo2 libgdk-pixbuf2.0-0
%cd /content/scrapping-lib/notebooks
```

*Note: Playwright has both sync and async APIs. These notebooks are designed to be async-safe for Jupyter/Colab. If you encounter OS dependency issues, use the `playwright install --with-deps chromium` command.*



# Online Scraping Playbook: Building & Adjusting Configs

This notebook teaches you how to build and refine scraping configurations for new websites using the `scrapping` library. It focuses on real-world challenges like engine selection, link extraction, and quality control.

## Purpose
- Learn the step-by-step workflow for onboarding a new site.
- Understand how to diagnose and handle common scraping obstacles responsibly.
- Master the configuration schema to balance speed and accuracy.

## Responsible Scraping: Do's and Don'ts
Compliance and ethics are central to our scraping practice.

### DO:
- **Check `robots.txt` and Terms of Service**: Always respect the site's guidelines.
- **Use APIs first**: If a site provides a legitimate API, prefer it over HTML scraping.
- **Rate limit**: Be a good citizen. Don't hammer servers; add delays and limit concurrency.
- **Obtain permission**: For large-scale data collection, reach out to the site owner when in doubt.

### DON'T:
- **Bypass CAPTCHAs**: We do not include evasion logic. If blocked by a CAPTCHA, stop and seek an alternative path.
- **Circumvent Access Controls**: Do not attempt to bypass login walls or restricted areas without proper authorization.
- **Ignore Rate Limits**: Bypassing 429 errors by cycling IPs aggressively is against our policy.

## 1. Setup
Initialize the environment and detect the repository root.

In [None]:
import json
import os
import sys
from pathlib import Path

from scrapping.engines.http import HttpEngine
from scrapping.extraction.link_extractors import LinkExtractRequest, extract_links
from scrapping.orchestrator import Orchestrator, OrchestratorOptions, validate_config
from scrapping.processing.html_to_structured import html_to_structured
from scrapping.processing.quality_filters import evaluate_quality


def find_repo_root(start_path):
    p = Path(start_path).resolve()
    for parent in [p] + list(p.parents):
        if (parent / 'pyproject.toml').exists():
            return parent
    return p

REPO_ROOT = find_repo_root(Path.cwd())
sys.path.append(str(REPO_ROOT))
os.chdir(str(REPO_ROOT))

ONLINE = os.getenv('ONLINE', '0') == '1'
RESULTS_DIR = Path('results_notebook_online')
print(f'Python version: {sys.version}')
print(f'Repo root: {REPO_ROOT}')
print(f'Online mode: {ONLINE}')
print(f'Results will be saved in: {RESULTS_DIR}')

## 2. Config Anatomy Refresher
A typical source config consists of several key sections that define the 'what', 'how', and 'where' of a scraping task.

In [None]:
minimal_config = {
    "source_id": "example_site",
    "engine": { 
        "type": "http",           # Choose: http, browser, hybrid
        "timeout_s": 15,          # Max time per request
        "verify_ssl": True        # Safety first
    },
    "entrypoints": [ 
        { "url": "https://example.com/items?page={page}", "paging": {"mode": "page", "start": 1, "end": 3} }
    ],
    "discovery": {
        "link_extract": {
            "method": "regex",
            "pattern": "https://example\.com/items/\d+"
        }
    },
    "quality": {
        "min_text_len": 200
    }
}
print(json.dumps(minimal_config, indent=2))

## 3. Website Triage Checklist
Choosing the right engine is crucial for performance and reliability.

### Decision Guide:
1. **Use HTTP** when:
   - Server renders HTML (check `view-source` in your browser).
   - Links and text are clearly visible in the raw response.
   - Site is fast and handles high-volume requests well.
2. **Use Browser** when:
   - Content is loaded via JavaScript after the initial page load.
   - The site is a Single Page Application (SPA).
   - You need to interact with the page (scrolling, clicking) to reveal content.
3. **Use Hybrid** when:
   - The listing pages are fast and static (HTTP is fine).
   - Individual detail pages require JS rendering (Browser is needed).

### Signals to Check:
- **Raw Source**: Press `Ctrl+U`. If the text you want isn't there, you probably need `browser` engine.
- **Network Tab**: Press `F12` and check the Network tab. If you see JSON responses with your data, you might be able to target an API directly via HTTP.

## 4. Online Debug Harness
These helpers allow us to quickly test site responses and detect blocking.

In [None]:
def debug_fetch_http(url):
    engine = HttpEngine()
    res = engine.get(url)
    print(f'Status: {res.status_code}')
    print(f'Length: {len(res.text) if res.text else 0}')
    if res.text:
        print(f'Snippet: {res.text[:500]}...')
    return res

def detect_blocking(fetch_result):
    if not fetch_result.ok:
        if fetch_result.status_code in (403, 429):
            return 'likely_blocked'
        return 'failed_request'
    
    text = (fetch_result.text or '').lower()
    blocked_patterns = ['captcha', 'cloudflare', 'unusual traffic', 'access denied', 'forbidden']
    for p in blocked_patterns:
        if p in text:
            return 'likely_blocked'
    
    auth_patterns = ['login required', 'sign in to continue', 'please log in']
    for p in auth_patterns:
        if p in text:
            return 'requires_auth'
            
    return 'ok'

## 5. Step-by-Step: Build a Config for a New Site
Let's walk through creating a configuration for a hypothetical site.

### Step A & B: Template and Validation

In [None]:
source_id = 'my_new_site'
entrypoint = 'https://example.com/listings' # Replace with real if ONLINE=1

new_source = {
    'source_id': source_id,
    'engine': {'type': 'http'},
    'entrypoints': [{'url': entrypoint}],
    'storage': {'items_format': 'jsonl'}
}

v = validate_config({'sources': [new_source]})
print(f'Validation ok: {v["ok"]}')

### Step C: Fetch & Inspect Listing
We extract links to find our detail pages.

In [None]:
if ONLINE:
    print(f'Fetching: {entrypoint}')
    res = debug_fetch_http(entrypoint)
    html = res.text
else:
    print('ONLINE=0: using fixtures (naukrigulf)')
    with open('tests/fixtures/html/listing_naukrigulf.html') as f:
        html = f.read()

# Define extraction (adjust as needed based on inspection)
link_pattern = r'https://www\.naukrigulf\.com/.*-jobs-\d+'
req = LinkExtractRequest(html=html, method='regex', pattern=link_pattern)
links = extract_links(req)
print(f'Found {len(links)} links. Top 3: {links[:3]}')

### Step D, E & F: Parse Detail and Quality
We take one link, fetch it, and extract structured data.

In [None]:
detail_url = links[0] if links else 'https://example.com/detail/1'
if ONLINE:
    print(f'Fetching detail: {detail_url}')
    res = debug_fetch_http(detail_url)
    detail_html = res.text
else:
    print('ONLINE=0: using fixtures (naukrigulf)')
    with open('tests/fixtures/html/detail_naukrigulf.html') as f:
        detail_html = f.read()

doc = html_to_structured(detail_html, url=detail_url)
item = doc.as_item()
print(f'Extracted Title: {item["title"]}')

q = evaluate_quality(item, rules={'min_text_len': 300})
print(f'Quality Check: Keep={q.keep}, Issues={[i.code for i in q.issues]}')

### Step G & H: Paging and Actions
Configure how to handle multiple pages and interactions.

In [None]:
paging_example = {
    'mode': 'page',
    'start': 1,
    'end': 10,
    'page_param': 'p' # e.g. ?p=1, ?p=2
}

actions_example = [
    {'type': 'scroll', 'params': {'mode': 'down', 'repeat': 3}},
    {'type': 'wait_for', 'selector': '.content-loaded'}
]
print('Paging and Actions are set in the source config to handle dynamic loading.')

### Step I: Small Online Trial
Run the full orchestrator for a limited scope.

In [None]:
if ONLINE:
    trial_cfg = {'sources': [new_source]}
    # Restrict to 1 page for trial
    trial_cfg['sources'][0]['entrypoints'][0]['paging'] = {'pages': 1}
    
    orch = Orchestrator(options=OrchestratorOptions(results_dir=RESULTS_DIR))
    out = orch.run(trial_cfg)
    print(f'Trial complete. Summary: {out["summary"]}')
else:
    print('ONLINE=0: skipping trial.')

## 6. Real-World Multi-Source Example
Applying the logic to `examples/configs/example_multi_sources.json`.

In [None]:
with open('examples/configs/example_multi_sources.json') as f:
    multi_cfg = json.load(f)

for s in multi_cfg['sources']:
    sid = s['source_id']
    eng_type = s.get('engine', {}).get('type')
    print(f'Source: {sid:<15} | Engine: {eng_type}')
    
    if ONLINE:
        # Tiny trial run
        sub_cfg = {'sources': [s]}
        orch = Orchestrator(options=OrchestratorOptions(results_dir=RESULTS_DIR / 'multi_trial'))
        res = orch.run(sub_cfg)
        print(f'  -> Run Result: {res["summary"]}')

## 7. Troubleshooting Cookbook (Safe)
| Symptom | Likely Cause | Safe Fix |
| :--- | :--- | :--- |
| **403 Forbidden / 429 Too Many Requests** | Rate limiting / Bot detection | Reduce RPS (rps/burst), increase `min_delay_s`, or use backoff retry policy. |
| **Empty HTML / No links found** | Client-side rendering (JS) | Switch to `browser` or `hybrid` engine and add `wait_for` selectors. |
| **'Login Required' messages** | Authenticated session needed | Use a legitimate API or provide an authenticated session token you already possess. |
| **CAPTCHA appears** | Suspicious traffic detected | **Stop scraping**. Use a permitted access path or API (No bypass logic permitted). |

## 8. From Config to Regression Tests
Once a config is working, lock it in with a test using snapshots.

In [None]:
def create_regression_test(source_id, listing_html, detail_html):
    # Pseudo-code for creating a test
    print(f'1. Saving listing snapshot: tests/fixtures/html/listing_{source_id}.html')
    print(f'2. Saving detail snapshot:  tests/fixtures/html/detail_{source_id}.html')
    print(f'3. Adding assertion: assert len(extract_links(...)) == {len(links)}')

create_regression_test('my_new_site', '...', '...')