**Run on Google Colab (Quickstart)**

```bash
! git clone --branch main --single-branch https://github.com/sbaaihamza/scrapping-lib.git
%cd scrapping-lib
! pip install -e ".[browser,dev]"
! playwright install
# Preferred (installs OS deps automatically on supported distros):
! playwright install --with-deps chromium
# If needed (manual deps fallback):
! apt-get update
! apt-get install -y libxcomposite1 libxcursor1 libgtk-3-0 libatk1.0-0 libcairo2 libgdk-pixbuf2.0-0
%cd /content/scrapping-lib/notebooks
```

*Note: Playwright has both sync and async APIs. These notebooks are designed to be async-safe for Jupyter/Colab. If you encounter OS dependency issues, use the `playwright install --with-deps chromium` command.*



# End-to-End Scraping Pipeline Walkthrough

This notebook demonstrates the `scrapping` library's full pipeline using a concrete multi-source configuration.

### Purpose
- Explain the core components of the library.
- Demonstrate how to load, validate, and execute scraping configurations.
- Show the transition from raw HTML to structured, quality-filtered data.

### Output Directory
Results from this notebook will be written to `results_notebook/`.

## 1. Setup & Imports
We start by importing the necessary modules and setting up our environment. We use robust path detection to ensure the notebook works regardless of where it's launched from.

In [None]:
import json
import os
import sys
from pathlib import Path

import pandas as pd

from scrapping.config.migration import migrate_config
from scrapping.extraction.link_extractors import LinkExtractRequest, extract_links
from scrapping.orchestrator import Orchestrator, OrchestratorOptions, validate_config
from scrapping.processing.html_to_structured import html_to_structured
from scrapping.processing.quality_filters import evaluate_quality


# Robust path detection to find repo root
def find_repo_root(start_path):
    p = Path(start_path).resolve()
    for parent in [p] + list(p.parents):
        if (parent / "pyproject.toml").exists():
            return parent
    return p


REPO_ROOT = find_repo_root(Path.cwd())
sys.path.append(str(REPO_ROOT))
os.chdir(str(REPO_ROOT))

print(f"Python version: {sys.version}")
print(f"Repo root: {REPO_ROOT}")

# Toggle between offline and online mode
ONLINE = os.getenv("ONLINE", "0") == "1"
print(f"Online mode: {ONLINE}")

## 2. Load and Explain the Config
We load the multi-source job sites configuration. We also apply migrations to see the resolved version of the config.

In [None]:
config_path = "examples/configs/example_multi_sources.json"
with open(config_path) as f:
    raw_cfg = json.load(f)

cfg, was_migrated = migrate_config(raw_cfg)
print(f"Config migrated: {was_migrated}")

sources_data = []
for s in cfg["sources"]:
    eng = s.get("engine", {})
    disc = s.get("discovery", {})
    le = disc.get("link_extract", {})
    sources_data.append(
        {
            "source_id": s["source_id"],
            "engine_type": eng.get("type"),
            "entrypoint": s["entrypoints"][0]["url"],
            "link_extract_method": le.get("method"),
            "pattern_selector": le.get("pattern") or le.get("selector"),
            "min_text_len": s.get("quality", {}).get("min_text_len"),
        }
    )

df_sources = pd.DataFrame(sources_data)
df_sources

### Resolved Config (Post-Migration)
Here is what one of the sources looks like after migration. Notice how fields like `config_version` and `storage` are normalized.

In [None]:
print(json.dumps(cfg["sources"][0], indent=2))

## 3. Validate Config
Before running, we ensure the configuration is valid according to our schema.

In [None]:
validation_result = validate_config(cfg, verbose=True)
if validation_result["ok"]:
    print("Config is VALID.")
else:
    print("Config is INVALID. Issues:")
    for issue in validation_result["issues"]:
        lvl = issue["level"].upper()
        msg = issue["msg"]
        print(f"- [{lvl}] {msg}")

## 4. Offline Fixtures
In offline mode, we use local HTML files to simulate the fetching process.

In [None]:
def get_fixture_path(kind, source_id):
    return REPO_ROOT / f"tests/fixtures/html/{kind}_{source_id}.html"


for s in cfg["sources"]:
    sid = s["source_id"]
    listing_p = get_fixture_path("listing", sid)
    detail_p = get_fixture_path("detail", sid)
    print(
        f"Source {sid}: Listing fixture exists: {listing_p.exists()}, Detail fixture exists: {detail_p.exists()}"
    )

## 5. Link Extraction Demo
We demonstrate how links are extracted from listing pages using regex or CSS selectors.

In [None]:
for s in cfg["sources"]:
    sid = s["source_id"]
    le_cfg = s["discovery"]["link_extract"]

    with open(get_fixture_path("listing", sid)) as f:
        html = f.read()

    req = LinkExtractRequest(
        html=html,
        base_url=s["entrypoints"][0]["url"],
        method=le_cfg["method"],
        pattern=le_cfg.get("pattern"),
        selector=le_cfg.get("selector"),
        normalize=True,
    )

    links = extract_links(req)
    print(f"Source {sid}: Found {len(links)} links. Samples:")
    for url in links[:3]:
        print(f"  - {url}")
    print()

## 6. HTML -> Structured Demo
This step converts raw HTML into a structured dictionary with title and main text.

In [None]:
for s in cfg["sources"]:
    sid = s["source_id"]
    with open(get_fixture_path("detail", sid)) as f:
        html = f.read()

    doc = html_to_structured(html, url=f"https://example.com/mock/{sid}")
    print(f"Source {sid}:")
    print(f"  - Title: {doc.title}")
    print(f"  - Text length: {len(doc.text)}")
    print(f"  - Extractor: {doc.extractor}")
    print(f"  - Snippet: {doc.text[:100]}...")
    print()

## 7. Quality Filters Demo
We validate the extracted items against quality rules (e.g., minimum text length).

In [None]:
for s in cfg["sources"]:
    sid = s["source_id"]
    rules = s.get("quality", {})
    min_len = rules.get("min_text_len", 200)

    # Using dummy items to test thresholds
    test_items = [
        {"title": "Short", "text": "Too brief."},  # Should fail
        {
            "title": "Full Post",
            "text": "A long enough post to pass the quality threshold. " * 20,
        },  # Should pass
    ]

    print(f"Source {sid} (min_text_len: {min_len}):")
    for item in test_items:
        res = evaluate_quality(item, rules=rules)
        issues = [i.code for i in res.issues]
        print(f"  - Text length {len(item['text']):>3}: Keep={res.keep} Issues={issues}")
    print()

## 8. Orchestrator Dry Run
A dry run validates the configuration and plans the run without making any network calls.

In [None]:
orch = Orchestrator(options=OrchestratorOptions(results_dir="results_notebook", dry_run=True))
out = orch.run(cfg)
print(json.dumps(out, indent=2))

## 9. Mini Offline 'Simulated Run'
We simulate the pipeline by manually processing our fixtures and saving the artifacts to `results_notebook/simulated_run/`.

In [None]:
from scrapping.storage.layouts import Layout
from scrapping.storage.writers import WriterOptions, write_items

sim_dir = REPO_ROOT / "results_notebook/simulated_run"
layout = Layout(root=sim_dir)
run_id = "sim_run_001"
writer_opts = WriterOptions()

for s in cfg["sources"]:
    sid = s["source_id"]
    print(f"Simulating source: {sid}")

    # 1. Load detail fixture
    with open(get_fixture_path("detail", sid)) as f:
        html = f.read()

    # 2. Extract and Parse
    doc = html_to_structured(html)
    item = doc.as_item()

    # 3. Write items
    path = write_items(
        layout, run_id, sid, name="items", items=[item], fmt="jsonl", options=writer_opts
    )
    print(f"  - Items written to: {path}")

print("\nSimulation complete. Artifacts saved in:", sim_dir / f"run_{run_id}")

## 10. Optional Online Run
If `ONLINE=1`, we execute a real scrape against the actual websites. Outputs go to `results_notebook_online/`.

In [None]:
if ONLINE:
    print("Running online scrape...")
    # Note: This executes via CLI logic to reuse parallel orchestration
    from scrapping.cli import main

    argv = [
        "run",
        "--config",
        config_path,
        "--results",
        "results_notebook_online",
        "--parallelism",
        "4",
    ]
    main(argv)
    print("Online run completed. Check results_notebook_online/ for the latest run folder.")
else:
    print("ONLINE=0, skipping online run.")

## 11. Troubleshooting & Next Steps
### When a site changes
- **Check link extraction**: Update `discovery.link_extract` regex or selector.
- **Check rendering**: If content is missing, use `engine: { "type": "browser" }` and add `wait_for` selectors.
- **Adjust actions**: Use `actions` to scroll, click, or bypass simple overlays.

### Future Roadmap
- **config_agent**: Automatically generate these JSON configs by probing URLs.
- **tests_agent**: Automatically generate golden-file tests for each source.
- **Prefect Integration**: Scale these runs using Prefect for distributed scheduling and retries.