**Run on Google Colab (Quickstart)**

```bash
! git clone --branch main --single-branch https://github.com/sbaaihamza/scrapping-lib.git
%cd scrapping-lib
! pip install -e ".[browser,dev]"
! playwright install
# Preferred (installs OS deps automatically on supported distros):
! playwright install --with-deps chromium
# If needed (manual deps fallback):
! apt-get update
! apt-get install -y libxcomposite1 libxcursor1 libgtk-3-0 libatk1.0-0 libcairo2 libgdk-pixbuf2.0-0
%cd /content/scrapping-lib/notebooks
```

*Note: Playwright has both sync and async APIs. These notebooks are designed to be async-safe for Jupyter/Colab. If you encounter OS dependency issues, use the `playwright install --with-deps chromium` command.*



# Jobs Aggregator: Multi-Source Methodology

This notebook is an operational guide for scaling your scraping operations across multiple job sites using the `Jobs Aggregator` recipe.

## 0) What this notebook teaches

*   **Configuration Strategy**: How to move from flat configs to structured, source-specific objects.
*   **Patterns**: Deep dive into **Paging** and **Link Extraction** strategies.
*   **Lab**: A step-by-step workshop on **onboarding a new job site** in under 30 minutes.
*   **Quality**: Managing **JobPostItem** validation and rejection flows.
*   **Observability**: Monitoring multi-source runs via `jobs_tracking.json`.

## Setup

In [None]:
import json
import os
import sys
import shutil
from pathlib import Path

REPO_ROOT = Path.cwd().parent
sys.path.append(str(REPO_ROOT))
os.chdir(str(REPO_ROOT))

ONLINE = os.getenv('ONLINE', '0') == '1'
RUN_DIR = Path('results/jobs_nb_guided')

if RUN_DIR.exists(): shutil.rmtree(RUN_DIR)
RUN_DIR.mkdir(parents=True, exist_ok=True)

print(f'Online mode: {ONLINE}')

## 1) Config Anatomy: The `JobSourceConfig` Object

Each site is defined by a `JobSourceConfig`. Here are the primary blocks:

| Block | Description | Key Fields |
| :--- | :--- | :--- |
| **Engine** | How to fetch content | `type` (http/browser), `timeout_s`, `verify_ssl` |
| **Entrypoints** | Where to start | `url` template, `paging` (mode, pages, step) |
| **Discovery** | How to find jobs | `link_extract` (method, pattern/selector) |
| **Parsing** | How to extract fields | `item_extract` (selectors for title, company, etc.) |
| **Policies** | Quality controls | `checkpoint_every_n`, `min_description_len` |

## 2) Paging Patterns

The recipe supports several pagination strategies out of the box.

In [None]:
from scrapping.recipes.jobs_aggregator import DiscoverListingPagesPhase, JobRecipeContext, JobSourceConfig, StateManager
import logging

# 1. Page template (?page=1)
cfg_page = JobSourceConfig(source_id="page_site", entrypoints=[{"url": "http://site.com/jobs?p={page}", "paging": {"mode": "page", "pages": 3}}])

# 2. Offset template (0, 10, 20...)
cfg_offset = JobSourceConfig(source_id="offset_site", entrypoints=[{"url": "http://site.com/api?start={offset}", "paging": {"mode": "offset", "pages": 3, "step": 10}}])

def preview_paging(cfg):
    state = StateManager(output_dir="tmp/paging_test")
    ctx = JobRecipeContext(engine=None, state=state, config=cfg, online=False, log=logging.getLogger("test"))
    DiscoverListingPagesPhase().run(ctx)
    return state.metadata['listing_urls']

print(f"Page Mode URLs: {preview_paging(cfg_page)}")
print(f"Offset Mode URLs: {preview_paging(cfg_offset)}")

## 3) Link Extraction Patterns

Configuring how to find job links correctly is the most important step for a successful run.

In [None]:
from scrapping.extraction.link_extractors import LinkExtractRequest, extract_links

html = """
<div class='job'><a href='/j/101'>Software Engineer</a></div>
<div class='job'><a href='https://other.com/jobs/202'>Data Scientist</a></div>
<div class='ad'><a href='/promo'>Buy our coffee</a></div>
"""

# 1. CSS Extraction (Targeted)
req_css = LinkExtractRequest(html=html, base_url="https://mysite.com", method="css", selector=".job a")
print(f"CSS Links: {extract_links(req_css)}")

# 2. Regex Extraction (Specific Pattern)
req_rx = LinkExtractRequest(html=html, base_url="https://mysite.com", method="regex", pattern=r"/j/\d+")
print(f"Regex Links: {extract_links(req_rx)}")

# 3. Normalization (Clean trailing slash and fragments)
req_norm = LinkExtractRequest(html="<a href='/j/1/#frag'>Job 1</a>", base_url="https://mysite.com", method="css", selector="a", normalize=True)
print(f"Normalized Link: {extract_links(req_norm)}")

In [None]:
import re
# Verify the jobs link regex
test_job = "/j/999"
pattern_job = r"/j/\d+"
matches_job = re.findall(pattern_job, test_job)
print(f"Jobs regex check: {matches_job}")
assert len(matches_job) == 1


## 4) Job Schema & Quality Filters

Unified data is ensured by the `JobPostItem` model.

In [None]:
from scrapping.schemas.job_items import JobPostItem

job = JobPostItem(
    source_id="linkedin",
    url="http://linkedin.com/j1",
    title="Python Developer",
    company="Tech Corp",
    location="Remote",
    description="Must know Python and scrapers..." * 20
)
print(f"Valid Job: {job.title} at {job.company}")

# Rejection Reason Example
print(f"\nIf 'description' is < 50 chars, it will be saved to 'jobs_rejected.jsonl'.")

## 5) "Add a New Site" Guided Lab

Follow these steps to onboard a new target:

### Step A: Start from Template
```python
new_site = {
    "source_id": "my_new_job_site",
    "engine": {"type": "http"}, # Step B
    "entrypoints": [{"url": "...", "paging": {"mode": "page", "pages": 1}}],
    "discovery": {"link_extract": {"method": "regex", "pattern": "..."}}, # Step C
    "parsing": {"item_extract": {"fields": {"title": {"selector": "h1"}}}} # Step D
}
```

### Step B: Engine Triage
*   Use **HTTP** if the HTML contains the job data in the raw source (`Ctrl+U`).
*   Use **Browser** if content only appears after 2-3 seconds or requires scrolling.

### Step E: Run Offline on Fixture
Place a sample listing in `tests/fixtures/html/jobs/my_new_job_site/listing.html` and run the recipe with `online=False`.

In [None]:
from scrapping.recipes.jobs_aggregator import run_jobs_recipe

LAB_DIR = RUN_DIR / "lab_trial"
cfg = JobSourceConfig(
    source_id="quotes_jobs_mock", 
    entrypoints=[{"url": "http://quotes.toscrape.com", "paging": {"mode": "page", "pages": 1}}],
    discovery={"link_extract": {"method": "css", "selector": ".quote a"}},
    parsing={"item_extract": {"fields": {"title": {"selector": ".text"}, "company": {"selector": ".author"}}}}
)

run_jobs_recipe([cfg], output_root=str(LAB_DIR), online=False)
print(f"Lab run complete. Check {LAB_DIR}/quotes_jobs_mock/jobs.jsonl")

## 6) Observability & Artifacts Inspection

Every multi-source run generates a `jobs_tracking.json` file.

In [None]:
tracking_file = LAB_DIR / "jobs_tracking.json"
if tracking_file.exists():
    tracking = json.loads(tracking_file.read_text())
    for sid, data in tracking.items():
        if isinstance(data, dict):
            print(f"Source: {sid} | Status: {data.get('status')} | Results: {len(data.get('results', []))} phases")

### Summary Checklist for New Sources
- [ ] Config defined and validated.
- [ ] Engine selected (HTTP vs Browser).
- [ ] Link extraction regex/selector verified.
- [ ] Detail page selectors verified.
- [ ] Offline run successful on fixtures.
- [ ] Regression test added to test suite.