**Run on Google Colab (Quickstart)**

```bash
! git clone --branch main --single-branch https://github.com/sbaaihamza/scrapping-lib.git
%cd scrapping-lib
! pip install -e ".[browser,dev]"
! playwright install
# Preferred (installs OS deps automatically on supported distros):
! playwright install --with-deps chromium
# If needed (manual deps fallback):
! apt-get update
! apt-get install -y libxcomposite1 libxcursor1 libgtk-3-0 libatk1.0-0 libcairo2 libgdk-pixbuf2.0-0
%cd /content/scrapping-lib/notebooks
```

*Note: Playwright has both sync and async APIs. These notebooks are designed to be async-safe for Jupyter/Colab. If you encounter OS dependency issues, use the `playwright install --with-deps chromium` command.*



# Alibaba Scraper: Hard Target Workshop

This notebook is a workshop-style guide to building a production-grade scraper for Alibaba. We follow the **Methodology Ladder**: starting simple, diagnosing failures, and escalating to browser rendering and the Recipe framework.

## 0) Readiness Check
Before we start, we ensure the environment has all required OS dependencies for the Browser engine.

In [None]:
from scrapping.orchestrator import doctor_environment
doctor = doctor_environment()
pw_check = doctor['checks'].get('playwright_browsers', {})
if pw_check.get('ok'):
    print("‚úÖ Playwright is ready.")
else:
    print(f"‚ùå Playwright Issue: {pw_check.get('msg')}")
    if 'hint' in pw_check:
        print(f"üëâ Hint: {pw_check['hint']}")

## 1) Step 1: Minimal Config & HTTP Attempt

We start with the simplest engine (HTTP) to see if the content is static.

In [None]:
import os
from scrapping.engines.http import HttpEngine
from scrapping.diagnostics.classifiers import diagnose_http_response, recommend_next_step

ONLINE = os.getenv('ONLINE', '0') == '1'
url = "https://www.alibaba.com/trade/search?SearchText=mechanical+keyboard"

if ONLINE:
    engine = HttpEngine()
    res = engine.get(url)
    diag = diagnose_http_response(res.status_code, res.response_meta.headers, res.text)
    print(f"HTTP Status: {res.status_code} | Text Len: {len(res.text or '')}")
    print(f"Diagnosis: {diag.label.value} | Rec: {recommend_next_step(diag)}")
else:
    print("OFFLINE: Skipping live HTTP attempt. (Alibaba typically requires JS or triggers challenges on raw HTTP requests)")

## 2) Step 2: Escalate to Browser

If HTTP fails (e.g., `js_required` or `challenge_detected`), we escalate to the **Browser Engine**.

In [None]:
from scrapping.engines.browser import BrowserEngine, BrowserEngineOptions
from scrapping.diagnostics.classifiers import diagnose_rendered_dom

if ONLINE:
    opts = BrowserEngineOptions(headless=True, save_artifacts=True)
    engine = BrowserEngine(options=opts)
    
    # We add a wait_for selector to ensure products are rendered
    res = engine.get_rendered(url, wait_for=".m-results-item, .item-main")
    diag = diagnose_rendered_dom(res.text or "")
    
    print(f"Browser Fetch OK: {res.ok} | Diagnosis: {diag.label.value}")
    if not res.ok:
        print(f"Error: {res.short_error()}")
    
    engine.close()
else:
    print("OFFLINE: In a real scenario, we'd now see the rendered HTML and artifacts.")

## 3) Step 3: Define Discovery & Extraction

Now we define how to find links and extract product data.

In [None]:
from scrapping.extraction.link_extractors import LinkExtractRequest, extract_links
from scrapping.extraction.parsers import select_text_bs4

# Sample extraction logic using a mock or real HTML
html_source = "<div class='item-main'><a href='/product/123.html'>Product 1</a><span class='price'>$10</span></div>"

req = LinkExtractRequest(html=html_source, method='regex', pattern=r'/product/\d+\.html')
links = extract_links(req)
print(f"Extracted Links: {links}")

title = select_text_bs4(html_source, "a")
price = select_text_bs4(html_source, ".price")
print(f"Fields: title='{title}', price='{price}'")

## 4) Step 4: Full Recipe Integration

Finally, we wrap everything in the **Alibaba L3 Recipe** for resumability and state management.

In [None]:
from scrapping.recipes.alibaba_l3 import run_single_keyword, AlibabaConfig

RUN_DIR = "results/alibaba_workshop"
config = AlibabaConfig(max_pages=1, checkpoint_every_n=5)

print("Running full recipe (Offline mode)...")
results = run_single_keyword(
    keyword='drone', 
    output_dir=RUN_DIR, 
    config=config, 
    online=False
)

for r in results:
    print(f"Phase {r.name}: {'‚úÖ' if r.ok else '‚ùå'} in {r.elapsed_ms:.0f}ms")


## 5) Defining Success for Hard Targets

When a site uses advanced challenges (Turnstile/reCAPTCHA), "Success" is defined as:
1.  **Correct Detection**: Identification of the challenge.
2.  **Artifact Capture**: Saving HTML + Screenshot for legal/technical review.
3.  **Graceful Stop**: Avoiding infinite retries or account flagging.

**Success Paths**:
*   If data is found -> Continue.
*   If challenge detected -> Stop + Record.
*   If blocked -> Diagnose + Escalate (e.g., to Official API).