A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
- 🚀 Declarative Scraping: Define scraping workflows using Python dictionaries or dataclasses
 - 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
 - 📊 Data Collection: Extract text, HTML, values, and files from web pages
 - 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
 - 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
 - 📥 File Downloads: Download files with automatic directory creation
 - 🔁 Looping & Iteration: ForEach loops for processing multiple elements
 - 📡 Streaming Results: Real-time result processing with callbacks
 - 🎯 Error Handling: Graceful error handling with configurable termination
 - 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors
 - 🔁 Retry Logic: Automatic retry on failure with configurable delays
 - 🎛️ Conditional Execution: Skip or execute steps based on JavaScript conditions
 - ⏳ Smart Waiting: Wait for selectors before actions with configurable timeouts
 - 🔀 Fallback Selectors: Multiple selector fallbacks for increased robustness
 - 🖱️ Enhanced Clicks: Double-click, right-click, modifier keys, and force clicks
 - ⌨️ Input Enhancements: Clear before input, human-like typing delays
 - 🔍 Data Transformations: Regex extraction, JavaScript transformations, default values
 - 🌐 Page Actions: Reload, get URL/title, meta tags, cookies, localStorage, viewport
 - 🤖 Human-like Behavior: Random delays to mimic human interaction
 - ✅ Element State Checks: Require visible/enabled before actions
 
# Using pip
pip install stepwright
# Using pip with development dependencies
pip install stepwright[dev]
# From source
git clone https://github.com/lablnet/stepwright.git
cd stepwright
pip install -e .import asyncio
from stepwright import run_scraper, TabTemplate, BaseStep
async def main():
    templates = [
        TabTemplate(
            tab="example",
            steps=[
                BaseStep(
                    id="navigate",
                    action="navigate",
                    value="https://example.com"
                ),
                BaseStep(
                    id="get_title",
                    action="data",
                    object_type="tag",
                    object="h1",
                    key="title",
                    data_type="text"
                )
            ]
        )
    ]
    results = await run_scraper(templates)
    print(results)
if __name__ == "__main__":
    asyncio.run(main())Main function to execute scraping templates.
Parameters:
templates: List ofTabTemplateobjectsoptions: OptionalRunOptionsobject
Returns: List[Dict[str, Any]]
results = await run_scraper(templates, RunOptions(
    browser={"headless": True}
))Execute scraping with streaming results via callback.
Parameters:
templates: List ofTabTemplateobjectson_result: Callback function for each result (can be sync or async)options: OptionalRunOptionsobject
async def process_result(result, index):
    print(f"Result {index}: {result}")
await run_scraper_with_callback(templates, process_result)@dataclass
class TabTemplate:
    tab: str
    initSteps: Optional[List[BaseStep]] = None      # Steps executed once before pagination
    perPageSteps: Optional[List[BaseStep]] = None   # Steps executed for each page
    steps: Optional[List[BaseStep]] = None          # Single steps array
    pagination: Optional[PaginationConfig] = None@dataclass
class BaseStep:
    id: str
    description: Optional[str] = None
    object_type: Optional[SelectorType] = None  # 'id' | 'class' | 'tag' | 'xpath'
    object: Optional[str] = None
    action: Literal[
        "navigate", "input", "click", "data", "scroll", 
        "eventBaseDownload", "foreach", "open", "savePDF", 
        "printToPDF", "downloadPDF", "downloadFile",
        "reload", "getUrl", "getTitle", "getMeta", "getCookies", 
        "setCookies", "getLocalStorage", "setLocalStorage", 
        "getSessionStorage", "setSessionStorage", "getViewportSize", 
        "setViewportSize", "screenshot", "waitForSelector", "evaluate"
    ] = "navigate"
    value: Optional[str] = None
    key: Optional[str] = None
    data_type: Optional[DataType] = None        # 'text' | 'html' | 'value' | 'default' | 'attribute'
    wait: Optional[int] = None
    terminateonerror: Optional[bool] = None
    subSteps: Optional[List["BaseStep"]] = None
    autoScroll: Optional[bool] = None
    
    # Retry configuration
    retry: Optional[int] = None                  # Number of retries on failure (default: 0)
    retryDelay: Optional[int] = None            # Delay between retries in ms (default: 1000)
    
    # Conditional execution
    skipIf: Optional[str] = None                 # JavaScript expression - skip step if true
    onlyIf: Optional[str] = None                # JavaScript expression - execute only if true
    
    # Element waiting and state
    waitForSelector: Optional[str] = None        # Wait for selector before action
    waitForSelectorTimeout: Optional[int] = None # Timeout for waitForSelector in ms (default: 30000)
    waitForSelectorState: Optional[Literal["visible", "hidden", "attached", "detached"]] = None
    
    # Multiple selector fallbacks
    fallbackSelectors: Optional[List[Dict[str, str]]] = None  # List of {object_type, object}
    
    # Click enhancements
    clickModifiers: Optional[List[ClickModifier]] = None  # ['Control', 'Meta', 'Shift', 'Alt']
    doubleClick: Optional[bool] = None            # Perform double click
    forceClick: Optional[bool] = None            # Force click even if not visible/actionable
    rightClick: Optional[bool] = None            # Perform right click
    
    # Input enhancements
    clearBeforeInput: Optional[bool] = None      # Clear input before typing (default: True)
    inputDelay: Optional[int] = None           # Delay between keystrokes in ms
    
    # Data extraction enhancements
    required: Optional[bool] = None             # Raise error if extraction returns None/empty
    defaultValue: Optional[str] = None          # Default value if extraction fails
    regex: Optional[str] = None                 # Regex pattern to extract from data
    regexGroup: Optional[int] = None            # Regex group to extract (default: 0)
    transform: Optional[str] = None             # JavaScript expression to transform data
    
    # Timeout configuration
    timeout: Optional[int] = None                # Step-specific timeout in ms
    
    # Navigation enhancements
    waitUntil: Optional[Literal["load", "domcontentloaded", "networkidle", "commit"]] = None
    
    # Human-like behavior
    randomDelay: Optional[Dict[str, int]] = None # {min: ms, max: ms} for random delay
    
    # Element state checks
    requireVisible: Optional[bool] = None        # Require element visible (default: True for click)
    requireEnabled: Optional[bool] = None       # Require element enabled
    
    # Skip/continue logic
    skipOnError: Optional[bool] = None          # Skip step if error occurs (default: False)
    continueOnEmpty: Optional[bool] = None      # Continue if element not found (default: True)@dataclass
class RunOptions:
    browser: Optional[dict] = None  # Playwright launch options
    onResult: Optional[Callable] = NoneNavigate to a URL.
BaseStep(
    id="go_to_page",
    action="navigate",
    value="https://example.com"
)Fill form fields.
BaseStep(
    id="search",
    action="input",
    object_type="id",
    object="search-box",
    value="search term"
)Click on elements.
BaseStep(
    id="submit",
    action="click",
    object_type="class",
    object="submit-button"
)Extract data from elements.
BaseStep(
    id="get_title",
    action="data",
    object_type="tag",
    object="h1",
    key="title",
    data_type="text"
)Process multiple elements.
BaseStep(
    id="process_items",
    action="foreach",
    object_type="class",
    object="item",
    subSteps=[
        BaseStep(
            id="get_item_title",
            action="data",
            object_type="tag",
            object="h2",
            key="title",
            data_type="text"
        )
    ]
)BaseStep(
    id="download_file",
    action="eventBaseDownload",
    object_type="class",
    object="download-link",
    value="./downloads/file.pdf",
    key="downloaded_file"
)BaseStep(
    id="download_pdf",
    action="downloadPDF",
    object_type="class",
    object="pdf-link",
    value="./output/document.pdf",
    key="pdf_file"
)BaseStep(
    id="save_pdf",
    action="savePDF",
    value="./output/page.pdf",
    key="pdf_file"
)PaginationConfig(
    strategy="next",
    nextButton=NextButtonConfig(
        object_type="class",
        object="next-page",
        wait=2000
    ),
    maxPages=10
)PaginationConfig(
    strategy="scroll",
    scroll=ScrollConfig(
        offset=800,
        delay=1500
    ),
    maxPages=5
)Paginate first, then collect data from each page:
TabTemplate(
    tab="news",
    initSteps=[...],
    perPageSteps=[...],  # Collect data from each page
    pagination=PaginationConfig(
        strategy="next",
        nextButton=NextButtonConfig(...),
        paginationFirst=True  # Go to next page before collecting
    )
)Paginate through all pages first, then collect all data at once:
TabTemplate(
    tab="articles",
    initSteps=[...],
    perPageSteps=[...],  # Collect all data after all pagination
    pagination=PaginationConfig(
        strategy="next",
        nextButton=NextButtonConfig(...),
        paginateAllFirst=True  # Load all pages first
    )
)from stepwright import run_scraper, RunOptions
results = await run_scraper(templates, RunOptions(
    browser={
        "proxy": {
            "server": "http://proxy-server:8080",
            "username": "user",
            "password": "pass"
        }
    }
))results = await run_scraper(templates, RunOptions(
    browser={
        "headless": False,
        "slow_mo": 1000,
        "args": ["--no-sandbox", "--disable-setuid-sandbox"]
    }
))async def process_result(result, index):
    print(f"Result {index}: {result}")
    # Process result immediately (e.g., save to database)
    await save_to_database(result)
await run_scraper_with_callback(
    templates, 
    process_result,
    RunOptions(browser={"headless": True})
)Use collected data in subsequent steps:
BaseStep(
    id="get_title",
    action="data",
    object_type="id",
    object="page-title",
    key="page_title",
    data_type="text"
),
BaseStep(
    id="save_with_title",
    action="savePDF",
    value="./output/{{page_title}}.pdf",  # Uses collected page_title
    key="pdf_file"
)Use loop index in foreach steps:
BaseStep(
    id="process_items",
    action="foreach",
    object_type="class",
    object="item",
    subSteps=[
        BaseStep(
            id="save_item",
            action="savePDF",
            value="./output/item_{{i}}.pdf",      # i = 0, 1, 2, ...
            # or
            value="./output/item_{{i_plus1}}.pdf" # i_plus1 = 1, 2, 3, ...
        )
    ]
)Steps can be configured to terminate on error:
BaseStep(
    id="critical_step",
    action="click",
    object_type="id",
    object="important-button",
    terminateonerror=True  # Stop execution if this fails
)Without terminateonerror=True, errors are logged but execution continues.
Automatically retry failed steps with configurable delays:
BaseStep(
    id="click_button",
    action="click",
    object_type="id",
    object="flaky-button",
    retry=3,              # Retry up to 3 times
    retryDelay=1000        # Wait 1 second between retries
)Execute or skip steps based on JavaScript conditions:
# Skip step if condition is true
BaseStep(
    id="optional_click",
    action="click",
    object_type="id",
    object="optional-button",
    skipIf="document.querySelector('.modal').classList.contains('hidden')"
)
# Execute only if condition is true
BaseStep(
    id="conditional_data",
    action="data",
    object_type="id",
    object="dynamic-content",
    key="content",
    onlyIf="document.querySelector('#dynamic-content') !== null"
)Wait for elements to appear before performing actions:
BaseStep(
    id="click_after_load",
    action="click",
    object_type="id",
    object="target-button",
    waitForSelector="#loading-indicator",      # Wait for this selector
    waitForSelectorTimeout=5000,               # Timeout: 5 seconds
    waitForSelectorState="hidden"              # Wait until hidden
)Provide multiple selector options for increased robustness:
BaseStep(
    id="click_with_fallback",
    action="click",
    object_type="id",
    object="primary-button",                   # Try this first
    fallbackSelectors=[
        {"object_type": "class", "object": "btn-primary"},
        {"object_type": "class", "object": "submit-btn"},
        {"object_type": "xpath", "object": "//button[contains(text(), 'Submit')]"}
    ]
)Advanced click options for different interaction types:
# Double click
BaseStep(
    id="double_click",
    action="click",
    object_type="id",
    object="item",
    doubleClick=True
)
# Right click (context menu)
BaseStep(
    id="right_click",
    action="click",
    object_type="id",
    object="context-menu-trigger",
    rightClick=True
)
# Click with modifier keys (Ctrl/Cmd+Click)
BaseStep(
    id="multi_select",
    action="click",
    object_type="class",
    object="item",
    clickModifiers=["Control"]  # or ["Meta"] for Mac
)
# Force click (click hidden elements)
BaseStep(
    id="force_click",
    action="click",
    object_type="id",
    object="hidden-button",
    forceClick=True
)More control over input behavior:
# Clear input before typing (default: True)
BaseStep(
    id="clear_and_input",
    action="input",
    object_type="id",
    object="search-box",
    value="new search term",
    clearBeforeInput=True  # Clear existing value first
)
# Human-like typing with delays
BaseStep(
    id="human_like_input",
    action="input",
    object_type="id",
    object="form-field",
    value="slowly typed text",
    inputDelay=100  # 100ms delay between each character
)Advanced data extraction and transformation options:
# Extract with regex
BaseStep(
    id="extract_price",
    action="data",
    object_type="id",
    object="price",
    key="price",
    regex=r"\$(\d+\.\d+)",        # Extract dollar amount
    regexGroup=1                   # Get first capture group
)
# Transform extracted data with JavaScript
BaseStep(
    id="transform_data",
    action="data",
    object_type="id",
    object="raw-data",
    key="processed",
    transform="value.toUpperCase().trim()"  # JavaScript transformation
)
# Required field with default value
BaseStep(
    id="get_required_data",
    action="data",
    object_type="id",
    object="important-field",
    key="important",
    required=True,                 # Raise error if not found
    defaultValue="N/A"            # Use if extraction fails
)
# Continue even if element not found
BaseStep(
    id="optional_data",
    action="data",
    object_type="id",
    object="optional-content",
    key="optional",
    continueOnEmpty=True           # Don't raise error if not found
)Validate element state before actions:
BaseStep(
    id="click_visible",
    action="click",
    object_type="id",
    object="button",
    requireVisible=True,           # Ensure element is visible
    requireEnabled=True            # Ensure element is enabled
)Add human-like random delays to actions:
BaseStep(
    id="human_like_action",
    action="click",
    object_type="id",
    object="button",
    randomDelay={"min": 500, "max": 2000}  # Random delay between 500-2000ms
)Skip steps that fail instead of stopping execution:
BaseStep(
    id="optional_step",
    action="click",
    object_type="id",
    object="optional-button",
    skipOnError=True  # Continue even if this step fails
)Reload the current page with optional wait conditions:
BaseStep(
    id="reload",
    action="reload",
    waitUntil="networkidle"  # Wait for network to be idle
)BaseStep(
    id="get_url",
    action="getUrl",
    key="current_url"  # Store in collector
)BaseStep(
    id="get_title",
    action="getTitle",
    key="page_title"
)# Get specific meta tag
BaseStep(
    id="get_description",
    action="getMeta",
    object="description",  # Meta name or property
    key="meta_description"
)
# Get all meta tags
BaseStep(
    id="get_all_meta",
    action="getMeta",
    key="all_meta_tags"  # Returns dictionary of all meta tags
)# Get all cookies
BaseStep(
    id="get_cookies",
    action="getCookies",
    key="cookies"
)
# Get specific cookie
BaseStep(
    id="get_session_cookie",
    action="getCookies",
    object="session_id",
    key="session"
)
# Set cookie
BaseStep(
    id="set_cookie",
    action="setCookies",
    object="preference",
    value="dark_mode"
)# Get localStorage value
BaseStep(
    id="get_storage",
    action="getLocalStorage",
    object="user_preference",
    key="preference"
)
# Set localStorage value
BaseStep(
    id="set_storage",
    action="setLocalStorage",
    object="theme",
    value="dark"
)
# Get all localStorage items
BaseStep(
    id="get_all_storage",
    action="getLocalStorage",
    key="all_storage"
)
# SessionStorage (same pattern)
BaseStep(
    id="get_session",
    action="getSessionStorage",
    object="temp_data",
    key="data"
)# Get viewport size
BaseStep(
    id="get_viewport",
    action="getViewportSize",
    key="viewport"
)
# Set viewport size
BaseStep(
    id="set_viewport",
    action="setViewportSize",
    value="1920x1080"  # or "1920,1080" or "1920 1080"
)# Full page screenshot
BaseStep(
    id="screenshot",
    action="screenshot",
    value="./screenshots/page.png",
    data_type="full"  # Full page, omit for viewport only
)
# Element screenshot
BaseStep(
    id="element_screenshot",
    action="screenshot",
    object_type="id",
    object="content-area",
    value="./screenshots/element.png",
    key="screenshot_path"
)Explicit wait for element state:
BaseStep(
    id="wait_for_element",
    action="waitForSelector",
    object_type="id",
    object="dynamic-content",
    value="visible",      # visible, hidden, attached, detached
    wait=5000,            # Timeout in ms
    key="wait_result"     # Stores True/False
)Execute custom JavaScript:
BaseStep(
    id="custom_js",
    action="evaluate",
    value="() => document.querySelector('.counter').textContent",
    key="counter_value"
)import asyncio
from pathlib import Path
from stepwright import (
    run_scraper,
    TabTemplate,
    BaseStep,
    PaginationConfig,
    NextButtonConfig,
    RunOptions
)
async def main():
    templates = [
        TabTemplate(
            tab="news_scraper",
            initSteps=[
                BaseStep(
                    id="navigate",
                    action="navigate",
                    value="https://news-site.com"
                ),
                BaseStep(
                    id="search",
                    action="input",
                    object_type="id",
                    object="search-box",
                    value="technology"
                )
            ],
            perPageSteps=[
                BaseStep(
                    id="collect_articles",
                    action="foreach",
                    object_type="class",
                    object="article",
                    subSteps=[
                        BaseStep(
                            id="get_title",
                            action="data",
                            object_type="tag",
                            object="h2",
                            key="title",
                            data_type="text"
                        ),
                        BaseStep(
                            id="get_content",
                            action="data",
                            object_type="tag",
                            object="p",
                            key="content",
                            data_type="text"
                        ),
                        BaseStep(
                            id="get_link",
                            action="data",
                            object_type="tag",
                            object="a",
                            key="link",
                            data_type="value"
                        )
                    ]
                )
            ],
            pagination=PaginationConfig(
                strategy="next",
                nextButton=NextButtonConfig(
                    object_type="id",
                    object="next-page",
                    wait=2000
                ),
                maxPages=5
            )
        )
    ]
    # Run scraper
    results = await run_scraper(templates, RunOptions(
        browser={"headless": True}
    ))
    # Process results
    for i, article in enumerate(results):
        print(f"\nArticle {i + 1}:")
        print(f"Title: {article.get('title')}")
        print(f"Content: {article.get('content')[:100]}...")
        print(f"Link: {article.get('link')}")
if __name__ == "__main__":
    asyncio.run(main())# Clone repository
git clone https://github.com/lablnet/stepwright.git
cd stepwright
# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install Playwright browsers
playwright install chromium# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_scraper.py
# Run specific test class
pytest tests/test_scraper.py::TestGetBrowser
# Run specific test
pytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance
# Run with coverage
pytest --cov=src --cov-report=html
# Run integration tests only
pytest tests/test_integration.pystepwright/
├── src/
│   ├── __init__.py
│   ├── step_types.py      # Type definitions and dataclasses
│   ├── helpers.py         # Utility functions
│   ├── executor.py        # Core step execution logic
│   ├── parser.py          # Public API (run_scraper)
│   ├── scraper.py         # Low-level browser automation
│   ├── handlers/          # Action-specific handlers
│   │   ├── __init__.py
│   │   ├── data_handlers.py      # Data extraction handlers
│   │   ├── file_handlers.py      # File download/PDF handlers
│   │   ├── loop_handlers.py      # Foreach/open handlers
│   │   └── page_actions.py       # Page-related actions (reload, getUrl, etc.)
│   └── scraper_parser.py  # Backward compatibility
├── tests/
│   ├── __init__.py
│   ├── conftest.py        # Pytest configuration
│   ├── test_page.html     # Test HTML page
│   ├── test_page_enhanced.html  # Enhanced test page for new features
│   ├── test_scraper.py    # Core scraper tests
│   ├── test_parser.py     # Parser function tests
│   ├── test_new_features.py  # Tests for new features
│   └── test_integration.py # Integration tests
├── pyproject.toml         # Package configuration
├── setup.py               # Setup script
├── pytest.ini             # Pytest configuration
├── README.md              # This file
└── README_TESTS.md        # Detailed test documentation
# Format code with black
black src/ tests/
# Lint with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/The codebase follows separation of concerns:
- step_types.py: All type definitions (BaseStep, TabTemplate, etc.)
 - helpers.py: Utility functions (placeholder replacement, locator creation, condition evaluation)
 - executor.py: Core execution logic (execute steps, handle pagination, retry logic)
 - parser.py: Public API (run_scraper, run_scraper_with_callback)
 - scraper.py: Low-level Playwright wrapper (navigate, click, get_data)
 - handlers/: Action-specific handlers organized by functionality
- data_handlers.py: Data extraction logic with transformations
 - file_handlers.py: File download and PDF operations
 - loop_handlers.py: Foreach loops and new tab/window handling
 - page_actions.py: Page-related actions (reload, getUrl, cookies, storage, etc.)
 
 - scraper_parser.py: Backward compatibility wrapper
 
You can import from the main module or specific submodules:
# From main module (recommended)
from stepwright import run_scraper, TabTemplate, BaseStep
# From specific modules
from stepwright.step_types import TabTemplate, BaseStep
from stepwright.parser import run_scraper
from stepwright.helpers import replace_data_placeholdersSee README_TESTS.md for detailed testing documentation.
- Fork the repository
 - Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
 - Add tests for new functionality
 - Ensure all tests pass (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
 
MIT License - see LICENSE file for details.
- 🐛 Issues: GitHub Issues
 - 📖 Documentation: README.md and README_TESTS.md
 - 💬 Discussions: GitHub Discussions
 
- Built with Playwright
 - Inspired by declarative web scraping patterns
 - Original TypeScript version: framework-Island/stepwright
 
Muhammad Umer Farooq (@lablnet)