Scrava is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.
π’ Built by Nextract Data Solutions - Your partner for enterprise web scraping and data extraction.
Scrava doesn't reinvent the wheel. Instead, it provides a composition-over-invention approach:
- Unifying Force: Eliminates boilerplate and integration complexity
- Battle-Tested Libraries: Built on httpx, Playwright, parsel, and more
- Developer Experience: Designed to be intuitive and "piece of cake" for newcomers
- Production-Ready: Structured logging, statistics, error handling, and more
- π Async-First: Built on asyncio for maximum performance
- π Dual-Mode Fetching: HTTP (httpx) and Browser (Playwright) support
- π¦ Flexible Queuing: In-memory or Redis-backed with duplicate filtering
- πͺ Powerful Hooks: Intercept and modify requests, responses, and data flow
- πΎ Pipeline System: MongoDB, JSON, or custom data storage
- π― Pydantic Integration: Type-safe data models with validation
- π Structured Logging: Production-grade logging with structlog
- βοΈ Config Management: YAML + Pydantic for type-safe configuration
- π οΈ CLI Tools: Project scaffolding, bot runner, and interactive shell
- Python 3.8 or higher
- pip (latest version recommended)
macOS (Apple Silicon - M1/M2/M3/M4):
# Use native ARM64 Python for best performance
arch -arm64 pip install scravamacOS (Intel):
pip install scravaWindows:
pip install scravaLinux:
pip install scrava# Basic installation (works on all platforms)
pip install scrava
# With browser support (Playwright)
pip install scrava[browser]
# With Redis queue support
pip install scrava[redis]
# With MongoDB pipeline support
pip install scrava[mongodb]
# Install everything
pip install scrava[all]# Clone and install in editable mode
git clone https://github.com/yourusername/scrava.git
cd scrava
pip install -e .
# With all optional dependencies
pip install -e ".[all]"For easier installation, use our platform-specific scripts:
macOS/Linux:
# Auto-detects architecture and installs correctly
curl -sSL https://raw.githubusercontent.com/yourusername/scrava/main/install.sh | bash
# Or download and run manually
chmod +x install.sh
./install.shWindows (PowerShell):
# Download and run the installation script
iwr -useb https://raw.githubusercontent.com/yourusername/scrava/main/install.ps1 | iex
# Or download and run manually
.\install.ps1# Check if Scrava is properly installed
scrava version
# Run the welcome screen
scravaIf you encounter installation issues, see PLATFORM.md for detailed platform-specific instructions.
scrava new my_project
cd my_project# bots/book_bot.py
from pydantic import BaseModel, HttpUrl
from scrava import BaseBot, Request, Response
class Book(BaseModel):
"""A scraped book record."""
title: str
price: float
url: HttpUrl
in_stock: bool = True
class BookBot(BaseBot):
"""Bot for scraping books.toscrape.com"""
start_urls = ['https://books.toscrape.com']
async def process(self, response: Response):
"""Extract book data from the page."""
# Extract books using parsel selectors
for book in response.selector.css('article.product_pod'):
title = book.css('h3 a::attr(title)').get()
price_text = book.css('.price_color::text').get()
price = float(price_text.replace('Β£', ''))
url = response.urljoin(book.css('h3 a::attr(href)').get())
yield Book(
title=title,
price=price,
url=url
)
# Follow pagination
next_page = response.selector.css('.next a::attr(href)').get()
if next_page:
yield Request(response.urljoin(next_page))scrava run book_botfrom scrava import Request, Response
# Create a request
request = Request(
url='https://example.com',
method='GET',
headers={'User-Agent': 'MyBot/1.0'},
priority=10, # Higher priority = processed first
meta={'browser': True} # Use browser rendering
)
# Response provides powerful selectors
async def process(self, response: Response):
# CSS selectors
title = response.selector.css('h1::text').get()
# XPath selectors
links = response.selector.xpath('//a/@href').getall()
# Join relative URLs
absolute_url = response.urljoin('/path')from scrava import BaseBot, Response
class MyBot(BaseBot):
start_urls = ['https://example.com']
async def setup(self):
"""Called before crawling starts."""
self.session_data = {}
async def process(self, response: Response):
"""Main processing method."""
yield Record(...)
yield Request(...)
async def teardown(self):
"""Called after crawling completes."""
passfrom scrava import Crawler
from scrava.queue import MemoryQueue, RedisQueue
# In-memory queue (default)
crawler = Crawler(queue=MemoryQueue())
# Redis-backed queue for distributed crawls
crawler = Crawler(queue=RedisQueue(redis_url="redis://localhost:6379/0"))# HTTP fetcher (default)
from scrava.fetchers import HttpxFetcher
crawler = Crawler(
fetcher=HttpxFetcher(
timeout=30.0,
follow_redirects=True,
verify_ssl=True
)
)
# Browser fetcher for JavaScript-heavy sites
from scrava.fetchers import PlaywrightFetcher
crawler = Crawler(
browser_fetcher=PlaywrightFetcher(
headless=True,
browser_type='chromium',
context_pool_size=5
),
enable_browser=True
)
# Use browser for specific requests
yield Request(url, meta={'browser': True})from scrava.hooks import RequestHook
class UserAgentHook(RequestHook):
async def process_req(self, request, bot):
# Modify request before fetching
request.headers['User-Agent'] = 'MyBot/1.0'
return None
async def process_res(self, request, response, bot):
# Process response after fetching
print(f"Got {response.status} from {response.url}")
return None
crawler = Crawler(request_hooks=[UserAgentHook()])from scrava.hooks import CacheHook
# Enable caching
crawler = Crawler(
request_hooks=[
CacheHook(expiration=86400) # Cache for 1 day
]
)
# Disable caching for specific requests
yield Request(url, meta={'cache': False})from scrava.pipelines import JsonPipeline, MongoPipeline
# JSON output
crawler = Crawler(
pipelines=[JsonPipeline(output_file='output.jsonl')]
)
# MongoDB with batching
crawler = Crawler(
pipelines=[
MongoPipeline(
uri='mongodb://localhost:27017',
database='scrava',
batch_size=100,
batch_timeout=5.0
)
]
)
# Custom pipeline
from scrava.pipelines import BasePipeline
class CustomPipeline(BasePipeline):
async def process_rec(self, record, bot):
# Process and store record
await self.save_to_db(record)
return record# config/settings.yaml
project_name: "my_project"
scrava:
concurrent_reqs: 16
download_delay: 0.0
enable_browser: false
cache:
enabled: true
path: ".scrava_cache"
expiration_secs: 86400
queue:
backend: "scrava.queue.memory.MemoryQueue"
redis_url: "redis://localhost:6379/0"
pipeline:
enabled:
- scrava.pipelines.json.JsonPipeline
mongodb_uri: "mongodb://localhost:27017"
mongodb_database: "scrava"
logging:
level: "INFO"
format: "console" # or "json" for production
use_colors: truefrom scrava.config import load_settings
settings = load_settings('config/settings.yaml')from scrava.logging import setup_logging, get_logger
# Setup logging
setup_logging(
level="INFO",
format="console", # "json" for production
use_colors=True
)
# Get logger
logger = get_logger(__name__)
logger.info("Bot started", bot_name="my_bot", url="https://example.com")
# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com# Create a new project
scrava new <project_name>
# Run a bot
scrava run <bot_name>
# List all bots
scrava list
# Interactive selector shell
scrava shell <url>
scrava shell <url> --browser # Use browser rendering
# Show version
scrava versionclass ProductBot(BaseBot):
start_urls = ['https://shop.example.com']
async def process(self, response: Response):
# Extract category links
for category in response.selector.css('.category'):
url = response.urljoin(category.css('a::attr(href)').get())
yield Request(url, callback=self.parse_category)
async def parse_category(self, response: Response):
# Extract products
for product in response.selector.css('.product'):
yield Request(
response.urljoin(product.css('a::attr(href)').get()),
callback=self.parse_product
)
async def parse_product(self, response: Response):
yield Product(
name=response.selector.css('h1::text').get(),
price=float(response.selector.css('.price::text').get())
)async def process(self, response: Response):
# Scroll page, click buttons, etc. with JavaScript
yield Request(
url='https://spa-site.com',
meta={
'browser': True,
'wait_for': '.dynamic-content',
'scroll': True
}
)class RetryHook(RequestHook):
async def process_exc(self, request, exception, bot):
if request.meta.get('retry_count', 0) < 3:
# Retry with incremented counter
request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1
await bot.queue.push(request)
return Noneclass ValidationPipeline(BasePipeline):
async def process_rec(self, record, bot):
# Pydantic automatically validates
if record.price < 0:
logger.warning("Invalid price", record=record)
return None # Filter out
return record- Use Pydantic Models: Define clear schemas for your scraped data
- Leverage Hooks: Keep bot logic clean by using hooks for cross-cutting concerns
- Configure Delays: Be respectful with
download_delayto avoid overwhelming servers - Enable Caching: Speed up development with the built-in CacheHook
- Structure Logs: Use structured logging for easy debugging and monitoring
- Handle Errors: Implement retry logic and error hooks for robust crawls
- Test Selectors: Use
scrava shell <url>to test CSS/XPath selectors interactively
βββββββββββββββ
β Bot β β Your scraping logic
ββββββββ¬βββββββ
β
β
βββββββββββββββ
β Core β β Orchestrator (asyncio event loop)
ββββββββ¬βββββββ
β
ββ Queue (MemoryQueue / RedisQueue)
ββ Fetcher (HttpxFetcher / PlaywrightFetcher)
ββ Hooks (RequestHook / BotHook)
ββ Pipelines (MongoPipeline / JsonPipeline)
For full documentation, visit: https://scrava.readthedocs.io
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details
Scrava is built on the shoulders of giants:
- httpx - HTTP client
- Playwright - Browser automation
- parsel - Data extraction
- Pydantic - Data validation
- structlog - Structured logging
- Typer - CLI framework
Scrava is developed and maintained by Nextract Data Solutions, a leading provider of enterprise web scraping and data extraction services.
Need enterprise-grade data extraction?
While Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:
- β Custom enterprise scraping solutions
- β Data-as-a-Service (DaaS) subscriptions
- β Data enrichment and validation
- β 99.9% accuracy and reliability
- β Dedicated support and SLA guarantees
- Website: https://nextract.dev
- Email: hello@nextract.dev
- Phone: +91 85110-98799
- GitHub: @nextractdevelopers
Schedule a Free Strategy Call | Download Capabilities Deck
Happy Scraping! π·οΈ