RusticSoup 🦀🍲

Lightning-fast HTML parser and data extractor built in Rust

🚀 Why RusticSoup?

Feature	BeautifulSoup	RusticSoup	Speedup
Google Shopping	8.1ms	3.9ms	2.1x faster
Product grids	14ms	1.2ms	12x faster
Bulk processing	Sequential	Parallel	Up to 100x faster
Attribute extraction	Manual loops	`@href` syntax	Zero loops needed
WebPage API	❌	✅	web-poet inspired
CSS selectors	✅	✅	Same API
Memory usage	High	Low	Rust efficiency

⚡ Quick Start

pip install rusticsoup

Option 1: WebPage API (Recommended - web-poet style)

from rusticsoup import WebPage

html = """
<div class="product">
    <h2>Amazing Product</h2>
    <span class="price">$29.99</span>
    <a href="/buy" class="buy-btn">Buy Now</a>
    <img src="/image.jpg" alt="product">
</div>
"""

# Create a WebPage
page = WebPage(html, url="https://example.com/products")

# Extract single values
title = page.text("h2")                    # "Amazing Product"
price = page.text("span.price")            # "$29.99"
link = page.attr("a.buy-btn", "href")      # "/buy"

# Or extract structured data
product = page.extract({
    "title": "h2",
    "price": "span.price",
    "link": "a.buy-btn@href",   # @ syntax for attributes
    "image": "img@src"
})
# {'title': 'Amazing Product', 'price': '$29.99', 'link': '/buy', 'image': '/image.jpg'}

Option 2: Universal Extraction (Original API)

import rusticsoup

# Define what you want to extract
field_mappings = {
    "title": "h2",              # Text content
    "price": "span.price",      # Text content
    "link": "a.buy-btn@href",   # Attribute extraction with @
    "image": "img@src"          # Any attribute: @src, @href, @alt, etc.
}

# Extract data - no manual loops, no site-specific logic
products = rusticsoup.extract_data(html, "div.product", field_mappings)

print(products)
# [{"title": "Amazing Product", "price": "$29.99", "link": "/buy", "image": "/image.jpg"}]

📚 Documentation & Examples

Help Center: help/README.md
Quick Start: help/quickstart.md
WebPage API: help/webpage_api.md
Field Usage: help/field_usage.md
Field Transform: help/field_transform.md
Containers & Mappings: help/containers_and_mappings.md
Fallback Selectors: help/fallback_selectors.md
ItemPage: Containers + Mapping: help/itempage_containers.md
PageObject Pattern: help/page_object_pattern.md
Examples: examples/

🎯 Core Features

🌟 NEW in v0.4.0: ItemPage with extract_all()

The cleanest extraction pattern with per-field transforms:

from rusticsoup import WebPage, Field, ItemPage

# Define your data model once
class ProductReview(ItemPage):
    author = Field(css='span.author', transform=str.strip)
    rating = Field(css='span.rating', transform=lambda s: float(s.split()[0]))
    text = Field(css='p.review-text', transform=str.strip)
    # Fallback selectors for robustness
    date = Field(css=['time.published', 'span.date'])

# One line to extract everything with transforms applied!
page = WebPage(html)
reviews = page.extract_all('div.review', ProductReview)

# Clean, type-safe access
for review in reviews:
    print(f"{review.author} ({review.rating}★): {review.text}")

Benefits:

✅ Declarative field definitions with transforms
✅ No post-processing list comprehensions
✅ Reusable data models
✅ Type-safe attribute access
✅ Fallback selectors built-in

📖 Full ItemPage Documentation

🌟 WebPage API (web-poet inspired)

High-level, declarative API for web scraping:

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

# Simple extraction
title = page.text("h1")
links = page.attr_all("a", "href")

# Extract multiple items at once
products = page.extract_all(".product", {
    "name": "h2",
    "price": ".price",
    "url": "a@href"
})

# Check existence
if page.has("nav.menu"):
    nav_items = page.text_all("nav.menu a")

# URL resolution
absolute_url = page.absolute_url("/products/123")

📖 Full WebPage API Documentation | 🚀 Quick Start Guide | 🆘 Help Center | **🧪 Examples

✅ Universal Extraction

Works with any HTML structure - no site-specific parsers needed:

# Google Shopping
rusticsoup.extract_data(html, 'tr[data-is-grid-offer="true"]', {
    'seller': 'a.b5ycib',
    'price': 'span.g9WBQb',
    'link': 'a.UxuaJe@href'
})

# Amazon Products
rusticsoup.extract_data(html, '[data-component-type="s-search-result"]', {
    'title': 'h2 a span',
    'price': '.a-price-whole',
    'rating': '.a-icon-alt',
    'url': 'h2 a@href'
})

# Any website
rusticsoup.extract_data(html, 'your-container-selector', {
    'any_field': 'any.css.selector',
    'any_attribute': 'element@attribute_name'
})

✅ Bulk Processing

Process multiple pages in parallel:

# Process 100 pages simultaneously
pages = [html1, html2, html3, ...]  # List of HTML strings
results = rusticsoup.extract_data_bulk(pages, "div.product", field_mappings)

# Each page processed in parallel using Rust's Rayon
# 10-100x faster than sequential processing

✅ Attribute Extraction

No more manual loops for getting href, src, etc:

# Before (BeautifulSoup)
links = []
for element in soup.select('a'):
    if element.get('href'):
        links.append(element['href'])

# After (RusticSoup)
data = rusticsoup.extract_data(html, 'div', {'links': 'a@href'})

✅ Browser-Grade Parsing

Built on html5ever - the same HTML parser used by Firefox and Servo:

Handles malformed HTML perfectly
WHATWG HTML5 compliant
Blazing fast C-level performance
Memory safe (Rust)

📊 Performance Benchmarks

Real-world scraping performance vs BeautifulSoup:

# Google Shopping: 30 ads per page
BeautifulSoup:  8.1ms per page
RusticSoup:     3.9ms per page  (2.1x faster)

# Product grids: 50 products per page
BeautifulSoup:  14ms per page
RusticSoup:     1.2ms per page  (12x faster)

# Bulk processing: 100 pages
BeautifulSoup:  Sequential ~1.4s
RusticSoup:     Parallel ~14ms   (100x faster)

🛠️ API Reference

Two Powerful APIs

RusticSoup provides two complementary APIs:

WebPage API - High-level, object-oriented (Recommended for new projects)
Universal Extraction API - Function-based, great for batch processing

WebPage API

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

Key Methods:

text(selector) - Extract text from first match
text_all(selector) - Extract text from all matches
attr(selector, attribute) - Extract attribute from first match
attr_all(selector, attribute) - Extract attribute from all matches
extract(mappings) - Extract structured data
extract_all(container, mappings) - Extract multiple items
has(selector) - Check if selector matches
count(selector) - Count matching elements
absolute_url(url) - Convert relative to absolute URL

📖 Full WebPage Documentation

🔄 Field Transforms (NEW in v0.2.2)

Apply transformations to extracted data automatically:

from rusticsoup import WebPage, Field
from rusticsoup_helpers import ItemPage

class Article(ItemPage):
    # Single transform
    title = Field(css="h1", transform=str.upper)

    # Chain multiple transforms
    author = Field(
        css=".author",
        transform=[
            str.strip,
            str.title,
            lambda s: s.replace("by ", "")
        ]
    )

    # Transform with attribute extraction
    price = Field(
        css=".price",
        transform=[
            str.strip,
            lambda s: float(s.replace("$", ""))
        ]
    )

    # Transform lists
    tags = Field(
        css=".tag",
        get_all=True,
        transform=lambda tags: [t.upper() for t in tags]
    )

page = WebPage(html)
article = Article(page)

print(article.title)   # "UNDERSTANDING RUST"
print(article.author)  # "Jane Smith"
print(article.price)   # 19.99
print(article.tags)    # ["PYTHON", "RUST", "WEB"]

Benefits:

✅ No manual post-processing needed
✅ Clean, declarative field definitions
✅ Reusable transform functions
✅ Chain multiple transforms in order
✅ Works with single values, lists, and attributes

📖 Full Transform Documentation

Universal Extraction API

`extract_data(html, container_selector, field_mappings)`

Universal HTML data extraction - works with any website structure.

Parameters:

html: HTML string to parse
container_selector: CSS selector for container elements
field_mappings: Dict mapping field names to CSS selectors

Returns: List of dictionaries with extracted data

`extract_data_bulk(html_pages, container_selector, field_mappings)`

Parallel processing of multiple HTML pages.

Parameters:

html_pages: List of HTML strings
container_selector: CSS selector for container elements
field_mappings: Dict mapping field names to CSS selectors

Returns: List of lists - one result list per input page

`parse_html(html)`

Low-level HTML parsing - returns WebScraper object for manual DOM traversal.

Parameters:

html: HTML string to parse

Returns: WebScraper object with select(), text(), attr() methods

Selector Syntax

Syntax	Description	Example
`"selector"`	Extract text content	`"h1"` → "Page Title"
`"selector@attr"`	Extract attribute	`"a@href"` → "/page.html"
`"selector@get_all"`	Extract all text	`"p@get_all"` → ["P1", "P2"]
`"complex selector"`	Any CSS selector	`"div.class > p:first-child"`

Supported Attributes

Any HTML attribute: @href, @src, @alt, @class, @id, @data-*, etc.

🏗️ Advanced Usage

Custom Processing

# Extract data then post-process
ads = rusticsoup.extract_data(html, "tr.ad", {
    "price": "span.price",
    "link": "a@href"
})

# Post-process the results
for ad in ads:
    # Clean price: "$29.99" → 29.99
    ad["price"] = float(ad["price"].replace("$", ""))

    # Convert relative URLs to absolute
    if ad["link"].startswith("/"):
        ad["link"] = f"https://example.com{ad['link']}"

Table Extraction

# Extract HTML tables easily
table_data = rusticsoup.extract_table_data(html, "table.data")
# Returns: [["Header1", "Header2"], ["Row1Col1", "Row1Col2"], ...]

Error Handling

try:
    data = rusticsoup.extract_data(html, "div.product", field_mappings)
except Exception as e:
    print(f"Parsing error: {e}")
    data = []

🆚 Migration from BeautifulSoup

Option 1: WebPage API (Recommended)

# BeautifulSoup - Imperative, verbose
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
products = []

for product in soup.select('div.product'):
    title = product.select_one('h2')
    price = product.select_one('span.price')
    link = product.select_one('a')

    products.append({
        'title': title.text if title else '',
        'price': price.text if price else '',
        'link': link.get('href') if link else ''
    })

# RusticSoup WebPage - Declarative, concise
from rusticsoup import WebPage

page = WebPage(html)
products = page.extract_all('div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

Option 2: Universal Extraction API

# RusticSoup Universal API - Function-based
import rusticsoup

products = rusticsoup.extract_data(html, 'div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

90% less code, 2-10x faster, handles attributes automatically!

web-poet to RusticSoup

RusticSoup's WebPage API is compatible with web-poet patterns:

# web-poet (async, slower)
from web_poet import WebPage

async def parse(page: WebPage):
    title = await page.css("h1::text").get()
    links = await page.css("a::attr(href)").getall()
    return {"title": title, "links": links}

# RusticSoup WebPage (sync, faster - no async needed!)
from rusticsoup import WebPage

def parse(html: str):
    page = WebPage(html)
    title = page.text("h1")
    links = page.attr_all("a", "href")
    return {"title": title, "links": links}

🔧 Installation

From PyPI (Recommended)

pip install rusticsoup

From Source

# Requires Rust toolchain
git clone https://github.com/yourusername/rusticsoup
cd rusticsoup
maturin develop --release

System Requirements

Python 3.11+
No additional dependencies (self-contained)

📈 Use Cases

Perfect for:

Web scraping - Extract data from any website
Data mining - Process large amounts of HTML
Price monitoring - Track e-commerce prices
Content aggregation - Collect articles, posts, listings
SEO analysis - Extract meta tags, titles, links
API alternatives - Scrape when no API exists

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Fork the repository
Create your feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built on html5ever - Mozilla's HTML5 parser
Powered by scraper - CSS selector support
Inspired by BeautifulSoup - the original HTML parsing library
WebPage API inspired by web-poet - declarative web scraping

Made with 🦀 and ❤️ - RusticSoup: Where Rust meets HTML parsing perfection

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
examples		examples
help		help
python/rusticsoup		python/rusticsoup
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile.example		Dockerfile.example
LICENSE		LICENSE
README.md		README.md
build_docker_wheels.sh		build_docker_wheels.sh
docker-compose.example.yml		docker-compose.example.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

RusticSoup 🦀🍲

🚀 Why RusticSoup?

⚡ Quick Start

Option 1: WebPage API (Recommended - web-poet style)

Option 2: Universal Extraction (Original API)

📚 Documentation & Examples

🎯 Core Features

🌟 NEW in v0.4.0: ItemPage with extract_all()

🌟 WebPage API (web-poet inspired)

✅ Universal Extraction

✅ Bulk Processing

✅ Attribute Extraction

✅ Browser-Grade Parsing

📊 Performance Benchmarks

🛠️ API Reference

Two Powerful APIs

WebPage API

🔄 Field Transforms (NEW in v0.2.2)

Universal Extraction API

extract_data(html, container_selector, field_mappings)

extract_data_bulk(html_pages, container_selector, field_mappings)

parse_html(html)

Selector Syntax

Supported Attributes

🏗️ Advanced Usage

Custom Processing

Table Extraction

Error Handling

🆚 Migration from BeautifulSoup

Option 1: WebPage API (Recommended)

Option 2: Universal Extraction API

web-poet to RusticSoup

🔧 Installation

From PyPI (Recommended)

From Source

System Requirements

📈 Use Cases

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`extract_data(html, container_selector, field_mappings)`

`extract_data_bulk(html_pages, container_selector, field_mappings)`

`parse_html(html)`

Packages