Skip to content

iristech-systems/RusticSoup

RusticSoup πŸ¦€πŸ²

Lightning-fast HTML parser and data extractor built in Rust

PyPI version Python versions License: MIT

πŸš€ Why RusticSoup?

Feature BeautifulSoup RusticSoup Speedup
Google Shopping 8.1ms 3.9ms 2.1x faster
Product grids 14ms 1.2ms 12x faster
Bulk processing Sequential Parallel Up to 100x faster
Attribute extraction Manual loops @href syntax Zero loops needed
WebPage API ❌ βœ… web-poet inspired
CSS selectors βœ… βœ… Same API
Memory usage High Low Rust efficiency

⚑ Quick Start

pip install rusticsoup

Option 1: WebPage API (Recommended - web-poet style)

from rusticsoup import WebPage

html = """
<div class="product">
    <h2>Amazing Product</h2>
    <span class="price">$29.99</span>
    <a href="/buy" class="buy-btn">Buy Now</a>
    <img src="/image.jpg" alt="product">
</div>
"""

# Create a WebPage
page = WebPage(html, url="https://example.com/products")

# Extract single values
title = page.text("h2")                    # "Amazing Product"
price = page.text("span.price")            # "$29.99"
link = page.attr("a.buy-btn", "href")      # "/buy"

# Or extract structured data
product = page.extract({
    "title": "h2",
    "price": "span.price",
    "link": "a.buy-btn@href",   # @ syntax for attributes
    "image": "img@src"
})
# {'title': 'Amazing Product', 'price': '$29.99', 'link': '/buy', 'image': '/image.jpg'}

Option 2: Universal Extraction (Original API)

import rusticsoup

# Define what you want to extract
field_mappings = {
    "title": "h2",              # Text content
    "price": "span.price",      # Text content
    "link": "a.buy-btn@href",   # Attribute extraction with @
    "image": "img@src"          # Any attribute: @src, @href, @alt, etc.
}

# Extract data - no manual loops, no site-specific logic
products = rusticsoup.extract_data(html, "div.product", field_mappings)

print(products)
# [{"title": "Amazing Product", "price": "$29.99", "link": "/buy", "image": "/image.jpg"}]

πŸ“š Documentation & Examples

🎯 Core Features

🌟 NEW in v0.4.0: ItemPage with extract_all()

The cleanest extraction pattern with per-field transforms:

from rusticsoup import WebPage, Field, ItemPage

# Define your data model once
class ProductReview(ItemPage):
    author = Field(css='span.author', transform=str.strip)
    rating = Field(css='span.rating', transform=lambda s: float(s.split()[0]))
    text = Field(css='p.review-text', transform=str.strip)
    # Fallback selectors for robustness
    date = Field(css=['time.published', 'span.date'])

# One line to extract everything with transforms applied!
page = WebPage(html)
reviews = page.extract_all('div.review', ProductReview)

# Clean, type-safe access
for review in reviews:
    print(f"{review.author} ({review.rating}β˜…): {review.text}")

Benefits:

  • βœ… Declarative field definitions with transforms
  • βœ… No post-processing list comprehensions
  • βœ… Reusable data models
  • βœ… Type-safe attribute access
  • βœ… Fallback selectors built-in

πŸ“– Full ItemPage Documentation

🌟 WebPage API (web-poet inspired)

High-level, declarative API for web scraping:

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

# Simple extraction
title = page.text("h1")
links = page.attr_all("a", "href")

# Extract multiple items at once
products = page.extract_all(".product", {
    "name": "h2",
    "price": ".price",
    "url": "a@href"
})

# Check existence
if page.has("nav.menu"):
    nav_items = page.text_all("nav.menu a")

# URL resolution
absolute_url = page.absolute_url("/products/123")

πŸ“– Full WebPage API Documentation | πŸš€ Quick Start Guide | πŸ†˜ Help Center | **πŸ§ͺ Examples

βœ… Universal Extraction

Works with any HTML structure - no site-specific parsers needed:

# Google Shopping
rusticsoup.extract_data(html, 'tr[data-is-grid-offer="true"]', {
    'seller': 'a.b5ycib',
    'price': 'span.g9WBQb',
    'link': 'a.UxuaJe@href'
})

# Amazon Products
rusticsoup.extract_data(html, '[data-component-type="s-search-result"]', {
    'title': 'h2 a span',
    'price': '.a-price-whole',
    'rating': '.a-icon-alt',
    'url': 'h2 a@href'
})

# Any website
rusticsoup.extract_data(html, 'your-container-selector', {
    'any_field': 'any.css.selector',
    'any_attribute': 'element@attribute_name'
})

βœ… Bulk Processing

Process multiple pages in parallel:

# Process 100 pages simultaneously
pages = [html1, html2, html3, ...]  # List of HTML strings
results = rusticsoup.extract_data_bulk(pages, "div.product", field_mappings)

# Each page processed in parallel using Rust's Rayon
# 10-100x faster than sequential processing

βœ… Attribute Extraction

No more manual loops for getting href, src, etc:

# Before (BeautifulSoup)
links = []
for element in soup.select('a'):
    if element.get('href'):
        links.append(element['href'])

# After (RusticSoup)
data = rusticsoup.extract_data(html, 'div', {'links': 'a@href'})

βœ… Browser-Grade Parsing

Built on html5ever - the same HTML parser used by Firefox and Servo:

  • Handles malformed HTML perfectly
  • WHATWG HTML5 compliant
  • Blazing fast C-level performance
  • Memory safe (Rust)

πŸ“Š Performance Benchmarks

Real-world scraping performance vs BeautifulSoup:

# Google Shopping: 30 ads per page
BeautifulSoup:  8.1ms per page
RusticSoup:     3.9ms per page  (2.1x faster)

# Product grids: 50 products per page
BeautifulSoup:  14ms per page
RusticSoup:     1.2ms per page  (12x faster)

# Bulk processing: 100 pages
BeautifulSoup:  Sequential ~1.4s
RusticSoup:     Parallel ~14ms   (100x faster)

πŸ› οΈ API Reference

Two Powerful APIs

RusticSoup provides two complementary APIs:

  1. WebPage API - High-level, object-oriented (Recommended for new projects)
  2. Universal Extraction API - Function-based, great for batch processing

WebPage API

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

Key Methods:

  • text(selector) - Extract text from first match
  • text_all(selector) - Extract text from all matches
  • attr(selector, attribute) - Extract attribute from first match
  • attr_all(selector, attribute) - Extract attribute from all matches
  • extract(mappings) - Extract structured data
  • extract_all(container, mappings) - Extract multiple items
  • has(selector) - Check if selector matches
  • count(selector) - Count matching elements
  • absolute_url(url) - Convert relative to absolute URL

πŸ“– Full WebPage Documentation

πŸ”„ Field Transforms (NEW in v0.2.2)

Apply transformations to extracted data automatically:

from rusticsoup import WebPage, Field
from rusticsoup_helpers import ItemPage

class Article(ItemPage):
    # Single transform
    title = Field(css="h1", transform=str.upper)

    # Chain multiple transforms
    author = Field(
        css=".author",
        transform=[
            str.strip,
            str.title,
            lambda s: s.replace("by ", "")
        ]
    )

    # Transform with attribute extraction
    price = Field(
        css=".price",
        transform=[
            str.strip,
            lambda s: float(s.replace("$", ""))
        ]
    )

    # Transform lists
    tags = Field(
        css=".tag",
        get_all=True,
        transform=lambda tags: [t.upper() for t in tags]
    )

page = WebPage(html)
article = Article(page)

print(article.title)   # "UNDERSTANDING RUST"
print(article.author)  # "Jane Smith"
print(article.price)   # 19.99
print(article.tags)    # ["PYTHON", "RUST", "WEB"]

Benefits:

  • βœ… No manual post-processing needed
  • βœ… Clean, declarative field definitions
  • βœ… Reusable transform functions
  • βœ… Chain multiple transforms in order
  • βœ… Works with single values, lists, and attributes

πŸ“– Full Transform Documentation

Universal Extraction API

extract_data(html, container_selector, field_mappings)

Universal HTML data extraction - works with any website structure.

Parameters:

  • html: HTML string to parse
  • container_selector: CSS selector for container elements
  • field_mappings: Dict mapping field names to CSS selectors

Returns: List of dictionaries with extracted data

extract_data_bulk(html_pages, container_selector, field_mappings)

Parallel processing of multiple HTML pages.

Parameters:

  • html_pages: List of HTML strings
  • container_selector: CSS selector for container elements
  • field_mappings: Dict mapping field names to CSS selectors

Returns: List of lists - one result list per input page

parse_html(html)

Low-level HTML parsing - returns WebScraper object for manual DOM traversal.

Parameters:

  • html: HTML string to parse

Returns: WebScraper object with select(), text(), attr() methods

Selector Syntax

Syntax Description Example
"selector" Extract text content "h1" β†’ "Page Title"
"selector@attr" Extract attribute "a@href" β†’ "/page.html"
"selector@get_all" Extract all text "p@get_all" β†’ ["P1", "P2"]
"complex selector" Any CSS selector "div.class > p:first-child"

Supported Attributes

Any HTML attribute: @href, @src, @alt, @class, @id, @data-*, etc.

πŸ—οΈ Advanced Usage

Custom Processing

# Extract data then post-process
ads = rusticsoup.extract_data(html, "tr.ad", {
    "price": "span.price",
    "link": "a@href"
})

# Post-process the results
for ad in ads:
    # Clean price: "$29.99" β†’ 29.99
    ad["price"] = float(ad["price"].replace("$", ""))

    # Convert relative URLs to absolute
    if ad["link"].startswith("/"):
        ad["link"] = f"https://example.com{ad['link']}"

Table Extraction

# Extract HTML tables easily
table_data = rusticsoup.extract_table_data(html, "table.data")
# Returns: [["Header1", "Header2"], ["Row1Col1", "Row1Col2"], ...]

Error Handling

try:
    data = rusticsoup.extract_data(html, "div.product", field_mappings)
except Exception as e:
    print(f"Parsing error: {e}")
    data = []

πŸ†š Migration from BeautifulSoup

Option 1: WebPage API (Recommended)

# BeautifulSoup - Imperative, verbose
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
products = []

for product in soup.select('div.product'):
    title = product.select_one('h2')
    price = product.select_one('span.price')
    link = product.select_one('a')

    products.append({
        'title': title.text if title else '',
        'price': price.text if price else '',
        'link': link.get('href') if link else ''
    })

# RusticSoup WebPage - Declarative, concise
from rusticsoup import WebPage

page = WebPage(html)
products = page.extract_all('div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

Option 2: Universal Extraction API

# RusticSoup Universal API - Function-based
import rusticsoup

products = rusticsoup.extract_data(html, 'div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

90% less code, 2-10x faster, handles attributes automatically!

web-poet to RusticSoup

RusticSoup's WebPage API is compatible with web-poet patterns:

# web-poet (async, slower)
from web_poet import WebPage

async def parse(page: WebPage):
    title = await page.css("h1::text").get()
    links = await page.css("a::attr(href)").getall()
    return {"title": title, "links": links}

# RusticSoup WebPage (sync, faster - no async needed!)
from rusticsoup import WebPage

def parse(html: str):
    page = WebPage(html)
    title = page.text("h1")
    links = page.attr_all("a", "href")
    return {"title": title, "links": links}

πŸ”§ Installation

From PyPI (Recommended)

pip install rusticsoup

From Source

# Requires Rust toolchain
git clone https://github.com/yourusername/rusticsoup
cd rusticsoup
maturin develop --release

System Requirements

  • Python 3.11+
  • No additional dependencies (self-contained)

πŸ“ˆ Use Cases

Perfect for:

  • Web scraping - Extract data from any website
  • Data mining - Process large amounts of HTML
  • Price monitoring - Track e-commerce prices
  • Content aggregation - Collect articles, posts, listings
  • SEO analysis - Extract meta tags, titles, links
  • API alternatives - Scrape when no API exists

🀝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

  1. Fork the repository
  2. Create your feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

  • Built on html5ever - Mozilla's HTML5 parser
  • Powered by scraper - CSS selector support
  • Inspired by BeautifulSoup - the original HTML parsing library
  • WebPage API inspired by web-poet - declarative web scraping

Made with πŸ¦€ and ❀️ - RusticSoup: Where Rust meets HTML parsing perfection

About

Rust bound BeautifulSoup alternative

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors