🔍 Extracto

Turn any URL into structured JSON using LLMs
No more brittle web scrapers that break when HTML changes

Features • Installation • Quick Start • AI Providers • Examples • API

🎯 The Problem

Traditional web scrapers are fragile:

❌ Break when websites change their HTML structure
❌ Require constant maintenance of CSS selectors
❌ Fail on JavaScript-heavy sites
❌ Need separate logic for each website

✨ The Solution

Extracto uses Large Language Models to intelligently extract data:

✅ Resilient - Works even when HTML changes
✅ Universal - Same code for any website
✅ Type-safe - Pydantic schema validation
✅ Smart - Handles dynamic JavaScript content

🚀 Features

🎭 Playwright Integration - Handles dynamic JavaScript-heavy websites
🤖 Multi-Provider AI - OpenAI, Claude, Gemini, Ollama, Groq, and more
🔒 Type Safety - Pydantic models with automatic validation via Instructor
💰 Cost Optimized - Smart HTML cleaning reduces token usage by 50-70%
⚡ Async First - Built on asyncio for high performance
🎨 Beautiful CLI - Rich terminal interface with Typer
🌐 REST API - FastAPI microservice ready
🆓 Local Option - Use Ollama for free, private extraction

📦 Installation

Using pip

pip install extracto
playwright install chromium

Using Poetry (recommended)

poetry add extracto
poetry run playwright install chromium

From source

git clone https://github.com/meklasdev/extracto.git
cd extracto
poetry install
poetry run playwright install chromium

⚡ Quick Start

1. Define Your Schema

from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float | None = None
    rating: float | None = None
    in_stock: bool | None = None

2. Extract Data

import asyncio
from extracto import Extractor

async def main():
    extractor = Extractor()  # Uses OPENAI_API_KEY from env
    
    result = await extractor.extract(
        url="https://example.com/product",
        schema=Product,
    )
    
    if result.success:
        print(f"Product: {result.data.name}")
        print(f"Price: ${result.data.price}")
        print(f"Rating: {result.data.rating}/5")

asyncio.run(main())

3. That's it! 🎉

No CSS selectors. No XPath. No maintenance.

🤖 Supported AI Providers

Extracto works with any OpenAI-compatible API:

Provider	Model Example	Base URL	Cost
OpenAI	`gpt-4o`	Default	$$
Anthropic Claude	`claude-3-5-sonnet-20241022`	Default	$$
Google Gemini	`gemini-pro`	Custom	$
Ollama 🆓	`llama3`, `qwen`, `mistral`	`http://localhost:11434/v1`	Free
Groq ⚡	`llama-3.1-70b`	`https://api.groq.com/openai/v1`	Free tier
LM Studio 🆓	Local models	`http://localhost:1234/v1`	Free
Together AI	`meta-llama/...`	`https://api.together.xyz/v1`	$

Using Different Providers

OpenAI (Default)

extractor = Extractor()  # Uses OPENAI_API_KEY env var

Ollama (Local, Free)

extractor = Extractor(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Dummy key
)

Anthropic Claude

extractor = Extractor(
    api_key="sk-ant-...",  # ANTHROPIC_API_KEY
)

Google Gemini

extractor = Extractor(
    api_key="your-google-key",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

Groq (Fast & Free Tier)

extractor = Extractor(
    api_key="gsk-...",
    base_url="https://api.groq.com/openai/v1"
)

💡 Examples

E-commerce Product Scraping

from pydantic import BaseModel, Field
from extracto import Extractor

class ProductInfo(BaseModel):
    name: str = Field(..., description="Product name")
    price: float | None = Field(None, description="Current price")
    rating: float | None = Field(None, description="Rating 0-5")
    review_count: int | None = Field(None, description="Number of reviews")
    availability: str | None = Field(None, description="Stock status")
    brand: str | None = Field(None, description="Brand name")

extractor = Extractor()
result = await extractor.extract(
    url="https://amazon.com/product/...",
    schema=ProductInfo
)

News Article Extraction

class Article(BaseModel):
    title: str
    author: str | None = None
    publish_date: str | None = None
    summary: str | None = None
    tags: list[str] = []

result = await extractor.extract(
    url="https://news-site.com/article",
    schema=Article
)

Job Posting Data

class JobPosting(BaseModel):
    title: str
    company: str
    location: str | None = None
    salary_range: str | None = None
    requirements: list[str] = []
    remote: bool | None = None

result = await extractor.extract(
    url="https://careers.company.com/job/123",
    schema=JobPosting
)

🖥️ CLI Usage

Extracto includes a beautiful CLI powered by Typer and Rich:

# Basic extraction
python main.py extract https://example.com

# With custom model
python main.py extract https://example.com --model gpt-4o

# Using Ollama locally
python main.py extract https://example.com \
  --model llama3 \
  --base-url http://localhost:11434/v1

# Save to file
python main.py extract https://example.com --output result.json

# List supported providers
python main.py providers

🌐 REST API

Run as a microservice:

uvicorn extracto.api:app --reload

API Endpoints

`POST /extract`

{
  "url": "https://example.com/product",
  "schema_fields": {
    "name": "str",
    "price": "float | None",
    "rating": "float | None"
  },
  "model": "gpt-4o"
}

`GET /providers`

List all supported AI providers

`GET /health`

Health check endpoint

🏗️ Architecture

Data Flow:

URL → Playwright → Raw HTML → Cleaner → Markdown → LLM + Instructor → Validated JSON

Key Components

browser.py - Async Playwright automation
cleaner.py - HTML → Markdown conversion (critical for cost optimization)
extractor.py - Main extraction engine with Instructor
cli.py - Beautiful CLI interface
api.py - FastAPI microservice

🔧 Advanced Configuration

Custom System Prompt

result = await extractor.extract(
    url="https://example.com",
    schema=MySchema,
    system_prompt="You are a specialized product data extractor..."
)

Browser Options

from extracto import fetch_page_content

html = await fetch_page_content(
    url="https://example.com",
    timeout=60000,  # 60 seconds
    wait_for_selector=".product-price",  # Wait for specific element
    wait_for_network_idle=True,  # Wait for network
    user_agent="Custom User Agent"
)

HTML Cleaning Options

from extracto import clean_html

cleaned = clean_html(
    html,
    remove_scripts=True,
    remove_styles=True,
    keep_attributes=["href", "src", "alt"]
)

💰 Cost Optimization

Extracto includes aggressive HTML cleaning to reduce LLM costs:

Stage	Size	Reduction
Raw HTML	150 KB	-
After cleaning	45 KB	70%
As Markdown	20 KB	87%

This can save you 80%+ on API costs!

🧪 Testing

# Run tests
poetry run pytest

# With coverage
poetry run pytest --cov=extracto

# Type checking
poetry run mypy src/

# Linting
poetry run ruff check src/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Instructor - For making LLM outputs type-safe
Playwright - For reliable browser automation
Pydantic - For data validation
FastAPI - For the API framework
Typer - For the beautiful CLI

🌟 Star History

If you find this project useful, please consider giving it a star! ⭐

Built with ❤️ by meklasdev

Report Bug • Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
examples		examples
src/extracto		src/extracto
tests		tests
.env.example		.env.example
.gitignore		.gitignore
API.md		API.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
STRUCTURE.md		STRUCTURE.md
main.py		main.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🔍 Extracto

🎯 The Problem

✨ The Solution

🚀 Features

📦 Installation

Using pip

Using Poetry (recommended)

From source

⚡ Quick Start

1. Define Your Schema

2. Extract Data

3. That's it! 🎉

🤖 Supported AI Providers

Using Different Providers

OpenAI (Default)

Ollama (Local, Free)

Anthropic Claude

Google Gemini

Groq (Fast & Free Tier)

💡 Examples

E-commerce Product Scraping

News Article Extraction

Job Posting Data

🖥️ CLI Usage

🌐 REST API

API Endpoints

POST /extract

GET /providers

GET /health

🏗️ Architecture

Key Components

🔧 Advanced Configuration

Custom System Prompt

Browser Options

HTML Cleaning Options

💰 Cost Optimization

🧪 Testing

🤝 Contributing

📝 License

🙏 Acknowledgments

🌟 Star History

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /extract`

`GET /providers`

`GET /health`

Packages