Turn any URL into structured JSON using LLMs
No more brittle web scrapers that break when HTML changes
Features β’ Installation β’ Quick Start β’ AI Providers β’ Examples β’ API
Traditional web scrapers are fragile:
- β Break when websites change their HTML structure
- β Require constant maintenance of CSS selectors
- β Fail on JavaScript-heavy sites
- β Need separate logic for each website
Extracto uses Large Language Models to intelligently extract data:
- β Resilient - Works even when HTML changes
- β Universal - Same code for any website
- β Type-safe - Pydantic schema validation
- β Smart - Handles dynamic JavaScript content
- π Playwright Integration - Handles dynamic JavaScript-heavy websites
- π€ Multi-Provider AI - OpenAI, Claude, Gemini, Ollama, Groq, and more
- π Type Safety - Pydantic models with automatic validation via Instructor
- π° Cost Optimized - Smart HTML cleaning reduces token usage by 50-70%
- β‘ Async First - Built on asyncio for high performance
- π¨ Beautiful CLI - Rich terminal interface with Typer
- π REST API - FastAPI microservice ready
- π Local Option - Use Ollama for free, private extraction
pip install extracto
playwright install chromiumpoetry add extracto
poetry run playwright install chromiumgit clone https://github.com/meklasdev/extracto.git
cd extracto
poetry install
poetry run playwright install chromiumfrom pydantic import BaseModel
class Product(BaseModel):
name: str
price: float | None = None
rating: float | None = None
in_stock: bool | None = Noneimport asyncio
from extracto import Extractor
async def main():
extractor = Extractor() # Uses OPENAI_API_KEY from env
result = await extractor.extract(
url="https://example.com/product",
schema=Product,
)
if result.success:
print(f"Product: {result.data.name}")
print(f"Price: ${result.data.price}")
print(f"Rating: {result.data.rating}/5")
asyncio.run(main())No CSS selectors. No XPath. No maintenance.
Extracto works with any OpenAI-compatible API:
| Provider | Model Example | Base URL | Cost |
|---|---|---|---|
| OpenAI | gpt-4o |
Default | $$ |
| Anthropic Claude | claude-3-5-sonnet-20241022 |
Default | $$ |
| Google Gemini | gemini-pro |
Custom | $ |
| Ollama π | llama3, qwen, mistral |
http://localhost:11434/v1 |
Free |
| Groq β‘ | llama-3.1-70b |
https://api.groq.com/openai/v1 |
Free tier |
| LM Studio π | Local models | http://localhost:1234/v1 |
Free |
| Together AI | meta-llama/... |
https://api.together.xyz/v1 |
$ |
extractor = Extractor() # Uses OPENAI_API_KEY env varextractor = Extractor(
base_url="http://localhost:11434/v1",
api_key="ollama" # Dummy key
)extractor = Extractor(
api_key="sk-ant-...", # ANTHROPIC_API_KEY
)extractor = Extractor(
api_key="your-google-key",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)extractor = Extractor(
api_key="gsk-...",
base_url="https://api.groq.com/openai/v1"
)from pydantic import BaseModel, Field
from extracto import Extractor
class ProductInfo(BaseModel):
name: str = Field(..., description="Product name")
price: float | None = Field(None, description="Current price")
rating: float | None = Field(None, description="Rating 0-5")
review_count: int | None = Field(None, description="Number of reviews")
availability: str | None = Field(None, description="Stock status")
brand: str | None = Field(None, description="Brand name")
extractor = Extractor()
result = await extractor.extract(
url="https://amazon.com/product/...",
schema=ProductInfo
)class Article(BaseModel):
title: str
author: str | None = None
publish_date: str | None = None
summary: str | None = None
tags: list[str] = []
result = await extractor.extract(
url="https://news-site.com/article",
schema=Article
)class JobPosting(BaseModel):
title: str
company: str
location: str | None = None
salary_range: str | None = None
requirements: list[str] = []
remote: bool | None = None
result = await extractor.extract(
url="https://careers.company.com/job/123",
schema=JobPosting
)Extracto includes a beautiful CLI powered by Typer and Rich:
# Basic extraction
python main.py extract https://example.com
# With custom model
python main.py extract https://example.com --model gpt-4o
# Using Ollama locally
python main.py extract https://example.com \
--model llama3 \
--base-url http://localhost:11434/v1
# Save to file
python main.py extract https://example.com --output result.json
# List supported providers
python main.py providersRun as a microservice:
uvicorn extracto.api:app --reload{
"url": "https://example.com/product",
"schema_fields": {
"name": "str",
"price": "float | None",
"rating": "float | None"
},
"model": "gpt-4o"
}List all supported AI providers
Health check endpoint
Data Flow:
URL β Playwright β Raw HTML β Cleaner β Markdown β LLM + Instructor β Validated JSON
browser.py- Async Playwright automationcleaner.py- HTML β Markdown conversion (critical for cost optimization)extractor.py- Main extraction engine with Instructorcli.py- Beautiful CLI interfaceapi.py- FastAPI microservice
result = await extractor.extract(
url="https://example.com",
schema=MySchema,
system_prompt="You are a specialized product data extractor..."
)from extracto import fetch_page_content
html = await fetch_page_content(
url="https://example.com",
timeout=60000, # 60 seconds
wait_for_selector=".product-price", # Wait for specific element
wait_for_network_idle=True, # Wait for network
user_agent="Custom User Agent"
)from extracto import clean_html
cleaned = clean_html(
html,
remove_scripts=True,
remove_styles=True,
keep_attributes=["href", "src", "alt"]
)Extracto includes aggressive HTML cleaning to reduce LLM costs:
| Stage | Size | Reduction |
|---|---|---|
| Raw HTML | 150 KB | - |
| After cleaning | 45 KB | 70% |
| As Markdown | 20 KB | 87% |
This can save you 80%+ on API costs!
# Run tests
poetry run pytest
# With coverage
poetry run pytest --cov=extracto
# Type checking
poetry run mypy src/
# Linting
poetry run ruff check src/Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Instructor - For making LLM outputs type-safe
- Playwright - For reliable browser automation
- Pydantic - For data validation
- FastAPI - For the API framework
- Typer - For the beautiful CLI
If you find this project useful, please consider giving it a star! β
Built with β€οΈ by meklasdev

