Skip to content

meklasdev/extracto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Extracto Banner

πŸ” Extracto

Turn any URL into structured JSON using LLMs
No more brittle web scrapers that break when HTML changes

Python 3.11+ License: MIT Code style: black

Features β€’ Installation β€’ Quick Start β€’ AI Providers β€’ Examples β€’ API


🎯 The Problem

Traditional web scrapers are fragile:

  • ❌ Break when websites change their HTML structure
  • ❌ Require constant maintenance of CSS selectors
  • ❌ Fail on JavaScript-heavy sites
  • ❌ Need separate logic for each website

✨ The Solution

Extracto uses Large Language Models to intelligently extract data:

  • βœ… Resilient - Works even when HTML changes
  • βœ… Universal - Same code for any website
  • βœ… Type-safe - Pydantic schema validation
  • βœ… Smart - Handles dynamic JavaScript content

πŸš€ Features

  • 🎭 Playwright Integration - Handles dynamic JavaScript-heavy websites
  • πŸ€– Multi-Provider AI - OpenAI, Claude, Gemini, Ollama, Groq, and more
  • πŸ”’ Type Safety - Pydantic models with automatic validation via Instructor
  • πŸ’° Cost Optimized - Smart HTML cleaning reduces token usage by 50-70%
  • ⚑ Async First - Built on asyncio for high performance
  • 🎨 Beautiful CLI - Rich terminal interface with Typer
  • 🌐 REST API - FastAPI microservice ready
  • πŸ†“ Local Option - Use Ollama for free, private extraction

πŸ“¦ Installation

Using pip

pip install extracto
playwright install chromium

Using Poetry (recommended)

poetry add extracto
poetry run playwright install chromium

From source

git clone https://github.com/meklasdev/extracto.git
cd extracto
poetry install
poetry run playwright install chromium

⚑ Quick Start

1. Define Your Schema

from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float | None = None
    rating: float | None = None
    in_stock: bool | None = None

2. Extract Data

import asyncio
from extracto import Extractor

async def main():
    extractor = Extractor()  # Uses OPENAI_API_KEY from env
    
    result = await extractor.extract(
        url="https://example.com/product",
        schema=Product,
    )
    
    if result.success:
        print(f"Product: {result.data.name}")
        print(f"Price: ${result.data.price}")
        print(f"Rating: {result.data.rating}/5")

asyncio.run(main())

3. That's it! πŸŽ‰

No CSS selectors. No XPath. No maintenance.


πŸ€– Supported AI Providers

Extracto works with any OpenAI-compatible API:

Provider Model Example Base URL Cost
OpenAI gpt-4o Default $$
Anthropic Claude claude-3-5-sonnet-20241022 Default $$
Google Gemini gemini-pro Custom $
Ollama πŸ†“ llama3, qwen, mistral http://localhost:11434/v1 Free
Groq ⚑ llama-3.1-70b https://api.groq.com/openai/v1 Free tier
LM Studio πŸ†“ Local models http://localhost:1234/v1 Free
Together AI meta-llama/... https://api.together.xyz/v1 $

Using Different Providers

OpenAI (Default)

extractor = Extractor()  # Uses OPENAI_API_KEY env var

Ollama (Local, Free)

extractor = Extractor(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Dummy key
)

Anthropic Claude

extractor = Extractor(
    api_key="sk-ant-...",  # ANTHROPIC_API_KEY
)

Google Gemini

extractor = Extractor(
    api_key="your-google-key",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

Groq (Fast & Free Tier)

extractor = Extractor(
    api_key="gsk-...",
    base_url="https://api.groq.com/openai/v1"
)

πŸ’‘ Examples

E-commerce Product Scraping

from pydantic import BaseModel, Field
from extracto import Extractor

class ProductInfo(BaseModel):
    name: str = Field(..., description="Product name")
    price: float | None = Field(None, description="Current price")
    rating: float | None = Field(None, description="Rating 0-5")
    review_count: int | None = Field(None, description="Number of reviews")
    availability: str | None = Field(None, description="Stock status")
    brand: str | None = Field(None, description="Brand name")

extractor = Extractor()
result = await extractor.extract(
    url="https://amazon.com/product/...",
    schema=ProductInfo
)

News Article Extraction

class Article(BaseModel):
    title: str
    author: str | None = None
    publish_date: str | None = None
    summary: str | None = None
    tags: list[str] = []

result = await extractor.extract(
    url="https://news-site.com/article",
    schema=Article
)

Job Posting Data

class JobPosting(BaseModel):
    title: str
    company: str
    location: str | None = None
    salary_range: str | None = None
    requirements: list[str] = []
    remote: bool | None = None

result = await extractor.extract(
    url="https://careers.company.com/job/123",
    schema=JobPosting
)

πŸ–₯️ CLI Usage

Extracto includes a beautiful CLI powered by Typer and Rich:

# Basic extraction
python main.py extract https://example.com

# With custom model
python main.py extract https://example.com --model gpt-4o

# Using Ollama locally
python main.py extract https://example.com \
  --model llama3 \
  --base-url http://localhost:11434/v1

# Save to file
python main.py extract https://example.com --output result.json

# List supported providers
python main.py providers

🌐 REST API

Run as a microservice:

uvicorn extracto.api:app --reload

API Endpoints

POST /extract

{
  "url": "https://example.com/product",
  "schema_fields": {
    "name": "str",
    "price": "float | None",
    "rating": "float | None"
  },
  "model": "gpt-4o"
}

GET /providers

List all supported AI providers

GET /health

Health check endpoint


πŸ—οΈ Architecture

Architecture Flow

Data Flow:

URL β†’ Playwright β†’ Raw HTML β†’ Cleaner β†’ Markdown β†’ LLM + Instructor β†’ Validated JSON

Key Components

  • browser.py - Async Playwright automation
  • cleaner.py - HTML β†’ Markdown conversion (critical for cost optimization)
  • extractor.py - Main extraction engine with Instructor
  • cli.py - Beautiful CLI interface
  • api.py - FastAPI microservice

πŸ”§ Advanced Configuration

Custom System Prompt

result = await extractor.extract(
    url="https://example.com",
    schema=MySchema,
    system_prompt="You are a specialized product data extractor..."
)

Browser Options

from extracto import fetch_page_content

html = await fetch_page_content(
    url="https://example.com",
    timeout=60000,  # 60 seconds
    wait_for_selector=".product-price",  # Wait for specific element
    wait_for_network_idle=True,  # Wait for network
    user_agent="Custom User Agent"
)

HTML Cleaning Options

from extracto import clean_html

cleaned = clean_html(
    html,
    remove_scripts=True,
    remove_styles=True,
    keep_attributes=["href", "src", "alt"]
)

πŸ’° Cost Optimization

Extracto includes aggressive HTML cleaning to reduce LLM costs:

Stage Size Reduction
Raw HTML 150 KB -
After cleaning 45 KB 70%
As Markdown 20 KB 87%

This can save you 80%+ on API costs!


πŸ§ͺ Testing

# Run tests
poetry run pytest

# With coverage
poetry run pytest --cov=extracto

# Type checking
poetry run mypy src/

# Linting
poetry run ruff check src/

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


🌟 Star History

If you find this project useful, please consider giving it a star! ⭐


Built with ❀️ by meklasdev

Report Bug β€’ Request Feature

About

Turn any URL into structured JSON using LLMs - no more brittle scrapers

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages