A high-performance Rust-based HTTP service for extracting text from PDF documents with OCR support for scanned documents. Built for fast, reliable PDF processing with Railway deployment support.
- Fast PDF Text Extraction: Uses
pdf-extractcrate for efficient text extraction - OCR Support: Tesseract OCR fallback for scanned PDFs
- High Performance: Handles 100+ concurrent requests
- API Key Authentication: Secure Bearer token authentication
- Rate Limiting: Global concurrent request limiting
- Railway Ready: Optimized for Railway deployment
- Comprehensive Logging: Structured logging with tracing
- Health Monitoring: Built-in health check endpoint
- Rust 1.75+
- Tesseract OCR (for scanned PDF support)
brew install tesseractsudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev- Clone and setup:
git clone <repository-url>
cd elicit
cp .env.example .env- Configure environment variables:
# Edit .env file
VALID_API_KEYS=your-secret-key-1,your-secret-key-2
MAX_FILE_SIZE_MB=10
MAX_CONCURRENT_REQUESTS=100- Run the service:
cargo runThe service will start on http://localhost:8080
# Build the image
docker build -t elicit .
# Run with environment variables
docker run -p 8080:8080 \
-e VALID_API_KEYS="your-key-1,your-key-2" \
-e MAX_FILE_SIZE_MB=10 \
-e MAX_CONCURRENT_REQUESTS=100 \
elicitEndpoint: POST /api/v1/extract
Headers:
Authorization: Bearer your-api-key
Content-Type: multipart/form-data
Request Body: Multipart form with file field containing PDF
Example with curl:
curl -X POST http://localhost:8080/api/v1/extract \
-H "Authorization: Bearer your-api-key" \
-F "file=@document.pdf"Success Response (200):
{
"success": true,
"data": {
"text": "Extracted text content from the PDF...",
"pages": 5,
"metadata": {
"title": null,
"author": null,
"creation_date": null,
"modification_date": null,
"file_size_bytes": 1048576,
"ocr_used": false
}
},
"processing_time_ms": 1250
}Error Responses:
400 Bad Request: Invalid file or missing file401 Unauthorized: Invalid or missing API key413 Payload Too Large: File exceeds 10MB limit429 Too Many Requests: Concurrent request limit exceeded500 Internal Server Error: Processing failed
Endpoint: GET /health
curl http://localhost:8080/healthResponse:
{
"status": "healthy",
"version": "0.1.0",
"service": "elicit-pdf-extractor"
}- Connect your repository to Railway
- Set environment variables in Railway dashboard:
VALID_API_KEYS=your-production-key-1,your-production-key-2 MAX_FILE_SIZE_MB=10 MAX_CONCURRENT_REQUESTS=100 RUST_LOG=info
Railway will automatically:
- Build using the Dockerfile
- Set the
PORTenvironment variable - Handle HTTPS termination
- Provide a public URL
# Check health
curl https://your-app.railway.app/health
# Test extraction
curl -X POST https://your-app.railway.app/api/v1/extract \
-H "Authorization: Bearer your-production-key" \
-F "file=@test.pdf"| Variable | Default | Description |
|---|---|---|
SERVER_HOST |
0.0.0.0 |
Server bind address |
SERVER_PORT |
8080 |
Server port (Railway sets PORT) |
MAX_FILE_SIZE_MB |
10 |
Maximum file size in MB |
MAX_CONCURRENT_REQUESTS |
100 |
Global concurrent request limit |
VALID_API_KEYS |
- | Comma-separated API keys |
REQUEST_TIMEOUT_SECONDS |
30 |
Request timeout |
WORKER_THREADS |
4 |
Tokio worker threads |
RUST_LOG |
info |
Log level |
- Throughput: 100+ concurrent requests
- Latency: < 5 seconds for typical PDFs (< 10MB)
- Memory: < 512MB per request
- File Size: Up to 10MB PDFs
- OCR Fallback: Automatic for scanned documents
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ HTTP Client │───▶│ Axum Server │───▶│ PDF Processor │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ Auth Middleware │ │ OCR Service │
└──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ Rate Limiting │ │ Tesseract │
└──────────────────┘ └─────────────────┘
elicit/
├── src/
│ ├── main.rs # Application entry point
│ ├── lib.rs # Library root
│ ├── config/ # Configuration management
│ ├── handlers/ # HTTP request handlers
│ ├── middleware/ # Auth, rate limiting, logging
│ ├── services/ # PDF processing, OCR
│ ├── models/ # Request/response models
│ └── error/ # Error handling
├── Cargo.toml # Dependencies
├── Dockerfile # Railway deployment
├── .env.example # Environment template
└── README.md # This file
# Unit tests
cargo test
# Integration tests
cargo test --test integration
# With logging
RUST_LOG=debug cargo test- New endpoints: Add to
src/handlers/ - Middleware: Add to
src/middleware/ - Services: Add to
src/services/ - Models: Add to
src/models/
-
Tesseract not found:
# Install Tesseract OCR brew install tesseract # macOS sudo apt install tesseract-ocr # Ubuntu
-
Rate limit errors:
- Check
MAX_CONCURRENT_REQUESTSsetting - Monitor concurrent request usage
- Check
-
File size errors:
- Verify
MAX_FILE_SIZE_MBconfiguration - Check actual file sizes
- Verify
-
OCR failures:
- Ensure Tesseract is properly installed
- Check PDF contains scannable images
Enable debug logging:
RUST_LOG=debug cargo runStructured JSON logging:
LOG_FORMAT=json RUST_LOG=info cargo run- API Key Authentication: All endpoints except
/healthrequire valid API keys - Rate Limiting: Global concurrent request limiting prevents abuse
- File Validation: Strict PDF validation and size limits
- Memory Safety: Rust's memory safety prevents common vulnerabilities
- No Sensitive Data: Error messages don't expose sensitive information
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
For issues and questions:
- Create an issue in the repository
- Check the troubleshooting section
- Review the logs for error details