Elicit - PDF Text Extractor Service

A high-performance Rust-based HTTP service for extracting text from PDF documents with OCR support for scanned documents. Built for fast, reliable PDF processing with Railway deployment support.

Features

Fast PDF Text Extraction: Uses pdf-extract crate for efficient text extraction
OCR Support: Tesseract OCR fallback for scanned PDFs
High Performance: Handles 100+ concurrent requests
API Key Authentication: Secure Bearer token authentication
Rate Limiting: Global concurrent request limiting
Railway Ready: Optimized for Railway deployment
Comprehensive Logging: Structured logging with tracing
Health Monitoring: Built-in health check endpoint

Quick Start

Prerequisites

Rust 1.75+
Tesseract OCR (for scanned PDF support)

Install Tesseract (macOS)

brew install tesseract

Install Tesseract (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev

Local Development

Clone and setup:

git clone <repository-url>
cd elicit
cp .env.example .env

Configure environment variables:

# Edit .env file
VALID_API_KEYS=your-secret-key-1,your-secret-key-2
MAX_FILE_SIZE_MB=10
MAX_CONCURRENT_REQUESTS=100

Run the service:

cargo run

The service will start on http://localhost:8080

Docker Development

# Build the image
docker build -t elicit .

# Run with environment variables
docker run -p 8080:8080 \
  -e VALID_API_KEYS="your-key-1,your-key-2" \
  -e MAX_FILE_SIZE_MB=10 \
  -e MAX_CONCURRENT_REQUESTS=100 \
  elicit

API Usage

Extract Text from PDF

Endpoint: POST /api/v1/extract

Headers:

Authorization: Bearer your-api-key
Content-Type: multipart/form-data

Request Body: Multipart form with file field containing PDF

Example with curl:

curl -X POST http://localhost:8080/api/v1/extract \
  -H "Authorization: Bearer your-api-key" \
  -F "file=@document.pdf"

Success Response (200):

{
  "success": true,
  "data": {
    "text": "Extracted text content from the PDF...",
    "pages": 5,
    "metadata": {
      "title": null,
      "author": null,
      "creation_date": null,
      "modification_date": null,
      "file_size_bytes": 1048576,
      "ocr_used": false
    }
  },
  "processing_time_ms": 1250
}

Error Responses:

400 Bad Request: Invalid file or missing file
401 Unauthorized: Invalid or missing API key
413 Payload Too Large: File exceeds 10MB limit
429 Too Many Requests: Concurrent request limit exceeded
500 Internal Server Error: Processing failed

Health Check

Endpoint: GET /health

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "version": "0.1.0",
  "service": "elicit-pdf-extractor"
}

Railway Deployment

1. Prepare for Railway

Connect your repository to Railway

Set environment variables in Railway dashboard:

VALID_API_KEYS=your-production-key-1,your-production-key-2
MAX_FILE_SIZE_MB=10
MAX_CONCURRENT_REQUESTS=100
RUST_LOG=info

2. Deploy

Railway will automatically:

Build using the Dockerfile
Set the PORT environment variable
Handle HTTPS termination
Provide a public URL

3. Verify Deployment

# Check health
curl https://your-app.railway.app/health

# Test extraction
curl -X POST https://your-app.railway.app/api/v1/extract \
  -H "Authorization: Bearer your-production-key" \
  -F "file=@test.pdf"

Configuration

Environment Variables

Variable	Default	Description
`SERVER_HOST`	`0.0.0.0`	Server bind address
`SERVER_PORT`	`8080`	Server port (Railway sets `PORT`)
`MAX_FILE_SIZE_MB`	`10`	Maximum file size in MB
`MAX_CONCURRENT_REQUESTS`	`100`	Global concurrent request limit
`VALID_API_KEYS`	-	Comma-separated API keys
`REQUEST_TIMEOUT_SECONDS`	`30`	Request timeout
`WORKER_THREADS`	`4`	Tokio worker threads
`RUST_LOG`	`info`	Log level

Performance

Throughput: 100+ concurrent requests
Latency: < 5 seconds for typical PDFs (< 10MB)
Memory: < 512MB per request
File Size: Up to 10MB PDFs
OCR Fallback: Automatic for scanned documents

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   HTTP Client   │───▶│   Axum Server    │───▶│  PDF Processor  │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌──────────────────┐    ┌─────────────────┐
                       │  Auth Middleware │    │   OCR Service   │
                       └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌──────────────────┐    ┌─────────────────┐
                       │ Rate Limiting    │    │   Tesseract     │
                       └──────────────────┘    └─────────────────┘

Development

Project Structure

elicit/
├── src/
│   ├── main.rs              # Application entry point
│   ├── lib.rs               # Library root
│   ├── config/              # Configuration management
│   ├── handlers/            # HTTP request handlers
│   ├── middleware/          # Auth, rate limiting, logging
│   ├── services/            # PDF processing, OCR
│   ├── models/              # Request/response models
│   └── error/               # Error handling
├── Cargo.toml               # Dependencies
├── Dockerfile               # Railway deployment
├── .env.example             # Environment template
└── README.md                # This file

Running Tests

# Unit tests
cargo test

# Integration tests
cargo test --test integration

# With logging
RUST_LOG=debug cargo test

Adding Features

New endpoints: Add to src/handlers/
Middleware: Add to src/middleware/
Services: Add to src/services/
Models: Add to src/models/

Troubleshooting

Common Issues

Tesseract not found:

# Install Tesseract OCR
brew install tesseract  # macOS
sudo apt install tesseract-ocr  # Ubuntu

Rate limit errors:
- Check MAX_CONCURRENT_REQUESTS setting
- Monitor concurrent request usage
File size errors:
- Verify MAX_FILE_SIZE_MB configuration
- Check actual file sizes
OCR failures:
- Ensure Tesseract is properly installed
- Check PDF contains scannable images

Logging

Enable debug logging:

RUST_LOG=debug cargo run

Structured JSON logging:

LOG_FORMAT=json RUST_LOG=info cargo run

Security

API Key Authentication: All endpoints except /health require valid API keys
Rate Limiting: Global concurrent request limiting prevents abuse
File Validation: Strict PDF validation and size limits
Memory Safety: Rust's memory safety prevents common vulnerabilities
No Sensitive Data: Error messages don't expose sensitive information

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Support

For issues and questions:

Create an issue in the repository
Check the troubleshooting section
Review the logs for error details

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
elicit-landing.html		elicit-landing.html
elicit-spec.md		elicit-spec.md
railway.toml		railway.toml
waitlist.txt		waitlist.txt

Folders and files

Latest commit

History

Repository files navigation

Elicit - PDF Text Extractor Service

Features

Quick Start

Prerequisites

Install Tesseract (macOS)

Install Tesseract (Ubuntu/Debian)

Local Development

Docker Development

API Usage

Extract Text from PDF

Health Check

Railway Deployment

1. Prepare for Railway

2. Deploy

3. Verify Deployment

Configuration

Environment Variables

Performance

Architecture

Development

Project Structure

Running Tests

Adding Features

Troubleshooting

Common Issues

Logging

Security

License

Contributing

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages