Paddlefish

A Python + C implementation for image-based PDF page layout analysis and content extraction.

(This project is just getting started.)

Features

PDF Processing

📄 PDF page content understanding using an image-based visualized method, segmenting tables and text boxes
🧪 Unit test controlled layout analysis results for quality assurance
🚀 High speed analysis: Image processing written in NumPy + scikit-image, achieving 3 page/sec per 1000 Geekbench score on a single core.
🧬 Conversion from PDF files to structured JSON

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
docmt		docmt
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build-dev.sh		build-dev.sh
dockerfile		dockerfile
run-dev.sh		run-dev.sh