Skip to content

A Python + C implementation for image-based PDF page layout analysis and content extraction.

License

Notifications You must be signed in to change notification settings

heshiming/paddlefish

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paddlefish

A Python + C implementation for image-based PDF page layout analysis and content extraction.

(This project is just getting started.)

Features

PDF Processing

  • 📄 PDF page content understanding using an image-based visualized method, segmenting tables and text boxes
  • 🧪 Unit test controlled layout analysis results for quality assurance
  • 🚀 High speed analysis: Image processing written in NumPy + scikit-image, achieving 3 page/sec per 1000 Geekbench score on a single core.
  • 🧬 Conversion from PDF files to structured JSON

Releases

No releases published

Packages

 
 
 

Languages

  • C++ 67.0%
  • Python 32.1%
  • Other 0.9%