pdf2table

pdf2table is a Python library designed to extract tabular data from PDF files and images efficiently and accurately. It leverages an enhanced algorithm of img2table library for table detection and the TATR model from Microsoft's Table Transformer for precise table structure recognition and content extraction.

Features

High Precision of Detection: Compared to Table Transformer's DETR model, rule-based algorithm is less likely to identify text blocks as table regions.
Maintenance Structural Information: Utilizes state-of-the-art models for table structure recognition to maintain structural information of tables.
Flexible Input: Supports both PDF files and image formats for table extraction. (More file format will be available later)
Easy to Use: Simple API allows for straightforward integration into Python projects.

Installation

Install pdf2table using pip:

pip install pdf2table

Usage

Here's a quick example on how to use PDF2Table to extract tables from a PDF file:

from pdf2table import Driver

# Initialize the driver
driver = Driver()

# Extract tables from a PDF
# which returns a list of dataframes
tables = driver.extract_tables("sample.pdf")

Driver object encapsulates the detection and extraction for both PDF object and Image object. If detection is what you need, please refer to the following example:

from pdf2table.document import Image, PDF

# Initialize an Image object
img = Image("sample.jpg")

# Extract all tables from the image
# which returns a list of Table objects
img_tables = img.extract_tables()

# Initialize an PDF object
pdf = PDF("sample.jpg")
pdf_tables = pdf.extract_tables()

You may refer to tutorial for more details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Thanks to the creators of the img2table library and Microsoft's Table Transformer model for providing the robust foundations for this tool.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
samples		samples
src/pdf2table		src/pdf2table
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2table

Features

Installation

Usage

License

Acknowledgements

About

Releases

Languages

License

li-rongzhi/pdf2table

Folders and files

Latest commit

History

Repository files navigation

pdf2table

Features

Installation

Usage

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Languages