PDF Tablr

Version: 0.1.0

This is a Python 3 module and command line utility that analyzes XML output from the program pdftohtml in order to extract tables from PDF files and output the data as CSV.

For example:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

See also pdftable -h and http://sourceforge.net/projects/pdftable

As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.

Installation

Install pdftohtml via poppler-utils (Linux) or poppler (Mac OSX)

Then install the module

python setup.py install

or

pip install pdftablr

Command line usage

Extract each table into a separate CSV file:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

Extract all tabular data into a single CSV file:

pdftohtml -xml -stdout file.pdf | pdftable -f file.csv

Module usage

from pdftablr.table_extractor import Extractor

# XML file created from pdftohtml
input_path = '/path/to/file.xml'

# Output CSV file
output_path = '/path/to/output.csv'

with open(output_path, 'w') as output_file:
    table_extractor = Extractor(output_file=output_file)

    with open(input_path) as f:
        table_extractor.read_file(f)
    
    tables = table_extractor.extract()
    for table in tables:
        table.output(writer=None)

TODO

Investigate why Table.columns is sometimes initialised with empty columns
Refactor all the file handling
Execute pdftohtml within the code to allow PDF input

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
pdftablr		pdftablr
test		test
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pdftable		pdftable
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Tablr

Installation

Command line usage

Module usage

TODO

About

Releases

Packages

Languages

License

philgooch/pdftable

Folders and files

Latest commit

History

Repository files navigation

PDF Tablr

Installation

Command line usage

Module usage

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages