Support for PDF format #82

fawkesley · 2013-07-30T16:28:33Z

We've been exploring different options for parsing PDFs. Currently we're using an (alpha) in-house library called pdftables (we blogged about it here)

This pull request integrates pdftables into messytables. It is an optional requirement - if pdftables is not installed, messytables will work as usual and the PDF tests will be skipped.

We're looking into other ways of extracting tables from PDFs, but either way we'll need the messytables integration.

rossjones · 2013-07-30T17:00:11Z

Think you need to add pdf tables to the test requirements file, assuming it's on pypi.

rossjones · 2013-07-31T09:30:38Z

Sorry you might need to rebase since I merged #81. I'm interested in @domoritz's opinion on this one :)

domoritz · 2013-07-31T09:42:40Z

My opinion is that you should never, ever change the history of something in the main repo (not even on a branch). Better create a new pr. However, I'm for rebasing on external branches or private branches because this keeps the history cleaner.

rossjones · 2013-07-31T09:44:16Z

I meant opinion on the feature, not on rebasing on their private branch ;)

domoritz · 2013-07-31T09:57:24Z

Ahh. IMHO, parsing tables in PDFs is super difficult but would be really awesome. As long as someone who just wants simple csv parsing does not have to install pdfminer and everything, I am for this feature.

@rossjones We talked about this before: I think we should move the requirements, that are only important for certain features, to a requirements.text file.

fawkesley · 2013-07-31T10:00:46Z

@domoritz Agreed on it being super difficult. We'll stick to this approach of PDF support being optional.

rossjones · 2013-07-31T10:01:04Z

I agree, as long as it is only the optional requirements rather than the core ones I am all for it.

Also @paulfurley don't forget the changelog ;)

fawkesley · 2013-07-31T10:01:50Z

I'll get pdftables working on python 2.6 now and I'll give you a shout once I've rebased and modded the changelog :)

… the underlying library ideally.

… 2.6

fawkesley · 2013-07-31T12:44:36Z

OK, tests passing and rebased, think we're good to go :) @rossjones

Support for PDF format

frabcus and others added 12 commits July 31, 2013 12:10

Main pdftables support code

09480fa

Tests for PDF support

44b2bd4

Added missing import for PDFTableSet

314922a

Added pdftables as an optional requirement

b61e2fb

Made PDF tests skip if pdftables is not installed

d51b67a

Implemented better names for PDF tables based on page and table index

0a7b87f

Updated to use pdftables' new page / table number interface

4dfcf2b

Removed test for number of rows in simple.pdf, this should be part of…

0eea865

… the underlying library ideally.

Removed a stray TODO

94ac8c9

Added pdftables to requirements-test.txt

4894dc8

Bumped required version of pdftables to version which supports python…

f40c5d8

… 2.6

Updated changelog

ca12d84

rossjones added a commit that referenced this pull request Jul 31, 2013

Merge pull request #82 from scraperwiki/pdftables-support-okfn

64a72a0

Support for PDF format

rossjones merged commit 64a72a0 into okfn:master Jul 31, 2013

fawkesley deleted the pdftables-support-okfn branch July 31, 2013 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for PDF format #82

Support for PDF format #82

fawkesley commented Jul 30, 2013

rossjones commented Jul 30, 2013

rossjones commented Jul 31, 2013

domoritz commented Jul 31, 2013

rossjones commented Jul 31, 2013

domoritz commented Jul 31, 2013

fawkesley commented Jul 31, 2013

rossjones commented Jul 31, 2013

fawkesley commented Jul 31, 2013

fawkesley commented Jul 31, 2013

Support for PDF format #82

Support for PDF format #82

Conversation

fawkesley commented Jul 30, 2013

rossjones commented Jul 30, 2013

rossjones commented Jul 31, 2013

domoritz commented Jul 31, 2013

rossjones commented Jul 31, 2013

domoritz commented Jul 31, 2013

fawkesley commented Jul 31, 2013

rossjones commented Jul 31, 2013

fawkesley commented Jul 31, 2013

fawkesley commented Jul 31, 2013