Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PDF 1.3 logical structure #963

Merged
merged 5 commits into from
Nov 9, 2023

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Aug 10, 2023

As promised, here is the other PR supporting the structure tree using pdfminer.six - so no overhead and no typing weirdness. In the end the implementation is rather nice and simple. somewhat complex once we take into account the multiplicity of optional features in the structure tree specification.

There is one caveat, which is mentioned in the docstring: whereas other PDF engines will include empty structure elements in the structure tree, this implementation does not, for kind of the same reason that #961 doesn't do anything for marked structure points. Since pdfplumber is based around extracting objects from the PDF, it isn't very useful to have structure that can't be associated to any objects, at least in my opinion.

Also, in the case where there are unparsed pages in a PDF, it isn't quite clear what to do about structure elements with no explicit page ID, unless we assume that elements with no marked content are always excluded.

But, if you like, we can (optionally?) add these structure elements, it isn't too hard to do.

@dhdaines dhdaines changed the base branch from stable to develop August 10, 2023 04:32
@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Merging #963 (036044d) into develop (336f83f) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop      #963    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           18        19     +1     
  Lines         1613      1897   +284     
==========================================
+ Hits          1613      1897   +284     
Files Changed Coverage Δ
pdfplumber/cli.py 100.00% <100.00%> (ø)
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/pdf.py 100.00% <100.00%> (ø)
pdfplumber/structure.py 100.00% <100.00%> (ø)

@dhdaines
Copy link
Contributor Author

This should be complete now, I'll let you review it at your leisure! It works for me (tm)

@dhdaines
Copy link
Contributor Author

dhdaines commented Sep 6, 2023

Hi! If you get a chance could you review this soon? The test suite is now pretty extensive since I learned how to create "synthetic" PDFs with a text editor, and I've removed all but one of the pragma: nocover comments (the remaining one is a "shouldn't happen" case).

I think it is really a good implementation of PDF logical structure though obviously there will be weird PDFs out there that do undefined behaviour!

@jsvine jsvine merged commit 35ed9e0 into jsvine:develop Nov 9, 2023
7 checks passed
@jsvine
Copy link
Owner

jsvine commented Nov 9, 2023

Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your docs/structure.md file?

@jsvine jsvine mentioned this pull request Nov 9, 2023
@dhdaines
Copy link
Contributor Author

dhdaines commented Nov 9, 2023

Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your docs/structure.md file?

Ah! You're right, it ought to be in README.md, I thought that I had put it there. I can submit another PR for this.

@dhdaines
Copy link
Contributor Author

dhdaines commented Nov 9, 2023

Thanks for merging as well! No problem about the delay, it is a large and complex feature. There is one quirk to the implementation that might require a follow-on: structure elements are allowed to span multiple pages, which is complicated to handle properly because PDF is otherwise extremely page-oriented (marked content sections notably can't do this). This means that objects that are in the structure tree might not appear to be in some situations. I will file this as an issue once I find a good test case for it.

@jsvine
Copy link
Owner

jsvine commented Nov 10, 2023

Ah, interesting. I think I understand in theory, but not quite sure in practice — so looking forward to that test case. Thanks!

@dhdaines dhdaines deleted the structure_tree branch February 5, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants