-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for structure tree and marked content sections #937
Closed
Closed
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
6dc9be5
feat: preliminary structure tree extractor using pypdfium2
dhdaines 2098ca3
test: basic test of structure tree
dhdaines 3c0e9c2
chore: add contributor
dhdaines 8de9fc2
refactor: fix imports and isorts and types, oh my
dhdaines da46833
docs: initial documentation for structure tree
dhdaines f9d01a2
feat: extract MCID and add it to `chars`
dhdaines 3018536
fix: types
dhdaines 297ddac
test: test mcid on extract_words
dhdaines 8e74285
test: disable coverage for untestable attributes
dhdaines 9de3200
fix: actually get the right attributes (doh)
dhdaines 4741d36
fix: always add mcid so it works in extra_attrs
dhdaines 084cf40
test: add test data
dhdaines 06025a7
fix: fix types, again
dhdaines b58947b
fix: tagstack not needed
dhdaines ac158cb
fix: skip empty (not on this page) children
dhdaines 13b0e92
fix: fix fix to fix
dhdaines 8bd484a
fix: remove unnecessary pdf
dhdaines dbccaab
docs: note about lang and image tags
dhdaines e595a71
test: test and document alt_text and mcid on images
dhdaines 47010a2
docs: minimally document structure and mcid here
dhdaines 20dc157
fix: give default value to cur_mcid
dhdaines 827726c
test: add test of structured PDF from Word 365
dhdaines a5de5f8
fix: pragma nocover no longer needed (thanks, word365)
dhdaines cc7a378
test: fix CSV tests to include/exclude mcid field
dhdaines 0077bf0
test: sample pdf with weird tables and stuff
dhdaines 2fd9f80
fix: ctypes/mypy/py3.8 errors
dhdaines b774531
fix: really fix py38 (hardcoded venv/ in makefile! argh!)
dhdaines 42ec17e
fix: put mcids on lines and curves in figure
dhdaines File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Structure Tree | ||
|
||
Since PDF 1.3 it is possible for a PDF to contain logical structure, | ||
contained in a *structure tree*. In conjunction with PDF 1.2 [marked | ||
content sections](#marked-content-sections) this forms the basis of | ||
Tagged PDF and other accessibility features. | ||
|
||
Unfortunately, since all of these standards are optional and variably | ||
implemented in PDF authoring tools, and are frequently not enabled by | ||
default, it is not possible to rely on them to extract the structure | ||
of a PDF and associated content. Nonetheless they can be useful as | ||
features for a heuristic or machine-learning based system, or for | ||
extracting particular structures such as tables. | ||
|
||
Since `pdfplumber`'s API is page-based, the structure is available for | ||
a particular page, using the `structure_tree` attribute: | ||
|
||
with pdfplumber.open(pdffile) as pdf: | ||
for element in pdf.pages[0].structure_tree: | ||
print(element["type"], element["mcids"]) | ||
for child in element.children: | ||
print(child["type"], child["mcids"]) | ||
|
||
The `type` field contains the type of the structure element - the | ||
standard structure types can be seen in section 10.7.3 of [the PDF 1.7 | ||
reference | ||
document](https://ghostscript.com/~robin/pdf_reference17.pdf#page=898), | ||
but usually they are rather HTML-like, if created by a recent PDF | ||
authoring tool (notably, older tools may simply produce `P` for | ||
everything). | ||
|
||
The `mcids` field contains the list of marked content section IDs | ||
corresponding to this element. You can use this to match the element | ||
to words or characters using the API described below. | ||
|
||
The `lang` field is often present as well, and contains a language | ||
code for the text content, e.g. `"EN-US"` or `"FR-CA"`. | ||
|
||
The `alt_text` field will be present if the author has helpfully added | ||
alternate text to an image. In theory, `title` and `actual_text` may | ||
also be present, but not all tools seem to support these. | ||
|
||
The `id` field is of unknown origin and use. Please find a PDF that | ||
contains it so we can test it. | ||
|
||
Likewise, attributes for structure elements (which, confusingly, come | ||
as a *list* of dictionaries) are not supported because I haven't found | ||
a PDF using them to test with yet. | ||
|
||
# Marked Content Sections | ||
|
||
The structure of a PDF obviously isn't all that useful unless you can, | ||
minimally, attach some text to the elements. This is where marked | ||
content sections come in. | ||
|
||
`pdfplumber` adds an optional field called `mcid` to the items in the | ||
`objects` and `chars` properties of a page, which tells you which | ||
marked content section a given character or other object belongs to. | ||
|
||
You can propagate `mcid` to the words returned by `extract_words` by | ||
adding it to the `extra_attrs` argument, e.g.: | ||
|
||
words = pdf.pages[0].extract_words(extra_attrs=["mcid"]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
import ctypes | ||
from io import BufferedReader, BytesIO | ||
from typing import TYPE_CHECKING, Callable, Iterator, Optional, Union | ||
|
||
import pypdfium2 # type: ignore | ||
import pypdfium2.raw as pdfium_c # type: ignore | ||
|
||
from ._typing import T_obj | ||
|
||
if TYPE_CHECKING: # pragma: nocover | ||
fpdf_structelement_t = ctypes._Pointer[pdfium_c.fpdf_structelement_t__] | ||
fpdf_structtree_t = ctypes._Pointer[pdfium_c.fpdf_structtree_t__] | ||
c_char_array = ctypes.Array[ctypes.c_char] | ||
else: | ||
fpdf_structelement_t = ctypes._Pointer | ||
fpdf_structtree_t = ctypes._Pointer | ||
c_char_array = ctypes.Array | ||
|
||
|
||
class PdfStructElement: | ||
def __init__(self, raw: fpdf_structelement_t): | ||
self.raw = raw | ||
|
||
@property | ||
def children(self) -> Iterator["PdfStructElement"]: | ||
n_children = pdfium_c.FPDF_StructElement_CountChildren(self.raw) | ||
for idx in range(n_children): | ||
child = PdfStructElement( | ||
pdfium_c.FPDF_StructElement_GetChildAtIndex(self.raw, idx) | ||
) | ||
if child.type: | ||
yield child | ||
|
||
def string_accessor( | ||
self, | ||
pdffunc: Callable[ | ||
[ | ||
fpdf_structelement_t, | ||
Optional[c_char_array], | ||
int, | ||
], | ||
int, | ||
], | ||
) -> str: | ||
n_bytes = pdffunc(self.raw, None, 0) | ||
buffer = ctypes.create_string_buffer(n_bytes) | ||
pdffunc(self.raw, buffer, n_bytes) | ||
return buffer.raw[: n_bytes - 2].decode("utf-16-le") | ||
|
||
@property | ||
def id(self) -> str: | ||
return self.string_accessor(pdfium_c.FPDF_StructElement_GetID) | ||
|
||
@property | ||
def lang(self) -> str: | ||
return self.string_accessor(pdfium_c.FPDF_StructElement_GetLang) | ||
|
||
@property | ||
def title(self) -> str: | ||
return self.string_accessor(pdfium_c.FPDF_StructElement_GetTitle) | ||
|
||
@property | ||
def type(self) -> str: | ||
return self.string_accessor(pdfium_c.FPDF_StructElement_GetType) | ||
|
||
@property | ||
def alt_text(self) -> str: | ||
return self.string_accessor(pdfium_c.FPDF_StructElement_GetAltText) | ||
|
||
@property | ||
def actual_text(self) -> str: | ||
return self.string_accessor(pdfium_c.FPDF_StructElement_GetActualText) | ||
|
||
@property | ||
def mcid(self) -> Optional[int]: | ||
mcid: int = pdfium_c.FPDF_StructElement_GetMarkedContentID(self.raw) | ||
if mcid == -1: | ||
return None | ||
else: | ||
return mcid | ||
|
||
@property | ||
def mcids(self) -> Iterator[int]: | ||
mcid_count = pdfium_c.FPDF_StructElement_GetMarkedContentIdCount(self.raw) | ||
if mcid_count == -1: | ||
return | ||
else: | ||
for idx in range(mcid_count): | ||
mcid = pdfium_c.FPDF_StructElement_GetMarkedContentIdAtIndex( | ||
self.raw, idx | ||
) | ||
if mcid != -1: | ||
yield mcid | ||
|
||
def to_dict(self) -> T_obj: | ||
eldict: T_obj = {} | ||
if self.id: | ||
eldict["id"] = self.id # pragma: nocover | ||
if self.lang: | ||
eldict["lang"] = self.lang | ||
if self.title: | ||
eldict["title"] = self.title # pragma: nocover | ||
if self.type: | ||
eldict["type"] = self.type | ||
if self.alt_text: | ||
eldict["alt_text"] = self.alt_text | ||
if self.actual_text: | ||
eldict["actual_text"] = self.actual_text | ||
if self.mcid: | ||
eldict["mcids"] = [self.mcid] | ||
else: | ||
mcids = list(self.mcids) | ||
if mcids: | ||
eldict["mcids"] = mcids | ||
children = [] | ||
for child in self.children: | ||
if child.type: | ||
children.append(child.to_dict()) | ||
if children: | ||
eldict["children"] = children | ||
return eldict | ||
|
||
|
||
class PdfStructTree: | ||
def __init__(self, raw: fpdf_structtree_t): | ||
self.raw = raw | ||
|
||
@classmethod | ||
def from_page(self, page: pypdfium2.PdfPage) -> "PdfStructTree": | ||
raw = pdfium_c.FPDF_StructTree_GetForPage(page) | ||
return PdfStructTree(raw) | ||
|
||
@property | ||
def children(self) -> Iterator[PdfStructElement]: | ||
n_children = pdfium_c.FPDF_StructTree_CountChildren(self.raw) | ||
for idx in range(n_children): | ||
child = PdfStructElement( | ||
pdfium_c.FPDF_StructTree_GetChildAtIndex(self.raw, idx) | ||
) | ||
if child.type: | ||
yield child | ||
|
||
|
||
def get_page_structure( | ||
stream: Union[BufferedReader, BytesIO], | ||
page_ix: int, | ||
password: Optional[str] = None, | ||
) -> PdfStructTree: | ||
# If we are working with a file object saved to disk | ||
if hasattr(stream, "name"): | ||
src = stream.name | ||
# If we instead are working with a BytesIO stream | ||
else: | ||
stream.seek(0) | ||
src = stream | ||
pdf = pypdfium2.PdfDocument(src) | ||
return PdfStructTree.from_page(pdf[page_ix]) |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for tagging
image
andchar
objects, but notline
,rect
, orcurve
objects?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this could be an omission on my part (I didn't have a PDF with lines/rects/curves in a Figure handy to test with, but I should be able to create one myself with LibreOffice). Thanks for catching it, I'll add a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, this was wrong. I've fixed it to tag the lines/rects/curves with the MCID.