Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessibility tagging #909

Open
NathanTech7713 opened this issue Jun 20, 2023 · 16 comments
Open

Accessibility tagging #909

NathanTech7713 opened this issue Jun 20, 2023 · 16 comments
Assignees
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@NathanTech7713
Copy link

Hi there,

Was wondering if, when the dev is particularly bored, would you mind considering implementing extraction of accessibility tagging?

Thank youPlease describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.

@NathanTech7713 NathanTech7713 added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Jun 20, 2023
@jsvine
Copy link
Owner

jsvine commented Jun 20, 2023

Hi @NathanTech7713 — thanks for your interest in this library, and for this suggestion. For my own notes and for others who may be less familiar:

And some general questions: What should the output of this extraction look like? A nested tree of tags? Something else?

@NathanTech7713: Do you have any examples of other PDF extraction libraries that have a feature like this, and which you think would provide a useful model?

@dhdaines
Copy link
Contributor

dhdaines commented Jul 7, 2023

Hi! I was about to make this same feature request. I've done a bit of exploration here as I am working on extracting the structure from PDFs and, obviously, it makes sense to use explicit structure if it's there... well, sort of.

Most of the libraries that support tagged PDF are closed-source, but some functionality to extract it exists in Poppler and pdf.js, and you can see the tags by running pdfinfo -struct on a PDF (or pdfinfo -struct-text to see the content of the tags as well). Unfortunately the generation of structure and tags is, to put it mildly, highly variable across different PDF authoring tools, and I haven't come remotely close to understanding the (very convoluted) specification. The W3C has a nice overview of logical structure and tagged PDF here: https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf_notes.html

Basically there are a couple of moving parts, which you can find starting in section 10.5 of the PDF 1.7 spec (or maybe section 14, if you have the Adobe/ISO document?):

  • Marked content sections - this is what pdfminer.six will give you if you use TagExtractor which I think we can agree is a sub-optimal API (I am not really sure how it could be integrated in pdfplumber). These are the sections of text/objects/whatever in the PDF that correspond to structural units. Sometimes they have meaningful tags attached directly to them (notably, LibreOffice will do this) but usually they are all tagged as "P" and have to look in the "logical structure" to get more useful information.
  • Logical structure - this is how you get a table of contents in the sidebar in your PDF reader, and it is pretty well supported by open-source libraries like Poppler and pdf.js, though usually with a torrent of error messages since there is so much variability in the way PDF creation tools implement it (probably because the spec is difficult to understand and full of options). The Poppler implementation is unreadable since it's written in C++ ;-) so look at the pdf.js implementation instead. You can get at this from the StructTreeRoot, RoleMap, ParentTree and sometimes ClassMap entries in the document catalog. It's a horrible, cyclical (notably pdfminer.six will crash with a stack overflow trying to resolve it) mess of PDFObject references. At some point (and there are multiple ways this can happen) you will end up at a leaf node which gives you a MCID that you can use to refer back to the marked content sections noted above. But they might be indirected through the ParentTree because Reasons.
  • Tagged PDF - this defines a whole bunch of extra standards on top of the two previous things, along with a (supposedly) standardized set of structural tags, a vaguely HTML+CSS-like layout model, and some extra attributes to help distinguish main content from headers, footers, etc, and also (yes!) actually define the "words" in the document which we know PDF doesn't do by default.

See https://github.com/dhdaines/alexi/blob/main/scripts/pdfstructure.py for a quick-and-dirty script (based on pdfminer.six code) which prints MCID sections and tags and attempts (but doesn't really succeed) to resolve the structure tree, and https://github.com/dhdaines/alexi/blob/main/test/data/pdf_structure.pdf for a test document with structure and tags.

@dhdaines
Copy link
Contributor

dhdaines commented Jul 7, 2023

What I would find minimally useful (but I can't speak for the original author of this issue) would be:

  • A method to extract marked content sections and their attributes in a page, akin to extract_words, and some way to place words from extract_words within a given content section (yes, this could just be done with the bounding box)
  • A method to extract (a simplified version of) the structure tree from the document such that one could easily get to the marked content sections from it and vice versa.

@NathanTech7713
Copy link
Author

Woops! Got to be honest, thought I replied and then didn't!

@dhdaines sums it up quite well in what I am also hoping for.

I think I mentioned quite a while ago about eventually wanting to put together an accessible PDf reader for screen reader (totally blind) users of windows, so and accessibility tagging would be a solid way of identifying structure.

@jsvine
Copy link
Owner

jsvine commented Jul 10, 2023

Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.

@jsvine jsvine self-assigned this Jul 10, 2023
@dhdaines
Copy link
Contributor

Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.

If it helps I can make a preliminary PR with something like what I mentioned above (extraction of marked content sections + structure tree parsing)

@jsvine
Copy link
Owner

jsvine commented Jul 10, 2023

@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into pdfplumber? (I.e., require the least modification of existing code / least performance impact.)

@dhdaines
Copy link
Contributor

@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into pdfplumber? (I.e., require the least modification of existing code / least performance impact.)

At first glance - extracting the structure tree is relatively easy and can be done on-demand as it's all in the document catalog - linking it to the MCIDs might have more of performance impact, at least, with pdfminer.six, since it seems like we have to decode and parse the entire document to get them, even for a single page, but I could be mistaken about this!

@jsvine
Copy link
Owner

jsvine commented Jul 11, 2023

Thanks! That sounds like a reasonable place to start. I suppose we could expose that similarly to how we do with Page.annots — i.e., outside the main parsing function?

@dhdaines
Copy link
Contributor

The pypdfium2 interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h

@dhdaines
Copy link
Contributor

The pypdfium2 interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h

Actually this is quite easy. I should have a PR for you tonight or tomorrow, I hope.

@dhdaines
Copy link
Contributor

Ready for review, see PR above. I'll test it more on my PDFs of interest, but it is functional and somewhat documented, see docs/structure.md and tests/test_structure.py for examples.

@jsvine
Copy link
Owner

jsvine commented Jul 19, 2023

Many thanks, @dhdaines, and a particular thanks for the documentation. It might take me a little while to review the PR, due to other workload and me being relatively new to the topic/feature, but on first glance, it seems like a helpful contribution.

@jsvine
Copy link
Owner

jsvine commented Nov 9, 2023

Now that #961 and #963 are merged, is this issue all clear to close? Or are there other features that would need to be in place for us to say we've handled accessibility tagging?

@dhdaines
Copy link
Contributor

dhdaines commented Nov 9, 2023

Thanks! There is at least one small add-on to consider - #961 doesn't give access to the tag attributes, only the tag name. These allow you to distinguish between different types of artifacts (header, footer, etc).

I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section, as this could produce large outputs (it shouldn't be a huge problem for memory consumption since it's the same dictionary...)

"Tagged PDF" is a fairly vaguely defined standard (or perhaps I just don't fully understand it yet) so there may be other things too.

@jsvine
Copy link
Owner

jsvine commented Nov 10, 2023

Thanks, @dhdaines. A couple of follow-up questions:

I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section

Could you share an example of what this would look like?

as this could produce large outputs

I agree with the general inclination here. Could we have it both ways and allow users to opt-in to this additional output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants