Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect extraction in tables with overlapping columns #912

Open
gnadlr opened this issue Jun 22, 2023 · 20 comments
Open

Incorrect extraction in tables with overlapping columns #912

gnadlr opened this issue Jun 22, 2023 · 20 comments

Comments

@gnadlr
Copy link

gnadlr commented Jun 22, 2023

This is a continuation of a discussion posted here, please check for more info.

Describe the bug

When the pdf has overlapping columns (i.e the columns do not wrap text), all extraction methods (extract_tables, extract_text, extract_words) give incorrect result.

  • When using extract_tables() or extract_text(), the 1st column is truncated at the separator line, the rest of the 1st column overlap alternatively with the 2nd column.
  • When using extract_words(use_text_flow=True), the last word of the 1st column (starting after the last space, or the entire cell if there is no space) is joined with the 2nd column

Original text

'aaaa b|bbb' and '1111' (the | is the separator line between the columns)

Expected behavior

'aaaa bbbb' and '1111'

Actual behavior

'aaaa b' and 'b1b1b11' when using extract_tables() or extract_text()

'aaaa' and 'bbbb1111'. when using extract_words(use_text_flow=True)

Sample pdf

https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf

@gnadlr gnadlr added the bug label Jun 22, 2023
@jsvine
Copy link
Owner

jsvine commented Jun 28, 2023

Thanks for opening this issue, @gnadlr; and thanks for your contributions to the related discussion and other recent ones, @cmdlineluser! Some observations:

  • 'aaaa bbbb' and '1111' do not seem to appear in the sample PDF, so it's difficult to diagnose the situation precisely. Are you able to share the PDF that contains those strings? Or, alternatively, restate the problem in terms of the sample PDF that you've shared above?

  • Your issue here and the sample PDF also helped me to diagnose a bug in the way pdfplumber handles use_text_flow=True. Hoping to push a fix for this soon, and hoping it helps you with your PDF.

  • In the sample PDF attached, there is an interesting quirk: It uses a "clipping path" to make the text that overflows certain cells invisible. If you're curious, it's this part (and another like it) in the raw PDF commands:

Screen Shot

@gnadlr
Copy link
Author

gnadlr commented Jun 29, 2023

You can check out these 2 sample pdf:
1a.pdf
2a.pdf

Issue with extract_tables(): text mingling. It is very apparent in 2nd pdf.

Issue with extract_words(): not recognizing space properly between the last word of a column and the first word of the next column. It is very apparent in the 2nd pdf (column overlapping) and also in the 1st pdf (column not overlap but words of 2 columns are very close).

@cmdlineluser
Copy link

cmdlineluser commented Jun 29, 2023

Apologies for any confusion regarding this @jsvine

  • The initial sample pdf came from another discussion where similar behaviour to what was being described was displayed
  • I had given it as an example to @gnadlr in the case that their own PDF was private

https://drive.google.com/file/d/1SlQAkQ7W28O6mZXvKx2Lvw__DjdhnX82/view

Using row 1 from your update page 2 sample as an example:

page2.search('Desloratadin.*')[0]['text']
'Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13 
 Gracure PharmIancdeiuatical LtHd ộp 1 lọ 60 ml Chai 4.120 65.000 267.800.000 
 Công ty TNNH2H Dược phGẩ1m 1A ViệBt NVa Đma khoaB tắỉnch K Bạắnc 
 K1ạn69/QĐ-SY0T4/3/2022
'

With use_text_flow=True it fixes some overlap, e.g. Pharmaceutical

page2.search('Desloratadin.*', use_text_flow=True)[0]['text']
'Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13 
 Gracure Pharmaceutical India Ltd Hộp 1 lọ 60 ml Chai 4.120 65.000 267.800.000          
 Công ty TNHH NamN2 Dược phẩm G1 1A Việt BV Đa khoa KạnBắc tỉnh Kạn Bắc 
 169/QĐ-SYT04/3/2022
'

keep_blank_chars=True seems to fix some more e.g. Việt Nam (although it N2 is merged and there is a space)

page2.search('Desloratadin.*', use_text_flow=True, keep_blank_chars=True)[0]['text']
'Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13 
 Gracure Pharmaceutical Ltd India Hộp 1 lọ 60 ml Chai 4.120      65.000       267.800.000         
 Công ty TNHH Dược phẩm 1A Việt NamN2 G1 BV Đa khoa tỉnh Bắc KạnBắc Kạn                   
 169/QĐ-SYT04/3/2022
'

I'm not sure if all the text there is now correct? There are spacing issues e.g. Việt NamN2

It does appear you can pass text_use_text_flow/text_use_blank_chars to the table methods: #764 (comment) (I did not realize this was possible)

pd.DataFrame(page2.extract_table()).iloc[[1], :10]

#    0             1     2                3           4            5            6             7                  8                  9
# 1  1  Desloratadin  Uống  0,5mg/ml x 60 m  lDestacure  VN-16773-13  VN-16773-13  Gracure Phar  mIancdeiuatical L  tHd ộp 1 lọ 60 ml
pd.DataFrame(page2.extract_table({"text_use_text_flow": True})).iloc[[1], :10]

#    0             1     2                3           4            5            6             7                  8                  9
# 1  1  Desloratadin  Uống  0,5mg/ml x 60 m  lDestacure  VN-16773-13  VN-16773-13  Gracure Phar  maceutical LIndia  td Hộp 1 lọ 60 ml

There does seem to be something else going on though.

@gnadlr
Copy link
Author

gnadlr commented Jun 29, 2023

@cmdlineluser is on point. Here are some more comparisons so it is easier to see the issue.
Keeping on with the example (10 columns of 1st row of pdf sample 2), we have the following:

Expected result

#    0             1     2                 3          4            5            6                           7     8             9
# 1  1  Desloratadin  Uống  0,5mg/ml x 60 ml  Destacure  VN-16773-13  VN-16773-13  Gracure Pharmaceutical Ltd India Hộp 1 lọ 60 ml

Extract_tables ()

Text is split where there is overlapping, the split text merges with the next column (lDestacure) and mingled (Gracure Pharmaceutical Ltd becomes 3 columns Gracure Phar | mIancdeiuatical L | tHd)

#    0             1       2               3          4           5           6            7                 8                 9
# 1  1  Desloratadin	Uống 0,5mg/ml x 60 m lDestacure	VN-16773-13 VN-16773-13 Gracure Phar mIancdeiuatical L	tHd ộp 1 lọ 60 ml

Extract_tables ({"text_use_text_flow": True, "text_keep_blank_chars": True})

Same split behavior (lDestacure). Texts mingling is better but still not correct (Gracure Pharmaceutical Ltd becomes 3 columns Gracure Phar | maceutical LIndia | td).

#    0             1     2                3          4            5            6              7                 8                 9
# 1  1  Desloratadin  Uống  0,5mg/ml x 60 m  lDestacure  VN-16773-13  VN-16773-13  Gracure Phar maceutical LIndia td Hộp 1 lọ 60 ml

Extract_words ({"use_text_flow" = True, "keep_blank_chars" = True})

All texts are correct (no mingling) but several spaces are not detected so it's not possible to parse correctly into tables (spaces are not detected when texts of adjacent columns are too close or when columns overlap)

1Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13 Gracure Pharmaceutical Ltd India Hộp 1 lọ 60 ml

@gnadlr
Copy link
Author

gnadlr commented Jul 7, 2023

I was able to extract individual characters with their coordinates with extract_text_lines(),
The automatic line detection of extract_text_lines() sometimes detect incorrectly so I have to merge all characters into a single list and write another parser to sort them into rows.

Now the only thing left to do is to parse it into columns.

def pdfplumber_extractchars():  
    with pdfplumber.open(pdf_path) as pdf:  
        for page in pdf.pages:  
            chars = []  
            data = []  
            text = page.extract_text_lines(use_text_flow=True, keep_blank_chars=True, x_tolerance=1, return_chars=True)  
            # Merge all rows  
            for line in text:  
                for char in line['chars']:  
                    chars.append(char)  
            # Parse into rows  
            row = [chars[0]]  
            for i in range(1, len(chars)): 
                # Detect row change by coordinates:
                if chars[i]['x0'] < column_separators[1] and chars[i-1]['x0'] > column_separators[-2]:  
                    data.append(row)  
                    row = [chars[i]]  
                else:  
                    row.append(chars[i])  
            data.append(row)

jsvine added a commit that referenced this issue Jul 16, 2023
As noted in #912, `use_text_flow` was not being handled consistently, as
characters and words were being re-sorted without checking first if this
parameter was set to `True`.
@jsvine jsvine mentioned this issue Jul 16, 2023
@jsvine
Copy link
Owner

jsvine commented Jul 17, 2023

  • Your issue here and the sample PDF also helped me to diagnose a bug in the way pdfplumber handles use_text_flow=True. Hoping to push a fix for this soon, and hoping it helps you with your PDF.

FYI v0.10.0, now available, contains this fix. Hopefully it helps with this issue more broadly. I'll be eager to know what you think.

@gnadlr
Copy link
Author

gnadlr commented Jul 18, 2023

Thank you very much for the fix.

Space is now correctly detected when text of 2 columns physically overlap.

However, space is not detected when text of 2 columns is very close but does not overlap.

Screenshot_20230719-050217_Adobe Acrobat

Using extract_words, this comes out as a single word "mlDestacure" (after "m" is "l", but it is hidden and occupies the empty space before "D"). It should have been "ml" and "Destacure" separately.

Because of the above, my current best option is to extract individual characters. But Im not sure how to fix this based on character coordinates alone, since the coordinates in this case are close and continous very similar to any standalone word.

My idea is pdfplumber could detect which character is "visible" and which is "hidden", then I could write a parser to split words when this attribute change from "hidden" to "visible".

If you have better idea please let me know.

Hope it makes sense. Thank you.

@cmdlineluser
Copy link

I wonder why Gracure Pharmaceutical is "hidden" - yet it seems to be parsed correctly?

Trying all the various tools/libraries for extracting text - they all seem to extract mlDestacure as a single word.

In looking for debugging options, I found the mutool trace command which appears to translate the raw pdf commands into XML:

    <fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 595.32">
        <span font="Times New Roman" wmode="0" bidi="0" trm="6 0 0 6">
            <g unicode="0" glyph="zero" x="172.1" y="479.74" adv=".5"/>
            <g unicode="," glyph="comma" x="175.1" y="479.74" adv=".25"/>
            <g unicode="5" glyph="five" x="176.654" y="479.74" adv=".5"/>
            <g unicode="m" glyph="m" x="179.654" y="479.74" adv=".778"/>
            <g unicode="g" glyph="g" x="184.09401" y="479.74" adv=".5"/>
            <g unicode="/" glyph="slash" x="187.09401" y="479.74" adv=".278"/>
            <g unicode="m" glyph="m" x="188.76201" y="479.74" adv=".778"/>
            <g unicode="l" glyph="l" x="193.214" y="479.74" adv=".278"/>
            <g unicode=" " glyph="space" x="194.654" y="479.74" adv=".25"/>
            <g unicode="x" glyph="x" x="196.20801" y="479.74" adv=".5"/>
            <g unicode=" " glyph="space" x="199.08802" y="479.74" adv=".25"/>
            <g unicode="6" glyph="six" x="200.64202" y="479.74" adv=".5"/>
            <g unicode="0" glyph="zero" x="203.64202" y="479.74" adv=".5"/>
            <g unicode=" " glyph="space" x="206.64202" y="479.74" adv=".25"/>
            <g unicode="m" glyph="m" x="208.19602" y="479.74" adv=".778"/>
            <g unicode="l" glyph="l" x="212.63602" y="479.74" adv=".278"/>
        </span>
    </fill_text>
<pop_clip/>
<end_layer/>
<layer name="P"/>
<clip_path winding="eofill" transform="1 0 0 -1 0 595.32">
    <moveto x="51.96" y="59.28"/>
    <lineto x="782.62" y="59.28"/>
    <lineto x="782.62" y="540.6"/>
    <lineto x="51.96" y="540.6"/>
    <closepath/>
</clip_path>
    <fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 595.32">
        <span font="Times New Roman" wmode="0" bidi="0" trm="6 0 0 6">
            <g unicode="D" glyph="D" x="213.26" y="479.74" adv=".722"/>
            <g unicode="e" glyph="e" x="217.592" y="479.74" adv=".444"/>
            <g unicode="s" glyph="s" x="220.22" y="479.74" adv=".389"/>
            <g unicode="t" glyph="t" x="222.5" y="479.74" adv=".278"/>
            <g unicode="a" glyph="a" x="224.294" y="479.74" adv=".444"/>
            <g unicode="c" glyph="c" x="226.934" y="479.74" adv=".444"/>
            <g unicode="u" glyph="u" x="229.574" y="479.74" adv=".5"/>
            <g unicode="r" glyph="r" x="232.574" y="479.74" adv=".333"/>
            <g unicode="e" glyph="e" x="234.608" y="479.74" adv=".444"/>
        </span>
    </fill_text>
<pop_clip/>
<end_layer/>

The <clip_path winding="eofill" entries appear to be the W* commands @jsvine showed us #912 (comment).

However, mutool text extraction still extracts mlDestacure as a single "word" (unless you use the preserve-spans option)

$ mutool convert -O preserve-spans -o 2a.txt Downloads/2a.pdf 
$ grep -C 3 Dest 2a.txt 
 Desloratadin
Uống
0,5mg/ml x 60 ml
Destacure
 VN-16773-13
 VN-16773-13
Gracure Pharmaceutical Ltd 

From what I can find, the clipping commands are currently no-ops in pdfminer: pdfminer/pdfminer.six#414 - I'm not sure if this is something that needs to be supported in order for pdfplumber to be able to handle this?

@jsvine
Copy link
Owner

jsvine commented Jul 23, 2023

From what I can find, the clipping commands are currently no-ops in pdfminer: pdfminer/pdfminer.six#414 - I'm not sure if this is something that needs to be supported in order for pdfplumber to be able to handle this?

Thanks for this extra context, @cmdlineluser, and for flagging the pdfminer no-op. Unfortunately, that no-op blocks pdfplumber from making use of clipping paths. So not sure we can do much with this here. I keep a fairly close eye on pdfminer.six releases; if/when a future release includes clipping path information, I'll aim to incorporate it. (Maybe something like char["is_clipped"]: bool.)

Trying all the various tools/libraries for extracting text - they all seem to extract mlDestacure as a single word.

I think the issue is that the l in ml bumps right up against the D in Destacure, thus providing no indication that they're part of separate words. The separation between the other examples of clipped text and the following column of text can be detected because the clipped text either extends beyond the beginning of the next column or stops a bit short of it.

I wonder why Gracure Pharmaceutical is "hidden" - yet it seems to be parsed correctly?

Hmm, I see Gracure Pharm [...] in the PDF as not-hidden. But perhaps I'm misunderstanding?

@jsvine jsvine removed the bug label Jul 23, 2023
@cmdlineluser
Copy link

It's possible I am the one misunderstanding things, or using the wrong terminology @jsvine

the ml is hidden:

ml

and here the macutical:

gracure

I was just wondering how come it doesn't extract as PharmacuticalIndia - similar to the mlDestacure case - but perhaps it's because there is an actual space character in the text, even though it is "hidden".

@jsvine
Copy link
Owner

jsvine commented Jul 24, 2023

Ah, I see; this is a good motivation for me to write more comprehensive documentation about how word segmentation works in pdfplumber. Until then:

  • With the default parameters, pdfplumber scans left-to-right; each character's x0 and x1 positions are compared to one another. If next_char["x0"] > curr_char["x1"] + x_tolerance, we consider next_char to begin a new word; otherwise next_char is appended to the current word.
  • With use_text_flow=True, the rule changes slightly. Rather than scan strictly left-to-right, pdfplumber instead examines the characters in the sequence they appear in the PDF's actual commands. Like before, next_char["x1"] > curr_char["x0"] + x_tolerance will trigger a new word. But now so will next_char["x0"] < curr_char["x0"], indicative of the text "backtracking" to a further-left location.

Because Gracure Pharmaceutical extends (substantially, in fact) beyond the beginning of the next chunk of text, it triggers that second condition — as, I think, it should.

But because the l in ml begins just a bit before the D in Destacure, neither condition is met.


Note: Technically, both criteria are tested when use_text_flow=False; it's just that the pre-sorting by x0 means that the second condition will never be triggered.

@gnadlr
Copy link
Author

gnadlr commented Jul 24, 2023

This is correctly what I was trying to say (though it should be next_char["x0"] > curr_char["x1"] + x_tolerance?).

If detection of clipping is not possible, another idea is to check character spacing: With a certain font and font-size, spacing between 2 specific characters should be consistent (in theory).

For example: in our example mlDestacure, spacing between "m" and "l" is consistent, while between "l" and "D" would be slightly different from the regular spacing "l" and "D" had they been part of the same word.

However, I haven't figured out the rule of spacing in PDF files (sometime characters even have negative spacing).

Since these PDF files has consistent font and font-size, if I can figure the spacing rule, I can write a parser to check individual spacing to see if it is part of the same word.

Note: by "spacing" i mean next_char["x0"] - curr_char["x1"]

@jsvine
Copy link
Owner

jsvine commented Jul 24, 2023

This is correctly what I was trying to say (though it should be next_char["x0"] > curr_char["x1"] + x_tolerance?).

Thanks! Updated the comment to fix that.

If detection of clipping is not possible, another idea is to check character spacing: With a certain font and font-size, spacing between 2 specific characters should be consistent (in theory).

Yes, I think the difficulty here is the "(in theory)" part. In practice, I think we'd see a lot of unexpected violations of this theory — enough that it'd create a whole class of edge cases perhaps more common than the thing it's trying to fix.

That said, I'm quite open to being persuaded otherwise with examples and testing!

@cmdlineluser
Copy link

cmdlineluser commented Jul 24, 2023

But because the l in ml begins just a bit before the D in Destacure, neither condition is met.

Ah, I see - the ml doesn't actually overlap - so while both examples may look similar visually, they're different.

Thanks for the explanation @jsvine.

I noticed from the mutool trace output in #912 (comment) that it knows the text boundaries as they are each in their own <layer>

(I'm not sure if that is just a property for this particular PDF?)

From some poking around, it looks like these are the do_BDC() and do_EMC() commands in pdfminer:

https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/pdfinterp.py#L771-L779

Adding in some debug prints and in do_TJ():

[LAYER]
    [SHOW_TEXT] seq=[b'0,', -9, b'5m', 38, b'g/m', 36, b'l', 38, b' ', -9, b'x', 20, b' ', -9, b'60 ', -9, b'm', 38, b'l']
[LAYER]
    [SHOW_TEXT] seq=[b'De', 6, b's', 9, b't', -21, b'a', 4, b'c', 4, b'ur', -6, b'e']

Perhaps you know if it's somehow possible to use this layer information to help with this?

@jsvine
Copy link
Owner

jsvine commented Jul 25, 2023

Really interesting, thanks for sharing @cmdlineluser. I think you're right about those layers being created by marked-content commands. As it happens @dhdaines is doing some experimentation with extracting those sections in #937.

Forcing a word-split when crossing in/out of a marked content section makes sense; certainly something worth trying out if we're able to merge that info.

Another option (perhaps defaulting to False) would be to force word-splits on those <span>s, which seem to correspond to TJ calls. It seems that would work in this particular example, but would be practically universally-available across PDFs (unlike marked content commands, which only some PDFs implement).

Unfortunately, getting access to TJ calls/information would seem to require subclassing pdfminer.six; something to consider, but feels like a last resort. (Of course, there's the more radical option of swapping out pdfminer.six for another library — something I've considered over the years — but that's going to require a lot more thinking/planning. Still: Open to your opinion on this!)

@dhdaines
Copy link
Contributor

Ironically, I have a similar problem to this where a space character appears for unknown reasons just above a line of text and causes a word break due to the sorting of characters - in this PDF I get "63 5" instead of "635" at the bottom of the page... solution is either to use y_tolerance=1 or use_text_flow=True...

Forcing a word-split when crossing in/out of a marked content section makes sense; certainly something worth trying out if we're able to merge that info.

This can be problematic because marked content section boundaries can show up just about anywhere - take this PDF for example, running:

pdf = pdfplumber.open(sys.argv[1])
page = pdf.pages[0]
for word in page.extract_words(extra_attrs=["mcid"]):
    print(word["mcid"], word["text"])

You will see that basically every word has its own MCID, but also many words are split into multiple marked content sections:

90 personnage
92 historique
94 dé
95 c
96 é
97 d
98 é

Unfortunately, getting access to TJ calls/information would seem to require subclassing pdfminer.six; something to consider, but feels like a last resort.

I already do this in #937 ;-) there is really no other option, particularly since pdfminer.six does not appear to be actively maintained: https://github.com/jsvine/pdfplumber/pull/937/files#diff-646d362173010ce6a7ab11c23aba8e777a3943f84c3ac1f39a9e3a47c3ad6719R122

It kind of seems to me like the subset of functionality in pdfminer.six that is actually used could simply be incorporated into pdfplumber, which would give the opportunity to make it more efficient and fix problems like the bad type annotations...

@dhdaines
Copy link
Contributor

dhdaines commented Jul 25, 2023

As for switching to a different library ... there doesn't seem to exist one that has:

  • A sufficiently permissive license
  • Low-level access to PDF structure
  • Non-onerous build and runtime dependencies
  • Pythonicity

Maybe pdf-rs could be interesting in the future ... binding Python to Rust is relatively painless.

@dhdaines
Copy link
Contributor

dhdaines commented Jul 25, 2023

You will see that basically every word has its own MCID, but also many words are split into multiple marked content sections:

Actually - sorry for the spam here ... but in this case the MCIDs correspond to inline Span elements in the structure tree, so they should be expected to not force word breaks. See pdfinfo -struct-text output:

    Span (inline)
      "historique"
    Span (inline)
      " "
    Span (inline)
      "dé"
    Span (inline)
      "c"
    Span (inline)
      "é"
    Span (inline)
      "d"
    Span (inline)
      "é"
    Span (inline)
      " "

So basically no we should not put word-breaks at marked content section boundaries unless we know that they are block elements.

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2023

Thanks for the notes, @dhdaines. Thoughts/responses below:

This can be problematic because marked content section boundaries can show up just about anywhere - take this PDF for example, running: [...]

Ah, very interesting, thanks. I wouldn't want MCIDs to be incorporated into word-splitting by default, but it might be a nice option to have available.

Unfortunately, getting access to TJ calls/information would seem to require subclassing pdfminer.six; something to consider, but feels like a last resort.

I already do this in #937 ;-) there is really no other option, particularly since pdfminer.six does not appear to be actively maintained: https://github.com/jsvine/pdfplumber/pull/937/files#diff-646d362173010ce6a7ab11c23aba8e777a3943f84c3ac1f39a9e3a47c3ad6719R122

Although it's true that pdfminer.six hasn't had a commit in nine months, I'm not quite ready to give up on it. Activity on the project has been sporadic in the past, followed by spurts of improvements. I worry that monkey-patching puts us down a path of substantially greater development complexity, and may make it more difficult to incorporate pdfminer.six's future improvements (if they occur).

It kind of seems to me like the subset of functionality in pdfminer.six that is actually used could simply be incorporated into pdfplumber

Although it may not seem so at first, pdfminer.six does a lot of heavy lifting, handling a lot of the frustrating edge-cases, nooks, and crannies of the PDF spec. To take a random-ish example, see https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/cmapdb.py

As for switching to a different library ... there doesn't seem to exist one that has: [...]

pypdfium2 seems the closest to me at this point. I haven't quite figured out how to get low-level access to individual char/path/etc. objects, but it seems like it should be possible. What do you think?

@dhdaines
Copy link
Contributor

dhdaines commented Aug 2, 2023

Although it's true that pdfminer.six hasn't had a commit in nine months, I'm not quite ready to give up on it. Activity on the project has been sporadic in the past, followed by spurts of improvements. I worry that monkey-patching puts us down a path of substantially greater development complexity, and may make it more difficult to incorporate pdfminer.six's future improvements (if they occur).

Well... it's only sort of monkey-patching since the pdfminer.six API is designed around inheritance (which is really a bad idea in my opinion, but it is what it is). I can at least submit a pull request to support MCIDs in the PDFPageAggregator, or better yet just a method to access the last object created by a PDFLayoutAnalyzer.

pypdfium2 seems the closest to me at this point. I haven't quite figured out how to get low-level access to individual char/path/etc. objects, but it seems like it should be possible. What do you think?

Ah, yes, indeed. The Python bindings won't let you do this, but it is easy to call the underlying C API, which, at least for text, seems to give you everything you need to get individual characters and all their attributes:

https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h

You can see how to do this in the pypdfium2 documentation as well as in my code to read the structure tree.

I shouldn't let my allergy to Google-origin software cloud my judgement here :) and anyway PDFium wasn't originally created by Google and doesn't seem to have been infected by their software engineering practices and tools (monorepo, bazel, abseil, and that whole bestiary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants