Extracting text using the Y coordinates from outlines only sometimes works #1377

lucasgadams · 2022-10-02T19:34:26Z

I'm trying to use the Destinations from reader.outlines to extract the text corresponding to each outline. I am using the top attribute from each destination to start a text selection, and then using the next (possibly nested) destination top attribute from the outline to mark the end of the text selection.

Then I am using the visitor_text argument of Page.extract_text to slice text according to the outline ranges, as show in the documentation.

I am finding that sometimes the extractions work, but a good percentage of them don't align correctly. When looking at the PDF in Preview on Mac, the outlines link to the correct place in the document when clicked.

The relevant code is below. Am I doing this correctly? Do I have the right idea, that I should be able to extract the chunk of text underneath an outline, and I can do that by using the top attribute of subsequent (possibly nested) outline Destinations? Note that all of the left values of the destinations are the same, so I believe I only need to deal with y coordinates. Is there anything else I can do to make this work?

I've tested a few different ranges for extract_text and can see where it seems to be getting messed up. So I don't think it is an issue with the actual top values, it must be in the extract_text algorithm itself.

Environment

Python 3.9.13
Mac OS, M1 chip

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-12.6-arm64-arm-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

This is a minimal, complete example that shows the issue:

def extract_text(ymax, ymin, page_number):
    node_text = []
    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if ymin < y <= ymax:
            node_text.append(text)
    reader.pages[page_number].extract_text(visitor_text=visitor_text)
    return "".join(node_text)

# Nodes here are my own class that hold outline Destinations. Subsequent ones in the list correspond to DFS order of the outline tree.

nodes[7]
> [OutlineNode] Title=(i) The Company's failure to p; n_child=0; page_number=1; top=639; left=70

nodes[8]
> [OutlineNode] Title=(ii) A Conversion Failure as d; n_child=0; page_number=1; top=576; left=70

nodes[9]
[OutlineNode] Title=(iii) The Company or any subsi; n_child=0; page_number=1; top=550; left=70

extract_text(639, 576, 1)
> '\n(i) '

extract_text(639, 565, 1)
> '\n(i) '

extract_text(639, 564, 1)
> "\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

extract_text(576, 550, 1)
> "The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

The text selection from node[7] to node[8] should read:

(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction
Document;

But instead is just:

'\n(i) '

You can see that i need to go all the way down to y coord 564 in order to get the text selection, but that extends beyond the subsequent outlines top attribute.

I've attached an image of the PDF selection, I cannot upload the whole thing due to privacy.

The text was updated successfully, but these errors were encountered:

lucasgadams · 2022-10-03T13:40:52Z

Tried some other packages for reference.pdfminer.six also gets the same bookmarks, but I can't figure out how to select text from a coordinate range with the package. However printing out the full page text from the page in question, i can see there is some ordering issues, and so I bet it would also have the same issue extracting. The commercial pdftron python bindings seem to work however. I used this example https://www.pdftron.com/documentation/samples/py/TextExtractTest (with a trial api key), and using the y coordinate ranges from the bookmarks was able to extract the correct text using the bounding boxes. Just FYI

pubpub-zz · 2022-10-03T17:06:26Z

Hard to state without the PDF... can't you just extract this page ?
What I would recommend would be to add some print as debug to identify what are the exact values

lucasgadams · 2022-10-03T17:43:51Z

Yes I can extract the page, but I want to basically be able to link an outline item to the text that it corresponds to. So I'm trying to use subsequent outline item coordinates to define the range to select text from. The issue seems to be in the algorithm that decides what text is contained within the bounding box.

pubpub-zz · 2022-10-03T18:34:59Z

"extract this page" I mean produce a new pdf with this page only?

lucasgadams · 2022-10-03T20:09:19Z

convert_note_page_2.pdf

Ah right, here is the page in question

pubpub-zz · 2022-10-03T21:21:43Z

Thanks.
I've used the following function:
def visitor_text(text, cm, tm, fontDict, fontSize):
print(cm,tm,text)

One point to be noted is that the absolute position (which is consistent with the outline positions) is the result of the matrix product cm.tm. Inhere, the cm matrix being the identity looking at tm only is ok.

When I've rerun the code as you did:
extract_text(639,576,0)
I get:
"\n(i) The Company's failure to pay to the Holder any amount of \nPrincipal, Interest, or other amounts when and as due under this Note (including, without limitation, the \nCompany's failure to pay any redemp tion payments or amounts hereunder) or any other Transaction \nDocument; "

Can you confirm you are getting the result ? also maybe you should add nonlocal statement to ensure the variable's scope

lucasgadams · 2022-10-04T13:13:46Z

Interesting, I do get the same result as you when calling the function on a document that only contains that single page. However I retested using the full document and indeed it is still the wrong selection.

from PyPDF2 import PdfReader
import logging

logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.INFO)

reader_short = PdfReader("../convert_note_page_2.pdf")
reader_full = PdfReader("../convertible_note.pdf")

def extract_text(ymax, ymin, page_number, reader):
    node_text = []
    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if ymin < y <= ymax:
            node_text.append(text)
    reader.pages[page_number].extract_text(visitor_text=visitor_text)
    return "".join(node_text)

extract_text(639, 576, 0, reader_short)

"\n(i) The Company's failure to pay to the Holder any amount of \nPrincipal, Interest, or other amounts when and as due under this Note  (including, without limitation, the \nCompany's failure to pay any redemp tion payments or amounts hereunder) or any other Transaction \nDocument;   "

extract_text(639, 576, 1, reader_full)

'\n(i) '

extract_text(639, 550, 1, reader_full)

"\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

lucasgadams · 2022-10-04T13:16:15Z

nonlocal doesn't change anything, it shouldn't be needed because I'm not reassigning the variable name, just appending to the object.

lucasgadams · 2022-10-04T13:51:02Z

Ok I tried a few things like exporting the first 2 pages, redacting info, exporting a full copy, all in Mac Preview and every time it actually fixed the issue and the text extraction was correct. However when I did the same thing in Adobe Acrobat, i was able to repro the issue. So here is the first 2 pages with redacted info (using Adobe), and the issue should now repro.
convert_note_duplicate_Redacted.pdf

from PyPDF2 import PdfReader
import logging

logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.INFO)

reader_short = PdfReader("../convert_note_page_2.pdf")
reader_full = PdfReader("../convertible_note.pdf")
reader_redacted_2 = PdfReader("../convert_note_duplicate_Redacted.pdf")

def extract_text(ymax, ymin, page_number, reader):
    node_text = []
    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if ymin < y <= ymax:
            node_text.append(text)
    reader.pages[page_number].extract_text(visitor_text=visitor_text)
    return "".join(node_text)

extract_text(639, 576, 0, reader_short)

"\n(i) The Company's failure to pay to the Holder any amount of \nPrincipal, Interest, or other amounts when and as due under this Note  (including, without limitation, the \nCompany's failure to pay any redemp tion payments or amounts hereunder) or any other Transaction \nDocument;   "

extract_text(639, 576, 1, reader_full)

'\n(i) '

extract_text(639, 550, 1, reader_full)

"\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

extract_text(639, 576, 1, reader_redacted_2)

'\n(i) '

extract_text(639, 550, 1, reader_redacted_2)

"\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

pubpub-zz · 2022-10-04T17:32:42Z

Got it!
You are facing the issue that has been /will be fixed in PR #1373 (produced by @srogmann)

This PR should be merged soon, however meanwhile you should be able to copy it locally

pubpub-zz · 2022-10-09T11:46:01Z

PR released. This issue can be closed

Before the text-visitor-function had been called at each change of the output. But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text. As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation. In this pull request the texts are sent inside the TJ and Tj operations. This may lead to sending letters instead of words: ``` x=264.53, y=403.13, text='M' x=264.53, y=403.13, text='etad' x=264.53, y=403.13, text='ata' x=307.85, y=403.13, text=' ' ``` Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ. The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ. When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python. In case of bad style a local variable current_text_visitor may be introduced. See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR. -- This PR is a copy of #1389 The PR#1389 was made a long time ago (before we renamed to pypdf), but it seems still valuable. This PR migrated the changes to the new codebase. Full credit to rogmann for all of the changes. Co-authored-by: rogmann <github@rogmann.org>

pubpub-zz closed this as completed Oct 9, 2022

srogmann mentioned this issue Oct 10, 2022

Coords in extract text #1389

Closed

MartinThoma mentioned this issue Dec 24, 2023

MAINT: Change the positions of the calls of the visitor-function #2364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text using the Y coordinates from outlines only sometimes works #1377

Extracting text using the Y coordinates from outlines only sometimes works #1377

lucasgadams commented Oct 2, 2022

lucasgadams commented Oct 3, 2022

pubpub-zz commented Oct 3, 2022

lucasgadams commented Oct 3, 2022

pubpub-zz commented Oct 3, 2022

lucasgadams commented Oct 3, 2022

pubpub-zz commented Oct 3, 2022

lucasgadams commented Oct 4, 2022

lucasgadams commented Oct 4, 2022

lucasgadams commented Oct 4, 2022

pubpub-zz commented Oct 4, 2022

pubpub-zz commented Oct 9, 2022

Extracting text using the Y coordinates from outlines only sometimes works #1377

Extracting text using the Y coordinates from outlines only sometimes works #1377

Comments

lucasgadams commented Oct 2, 2022

Environment

Code + PDF

lucasgadams commented Oct 3, 2022

pubpub-zz commented Oct 3, 2022

lucasgadams commented Oct 3, 2022

pubpub-zz commented Oct 3, 2022

lucasgadams commented Oct 3, 2022

pubpub-zz commented Oct 3, 2022

lucasgadams commented Oct 4, 2022

lucasgadams commented Oct 4, 2022

lucasgadams commented Oct 4, 2022

pubpub-zz commented Oct 4, 2022

pubpub-zz commented Oct 9, 2022