Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting text using the Y coordinates from outlines only sometimes works #1377

Closed
lucasgadams opened this issue Oct 2, 2022 · 11 comments
Closed

Comments

@lucasgadams
Copy link

I'm trying to use the Destinations from reader.outlines to extract the text corresponding to each outline. I am using the top attribute from each destination to start a text selection, and then using the next (possibly nested) destination top attribute from the outline to mark the end of the text selection.

Then I am using the visitor_text argument of Page.extract_text to slice text according to the outline ranges, as show in the documentation.

I am finding that sometimes the extractions work, but a good percentage of them don't align correctly. When looking at the PDF in Preview on Mac, the outlines link to the correct place in the document when clicked.

The relevant code is below. Am I doing this correctly? Do I have the right idea, that I should be able to extract the chunk of text underneath an outline, and I can do that by using the top attribute of subsequent (possibly nested) outline Destinations? Note that all of the left values of the destinations are the same, so I believe I only need to deal with y coordinates. Is there anything else I can do to make this work?

I've tested a few different ranges for extract_text and can see where it seems to be getting messed up. So I don't think it is an issue with the actual top values, it must be in the extract_text algorithm itself.

Environment

Python 3.9.13
Mac OS, M1 chip

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-12.6-arm64-arm-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

This is a minimal, complete example that shows the issue:

def extract_text(ymax, ymin, page_number):
    node_text = []
    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if ymin < y <= ymax:
            node_text.append(text)
    reader.pages[page_number].extract_text(visitor_text=visitor_text)
    return "".join(node_text)

# Nodes here are my own class that hold outline Destinations. Subsequent ones in the list correspond to DFS order of the outline tree.

nodes[7]
> [OutlineNode] Title=(i) The Company's failure to p; n_child=0; page_number=1; top=639; left=70

nodes[8]
> [OutlineNode] Title=(ii) A Conversion Failure as d; n_child=0; page_number=1; top=576; left=70

nodes[9]
[OutlineNode] Title=(iii) The Company or any subsi; n_child=0; page_number=1; top=550; left=70

extract_text(639, 576, 1)
> '\n(i) '

extract_text(639, 565, 1)
> '\n(i) '

extract_text(639, 564, 1)
> "\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

extract_text(576, 550, 1)
> "The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

The text selection from node[7] to node[8] should read:

(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction
Document;

But instead is just:

'\n(i) '

You can see that i need to go all the way down to y coord 564 in order to get the text selection, but that extends beyond the subsequent outlines top attribute.

I've attached an image of the PDF selection, I cannot upload the whole thing due to privacy.

image

@lucasgadams
Copy link
Author

Tried some other packages for reference.pdfminer.six also gets the same bookmarks, but I can't figure out how to select text from a coordinate range with the package. However printing out the full page text from the page in question, i can see there is some ordering issues, and so I bet it would also have the same issue extracting. The commercial pdftron python bindings seem to work however. I used this example https://www.pdftron.com/documentation/samples/py/TextExtractTest (with a trial api key), and using the y coordinate ranges from the bookmarks was able to extract the correct text using the bounding boxes. Just FYI

@pubpub-zz
Copy link
Collaborator

Hard to state without the PDF... can't you just extract this page ?
What I would recommend would be to add some print as debug to identify what are the exact values

@lucasgadams
Copy link
Author

Yes I can extract the page, but I want to basically be able to link an outline item to the text that it corresponds to. So I'm trying to use subsequent outline item coordinates to define the range to select text from. The issue seems to be in the algorithm that decides what text is contained within the bounding box.

@pubpub-zz
Copy link
Collaborator

"extract this page" I mean produce a new pdf with this page only?

@lucasgadams
Copy link
Author

convert_note_page_2.pdf

Ah right, here is the page in question

@pubpub-zz
Copy link
Collaborator

Thanks.
I've used the following function:
def visitor_text(text, cm, tm, fontDict, fontSize):
print(cm,tm,text)

One point to be noted is that the absolute position (which is consistent with the outline positions) is the result of the matrix product cm.tm. Inhere, the cm matrix being the identity looking at tm only is ok.

When I've rerun the code as you did:
extract_text(639,576,0)
I get:
"\n(i) The Company's failure to pay to the Holder any amount of \nPrincipal, Interest, or other amounts when and as due under this Note (including, without limitation, the \nCompany's failure to pay any redemp tion payments or amounts hereunder) or any other Transaction \nDocument; "

Can you confirm you are getting the result ? also maybe you should add nonlocal statement to ensure the variable's scope

@lucasgadams
Copy link
Author

Interesting, I do get the same result as you when calling the function on a document that only contains that single page. However I retested using the full document and indeed it is still the wrong selection.

from PyPDF2 import PdfReader
import logging

logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.INFO)
reader_short = PdfReader("../convert_note_page_2.pdf")
reader_full = PdfReader("../convertible_note.pdf")
def extract_text(ymax, ymin, page_number, reader):
    node_text = []
    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if ymin < y <= ymax:
            node_text.append(text)
    reader.pages[page_number].extract_text(visitor_text=visitor_text)
    return "".join(node_text)
extract_text(639, 576, 0, reader_short)
"\n(i) The Company's failure to pay to the Holder any amount of \nPrincipal, Interest, or other amounts when and as due under this Note  (including, without limitation, the \nCompany's failure to pay any redemp tion payments or amounts hereunder) or any other Transaction \nDocument;   "
extract_text(639, 576, 1, reader_full)
'\n(i) '
extract_text(639, 550, 1, reader_full)
"\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

@lucasgadams
Copy link
Author

nonlocal doesn't change anything, it shouldn't be needed because I'm not reassigning the variable name, just appending to the object.

@lucasgadams
Copy link
Author

Ok I tried a few things like exporting the first 2 pages, redacting info, exporting a full copy, all in Mac Preview and every time it actually fixed the issue and the text extraction was correct. However when I did the same thing in Adobe Acrobat, i was able to repro the issue. So here is the first 2 pages with redacted info (using Adobe), and the issue should now repro.
convert_note_duplicate_Redacted.pdf

from PyPDF2 import PdfReader
import logging

logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.INFO)
reader_short = PdfReader("../convert_note_page_2.pdf")
reader_full = PdfReader("../convertible_note.pdf")
reader_redacted_2 = PdfReader("../convert_note_duplicate_Redacted.pdf")
def extract_text(ymax, ymin, page_number, reader):
    node_text = []
    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if ymin < y <= ymax:
            node_text.append(text)
    reader.pages[page_number].extract_text(visitor_text=visitor_text)
    return "".join(node_text)
extract_text(639, 576, 0, reader_short)
"\n(i) The Company's failure to pay to the Holder any amount of \nPrincipal, Interest, or other amounts when and as due under this Note  (including, without limitation, the \nCompany's failure to pay any redemp tion payments or amounts hereunder) or any other Transaction \nDocument;   "
extract_text(639, 576, 1, reader_full)
'\n(i) '
extract_text(639, 550, 1, reader_full)
"\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "
extract_text(639, 576, 1, reader_redacted_2)   
'\n(i) '
extract_text(639, 550, 1, reader_redacted_2)   
"\n(i) The Company's failure to pay to the Holder any amount of Principal, Interest, or other amounts when and as due under this Note (including, without limitation, the Company's failure to pay any redemption payments or amounts hereunder) or any other Transaction Document;  \n(ii) "

@pubpub-zz
Copy link
Collaborator

Got it!
You are facing the issue that has been /will be fixed in PR #1373 (produced by @srogmann)

This PR should be merged soon, however meanwhile you should be able to copy it locally

@pubpub-zz
Copy link
Collaborator

PR released. This issue can be closed

MartinThoma added a commit that referenced this issue Dec 24, 2023
Before the text-visitor-function had been called at each change of the output.
But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text.
As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation.

In this pull request the texts are sent inside the TJ and Tj operations.
This may lead to sending letters instead of words:

```    x=264.53, y=403.13, text='M'
    x=264.53, y=403.13, text='etad'
    x=264.53, y=403.13, text='ata'
    x=307.85, y=403.13, text=' '
```

Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ.
The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ.
When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python.
In case of bad style a local variable current_text_visitor may be introduced.

See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR.

--

This PR is a copy of #1389
The PR#1389 was made a long time ago (before we renamed to pypdf),
but it seems still valuable.

This PR migrated the changes to the new codebase. Full credit
to rogmann for all of the changes.

Co-authored-by: rogmann <github@rogmann.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants