-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting text using the Y coordinates from outlines only sometimes works #1377
Comments
Tried some other packages for reference. |
Hard to state without the PDF... can't you just extract this page ? |
Yes I can extract the page, but I want to basically be able to link an outline item to the text that it corresponds to. So I'm trying to use subsequent outline item coordinates to define the range to select text from. The issue seems to be in the algorithm that decides what text is contained within the bounding box. |
"extract this page" I mean produce a new pdf with this page only? |
Ah right, here is the page in question |
Thanks. One point to be noted is that the absolute position (which is consistent with the outline positions) is the result of the matrix product cm.tm. Inhere, the cm matrix being the identity looking at tm only is ok. When I've rerun the code as you did: Can you confirm you are getting the result ? also maybe you should add nonlocal statement to ensure the variable's scope |
Interesting, I do get the same result as you when calling the function on a document that only contains that single page. However I retested using the full document and indeed it is still the wrong selection. from PyPDF2 import PdfReader
import logging
logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.INFO) reader_short = PdfReader("../convert_note_page_2.pdf")
reader_full = PdfReader("../convertible_note.pdf") def extract_text(ymax, ymin, page_number, reader):
node_text = []
def visitor_text(text, cm, tm, fontDict, fontSize):
y = tm[5]
if ymin < y <= ymax:
node_text.append(text)
reader.pages[page_number].extract_text(visitor_text=visitor_text)
return "".join(node_text) extract_text(639, 576, 0, reader_short)
extract_text(639, 576, 1, reader_full)
extract_text(639, 550, 1, reader_full)
|
|
Ok I tried a few things like exporting the first 2 pages, redacting info, exporting a full copy, all in Mac Preview and every time it actually fixed the issue and the text extraction was correct. However when I did the same thing in Adobe Acrobat, i was able to repro the issue. So here is the first 2 pages with redacted info (using Adobe), and the issue should now repro. from PyPDF2 import PdfReader
import logging
logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.INFO) reader_short = PdfReader("../convert_note_page_2.pdf")
reader_full = PdfReader("../convertible_note.pdf")
reader_redacted_2 = PdfReader("../convert_note_duplicate_Redacted.pdf") def extract_text(ymax, ymin, page_number, reader):
node_text = []
def visitor_text(text, cm, tm, fontDict, fontSize):
y = tm[5]
if ymin < y <= ymax:
node_text.append(text)
reader.pages[page_number].extract_text(visitor_text=visitor_text)
return "".join(node_text) extract_text(639, 576, 0, reader_short)
extract_text(639, 576, 1, reader_full)
extract_text(639, 550, 1, reader_full)
extract_text(639, 576, 1, reader_redacted_2)
extract_text(639, 550, 1, reader_redacted_2)
|
PR released. This issue can be closed |
Before the text-visitor-function had been called at each change of the output. But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text. As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation. In this pull request the texts are sent inside the TJ and Tj operations. This may lead to sending letters instead of words: ``` x=264.53, y=403.13, text='M' x=264.53, y=403.13, text='etad' x=264.53, y=403.13, text='ata' x=307.85, y=403.13, text=' ' ``` Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ. The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ. When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python. In case of bad style a local variable current_text_visitor may be introduced. See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR. -- This PR is a copy of #1389 The PR#1389 was made a long time ago (before we renamed to pypdf), but it seems still valuable. This PR migrated the changes to the new codebase. Full credit to rogmann for all of the changes. Co-authored-by: rogmann <github@rogmann.org>
I'm trying to use the
Destinations
fromreader.outlines
to extract the text corresponding to each outline. I am using thetop
attribute from each destination to start a text selection, and then using the next (possibly nested) destinationtop
attribute from the outline to mark the end of the text selection.Then I am using the
visitor_text
argument ofPage.extract_text
to slice text according to the outline ranges, as show in the documentation.I am finding that sometimes the extractions work, but a good percentage of them don't align correctly. When looking at the PDF in Preview on Mac, the outlines link to the correct place in the document when clicked.
The relevant code is below. Am I doing this correctly? Do I have the right idea, that I should be able to extract the chunk of text underneath an outline, and I can do that by using the
top
attribute of subsequent (possibly nested) outlineDestinations
? Note that all of theleft
values of the destinations are the same, so I believe I only need to deal with y coordinates. Is there anything else I can do to make this work?I've tested a few different ranges for
extract_text
and can see where it seems to be getting messed up. So I don't think it is an issue with the actualtop
values, it must be in the extract_text algorithm itself.Environment
Python 3.9.13
Mac OS, M1 chip
Which environment were you using when you encountered the problem?
$ python -m platform macOS-12.6-arm64-arm-64bit $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.11.0
Code + PDF
This is a minimal, complete example that shows the issue:
The text selection from node[7] to node[8] should read:
But instead is just:
You can see that i need to go all the way down to y coord 564 in order to get the text selection, but that extends beyond the subsequent outlines
top
attribute.I've attached an image of the PDF selection, I cannot upload the whole thing due to privacy.
The text was updated successfully, but these errors were encountered: