Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractText yields nothing for apparently good PDF #168

Closed
chrisinmtown opened this issue Jan 8, 2015 · 12 comments
Closed

ExtractText yields nothing for apparently good PDF #168

chrisinmtown opened this issue Jan 8, 2015 · 12 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@chrisinmtown
Copy link

PyPDF2 version 1.23 fails to extract any text from the first 3 pages of this PDF file:
http://emma.msrb.org/EP295293-EP10300-EP632440.pdf

The file seems well-formed to me; both Acrobat and evince display it nicely. The linux utility pdftotext converts it to text and I see the expected content just fine.

Here's the relevant bit of my little script:

    with open(filename, "rb") as pdf_file:
        try:
            pdf_obj = PdfFileReader(pdf_file)
            # gather properties
            prop_en = pdf_obj.getIsEncrypted()
            err = ""
            if not prop_en:
                # Look for any text on the first N pages
                prop_img = True
                prop_pg = pdf_obj.getNumPages()
                for i in xrange(min(prop_pg, 3)):
                    pagei = pdf_obj.getPage(i)
                    pageitext = pagei.extractText()
                    # Set property and stop searching at first text found
                    if len(pageitext) > 0:
                        prop_img = False
                        break

Is there a gotcha here that I'm missing? Pls advise, thanks in advance for help.

@chrisinmtown
Copy link
Author

I would like to mention that I have many unprotected, machine-searchable (i.e., non-image) PDF files like this - I just posted one link. Unlike the last issue I opened about a freak PDF with a botched header, in this case PyPDF2 fails to get text from annoyingly many of the files I'm trying to process. Thanks for listening.

@zevaverbach
Copy link

@chrisinmtown I ran into a similar issue today; my PDF and yours are "page extraction: not allowed" according to Adobe Reader. :(

@chrisinmtown
Copy link
Author

@zevav thanks for the comment but please let's not confuse issues.

Protected files are a whole different ball of wax and I don't expect PyPDF2 to extract anything from such files given no password.

The link I provided above yields a PDF that is not password protected. On this document Adobe Acrobat makes no complaint about extracting text, it happily saves-as plain text and the result is totally usable.

@zevaverbach
Copy link

@chrisinmtown hey, sorry to send you in the wrong direction; in Acrobat on my machine that document does show (document info) as "page extraction not allowed."

@chrisinmtown
Copy link
Author

Thanks for clarifying. Now I'm concerned, I don't want to waste anyone's time here on non-issues!

I am using Adobe Acrobat XI on Win7_x64. With this document open in Acrobat I pick File -> Properties, switch to the Security tab of the Document Properties dialog, and there I read "Security Method: No Security", and under the restrictions everything is allowed (Printing, Changing, Copying ...).

Could there possibly be a difference in behavior between Reader and Acrobat on this document?

@zevaverbach
Copy link

Okay, my bad: I wrote "Acrobat" in my second comment, but I meant "Reader." Here's a screenshot of your file's info in that, on OS X.10, Reader 11.0.10.

@chrisinmtown
Copy link
Author

I see the exact same thing in the Win7 version of Acrobat Reader XI: Document Assembly and Page Extract Not Allowed; all the rest (Content Copying ..) are Allowed. FWIW, PyPDF2 declares this document unprotected.

I'm starting to think the Properties window is reflecting features of Acrobat Reader rather than the document, do you agree? In my tests of Reader on other PDF documents, it invariably declares "Page Extract Not Allowed". Reader by definition cannot extract pages, right?

Just to be clear, I am sticking to my position :) that the original document is a valid PDF, unprotected, with text content, and I really would like PyPDF2 to be extended so it can handle this doc.

@zevaverbach
Copy link

Dang, you're right! I didn't think to check a PDF that I know PyPDF2 can extract the text of; Reader does indeed show that property for all PDFs. :(

What method in PyPDF2 tells you whether or not a document is protected?

@chrisinmtown
Copy link
Author

The relevant method on PdfFileReader is getIsEncrypted()

@Rob1080
Copy link
Contributor

Rob1080 commented Feb 20, 2016

I realise this is an old post, did you ever find the reason for text not being extracted?

@droid-surbhi
Copy link

Facing same problem. PyPDF2 version 1.26

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 7, 2022
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022
@py-pdf py-pdf deleted a comment from warriorgithub Jun 6, 2022
@py-pdf py-pdf deleted a comment from SallyRagab Jun 6, 2022
@py-pdf py-pdf deleted a comment from Bitseat Jun 6, 2022
@py-pdf py-pdf deleted a comment from psulin Jun 6, 2022
@MartinThoma
Copy link
Member

Sadly, the PDF mentioned above is no longer reachable. I think that #924 fixed the issue and hence I close this PR.

It might also be a duplicate of the underlying cause of #242 .

If you face the same issue, please open a new bug ticket and upload a PDF with the issue (to which you must have the copyright)

MartinThoma added a commit that referenced this issue Jun 6, 2022
The highlight of the 2.1.0 release is the most massive improvement to the
text extraction capabilities of PyPDF2 since 2016 🥳🎊 A very big thank you goes
to [pubpub-zz](https://github.com/pubpub-zz) who took a lot of time and
knowledge about the PDF format to finally get those improvements into PyPDF2.
Thank you 🤗💚

In case the new function causes any issues, you can use `_extract_text_old`
for the old functionality. Please also open a bug ticket in that case.

There were several people who have attempted to bring similar improvements to
PyPDF2. All of those were valuable. The main reason why they didn't get merged
is the big amount of open PRs / issues. pubpub-zz was the most comprehensive
PR which also incorporated the latest changes of PyPDF2 2.0.0.

Thank you to [VictorCarlquist](https://github.com/VictorCarlquist) for #858 and
[asabramo](https://github.com/asabramo) for #464 🤗

New Features (ENH):
-  Massive text extraction improvement (#924). Closed many open issues:
    - Exceptions / missing spaces in extract_text() method (#17) 🕺
      - Whitespace issues in extract_text() (#42) 💃
      - pypdf2 reads the hifenated words in a new line (#246)
    - PyPDF2 failing to read unicode character (#37)
      - Unable to read bullets (#230)
    - ExtractText yields nothing for apparently good PDF (#168) 🎉
    - Encoding issue in extract_text() (#235)
    - extractText() doesn't work on Chinese PDF (#252)
    - encoding error (#260)
    - Trouble with apostophes in names in text "O'Doul" (#384)
    - extract_text works for some PDF files, but not the others (#437)
    - Euro sign not being recognized by extractText (#443)
    - Failed extracting text from French texts (#524)
    - extract_text doesn't extract ligatures correctly (#598)
    - reading spanish text - mark convert issue (#635)
    - Read PDF changed from text to random symbols (#654)
    - .extractText() reads / as 1. (#789)
-  Update glyphlist (#947) - inspired by #464
-  Allow adding PageRange objects (#948)

Bug Fixes (BUG):
-  Delete .python-version file (#944)
-  Compare StreamObject.decoded_self with None (#931)

Robustness (ROB):
-  Fix some conversion errors on non conform PDF (#932)

Documentation (DOC):
-  Elaborate on PDF text extraction difficulties (#939)
-  Add logo (#942)
-  rotate vs Transformation().rotate (#937)
-  Example how to use PyPDF2 with AWS S3 (#938)
-  How to deprecate (#930)
-  Fix typos on robustness page (#935)
-  Remove scripts (pdfcat) from docs (#934)

Developer Experience (DEV):
-  Ignore .python-version file
-  Mark deprecated code with no-cover (#943)
-  Automatically create Github releases from tags (#870)

Testing (TST):
-  Text extraction for non-latin alphabets (#954)
-  Ignore PdfReadWarning in benchmark (#949)
-  writer.remove_text (#946)
-  Add test for Tree and _security (#945)

Code Style (STY):
-  black, isort, Flake8, splitting buildCharMap (#950)

Full Changelog: 2.0.0...2.1.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

5 participants