Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check_extractable argument to high_level.extract_text #350

Closed
Recursing opened this issue Jan 5, 2020 · 18 comments · Fixed by #453
Closed

Add check_extractable argument to high_level.extract_text #350

Recursing opened this issue Jan 5, 2020 · 18 comments · Fixed by #453

Comments

@Recursing
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When using pdfminer.high_level.extract_text on some files, I get pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed

Describe the solution you'd like
Add a keyword argument check_extractable to pdfminer.high_level.extract_text, and pass it to PDFPage.get_pages

@igormp
Copy link
Contributor

igormp commented Jan 6, 2020

Is it possible for you to give us a sample PDF so we can reproduce this issue?

@Recursing
Copy link
Contributor Author

asw_oct06_p33-41.pdf

@pietermarsman
Copy link
Member

pietermarsman commented Jan 6, 2020

Hi all,

I am against adding the check_extractable() parameter to the high-level functions extract_text() and extract_text_to_fp(). I think these function signatures are already bloated, especially extract_text_to_fp().

The high-level functions (should) cover the most common use-cases. Changing the check_extractable flag is not imho a common use-case. Instead, you should use the more adaptable composable api instead. Also see the (currently minimal) docs for this.

In your case, I recommend something like this:

from io import StringIO

from converter import TextConverter
from layout import LAParams
from pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfpage import PDFPage

layout_parameters = LAParams()

with open('asw_oct06_p33-41.pdf', "rb") as fp, StringIO() as output_string:
    resource_manager = PDFResourceManager()
    device = TextConverter(resource_manager, output_string, 
                           laparams=layout_parameters)
    interpreter = PDFPageInterpreter(resource_manager, device)

    for page in PDFPage.get_pages(fp, check_extractable=False):
        interpreter.process_page(page)
    
    print(output_string.getvalue())

You could also wrap this as a function.

I like to know if this suits your needs, and if I am missing something.

@Recursing
Copy link
Contributor Author

Of course rewriting the high level extract_text myself with check_extractable=False would suit my needs, but I think that if adding an argument is a problem, a more reasonable default for the high level extract_text would be to work even on pdfs like the attached one.

Also, I think we might want to streamline the parameters of extract_text and extract_text_to_fp (and remove its unused kwargs parameter)

@pietermarsman
Copy link
Member

Of course rewriting the high level extract_text myself with check_extractable=False would suit my needs

Great! I'm closing this issue and the corresponding PR, let me know if you disagree.

but I think that if adding an argument is a problem, a more reasonable default for the high level extract_text would be to work even on pdfs like the attached one.

A PDF is not extractable if the document itself signals that it is not allowed to extract the content if opened with user-level permissions. In my opinion, we should respect the choice of the application that created the PDF, and consequently check_extractable=True is a great default.

Also, I think we might want to streamline the parameters of extract_text and extract_text_to_fp (and remove its unused kwargs parameter)

I agree! The function signature of extract_text_to_fp is been around for long and changing it would be a breaking change. The extract_text() function is only recently created, and its arguments are much more concise and understandable.

If this bothers you, you can create a new issue for this. We could deprecate the use of extract_text_to_fp and create some new methods.

@Recursing
Copy link
Contributor Author

Recursing commented Jan 6, 2020

I kind of disagree on closing this, as the main purpose of the high level API is to extract text from pdfs and in my tests pdfs that raise PDFTextExtractionNotAllowed are not that rare (happened to both me and another user).
But it's not that big of a deal as it's easy to rewrite the high level api

I would argue that laparams, codec and caching and maxpages are all less useful parameters (especially the last one as a user can just specify a range of pages)

@pietermarsman
Copy link
Member

Then I will reopen this. Lets see what others think.

@pietermarsman pietermarsman reopened this Jan 7, 2020
@igormp
Copy link
Contributor

igormp commented Jan 7, 2020

While I think that the high level functions should be a no brainer to use in most situations, where you just pass the file and it extracts the text without problems, I do think that we should adopt some sane defaults that would at least throw a warning about something as a PDF not meant to be extractable.

Passing another parameter to a method that is meant to be simple doesn't sound right to me, so maybe we should just change check_extractable's default to False and issue a warning of some kind, since people will want to extract its contents anyway.

@Recursing
Copy link
Contributor Author

I would agree with the warning

@pietermarsman
Copy link
Member

pietermarsman commented Jan 7, 2020

I like it too.

In summary:

  • Set the default value for check_extractable to False.
  • If check_extractable is True we throw an Error, if False we raise a warning.
  • Remove the explicit arguments for check_extractable from the high_level module.

As an alternative; we could also drop the check_extractable argument completely and always raise a warning and never an error.

@igormp
Copy link
Contributor

igormp commented Jan 7, 2020

As an alternative; we could also drop the check_extractable argument completely and always raise a warning and never an error.

I really like that.

crccheck added a commit to crccheck/atx-bandc that referenced this issue Jan 21, 2020
These PDFs are extractable, but pdfminer won't extract them pdfminer/pdfminer.six#350

This forks the high level `extract_text` function to fix this. I could have combined `_get_pdf_page_count` with but then I wouldn't be able to delete this code in the future if pdfminer implements a fix.

Part of #38
@filipopo
Copy link

Encountered this too
Or just make a ignore flag, there's many tools that ignore this, hell even opening the .pdf in firefox allows me to copy the text of protected .pdf's

@pietermarsman
Copy link
Member

@filip98, if you feel like creating a PR, I'll make sure that it gets reviewed and merged.

@HeroadZ
Copy link

HeroadZ commented Jun 16, 2020

Can't believe this problem still exists.
In my opinion, some users have suffered for this problem. There are also some discussion on stackoverflow.
I mean, "add an optional argument" is simple and doesn't have any bad effect on old codes.
I agree with the optional argument, check_extractable.

@pietermarsman
Copy link
Member

Hi @HeroadZ, feel free to work on this. I can review, merge and publish the code.

@Recursing
Copy link
Contributor Author

Recursing commented Jun 23, 2020

You can still merge #351 , maybe see if you want to change the default to False

@madhurcodes
Copy link
Contributor

Encountered this error today causing me to redo a batch job :<
Submitted this PR #453 to change the error to a warning with a descriptive message to the user.

@pietermarsman
Copy link
Member

Hi @madhurcodes, thanks for your work on this. I'll review the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants