Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"TypeError: 'NoneType' object is not subscriptable" almost every time I make inference #21

Closed
nicolas-gervais opened this issue Jul 10, 2021 · 9 comments

Comments

@nicolas-gervais
Copy link

Traceback (most recent call last):
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 669, in run
    title = get_title_from_file(args.pdf)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 557, in get_title_from_file
    return get_title_from_io(raw_file)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 452, in get_title_from_io
    dev.recover_last_paragraph()
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 340, in recover_last_paragraph
    if len(self.current_block[4]) > 0:
TypeError: 'NoneType' object is not subscriptable
@metebalci
Copy link
Owner

Can you share how you are using it and with which pdf file ?

@nicolas-gervais
Copy link
Author

I found that if one of the pages doesn't contain text, this occurs. I'm able to avoid these documents like this:

import pdftotext

@staticmethod
 def contains_text(filename):
        with open(filename, "rb") as f:
            pdf = pdftotext.PDF(f)
        return all([text != '\x0c' for text in pdf])

@metebalci
Copy link
Owner

I cannot reproduce the error, I checked both an empty pdf and a pdf with an empty page between non-empty pages. If it is possible you share the pdf I can check that one. Maybe there is a different problem.

@seamustuohy
Copy link

Here are some example files which fail. From some very cursory debugging it looks like the error was introduced when the eliot algorithm was added. TextOnlyDevices had some changes made to it that fail when certain assumptions are not met. Specifically, the below files are constructed in a manner where process_string never gets run when the PDF is being parsed. This in turn means that draw_cid never sets self.current_block. That leads recover_last_paragraph to fail when it tries to pull the fifth item from self.current_block, which is still set as None.

SYSTEM V - application binary interface.pdf

Taking_The_Pulse_Of_Hacking-A_Risk_Basis_For_Security_Research_2018.pdf

anti-reverse-engineering-linux.pdf

@metebalci
Copy link
Owner

Thanks for the pdfs. There are different issues with each of these.

  • anti-reverse-engineering-linux contains a single XObject (single Do operator) embedded into this pdf. My understanding is XObject can be many things, it is like an embedded PDF inside this PDF. If this embedded XObject is a normal PDF like others, it might be possible to extract the title from that, however it is not very clear to me yet how to work on these. I will check but it might take some time if it is possible to support this.

  • Taking_The.. contains 14 XObjects (so 14 Do operators) and also some other (probably unrelated to pdftitle) operators. I first thought maybe each page is another XObject but there are more than 14 pages in the document. So this is similar to the issue above, but a little different, I think.

  • SYSTEM V... looks like a regular PDF file which pdftitle should support. There is something strange (in the sense I havent seen before) in the text transformation or state in this file, so none of the characters in the first page is taken into account, then this causes the error you mention (no current_block). I am checking this, but I need to remember or understand the text transformation again so not sure how easy it will be to fix it.

@metebalci
Copy link
Owner

This is a note to myself, I will try to improve the error logging a bit in the next version, so it will be a little more human friendly messages when errors happen.

@seamustuohy
Copy link

I have been processing hundreds of PDF files recently and have come across a large number of these. If you would like I can provide you more. There are also a range of unicode and other cid encoding errors I've come across that I've not reported since they seem to be with the underlying PDFMiner library. But, I'd be happy to share a range of PDF's that cause the library to fail out in different ways if you would like.

@metebalci
Copy link
Owner

Thanks, I will get in touch if I need more examples.

Meanwhile I will create two different issues for the cases above, and also close this issue as it was first opened for empty pdfs which I could not reproduce.

@metebalci
Copy link
Owner

metebalci commented Aug 10, 2021

@seamustuohy update about the files you mentioned:

  • anti-reverse-engineering-linux: the first page is an image, so no text to extract. you can send --page-number 2 with the new version, and it should work.

  • Taking_The..: again the first page is an image but also the 3rd page has the section title (Introduction) with a bigger font than caption at the top of the page), using another algorithm might work for this, but it is a very special case.

  • SYSTEM V...: should work with the new version but there is an issue with the space, between the first line and the second line. I want to fix the spacing issue in general but it will probably take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants