Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'NumberObject' object is not subscriptable #1273

Closed
DL6ER opened this issue Aug 25, 2022 · 6 comments · Fixed by #1297
Closed

TypeError: 'NumberObject' object is not subscriptable #1273

DL6ER opened this issue Aug 25, 2022 · 6 comments · Fixed by #1297
Labels
is-robustness-issue From a users perspective, this is about robustness

Comments

@DL6ER
Copy link

DL6ER commented Aug 25, 2022

See #1269 for further details.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("shiv_resume.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)

PDF used above: shiv_resume.pdf

Traceback

This is the complete Traceback I see:

Xref table not zero-indexed. ID numbers for objects will be corrected.
Xref table not zero-indexed. ID numbers for objects will be corrected.
Superfluous whitespace found in object header b'17' b'23'

Traceback (most recent call last):
  File "test4.py", line 3, in <module>
    pdfreader = PyPDF2.PdfFileReader(f, strict=True)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1775, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 275, in __init__
    self.read(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1279, in read
    self._read_xref_tables_and_trailers(stream, startxref, xref_issue_nr)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1435, in _read_xref_tables_and_trailers
    xrefstream = self._read_pdf15_xref_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1515, in _read_pdf15_xref_stream
    assert cast(str, xrefstream["/Type"]) == "/XRef"
TypeError: 'NumberObject' object is not subscriptable
@MartinThoma MartinThoma added is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. is-robustness-issue From a users perspective, this is about robustness and removed is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. labels Aug 31, 2022
@MartinThoma
Copy link
Member

@DL6ER Thank you for sharing this issue and all the details you put in here (and in all the other issues as well) 🤗 I appreciate this a lot ❤️

@MartinThoma
Copy link
Member

MartinThoma commented Aug 31, 2022

One part that might be interesting to you: PdfFileReader was deprecated. Instead, please use PdfReader. The difference is only that the old PdfFileReader has strict=True by default and PdfReader has strict=False as default. We decided to do this change as most users will want strict=False and the PdfFileReader doesn't actually need a file - ByteIO works fine as well.

Hence this:

import PyPDF2
with open("shiv_resume.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)

can be simplified to this:

from PyPDF2 import PdfReader

reader = PdfReader("shiv_resume.pdf")

@DL6ER
Copy link
Author

DL6ER commented Aug 31, 2022

Just to clarify: with, i.e. close() is also not needed?

@pubpub-zz
Copy link
Collaborator

the PDF file is linearized. and there seems to be some issues in reading this part of the header. I will deeper analyze it later

@MartinThoma
Copy link
Member

MartinThoma commented Aug 31, 2022

Just to clarify: with, i.e. close() is also not needed?

Exactly! PyPDF2 takes care of that: https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/_reader.py#L272-L274 - no file handles are left open. That is also the reason why I typically recommend to pass the file path directly to PyPDF2

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Aug 31, 2022

the problem is not directly due to "Linearization"but to other errors (generated by linearization process ???)

  • the pointer to the 3rd (and potentially 4th)chained xref/trailer is invalid : in such case we will stop xref_trailer analysis
    (I've fixed the /Prev enry in 2nd chained trailer to extend the test coverage)
    shiv_resume.pdf

  • I've added a solution to search for an entry when the xref pointer is invalid

  • I've also added the same solution to search for an entry when the id/gen is not present in the xref.

PR #1297 completed

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 31, 2022
* if chained xref/trailer are not good
* if the object header ('id' 'gen' obj) or if the object is not present in the xref table, will search the file for the object.

fixes  py-pdf#1273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants