Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'IndexError: list index out of range' when extracting text #1091

Closed
MartinThoma opened this issue Jul 10, 2022 · 12 comments
Closed

'IndexError: list index out of range' when extracting text #1091

MartinThoma opened this issue Jul 10, 2022 · 12 comments
Labels
is-robustness-issue From a users perspective, this is about robustness MCVE in Tests The MCVE was added to PyPDF2 test suite workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Jul 10, 2022

I've got an IndexError when extracting text. The file opens fine in Chrome.

Environment

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The file: pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf')
/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py:1229: PdfReadWarning: incorrect startxref pointer(1)
  warnings.warn(
>>> for page in reader.pages: print(page.extract_text())
[...]
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
[...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1507, in extract_text
    return self._extract_text(
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1441, in _extract_text
    process_operation(operator, operands)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1301, in process_operation
    float(operands[5]),
IndexError: list index out of range

It's print(reader.pages[10].extract_text()) to be exact.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 10, 2022
@MartinThoma
Copy link
Member Author

The same file gives a ValueError("invalid literal for int() with base 10: b'7267753-726774'") when trying to make an overlay.

@dkg
Copy link
Contributor

dkg commented Jul 14, 2022

fwiw, i'm seeing a similar error with dump.pdf which is generated during the test suite of xml2rfc.

Python 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from PyPDF2 import PdfReader
from PyPDF2 import PdfReaderr

In [2]: r = PdfReader('../dump.pdf')
r = PdfReader('../dump.pdf'))

In [3]: r.pages[0].extract_text()
r.pages[0].extract_text())
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-2445f91b85f4> in <module>
----> 1 r.pages[0].extract_text()

/usr/lib/python3/dist-packages/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1314         :return: The extracted text
   1315         """
-> 1316         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1317 
   1318     def extract_xform_text(

/usr/lib/python3/dist-packages/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1127         if "/Font" in resources_dict:
   1128             for f in cast(DictionaryObject, resources_dict["/Font"]):
-> 1129                 cmaps[f] = build_char_map(f, space_width, obj)
   1130         cmap: Tuple[
   1131             Union[str, Dict[int, str]], Dict[str, str], str

/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in build_char_map(font_name, space_width, obj)
     19     space_code = 32
     20     encoding, space_code = parse_encoding(ft, space_code)
---> 21     map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
     22 
     23     # encoding can be either a string for decode (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)

/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in parse_to_unicode(ft, space_code)
    244                         "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
    245                     )
--> 246                 ] = unhexlify(lst[1]).decode(
    247                     "utf-16-be", "surrogatepass"
    248                 )  # join is here as some cases where the code was split

IndexError: list index out of range

In [4]: 

dkg added a commit to dkg/pypdf that referenced this issue Jul 14, 2022
The code within the if block assumes that lst has index 0 and index 1.
So the predicate should depend on lst having at least two elements.

This resolves the error I described at
py-pdf#1091 (comment)
(I'm not sure that it would resolve the other issue raised by
@MartinThoma)
@dkg
Copy link
Contributor

dkg commented Jul 14, 2022

I'm not sure that bb2d1db resolves this issue. looking at 966635.pdf (from the original report), and working from bb2d1db, when i do:

r = PdfReader('966635.pdf')
p = r.pages[10].extract_text()

I get this crash (ipython3 backtrace):

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-182fc7811fdb> in <module>
----> 1 p = r.pages[10].extract_text()

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1424         :return: The extracted text
   1425         """
-> 1426         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1427 
   1428     def extract_xform_text(

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1402                     text = ""
   1403             else:
-> 1404                 process_operation(operator, operands)
   1405         output += text  # just in case of
   1406         return output

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in process_operation(operator, operands)
   1269                     float(operands[3]),
   1270                     float(operands[4]),
-> 1271                     float(operands[5]),
   1272                 ]
   1273             elif operator == b"T*":

sorry for having commented here just because i also got an IndexError on extract_text! The issue i'd found is probably better characterized by #1111, and it is distinct from this one.

I think this report should be re-opened.

@MartinThoma MartinThoma reopened this Jul 14, 2022
@MartinThoma
Copy link
Member Author

Thank you for letting me know 🤗

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
The code within the if block assumes that `lst` has index 0 and index 1.

Fixes py-pdf#1091
Related to py-pdf#1111
@MartinThoma MartinThoma added MCVE in Tests The MCVE was added to PyPDF2 test suite and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 17, 2022
@MartinThoma
Copy link
Member Author

By the way, this is how the page causing the issues looks like:

image

@MartinThoma
Copy link
Member Author

Trying it via https://www.pdf-online.com/osa/validate.aspx :

Validating file "non-compliant.pdf" for conformance level pdf1.3

  1. The 'xref' keyword was not found or the xref table is malformed.
  2. The file trailer dictionary is missing or invalid.
  3. The "Length" key of the stream object is wrong.
  4. Error in Flate stream: data error.
  5. The embedded ICC profile couldn't be read.
  6. The embedded font program 'JNLDEF+TimesNewRoman' cannot be read.
  7. The "Length" key of the stream object is wrong.
  8. Error in Flate stream: data error.
  9. The "Length" key of the stream object is wrong.
  10. The operator has an invalid number of operands.
  11. Error in Flate stream: data error.
  12. The "Length" key of the stream object is wrong.
  13. The operator has an invalid number of operands.
  14. A path start operator was missing.
  15. Error in Flate stream: data error.
  16. Graphics operator m is not allowed in page description.
  17. The "Length" key of the stream object is wrong.
  18. The operator has an invalid number of operands.
  19. A path start operator was missing.
  20. Error in Flate stream: data error.
  21. The "Length" key of the stream object is wrong.
  22. The operator has an invalid number of operands.
  23. Error in Flate stream: data error.
  24. Graphics operator l is not allowed in text object.

@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Aug 6, 2022
@kxrob
Copy link

kxrob commented Jan 2, 2023

Similar exception (v3.0.1) :

  File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
  File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
    multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
    nbi = max(len(lst[0]), len(lst[1]))
IndexError: list index out of range

at post-mortem lst has only one element:

>>> lst
[b'fffd']
>>>
>>> PyPDF2.__version__
'3.0.1'

(unfortunately cannot publish the pdf )

@pubpub-zz
Copy link
Collaborator

@kxrob
the whole pdf is not required for the analysis;
can you locate the failing page and extract the fonts data with this script:

failing_pdf="xxxx.pdf"    # to be updated
failing_page= 0            # to be updated
w = pypdf.PdfWriter()
w.add_page(pypdf.PdfReader(failing_pdf).pages[failing_page])
del w.pages[0]["/Contents"]
w.write("cleaned_page.pdf")

@kxrob
Copy link

kxrob commented Jan 3, 2023

can you locate the failing page and extract the fonts data with this script

Here is the stripped page - it causes the same error with PdfReader("cleaned_page.pdf").pages[0].extract_text()
cleaned_page.pdf

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 4, 2023

@kxrob
can you retry replacing in _cmap.py the code of function parse_bfrange with the following code (about line 270):

def parse_bfrange(
    l: bytes,
    map_dict: Dict[Any, Any],
    int_entry: List[int],
    multiline_rg: Union[None, Tuple[int, int]],
) -> Union[None, Tuple[int, int]]:
    lst = [x for x in l.split(b" ") if x]
    closure_found = False
    if multiline_rg is not None:
        fmt = b"%%0%dX" % (map_dict[-1] * 2)
        a = multiline_rg[0]  # a, b not in the current line
        b = multiline_rg[1]
        for sq in lst[1:]:
            if sq == b"]":
                closure_found = True
                break
            map_dict[
                unhexlify(fmt % a).decode(
                    "charmap" if map_dict[-1] == 1 else "utf-16-be",
                    "surrogatepass",
                )
            ] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
            int_entry.append(a)
            a += 1
    else:
        a = int(lst[0], 16)
        b = int(lst[1], 16)
        nbi = max(len(lst[0]), len(lst[1]))
        map_dict[-1] = ceil(nbi / 2)
        fmt = b"%%0%dX" % (map_dict[-1] * 2)
        if lst[2] == b"[":
            for sq in lst[3:]:
                if sq == b"]":
                    closure_found = True
                    break
                map_dict[
                    unhexlify(fmt % a).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be",
                        "surrogatepass",
                    )
                ] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
                int_entry.append(a)
                a += 1
        else:  # case without list
            c = int(lst[2], 16)
            fmt2 = b"%%0%dX" % max(4, len(lst[2]))
            closure_found = True
            while a <= b:
                map_dict[
                    unhexlify(fmt % a).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be",
                        "surrogatepass",
                    )
                ] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
                int_entry.append(a)
                a += 1
                c += 1
    return None if closure_found else (a, b)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 8, 2023
First Part fixing py-pdf#1091 (late)
Analysis of  'Hungarian' py-pdf#1533 still in progress
@pubpub-zz
Copy link
Collaborator

@kxrob a PR has been issued. If you can confirm it is fixing your issue too

MartinThoma pushed a commit that referenced this issue Jan 21, 2023
@pubpub-zz
Copy link
Collaborator

@kxrob
I close this issue as normally closed. Feel free to ask for reopen if you have new inputs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness MCVE in Tests The MCVE was added to PyPDF2 test suite workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants