'IndexError: list index out of range' when extracting text #1091

MartinThoma · 2022-07-10T09:44:25Z

I've got an IndexError when extracting text. The file opens fine in Chrome.

Environment

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The file: pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf')
/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py:1229: PdfReadWarning: incorrect startxref pointer(1)
  warnings.warn(
>>> for page in reader.pages: print(page.extract_text())
[...]
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
[...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1507, in extract_text
    return self._extract_text(
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1441, in _extract_text
    process_operation(operator, operands)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1301, in process_operation
    float(operands[5]),
IndexError: list index out of range

It's print(reader.pages[10].extract_text()) to be exact.

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-07-10T09:45:07Z

The same file gives a ValueError("invalid literal for int() with base 10: b'7267753-726774'") when trying to make an overlay.

dkg · 2022-07-14T17:31:59Z

fwiw, i'm seeing a similar error with dump.pdf which is generated during the test suite of xml2rfc.

Python 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from PyPDF2 import PdfReader
from PyPDF2 import PdfReaderr

In [2]: r = PdfReader('../dump.pdf')
r = PdfReader('../dump.pdf'))

In [3]: r.pages[0].extract_text()
r.pages[0].extract_text())
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-2445f91b85f4> in <module>
----> 1 r.pages[0].extract_text()

/usr/lib/python3/dist-packages/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1314         :return: The extracted text
   1315         """
-> 1316         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1317 
   1318     def extract_xform_text(

/usr/lib/python3/dist-packages/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1127         if "/Font" in resources_dict:
   1128             for f in cast(DictionaryObject, resources_dict["/Font"]):
-> 1129                 cmaps[f] = build_char_map(f, space_width, obj)
   1130         cmap: Tuple[
   1131             Union[str, Dict[int, str]], Dict[str, str], str

/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in build_char_map(font_name, space_width, obj)
     19     space_code = 32
     20     encoding, space_code = parse_encoding(ft, space_code)
---> 21     map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
     22 
     23     # encoding can be either a string for decode (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)

/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in parse_to_unicode(ft, space_code)
    244                         "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
    245                     )
--> 246                 ] = unhexlify(lst[1]).decode(
    247                     "utf-16-be", "surrogatepass"
    248                 )  # join is here as some cases where the code was split

IndexError: list index out of range

In [4]:

@MartinThoma

The code within the if block assumes that lst has index 0 and index 1. So the predicate should depend on lst having at least two elements. This resolves the error I described at py-pdf#1091 (comment) (I'm not sure that it would resolve the other issue raised by @MartinThoma)

dkg · 2022-07-14T20:32:37Z

I'm not sure that bb2d1db resolves this issue. looking at 966635.pdf (from the original report), and working from bb2d1db, when i do:

r = PdfReader('966635.pdf')
p = r.pages[10].extract_text()

I get this crash (ipython3 backtrace):

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-182fc7811fdb> in <module>
----> 1 p = r.pages[10].extract_text()

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1424         :return: The extracted text
   1425         """
-> 1426         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1427 
   1428     def extract_xform_text(

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1402                     text = ""
   1403             else:
-> 1404                 process_operation(operator, operands)
   1405         output += text  # just in case of
   1406         return output

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in process_operation(operator, operands)
   1269                     float(operands[3]),
   1270                     float(operands[4]),
-> 1271                     float(operands[5]),
   1272                 ]
   1273             elif operator == b"T*":

sorry for having commented here just because i also got an IndexError on extract_text! The issue i'd found is probably better characterized by #1111, and it is distinct from this one.

I think this report should be re-opened.

MartinThoma · 2022-07-14T20:33:59Z

Thank you for letting me know 🤗

The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111

See #1091

MartinThoma · 2022-08-06T06:13:05Z

By the way, this is how the page causing the issues looks like:

MartinThoma · 2022-08-06T06:15:10Z

Trying it via https://www.pdf-online.com/osa/validate.aspx :

Validating file "non-compliant.pdf" for conformance level pdf1.3

The 'xref' keyword was not found or the xref table is malformed.
The file trailer dictionary is missing or invalid.
The "Length" key of the stream object is wrong.
Error in Flate stream: data error.
The embedded ICC profile couldn't be read.
The embedded font program 'JNLDEF+TimesNewRoman' cannot be read.
The "Length" key of the stream object is wrong.
Error in Flate stream: data error.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
Error in Flate stream: data error.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
A path start operator was missing.
Error in Flate stream: data error.
Graphics operator m is not allowed in page description.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
A path start operator was missing.
Error in Flate stream: data error.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
Error in Flate stream: data error.
Graphics operator l is not allowed in text object.

kxrob · 2023-01-02T18:52:33Z

Similar exception (v3.0.1) :

  File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
  File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
    multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
    nbi = max(len(lst[0]), len(lst[1]))
IndexError: list index out of range

at post-mortem lst has only one element:

>>> lst
[b'fffd']
>>>
>>> PyPDF2.__version__
'3.0.1'

(unfortunately cannot publish the pdf )

pubpub-zz · 2023-01-03T08:43:15Z

@kxrob
the whole pdf is not required for the analysis;
can you locate the failing page and extract the fonts data with this script:

failing_pdf="xxxx.pdf"    # to be updated
failing_page= 0            # to be updated
w = pypdf.PdfWriter()
w.add_page(pypdf.PdfReader(failing_pdf).pages[failing_page])
del w.pages[0]["/Contents"]
w.write("cleaned_page.pdf")

kxrob · 2023-01-03T19:08:44Z

can you locate the failing page and extract the fonts data with this script

Here is the stripped page - it causes the same error with PdfReader("cleaned_page.pdf").pages[0].extract_text()
cleaned_page.pdf

pubpub-zz · 2023-01-04T20:44:56Z

@kxrob
can you retry replacing in _cmap.py the code of function parse_bfrange with the following code (about line 270):

def parse_bfrange(
    l: bytes,
    map_dict: Dict[Any, Any],
    int_entry: List[int],
    multiline_rg: Union[None, Tuple[int, int]],
) -> Union[None, Tuple[int, int]]:
    lst = [x for x in l.split(b" ") if x]
    closure_found = False
    if multiline_rg is not None:
        fmt = b"%%0%dX" % (map_dict[-1] * 2)
        a = multiline_rg[0]  # a, b not in the current line
        b = multiline_rg[1]
        for sq in lst[1:]:
            if sq == b"]":
                closure_found = True
                break
            map_dict[
                unhexlify(fmt % a).decode(
                    "charmap" if map_dict[-1] == 1 else "utf-16-be",
                    "surrogatepass",
                )
            ] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
            int_entry.append(a)
            a += 1
    else:
        a = int(lst[0], 16)
        b = int(lst[1], 16)
        nbi = max(len(lst[0]), len(lst[1]))
        map_dict[-1] = ceil(nbi / 2)
        fmt = b"%%0%dX" % (map_dict[-1] * 2)
        if lst[2] == b"[":
            for sq in lst[3:]:
                if sq == b"]":
                    closure_found = True
                    break
                map_dict[
                    unhexlify(fmt % a).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be",
                        "surrogatepass",
                    )
                ] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
                int_entry.append(a)
                a += 1
        else:  # case without list
            c = int(lst[2], 16)
            fmt2 = b"%%0%dX" % max(4, len(lst[2]))
            closure_found = True
            while a <= b:
                map_dict[
                    unhexlify(fmt % a).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be",
                        "surrogatepass",
                    )
                ] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
                int_entry.append(a)
                a += 1
                c += 1
    return None if closure_found else (a, b)

First Part fixing py-pdf#1091 (late) Analysis of 'Hungarian' py-pdf#1533 still in progress

pubpub-zz · 2023-01-09T21:12:46Z

@kxrob a PR has been issued. If you can confirm it is fixing your issue too

Fixes #1533 and late #1091

pubpub-zz · 2023-02-05T15:45:57Z

@kxrob
I close this issue as normally closed. Feel free to ask for reopen if you have new inputs

pubpub-zz mentioned this issue Jul 12, 2022

ENH: Extract Text Enhancement (whitespaces) #1084

Merged

dkg mentioned this issue Jul 14, 2022

_cmap.py: avoid IndexError in parse_to_unicode #1110

Merged

MartinThoma closed this as completed in bb2d1db Jul 14, 2022

MartinThoma reopened this Jul 14, 2022

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022

BUG: Avoid IndexError in _cmap.parse_to_unicode (py-pdf#1110)

2ba30d6

The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111

MartinThoma added a commit that referenced this issue Jul 17, 2022

TST: Add xfail test for IndexError when extracting text

15982f1

See #1091

MartinThoma mentioned this issue Jul 17, 2022

TST: Add xfail test for IndexError when extracting text #1124

Merged

MartinThoma added a commit that referenced this issue Jul 17, 2022

TST: Add xfail test for IndexError when extracting text

a91dce7

See #1091

MartinThoma added a commit that referenced this issue Jul 17, 2022

TST: Add xfail test for IndexError when extracting text (#1124)

b1d4ea1

See #1091

MartinThoma added MCVE in Tests The MCVE was added to PyPDF2 test suite and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 17, 2022

pubpub-zz mentioned this issue Jul 22, 2022

DEV: Introduce logger_warning #1148

Merged

MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Aug 6, 2022

pubpub-zz mentioned this issue Jan 6, 2023

PyPDF2 throws exception during extract_text() #1533

Closed

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 8, 2023

FIX : fixes indexerror in cmap

395f588

First Part fixing py-pdf#1091 (late) Analysis of 'Hungarian' py-pdf#1533 still in progress

pubpub-zz mentioned this issue Jan 9, 2023

BUG: Fix error in cmap extraction #1544

Merged

MartinThoma pushed a commit that referenced this issue Jan 21, 2023

BUG: Fix error in cmap extraction (#1544)

c1f8742

Fixes #1533 and late #1091

pubpub-zz closed this as completed Feb 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'IndexError: list index out of range' when extracting text #1091

'IndexError: list index out of range' when extracting text #1091

MartinThoma commented Jul 10, 2022 •

edited

Loading

MartinThoma commented Jul 10, 2022

dkg commented Jul 14, 2022

dkg commented Jul 14, 2022

MartinThoma commented Jul 14, 2022

MartinThoma commented Aug 6, 2022

MartinThoma commented Aug 6, 2022

kxrob commented Jan 2, 2023

pubpub-zz commented Jan 3, 2023

kxrob commented Jan 3, 2023

pubpub-zz commented Jan 4, 2023 •

edited

Loading

pubpub-zz commented Jan 9, 2023

pubpub-zz commented Feb 5, 2023

'IndexError: list index out of range' when extracting text #1091

'IndexError: list index out of range' when extracting text #1091

Comments

MartinThoma commented Jul 10, 2022 • edited Loading

Environment

Code + PDF

MartinThoma commented Jul 10, 2022

dkg commented Jul 14, 2022

dkg commented Jul 14, 2022

MartinThoma commented Jul 14, 2022

MartinThoma commented Aug 6, 2022

MartinThoma commented Aug 6, 2022

kxrob commented Jan 2, 2023

pubpub-zz commented Jan 3, 2023

kxrob commented Jan 3, 2023

pubpub-zz commented Jan 4, 2023 • edited Loading

pubpub-zz commented Jan 9, 2023

pubpub-zz commented Feb 5, 2023

MartinThoma commented Jul 10, 2022 •

edited

Loading

pubpub-zz commented Jan 4, 2023 •

edited

Loading