PDFObjRef is not iterable #495

pietermarsman · 2020-09-10T18:51:29Z

Bug report

Sadly, I cannot upload the problematic PDFs due a non-disclosure agreement. I can however point out the issue and share my fix.

When trying to instantiate a PDFCIDFont object at:

pdfminer.six/pdfminer/pdfinterp.py

Line 193 in 0b44f77

font = PDFCIDFont(self, spec)

Where

self
<pdfminer.pdfinterp.PDFResourceManager at 0x7f0c80493048>

spec
{'Type': /'Font',
 'Subtype': /'CIDFontType2',
 'BaseFont': /'ISOCPEUR',
 'FontDescriptor': <PDFObjRef:5>,
 'CIDSystemInfo': <PDFObjRef:41>,
 'W': <PDFObjRef:13>,
 'Encoding': /'Identity-H'}

The execution ends up at PDFStream.decode with:

self
<PDFStream(7): raw=8963, {'Length1': 28048, 'Length': 8961, 'Filter': /'FlateDecode', 'DecodeParms': <PDFObjRef:39>}>

The origin of the bug is that in this case, get_filters(self)

pdfminer.six/pdfminer/pdftypes.py

Line 258 in 0b44f77

filters = self.get_filters()

returns:

 In[8]: filters
Out[8]: [(/'FlateDecode', <PDFObjRef:39>)]

As you can see, the second element of the first and only Tuple is a PDFObjRef, which is then saved to params and fails a little later down the line when trying to evaluate 'Predictor' in params:

pdfminer.six/pdfminer/pdftypes.py

Line 297 in 0b44f77

if params and 'Predictor' in params:

I noticed that the default value of params is an empty dictionary, which effectively skips that check. So I extended the check to only continue if params was a dictionary:
https://github.com/imochoa/pdfminer.six/blob/2d996c9ae26c8a336711178a8afe3091e1140970/pdfminer/pdftypes.py#L297

I think the underlying error is the fact that get_filters(self) is returning a PDFObjRef instead of a dictionary. I've tried to find the exact origin, but I'm not too familiar with the project and couldn't pinpoint the exact issue. The furthest I got was that PDFParser was that the problematic <PDFObjRef:39> value was being set at the dictionary in:

pdfminer.six/pdfminer/pdfparser.py

Line 122 in 0b44f77

obj = PDFStream(dic, data, self.doc.decipher)

as:

dic['DecodeParms']
<PDFObjRef:39>

Since I couldn't prevent it from coming up, refining the check seemed like the next best option and it works well on the 2 problematic PDFs I have.

The text was updated successfully, but these errors were encountered:

pdfminer#495

imochoa · 2023-06-16T08:36:30Z

Hello again :)

I found a PDF I can share where the issue is happening, in the text box at the bottom right where it says "PET BLACK"

pdfminer_testpart.pdf

EvaSDK · 2023-08-05T14:18:21Z

Jumping in as I had the same issue on some PII document that I would not be able to share.
The minimal code sample to reproduce is the following:

from pdfminer.high_level import extract_text
extract_text("./pdfminer_testpart.pdf")

It should return:

'8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n150,00\n\n30,00\n\n(cid:72) 0,05 A\n\n0\n0\n,\n0\n2\n\n0\n0\n,\n8\n\n(cid:69) 0,05\n\n0\n0\n,\n0\n5\n\nA\n\nF\n\nE\n\nD\n\n20,00\n\n16,00\n\n+\n0,05\n15,00 - 0,00\n\nC\n\n0\n0\n,\n0\n4\n\n0\n0\n,\n0\n2\n\nR 1 8 , 0 0\n\nM12x1.75 - 6H\n\n0\n0\n,\n5\n4\n\nB\n\nA\n\n0\n0\n,\n6\n1\n(cid:142)\n\n0\n0\n,\n6\n1\n\n+\n0,50\n15,00 - 0,00\n\n60,00 (cid:66)0,02\n\n100,00 (cid:66)0,05\n\n132,00\n\n9\nH\n0\n1\n(cid:142)\n\n9\nH\n0\n1\n(cid:142)\n\n(cid:68) 0,1 A\n\n+\n0,00\n70,00 - 0,02\n\n50,00\n\n(cid:76) 0,1\n\n(cid:76) 0,1\n\n0\n0\n,\n5\n3\n\nF\n\nE\n\nD\n\nC\n\nB\n\nAllgemeintoleranzen\n\nMATERIAL\n\nDIN ISO 2768 - mK\n\nPET BLACK\n\nFINISH\n\nEloxieren (natur)\n\nRa 1,6\n\nDate\n29.03.2021\n\nName\nLucas Giering\n\nDrawn\n\nChecked\n\nStandard\n\nArretierungshilfe\n\nA\n\n1 \n\nA2\n\n8\n\n7\n\n6\n\n5\n\n4\n\nState\n\nChanges\n\nDate\n\nName\n\n3\n\n2\n\n1\n\n\x0c'

Some PDF documents use reference to store filter params. Resolve them to allow proper extraction of content. ``` In [1]: import pdfplumber In [2]: doc = pdfplumber.open("bill.pdf") In [3]: s = doc.images[0]['stream'] In [4]: s.get_data() --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 s.get_data() File ~/.local/share/virtualenvs/pdfreader/lib/python3.11/site-packages/pdfminer/pdftypes.py:396, in PDFStream.get_data(self) 394 def get_data(self) -> bytes: 395 if self.data is None: --> 396 self.decode() 397 assert self.data is not None 398 return self.data File ~/.local/share/virtualenvs/pdfreader/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self) 371 raise PDFNotImplementedError("Unsupported filter: %r" % f) 372 # apply predictors --> 373 if params and "Predictor" in params: 374 pred = int_value(params["Predictor"]) 375 if pred == 1: 376 # no predictor TypeError: argument of type 'PDFObjRef' is not iterable In [5]: s.get_filters() Out[5]: [(/'FlateDecode', <PDFObjRef:21>)] ```

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>

pietermarsman · 2024-01-16T19:08:46Z

Closed by #906

pietermarsman added the type: bug label Sep 10, 2020

pietermarsman added this to new in pdfminer.six via automation Sep 10, 2020

pietermarsman moved this from new to needs solution in pdfminer.six Sep 10, 2020

pietermarsman mentioned this issue Sep 10, 2020

bugfix PDFObjRef is not iterable #471

Closed

pietermarsman added the status: needs solution label Aug 7, 2022

ricardocarvalhods added a commit to ricardocarvalhods/pdfminer.six that referenced this issue Jan 27, 2023

Acompanhar solucao oficial em pdfminer#495

8d6468f

pdfminer#495

cmdlineluser mentioned this issue Jul 14, 2023

TypeError: argument of type 'PDFObjRef' is not iterable jsvine/pdfplumber#935

Open

EvaSDK added a commit to EvaSDK/pdfminer.six that referenced this issue Aug 5, 2023

Add test illustrating pdfminer#495 problem

04c88e8

EvaSDK mentioned this issue Aug 5, 2023

Fix #495: resolve params in PDFStream.get_filters #906

Merged

5 tasks

github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2024

Fix #495: resolve params in PDFStream.get_filters (#906)

f428846

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>

pietermarsman closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFObjRef is not iterable #495

PDFObjRef is not iterable #495

pietermarsman commented Sep 10, 2020 •

edited

Loading

imochoa commented Jun 16, 2023

EvaSDK commented Aug 5, 2023

pietermarsman commented Jan 16, 2024

PDFObjRef is not iterable #495

PDFObjRef is not iterable #495

Comments

pietermarsman commented Sep 10, 2020 • edited Loading

imochoa commented Jun 16, 2023

EvaSDK commented Aug 5, 2023

pietermarsman commented Jan 16, 2024

pietermarsman commented Sep 10, 2020 •

edited

Loading