Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFObjRef is not iterable #495

Closed
pietermarsman opened this issue Sep 10, 2020 · 3 comments
Closed

PDFObjRef is not iterable #495

pietermarsman opened this issue Sep 10, 2020 · 3 comments

Comments

@pietermarsman
Copy link
Member

pietermarsman commented Sep 10, 2020

Bug report

Copy of #471 (by @imochoa)

Sadly, I cannot upload the problematic PDFs due a non-disclosure agreement. I can however point out the issue and share my fix.

When trying to instantiate a PDFCIDFont object at:

font = PDFCIDFont(self, spec)

Where

self
<pdfminer.pdfinterp.PDFResourceManager at 0x7f0c80493048>

spec
{'Type': /'Font',
 'Subtype': /'CIDFontType2',
 'BaseFont': /'ISOCPEUR',
 'FontDescriptor': <PDFObjRef:5>,
 'CIDSystemInfo': <PDFObjRef:41>,
 'W': <PDFObjRef:13>,
 'Encoding': /'Identity-H'}

The execution ends up at PDFStream.decode with:

self
<PDFStream(7): raw=8963, {'Length1': 28048, 'Length': 8961, 'Filter': /'FlateDecode', 'DecodeParms': <PDFObjRef:39>}>

The origin of the bug is that in this case, get_filters(self)

filters = self.get_filters()

returns:

 In[8]: filters
Out[8]: [(/'FlateDecode', <PDFObjRef:39>)]

As you can see, the second element of the first and only Tuple is a PDFObjRef, which is then saved to params and fails a little later down the line when trying to evaluate 'Predictor' in params:

if params and 'Predictor' in params:

I noticed that the default value of params is an empty dictionary, which effectively skips that check. So I extended the check to only continue if params was a dictionary:
https://github.com/imochoa/pdfminer.six/blob/2d996c9ae26c8a336711178a8afe3091e1140970/pdfminer/pdftypes.py#L297

I think the underlying error is the fact that get_filters(self) is returning a PDFObjRef instead of a dictionary. I've tried to find the exact origin, but I'm not too familiar with the project and couldn't pinpoint the exact issue. The furthest I got was that PDFParser was that the problematic <PDFObjRef:39> value was being set at the dictionary in:

obj = PDFStream(dic, data, self.doc.decipher)

as:

dic['DecodeParms']
<PDFObjRef:39>

Since I couldn't prevent it from coming up, refining the check seemed like the next best option and it works well on the 2 problematic PDFs I have.

@pietermarsman pietermarsman added this to new in pdfminer.six via automation Sep 10, 2020
@pietermarsman pietermarsman moved this from new to needs solution in pdfminer.six Sep 10, 2020
ricardocarvalhods added a commit to ricardocarvalhods/pdfminer.six that referenced this issue Jan 27, 2023
@imochoa
Copy link

imochoa commented Jun 16, 2023

Hello again :)

I found a PDF I can share where the issue is happening, in the text box at the bottom right where it says "PET BLACK"

pdfminer_testpart.pdf

@EvaSDK
Copy link
Contributor

EvaSDK commented Aug 5, 2023

Jumping in as I had the same issue on some PII document that I would not be able to share.
The minimal code sample to reproduce is the following:

from pdfminer.high_level import extract_text
extract_text("./pdfminer_testpart.pdf")

It should return:

'8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n150,00\n\n30,00\n\n(cid:72) 0,05 A\n\n0\n0\n,\n0\n2\n\n0\n0\n,\n8\n\n(cid:69) 0,05\n\n0\n0\n,\n0\n5\n\nA\n\nF\n\nE\n\nD\n\n20,00\n\n16,00\n\n+\n0,05\n15,00 - 0,00\n\nC\n\n0\n0\n,\n0\n4\n\n0\n0\n,\n0\n2\n\nR 1 8 , 0 0\n\nM12x1.75 - 6H\n\n0\n0\n,\n5\n4\n\nB\n\nA\n\n0\n0\n,\n6\n1\n(cid:142)\n\n0\n0\n,\n6\n1\n\n+\n0,50\n15,00 - 0,00\n\n60,00 (cid:66)0,02\n\n100,00 (cid:66)0,05\n\n132,00\n\n9\nH\n0\n1\n(cid:142)\n\n9\nH\n0\n1\n(cid:142)\n\n(cid:68) 0,1 A\n\n+\n0,00\n70,00 - 0,02\n\n50,00\n\n(cid:76) 0,1\n\n(cid:76) 0,1\n\n0\n0\n,\n5\n3\n\nF\n\nE\n\nD\n\nC\n\nB\n\nAllgemeintoleranzen\n\nMATERIAL\n\nDIN ISO 2768 - mK\n\nPET BLACK\n\nFINISH\n\nEloxieren (natur)\n\nRa 1,6\n\nDate\n29.03.2021\n\nName\nLucas Giering\n\nDrawn\n\nChecked\n\nStandard\n\nArretierungshilfe\n\nA\n\n1 \n\nA2\n\n8\n\n7\n\n6\n\n5\n\n4\n\nState\n\nChanges\n\nDate\n\nName\n\n3\n\n2\n\n1\n\n\x0c'

EvaSDK added a commit to EvaSDK/pdfminer.six that referenced this issue Aug 5, 2023
EvaSDK added a commit to EvaSDK/pdfminer.six that referenced this issue Aug 5, 2023
Some PDF documents use reference to store filter params. Resolve them to
allow proper extraction of content.

```
In [1]: import pdfplumber

In [2]: doc = pdfplumber.open("bill.pdf")

In [3]: s = doc.images[0]['stream']

In [4]: s.get_data()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 s.get_data()

File ~/.local/share/virtualenvs/pdfreader/lib/python3.11/site-packages/pdfminer/pdftypes.py:396, in PDFStream.get_data(self)
    394 def get_data(self) -> bytes:
    395     if self.data is None:
--> 396         self.decode()
    397         assert self.data is not None
    398     return self.data

File ~/.local/share/virtualenvs/pdfreader/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self)
    371     raise PDFNotImplementedError("Unsupported filter: %r" % f)
    372 # apply predictors
--> 373 if params and "Predictor" in params:
    374     pred = int_value(params["Predictor"])
    375     if pred == 1:
    376         # no predictor

TypeError: argument of type 'PDFObjRef' is not iterable

In [5]: s.get_filters()
Out[5]: [(/'FlateDecode', <PDFObjRef:21>)]
```
github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2024
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
@pietermarsman
Copy link
Member Author

Closed by #906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
pdfminer.six
  
needs solution
Development

No branches or pull requests

3 participants