Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Lincyaw · 2024-07-23T14:53:40Z

Description of the bug | 错误描述

脚本与这个 issue 相同，这个 pdf 输入会触发

2024-07-23 14:50:10.631 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:116 - doc analyze cost: 21.791197299957275
2024-07-23 14:50:11.331 | WARNING  | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功：{'bbox': [497, 444, 504, 453], 'sco
2024-07-23 14:50:12.614 | INFO     | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished
2024-07-23 14:50:12.614 | INFO     | __main__:process_pdf_file:41 - Processed '13.pdf' and generated '13.md'
root@034cece9ab64:/code# python main.py
2024-07-23 14:50:48.146 | ERROR    | __main__:process_pdf_file:43 - Failed to process 54.pdf: 'PDFObjRef' object is not iterable
Traceback (most recent call last):

  File "/code/main.py", line 58, in <module>
    process_pdf_files_in_directory(directory_path)
    │                              └ 'papers'
    └ <function process_pdf_files_in_directory at 0x7d2ac60936d0>

  File "/code/main.py", line 50, in process_pdf_files_in_directory
    process_pdf_file(directory, pdf_file)
    │                │          └ '54.pdf'
    │                └ 'papers'
    └ <function process_pdf_file at 0x7d2ac62fbd90>

> File "/code/main.py", line 28, in process_pdf_file
    pipe.pipe_classify()
    │    └ <function UNIPipe.pipe_classify at 0x7d2a42ebe290>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>

  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify
    self.pdf_type = AbsPipe.classify(self.pdf_bytes)
    │    │          │       │        │    └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344
    │    │          │       │        └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>
    │    │          │       └ <staticmethod(<function AbsPipe.classify at 0x7d2a8d5a4c10>)>
    │    │          └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>
    │    └ ''
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/AbsPipe.py", line 63, in classify
    pdf_meta = pdf_meta_scan(pdf_bytes)
               │             └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen...
               └ <function pdf_meta_scan at 0x7d2a8d5a40d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 339, in pdf_meta_scan
    invalid_chars = check_invalid_chars(pdf_bytes)
                    │                   └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >
                    └ <function check_invalid_chars at 0x7d2a8d5a4040>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 305, in check_invalid_chars
    return detect_invalid_chars(pdf_bytes)
           │                    └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen...
           └ <function detect_invalid_chars at 0x7d2a8d59fac0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/libs/pdf_check.py", line 44, in detect_invalid_chars
    text = extract_text(sample_pdf_file_like_object)
           │            └ <_io.BytesIO object at 0x7d2a42e6ff10>
           └ <function extract_text at 0x7d2a9299e4d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
                │       └ <classmethod(<function PDFPage.get_pages at 0x7d2a8d58e170>)>
                └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
                                    │   │            └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990>
                                    │   └ <classmethod(<function PDFPage.create_pages at 0x7d2a8d58e0e0>)>
                                    └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
          │   │         │      │          └ repeat(None)
          │   │         │      └ {'Type': /'Page', 'Contents': [<PDFObjRef:3>], 'Resources': <PDFObjRef:4>, 'MediaBox': <PDFObjRef:26>, 'Annots': [<PDFObjRef:...
          │   │         └ 27
          │   └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990>
          └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
                     │    └ typing.Any
                     └ typing.List

TypeError: 'PDFObjRef' object is not iterable

How to reproduce the bug | 如何复现

使用这个文件复现

54.pdf

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

The text was updated successfully, but these errors were encountered:

myhloli · 2024-07-23T15:49:24Z

之前有另一个用户反馈了这个问题： #191
pdfminer.six最新版引入的新bug，我试了下在20231228版本上表现良好，因此建议使用

pip install pdfminer.six==20231228

来解决这个问题
修复参考：27e98a8

Lincyaw added the bug Something isn't working label Jul 23, 2024

myhloli mentioned this issue Jul 23, 2024

TypeError: 'PDFObjRef' object is not iterable pdfminer/pdfminer.six#1004

Open

myhloli closed this as completed Jul 23, 2024

myhloli added upstream bug bug outside this package and removed bug Something isn't working labels Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Lincyaw commented Jul 23, 2024 •

edited

Loading

myhloli commented Jul 23, 2024

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Comments

Lincyaw commented Jul 23, 2024 • edited Loading

Description of the bug | 错误描述

How to reproduce the bug | 如何复现

Operating system | 操作系统

Python version | Python 版本

Software version | 软件版本 (magic-pdf --version)

Device mode | 设备模式

myhloli commented Jul 23, 2024

Lincyaw commented Jul 23, 2024 •

edited

Loading