Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Closed
Lincyaw opened this issue Jul 23, 2024 · 1 comment
Closed

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Lincyaw opened this issue Jul 23, 2024 · 1 comment
Labels
upstream bug bug outside this package

Comments

@Lincyaw
Copy link
Contributor

Lincyaw commented Jul 23, 2024

Description of the bug | 错误描述

脚本与这个 issue 相同,这个 pdf 输入会触发

2024-07-23 14:50:10.631 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:116 - doc analyze cost: 21.791197299957275
2024-07-23 14:50:11.331 | WARNING  | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [497, 444, 504, 453], 'sco
2024-07-23 14:50:12.614 | INFO     | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished
2024-07-23 14:50:12.614 | INFO     | __main__:process_pdf_file:41 - Processed '13.pdf' and generated '13.md'
root@034cece9ab64:/code# python main.py
2024-07-23 14:50:48.146 | ERROR    | __main__:process_pdf_file:43 - Failed to process 54.pdf: 'PDFObjRef' object is not iterable
Traceback (most recent call last):

  File "/code/main.py", line 58, in <module>
    process_pdf_files_in_directory(directory_path)
    │                              └ 'papers'
    └ <function process_pdf_files_in_directory at 0x7d2ac60936d0>

  File "/code/main.py", line 50, in process_pdf_files_in_directory
    process_pdf_file(directory, pdf_file)
    │                │          └ '54.pdf'
    │                └ 'papers'
    └ <function process_pdf_file at 0x7d2ac62fbd90>

> File "/code/main.py", line 28, in process_pdf_file
    pipe.pipe_classify()
    │    └ <function UNIPipe.pipe_classify at 0x7d2a42ebe290>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>

  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify
    self.pdf_type = AbsPipe.classify(self.pdf_bytes)
    │    │          │       │        │    └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344
    │    │          │       │        └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>
    │    │          │       └ <staticmethod(<function AbsPipe.classify at 0x7d2a8d5a4c10>)>
    │    │          └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>
    │    └ ''
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/AbsPipe.py", line 63, in classify
    pdf_meta = pdf_meta_scan(pdf_bytes)
               │             └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen...
               └ <function pdf_meta_scan at 0x7d2a8d5a40d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 339, in pdf_meta_scan
    invalid_chars = check_invalid_chars(pdf_bytes)
                    │                   └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >
                    └ <function check_invalid_chars at 0x7d2a8d5a4040>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 305, in check_invalid_chars
    return detect_invalid_chars(pdf_bytes)
           │                    └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen...
           └ <function detect_invalid_chars at 0x7d2a8d59fac0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/libs/pdf_check.py", line 44, in detect_invalid_chars
    text = extract_text(sample_pdf_file_like_object)
           │            └ <_io.BytesIO object at 0x7d2a42e6ff10>
           └ <function extract_text at 0x7d2a9299e4d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
                │       └ <classmethod(<function PDFPage.get_pages at 0x7d2a8d58e170>)>
                └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
                                    │   │            └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990>
                                    │   └ <classmethod(<function PDFPage.create_pages at 0x7d2a8d58e0e0>)>
                                    └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
          │   │         │      │          └ repeat(None)
          │   │         │      └ {'Type': /'Page', 'Contents': [<PDFObjRef:3>], 'Resources': <PDFObjRef:4>, 'MediaBox': <PDFObjRef:26>, 'Annots': [<PDFObjRef:...
          │   │         └ 27
          │   └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990>
          └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
                     │    └ typing.Any
                     └ typing.List

TypeError: 'PDFObjRef' object is not iterable

How to reproduce the bug | 如何复现

使用这个文件复现

54.pdf

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

@Lincyaw Lincyaw added the bug Something isn't working label Jul 23, 2024
@myhloli
Copy link
Collaborator

myhloli commented Jul 23, 2024

之前有另一个用户反馈了这个问题: #191
pdfminer.six最新版引入的新bug,我试了下在20231228版本上表现良好,因此建议使用

pip install pdfminer.six==20231228

来解决这个问题
修复参考:27e98a8

@myhloli myhloli closed this as completed Jul 23, 2024
@myhloli myhloli added upstream bug bug outside this package and removed bug Something isn't working labels Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants