We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
脚本与这个 issue 相同,这个 pdf 输入会触发
2024-07-23 14:50:10.631 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:116 - doc analyze cost: 21.791197299957275 2024-07-23 14:50:11.331 | WARNING | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [497, 444, 504, 453], 'sco 2024-07-23 14:50:12.614 | INFO | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished 2024-07-23 14:50:12.614 | INFO | __main__:process_pdf_file:41 - Processed '13.pdf' and generated '13.md' root@034cece9ab64:/code# python main.py 2024-07-23 14:50:48.146 | ERROR | __main__:process_pdf_file:43 - Failed to process 54.pdf: 'PDFObjRef' object is not iterable Traceback (most recent call last): File "/code/main.py", line 58, in <module> process_pdf_files_in_directory(directory_path) │ └ 'papers' └ <function process_pdf_files_in_directory at 0x7d2ac60936d0> File "/code/main.py", line 50, in process_pdf_files_in_directory process_pdf_file(directory, pdf_file) │ │ └ '54.pdf' │ └ 'papers' └ <function process_pdf_file at 0x7d2ac62fbd90> > File "/code/main.py", line 28, in process_pdf_file pipe.pipe_classify() │ └ <function UNIPipe.pipe_classify at 0x7d2a42ebe290> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify self.pdf_type = AbsPipe.classify(self.pdf_bytes) │ │ │ │ │ └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00> │ │ │ └ <staticmethod(<function AbsPipe.classify at 0x7d2a8d5a4c10>)> │ │ └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'> │ └ '' └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/AbsPipe.py", line 63, in classify pdf_meta = pdf_meta_scan(pdf_bytes) │ └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen... └ <function pdf_meta_scan at 0x7d2a8d5a40d0> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 339, in pdf_meta_scan invalid_chars = check_invalid_chars(pdf_bytes) │ └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 > └ <function check_invalid_chars at 0x7d2a8d5a4040> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 305, in check_invalid_chars return detect_invalid_chars(pdf_bytes) │ └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen... └ <function detect_invalid_chars at 0x7d2a8d59fac0> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/libs/pdf_check.py", line 44, in detect_invalid_chars text = extract_text(sample_pdf_file_like_object) │ └ <_io.BytesIO object at 0x7d2a42e6ff10> └ <function extract_text at 0x7d2a9299e4d0> File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text for page in PDFPage.get_pages( │ └ <classmethod(<function PDFPage.get_pages at 0x7d2a8d58e170>)> └ <class 'pdfminer.pdfpage.PDFPage'> File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages for (pageno, page) in enumerate(cls.create_pages(doc)): │ │ └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990> │ └ <classmethod(<function PDFPage.create_pages at 0x7d2a8d58e0e0>)> └ <class 'pdfminer.pdfpage.PDFPage'> File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages yield cls(document, objid, tree, next(page_labels)) │ │ │ │ └ repeat(None) │ │ │ └ {'Type': /'Page', 'Contents': [<PDFObjRef:3>], 'Resources': <PDFObjRef:4>, 'MediaBox': <PDFObjRef:26>, 'Annots': [<PDFObjRef:... │ │ └ 27 │ └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990> └ <class 'pdfminer.pdfpage.PDFPage'> File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 63, in __init__ mediabox_params: List[Any] = [ │ └ typing.Any └ typing.List TypeError: 'PDFObjRef' object is not iterable
使用这个文件复现
54.pdf
Linux
3.10
0.6.x
cuda
The text was updated successfully, but these errors were encountered:
之前有另一个用户反馈了这个问题: #191 pdfminer.six最新版引入的新bug,我试了下在20231228版本上表现良好,因此建议使用
pip install pdfminer.six==20231228
来解决这个问题 修复参考:27e98a8
Sorry, something went wrong.
No branches or pull requests
Description of the bug | 错误描述
脚本与这个 issue 相同,这个 pdf 输入会触发
How to reproduce the bug | 如何复现
使用这个文件复现
54.pdf
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: