[BUG] 在尝试单独使用PdfLoader出现问题 #367

tcy6 · 2024-05-27T16:38:27Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

我在self_pdf_loader.py在添加了这么几行代码，用来测试解析pdf的效果
file_path = 'cat.pdf'
file_path = os.path.abspath(os.path.join(os.path.dirname(file), file_path))
loader = PdfLoader(filename=file_path, from_page=14, to_page=15, root_dir=os.path.dirname(file_path))
markdown_dir = loader.load_to_markdown()
docs = convert_markdown_to_langchaindoc(markdown_dir)
docs = PdfLoader.pdf_process(docs)
print(docs)

但是却碰到了检索不到checkpoints的问题
Traceback (most recent call last):
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 203, in
loader = PdfLoader(filename=file_path, root_dir=os.path.dirname(file_path))
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 14, in init
super().init()
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\parser\pdf_parser.py", line 34, in init
self.layouter = LayoutRecognizer("layout")
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\layout_recognizer.py", line 20, in init
super().init(self.labels, domain, model_dir)
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\recognizer.py", line 21, in init
raise ValueError("not find model file path {}".format(
ValueError: not find model file path c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel/utils/loader/pdf_to_markdown\checkpoints/layout\layout.onnx

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS:
- NVIDIA Driver:
- CUDA:
- docker:
- docker-compose:
- NVIDIA GPU:
- NVIDIA GPU Memory:

QAnything日志 | QAnything logs

No response

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

milely · 2024-05-28T01:41:17Z

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

tcy6 · 2024-05-28T05:36:29Z

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢，另外是不是Qanything无法处理没有文本元素的pdf啊，我截了一张图进行解析，发现有报错。如果是这样那它里面的ocr的意义是什么呢，是解析表格？

tcy6 · 2024-05-28T06:08:02Z

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢，另外是不是Qanything无法处理没有文本元素的pdf啊，我截了一张图进行解析，发现有报错。如果是这样那它里面的ocr的意义是什么呢，是解析表格？

报错信息如下:
<Logger debug_logger (INFO)> <Logger qa_logger (INFO)>
LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content
LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1
LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1
table model initing...
cpu
table model inited...
WARNING:root:Miss outlines
INFO:debug_logger:Start OCR！
1it [00:00, ?it/s]
INFO:debug_logger:OCR finished in 0.15695199999026954 seconds
preprocess
1it [00:00, ?it/s]
Traceback (most recent call last):
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in
markdown_dir = loader.load_to_markdown()
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown
page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min(
ValueError: max() arg is an empty sequence

milely · 2024-05-28T07:07:52Z

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢，另外是不是Qanything无法处理没有文本元素的pdf啊，我截了一张图进行解析，发现有报错。如果是这样那它里面的ocr的意义是什么呢，是解析表格？
The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.

tcy6 · 2024-05-28T15:17:16Z

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢，另外是不是Qanything无法处理没有文本元素的pdf啊，我截了一张图进行解析，发现有报错。如果是这样那它里面的ocr的意义是什么呢，是解析表格？
The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.

好的好的，十分感谢。既然不会ocr pdf，那感觉可以把pdf loader里面的ocr相关的东西先去掉，不然很迷惑人哈哈哈，明明都输出ocr finished了，但是实际上却没有ocr

xiehurricane · 2024-06-07T09:14:07Z

同感上传一个单层PDF只有图片就悲剧了 box找不到直接报错跟代码发现没有OCR

zhudongwork · 2024-06-24T09:32:34Z

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢，另外是不是Qanything无法处理没有文本元素的pdf啊，我截了一张图进行解析，发现有报错。如果是这样那它里面的ocr的意义是什么呢，是解析表格？

报错信息如下: <Logger debug_logger (INFO)> <Logger qa_logger (INFO)> LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 table model initing... cpu table model inited... WARNING:root:Miss outlines INFO:debug_logger:Start OCR！ 1it [00:00, ?it/s] INFO:debug_logger:OCR finished in 0.15695199999026954 seconds preprocess 1it [00:00, ?it/s] Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in markdown_dir = loader.load_to_markdown() File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min( ValueError: max() arg is an empty sequence

我也是一样的错误：Error in Powerful PDF parsing: max() arg is an empty sequence。关键是我传的是一页论文pdf，不是图片

SoonyangZhang · 2024-06-27T02:27:01Z

同感上传一个单层PDF只有图片就悲剧了 box找不到直接报错跟代码发现没有OCR

可以使用ocrmypdf 处理pdf。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 在尝试单独使用PdfLoader出现问题 #367

[BUG] 在尝试单独使用PdfLoader出现问题 #367

tcy6 commented May 27, 2024

milely commented May 28, 2024

tcy6 commented May 28, 2024

tcy6 commented May 28, 2024

milely commented May 28, 2024

tcy6 commented May 28, 2024

xiehurricane commented Jun 7, 2024

zhudongwork commented Jun 24, 2024

SoonyangZhang commented Jun 27, 2024

[BUG] 在尝试单独使用PdfLoader出现问题 #367

[BUG] 在尝试单独使用PdfLoader出现问题 #367

Comments

tcy6 commented May 27, 2024

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

运行环境 | Environment

QAnything日志 | QAnything logs

复现方法 | Steps To Reproduce

备注 | Anything else?

milely commented May 28, 2024

tcy6 commented May 28, 2024

tcy6 commented May 28, 2024

milely commented May 28, 2024

tcy6 commented May 28, 2024

xiehurricane commented Jun 7, 2024

zhudongwork commented Jun 24, 2024

SoonyangZhang commented Jun 27, 2024