Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 在尝试单独使用PdfLoader出现问题 #367

Open
2 tasks done
tcy6 opened this issue May 27, 2024 · 9 comments
Open
2 tasks done

[BUG] 在尝试单独使用PdfLoader出现问题 #367

tcy6 opened this issue May 27, 2024 · 9 comments

Comments

@tcy6
Copy link

tcy6 commented May 27, 2024

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

我在self_pdf_loader.py在添加了这么几行代码,用来测试解析pdf的效果
file_path = 'cat.pdf'
file_path = os.path.abspath(os.path.join(os.path.dirname(file), file_path))
loader = PdfLoader(filename=file_path, from_page=14, to_page=15, root_dir=os.path.dirname(file_path))
markdown_dir = loader.load_to_markdown()
docs = convert_markdown_to_langchaindoc(markdown_dir)
docs = PdfLoader.pdf_process(docs)
print(docs)

但是却碰到了检索不到checkpoints的问题
Traceback (most recent call last):
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 203, in
loader = PdfLoader(filename=file_path, root_dir=os.path.dirname(file_path))
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 14, in init
super().init()
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\parser\pdf_parser.py", line 34, in init
self.layouter = LayoutRecognizer("layout")
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\layout_recognizer.py", line 20, in init
super().init(self.labels, domain, model_dir)
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\recognizer.py", line 21, in init
raise ValueError("not find model file path {}".format(
ValueError: not find model file path c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel/utils/loader/pdf_to_markdown\checkpoints/layout\layout.onnx

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS:
- NVIDIA Driver:
- CUDA:
- docker:
- docker-compose:
- NVIDIA GPU:
- NVIDIA GPU Memory:

QAnything日志 | QAnything logs

No response

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

@milely
Copy link
Collaborator

milely commented May 28, 2024

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

@tcy6
Copy link
Author

tcy6 commented May 28, 2024

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?

@tcy6
Copy link
Author

tcy6 commented May 28, 2024

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?

报错信息如下:
<Logger debug_logger (INFO)> <Logger qa_logger (INFO)>
LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content
LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1
LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1
table model initing...
cpu
table model inited...
WARNING:root:Miss outlines
INFO:debug_logger:Start OCR!
1it [00:00, ?it/s]
INFO:debug_logger:OCR finished in 0.15695199999026954 seconds
preprocess
1it [00:00, ?it/s]
Traceback (most recent call last):
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in
markdown_dir = loader.load_to_markdown()
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown
page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min(
ValueError: max() arg is an empty sequence

@milely
Copy link
Collaborator

milely commented May 28, 2024

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?
The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.

@tcy6
Copy link
Author

tcy6 commented May 28, 2024

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?
The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.

好的好的,十分感谢。既然不会ocr pdf,那感觉可以把pdf loader里面的ocr相关的东西先去掉,不然很迷惑人哈哈哈,明明都输出ocr finished了,但是实际上却没有ocr

@xiehurricane
Copy link

同感 上传一个单层PDF只有图片 就悲剧了 box找不到 直接报错 跟代码发现没有OCR

@Tendo33
Copy link

Tendo33 commented Jun 21, 2024

我也真是无语了

@zhudongwork
Copy link

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?

报错信息如下: <Logger debug_logger (INFO)> <Logger qa_logger (INFO)> LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 table model initing... cpu table model inited... WARNING:root:Miss outlines INFO:debug_logger:Start OCR! 1it [00:00, ?it/s] INFO:debug_logger:OCR finished in 0.15695199999026954 seconds preprocess 1it [00:00, ?it/s] Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in markdown_dir = loader.load_to_markdown() File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min( ValueError: max() arg is an empty sequence

我也是一样的错误:Error in Powerful PDF parsing: max() arg is an empty sequence。关键是我传的是一页论文pdf,不是图片

@SoonyangZhang
Copy link

同感 上传一个单层PDF只有图片 就悲剧了 box找不到 直接报错 跟代码发现没有OCR

可以使用ocrmypdf 处理pdf。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants