Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error in Powerful PDF parsing,强力解析报错 #405

Open
2 tasks done
allentern opened this issue Jun 18, 2024 · 2 comments
Open
2 tasks done

[BUG] Error in Powerful PDF parsing,强力解析报错 #405

allentern opened this issue Jun 18, 2024 · 2 comments

Comments

@allentern
Copy link

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

python模式,全CPU运行,调用外部大模型。
在config中打开PDF强力解析:

pdf解析参数

pdf_config = {
# 设置是否使用快速PDF解析器,设置为False时,使用优化后的PDF解析器,但速度下降
"USE_FAST_PDF_PARSER": False
}
运行,上传pdf,后台日志:
Error in Powerful PDF parsing: PdfLoader.init() got an unexpected keyword argument 'root_dir', use fast PDF parser instead.
...
insert_to_faiss: success num: 1, failed num: 0
从日志中看出来,强力解析出错,然后专用快速解析。

期望行为 | Expected Behavior

期望强力解析能够正常运行。

运行环境 | Environment

- OS:ubuntu 22.04
- NVIDIA Driver: 无
- CUDA: 无
- docker: 无
- docker-compose: 无
- NVIDIA GPU: 无
- NVIDIA GPU Memory: 无

QAnything日志 | QAnything logs

debug.log中的内容:

2024-06-18 10:40:56,518 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - upload_files zzp
2024-06-18 10:40:56,520 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - mode: strong
2024-06-18 10:40:56,524 - [PID: 88643][MainProcess] - [Function: check_kb_exist] - INFO - check_kb_exist [('KB2baad59dd8b346f79ae06061c86da883',)]
2024-06-18 10:40:56,525 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - ori name: 建筑光伏系统应用技术标准.pdf
2024-06-18 10:40:56,525 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - decode name: 建筑光伏系统应用技术标准.pdf
2024-06-18 10:40:56,525 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - cleaned name: 建筑光伏系统应用技术标准.pdf
2024-06-18 10:40:56,526 - [PID: 88643][MainProcess] - [Function: check_user_exist_] - INFO - check_user_exist [('zzp',)]
2024-06-18 10:40:56,527 - [PID: 88643][MainProcess] - [Function: check_kb_exist] - INFO - check_kb_exist [('KB2baad59dd8b346f79ae06061c86da883',)]
2024-06-18 10:40:56,530 - [PID: 88643][MainProcess] - [Function: add_file] - INFO - add_file: e87590666140418eba9d0f135d5ea390
2024-06-18 10:40:56,530 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - 建筑光伏系统应用技术标准.pdf, e87590666140418eba9d0f135d5ea390, success
2024-06-18 10:40:56,541 - [PID: 88643][MainProcess] - [Function: init] - INFO - success init localfile 建筑光伏系统应用技术标准.pdf
2024-06-18 10:40:56,545 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - insert_files_to_faiss: KB2baad59dd8b346f79ae06061c86da883
2024-06-18 10:40:56,546 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - WARNING - Error in Powerful PDF parsing: PdfLoader.init() got an unexpected keyword argument 'root_dir', use fast PDF parser instead.
2024-06-18 10:40:57,513 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - INFO - before 2nd split doc lens: 8
2024-06-18 10:40:57,514 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - INFO - after 2nd split doc lens: 8
2024-06-18 10:40:57,515 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - INFO - langchain analysis content head: 住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览
2024-06-18 10:40:57,515 - [PID: 88643][MainProcess] - [Function: inner] - INFO - 函数 split_file_to_docs 执行耗时: 0.9691917896270752 秒
2024-06-18 10:40:57,518 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - split time: 0.9694967269897461 8
2024-06-18 10:40:57,521 - [PID: 88643][MainProcess] - [Function: load_vector_store] - INFO - load faiss index: /root/QAnything/QANY_DB/faiss/KB2baad59dd8b346f79ae06061c86da883/faiss_index
2024-06-18 10:40:58,044 - [PID: 88643][MainProcess] - [Function: _load_kb_to_memory] - INFO - FAISS load kb_ids: ['KB2baad59dd8b346f79ae06061c86da883']
2024-06-18 10:40:58,046 - [PID: 88643][MainProcess] - [Function: get_len_safe_embeddings] - INFO - embedding number: 1
2024-06-18 10:40:59,334 - [PID: 88643][MainProcess] - [Function: get_embedding] - INFO - onnx infer time: 1.2814881801605225
2024-06-18 10:40:59,337 - [PID: 88643][MainProcess] - [Function: get_embedding] - INFO - embedding shape: (8, 768)
2024-06-18 10:40:59,342 - [PID: 88643][MainProcess] - [Function: inner] - INFO - 函数 get_len_safe_embeddings 执行耗时: 1.2964568138122559 秒
2024-06-18 10:40:59,357 - [PID: 88643][MainProcess] - [Function: add_document] - INFO - add documents number: 8
2024-06-18 10:40:59,363 - [PID: 88643][MainProcess] - [Function: add_document] - INFO - save faiss index: /root/QAnything/QANY_DB/faiss/KB2baad59dd8b346f79ae06061c86da883/faiss_index
2024-06-18 10:40:59,363 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - insert time: 1.847867727279663
2024-06-18 10:40:59,365 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - insert_to_faiss: success num: 1, failed num: 0
2024-06-18 10:41:22,223 - [PID: 88643][MainProcess] - [Function: list_docs] - INFO - list_docs zzp
2024-06-18 10:41:22,224 - [PID: 88643][MainProcess] - [Function: list_docs] - INFO - kb_id: KB2baad59dd8b346f79ae06061c86da883

复现方法 | Steps To Reproduce

1.python模式,全CPU运行,调用外部LLM。
2.config中打开强力解析。
3.启动。
4.上传PDF,观察日志。

备注 | Anything else?

No response

@allentern allentern changed the title [BUG] <title> [BUG] Error in Powerful PDF parsing,强力解析报错 Jun 18, 2024
@Sonder-JX
Copy link

The same problem. Any solution?

@fi5ee
Copy link

fi5ee commented Jul 11, 2024

一样遇到了这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants