Description:
I am using markitdown to convert scanned PDF documents into Markdown with options markitdown-ocr.
However, I found that when processing a multi-page scanned PDF, the tool only extracts the content of the first page and ignores the remaining pages.
screenshots
code
result
Environment Information
-
OS: Ubuntu 24.04.2 LTS
-
python: Python 3.12.13
-
markitdown info:
- Name: markitdown
Version: 0.1.5
Summary: Utility tool for converting various files to Markdown
Home-page:
Author:
Author-email: Adam Fourney adamfo@microsoft.com
License-Expression: MIT
Location: /root/miniconda3/envs/markitdown/lib/python3.12/site-packages
Requires: beautifulsoup4, charset-normalizer, defusedxml, magika, markdownify, requests
Required-by: markitdown-ocr
-
llm_client: I use the api from bailian, aliyun
- And I've tested several models, the results are the same.
-
the tested file: 👇
4-2南实党委〔 2024〕20号-关于印发《南湖实验室采购管理办法实施细则(试行)》的通知.pdf
Description:
I am using markitdown to convert scanned PDF documents into Markdown with options markitdown-ocr.
However, I found that when processing a multi-page scanned PDF, the tool only extracts the content of the first page and ignores the remaining pages.
screenshots
code
result
Environment Information
OS: Ubuntu 24.04.2 LTS
python: Python 3.12.13
markitdown info:
Version: 0.1.5
Summary: Utility tool for converting various files to Markdown
Home-page:
Author:
Author-email: Adam Fourney adamfo@microsoft.com
License-Expression: MIT
Location: /root/miniconda3/envs/markitdown/lib/python3.12/site-packages
Requires: beautifulsoup4, charset-normalizer, defusedxml, magika, markdownify, requests
Required-by: markitdown-ocr
llm_client: I use the api from bailian, aliyun
the tested file: 👇
4-2南实党委〔 2024〕20号-关于印发《南湖实验室采购管理办法实施细则(试行)》的通知.pdf