# PDF转Markdown工具

本notebook使用MinerU库将PDF文件转换为Markdown格式。
详情请参考
https://github.com/opendatalab/MinerU

## 步骤1：设置环境变量
首先设置模型下载源，如果遇到模型下载问题可以使用modelscope作为备选源。


In [1]:
import os

# 如果在下载模型时遇到问题，可以设置此环境变量
os.environ['MINERU_MODEL_SOURCE'] = "modelscope"

## 步骤2：创建目录结构

创建必要的目录结构：
- `pdf_articles`：存放待处理的PDF文件
- `md_articles`：存放转换后的Markdown文件

同时导入必要的库文件。


In [2]:
import os
from pathlib import Path
import requests

# 创建存放示例PDF文件的目录
pdf_files_dir = Path("pdf_articles")
pdf_files_dir.mkdir(exist_ok=True)

# 创建输出目录
output_dir = Path("md_articles")
output_dir.mkdir(exist_ok=True)


## 步骤3：扫描待处理文件

扫描`pdf_articles`目录中的所有PDF和图像文件，生成待处理文件列表。

支持的文件格式：
- PDF文件：`.pdf`
- 图像文件：`.png`, `.jpeg`, `.jpg`


In [3]:
# 定义支持的文件类型
pdf_suffixes = [".pdf"]
image_suffixes = [".png", ".jpeg", ".jpg"]

# 在源目录中查找所有支持的文档
doc_path_list = []
for doc_path in pdf_files_dir.glob('*'):
    if doc_path.suffix in pdf_suffixes + image_suffixes:
        doc_path_list.append(doc_path)

print(f"待解析文档数量: {len(doc_path_list)}")
print(f"待解析文档列表: {doc_path_list}")

待解析文档数量: 5
待解析文档列表: [WindowsPath('pdf_articles/article1.pdf'), WindowsPath('pdf_articles/article2.pdf'), WindowsPath('pdf_articles/article3.pdf'), WindowsPath('pdf_articles/article4.pdf'), WindowsPath('pdf_articles/article5.pdf')]


## 步骤4：定义PDF解析函数

这是核心的PDF解析模块，包含两个主要函数：

### `do_parse` 函数
- 负责实际的PDF解析过程
- 支持两种后端：pipeline（管道模式）和vlm（视觉语言模型）
- 可以输出多种格式：Markdown、JSON、图像等

### `parse_doc` 函数
- 高级接口，简化了PDF解析的调用
- 支持批量处理多个文档
- 提供丰富的参数配置选项

参考：https://github.com/opendatalab/MinerU/blob/master/demo/demo.py

In [4]:
import copy
import json
import os
from pathlib import Path

from loguru import logger
from mineru.cli.common import convert_pdf_bytes_to_bytes_by_pypdfium2, prepare_env, read_fn
from mineru.data.data_reader_writer import FileBasedDataWriter
from mineru.utils.draw_bbox import draw_layout_bbox, draw_span_bbox
from mineru.utils.enum_class import MakeMode
from mineru.backend.vlm.vlm_analyze import doc_analyze as vlm_doc_analyze
from mineru.backend.pipeline.pipeline_analyze import doc_analyze as pipeline_doc_analyze
from mineru.backend.pipeline.pipeline_middle_json_mkcontent import union_make as pipeline_union_make
from mineru.backend.pipeline.model_json_to_middle_json import result_to_middle_json as pipeline_result_to_middle_json
from mineru.backend.vlm.vlm_middle_json_mkcontent import union_make as vlm_union_make
from mineru.utils.models_download_utils import auto_download_and_get_model_root_path

def do_parse(
    output_dir,  # 存储解析结果的输出目录
    pdf_file_names: list[str],  # 待解析的PDF文件名列表
    pdf_bytes_list: list[bytes],  # 待解析的PDF字节数据列表
    p_lang_list: list[str],  # 每个PDF的语言列表，默认为'ch'（中文）
    backend="pipeline",  # 解析PDF的后端，默认为'pipeline'
    parse_method="auto",  # 解析PDF的方法，默认为'auto'
    formula_enable=True,  # 启用公式解析
    table_enable=True,  # 启用表格解析
    server_url=None,  # vlm-sglang-client后端的服务器URL
    f_draw_layout_bbox=True,  # 是否绘制布局边界框
    f_draw_span_bbox=True,  # 是否绘制文本边界框
    f_dump_md=True,  # 是否输出markdown文件
    f_dump_middle_json=True,  # 是否输出中间JSON文件
    f_dump_model_output=True,  # 是否输出模型输出文件
    f_dump_orig_pdf=True,  # 是否输出原始PDF文件
    f_dump_content_list=True,  # 是否输出内容列表文件
    f_make_md_mode=MakeMode.MM_MD,  # 制作markdown内容的模式，默认为MM_MD
    start_page_id=0,  # 解析的起始页面ID，默认为0
    end_page_id=None,  # 解析的结束页面ID，默认为None（解析到文档末尾）
):

    if backend == "pipeline":
        for idx, pdf_bytes in enumerate(pdf_bytes_list):
            new_pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id, end_page_id)
            pdf_bytes_list[idx] = new_pdf_bytes

        infer_results, all_image_lists, all_pdf_docs, lang_list, ocr_enabled_list = pipeline_doc_analyze(pdf_bytes_list, p_lang_list, parse_method=parse_method, formula_enable=formula_enable,table_enable=table_enable)

        for idx, model_list in enumerate(infer_results):
            model_json = copy.deepcopy(model_list)
            pdf_file_name = pdf_file_names[idx]
            local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
            image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir)

            images_list = all_image_lists[idx]
            pdf_doc = all_pdf_docs[idx]
            _lang = lang_list[idx]
            _ocr_enable = ocr_enabled_list[idx]
            middle_json = pipeline_result_to_middle_json(model_list, images_list, pdf_doc, image_writer, _lang, _ocr_enable, formula_enable)

            pdf_info = middle_json["pdf_info"]

            pdf_bytes = pdf_bytes_list[idx]
            if f_draw_layout_bbox:
                draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_layout.pdf")

            if f_draw_span_bbox:
                draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_span.pdf")

            if f_dump_orig_pdf:
                md_writer.write(
                    f"{pdf_file_name}_origin.pdf",
                    pdf_bytes,
                )

            if f_dump_md:
                image_dir = str(os.path.basename(local_image_dir))
                md_content_str = pipeline_union_make(pdf_info, f_make_md_mode, image_dir)
                md_writer.write_string(
                    f"{pdf_file_name}.md",
                    md_content_str,
                )

            if f_dump_content_list:
                image_dir = str(os.path.basename(local_image_dir))
                content_list = pipeline_union_make(pdf_info, MakeMode.CONTENT_LIST, image_dir)
                md_writer.write_string(
                    f"{pdf_file_name}_content_list.json",
                    json.dumps(content_list, ensure_ascii=False, indent=4),
                )

            if f_dump_middle_json:
                md_writer.write_string(
                    f"{pdf_file_name}_middle.json",
                    json.dumps(middle_json, ensure_ascii=False, indent=4),
                )

            if f_dump_model_output:
                md_writer.write_string(
                    f"{pdf_file_name}_model.json",
                    json.dumps(model_json, ensure_ascii=False, indent=4),
                )

            logger.info(f"local output dir is {local_md_dir}")
    else:
        if backend.startswith("vlm-"):
            backend = backend[4:]

        f_draw_span_bbox = False
        parse_method = "vlm"
        for idx, pdf_bytes in enumerate(pdf_bytes_list):
            pdf_file_name = pdf_file_names[idx]
            pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id, end_page_id)
            local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
            image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir)
            middle_json, infer_result = vlm_doc_analyze(pdf_bytes, image_writer=image_writer, backend=backend, server_url=server_url)

            pdf_info = middle_json["pdf_info"]

            if f_draw_layout_bbox:
                draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_layout.pdf")

            if f_draw_span_bbox:
                draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_span.pdf")

            if f_dump_orig_pdf:
                md_writer.write(
                    f"{pdf_file_name}_origin.pdf",
                    pdf_bytes,
                )

            if f_dump_md:
                image_dir = str(os.path.basename(local_image_dir))
                md_content_str = vlm_union_make(pdf_info, f_make_md_mode, image_dir)
                md_writer.write_string(
                    f"{pdf_file_name}.md",
                    md_content_str,
                )

            if f_dump_content_list:
                image_dir = str(os.path.basename(local_image_dir))
                content_list = vlm_union_make(pdf_info, MakeMode.CONTENT_LIST, image_dir)
                md_writer.write_string(
                    f"{pdf_file_name}_content_list.json",
                    json.dumps(content_list, ensure_ascii=False, indent=4),
                )

            if f_dump_middle_json:
                md_writer.write_string(
                    f"{pdf_file_name}_middle.json",
                    json.dumps(middle_json, ensure_ascii=False, indent=4),
                )

            if f_dump_model_output:
                model_output = ("\n" + "-" * 50 + "\n").join(infer_result)
                md_writer.write_string(
                    f"{pdf_file_name}_model_output.txt",
                    model_output,
                )

            logger.info(f"local output dir is {local_md_dir}")


def parse_doc(
        path_list: list[Path],
        output_dir,
        lang="ch",
        backend="pipeline",
        method="auto",
        server_url=None,
        start_page_id=0,
        end_page_id=None
):
    """
        参数说明:
        path_list: 待解析的文档路径列表，可以是PDF或图像文件。
        output_dir: 存储解析结果的输出目录。
        lang: 语言选项，默认为'ch'，可选值包括['ch', 'ch_server', 'ch_lite', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']。
            输入PDF中的语言（如果已知）以提高OCR准确性。可选。
            仅适用于后端设置为"pipeline"的情况
        backend: 解析PDF的后端:
            pipeline: 更通用。
            vlm-transformers: 更通用。
            vlm-sglang-engine: 更快（引擎）。
            vlm-sglang-client: 更快（客户端）。
            未指定方法时，默认使用pipeline。
        method: 解析PDF的方法:
            auto: 基于文件类型自动确定方法。
            txt: 使用文本提取方法。
            ocr: 对基于图像的PDF使用OCR方法。
            未指定方法时，默认使用'auto'。
            仅适用于后端设置为"pipeline"的情况。
        server_url: 当后端为`sglang-client`时，需要指定server_url，例如:`http://127.0.0.1:30000`
        start_page_id: 解析的起始页面ID，默认为0
        end_page_id: 解析的结束页面ID，默认为None（解析到文档末尾）
    """
    try:
        file_name_list = []
        pdf_bytes_list = []
        lang_list = []
        for path in path_list:
            file_name = str(Path(path).stem)
            pdf_bytes = read_fn(path)
            file_name_list.append(file_name)
            pdf_bytes_list.append(pdf_bytes)
            lang_list.append(lang)
        do_parse(
            output_dir=output_dir,
            pdf_file_names=file_name_list,
            pdf_bytes_list=pdf_bytes_list,
            p_lang_list=lang_list,
            backend=backend,
            parse_method=method,
            server_url=server_url,
            start_page_id=start_page_id,
            end_page_id=end_page_id
        )
    except Exception as e:
        logger.exception(e)


  from .autonotebook import tqdm as notebook_tqdm


## 步骤5：执行PDF解析

现在开始执行PDF解析过程。使用`parse_doc`函数处理之前扫描到的所有PDF文件。

**参数配置：**
- `doc_path_list`: 待处理的文档路径列表
- `output_dir`: 输出目录（md_articles）
- `backend="pipeline"`: 使用pipeline后端进行解析

**解析过程：**
1. 读取PDF文件内容
2. 进行文档分析和内容提取
3. 生成Markdown格式的输出文件
4. 保存相关的中间文件和图像

**输出结果：**
- Markdown文件（.md）
- 原始PDF文件副本
- 布局和文本边界框可视化PDF
- 中间JSON文件
- 提取的图像文件

注：第一次运行需要下载模型参数

In [5]:
parse_doc(doc_path_list, output_dir, backend="pipeline")

[32m2025-07-13 15:34:33.539[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.pipeline_analyze[0m:[36mdoc_analyze[0m:[36m124[0m - [1mBatch 1/1: 55 pages/55 pages[0m
[32m2025-07-13 15:34:33.552[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.pipeline_analyze[0m:[36mbatch_image_analyze[0m:[36m187[0m - [1mgpu_memory: 16 GB, batch_ratio: 16[0m
[32m2025-07-13 15:34:33.553[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.model_init[0m:[36m__init__[0m:[36m137[0m - [1mDocAnalysis init, this may take some times......[0m


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:34:36,172 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:34:39,403 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:34:43,467 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:34:46,527 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:34:49,329 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:34:52,191 - modelscope - INFO - Target directory already exists, skipping creation.
[32m2025-07-13 15:34:52.298[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.model_init[0m:[36m__init__[0m:[36m182[0m - [1mDocAnalysis init done![0m
[32m2025-07-13 15:34:52.298[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.pipeline_analyze[0m:[36mcustom_model_init[0m:[36m64[0m - [1mmodel init cost: 18.744927883148193[0m
Layout Predict: 100%|██████████| 55/55 [00:02<00:00, 21.62it/s]
MFD Predict: 100%|██████████| 55/55 [00:03<00:00, 14.39it/s]
MFR Predict: 100%|██████████| 1763/1763 [00:10<00:00, 169.79it/s]


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:35:12,333 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:35:15,207 - modelscope - INFO - Target directory already exists, skipping creation.
OCR-det ch: 100%|██████████| 182/182 [00:10<00:00, 16.90it/s]
Table Predict:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:35:28,870 - modelscope - INFO - Target directory already exists, skipping creation.
Table Predict: 100%|██████████| 6/6 [00:04<00:00,  1.46it/s]
OCR-rec Predict: 100%|██████████| 8770/8770 [00:29<00:00, 298.94it/s]
Processing pages:   0%|          | 0/12 [00:00<?, ?it/s]

Downloading Model from https://www.modelscope.cn to directory: C:\Users\HYPER\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0


2025-07-13 15:36:03,416 - modelscope - INFO - Target directory already exists, skipping creation.
Processing pages: 100%|██████████| 12/12 [00:04<00:00,  2.89it/s]
[32m2025-07-13 15:36:05.088[0m | [1mINFO    [0m | [36m__main__[0m:[36mdo_parse[0m:[36m102[0m - [1mlocal output dir is md_articles\article1\auto[0m
Processing pages: 100%|██████████| 9/9 [00:00<00:00, 19.09it/s]
[32m2025-07-13 15:36:05.868[0m | [1mINFO    [0m | [36m__main__[0m:[36mdo_parse[0m:[36m102[0m - [1mlocal output dir is md_articles\article2\auto[0m
Processing pages: 100%|██████████| 11/11 [00:00<00:00, 21.42it/s]
[32m2025-07-13 15:36:06.863[0m | [1mINFO    [0m | [36m__main__[0m:[36mdo_parse[0m:[36m102[0m - [1mlocal output dir is md_articles\article3\auto[0m
Processing pages: 100%|██████████| 14/14 [00:01<00:00, 13.42it/s]
[32m2025-07-13 15:36:08.585[0m | [1mINFO    [0m | [36m__main__[0m:[36mdo_parse[0m:[36m102[0m - [1mlocal output dir is md_articles\article4\auto[0m
Proc