vllm解析不同显存设置下解析速度差不多，且与pipeline解析速度相近，是有什么问题吗？ #3958

jc7ctzphbf-dotcom · 2025-11-08T01:56:37Z

jc7ctzphbf-dotcom
Nov 8, 2025

0、显卡设置
显卡为4090D
使用flash-attn和flash-infer加速
cuda版本：
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.15.0
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 13.580.82
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12 12.8.90
torch 2.8.0
torchaudio 2.8.0
torchvision 0.23.0
vlm参数
vlm_kwargs = {
"backend": "vllm-engine",
"data_parallel_size": 1,
"model": "/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B",
"gpu_memory_utilization": 0.64,
"enable_prefix_caching": True,
"enable_chunked_prefill": True,
"max_num_batched_tokens": 512,
}
1、设置64%的显存
单台GPU是跑了32页/17.5s左右
[图片]
启动日志：
启动 Parse Server...
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
INFO 11-08 09:33:13 [init.py:216] Automatically detected platform cuda.
INFO 11-08 09:33:14 [utils.py:328] non-default args: {'enable_prefix_caching': True, 'gpu_memory_utilization': 0.64, 'max_num_batched_tokens': 512, 'disable_log_stats': True, 'enable_chunked_prefill': True, 'logits_processors': [<class 'mineru_vl_utils.logits_processor.vllm_v1_no_repeat_ngram.VllmV1NoRepeatNGramLogitsProcessor'>], 'model': '/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B'}
INFO 11-08 09:33:21 [init.py:742] Resolved architecture: Qwen2VLForConditionalGeneration
torch_dtype is deprecated! Use dtype instead!
INFO 11-08 09:33:21 [init.py:1815] Using max model len 16384
INFO 11-08 09:33:21 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=512.
WARNING 11-08 09:33:24 [init.py:2974] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
2025-11-08 09:33:27.952 | INFO | mp_main::45 | Started 1 GPU workers: ['http://127.0.0.1:8100']
INFO 11-08 09:33:30 [init.py:216] Automatically detected platform cuda.
2025-11-08 09:33:31.261 | INFO | worker_server:create_worker_app:36 | MineruService initialized on GPU 0
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:31 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:31 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B', speculative_config=None, tokenizer='/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:34 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:34 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=4281) WARNING 11-08 09:33:37 [profiling.py:280] The sequence length (16384) is smaller than the pre-defined worst-case total number of multimodal tokens (18225). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase max_model_len or reduce mm_counts.
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:37 [gpu_model_runner.py:2338] Starting to load model /root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B...
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:37 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:37 [cuda.py:362] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.99it/s]
(EngineCore_DP0 pid=4281)
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:38 [default_loader.py:268] Loading weights took 0.45 seconds
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:38 [gpu_model_runner.py:2392] Model loading took 2.1639 GiB and 0.719621 seconds
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:38 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16200 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:44 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/465aec7182/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:44 [backends.py:550] Dynamo bytecode transform time: 4.74 s
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:47 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.881 s
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:48 [monitor.py:34] torch.compile takes 4.74 s in total
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:49 [gpu_worker.py:298] Available KV cache memory: 12.28 GiB
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:49 [kv_cache_utils.py:864] GPU KV cache size: 1,073,408 tokens
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:49 [kv_cache_utils.py:868] Maximum concurrency for 16,384 tokens per request: 65.52x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████| 67/67 [00:02<00:00, 28.60it/s]
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:52 [gpu_model_runner.py:3118] Graph capturing finished in 3 secs, took 0.41 GiB
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:52 [gpu_worker.py:391] Free memory on device (23.11/23.52 GiB) on startup. Desired GPU memory utilization is (0.64, 15.05 GiB). Actual usage is 2.16 GiB for weight, 0.56 GiB for peak activation, 0.04 GiB for non-torch memory, and 0.41 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=12596736245 to fit into requested memory, or --kv-cache-memory=21254752256 to fully utilize gpu memory. Current kv cache memory in use is 13190230261 bytes.
(EngineCore_DP0 pid=4281) INFO 11-08 09:33:52 [core.py:218] init engine (profile, create kv cache, warmup model) took 13.77 seconds
INFO 11-08 09:33:53 [llm.py:295] Supported_tasks: ['generate']
INFO 11-08 09:33:53 [init.py:36] No IOProcessor plugins requested by the model
压测下的表现：
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:01<00:00, 31.46it/s]
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 34.50it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:07<00:00, 4.24it/s, est. speed input: 5912.50 toks/s, output: 1136.02 toks/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 366/366 [00:01<00:00, 201.86it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:05<00:00, 5.90it/s, est. speed input: 8221.19 toks/s, output: 1315.33 toks/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:06<00:00, 5.25it/s, est. speed input: 7317.30 toks/s, output: 1116.59 toks/s]
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 314/314 [00:03<00:00, 95.82it/s]
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:03<00:00, 93.74it/s]
Processed prompts: 100%|███████████████████████████████| 366/366 [00:05<00:00, 62.23it/s, est. speed input: 8615.92 toks/s, output: 1767.31 toks/s]
Processed prompts: 35%|███████████▎ | 104/295 [00:01<00:04, 42.88it/s, est. speed input: 5007.10 toks/s, output: 941.89 toks/s]2025-11-08 09:52:03.481 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 19.32 秒
Processed prompts: 50%|███████████████▍ | 156/314 [00:02<00:02, 57.34it/s, est. speed input: 7118.76 toks/s, output: 1326.06 toks/s]INFO: 127.0.0.1:46068 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:49542 - "POST /doc_parse HTTP/1.1" 200 OK
Processed prompts: 100%|██████████████████████████████| 314/314 [00:04<00:00, 65.44it/s, est. speed input: 11946.27 toks/s, output: 2375.73 toks/s]
Processed prompts: 100%|██████████████████████████████| 295/295 [00:04<00:00, 66.37it/s, est. speed input: 11484.94 toks/s, output: 2435.17 toks/s]
Adding requests: 50%|██████████████████████████████████████████████▌ | 16/32 [00:00<00:00, 47.15it/s]2025-11-08 09:52:06.733 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 17.90 秒
Adding requests: 66%|█████████████████████████████████████████████████████████████ | 21/32 [00:00<00:00, 44.73it/s]2025-11-08 09:52:06.814 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 18.06 秒
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 45.29it/s]
Processed prompts: 0%| | 0/32 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO: 127.0.0.1:36484 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:49566 - "POST /doc_parse HTTP/1.1" 200 OK
INFO: 127.0.0.1:60618 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:49562 - "POST /doc_parse HTTP/1.1" 200 OK
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 47.27it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:06<00:00, 5.17it/s, est. speed input: 7207.12 toks/s, output: 1188.14 toks/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 314/314 [00:01<00:00, 187.44it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:05<00:00, 5.55it/s, est. speed input: 7737.34 toks/s, output: 1200.97 toks/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:01<00:00, 171.14it/s]
Processed prompts: 100%|██████████████████████████████| 314/314 [00:04<00:00, 74.99it/s, est. speed input: 11491.66 toks/s, output: 2365.47 toks/s]
Processed prompts: 36%|███████████▍ | 105/295 [00:02<00:04, 40.11it/s, est. speed input: 4946.81 toks/s, output: 920.79 toks/s]2025-11-08 09:52:20.042 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 16.12 秒
Processed prompts: 39%|████████████▍ | 115/295 [00:02<00:04, 37.57it/s, est. speed input: 4710.36 toks/s, output: 892.43 toks/s]INFO: 127.0.0.1:46080 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:49576 - "POST /doc_parse HTTP/1.1" 200 OK
Processed prompts: 100%|██████████████████████████████| 295/295 [00:05<00:00, 56.80it/s, est. speed input: 11009.52 toks/s, output: 2182.09 toks/s]
2025-11-08 09:52:23.970 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 16.79 秒
INFO: 127.0.0.1:36480 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:49554 - "POST /doc_parse HTTP/1.1" 200 OK
2、设置90%的显存
单台GPU是跑了32页/17.5s
[图片]
启动日志：
启动 Parse Server...
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/models/OpenDataLab/MinerU2.5-2509-1.2B
2025-11-08 09:45:34,271 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 11-08 09:45:35 [init.py:216] Automatically detected platform cuda.
INFO 11-08 09:45:36 [utils.py:328] non-default args: {'enable_prefix_caching': True, 'max_num_batched_tokens': 512, 'disable_log_stats': True, 'enable_chunked_prefill': True, 'logits_processors': [<class 'mineru_vl_utils.logits_processor.vllm_v1_no_repeat_ngram.VllmV1NoRepeatNGramLogitsProcessor'>], 'model': '/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B'}
INFO 11-08 09:45:43 [init.py:742] Resolved architecture: Qwen2VLForConditionalGeneration
torch_dtype is deprecated! Use dtype instead!
INFO 11-08 09:45:43 [init.py:1815] Using max model len 16384
INFO 11-08 09:45:43 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=512.
WARNING 11-08 09:45:46 [init.py:2974] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/root/de-doc_parse/mineru-distributed-parsing/.venv/lib/python3.11/site-packages/paddle/utils/cpp_extension/extension_utils.py:718: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
WARNING: OMP_NUM_THREADS set to 54, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
PLEASE USE OMP_NUM_THREADS WISELY.
2025-11-08 09:45:50.185 | INFO | mp_main::45 | Started 1 GPU workers: ['http://127.0.0.1:8100']
INFO 11-08 09:45:53 [init.py:216] Automatically detected platform cuda.
2025-11-08 09:45:53.495 | INFO | worker_server:create_worker_app:36 | MineruService initialized on GPU 0
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:54 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:54 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B', speculative_config=None, tokenizer='/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:56 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:57 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=10064) WARNING 11-08 09:45:59 [profiling.py:280] The sequence length (16384) is smaller than the pre-defined worst-case total number of multimodal tokens (18225). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase max_model_len or reduce mm_counts.
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:59 [gpu_model_runner.py:2338] Starting to load model /root/autodl-tmp/modelscope/models/OpenDataLab/MinerU2___5-2509-1___2B...
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:59 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=10064) INFO 11-08 09:45:59 [cuda.py:362] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.97it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.97it/s]
(EngineCore_DP0 pid=10064)
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:00 [default_loader.py:268] Loading weights took 0.45 seconds
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:00 [gpu_model_runner.py:2392] Model loading took 2.1639 GiB and 0.723047 seconds
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:01 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16200 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:06 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/465aec7182/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:06 [backends.py:550] Dynamo bytecode transform time: 4.87 s
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:10 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.910 s
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:10 [monitor.py:34] torch.compile takes 4.87 s in total
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:11 [gpu_worker.py:298] Available KV cache memory: 18.40 GiB
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:11 [kv_cache_utils.py:864] GPU KV cache size: 1,607,696 tokens
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:11 [kv_cache_utils.py:868] Maximum concurrency for 16,384 tokens per request: 98.13x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████| 67/67 [00:02<00:00, 29.49it/s]
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:14 [gpu_model_runner.py:3118] Graph capturing finished in 3 secs, took 0.41 GiB
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:14 [gpu_worker.py:391] Free memory on device (23.11/23.52 GiB) on startup. Desired GPU memory utilization is (0.9, 21.16 GiB). Actual usage is 2.16 GiB for weight, 0.56 GiB for peak activation, 0.04 GiB for non-torch memory, and 0.41 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=19161899417 to fit into requested memory, or --kv-cache-memory=21254752256 to fully utilize gpu memory. Current kv cache memory in use is 19755393433 bytes.
(EngineCore_DP0 pid=10064) INFO 11-08 09:46:14 [core.py:218] init engine (profile, create kv cache, warmup model) took 13.88 seconds
INFO 11-08 09:46:15 [llm.py:295] Supported_tasks: ['generate']
INFO 11-08 09:46:15 [init.py:36] No IOProcessor plugins requested by the model
压测下的表现
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 42.98it/s]
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:01<00:00, 31.31it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:07<00:00, 4.39it/s, est. speed input: 6124.80 toks/s, output: 1176.81 toks/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 366/366 [00:01<00:00, 203.41it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:05<00:00, 5.57it/s, est. speed input: 7762.61 toks/s, output: 1204.90 toks/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:06<00:00, 4.96it/s, est. speed input: 6911.76 toks/s, output: 1570.66 toks/s]
Warning: line does not match layout format:
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:02<00:00, 131.33it/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 431/431 [00:02<00:00, 179.31it/s]
Processed prompts: 100%|███████████████████████████████| 366/366 [00:05<00:00, 70.22it/s, est. speed input: 9721.38 toks/s, output: 1994.07 toks/s]
Processed prompts: 44%|██████████████ | 130/295 [00:03<00:05, 29.27it/s, est. speed input: 4302.71 toks/s, output: 820.65 toks/s]2025-11-08 09:41:13.333 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 18.35 秒
Processed prompts: 69%|███████████████████▉ | 296/431 [00:01<00:01, 115.36it/s, est. speed input: 10298.33 toks/s, output: 2583.01 toks/s]INFO: 127.0.0.1:56764 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:42302 - "POST /doc_parse HTTP/1.1" 200 OK
Processed prompts: 100%|█████████████████████████████| 431/431 [00:02<00:00, 147.99it/s, est. speed input: 12477.18 toks/s, output: 2961.70 toks/s]
Processed prompts: 95%|████████████████████████████▎ | 279/295 [00:04<00:00, 141.32it/s, est. speed input: 9480.03 toks/s, output: 1873.97 toks/s]2025-11-08 09:41:15.394 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 15.89 秒
Processed prompts: 100%|██████████████████████████████| 295/295 [00:05<00:00, 54.83it/s, est. speed input: 10627.66 toks/s, output: 2106.40 toks/s]
INFO: 127.0.0.1:45370 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:42276 - "POST /doc_parse HTTP/1.1" 200 OK
Adding requests: 25%|███████████████████████▌ | 8/32 [00:00<00:00, 34.99it/s]2025-11-08 09:41:16.561 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 17.57 秒
Adding requests: 75%|█████████████████████████████████████████████████████████████████████▊ | 24/32 [00:00<00:00, 36.12it/s]INFO: 127.0.0.1:33994 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:42312 - "POST /doc_parse HTTP/1.1" 200 OK
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 35.75it/s]
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 53.53it/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:06<00:00, 5.24it/s, est. speed input: 7308.08 toks/s, output: 1115.17 toks/s]
Processed prompts: 100%|█████████████████████████████████| 32/32 [00:05<00:00, 5.54it/s, est. speed input: 7721.25 toks/s, output: 1235.34 toks/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:01<00:00, 159.97it/s]
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 314/314 [00:02<00:00, 143.69it/s]
Processed prompts: 100%|██████████████████████████████| 295/295 [00:04<00:00, 62.38it/s, est. speed input: 10793.99 toks/s, output: 2288.67 toks/s]
Processed prompts: 68%|█████████████████████ | 213/314 [00:03<00:02, 47.21it/s, est. speed input: 7481.81 toks/s, output: 1425.79 toks/s]2025-11-08 09:41:30.747 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 16.97 秒
Processed prompts: 74%|██████████████████████▊ | 231/314 [00:03<00:01, 56.39it/s, est. speed input: 7563.06 toks/s, output: 1447.42 toks/s]INFO: 127.0.0.1:56780 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:42328 - "POST /doc_parse HTTP/1.1" 200 OK
Processed prompts: 100%|██████████████████████████████| 314/314 [00:04<00:00, 67.19it/s, est. speed input: 12265.73 toks/s, output: 2439.26 toks/s]
2025-11-08 09:41:32.580 | ERROR | mineru_service:process_file:59 | [性能日志] vlm_doc_analyze 耗时: 16.74 秒
INFO: 127.0.0.1:45392 - "POST /infer HTTP/1.1" 200 OK
INFO: 127.0.0.1:42340 - "POST /doc_parse HTTP/1.1" 200 OK

jc7ctzphbf-dotcom · 2025-11-08T01:57:38Z

jc7ctzphbf-dotcom
Nov 8, 2025
Author

0 replies

jc7ctzphbf-dotcom · 2025-11-08T01:57:51Z

jc7ctzphbf-dotcom
Nov 8, 2025
Author

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm解析不同显存设置下解析速度差不多，且与pipeline解析速度相近，是有什么问题吗？ #3958

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

vllm解析不同显存设置下解析速度差不多，且与pipeline解析速度相近，是有什么问题吗？ #3958

Uh oh!

jc7ctzphbf-dotcom Nov 8, 2025

Replies: 2 comments

Uh oh!

jc7ctzphbf-dotcom Nov 8, 2025 Author

Uh oh!

jc7ctzphbf-dotcom Nov 8, 2025 Author

jc7ctzphbf-dotcom
Nov 8, 2025

jc7ctzphbf-dotcom
Nov 8, 2025
Author

jc7ctzphbf-dotcom
Nov 8, 2025
Author