Support local GGUF in VLLM and use HF tokenizer #943 #972

JIElite · 2025-09-17T06:39:05Z

Dear maintainers,

Because VLLM have already supported local GGUF file, and the enforce_eager is fix to True, all we need to do is use HF tokenizer for local GGUF in inference. (As mentioned in the VLLM document)

And here is the example of local GGUF in VLLM document

I test the following scripts for VLLM model in GGUF format

export VLLM_WORKER_MULTIPROC_METHOD=spawn
TASKS=("gpqa:main")

MODEL=./models/qwen3-4b-instruct-2507-q4_k_m.gguf
TOKENIZER=Qwen/Qwen3-4B-Instruct-2507 # NOTICE: use the tokenizer from huggingface !!
SEED=1234
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,seed=$SEED,tokenizer=$TOKENIZER,max_model_length=24000,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:16384,temperature:0.7,top_p:0.8,top_k:20,min_p:0.0}"

for TASK in "${TASKS[@]}"; do
  lighteval vllm $MODEL_ARGS "lighteval|$TASK|0"
done

And the log and evaluation results are attached as follow.

INFO 09-17 14:26:13 [__init__.py:241] Automatically detected platform cuda.
[2025-09-17 14:26:14,565] [    INFO]: --- INIT SEEDS --- (pipeline.py:282)
[2025-09-17 14:26:14,565] [    INFO]: --- LOADING TASKS --- (pipeline.py:243)
[2025-09-17 14:26:14,565] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:255)
[2025-09-17 14:26:14,566] [ WARNING]: Careful, the task gpqa:main is using evaluation data to build the few shot examples. (lighteval_task.py:269)
[2025-09-17 14:26:18,519] [    INFO]: --- LOADING MODEL --- (pipeline.py:210)
[2025-09-17 14:26:18,519] [ WARNING]: We were not able to detect if the chat template should be used for your model: {e}. Assuming we're using a chat template (utils.py:134)
[2025-09-17 14:26:19,527] [    INFO]: non-default args: {'model': './models/qwen3-4b-instruct-2507-q4_k_m.gguf', 'dtype': 'bfloat16', 'seed': 1234, 'max_model_len': 24000, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 2048, 'max_num_seqs': 128, 'disable_log_stats': True, 'revision': 'main', 'enforce_eager': True} (utils.py:326)
[2025-09-17 14:26:47,395] [    INFO]: Resolved architecture: Qwen3ForCausalLM (__init__.py:711)
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-17 14:26:47,395] [   ERROR]: Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/qwen3-4b-instruct-2507-q4_k_m.gguf'. Use `repo_type` argument if needed., retrying 1 of 2 (config.py:130)
[2025-09-17 14:26:49,396] [   ERROR]: Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/qwen3-4b-instruct-2507-q4_k_m.gguf'. Use `repo_type` argument if needed. (config.py:128)
[2025-09-17 14:26:49,396] [    INFO]: Downcasting torch.float32 to torch.bfloat16. (__init__.py:2816)
[2025-09-17 14:26:49,396] [    INFO]: Using max model len 24000 (__init__.py:1750)
[2025-09-17 14:26:49,397] [ WARNING]: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models. (__init__.py:1171)
[2025-09-17 14:27:04,289] [    INFO]: Chunked prefill is enabled with max_num_batched_tokens=2048. (scheduler.py:222)
[2025-09-17 14:27:05,174] [    INFO]: Cudagraph is disabled under eager mode (__init__.py:3565)
INFO 09-17 14:27:38 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=165675) INFO 09-17 14:27:40 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=165675) INFO 09-17 14:27:40 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config: model='./models/qwen3-4b-instruct-2507-q4_k_m.gguf', speculative_config=None, tokenizer='./models/qwen3-4b-instruct-2507-q4_k_m.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=24000, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=./models/qwen3-4b-instruct-2507-q4_k_m.gguf, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_0 pid=165675) INFO 09-17 14:27:41 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=165675) WARNING 09-17 14:27:41 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=165675) INFO 09-17 14:27:41 [gpu_model_runner.py:1953] Starting to load model ./models/qwen3-4b-instruct-2507-q4_k_m.gguf...
(EngineCore_0 pid=165675) INFO 09-17 14:27:41 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=165675) INFO 09-17 14:27:47 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=165675) INFO 09-17 14:27:53 [gpu_model_runner.py:2007] Model loading took 2.4538 GiB and 11.525276 seconds
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [gpu_worker.py:276] Available KV cache memory: 15.72 GiB
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [kv_cache_utils.py:849] GPU KV cache size: 114,432 tokens
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [kv_cache_utils.py:853] Maximum concurrency for 24,000 tokens per request: 4.77x
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [core.py:214] init engine (profile, create kv cache, warmup model) took 3.06 seconds
(EngineCore_0 pid=165675) INFO 09-17 14:28:09 [__init__.py:3565] Cudagraph is disabled under eager mode
[2025-09-17 14:28:09,679] [    INFO]: Supported_tasks: ['generate'] (llm.py:298)
[2025-09-17 14:28:09,679] [    INFO]: [CACHING] Initializing data cache (cache_management.py:105)
[2025-09-17 14:28:09,680] [    INFO]: --- RUNNING MODEL --- (pipeline.py:363)
[2025-09-17 14:28:09,680] [    INFO]: Running SamplingMethod.GENERATIVE requests (pipeline.py:346)
[2025-09-17 14:28:24,509] [    INFO]: Cache: Starting to process 448/448 samples (not found in cache) for tasks lighteval|gpqa:main|0 (c010de55e8c94f12, GENERATIVE) (cache_management.py:399)
[2025-09-17 14:28:24,509] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)
Splits:   0%|                                                                                               | 0/1 [00:00<?, ?it/s(EngineCore_0 pid=165675) WARNING 09-17 14:28:24 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.
Adding requests: 100%|███████████████████████████████████████████████████████████████████████| 448/448 [00:00<00:00, 12447.82it/s]
Processed prompts: 100%|████████████████| 448/448 [07:54<00:00,  1.06s/it, est. speed input: 242.39 toks/s, output: 426.90 toks/s]
Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:54<00:00, 474.92s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 84.61ba/s]
[2025-09-17 14:36:24,653] [    INFO]: Cached 448 samples of lighteval|gpqa:main|0 (c010de55e8c94f12, GENERATIVE) at /home/elichen/.cache/huggingface/lighteval/models/qwen3-4b-instruct-2507-q4_k_m.gguf/f43c64204c5fc574/lighteval|gpqa:main|0/c010de55e8c94f12/GENERATIVE.parquet. (cache_management.py:345)
Generating train split: 448 examples [00:00, 25253.31 examples/s]
[rank0]:[W917 14:36:30.310590683 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-09-17 14:36:32,012] [    INFO]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:377)
[2025-09-17 14:36:32,013] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:404)
[2025-09-17 14:36:32,065] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:465)
|        Task         |Version|     Metric     |Value |   |Stderr|
|---------------------|-------|----------------|-----:|---|-----:|
|all                  |       |extractive_match|0.3795|±  | 0.023|
|lighteval:gpqa:main:0|       |extractive_match|0.3795|±  | 0.023|

[2025-09-17 14:36:32,075] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:455)
[2025-09-17 14:36:32,075] [    INFO]: Saving experiment tracker (evaluation_tracker.py:246)

results_2025-09-17T14-36-32.075650.json

Please help to review,
Thank you,
Eli Chen

NathanHB · 2025-09-17T09:07:33Z

src/lighteval/models/vllm/vllm_model.py

+            config.tokenizer
+            if config.tokenizer
+            else config.model_name,  # use HF tokenizer for non-HF models, like GGUF model.


using config.tokenizer or config.model_name would be better here

Thank you for the suggestion, I've updated the implementation

src/lighteval/models/vllm/vllm_model.py

HuggingFaceDocBuilderDev · 2025-09-18T12:31:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Support local GGUF in VLLM and use HF tokenizer huggingface#943

a78390d

JIElite mentioned this pull request Sep 17, 2025

[FT] Support GGUF in vllm, and use HF tokenizer together #943

Closed

NathanHB reviewed Sep 17, 2025

View reviewed changes

src/lighteval/models/vllm/vllm_model.py Outdated Show resolved Hide resolved

Improve the readability of implementation

a00d8f9

NathanHB approved these changes Sep 18, 2025

View reviewed changes

NathanHB added the feature label Sep 18, 2025

NathanHB merged commit d6a65ca into huggingface:main Sep 18, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support local GGUF in VLLM and use HF tokenizer #943 #972

Support local GGUF in VLLM and use HF tokenizer #943 #972

Uh oh!

JIElite commented Sep 17, 2025 •

edited

Loading

Uh oh!

NathanHB Sep 17, 2025 •

edited

Loading

Uh oh!

JIElite Sep 17, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

Support local GGUF in VLLM and use HF tokenizer #943 #972

Support local GGUF in VLLM and use HF tokenizer #943 #972

Uh oh!

Conversation

JIElite commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanHB Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JIElite Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

JIElite commented Sep 17, 2025 •

edited

Loading

NathanHB Sep 17, 2025 •

edited

Loading