Skip to content

Conversation

JIElite
Copy link
Contributor

@JIElite JIElite commented Sep 17, 2025

Dear maintainers,

Because VLLM have already supported local GGUF file, and the enforce_eager is fix to True, all we need to do is use HF tokenizer for local GGUF in inference. (As mentioned in the VLLM document)
image

And here is the example of local GGUF in VLLM document
image

I test the following scripts for VLLM model in GGUF format

export VLLM_WORKER_MULTIPROC_METHOD=spawn
TASKS=("gpqa:main")

MODEL=./models/qwen3-4b-instruct-2507-q4_k_m.gguf
TOKENIZER=Qwen/Qwen3-4B-Instruct-2507 # NOTICE: use the tokenizer from huggingface !!
SEED=1234
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,seed=$SEED,tokenizer=$TOKENIZER,max_model_length=24000,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:16384,temperature:0.7,top_p:0.8,top_k:20,min_p:0.0}"

for TASK in "${TASKS[@]}"; do
  lighteval vllm $MODEL_ARGS "lighteval|$TASK|0"
done

And the log and evaluation results are attached as follow.

INFO 09-17 14:26:13 [__init__.py:241] Automatically detected platform cuda.
[2025-09-17 14:26:14,565] [    INFO]: --- INIT SEEDS --- (pipeline.py:282)
[2025-09-17 14:26:14,565] [    INFO]: --- LOADING TASKS --- (pipeline.py:243)
[2025-09-17 14:26:14,565] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:255)
[2025-09-17 14:26:14,566] [ WARNING]: Careful, the task gpqa:main is using evaluation data to build the few shot examples. (lighteval_task.py:269)
[2025-09-17 14:26:18,519] [    INFO]: --- LOADING MODEL --- (pipeline.py:210)
[2025-09-17 14:26:18,519] [ WARNING]: We were not able to detect if the chat template should be used for your model: {e}. Assuming we're using a chat template (utils.py:134)
[2025-09-17 14:26:19,527] [    INFO]: non-default args: {'model': './models/qwen3-4b-instruct-2507-q4_k_m.gguf', 'dtype': 'bfloat16', 'seed': 1234, 'max_model_len': 24000, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 2048, 'max_num_seqs': 128, 'disable_log_stats': True, 'revision': 'main', 'enforce_eager': True} (utils.py:326)
[2025-09-17 14:26:47,395] [    INFO]: Resolved architecture: Qwen3ForCausalLM (__init__.py:711)
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-17 14:26:47,395] [   ERROR]: Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/qwen3-4b-instruct-2507-q4_k_m.gguf'. Use `repo_type` argument if needed., retrying 1 of 2 (config.py:130)
[2025-09-17 14:26:49,396] [   ERROR]: Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/qwen3-4b-instruct-2507-q4_k_m.gguf'. Use `repo_type` argument if needed. (config.py:128)
[2025-09-17 14:26:49,396] [    INFO]: Downcasting torch.float32 to torch.bfloat16. (__init__.py:2816)
[2025-09-17 14:26:49,396] [    INFO]: Using max model len 24000 (__init__.py:1750)
[2025-09-17 14:26:49,397] [ WARNING]: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models. (__init__.py:1171)
[2025-09-17 14:27:04,289] [    INFO]: Chunked prefill is enabled with max_num_batched_tokens=2048. (scheduler.py:222)
[2025-09-17 14:27:05,174] [    INFO]: Cudagraph is disabled under eager mode (__init__.py:3565)
INFO 09-17 14:27:38 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=165675) INFO 09-17 14:27:40 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=165675) INFO 09-17 14:27:40 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config: model='./models/qwen3-4b-instruct-2507-q4_k_m.gguf', speculative_config=None, tokenizer='./models/qwen3-4b-instruct-2507-q4_k_m.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=24000, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=./models/qwen3-4b-instruct-2507-q4_k_m.gguf, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_0 pid=165675) INFO 09-17 14:27:41 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=165675) WARNING 09-17 14:27:41 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=165675) INFO 09-17 14:27:41 [gpu_model_runner.py:1953] Starting to load model ./models/qwen3-4b-instruct-2507-q4_k_m.gguf...
(EngineCore_0 pid=165675) INFO 09-17 14:27:41 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=165675) INFO 09-17 14:27:47 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=165675) INFO 09-17 14:27:53 [gpu_model_runner.py:2007] Model loading took 2.4538 GiB and 11.525276 seconds
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [gpu_worker.py:276] Available KV cache memory: 15.72 GiB
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [kv_cache_utils.py:849] GPU KV cache size: 114,432 tokens
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [kv_cache_utils.py:853] Maximum concurrency for 24,000 tokens per request: 4.77x
(EngineCore_0 pid=165675) INFO 09-17 14:27:56 [core.py:214] init engine (profile, create kv cache, warmup model) took 3.06 seconds
(EngineCore_0 pid=165675) INFO 09-17 14:28:09 [__init__.py:3565] Cudagraph is disabled under eager mode
[2025-09-17 14:28:09,679] [    INFO]: Supported_tasks: ['generate'] (llm.py:298)
[2025-09-17 14:28:09,679] [    INFO]: [CACHING] Initializing data cache (cache_management.py:105)
[2025-09-17 14:28:09,680] [    INFO]: --- RUNNING MODEL --- (pipeline.py:363)
[2025-09-17 14:28:09,680] [    INFO]: Running SamplingMethod.GENERATIVE requests (pipeline.py:346)
[2025-09-17 14:28:24,509] [    INFO]: Cache: Starting to process 448/448 samples (not found in cache) for tasks lighteval|gpqa:main|0 (c010de55e8c94f12, GENERATIVE) (cache_management.py:399)
[2025-09-17 14:28:24,509] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)
Splits:   0%|                                                                                               | 0/1 [00:00<?, ?it/s(EngineCore_0 pid=165675) WARNING 09-17 14:28:24 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.
Adding requests: 100%|███████████████████████████████████████████████████████████████████████| 448/448 [00:00<00:00, 12447.82it/s]
Processed prompts: 100%|████████████████| 448/448 [07:54<00:00,  1.06s/it, est. speed input: 242.39 toks/s, output: 426.90 toks/s]
Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:54<00:00, 474.92s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 84.61ba/s]
[2025-09-17 14:36:24,653] [    INFO]: Cached 448 samples of lighteval|gpqa:main|0 (c010de55e8c94f12, GENERATIVE) at /home/elichen/.cache/huggingface/lighteval/models/qwen3-4b-instruct-2507-q4_k_m.gguf/f43c64204c5fc574/lighteval|gpqa:main|0/c010de55e8c94f12/GENERATIVE.parquet. (cache_management.py:345)
Generating train split: 448 examples [00:00, 25253.31 examples/s]
[rank0]:[W917 14:36:30.310590683 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-09-17 14:36:32,012] [    INFO]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:377)
[2025-09-17 14:36:32,013] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:404)
[2025-09-17 14:36:32,065] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:465)
|        Task         |Version|     Metric     |Value |   |Stderr|
|---------------------|-------|----------------|-----:|---|-----:|
|all                  |       |extractive_match|0.3795|±  | 0.023|
|lighteval:gpqa:main:0|       |extractive_match|0.3795|±  | 0.023|

[2025-09-17 14:36:32,075] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:455)
[2025-09-17 14:36:32,075] [    INFO]: Saving experiment tracker (evaluation_tracker.py:246)

results_2025-09-17T14-36-32.075650.json

Please help to review,
Thank you,
Eli Chen

Comment on lines 295 to 297
config.tokenizer
if config.tokenizer
else config.model_name, # use HF tokenizer for non-HF models, like GGUF model.
Copy link
Member

@NathanHB NathanHB Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using config.tokenizer or config.model_name would be better here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, I've updated the implementation

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB NathanHB merged commit d6a65ca into huggingface:main Sep 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants