Bug Description
When we run a benchmark with custom dataset name, the run fails at accuracy evaluation stage with KeyError. When looking at the sample_idx_map.json, it seems like for dataset name which is not predefined like open_orca, "Dataset" becomes the generic key.
Note:
The flow successfully executes if we modify config snippet the following way:
datasets:
- name: "Dataset"
type: "accuracy"
samples: 24576
path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet"
accuracy_config:
eval_method: "rouge"
extractor: "identity_extractor"
ground_truth: "output"
- name: "Dataset"
type: "performance"
samples: 24576
path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet"
Steps to Reproduce
- Launch a vllm server with llama2-70b/llama2-7b
- Config file:
# Online Latency Benchmark
name: "online-llama2-70b-orca-benchmark"
version: "1.0"
type: "offline"
#benchmark_mode: "online"
model_params:
name: "meta-llama/Llama-2-7b-chat-hf"
temperature: 0
top_p: 1
max_new_tokens: 1024
datasets:
- name: "Dataset-openorca"
type: "accuracy"
samples: 24576
path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet"
accuracy_config:
eval_method: "rouge"
extractor: "identity_extractor"
ground_truth: "output"
- name: "Dataset-openorca"
type: "performance"
samples: 24576
path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet"
settings:
runtime:
min_duration_ms: 600000 # 1 minute
#max_duration_ms: 600000 # 10 minutes
scheduler_random_seed: 42 # For Poisson/distribution sampling
dataloader_random_seed: 42 # For dataset shuffling
n_samples_to_issue: 24576
load_pattern:
type: "max_throughput"
#target_qps: 10
client:
num_workers: 4
metrics:
collect:
- "throughput"
- "latency"
- "ttft"
- "tpot"
endpoint_config:
endpoints:
- "http://localhost:9000"
api_key: null
report_dir: results/llama2_70b_orca_benchmark_mlperf_parq/
- Run command:
inference-endpoint benchmark from-config -c examples/06_Llama2-70B_Example/online_llama2_70b_orca_backup.yaml --timeout 600000
Environment
OS: Ubuntu 24.04
Python: 3.12.3
Endpoints repo latest commit hash: 8c0c63d
Relevant Logs
Error log:
(endp) anandhusooraj@mlc2:~/endpoints$ inference-endpoint benchmark from-config -c examples/06_Llama2-70B_Example/online_llama2_70b_orca_backup.yaml --timeout 600000
2026-04-16 14:46:42,753 - inference_endpoint.endpoint_client.cpu_affinity - INFO - CPU affinity: 224 online CPUs available to process
2026-04-16 14:46:42,763 - inference_endpoint.endpoint_client.cpu_affinity - INFO - CPU affinity: 112 physical cores across 2 NUMA nodes, requesting 5 for loadgen, 4 workers
2026-04-16 14:46:42,772 - inference_endpoint.endpoint_client.cpu_affinity - INFO - LoadGen pinned to 10 CPUs (5 physical cores)
2026-04-16 14:46:42,777 - inference_endpoint.commands.benchmark.execute - INFO - Loading tokenizer for model: meta-llama/Llama-2-7b-chat-hf
2026-04-16 14:46:42,884 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json "HTTP/1.1 401 Unauthorized"
2026-04-16 14:46:42,947 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json "HTTP/1.1 401 Unauthorized"
2026-04-16 14:46:43,000 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/models/meta-llama/Llama-2-7b-chat-hf/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-04-16 14:46:43,064 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/models/meta-llama/Llama-2-7b-chat-hf/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-04-16 14:46:43,240 - inference_endpoint.commands.benchmark.execute - INFO - Tokenizer loaded successfully
2026-04-16 14:46:43,241 - inference_endpoint.commands.benchmark.execute - INFO - Streaming: disabled (off)
2026-04-16 14:46:43,757 - inference_endpoint.commands.benchmark.execute - INFO - Loaded <inference_endpoint.dataset_manager.dataset.Dataset object at 0x7958a9e0e810> - 24576 samples
2026-04-16 14:46:44,319 - inference_endpoint.commands.benchmark.execute - INFO - Loaded 24576 samples
2026-04-16 14:46:44,319 - inference_endpoint.commands.benchmark.execute - INFO - Mode: TestMode.PERF, Target QPS: None, Responses: False
2026-04-16 14:46:44,319 - inference_endpoint.commands.benchmark.execute - INFO - Min Duration: 600.0s, Expected samples: 49152
2026-04-16 14:46:44,320 - inference_endpoint.commands.benchmark.execute - INFO - Scheduler: MaxThroughputScheduler (pattern: max_throughput)
meta-llama/Llama-2-7b-chat-hf (Streaming: False): 0%| | 0/49152 [00:00<?, ?it/s]2026-04-16 14:46:44,327 - inference_endpoint.commands.benchmark.execute - INFO - Connecting: ['http://localhost:9000']
2026-04-16 14:46:46,889 - inference_endpoint.endpoint_client.http_client - INFO - EndpointClient initialized with num_workers=4, endpoints=['http://localhost:9000/v1/chat/completions'], adapter=OpenAIMsgspecAdapter, accumulator=OpenAISSEAccumulator, transport=zmq
2026-04-16 14:46:46,890 - inference_endpoint.commands.benchmark.execute - INFO - Running...
2026-04-16 14:46:47,550 - inference_endpoint.load_generator.session - INFO - All performance samples issued
2026-04-16 14:46:48,121 - inference_endpoint.load_generator.session - INFO - All accuracy samples issued
meta-llama/Llama-2-7b-chat-hf (Streaming: False): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 49152/49152 [17:40<00:00, 46.37it/s]----------------- Summary -----------------
Version: 0.1.0
Git SHA: 8c0c63d
Test started at: (timestamp_ns):4653411357432119, approx. wall-clock time: (2026-04-16 14:46:46)
Total samples issued: 24576
Total samples completed: 24576
Total samples failed: 0
Duration: 654.46 seconds
QPS: 37.55
TPS: 11089.98
----------------- End of Summary -----------------
2026-04-16 15:04:25,890 - inference_endpoint.load_generator.session - INFO - Report saved to results/llama2_70b_orca_benchmark_mlperf_parq/report.txt
2026-04-16 15:04:25,910 - inference_endpoint.commands.benchmark.execute - INFO - Cleaning up...
meta-llama/Llama-2-7b-chat-hf (Streaming: False): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 49152/49152 [17:41<00:00, 46.30it/s]
2026-04-16 15:04:25,912 - inference_endpoint.endpoint_client.http_client - INFO - [bfdeb7ec] Shutting down...
2026-04-16 15:04:26,424 - inference_endpoint.endpoint_client.http_client - INFO - [bfdeb7ec] Shutdown complete.
Traceback (most recent call last):
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/main.py", line 128, in run
app.meta()
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/core.py", line 1889, in __call__
result = _run_maybe_async_command(command, bound, resolved_backend)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/_run.py", line 50, in _run_maybe_async_command
return command(*bound.args, **bound.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/main.py", line 73, in launcher
app(tokens)
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/core.py", line 1889, in __call__
result = _run_maybe_async_command(command, bound, resolved_backend)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/_run.py", line 50, in _run_maybe_async_command
return command(*bound.args, **bound.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/cli.py", line 112, in from_config
_run(resolved, [], test_mode)
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/cli.py", line 54, in _run
run_benchmark(config, mode)
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/execute.py", line 481, in run_benchmark
finalize_benchmark(ctx, report, collector)
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/execute.py", line 405, in finalize_benchmark
scorer_instance = eval_cfg.scorer(
^^^^^^^^^^^^^^^^
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/evaluation/scoring.py", line 228, in __init__
super().__init__(*args, **kwargs)
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/evaluation/scoring.py", line 112, in __init__
self.sample_index_map = self._load_sample_index_map()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/evaluation/scoring.py", line 123, in _load_sample_index_map
return d[self.dataset_name] # Implicitly raises KeyError
~^^^^^^^^^^^^^^^^^^^
KeyError: 'Dataset-openorca'
Before submitting
Bug Description
When we run a benchmark with custom dataset name, the run fails at accuracy evaluation stage with
KeyError. When looking at thesample_idx_map.json, it seems like for dataset name which is not predefined like open_orca, "Dataset" becomes the generic key.Note:
The flow successfully executes if we modify config snippet the following way:
Steps to Reproduce
Environment
OS: Ubuntu 24.04
Python: 3.12.3
Endpoints repo latest commit hash: 8c0c63d
Relevant Logs
Before submitting