Skip to content

vdb benchmark shows a very low recall@10 because the flat_gt collection size is too small #375

@Raysmond

Description

@Raysmond

We had a few dryrun tests for vdb benchmark on a single-host system with a single Gen5 NVMe (Solidigm D7-PS1010).
The result shows a very low number only about 0.0090 for mean recall@10. We had some study for the issue and found that the collectionmlps_1m_1shards_1536dim_uniform_flat_gt size is 10000, which is only 1% of the total 1M vector size. Since the flat collection size is too small, how can we get a higher recall@10 result ? I think the flat collection size should be equal to the original vector size. That's more reasonable ? So, I don't know if this is by design or it's a bug ?

(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2/vdb_benchmark# python vdbbench/list_collections.py
2026-05-14 00:58:06,702 - INFO - Connected to Milvus server at 127.0.0.1:19530
2026-05-14 00:58:06,704 - INFO - Found 2 collections
2026-05-14 00:58:06,704 - INFO - Getting information for collection: mlps_1m_1shards_1536dim_uniform
2026-05-14 00:58:06,714 - INFO - Getting information for collection: mlps_1m_1shards_1536dim_uniform_flat_gt
+-----------------------------------------+----------------+-------------+---------------+----------------+--------------+
| Collection Name                         |   Vector Count |   Dimension | Index Types   | Metric Types   |   Partitions |
+=========================================+================+=============+===============+================+==============+
| mlps_1m_1shards_1536dim_uniform         |        1000000 |        1536 | DISKANN       | COSINE         |            1 |
+-----------------------------------------+----------------+-------------+---------------+----------------+--------------+
| mlps_1m_1shards_1536dim_uniform_flat_gt |          10000 |        1536 | FLAT          | COSINE         |            1 |
+-----------------------------------------+----------------+-------------+---------------+----------------+--------------+
2026-05-14 00:58:06,900 - INFO - Disconnected from Milvus server

Here are the commands we executed and the dryrun logs:

# load the vector database
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# ./mlpstorage vectordb datagen     --host 127.0.0.1 --port 19530 --config default     --force --results-dir ./vdb_results --file

# run the query cmd (PATH A)
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2/vdb_benchmark# python vdbbench/enhanced_bench.py \
  --host 127.0.0.1 \
  --collection mlps_1m_1shards_1536dim_uniform \
  --auto-create-flat \
  --runtime 120 \
  --batch-size 10 \
  --processes 8 \
  --search-limit 10 \
  --search-ef 200 \
  --queries 100000 \
  --recall-k 10 \
  --cache-state cold \
  --drop-caches-cmd "sh -c 'echo 3 > /proc/sys/vm/drop_caches'"

Logs:

============================================================
ENHANCED VDB BENCH — runtime/query-count mode
============================================================
Results will be saved to: vdbbench_results/20260514_005609

============================================================
Database Verification and Collection Loading
============================================================
Connecting to Milvus server at 127.0.0.1:19530...
Collection mlps_1m_1shards_1536dim_uniform already loaded.

+---------------------------------+----------------+-------------+---------------+----------------+--------------+
| Collection Name                 |   Vector Count |   Dimension | Index Types   | Metric Types   |   Partitions |
+=================================+================+=============+===============+================+==============+
| mlps_1m_1shards_1536dim_uniform |        1000000 |        1536 | DISKANN       | COSINE         |            1 |
+---------------------------------+----------------+-------------+---------------+----------------+--------------+
Detected source vector field: 'vector'

============================================================
RECALL SETUP (outside benchmark timing)
============================================================
Ground truth is pre-computed using a FLAT (brute-force) index.
Using metric type: COSINE

Generating 1000 query vectors (dim=1536, seed=42)...
Generated 1000 query vectors.

Setting up FLAT collection: mlps_1m_1shards_1536dim_uniform_flat_gt
FLAT collection exists but has 10000 vs 1000000 vectors. Dropping and recreating...
Creating FLAT collection 'mlps_1m_1shards_1536dim_uniform_flat_gt' from source 'mlps_1m_1shards_1536dim_uniform'...
Source schema: pk_field='id' (INT64), vec_field='vector', vectors=1000000
Copying 1000000 vectors to FLAT collection (batch_size=5000)...
  Copied 152/1000000 vectors (0.0%)
  Copied 304/1000000 vectors (0.0%)
  Copied 456/1000000 vectors (0.0%)
  Copied 608/1000000 vectors (0.1%)
  Copied 760/1000000 vectors (0.1%)
  Copied 912/1000000 vectors (0.1%)
  Copied 1064/1000000 vectors (0.1%)
  Copied 1216/1000000 vectors (0.1%)
  Copied 1368/1000000 vectors (0.1%)
  Copied 1520/1000000 vectors (0.2%)
  Copied 1672/1000000 vectors (0.2%)
  Copied 1824/1000000 vectors (0.2%)
  Copied 1976/1000000 vectors (0.2%)
  Copied 2132/1000000 vectors (0.2%)
  Copied 2289/1000000 vectors (0.2%)
  Copied 2446/1000000 vectors (0.2%)
  Copied 2603/1000000 vectors (0.3%)
  Copied 2757/1000000 vectors (0.3%)
  Copied 2909/1000000 vectors (0.3%)
  Copied 3063/1000000 vectors (0.3%)
  Copied 3220/1000000 vectors (0.3%)
  Copied 3376/1000000 vectors (0.3%)
  Copied 3528/1000000 vectors (0.4%)
  Copied 3680/1000000 vectors (0.4%)
  Copied 3837/1000000 vectors (0.4%)
  Copied 3994/1000000 vectors (0.4%)
  Copied 4146/1000000 vectors (0.4%)
  Copied 4298/1000000 vectors (0.4%)
  Copied 4450/1000000 vectors (0.4%)
  Copied 4602/1000000 vectors (0.5%)
  Copied 4754/1000000 vectors (0.5%)
  Copied 4906/1000000 vectors (0.5%)
  Copied 10000/1000000 vectors (100.0%)
Building FLAT index...
FLAT collection 'mlps_1m_1shards_1536dim_uniform_flat_gt' ready with 10000 vectors.
Pre-computing ground truth for 1000 queries using FLAT index (top_k=10)...
Ground truth pre-computation complete: 1000 queries in 0.61s
Ground truth ready: 1000 queries pre-computed.

Collecting initial disk statistics...

============================================================
Benchmark Execution
============================================================
Starting benchmark: 8 processes × 12500 queries/process
Recall: 1000 pre-generated queries, recall@10
NOTE: batch_end timing is placed BEFORE recall capture — performance unaffected.
NOTE: recall hits written to per-worker recall_hits_p<N>.jsonl files.
Staggering process startup by 0.125s
Starting process 0...
Process 0 initialized
Process 0 - Loading collection
Process 0: Writing results to vdbbench_results/20260514_005609/milvus_benchmark_p0.csv
Process 0: Starting benchmark ...
Starting process 1...
Process 1 initialized



Calculating recall from per-worker JSONL files...
  Loaded ANN hits for 1000 unique query indices from 8 worker(s).
Calculating benchmark statistics...

============================================================
BENCHMARK SUMMARY
============================================================
Total Queries: 100000
Total Batches: 10000
Total Runtime: 46.55s

QUERY STATISTICS
------------------------------------------------------------
Mean Latency:      3.64 ms
Median Latency:    3.67 ms
P95 Latency:       3.98 ms
P99 Latency:       4.16 ms
P99.9 Latency:     4.52 ms
P99.99 Latency:    5.72 ms
Throughput:        2148.38 queries/second

BATCH STATISTICS
------------------------------------------------------------
Mean Batch Time:   36.40 ms
Median Batch Time: 36.74 ms
P95 Batch Time:    39.84 ms
P99 Batch Time:    41.64 ms
P99.9 Batch Time:  45.19 ms
P99.99 Batch Time: 57.24 ms
Max Batch Time:    97.02 ms
Batch Throughput:  27.47 batches/second

RECALL STATISTICS (recall@10)
------------------------------------------------------------
Mean Recall:       0.0090
Median Recall:     0.0000
Min Recall:        0.0000
Max Recall:        0.3000
P95 Recall:        0.1000
P99 Recall:        0.1000
Queries Evaluated: 1000

DISK I/O DURING BENCHMARK
------------------------------------------------------------
Total Read:        295.53 GB  (6501.54 MB/s,  832193 IOPS)
Total Write:       223.03 MB  (4.79 MB/s,  82 IOPS)

Per-Device Breakdown:
  nvme11n1:
    Read:  600.00 KB  (0.01 MB/s, 0 IOPS)
    Write: 3.82 MB  (0.08 MB/s, 8 IOPS)
  nvme11n1p3:
    Read:  600.00 KB  (0.01 MB/s, 0 IOPS)
    Write: 3.82 MB  (0.08 MB/s, 8 IOPS)
  nvme14n1:
    Read:  147.77 GB  (3250.75 MB/s, 416096 IOPS)
    Write: 105.79 MB  (2.27 MB/s, 26 IOPS)
  nvme14n1p1:
    Read:  147.77 GB  (3250.75 MB/s, 416096 IOPS)
    Write: 105.79 MB  (2.27 MB/s, 26 IOPS)
  dm-0:
    Read:  600.00 KB  (0.01 MB/s, 0 IOPS)
    Write: 3.82 MB  (0.08 MB/s, 14 IOPS)

Detailed results: vdbbench_results/20260514_005609
Recall details:   vdbbench_results/20260514_005609/recall_stats.json
============================================================

As you can see from the logs, the test suite only copy 1% (10000) vectors to a new FLAT collection. But the progress shows as 100%.

** The rootcause (guess) **
I think there might be an issue with the insert_data(collection, vectors, batch_size=10000) methond in file load_vdb.py.
The caller method always passes the new chunk_vectors to the insert_data instead of passing the whole vectors. So the method insert_data always generates the new batch chunk vectors starting with id 0 to 10000. So the final vector collection has many duplicated ids.

For that reason, when we try to run the enhanced_bench.py with --auto-create-flat. It will count and copy only 10000 vectors to the FLAT collection.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions