Summary
The current configuration (BENCHMARK_RUNS=1, BENCHMARK_MAX_FIXED_WINDOW_SECONDS=4500) produces too few samples for reliable statistical analysis. The thesis (§4.5, Table 4.5.1) acknowledges that bootstrap CI from fewer than 30 samples is noisy. The current ceiling also caps several RQ2 experiments below their target iteration count.
Proposed changes:
- Split
BENCHMARK_RUNS into per-workload counts: 30 RQ1 runs, 3 RQ2 runs
BENCHMARK_MAX_ITERATION_SECONDS: 4500 → 3600 (60 min)
BENCHMARK_MAX_FIXED_WINDOW_SECONDS: 75 * 60 → 5 * BENCHMARK_MAX_ITERATION_SECONDS (= 18,000s = 300 min)
- Skip
national-scale-spatial-join-duckdb-large — exceeds 60-min per-iteration threshold (median 70.8 min). Existing data from run 2026-05-24-HBJYYT is sufficient. Report as a finding: single-node processing becomes impractical at large scale.
main.py: partition experiments into RQ1 (pip, knn, bbox) and RQ2 (national-scale-spatial-join) by ID prefix, loop each group with its own run count.
Estimated wall-clock per benchmark run
RQ1 — Sequential stopping (includes 10 min ingestion delay per experiment)
| Batch |
Experiments |
Wall-clock |
| pip-small |
duckdb, local, postgis |
15 min |
| pip-large |
duckdb, postgis |
29 min |
| knn-small |
duckdb, local, postgis |
52 min |
| knn-large |
duckdb, postgis |
30 min |
| bbox-small |
duckdb, local, postgis |
15 min |
| bbox-large |
duckdb, postgis |
38 min |
| RQ1 total (batched + cleanup) |
15 experiments |
189 min = 3.2h |
RQ2 — Fixed iteration (1 warmup + 5 timed, 60 min/iter, 300 min ceiling, includes 10 min ingestion)
| Experiment |
Per-iter |
Iters |
Total |
Note |
| duckdb-small |
2.8 min |
6 |
26.6 min |
|
| duckdb-medium |
24.6 min |
6 |
157.3 min |
|
duckdb-large |
70.8 min |
— |
— |
SKIP: exceeds 60 min threshold |
| postgis-small |
0.2 min |
6 |
11.4 min |
|
| postgis-medium |
23.0 min |
6 |
148.1 min |
|
| broadcast-2-nodes-medium |
3.3 min |
6 |
30.1 min |
|
| broadcast-2-nodes-large |
7.5 min |
6 |
55.1 min |
|
| broadcast-4-nodes-small |
0.3 min |
6 |
12.0 min |
|
| broadcast-4-nodes-medium |
1.8 min |
6 |
20.6 min |
|
| broadcast-4-nodes-large |
4.0 min |
6 |
34.0 min |
|
| broadcast-8-nodes-small |
0.3 min |
6 |
11.6 min |
|
| broadcast-8-nodes-medium |
1.0 min |
6 |
15.7 min |
|
| broadcast-8-nodes-large |
2.1 min |
6 |
22.6 min |
|
| broadcast-16-nodes-large |
1.2 min |
6 |
17.4 min |
|
| partitioned-4-nodes-small |
18.4 min |
6 |
120.2 min |
|
| partitioned-8-nodes-small |
9.7 min |
6 |
68.5 min |
|
| Batch |
Wall-clock |
| small-main (duckdb, postgis, broadcast-8, partitioned-8) |
71 min |
| small-4node (broadcast-4, partitioned-4) |
123 min |
| med-main (duckdb, postgis, broadcast-8) |
160 min |
| med-4node (broadcast-4) |
24 min |
| med-2node (broadcast-2) |
33 min |
| lrg-8node (broadcast-8) |
26 min |
| lrg-broadcast (broadcast-2, broadcast-4, broadcast-16) |
58 min |
| RQ2 total (batched + cleanup) |
505 min = 8.4h |
Skipped experiments (9)
| Experiment |
Reason |
| national-scale-spatial-join-duckdb-large |
NEW — exceeds 60-min per-iteration threshold |
| national-scale-spatial-join-postgis-large |
OOM on Azure PostgreSQL Flex Server |
| national-scale-spatial-join-databricks-partitioned-2-nodes-medium |
executor OOM |
| national-scale-spatial-join-databricks-partitioned-2-nodes-large |
executor OOM |
| national-scale-spatial-join-databricks-partitioned-4-nodes-medium |
executor OOM |
| national-scale-spatial-join-databricks-partitioned-4-nodes-large |
executor OOM |
| national-scale-spatial-join-databricks-partitioned-8-nodes-medium |
executor OOM |
| national-scale-spatial-join-databricks-partitioned-8-nodes-large |
executor OOM |
| national-scale-spatial-join-databricks-partitioned-16-nodes-large |
executor OOM |
Total suite estimate
| Component |
Per run |
Runs |
Total time |
Cost |
| RQ1 |
3.2h |
30 |
94.6h |
~$14.64 |
| RQ2 |
8.4h |
3 |
25.3h |
~$37.97 |
| Total |
|
|
119.9h = 5.0 days |
~$52.61 |
Per-iteration times are median estimates from run 2026-05-24-HBJYYT. Cost estimates based on Azure ACI, Databricks, and PostgreSQL pricing from the same run, adjusted for increased iteration counts.
Summary
The current configuration (
BENCHMARK_RUNS=1,BENCHMARK_MAX_FIXED_WINDOW_SECONDS=4500) produces too few samples for reliable statistical analysis. The thesis (§4.5, Table 4.5.1) acknowledges that bootstrap CI from fewer than 30 samples is noisy. The current ceiling also caps several RQ2 experiments below their target iteration count.Proposed changes:
BENCHMARK_RUNSinto per-workload counts: 30 RQ1 runs, 3 RQ2 runsBENCHMARK_MAX_ITERATION_SECONDS: 4500 → 3600 (60 min)BENCHMARK_MAX_FIXED_WINDOW_SECONDS:75 * 60→5 * BENCHMARK_MAX_ITERATION_SECONDS(= 18,000s = 300 min)national-scale-spatial-join-duckdb-large— exceeds 60-min per-iteration threshold (median 70.8 min). Existing data from run2026-05-24-HBJYYTis sufficient. Report as a finding: single-node processing becomes impractical at large scale.main.py: partition experiments into RQ1 (pip, knn, bbox) and RQ2 (national-scale-spatial-join) by ID prefix, loop each group with its own run count.Estimated wall-clock per benchmark run
RQ1 — Sequential stopping (includes 10 min ingestion delay per experiment)
RQ2 — Fixed iteration (1 warmup + 5 timed, 60 min/iter, 300 min ceiling, includes 10 min ingestion)
duckdb-large70.8 minSkipped experiments (9)
Total suite estimate