Skip to content
6 changes: 6 additions & 0 deletions .github/workflows/pull-request-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,9 @@ jobs:
- service: national-scale-spatial-join-databricks-broadcast-8-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 8 Nodes

- service: national-scale-spatial-join-databricks-broadcast-12-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 12 Nodes

- service: national-scale-spatial-join-databricks-partitioned-2-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 2 Nodes

Expand All @@ -193,6 +196,9 @@ jobs:
- service: national-scale-spatial-join-databricks-partitioned-8-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 8 Nodes

- service: national-scale-spatial-join-databricks-partitioned-12-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 12 Nodes

- service: national-scale-spatial-join-databricks-broadcast-16-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 16 Nodes

Expand Down
8 changes: 8 additions & 0 deletions .github/workflows/push-containers-to-acr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ jobs:
image: national-scale-spatial-join-databricks-broadcast-8-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 8 Nodes

- service: national-scale-spatial-join-databricks-broadcast-12-nodes
image: national-scale-spatial-join-databricks-broadcast-12-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 12 Nodes

- service: national-scale-spatial-join-databricks-partitioned-2-nodes
image: national-scale-spatial-join-databricks-partitioned-2-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 2 Nodes
Expand All @@ -129,6 +133,10 @@ jobs:
image: national-scale-spatial-join-databricks-partitioned-8-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 8 Nodes

- service: national-scale-spatial-join-databricks-partitioned-12-nodes
image: national-scale-spatial-join-databricks-partitioned-12-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 12 Nodes

- service: national-scale-spatial-join-databricks-broadcast-16-nodes
image: national-scale-spatial-join-databricks-broadcast-16-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 16 Nodes
Expand Down
26 changes: 12 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ Concretely, the outer orchestrator loop is serial: it picks the next experiment
its peer batch in parallel, waits for the whole batch to finish, marks every member completed, and moves on. At any
moment one batch is in flight; within that batch every member runs on its own ACI in parallel.

The 41 experiments are packed into 17 batches under four constraints that the `related_script_ids` graph encodes:
The 51 experiments are packed into 17 batches under four constraints that the `related_script_ids` graph encodes:

1. **Same query type** per batch — `point-in-polygon-lookup`, `knn-search`, `bbox-filtering`, or
`national-scale-spatial-join` never mix.
Expand All @@ -228,7 +228,7 @@ by constraint 4.

### Test matrix

The matrix below is the active set of 44 experiments grouped into 18 parallel batches. Each cell lists the engines
The matrix below is the active set of 51 experiments grouped into 17 parallel batches. Each cell lists the engines
or Sedona configurations that launch together in the same wall-clock window; size suffixes (`-small`, `-medium`,
`-large`) are appended to the experiment ids in `benchmarks.yml` and forwarded to each container as `--dataset-size`.
Shapefile (`local`) only participates at the `small` tier per the thesis methodology — it represents the
Expand All @@ -245,20 +245,18 @@ laptop-workflow reference, not a scalable engine.
The medium tier was dropped from the surviving RQ1 queries and `attribute-spatial-compound-filter` was removed
across the board (issue #281); the 13 freed cells are reinvested in RQ2.

**RQ2 — National-scale spatial join** (26 experiments, 11 batches)
**RQ2 — National-scale spatial join** (36 experiments, 11 batches)

| Engine / strategy | `small` | `medium` | `large` |
|---------------------|---------------------|---------------------|---------------------------|
| Single-node | duckdb · postgis | duckdb · postgis | duckdb · postgis |
| Sedona `broadcast` | 4 / 8 nodes | 2 / 4 / 8 nodes | 2 / 4 / 8 / 12 / 16 nodes |
| Sedona `partitioned`| 4 / 8 nodes | 2 / 4 / 8 nodes | 2 / 4 / 8 / 12 / 16 nodes |
| Engine / strategy | `small` | `medium` | `large` |
|---------------------|------------------------------|------------------------------|------------------------------|
| Single-node | duckdb · postgis | duckdb · postgis | duckdb · postgis |
| Sedona `broadcast` | 2 / 4 / 8 / 12 / 16 nodes | 2 / 4 / 8 / 12 / 16 nodes | 2 / 4 / 8 / 12 / 16 nodes |
| Sedona `partitioned`| 2 / 4 / 8 / 12 / 16 nodes | 2 / 4 / 8 / 12 / 16 nodes | 2 / 4 / 8 / 12 / 16 nodes |

Within each size column, single-node and Sedona experiments are packed into the same batches up to the 200 vCPU
Databricks budget — the table groups by strategy for readability, not by batch membership. Concrete batch
membership is whatever `related_script_ids` in `benchmarks.yml` declares; see the batch listing below.

The 2-node row is omitted at `small` for `broadcast` and `partitioned`: at ~5M polygons those configurations were
weakly differentiated; the freed cells fund the 12-/16-node extension of the scaling curve at `large`.
A `default` strategy (no hint; Spark's CBO picks the plan) was originally planned as an untuned baseline but was
dropped entirely because iterations consistently timed out or failed at this workload scale, making reliable
measurement infeasible (issue #254).
Expand All @@ -275,10 +273,10 @@ the seeded shuffle.
| K2 | knn-search | large | 0 | duckdb · postgis |
| B1 | bbox-filtering | small | 0 | duckdb · postgis · local |
| B2 | bbox-filtering | large | 0 | duckdb · postgis |
| A_S1 | national-scale-spatial-join | small | 72 | broadcast-8 · partitioned-8 · duckdb · postgis |
| A_S2 | national-scale-spatial-join | small | 40 | broadcast-4 · partitioned-4 |
| A_M1 | national-scale-spatial-join | medium | 72 | broadcast-8 · partitioned-8 · duckdb · postgis |
| A_M2 | national-scale-spatial-join | medium | 40 | broadcast-4 · partitioned-4 |
| A_S1 | national-scale-spatial-join | small | 200 | broadcast-2 · broadcast-8 · broadcast-12 · partitioned-2 · partitioned-8 · partitioned-12 · duckdb · postgis |
| A_S2 | national-scale-spatial-join | small | 176 | broadcast-4 · broadcast-16 · partitioned-4 · partitioned-16 |
| A_M1 | national-scale-spatial-join | medium | 176 | broadcast-8 · broadcast-12 · partitioned-8 · partitioned-12 · duckdb · postgis |
| A_M2 | national-scale-spatial-join | medium | 176 | broadcast-4 · broadcast-16 · partitioned-4 · partitioned-16 |
| A_M3 | national-scale-spatial-join | medium | 24 | broadcast-2 · partitioned-2 |
| A_L1 | national-scale-spatial-join | large | 80 | broadcast-16 · broadcast-2 |
| A_L2 | national-scale-spatial-join | large | 80 | partitioned-16 · partitioned-2 |
Expand Down
8 changes: 8 additions & 0 deletions benchmark_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,12 @@
national_scale_spatial_join_databricks_broadcast_2_nodes,
national_scale_spatial_join_databricks_broadcast_4_nodes,
national_scale_spatial_join_databricks_broadcast_8_nodes,
national_scale_spatial_join_databricks_broadcast_12_nodes,
national_scale_spatial_join_databricks_broadcast_16_nodes,
national_scale_spatial_join_databricks_partitioned_2_nodes,
national_scale_spatial_join_databricks_partitioned_4_nodes,
national_scale_spatial_join_databricks_partitioned_8_nodes,
national_scale_spatial_join_databricks_partitioned_12_nodes,
national_scale_spatial_join_databricks_partitioned_16_nodes,
)

Expand Down Expand Up @@ -104,6 +106,9 @@ def benchmark_runner() -> None:
case "national-scale-spatial-join-databricks-broadcast-8-nodes":
national_scale_spatial_join_databricks_broadcast_8_nodes()
return
case "national-scale-spatial-join-databricks-broadcast-12-nodes":
national_scale_spatial_join_databricks_broadcast_12_nodes()
return
case "national-scale-spatial-join-databricks-broadcast-16-nodes":
national_scale_spatial_join_databricks_broadcast_16_nodes()
return
Expand All @@ -116,6 +121,9 @@ def benchmark_runner() -> None:
case "national-scale-spatial-join-databricks-partitioned-8-nodes":
national_scale_spatial_join_databricks_partitioned_8_nodes()
return
case "national-scale-spatial-join-databricks-partitioned-12-nodes":
national_scale_spatial_join_databricks_partitioned_12_nodes()
return
case "national-scale-spatial-join-databricks-partitioned-16-nodes":
national_scale_spatial_join_databricks_partitioned_16_nodes()
return
Expand Down
Loading
Loading