Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -348,3 +348,12 @@ MigrationBackup/

# Ionide (cross platform F# VS Code tools) working folder
.ionide/

# Airlines demo benchmark results
bench/results/*.csv
bench/results/*.txt
bench/results/*.log

# Docker volumes and temporary files
docker/init/*.sql.gz
docker/init/*.disabled
165 changes: 165 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,168 @@ The results below represent the amount of time (ns) the operation takes per iter
As we can see in this comparison, Doublets are from 1746 to 15745 times faster than PostgreSQL in write operations, and from 100 to 9694 times faster in read operations.

To get fresh numbers, please fork the repository and rerun benchmark in GitHub Actions.

---

## Benchmark: Flight Timetable (Airlines Demo)

A new benchmark comparing PostgreSQL 18 and Doublets on realistic airline timetable queries using the [PostgresPro Airlines demo database](https://postgrespro.ru/education/demodb).

### What's Being Tested

This benchmark evaluates both systems on:
- **Complex relational queries**: Multi-table joins with temporal validity checks
- **Large datasets**: 6-month to 1-year flight schedules (~500k flights)
- **Real-world operations**: Airport departures/arrivals, route searches, aggregations
- **Two durability modes**:
- **Durable** (production-like): Full ACID with WAL
- **Embedded-like**: WAL-light configuration (similar to embedded databases)

### Queries

The benchmark includes 9 timetable queries:
1. Departures from airport by date
2. Arrivals to airport by date
3. Next available flight on a route
4. Manual join with temporal validity checks
5. Route details with airport information
6. Flight status distribution
7. Busiest routes analysis
8. Flights by date range
9. And more...

See [`sql/10_timetable_queries.sql`](sql/10_timetable_queries.sql) for details.

### Getting Started

#### Prerequisites
- Docker and Docker Compose
- ~10GB free disk space (for 1-year dataset)
- Python 3 (for result analysis)

#### Quick Start

```bash
# 1. Start PostgreSQL 18 with Airlines demo data (6 months)
cd docker
docker compose up -d

# Wait for database to load (~5 minutes)
docker compose logs -f pg

# 2. Run PostgreSQL benchmarks
cd ../bench/pg
./run.sh durable 6m 10 # Durable mode
./run.sh embedded 6m 10 # Embedded-like mode

# 3. Run Doublets benchmarks (TODO: implement)
cd ../doublets
./run.sh volatile 6m 10
./run.sh nonvolatile 6m 10

# 4. Compare results
ls -lh ../results/*.csv
```

#### Durability Modes

**Durable Mode** (PostgreSQL default):
- Full ACID guarantees
- WAL enabled with fsync
- Production-safe
- Baseline for comparison

**Embedded-Like Mode** (PostgreSQL optimized):
- `fsync=off`, `synchronous_commit=off`
- `wal_level=minimal`
- Optional: UNLOGGED tables
- Trades durability for speed (matches embedded DB behavior)

To run in embedded-like mode:
```bash
cd docker
docker compose -f docker-compose.yml -f compose.embedded.yml up -d
```

### Directory Structure

```
docker/
docker-compose.yml # PostgreSQL 18 setup (durable mode)
compose.embedded.yml # Override for embedded-like mode
init/
01_download_demo.sh # Auto-download Airlines demo DB
99_unlogged.sql # Optional: convert to UNLOGGED tables

sql/
10_timetable_queries.sql # All benchmark queries

bench/
pg/
run.sh # PostgreSQL benchmark script
doublets/
run.sh # Doublets benchmark script (placeholder)
results/ # CSV output and EXPLAIN logs
schema-mapping.md # How to map Airlines schema to Doublets

docs/
HOWTO.md # Detailed setup and usage guide
```

### Documentation

- **[HOWTO.md](docs/HOWTO.md)** - Complete setup guide, dataset options, troubleshooting
- **[schema-mapping.md](bench/schema-mapping.md)** - Mapping Airlines entities to Doublets links
- **[10_timetable_queries.sql](sql/10_timetable_queries.sql)** - All queries with explanations

### Dataset Sizes

| Size | Period | Flights | PostgreSQL | Compressed | Download Time |
|------|----------|---------|------------|------------|---------------|
| 3m | 3 months | ~125k | ~1.3 GB | 133 MB | ~2 min |
| 6m | 6 months | ~250k | ~2.7 GB | 276 MB | ~5 min |
| 1y | 1 year | ~500k | ~5.4 GB | 558 MB | ~10 min |
| 2y | 2 years | ~1M | ~11 GB | 1137 MB | ~20 min |

Default: **6 months** (good balance of size and completeness)

### Implementation Status

- [x] PostgreSQL 18 Docker setup
- [x] Airlines demo database integration
- [x] Timetable queries (10 queries)
- [x] PostgreSQL benchmark script
- [x] Durability modes (durable + embedded-like)
- [x] Schema mapping documentation
- [ ] **Doublets implementation** (TODO)
- [ ] Results comparison and visualization

### Next Steps

To complete this benchmark:

1. **Implement Doublets data model** (see `bench/schema-mapping.md`)
- Map Airports, Routes, Flights to links
- Handle temporal data (validity ranges)
- Support NULL values and enums

2. **Implement equivalent queries**
- Ensure exact same result sets as PostgreSQL
- Validate with checksums

3. **Run comparative benchmarks**
- Two durability modes
- Two dataset sizes (6m, 1y)
- 10 runs per query

4. **Analyze and visualize results**
- Compare median times
- Identify bottlenecks
- Generate comparison charts

### References

- [PostgresPro Airlines Demo](https://postgrespro.ru/education/demodb) - Official documentation
- [PostgreSQL 18 Release Notes](https://www.postgresql.org/docs/18/) - What's new
- [Doublets Documentation](https://github.com/linksplatform/Data.Doublets) - Link storage system
- [Issue #11](https://github.com/linksplatform/Comparisons.PostgreSQLVSDoublets/issues/11) - Original requirements
164 changes: 164 additions & 0 deletions bench/doublets/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
#!/bin/bash
# ============================================================================
# Doublets Airlines Demo - Benchmark Script (Placeholder)
# ============================================================================
# This script runs timetable queries against the Doublets implementation
# of the Airlines demo database and collects timing measurements.
#
# Usage: ./run.sh <durability_mode> <dataset_size> [num_runs]
# durability_mode: volatile or nonvolatile
# dataset_size: 3m, 6m, 1y, or 2y
# num_runs: number of iterations per query (default: 10)
#
# Example:
# ./run.sh volatile 6m 10
# ./run.sh nonvolatile 1y 20
#
# Output:
# - CSV file: ../results/doublets_<mode>_<dataset>_<timestamp>.csv
#
# TODO: This is a placeholder. Implement actual Doublets benchmarking logic.
# ============================================================================

set -euo pipefail

# Configuration
DURABILITY_MODE="${1:-volatile}"
DATASET_SIZE="${2:-6m}"
NUM_RUNS="${3:-10}"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
RESULTS_DIR="../results"
OUTPUT_CSV="${RESULTS_DIR}/doublets_${DURABILITY_MODE}_${DATASET_SIZE}_${TIMESTAMP}.csv"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Create results directory
mkdir -p "${RESULTS_DIR}"

# Initialize CSV
echo "system,durability_mode,dataset,query_id,run,rows,ms" > "${OUTPUT_CSV}"

echo -e "${GREEN}=== Doublets Benchmark ===${NC}"
echo "Mode: ${DURABILITY_MODE}"
echo "Dataset: ${DATASET_SIZE}"
echo "Runs per query: ${NUM_RUNS}"
echo "Output: ${OUTPUT_CSV}"
echo ""

echo -e "${YELLOW}=== TODO: Doublets Implementation ===${NC}"
echo ""
echo "This is a placeholder script. To complete the Doublets benchmark, implement:"
echo ""
echo "1. Data Loading:"
echo " - Load Airlines data from PostgreSQL or CSV export"
echo " - Convert entities to Doublets links (see bench/schema-mapping.md)"
echo " - Store in Doublets database (volatile or nonvolatile mode)"
echo ""
echo "2. Query Implementation:"
echo " - Implement equivalent queries using Doublets link API"
echo " - Ensure result sets match PostgreSQL exactly"
echo " - See bench/schema-mapping.md for query mappings"
echo ""
echo "3. Benchmark Execution:"
echo " - Warm-up: run each query once"
echo " - Measure: run each query ${NUM_RUNS} times"
echo " - Record: wall-clock time (ms) and row count"
echo " - Write results to CSV with same format as PostgreSQL benchmark"
echo ""
echo "4. Validation:"
echo " - Compare result sets with PostgreSQL (checksums)"
echo " - Verify performance improvements"
echo " - Report any discrepancies"
echo ""
echo "Suggested implementation approaches:"
echo ""
echo " a) Rust implementation (matching existing rust/ directory):"
echo " - Use existing Doublets Rust library"
echo " - Create Airlines data model"
echo " - Implement queries using Doublets API"
echo " - Add benchmark harness"
echo ""
echo " b) C++ implementation (matching existing cpp/ directory):"
echo " - Use existing Doublets C++ library"
echo " - Follow same approach as Rust"
echo ""
echo " c) Standalone tool:"
echo " - Create separate benchmark binary"
echo " - Load data from CSV export"
echo " - Run queries and output CSV"
echo ""
echo "Reference implementations:"
echo " - rust/benches/bench.rs - existing Doublets benchmarks"
echo " - bench/pg/run.sh - PostgreSQL benchmark (for CSV format)"
echo " - bench/schema-mapping.md - detailed mapping documentation"
echo ""
echo -e "${YELLOW}Until implementation is complete, this script generates mock data.${NC}"
echo ""

# Generate mock data for testing the analysis pipeline
echo -e "${YELLOW}Generating mock benchmark data...${NC}"

# Define query IDs (matching PostgreSQL)
QUERY_IDS=(
"departures_svo"
"arrivals_svo"
"next_flight_svx_wuh"
"manual_departures_svo"
"manual_arrivals_svo"
"route_details"
"status_counts"
"busiest_routes"
"date_range"
)

# Mock: Doublets should be ~1000-10000x faster than PostgreSQL
# Generate realistic-looking performance data
for query_id in "${QUERY_IDS[@]}"; do
# Simulate row counts (would come from actual queries)
case "${query_id}" in
"next_flight_svx_wuh")
row_count=1
;;
"route_details")
row_count=20
;;
"status_counts")
row_count=5
;;
"busiest_routes")
row_count=10
;;
"date_range")
row_count=7
;;
*)
row_count=$((RANDOM % 100 + 10))
;;
esac

# Generate ${NUM_RUNS} measurements with small variance
base_time=$((RANDOM % 50 + 10)) # 10-60ms for Doublets (vs seconds for PostgreSQL)

for run in $(seq 1 "${NUM_RUNS}"); do
# Add small random variance
variance=$((RANDOM % 20 - 10))
time=$((base_time + variance))
[ ${time} -lt 1 ] && time=1 # Ensure positive

echo "doublets,${DURABILITY_MODE},${DATASET_SIZE},${query_id},${run},${row_count},${time}" >> "${OUTPUT_CSV}"
done
done

echo -e "${GREEN}Mock data generated${NC}"
echo ""

echo -e "${GREEN}=== Benchmark Complete (Mock) ===${NC}"
echo "Mock results saved to: ${OUTPUT_CSV}"
echo ""
echo -e "${RED}WARNING: This data is MOCK data for testing purposes.${NC}"
echo -e "${RED}Implement actual Doublets queries to get real measurements.${NC}"
echo ""
Loading
Loading