Skip to content

Performance improvements#7

Merged
lovelaced merged 11 commits intomainfrom
mku-perf-improvements-3
Feb 18, 2026
Merged

Performance improvements#7
lovelaced merged 11 commits intomainfrom
mku-perf-improvements-3

Conversation

@michalkucharczyk
Copy link
Contributor

Performance: TimescaleDB migration + parallel batch writer

Motivation

We ran jip-3-spammer against v0.3.0 with 300 nodes at realistic
rates (~258K events/s total) and hit performance problems pretty quickly — the single
batch writer couldn't keep up, events started dropping, and the write buffer filled up
within seconds of the nodes connecting.

Why

The original PostgreSQL schema uses a single flat events table with 6+ indexes. At 3M events/s from 1024 nodes, every INSERT must update all indexes, all writes hit the same table, and aggregate queries (dashboards, stats) scan the entire table. This doesn't scale.

Database: PostgreSQL → TimescaleDB

TimescaleDB is a PostgreSQL extension — same Postgres, just with time-series superpowers.

Hypertable with automatic chunking
The events table is split into 1-hour chunks automatically. Queries like "events in the last hour" scan 1-2 chunks instead of the whole table. Old data is dropped per-chunk (DROP TABLE, instant) instead of DELETE + vacuum.

32 hash partitions on node_id
Each 1-hour chunk is further split into 32 sub-chunks by hashing node_id. This spreads writes from 1024 nodes across 32 parallel physical tables — 32x less lock contention on indexes and WAL. Queries filtering by node_id only scan 1/32 of each chunk. This is a DB-internal detail, transparent to application code.

Continuous aggregates (pre-computed rollups)
Instead of running COUNT(*) over billions of rows on every API request:

  • event_stats_1m — per-minute counts per node/type, refreshed every 2 min
  • event_stats_1h — per-hour counts, built from the 1m aggregate (not raw events)

These are incrementally maintained by TimescaleDB — only changed chunks get re-aggregated.

Data retention pyramid

Tier Resolution Retention
Raw events Full JSONB payload 7 days
1-min aggregates Counts per node/type 30 days
1-hour aggregates Counts per node/type 365 days

After 7 days raw event data is gone, but you still know how many events each node sent — per-minute for 30 days, per-hour for a year.

Compression after 2 hours
Columnar compression grouped by (node_id, event_type), ordered by timestamp DESC. Typical 10-20x compression ratio. Queries can skip irrelevant segments without decompressing.

Other schema changes

  • event_type: INTEGERSMALLINT (130 types fit in 2 bytes, saves ~500GB/day at full throughput)
  • id BIGSERIAL PRIMARY KEYevent_id BIGINT (no PK — hypertables don't support it, and ON CONFLICT dedup is too expensive at 3M/s)
  • Indexes reduced from 6+ to 2 (each index costs write throughput)
  • No per-row triggers (catastrophic at high throughput) — app-level batch stats instead
  • Docker image: postgres:18-alpinetimescale/timescaledb:latest-pg16

Ingestion: Work-stealing batch writer pool

Single writer replaced with 32 parallel workers sharing an Arc<Mutex<Receiver>>:

  • Each worker drains events into a local batch (up to 16k events) then flushes via COPY BINARY
  • Node stats aggregated per-worker and flushed every 5s (additive, concurrent-safe)
  • Channel buffer: 5M events to absorb bursts from 1024 nodes
  • Removed single-row store_event() — everything is batched

Server improvements

  • Per-event logging downgraded from debug! to trace! to reduce log noise
  • wait_for_connections() watch channel for deterministic test synchronization
  • Partition health check removed (TimescaleDB handles this)

Tests

  • 13 new data-driven API tests: insert real events, query endpoints, validate results
  • Event encoding added for WorkPackageReceived, Authorized, Refined, GuaranteeBuilt
  • All test setups updated for TimescaleDB

Bug fixes

  • da_stats INTEGER overflow: Status fields (num_shards, num_preimages, preimages_size) cast to BIGINT instead of INTEGER

Replace PostgreSQL partitioned schema with TimescaleDB hypertable:
- 1-hour chunks with 32 hash partitions on node_id
- Continuous aggregates: event_stats_1m (2min refresh), event_stats_1h
- Compression after 2h (segmentby node_id + event_type)
- Retention: raw 7d, 1m aggs 30d, 1h aggs 365d
- event_type SMALLINT (was INTEGER), event_id BIGINT (no PK)
- No per-row triggers (app-level batch stats instead)
- Event types lookup table with convenience view
- Docker image: timescale/timescaledb:latest-pg16
- Remove all partition management code (ensure_partitions_exist,
  spawn_partition_maintenance, shutdown, check_partition_health,
  PartitionHealth struct)
- Add store_events_batch() with COPY BINARY for >10 events,
  simple INSERT for <=10 events
- Add update_node_stats() using unnest() for concurrent-safe
  batch updates
- Update event_type from i32 to i16 (SMALLINT)
- Update store_nodes_connected_batch() with address parameter
- Add ping(), get_node_by_id(), get_cores_telemetry_agg()
- Remove PartitionHealth export from lib.rs
- Adapt all query methods for parameterized DurationPreset intervals
Replace single-task batch writer with parallel workers:
- Arc<Mutex<Receiver>> shared across all workers
- Each worker drains events into local batch (up to 16k)
  then flushes via store.store_events_batch()
- Timeout-based accumulation (100ms) prevents tiny flushes
- Separate stats flusher task aggregates node counts every 5s
- node_connected() now takes address parameter
- flush() sends sentinel to all workers and waits for responses
- Channel buffer: 5M events
- health.rs: remove partition_check() (no partitions in TimescaleDB)
- main.rs: remove partition health check (5 checks instead of 6)
- rate_limiter.rs: make MAX_CONNECTIONS pub for test access
- Restructure TelemetryServer to store TcpListener (enables port 0 binding)
- Add with_options() constructor with no_rate_limit parameter
- Add local_addr(), wait_for_connections() for deterministic test sync
- Add connection_watch channel for tracking connection count changes
- Remove BytesMutExt trait in favor of bytes::Buf
- Remove read timeouts (handled by TCP keepalive)
- api.rs: remove secondary_interval() params from store query call sites
…eBuilt

These events had stub encoding (0 bytes) which caused data-driven
tests to silently fail — events were sent as empty payloads and never stored.
- Add 13 data-driven tests validating JSONB query paths against real events
- Add now_jce_micros() helper for realistic test timestamps
- Update all test setup functions to use port 0 + local_addr() pattern
- Set test cache TTL to zero to avoid stale cache hits
- Use realistic JCE-relative timestamps so events pass time-window filters
@lovelaced lovelaced merged commit cc275c3 into main Feb 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants