Add structured logging to backfill smart chunking by KeKs0r · Pull Request #109 · obsessiondb/chkit

KeKs0r · 2026-04-14T12:04:35Z

Summary

Smart chunking rewrite for @chkit/plugin-backfill: multi-strategy planner (equal-width, quantile, group-by-key, string-prefix, temporal-bucket) with refinement passes.
New @logtape/logtape-based structured logging across the CLI and backfill planner. Slow ClickHouse queries (>5s) emit warnings with backoff repeats up to 30s.
Enabled with CHKIT_DEBUG=1; default behavior unchanged.

Test plan

CHKIT_DEBUG=1 chkit plugin backfill plan ... shows structured planner logs and slow-query warnings on a real ClickHouse instance
chkit plugin backfill plan ... (no CHKIT_DEBUG) produces no extra stderr noise
bun verify passes on CI

🤖 Generated with Claude Code

ClickHouse Cloud enables parallel replicas by default, which inflates count() results by the replica count (observed 35x over-count). Add SETTINGS enable_parallel_replicas=0 to all count queries used during chunk planning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend the fix to min/max, GROUP BY prefix, and GROUP BY temporal bucket queries. Tested on ObsessionDB: GROUP BY counts are inflated by the replica count (16-35x), and min/max queries are ~5x slower with replicas on. Extract a shared DISABLE_PARALLEL_REPLICAS constant with a note that this is an ObsessionDB workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend ClickHouseExecutor.query() to accept per-query settings and thread querySettings through PlannerContext. The ObsessionDB workaround (enable_parallel_replicas=0) is now set once at the plugin call site instead of being appended to each SQL string. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The backfill moves uncompressed data, so chunk sizing should be based on uncompressed bytes rather than compressed. With ~8x compression ratios, using compressed bytes produced chunks ~8x larger than intended. All size comparisons, merge budgets, and row-target calculations now use bytesUncompressed. Test maxChunkBytes values doubled to match the 2x compression ratio in the test fixture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Analysis showed 3x oversampling is sufficient for equal-width range splitting while reducing the number of estimation queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

At root level (depth 0), use equal-width EXPLAIN ESTIMATE for the initial split — fast metadata-only probes instead of a full GROUP BY scan. Oversized children re-enter at depth 1+ and get GROUP BY prefix refinement on their narrowed sub-ranges. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace prefix-based hot key discovery with a direct GROUP BY key approach for sub-ranges with ≤100 distinct values. One query gives exact per-key counts instead of recursive depth drilling (1→4 chars). When a sub-range contains a single key, narrow the range to an exact match and re-enter dispatch so focusedValue propagates to subsequent dimension splitting (e.g. temporal buckets on the hot key). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The BigInt string boundary computation used a hardcoded 8-byte width, silently truncating sort key values longer than 8 characters. This caused chunk boundaries to miss rows at the end of the value range — e.g. a partition with tenants "mega-corp" (9 chars) through "tenant-0199" (11 chars) would lose all rows past the truncated upper bound "tenant-0". Replace the fixed width with a dynamic width derived from the actual input range strings (max of rangeFrom/rangeTo length). Also strip trailing null bytes from bigIntToStr output to avoid inflating boundary strings beyond their original length, while preserving semantically meaningful nulls via a minLength parameter. Adds E2E test infrastructure: a seed script to populate ClickHouse with controlled datasets, and an E2E test for the skewed power-law scenario (80/20 tenant distribution) that verifies hot key detection, cross- dimension splitting, estimate accuracy, and full row coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds scenario 2: three tenants at ~30% each with 10% long tail, verifying that multiple hot keys are independently detected and split on the secondary dimension. Renames seed script to .script.ts so bun test does not re-execute it on every test run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Development artifacts: benchmark scripts for testing chunking strategies against live ClickHouse, query traces, and E2E scenario planning notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch CLI debug output to @logtape/logtape and emit structured logs from the chunk planner: introspection summaries, per-partition split decisions, slow query warnings (>5s), and query timing. Enabled with CHKIT_DEBUG=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expand the SDK section with a three-stage pipeline overview, working examples for plan / build SQL / execute, plan persistence via the boundary codec, logging configuration, and pointers to playground scripts that exercise each strategy against a real ClickHouse instance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The playground scripts were research artifacts from designing the smart chunk planner, not shipped code. Trim the SDK README's playground pointers since the scripts are gone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KeKs0r and others added 19 commits April 2, 2026 00:13

replace backfill chunking with smart planner

d2eb78f

fix smart chunking review issues

6d224b7

Update Algo

ab1239e

Fix CI

3f99820

Export backfill SDK helpers and obsessiondb service types

fbd3a19

Trace Queries with Performance.

78742a7

oversample the equal width

2340608

Reduce equal-width oversampling from 5x to 3x

38dc5d8

Analysis showed 3x oversampling is sufficient for equal-width range splitting while reducing the number of estimation queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix equal-width oversample precision and add changeset

064d387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add playground benchmarks and smart chunking research notes

f655fc3

Development artifacts: benchmark scripts for testing chunking strategies against live ClickHouse, query traces, and E2E scenario planning notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

KeKs0r changed the base branch from marc/rebase-smart-chunking to main April 14, 2026 12:06

KeKs0r mentioned this pull request Apr 14, 2026

Rewrite backfill chunk planning with multi-strategy smart chunking #108

Closed

3 tasks

KeKs0r and others added 3 commits April 14, 2026 14:11

Fixing cluster

77d018a

KeKs0r merged commit 50a34db into main Apr 15, 2026
2 checks passed

KeKs0r deleted the logging-on-smart-chunking branch April 15, 2026 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add structured logging to backfill smart chunking#109

Add structured logging to backfill smart chunking#109
KeKs0r merged 22 commits intomainfrom
logging-on-smart-chunking

KeKs0r commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeKs0r commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KeKs0r commented Apr 14, 2026 •

edited

Loading