Skip to content

Add structured logging to backfill smart chunking#109

Merged
KeKs0r merged 22 commits intomainfrom
logging-on-smart-chunking
Apr 15, 2026
Merged

Add structured logging to backfill smart chunking#109
KeKs0r merged 22 commits intomainfrom
logging-on-smart-chunking

Conversation

@KeKs0r
Copy link
Copy Markdown
Contributor

@KeKs0r KeKs0r commented Apr 14, 2026

Summary

  • Smart chunking rewrite for @chkit/plugin-backfill: multi-strategy planner (equal-width, quantile, group-by-key, string-prefix, temporal-bucket) with refinement passes.
  • New @logtape/logtape-based structured logging across the CLI and backfill planner. Slow ClickHouse queries (>5s) emit warnings with backoff repeats up to 30s.
  • Enabled with CHKIT_DEBUG=1; default behavior unchanged.

Test plan

  • CHKIT_DEBUG=1 chkit plugin backfill plan ... shows structured planner logs and slow-query warnings on a real ClickHouse instance
  • chkit plugin backfill plan ... (no CHKIT_DEBUG) produces no extra stderr noise
  • bun verify passes on CI

🤖 Generated with Claude Code

KeKs0r and others added 19 commits April 2, 2026 00:13
ClickHouse Cloud enables parallel replicas by default, which inflates
count() results by the replica count (observed 35x over-count). Add
SETTINGS enable_parallel_replicas=0 to all count queries used during
chunk planning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the fix to min/max, GROUP BY prefix, and GROUP BY temporal
bucket queries. Tested on ObsessionDB: GROUP BY counts are inflated
by the replica count (16-35x), and min/max queries are ~5x slower
with replicas on. Extract a shared DISABLE_PARALLEL_REPLICAS constant
with a note that this is an ObsessionDB workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend ClickHouseExecutor.query() to accept per-query settings and
thread querySettings through PlannerContext. The ObsessionDB workaround
(enable_parallel_replicas=0) is now set once at the plugin call site
instead of being appended to each SQL string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The backfill moves uncompressed data, so chunk sizing should be based on
uncompressed bytes rather than compressed. With ~8x compression ratios,
using compressed bytes produced chunks ~8x larger than intended. All size
comparisons, merge budgets, and row-target calculations now use
bytesUncompressed. Test maxChunkBytes values doubled to match the 2x
compression ratio in the test fixture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analysis showed 3x oversampling is sufficient for equal-width range
splitting while reducing the number of estimation queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At root level (depth 0), use equal-width EXPLAIN ESTIMATE for the
initial split — fast metadata-only probes instead of a full GROUP BY
scan. Oversized children re-enter at depth 1+ and get GROUP BY prefix
refinement on their narrowed sub-ranges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace prefix-based hot key discovery with a direct GROUP BY key
approach for sub-ranges with ≤100 distinct values. One query gives
exact per-key counts instead of recursive depth drilling (1→4 chars).

When a sub-range contains a single key, narrow the range to an exact
match and re-enter dispatch so focusedValue propagates to subsequent
dimension splitting (e.g. temporal buckets on the hot key).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The BigInt string boundary computation used a hardcoded 8-byte width,
silently truncating sort key values longer than 8 characters. This caused
chunk boundaries to miss rows at the end of the value range — e.g. a
partition with tenants "mega-corp" (9 chars) through "tenant-0199" (11
chars) would lose all rows past the truncated upper bound "tenant-0".

Replace the fixed width with a dynamic width derived from the actual
input range strings (max of rangeFrom/rangeTo length). Also strip
trailing null bytes from bigIntToStr output to avoid inflating boundary
strings beyond their original length, while preserving semantically
meaningful nulls via a minLength parameter.

Adds E2E test infrastructure: a seed script to populate ClickHouse with
controlled datasets, and an E2E test for the skewed power-law scenario
(80/20 tenant distribution) that verifies hot key detection, cross-
dimension splitting, estimate accuracy, and full row coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds scenario 2: three tenants at ~30% each with 10% long tail, verifying
that multiple hot keys are independently detected and split on the
secondary dimension. Renames seed script to .script.ts so bun test does
not re-execute it on every test run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Development artifacts: benchmark scripts for testing chunking strategies
against live ClickHouse, query traces, and E2E scenario planning notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch CLI debug output to @logtape/logtape and emit structured logs
from the chunk planner: introspection summaries, per-partition split
decisions, slow query warnings (>5s), and query timing. Enabled with
CHKIT_DEBUG=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@KeKs0r KeKs0r changed the base branch from marc/rebase-smart-chunking to main April 14, 2026 12:06
KeKs0r and others added 3 commits April 14, 2026 14:11
Expand the SDK section with a three-stage pipeline overview, working
examples for plan / build SQL / execute, plan persistence via the
boundary codec, logging configuration, and pointers to playground
scripts that exercise each strategy against a real ClickHouse instance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The playground scripts were research artifacts from designing the smart
chunk planner, not shipped code. Trim the SDK README's playground
pointers since the scripts are gone.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@KeKs0r KeKs0r merged commit 50a34db into main Apr 15, 2026
2 checks passed
@KeKs0r KeKs0r deleted the logging-on-smart-chunking branch April 15, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant