Add DataFusion + Parquet query engine design doc for one_d4 by aaylward · Pull Request #1068 · muchq/MoonBase

aaylward · 2026-02-25T16:46:21Z

https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

cloudflare-workers-and-pages · 2026-02-25T16:46:29Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	1d4-web	`7920e3b`	Commit Preview URL Branch Preview URL	Feb 26 2026, 05:05 AM

Revise the DataFusion/Parquet design doc based on the planned chariot integration (issue #1049). Key changes: - Motif detection stays in Java — no Rust port of detectors - Lichess bulk ingest is a Java CLI jar reusing one_d4 detectors - motif_query (Rust) narrows to query engine + Parquet writer only - Parquet schema includes 7 new Phase 9 motifs - Architecture diagram updated to show Java-centric data flow - New open questions on Phase 9 ordering and Java Parquet writer https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

Replace the direct ChessQL→SQL compilation model with a Substrait-based pipeline: ChessQL→Substrait Plan→{SQL, DataFusion}. Key design changes: - SubstraitCompiler produces Substrait protobuf plans from ChessQL AST - QueryRouter dispatches plans to SQL (via substrait-java) or DataFusion (via datafusion-substrait) based on feature flag or cost routing - motif_query Rust service accepts Substrait plan bytes, not SQL strings - Shadow mode runs both backends in parallel for migration validation - Optional cost-based routing (boolean filters→DataFusion, sequence→SQL) - Architecture diagram shows dual-backend query flow - Implementation phases reordered: Substrait compiler first, then crate - New open questions on sequence() coverage and version pinning https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

Detail how the indexer transitions from per-game SQL INSERTs to Parquet: - Current IndexWorker writes 1 game at a time; Parquet files are immutable - Buffered writer in motif_query: accumulate rows, flush at threshold (5000 rows) or interval (60s), with explicit /v1/flush endpoint - Java IndexWorker batches per month (10-100 games) before POST - Compaction merges small files: timer-based, file-count triggered, lock-file coordination with writer, target 5-25 MB steady-state - Dual-write during migration: SQL first (authoritative), Parquet second - File size analysis: small batches → buffer flush → compaction targets - Tradeoff analysis: why not append, why not Delta Lake/Iceberg (yet) - Updated open questions: compaction concurrency, read-your-writes - Removed answered questions (buffering strategy, compaction design) https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

Rework the Parquet write section with concrete throughput assumptions: - 10K-100K games/month total (~300-3,300/day), 10-100 per player-month - Buffer threshold raised to 10K rows (most partitions flush 1-5 times) - Time-based flush at 5 min (safety net, not primary trigger) - Compaction downgraded to low priority: 30-min interval, only merges tail files < 100 KB, skip partitions with recent writes - Compaction scaling table: when to consider Delta Lake (500K+/month) - File size targets: 500 KB - 2.5 MB for Chess.com, 20-80 MB for Lichess - Storage estimates updated for 50K games/month Chess.com baseline - Write amplification analysis for single-file-per-partition approach https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

Major simplification of the write strategy. Instead of real-time Parquet writes with buffering/compaction/dual-write, the indexer writes to SQL as it does today (unchanged), and a periodic batch job exports SQL data to Parquet for analytical queries. Architecture changes: - IndexWorker: no changes at all — continues writing to SQL - ParquetExportJob: weekly/monthly cron, SELECT → write one Parquet file per partition — no buffering, no compaction, no small-file problem - Lichess ingest: direct to Parquet (batch job, never touches SQL) - game_storage_backends metadata table: tracks which backend has each partition's data ('sql', 'parquet', 'both') - StorageAwareQueryRouter: checks metadata, dispatches to SQL or DataFusion, with time-based shortcut (current month → SQL) Eliminated: in-memory buffer, flush thresholds, compaction.rs, /v1/flush endpoint, dual-write logic, crash recovery, read-your-writes problem. motif_query Rust service simplified to query engine + batch file writer. https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

Detail how adding new motifs / fixing detectors works with the SQL-first + periodic Parquet export architecture: - Schema evolution: ALTER TABLE in SQL, schema-on-read in Parquet (old files return NULL for new columns, treated as FALSE) - Re-analysis pipeline: read PGN from SQL, re-run detectors, UPDATE motif columns in place, mark partition as parquet_stale - Re-export: export job detects stale partitions, overwrites Parquet - Lichess re-analysis: three options (re-ingest from dump, store PGN in SQL, or separate PGN Parquet table) — recommend re-ingest for now - game_storage_backends gains parquet_stale and last_reanalyzed_at columns for staleness tracking - Query router falls back to SQL for stale partitions during the window between re-analysis and re-export - Timeline diagram showing the re-analysis → re-export flow https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

Add DataFusion + Parquet query engine design doc for one_d4

8ce0e1b

https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun

claude added 6 commits February 25, 2026 16:53

aaylward merged commit 3949b97 into main Feb 26, 2026
12 checks passed

aaylward deleted the claude/datafusion-parquet-plan-tN00g branch February 26, 2026 18:14

aaylward mentioned this pull request Feb 28, 2026

chess_ql: compile motif queries against motif_occurrences #1081

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFusion + Parquet query engine design doc for one_d4#1068

Add DataFusion + Parquet query engine design doc for one_d4#1068
aaylward merged 7 commits intomainfrom
claude/datafusion-parquet-plan-tN00g

aaylward commented Feb 25, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aaylward commented Feb 25, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages bot commented Feb 25, 2026 •

edited

Loading