Add DataFusion + Parquet query engine design doc for one_d4#1068
Merged
Add DataFusion + Parquet query engine design doc for one_d4#1068
Conversation
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
1d4-web | 7920e3b | Commit Preview URL Branch Preview URL |
Feb 26 2026, 05:05 AM |
Revise the DataFusion/Parquet design doc based on the planned chariot integration (issue #1049). Key changes: - Motif detection stays in Java — no Rust port of detectors - Lichess bulk ingest is a Java CLI jar reusing one_d4 detectors - motif_query (Rust) narrows to query engine + Parquet writer only - Parquet schema includes 7 new Phase 9 motifs - Architecture diagram updated to show Java-centric data flow - New open questions on Phase 9 ordering and Java Parquet writer https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun
Replace the direct ChessQL→SQL compilation model with a Substrait-based
pipeline: ChessQL→Substrait Plan→{SQL, DataFusion}. Key design changes:
- SubstraitCompiler produces Substrait protobuf plans from ChessQL AST
- QueryRouter dispatches plans to SQL (via substrait-java) or DataFusion
(via datafusion-substrait) based on feature flag or cost routing
- motif_query Rust service accepts Substrait plan bytes, not SQL strings
- Shadow mode runs both backends in parallel for migration validation
- Optional cost-based routing (boolean filters→DataFusion, sequence→SQL)
- Architecture diagram shows dual-backend query flow
- Implementation phases reordered: Substrait compiler first, then crate
- New open questions on sequence() coverage and version pinning
https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun
Detail how the indexer transitions from per-game SQL INSERTs to Parquet: - Current IndexWorker writes 1 game at a time; Parquet files are immutable - Buffered writer in motif_query: accumulate rows, flush at threshold (5000 rows) or interval (60s), with explicit /v1/flush endpoint - Java IndexWorker batches per month (10-100 games) before POST - Compaction merges small files: timer-based, file-count triggered, lock-file coordination with writer, target 5-25 MB steady-state - Dual-write during migration: SQL first (authoritative), Parquet second - File size analysis: small batches → buffer flush → compaction targets - Tradeoff analysis: why not append, why not Delta Lake/Iceberg (yet) - Updated open questions: compaction concurrency, read-your-writes - Removed answered questions (buffering strategy, compaction design) https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun
Rework the Parquet write section with concrete throughput assumptions: - 10K-100K games/month total (~300-3,300/day), 10-100 per player-month - Buffer threshold raised to 10K rows (most partitions flush 1-5 times) - Time-based flush at 5 min (safety net, not primary trigger) - Compaction downgraded to low priority: 30-min interval, only merges tail files < 100 KB, skip partitions with recent writes - Compaction scaling table: when to consider Delta Lake (500K+/month) - File size targets: 500 KB - 2.5 MB for Chess.com, 20-80 MB for Lichess - Storage estimates updated for 50K games/month Chess.com baseline - Write amplification analysis for single-file-per-partition approach https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun
Major simplification of the write strategy. Instead of real-time Parquet
writes with buffering/compaction/dual-write, the indexer writes to SQL
as it does today (unchanged), and a periodic batch job exports SQL data
to Parquet for analytical queries.
Architecture changes:
- IndexWorker: no changes at all — continues writing to SQL
- ParquetExportJob: weekly/monthly cron, SELECT → write one Parquet file
per partition — no buffering, no compaction, no small-file problem
- Lichess ingest: direct to Parquet (batch job, never touches SQL)
- game_storage_backends metadata table: tracks which backend has each
partition's data ('sql', 'parquet', 'both')
- StorageAwareQueryRouter: checks metadata, dispatches to SQL or
DataFusion, with time-based shortcut (current month → SQL)
Eliminated: in-memory buffer, flush thresholds, compaction.rs, /v1/flush
endpoint, dual-write logic, crash recovery, read-your-writes problem.
motif_query Rust service simplified to query engine + batch file writer.
https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun
Detail how adding new motifs / fixing detectors works with the SQL-first + periodic Parquet export architecture: - Schema evolution: ALTER TABLE in SQL, schema-on-read in Parquet (old files return NULL for new columns, treated as FALSE) - Re-analysis pipeline: read PGN from SQL, re-run detectors, UPDATE motif columns in place, mark partition as parquet_stale - Re-export: export job detects stale partitions, overwrites Parquet - Lichess re-analysis: three options (re-ingest from dump, store PGN in SQL, or separate PGN Parquet table) — recommend re-ingest for now - game_storage_backends gains parquet_stale and last_reanalyzed_at columns for staleness tracking - Query router falls back to SQL for stale partitions during the window between re-analysis and re-export - Timeline diagram showing the re-analysis → re-export flow https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://claude.ai/code/session_011dyxSbaXZV93zBZL5SMbun