Skip to content

Eagerly materialize row cells during sort buffering#23

Open
philcunliffe wants to merge 7 commits intomasterfrom
perf/eager-sort-materialization
Open

Eagerly materialize row cells during sort buffering#23
philcunliffe wants to merge 7 commits intomasterfrom
perf/eager-sort-materialization

Conversation

@philcunliffe
Copy link
Copy Markdown
Contributor

Summary

  • Resolves all cell values when rows are buffered for ORDER BY, replacing AsyncRow closures with plain value-returning functions
  • The original closures (which capture decompressed parquet row group data) become GC-eligible immediately
  • For tables with large text columns (~10KB/row), reduces per-row buffer cost from ~10KB to ~100B

Test plan

  • All 1322 existing tests pass
  • Expensive cell access counts updated (sort now resolves cells once during buffering instead of lazily)

philcunliffe and others added 5 commits April 9, 2026 16:45
Add multi-level caching and reduce per-row overhead:

- parseSql: LRU cache (64 entries) avoids re-tokenizing/parsing same SQL strings
- planSql: WeakMap cache on parsed ASTs avoids re-planning identical queries
- asyncRow: attach _data field for zero-copy collection
- collect: sync fast-path skips Promise.all when all rows have pre-materialized _data
- executeProject: pre-compute static column names, fast-path for simple identifier
  projections with direct cell passthrough and _data propagation
- executeSql: skip table normalization when no array tables are present
- compareForTerm: use module-level Set instead of per-call array allocation
- memorySource: hoist column computation outside scan loop, use Set for validation
- Add _data to AsyncRow type definition
- Cast to DerivedColumn/IdentifierNode where type narrowing is needed
- Type _data as Record<string, SqlPrimitive>
- Fix JSDoc placement for compareForTerm
Adapt optimizations to the new QueryResults return type:
- executeSql: keep table normalization skip, use new inline plan+execute
- executeProject: move pre-computation outside rows(), keep identifier
  fast-path and static column names inside the rows() generator
- Add _data to AsyncRow type definition
- Fix JSDoc placement and type casts for tsc
Drop the parseSql/planSql memoization caches added in 881a031. Also
rename the pre-materialized row payload from `_data` to `resolved` for
clarity, and delete stale scratch files (query-parquet.mjs, repro-525.mjs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves all cell values when rows are buffered for ORDER BY, replacing
AsyncRow closures (which capture decompressed parquet row group data)
with plain value-returning functions. The original closures become
GC-eligible immediately.

For tables with large text columns (~10KB/row), this reduces per-row
buffer cost from ~10KB (closure over parquet data) to ~100B (plain value).
…erialization

# Conflicts:
#	src/execute/execute.js
#	src/execute/sort.js
@philcunliffe philcunliffe marked this pull request as ready for review April 13, 2026 19:39
@platypii platypii force-pushed the perf/eager-sort-materialization branch from 482b0c0 to df7039d Compare April 13, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants