Chore/fmt typecheck repl#4
Closed
Felipe705x wants to merge 4 commits into
Closed
Conversation
Single source of truth for the compile pipeline: parse → elaborate
→ typecheck → optimize. compile_query becomes a thin wrapper that
flattens diagnostics into a single Err(String); compile_query_unchecked
and compile/compile_unchecked are unchanged.
CompileError tags failures by phase (Parse vs Type) so callers can
distinguish the two without inspecting strings. CompileResult preserves
typechecker warnings (currently dropped by compile_query) so
non-blocking diagnostics — typo'd attributes, non-boolean filters —
can reach the user.
The legacy compile_query string format ('Parse error: …', 'Type
error: …') is preserved verbatim by the wrapper so existing callers
(Python bindings, tests, examples) see no behavior change.
Locks the lib-level contract the REPL depends on: - parse failures → CompileError::Parse - typecheck failures → CompileError::Type with non-empty error vec - successful compile with non-bool filter → warnings vector populated - successful clean compile → empty warnings - legacy compile_query string-error format unchanged - compile_query_unchecked still ignores typechecker These also serve as a regression net for the lib pipeline itself — any future phase change (constant folding, etc.) lands in one place and these tests exercise it for both the structured and the legacy entry points.
Make the typechecker visible and toggleable in the REPL:
- --no-typecheck CLI flag (anywhere in argv) skips typechecking for
the session. Default: on. Flag is stripped before the existing
--import-csv / --import-json positional handling so they continue
to work unchanged, including in combination.
- Startup banner adds 'Typechecker: on' or 'off' so the user knows.
- Compile path uses compile_query_with_diagnostics when typecheck is
on and compile_query_unchecked when off.
- Errors are now phase-labeled at the prompt:
'Parse error: …' from CompileError::Parse
'Type error: …' from CompileError::Type (joined)
Previously both surfaced as 'Parse error', which was misleading.
- Typechecker warnings (e.g. 'Filter expression has type int, which
is not a boolean'; typo'd attributes under closed schemas) print
one per line as 'warning: …' before the query runs. Warnings don't
block — the query still executes.
- Usage text mentions the new flag.
Manual smoke against examples/movies.gdb confirms:
- 'MATCH (m:Movie) RETURN m.title' → 38 rows, no diagnostics
- 'MATCH (m:Movie) WHERE x.foo = 1 RETURN m.title' → Type error: x not found
- '(x WHERE 1)' → warning: not-a-boolean, then 0 rows
- Same queries under --no-typecheck → all run, no errors or warnings
Run cargo fmt over the files added on the typing/checker line of work that postdated the fmt-CI PR (#2). Those files weren't in the tree when the CI fmt run produced its baseline, so they remained unformatted: examples/typecheck_demo.rs, examples/typecheck_repl_smoke.rs, src/lib.rs (new diagnostics types), src/typing/{checker, path_type, type_environment, mod}.rs, and the new tests (compile_diagnostics, typecheck_smoke, typecheck_test). No semantic changes. All 198 tests still pass after fmt.
Felipe705x
added a commit
that referenced
this pull request
Apr 29, 2026
Now that the LDBC loader exposes `id` as a queryable property (parent commit), the IC2 query can anchor by `Person.id` directly — same as the spec. Changes: - `WHERE p.firstName = ... AND p.lastName = ...` → `WHERE p.id = ...` - `RETURN ...firstName, ...lastName, c.content, c.creationDate` → `RETURN friend.id, ...firstName, ...lastName, c.id, c.content, c.creationDate` (matches the spec's RETURN modulo `coalesce`). - `PARAMS` is now five real `Person.id` values from SF0.1 instead of `(firstName, lastName)` workaround pairs. Doc reflects the upgraded fidelity: replaces the old "divergences fixed" section with a 9-row honest divergence table covering every surface-syntax-or-semantic difference. Net divergences from the spec are reduced to two with real semantic effect: - **#6 (`coalesce`)**: gqlite has no `coalesce` builtin. Posts with empty `content` but non-empty `imageFile` return blank rather than the imageFile path. Few rows in SF0.1. - **#7 (no ORDER BY)**: gqlite parser doesn't have ORDER BY yet. Affects *which* 20 rows you see under LIMIT, not wall time. All other table rows (#1, #2, #3, #4, #5, #8, #9) are surface- syntax or naming differences with no semantic divergence — same query, just spelled to fit gqlite's grammar / data conventions. ### Result snapshot (SF0.1, single-iter) | Person.id | resolves to | rows | wall time | |-----------------|----------------|------|-----------| | 933 | Mahinda Perera | 20 | 108.5 s | | 1129 | Carmen Lepland | 20 | 100.6 s | | 8796093023296 | Hồ Chí Loan | 20 | 97.4 s | | 21990232555524 | Bryn Davies | 20 | 99.2 s | | 32985348833865 | Cheng Yu | 20 | 107.5 s | Wall times are similar to the previous (less faithful) bench. The optimizer gap (no value-equality predicate pushdown) still dominates; documented in "Things this surfaces". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Felipe705x
added a commit
that referenced
this pull request
Apr 29, 2026
* loader(ldbc): expose `id` as a queryable property on every node
Previously `load_ldbc_node_file` consumed the LDBC `id` column for the
internal node name only, excluding it from the property map. That made
LDBC's spec queries un-expressible: every IC1-IC14 query in LDBC SNB
Interactive parameterizes by `Person.id`, `Comment.id`, etc. via
`{id: $param}` shorthand or `WHERE x.id = $param`. Without `id` as a
property, those queries return NULL and the bench has to use surrogate
anchors (firstName, lastName) — a real divergence from the spec.
The fix: after the existing prop_cols loop, insert
`props["id"] = parse_ldbc_value(&id_str)`. The id is also still folded
into the internal node name (`"Person:933"`) for cross-file edge
resolution; the duplication is ~12 bytes per node (~4 MB extra on
SF0.1), required for spec-faithful queries.
Why LDBC-only and not the spanner-config CSV loader: that loader
follows different conventions (`vid`, `<Label>_id`, etc., per
CLAUDE.md) and isn't tied to a workload that requires id-as-property.
LDBC is the immediate driver; the spanner loader can argue for the
same treatment separately if needed.
Verified end-to-end on SF0.1:
$ MATCH (p: Person) WHERE p.id = 933 RETURN p.firstName, p.lastName
"Mahinda" | "Perera"
1 rows (0.026s)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench: add LDBC SNB Interactive IC2 benchmark
First slice of the LDBC Social Network Benchmark v1 (arXiv:2001.02299
§6, Interactive Complex reads) — IC2 only ("recent messages by
friends"). Other ICs need features gqlite doesn't have: shortest /
variable-length paths (IC1, IC3, IC11, IC12, IC14), date arithmetic /
OPTIONAL MATCH / ORDER BY / aggregation-with-HAVING (IC4-IC8, IC10,
IC13).
`src/bin/ldbc_bench.rs` runs IC2 against an SF0.1 .gdb, sweeping a
small curated set of `Person.firstName` values (LDBC's `Person.id` is
folded into the gqlite node name, not addressable as a property).
Reports per-param min / median / mean / max wall time. Typechecker
is off — IC queries are valid by construction; comparing typecheck
overhead vs runtime is a separate bench.
`bench/LDBC_BENCHMARK.md` documents setup (download, decompress,
import) and includes baseline numbers from the first run.
### Result snapshot (SF0.1, 327k nodes / 1.48M edges, limit=20)
| Param | min | med | mean | max |
|----------|-------|-------|-------|--------|
| Mahinda | 52 s | 56 s | 55 s | 57 s |
| Carmen | 38 s | 50 s | 65 s | 106 s |
| Bryn | 84 s | 85 s | 86 s | 88 s |
| Cheng | 87 s | 95 s | 94 s | 99 s |
These are 50-130x slower than LDBC's sub-second target. The benchmark
surfaces the cause: the optimizer's predicate pushdown handles
`x.attr is type` (per `docs/implemented-optimizations.md`) but not
value-equality predicates like `WHERE p.firstName = 'Mahinda'`. So
the firstName filter applies *after* the join enumerates every
`Person~knows~Person~hasCreator~Comment` triple — work scales with
the cartesian intermediate, not the parameter's selectivity.
Documented as a finding for a future optimizer pass — pushing `=`
constants into descriptors would collapse this into a 1-2-row anchor
on Person, dropping the join's left side from ~28k to a handful.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): close 4 IC2 fidelity divergences vs LDBC spec
The earlier bench had five divergences from LDBC IC2; four are now
fixed without touching gqlite implementation. Only ORDER BY remains
(parser feature gap).
Fixed:
- **Anchor selectivity.** `Person.firstName` → `(firstName, lastName)`.
Five params each map to *exactly one* Person in SF0.1 — same per-
query selectivity as the spec's `Person.id`, just spelled
differently because the gqlite LDBC loader folds `id` into the
internal node name rather than a property.
- **Message = Comment ∪ Post.** LDBC's `(message:Message)` covers
both via Cypher label inheritance, which gqlite doesn't have. Use
path-pattern union: `<-[:hasCreator]-(c:Comment) | <-[:hasCreator]-
(c:Post)`. `c` binds to whichever arm matches.
- **`message.creationDate <= $maxDate` filter.** Was dropped
entirely. Now in the WHERE clause; `maxDate` is a const
(1340000000000 ms ≈ mid-2012) chosen so the filter retains a
sizable fraction of SF0.1 data while still exercising.
- **RETURN columns.** Added `friend.firstName`, `friend.lastName`,
`c.content` alongside the existing `c.creationDate`. The two `id`
columns the spec wants (`friend.id`, `message.id`) remain dropped
because they're not addressable as properties; documented.
Remaining divergence (out of scope without parser changes):
- **No `ORDER BY ... DESC`** — gqlite's parser doesn't have it.
Affects which 20 rows you get under `LIMIT 20`, not wall time.
Updated `bench/LDBC_BENCHMARK.md` to spell out the spec query, the
gqlite query, and the divergence list explicitly.
### Result snapshot (SF0.1, single-iter)
| Param | rows | wall time |
|-------------------|------|-----------|
| Mahinda Perera | 20 | 96.4 s |
| Carmen Lepland | 20 | 108.1 s |
| Bryn Davies | 20 | 85.0 s |
| Cheng Yu | 20 | 105.6 s |
| Hồ Chí Loan | 20 | 110.7 s |
Wall times are slightly higher than the previous (less faithful)
bench because the union arm doubles the right side of the join
(~286k hasCreator edges vs ~151k Comment-only). The dominant cost
remains the optimizer's missing predicate pushdown on value-equality
WHERE; documented in "Things this surfaces".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): use Person.id anchor (full IC2 fidelity)
Now that the LDBC loader exposes `id` as a queryable property
(parent commit), the IC2 query can anchor by `Person.id` directly
— same as the spec.
Changes:
- `WHERE p.firstName = ... AND p.lastName = ...` → `WHERE p.id = ...`
- `RETURN ...firstName, ...lastName, c.content, c.creationDate`
→ `RETURN friend.id, ...firstName, ...lastName, c.id, c.content,
c.creationDate` (matches the spec's RETURN modulo `coalesce`).
- `PARAMS` is now five real `Person.id` values from SF0.1 instead
of `(firstName, lastName)` workaround pairs.
Doc reflects the upgraded fidelity: replaces the old "divergences
fixed" section with a 9-row honest divergence table covering every
surface-syntax-or-semantic difference. Net divergences from the
spec are reduced to two with real semantic effect:
- **#6 (`coalesce`)**: gqlite has no `coalesce` builtin. Posts
with empty `content` but non-empty `imageFile` return blank
rather than the imageFile path. Few rows in SF0.1.
- **#7 (no ORDER BY)**: gqlite parser doesn't have ORDER BY yet.
Affects *which* 20 rows you see under LIMIT, not wall time.
All other table rows (#1, #2, #3, #4, #5, #8, #9) are surface-
syntax or naming differences with no semantic divergence — same
query, just spelled to fit gqlite's grammar / data conventions.
### Result snapshot (SF0.1, single-iter)
| Person.id | resolves to | rows | wall time |
|-----------------|----------------|------|-----------|
| 933 | Mahinda Perera | 20 | 108.5 s |
| 1129 | Carmen Lepland | 20 | 100.6 s |
| 8796093023296 | Hồ Chí Loan | 20 | 97.4 s |
| 21990232555524 | Bryn Davies | 20 | 99.2 s |
| 32985348833865 | Cheng Yu | 20 | 107.5 s |
Wall times are similar to the previous (less faithful) bench. The
optimizer gap (no value-equality predicate pushdown) still
dominates; documented in "Things this surfaces".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): document parameter-generator divergence
Previous audit listed personId and maxDate substitutions as if they
were independent surface differences. They aren't — they're symptoms
of a single methodology divergence: we didn't run LDBC's
substitution_parameters generator (ldbc_snb_datagen) and instead
hand-picked params by inspection. LDBC's spec example values are
SF-specific by design (the id space is bit-packed per SF and event
timelines stretch with SF), so spec values from one SF are not
expected to work against another. The canonical workflow is to run
the generator against the chosen SF and consume its
interactive_<n>.txt output.
This doesn't affect the bench's timing claim — the join work is the
same for any valid Person.id anchor and any maxDate that doesn't cut
all rows — but it's a procedural deviation from LDBC methodology
that wasn't called out clearly. Adding it as row #10 in the audit
table, reframing #8 and #9 as derived from #10, and updating the
bench's rustdoc comment to match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): use official LDBC interactive_2 substitution params
Closes the methodology divergence flagged in the previous commit.
LDBC ships substitution parameters as a separate archive
(substitution_parameters-sf0.1.tar.zst on datasets.ldbcouncil.org)
because our SF0.1 distribution was generated with
parametergenerator.parameters:false. Downloading that archive gives
us interactive_2_param.txt: 15 (personId, maxDate) pairs from
LDBC's official parameter generator, matching our SF0.1 build
because the generator is deterministic on the dataset seed
(verified — all 15 personIds resolve in person_0_0.csv).
Changes:
- src/bin/ldbc_bench.rs: PARAMS replaced with the 15 official
pairs verbatim. Each param now carries its own maxDate (the
generator's output is per-param), so the global MAX_DATE_MS
constant goes away. CSV stdout grows a `max_date` column.
Names alongside each ID are resolved from person_0_0.csv only
for human-readable output.
- bench/LDBC_BENCHMARK.md: audit table collapses #8/#9/#10 into
a single row #8 ("uses official LDBC params, no semantic
divergence"). Setup grows a §1b for the substitution-params
download. Summary bullets updated. Strict description is now
"IC2 with no coalesce and no ordering" — the parameter-source
caveat is gone.
The Results table is currently a re-run note pointing at the old
hand-picked numbers as a reference; the official-param run is in
progress (one param in: 101s, 20 rows — same order of magnitude
as the hand-picked-param run, as expected since the bench's
bottleneck is optimizer-bound rather than parameter-bound). A
follow-up commit will land the fresh numbers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): document driver / reporting methodology divergence
Sweep of LDBC divergences found a bigger one I'd missed: we don't
use the LDBC SNB Driver at all. The driver defines the audited
measurement regime — workload at target throughput, mix of IC+IS
reads and IU updates, multi-minute warmup, randomized param order,
audit JSON output, throughput-at-SLA + p99 latency reporting.
We have none of that. The bench is a hand-rolled harness: direct
Runtime::run_query calls, IC2 only, sequential, no warmup, CSV
output, min/median/mean/max wall times.
The query and parameter set are still spec-faithful (rows #1-#5,
#8 in the audit); the rows returned would match an LDBC-conformant
run modulo #6 (no coalesce) and #7 (no ORDER BY). What's NOT
spec-faithful is the measurement regime itself — the numbers we
report are not directly comparable to throughput numbers in
published LDBC audit submissions.
Adding this as audit row #9 plus an expanded "Methodology /
reporting divergence" section listing the 9 specific differences
(harness, workload mix, concurrency, warmup, param order, headline
metric, output format, update streams, result validation). Strict
description updated: "IC2 with no coalesce, no ordering, and a
hand-rolled (non-driver) harness reporting wall-time stats instead
of throughput-at-SLA."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): land Results table from official-param re-run
Full IC2 sweep against ldbc-sf0.1.gdb with the 15 official LDBC
substitution parameters. 1 iter per param, ~20 min total wall.
Stats across all 15 params:
- min 52.8 s
- med 79.7 s
- mean 83.0 s
- max 146.9 s
Same regime as the previous hand-picked-param run (97-108 s for 5
params), confirming the bottleneck is gqlite's optimizer (no
predicate pushdown for value equality on `=`, see "Things this
surfaces") rather than parameter selectivity. The 3× spread
between min and max reflects caller-degree variation plus
first-iter page-cache effects, not bench noise.
All 15 params returned 20 rows (the LIMIT). Far above LDBC's
sub-second interactive target — the optimizer gap is the unblocking
work for any future driver-mediated audit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): note bench/impl scope split + no result validation
Two clarifications surfaced in review:
1. The optimizer pushdown gap that makes IC2 take ~100s is a real
finding but is NOT something this branch will fix. bench/ldbc is
bench-only — no changes to src/* implementation. Adding a
pointer to which files would need to change (pushdown.rs +
descriptor.rs + engine.rs) so a future impl PR has a starting
point.
2. We don't validate row contents against LDBC's expected_output
archives. Two divergences make any byte-for-byte check fail:
#7 (no ORDER BY) means our 20 rows aren't the 20 most recent,
and #6 (no coalesce) means imageFile-only Posts show blank.
Either alone invalidates a diff -q check. The bench asserts
row_count == LIMIT, not row contents.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): drop misleading min/med/mean/max labels at N=1
Sweep-review item: with --iters 1, the bench's report() printed
"min=X med=X mean=X max=X" with all four columns showing the same
number, suggesting statistical content where there is none.
Now: N=1 prints "wall=X.XXms (n=1, --iters >=3 recommended for
stable median)". N>=2 keeps the full spread display with an
explicit n=N. Bench-side fix only; doesn't touch gqlite impl.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): add SPEC_CHECKLIST.md — verbatim quotes from LDBC spec
Source-of-truth doc that maps each LDBC spec requirement to whether
it applies to research-paper use vs audited submissions, and what
our bench currently does. All quotes verbatim from
github.com/ldbc/ldbc_snb_docs (main branch), with file citations.
Headline finding: the spec itself sanctions a two-tier model.
benchmark-checklist.tex says "For most research papers, fully
audited results are unrealistic and even unaudited results can
provide insight" and lists exactly what info to disclose for
research use. We're firmly in that tier and the spec says that's
fine.
Practical use: when someone challenges "is this a real LDBC bench?"
the answer is now "yes, at the research-paper tier the spec
defines, with these specific gaps documented." Each gap (validation,
ACID, driver, run duration, scale) carries a verbatim quote and
a reason it's audit-only or out-of-scope here.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): IC2 syntax fidelity + paper-tier disclosure
Two changes from the post-spec-fetch review:
1. Switch IC2 query to use `(p: Person {id: $personId})` shorthand,
which is verbatim spec syntax. gqlite's parser accepts it and
elaborates to `(p: Person) WHERE p.id = $personId` internally;
identical post-elab behavior. Closes audit-row #1 entirely —
that row was a self-imposed stylistic divergence, not a gap.
2. Add "Paper-tier disclosure" section answering all 10 items
from `benchmark-checklist.tex` directly. Each answer is one
line, no hedging. Replaces what would otherwise be scattered
across the doc:
- Cross-validation: No (with reason: coalesce + ORDER BY gaps)
- Persistent storage: Yes
- ACID: No
- Fault-tolerance: No
- Warmup rounds: 0
- Execution rounds: 3 per param × 15 params
- Summarization: per-param med + across-param med headline
- Load phase included: No (separate)
- SUT-dev consulted: gqlite is our own system
The maxDate predicate stays in WHERE since `<=` isn't equality;
the descriptor shorthand only handles equality. That's a spec-
faithful split (the IC2 spec has the `id` filter inside the
descriptor and the date filter in WHERE, exactly as we now do).
Results table is being re-run with --iters 3 (was 1) for stable
medians; lands in a follow-up commit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): --backend memory|lazy|disk + RSS measurement (sysinfo)
Per professor's request: compare gqlite's three GraphAccess backends
on the same IC2 query so the paper can show the speed-vs-RAM tradeoff.
- Memory (Graph from CSV): everything in RAM, no disk I/O at query
time. Uses csv_loader::load_from_ldbc_csv_dir.
- Lazy (LazyGraphStore): topology + label index in RAM, props via
per-process LRU page cache + OS page cache.
- Disk (DiskGraphStore): topology in RAM, everything else from disk
per access (via OS page cache).
Bench grows --backend and --csv-dir flags. Memory backend takes the
CSV directory directly; Lazy/Disk take the .gdb path. Each run
reports both per-param wall time and peak RSS delta over baseline.
sysinfo crate (0.30) added for cross-platform RSS reading. Used
only by ldbc_bench. RSS is OS-reported working-set; on Windows
that includes mapped file pages, so Lazy/Disk numbers reflect both
the in-process structures AND any portion of the .gdb that the OS
has resident — disclosed in the doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): land 3-backend Results table + cache-cost note
Memory vs Lazy vs Disk on SF0.1 IC2, per professor's request.
Median per-param: memory 4.65s, disk 7.71s, lazy 8.42s. Memory
wins on speed by ~2x, costs +605 MiB; Disk uses least RAM and
loads fastest but no in-process caching.
Three real findings, all worth flagging in the paper:
1. Memory beats Lazy/Disk by ~2x. Page-fault-free queries.
2. Lazy and Disk are within 10% on speed despite Lazy having
an in-process LRU cache. The OS file cache dominates; Lazy's
process-level cache adds bookkeeping without much benefit.
3. Cold-cache vs warm-cache cost differs by ~10x. The very first
run after machine restart shows 53-147s per param (median 80s);
subsequent runs benefit from OS-cached .gdb pages and drop to
7-10s per param. Documented as context for the warm-cache
table; cold-cache numbers preserved for paper use.
Caveat documented: sysinfo's RSS reading on Windows includes
mmap'd file pages, so Lazy/Disk RSS deltas are confounded by
which portion of the .gdb the OS has resident. Memory's +605 MiB
delta is purely allocated heap and is the most reliable absolute
number in the table.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): per-IC TOML query catalog + dispatching runner
PR #1 of the LDBC runner refactor (per the planning doc).
Architecture:
- bench/ldbc-queries/ic<n>.toml — one file per IC (1-14).
IC2 carries query template + params_file ref + divergences;
the other 13 carry status="blocked" + required_features +
blocked_reason.
- src/bin/ldbc_bench.rs rewritten to load TOML files, parse the
LDBC param file, and substitute {paramName} placeholders in the
query template. No more hardcoded PARAMS / query string.
CLI:
--ic <n>|N,M|all|blocked which IC(s) to run; default 2
--backend memory|lazy|disk default lazy
--params-dir <dir> default bench/data/substitution_parameters-sf0.1/...
--queries-dir <dir> default bench/ldbc-queries
--iters N default 3
--limit N default 20
--csv-dir <dir> required only with --backend memory
`--ic blocked` prints the inventory of blocked ICs with their
required_features (no bench run). Useful as a status report:
"what would we need to add to gqlite to run these?"
New deps: toml = "0.8" (small, well-maintained; multiline strings
in TOML are nicer than JSON for query templates).
Per professor feedback: Memory backend stays available in the bench
binary for smoke-testing but is dropped from the headline Lazy-vs-Disk
comparison in LDBC_BENCHMARK.md (Memory loads everything from CSV at
every startup — unrealistic for a database product).
Same numbers as the previous IC2-only bench; this commit is purely
the architecture change. Adding a new IC (when blocking gqlite
features land) = drop a TOML file, no bench code changes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): trim docs — drop audit-tier discussion + non-divergences
Per feedback: docs were too verbose with audit-tier content we don't
care about and "divergences" that turned out to be no-ops.
- Delete LDBC_SPEC_CHECKLIST.md entirely (314 lines of audit vs
research-paper analysis with verbatim spec quotes — irrelevant
since we're not auditing).
- Drop the "Methodology / reporting divergence (#9, expanded)"
section from LDBC_BENCHMARK.md (driver, throughput-at-SLA, ACID,
audit JSON — all audit-tier).
- Drop the "Paper-tier disclosure" section's 10-row spec-checklist
table (mostly no/no/n.a. answers; not useful when we're not
arguing for compliance).
- Trim the divergence audit from 9 rows to 2: only the real
divergences (#6 no coalesce, #7 no ORDER BY). The other 7 rows
were "None — same as spec" or "fixed in earlier commit" and
weren't divergences anyway.
- Trim the RSS caveat and cold-cache notes to a few lines each.
- Remove the long "out of scope" rationale block on the optimizer
pushdown gap; reduced to a 4-line "Fix:" pointer to which
src/* files would need to change.
LDBC_BENCHMARK.md goes from 428 -> 324 lines. Combined with the
deleted SPEC_CHECKLIST.md, total bench-doc reduction is ~420 lines.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): note LIMIT-keyword TODO in ic2.toml
When gqlite's parser gains a LIMIT keyword, append `LIMIT 20` to
the query template and drop --limit from the bench CLI. Until then
the runtime row cap (Runtime::run_query's second arg, exposed as
--limit) is the only way to enforce the spec's LIMIT 20.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): rewrite README to be how-to-run only
Per feedback: the markdown was trying to do too much.
- Divergences are already in each ic<n>.toml's [divergences] table
- Results / findings / cold-warm analysis belong in the paper, not
the bench README
- The README's job is "how does someone reproduce this run"
Cuts:
- IC2 fidelity-vs-spec section (the 2-row divergence table is now
redundant with ic2.toml; the spec quote is in the TOML's spec link)
- Results section (Lazy-vs-Disk table, per-param times, RSS caveat,
cold-cache note — all paper material)
- "Things this surfaces" (optimizer pushdown finding — paper material)
- "Architecture" section (TOML schema is self-documenting)
What stays:
- Status: 1 implemented, 13 catalogued, point at toml inventory
- Setup: download dataset, download params, build .gdb
- Run: invocation examples + flag table + output format
- Add new IC: 2-step recipe for promoting a blocked TOML
- Backends: one paragraph distinguishing the three options
- See also: pointers to the toml files, runner source, and paper
Goes from 324 to 133 lines (-258 +67). Combined with the earlier
trim and the deleted SPEC_CHECKLIST.md, total reduction across the
two LDBC bench docs is ~640 lines.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): bench_setup binary for one-command data prep
Replaces the README's three manual setup steps (curl + python zstd
+ gqlite import) with a single binary that does all three.
Idempotent: each step checks for its output and skips if present.
Re-running bench_setup with everything in place is a no-op (just
prints "[skip] ..." for each step). Pass --rebuild to force-rebuild
the .gdb.
Three deps added (only used by this binary):
- ureq 2.12 (default-features = false, "tls"): small blocking HTTP
client, ~one network call per archive
- zstd 0.13: streaming decompressor for the .tar.zst archives
- tar 0.4: streaming tar extractor
The .gdb build step shells out to the existing `gqlite --import-ldbc-csv`
binary rather than calling csv_loader directly, so the .gdb is
produced by exactly the same code path users run from the REPL — no
risk of bench_setup and gqlite drifting on the .gdb format.
Output:
data-dir: bench/data
downloading https://datasets.../ldbc-sf0.1.tar.zst
done: 17.5 MiB in 0.7s
extracting ldbc-sf0.1.tar.zst -> ldbc-sf0.1
done in 0.3s
downloading https://datasets.../substitution_parameters-sf0.1.tar.zst
done: 198.4 KiB in 0.2s
extracting substitution_parameters-sf0.1.tar.zst -> substitution_parameters-sf0.1
done in 0.0s
building ldbc-sf0.1.gdb from .../social_network-sf0.1-CsvBasic-LongDateFormatter
done: ldbc-sf0.1.gdb in 39.1s
✓ setup complete
README's "Setup" section collapsed from 3 multi-block steps to a
single command. Manual steps still recoverable from this commit's
diff if anyone needs them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): document output format + rename ic CSV column to params
Two related fixes after a user pointed out the CSV format wasn't
self-explanatory:
1. The third CSV column was named `ic` but actually carries the
substitution-param values (the IC name is already in column 1).
Renamed to `params`. Reading the column as `params` matches the
stderr summary's parens, e.g. `(19791209300143|1354060800000)`.
2. README's "Output" section was 4 lines listing column names with
no semantics. Expanded to ~50 lines covering:
- Why two streams (stdout=data, stderr=chatter, Unix convention)
- Column-by-column meaning of the CSV
- What `(personId|maxDate)` and `(n=3)` in the stderr summary mean
- Worked cross-check showing min/median/mean/max of three CSV
rows reproduces one stderr summary line
The CSV header rename is breaking-but-trivial: any existing scripts
reading `ic` would now get `params` and need a one-character change.
Since the bench has shipped maybe 2 days ago and nothing is pinned
to it, this is the right time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): rename param_idx -> row + done summary at end of run
Two small UX fixes:
1. The CSV's `param_idx` column was the same number as the stderr
summary's `row#N` label but with a different name. Renamed to
`row` so the two streams use the same vocabulary.
2. After the last param of the last IC, the bench now prints:
✓ done — 1 IC(s) ran to completion
IC2: 15 rows × 1 iter(s) = 15 runs;
across-row median 11506.22ms
(range 9109.93-16517.23ms)
Peak RSS during query loop: 84.5 MiB (+75.4 MiB over baseline)
Confirms the run finished cleanly (vs got killed mid-row) and
gives the across-row median without grepping. Per-IC summary line
so multi-IC runs (when more get implemented) compose cleanly.
Implementation: `report` now returns the per-row median;
`run_one_ic` collects them into an `IcSummary { ic_id, iters,
row_medians_ns }` and `run_targets` builds the final aggregate
from the Vec<IcSummary>.
README's Output section updated for both changes (column rename in
the CSV table; done-summary section added with worked example).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): pre-merge fixes from review (A1-A4 + warmup + --help)
Five fixes from the pre-merge review on the runner:
A1. Type-aware param substitution. New optional `param_types` array
in the per-IC TOML; per-column tag is "int" (raw substitution,
default) or "string" (wraps in single quotes, rejects values
containing `'` since gqlite's lexer has no escape syntax).
Unblocks future ICs with string params (firstName, country
names, tag names) without putting quote-handling onto the
TOML author. ic2.toml is unchanged — defaults to all-int.
A2. Duplicate IC id detection in load_queries. After sort, scan
adjacent for matching ids; fail fast naming both source files.
Catches the footgun of copying ic2.toml without renumbering.
A3. Median for even N. Previously returned sorted[n/2] (upper of
two midpoints) for even-sized samples — neither the true
median nor a labeled percentile. New median_of() helper does
the standard average for even N. Doesn't affect --iters 3
(odd); fixes --iters 2/4/6.
A4. Schema validation at load time. New IcQuery::validate()
enforces that "implemented" carries query+params_file and
"blocked" carries blocked_reason, plus param_columns/param_types
length agreement. Failure now surfaces before any backend is
opened — not 8 minutes into a multi-IC run.
B4. --warmup N flag. Per-row warmup that runs N extra iters of the
query before the timed ones, discarding their measurements.
Banner now reads "{warmup}+{iters} iters/param (warmup+measured)"
so the count is visible. Mirrors typecheck_bench's --warmup.
Default 0 (preserves existing behavior).
Plus a small ergonomic fix:
- `ldbc_bench --help` (with no positional .gdb) now prints usage
instead of trying to open "--help" as a database file.
What was not fixed (deliberately, low impact for now):
- B1 CSV ambiguity if param values contain `|` or `;`: real
issue but no LDBC param file has these chars in our scope.
- B2 peak RSS sampling between rows, not during query: watchdog
thread isn't worth the complexity.
- B3 mmap inflates Windows RSS: documented in the README; can't
fix from the bench side.
- B5 hardcoded SF0.1 in bench_setup: out of scope for now.
- B6 no checksums on downloads: would need LDBC's published
hashes; out of scope.
- B7 --ic 2,2 doesn't dedup: not a bug, just undocumented; fine.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): second-pass review fixes — stale docs + warmup honesty
Findings from a second-pass review of the runner + README:
Stale doc references:
- Rustdoc said `--show-blocked` mode (real flag is `--ic blocked`).
- Usage text recommended `--warmup 1` to absorb cold-cache cost.
Empirically wrong for our scope: looking at iter-level data from
the SF0.1 iter=3 run, iter 0 is NOT systematically the slowest
within a row — variance is general jitter, not first-iter cold
cache. The OS page cache is cross-row, so once row#0's iters
populate it, subsequent rows + iters all hit warm pages.
Reframed: --warmup is available for parity with typecheck_bench
and for larger SFs where the cache can't hold the working set,
but at SF0.1 it's overhead.
- Warmup loop's inline comment claimed "iter 0 cold-cache"; updated
to match the empirical finding.
Stale README:
- The flag table was missing --warmup entirely (added in a previous
commit but the doc table didn't get the row).
- The stderr summary example showed `3 iters/param` but the new
banner format is `0+3 iters/param (warmup+measured)`.
- "Adding a new IC" recipe didn't mention `param_types` for ICs
with string params (IC1 firstName, IC3/IC11 country names, IC6
tagName, etc.). Now mentions `param_types` and the gqlite
no-escape-for-quotes caveat.
What's NOT a stale-doc issue (verified during the pass):
- All `# Spec: ...` URLs in TOMLs point to ldbc/ldbc_snb_docs main
branch which is stable.
- The `_ic` parameter in `report()` is used in the eprintln line.
- Edge cases (--iters 0, --warmup with no iters, etc.) handled
gracefully via the `is_empty()` skip in the final summary.
- `param_columns` doesn't strictly need to match the param file's
header — the runtime reads the file's own header. Documented.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): real fixes from a deeper review pass + 13 unit tests
The previous "second-pass review" was thin (mostly stale-doc fixes).
A proper adversarial pass surfaced four substantive issues:
C1. Arg parsing crashed with index-out-of-bounds when a flag had
no value (e.g. `ldbc_bench db.gdb --iters` with nothing after).
New `need_value` helper checks bounds and prints a clear
"{flag} requires a value" before exiting.
C2. "All rows failed" was reported as `✓ done` with no per-IC line.
A user could read this as "ran successfully, no rows" instead of
"every row erred." IcSummary now tracks `failed_rows`; final
summary prints `⚠ done with errors` when any IC had zero
successful rows, and per-IC lines surface failure counts even
when row_medians_ns is empty.
C3. `substitute()` silently left unmatched `{...}` placeholders if
the query template referenced a column not in the param file's
header. The gqlite parser would then fail with "Parse error" —
a misleading symptom for the actual problem (TOML/header
mismatch). New `find_unsubstituted_placeholder` post-pass scan
catches bare `{ident}` tokens — careful to skip the legitimate
`{id: 933}` descriptor shorthand and `{1,3}` repetition syntax.
Verified end-to-end: changing ic2.toml's query to reference a
nonexistent {countryName} now produces:
SUBSTITUTE ERROR on row 0 (params: ...): unsubstituted
placeholder {countryName} remains after substitution; check
that the query template's column names match the param
file's header (personId, maxDate)
C4. Zero unit tests for 700 lines of code. Added a tests module
covering median_of (incl. the even-N regression A3 caught
earlier), substitute (int/string/bad-type/quote-rejection/
unmatched-placeholder), and find_unsubstituted_placeholder
(true positives + false-positive guards on descriptor
shorthand and repetition quantifiers). 13 tests, all green.
Plus reframed the warmup recommendation honestly. Earlier commits
said "empirically not needed at SF0.1" based on 6 data points — too
strong a claim from too little data. New text acknowledges OS
page-cache mechanics (real, well-known) but is calibrated about
what we've actually measured: "varies by OS / RAM / storage; on a
warm machine with free RAM >= dataset size it usually doesn't help,
on cold or cache-constrained machines it can absorb a cold-iter
spike." Default stays 0; recommendation is "raise it if the
per-row summary shows iter 0 consistently dominating."
What's still NOT fixed (deliberately, low impact):
C5 TOML schema versioning, C6 absolute-path footgun in
`params_file`, C7 .gdb freshness vs loader version. Each is
documentation territory or a system-wide gqlite concern.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): final pre-merge fixes — arg parsing + idempotency
Three more issues found in the fourth review pass:
E1. The --ic value parser still used .expect() (panic with raw
Rust error) for non-numeric values like `--ic 2,abc`.
Inconsistent with the C1 fix from the previous round.
Now a clean error: "invalid --ic value \"2,abc\":
invalid digit found in string. expected: N, N,M,K, all, or
blocked".
F1. bench_setup's arg parser had the same C1-style index-out-of-bounds
risk (`bench_setup --data-dir` with no value would panic). Added
the same `need_value` helper used in ldbc_bench.
F2. bench_setup's idempotency check was "does the directory exist?".
A partially-extracted directory (interrupted run, network drop
mid-extract) would skip re-extraction and feed garbage data into
the .gdb build. Now checks for a known sub-path that only exists
post-successful-extract:
- dataset: csv_dir/dynamic/ must be a directory
- params: params_inner/interactive_2_param.txt must be a file
Detected the partial-extract case in a synthetic test (empty
csv_dir → bench_setup correctly tried to re-download instead of
skipping).
13 unit tests still pass. End-to-end smoke tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): loop-review fixes — error handling, idempotency, lint, doc
User asked for a loop-review until I find no issues. Multiple
passes; this commit is everything that surfaced.
Pass 1 — `.expect()` panics replaced with clean exits:
- load_params on empty file (ldbc_bench)
- fs::create_dir_all on the data-dir (bench_setup)
- env::current_exe lookup (bench_setup)
- fs::remove_file on --rebuild (bench_setup)
Pass 2 — TOML schema strictness:
- #[serde(deny_unknown_fields)] on IcQuery so a typo like
`parms_file` produces an immediate parse error naming the
unknown key, instead of silently leaving params_file=None
and surfacing later as "missing `params_file`".
Pass 3 — three more `.expect()` on backend dispatch:
- csv_dir on memory backend → "--csv-dir is required..."
- csv_loader::load_from_ldbc_csv_dir → propagates the io::Error
- LazyGraphStore::open / DiskGraphStore::open → ditto
Pass 4 — README's flag table had stale strong claim
"--warmup empirically doesn't help at SF0.1" — softened to match
the calibrated runner help text (varies by OS / RAM / storage).
Pass 5 — `--iters 0` was silently allowed but produced no output:
the bench would run warmup iters (if any), record zero measured
iters, emit no CSV, and print "✓ done" with no per-IC line.
Now rejected at arg parse with "--iters must be >= 1".
Pass 8 — clippy warnings:
- cmp_owned: `data_dir == PathBuf::from("...")` → `Path::new`
- doc_overindented_list_items: re-indented two doc bullets
Pass 9 — `--ic 99` (no match) printed "✓ done — 0 IC(s) ran to
completion" which falsely implied success. Now prints
"⚠ done — 0 IC(s) ran (none of the requested ICs were
implemented in the catalog). Use --ic blocked to see the
inventory."
13 unit tests still pass. cargo clippy clean. cargo fmt clean.
End-to-end smoke (--ic blocked, --ic 2, --ic 99, --ic 0, --ic 2
with bad value) all produce clean error/output paths.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* bench(ldbc): rewrite stderr docs — full sample + RSS/baseline glossary
Two issues a reader pointed out in the README's stderr section:
1. The example was incomplete — it skipped the setup lines (the
"Loaded N IC definitions / RSS baseline / Loading .gdb / RSS
after open" header) and the "✓ done" final summary that the
bench actually emits. README only showed the per-row middle.
Now shows a complete stderr trace from --ic 2 --iters 3, top
to bottom, so readers can match what they see live to what's
documented.
2. RSS and "baseline" were used in the example without definition,
leaving readers (correctly) confused about what the numbers
mean. New "What RSS and baseline mean" subsection: RSS is OS-
reported RAM use; baseline is process startup; the
"+X MiB over baseline" form makes per-step cost readable
without manual subtraction. Worked example using the actual
sample numbers.
Per-row docs (min/med/mean/max meaning, sample-size note when n=1)
unchanged — those were already correct, just buried under the
incomplete example.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
matiastoro
added a commit
that referenced
this pull request
May 1, 2026
The LTJ TripleIndex was being rebuilt from scratch on every call to run_join / run_concat_pattern (engine.rs:561, :633), even when the graph hadn't changed. For SF0.1 that's a ~700ms tax on every query — reading 1.5M directed + undirected edges through the page cache and sorting six orderings of the resulting triples. The CLAUDE.md `Current limits` section already flagged this as #4: "TripleIndex rebuilt per query: no cross-query caching (could go on Runtime with a RefCell)." This commit is exactly that change. Add `triple_index: RefCell<Option<TripleIndex>>` to Runtime. A new Self::triple_index() lazily builds-and-caches on first use; both LTJ call sites borrow it through the cache instead of allocating. The graph is read-only over a Runtime's lifetime, so the index never needs invalidation. Measured on `bench/data/ldbc-sf0.1.gdb`: Same query 3× back-to-back (single-branch IC2): Before: 0.709s, 0.681s, 0.693s (linear — no caching) After: 0.709s, 0.001s, 0.001s (700× warm-query speedup) LDBC IC2 across-row median (full union, 15 params × 3 iters, --warmup 1, --limit 20): No index, no cache: 2417 ms Auto hash + btree, no cache: 1377 ms (1.76×) Auto hash + btree, cached: 8.74 ms (276× total, 158× from cache) For reference: GraphQLite (SQLite extension with Cypher) was 32.82ms median on the same IC2. We're now 3.75× faster on this query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Felipe705x
added a commit
that referenced
this pull request
May 5, 2026
Per request to actually verify rather than just analyze: tried to
build webbery/gqlite on Windows + MSVC 2022 + bundled CMake
(3.31.6-msvc6), May 2026.
Hit FOUR sequential build failures, each requiring manual
intervention:
1. **git-describe submodule version**: shallow clone breaks
libmdbx's CMake tag-matcher. Worked around by --unshallow +
synthetic tag.
2. **flex/bison not on PATH for MSBuild**: README claims the
bundled `tool/{flex,bison}.exe` "needs no dependency" but
MSBuild doesn't honor `WORKING_DIRECTORY` for cwd-based binary
lookup. Worked around by pre-generating parser/lexer manually
AND prefixing `tool/` to PATH for the build invocation.
3. **libmdbx version-mismatch** in CMake-templated `version.c`:
`MDBX_VERSION_MAJOR/MINOR` mismatch between mdbx.h (0,11) and
the generated file (0,0) fires a preprocessor `#error`. Worked
around by patching the generated version.c.
4. **MASM assembly file not built/linked**: project contains x86_64
coroutine assembly stubs but CMake doesn't `enable_language(ASM_MASM)`,
so MSBuild compiles the .cpp consumers but never assembles the
.asm files into .objs. Linker fails with `LNK1181: cannot open
input file 'jump_x86_64_ms_pe_masm.obj'`.
We stopped at #4. The point isn't "Windows is hard" — every other
system in this bench (kuzu, graphqlite, gqlite) builds clean from
a fresh `git clone` on the same Windows + MSVC 2022 setup. webbery's
build issues are specific to its pre-2024 CMake config and
unmaintained submodules. **Combined with the grammar-level blockers
(no LIMIT, no label disjunction), the build issues are the second-
order signal — even if all four were patched, the DSL still can't
express IC2 faithfully.**
The verdict is unchanged: discard. But the SKIPPED.md is now
empirically grounded rather than just grammar-reading. A reviewer
who clones the repo will hit the same wall.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.