Skip to content

Typing/checker#1

Merged
matiastoro merged 14 commits into
mainfrom
typing/checker
Apr 27, 2026
Merged

Typing/checker#1
matiastoro merged 14 commits into
mainfrom
typing/checker

Conversation

@Felipe705x
Copy link
Copy Markdown
Collaborator

DO NOT MERGE -- DIFF PREVIEW ONLY

Step 0 of the fppc → gqlite typechecker port: AST and lattice mappings,
PathType option-1 decision, and a reconciliation table specifying which
gqlite lattice ops need to change to match fppc's semantics in Step 1.

Per user override, fppc is the authority on type-checking semantics:
where a gqlite op differs from fppc on a case fppc covers, gqlite is
updated to match fppc. Cases gqlite covers but fppc does not (Top,
Empty, Neg, F, Record, Group, EdgeNonDirectional, EdgeAnyDirection)
stay as-is.
Per Step 1 of the migration plan: where a gqlite lattice op diverged
from fppc on a case fppc covers, gqlite is updated to match fppc.

- simple_type.rs: SimpleType::is_subtype — flip (_, Zero) => true to
  (Zero, _) => true. Bottom is subtype of everything, not vice versa.
  (fppc B2)
- label_type.rs: LabelType::is_subtype — add the missing (Or, _) arm.
  Uses || (matches fppc's preserved-by-design behavior).
- variable_type.rs: VariableType::is_subtype — add (Zero, _),
  (Group, Group), (Union, _) and (_, Union) arms. (fppc B5 plus
  the missing Union and Group covariance arms.)
- variable_type.rs: VariableType::meet — replace the Union-LHS manual
  Union(r1, r2) construction with VariableType::join, so equal sides
  dedupe and Zero is absorbed (matches fppc's join-based combination).
- variable_type.rs: add VariableType::refine_to_nodes helper, used
  by PathType::meet (ported in Step 3).
- variable_type.rs: derive Debug+Clone on Schema so the Typechecker
  can own it by value.

Cases gqlite covers but fppc does not (LabelType::Top, Empty, Neg,
SimpleType::F, Record, Group, VariableType::Group, EdgeNonDirectional,
EdgeAnyDirection) are gqlite-only and stay as-is.

All 62 lib tests and 18 integration tests still pass.
A HashMap<String, VariableType> binding map plus pointwise union/meet
and a to_group helper for repeated patterns.

Adaptations from fppc:
- gqlite uses raw String for variable names, not fppc's Var wrapper, so
  set/get take &str directly (no set_var/get_var siblings).
- fppc's to_list becomes to_group: gqlite uses VariableType::Group for
  repetition grouping and reserves List for user-facing list values.
- fppc's meet returns Result<_, String> because its VariableType::meet
  is fallible. gqlite's VariableType::meet returns Zero on shape
  mismatch; we surface that as Err here when the meet collapses to
  bottom on inputs that weren't already empty, preserving fppc's
  observable check-level behavior.
Step 3 of the typechecker migration.

PathType (src/typing/path_type.rs) is ported as a parallel module per
option 1 of the migration plan. Mirrors fppc's structure: Node / Edge /
Union / Zero with meet, union, len, is_empty, is_unsatisfiable. The
EdgeDir enum mirrors fppc's ast::EdgeDirection (gqlite has no AST
counterpart since direction is encoded in the PathPattern variant).

Adaptations from fppc:
- NodePathType wraps DescriptorType directly (gqlite has no NodeType
  wrapper struct; VariableType::Node carries a DescriptorType).
- EdgePathType keeps fppc's shape: { p1: Box<PathType>, n2: NodePathType }.
  The edge descriptor is held by the variable's VariableType binding,
  not by the path shape.
- VariableType::Group (gqlite-only repetition grouping) maps to Zero in
  the path conversion, mirroring fppc's treatment of List.

Checker skeleton (src/typing/checker.rs) defines Typechecker /
TypecheckResult and stubs check_query / check_pattern /
check_path_pattern / check_expr / refine_pattern_node /
refine_pattern_edge with todo!() bodies. pow_path_type is implemented
since it has no dependencies. Module compiles against gqlite's types,
which proves the Step 0 mapping work.
Step 4 of the typechecker migration. The expression checker reuses
gqlite's BinOp::delta and UnOp::delta, so no operator signature table
is duplicated.

Cases:
- Const(Value): map literal to SimpleType (Int→Z, Float→F, Str→S,
  Bool→B). List and Record literals get loose types — fppc has no
  literal lists, and full element/field typing is a phase-1 punt
  recorded in the migration doc.
- Type(t): return t directly.
- AttrLookup { var, attr }: look up var, call get_attribute. Warn on
  unbound variable, missing attribute, or empty type.
- FieldAccess { base, field }: gqlite-only. Look up the field on Record
  base; pass through Star/Zero gradually; warn on non-record base.
- Binop: special-case Is and As when rhs is a Type literal (Is→Bool,
  As→declared type). Otherwise call BinOp::delta and verify both
  operands are compatible via SimpleType::meet.
- Unop: same pattern with UnOp::delta.

is_empty / != Zero is the compatibility test, mirroring fppc; no
gradual_eq is reintroduced (it was removed in fppc HEAD).
Step 5 of the typechecker migration.

refine_pattern_node builds a VariableType::Node from the descriptor
(star descriptor for anonymous nodes) and runs VariableType::refine
against the schema. refine_pattern_edge does the same, picking
EdgeDirectional or EdgeNonDirectional from the EdgeDir.

Neither pattern-matches on LabelType::{Top, Empty, Neg} or
PropertyType::{Open, Closed} directly — refinement defers to the
lattice's existing meet / is_subtype which already handle them.

Adds an assert_filters_drained sanity check: elaboration is supposed
to hoist Descriptor::value_filters into Filter nodes before typecheck
runs. If we see leftovers we surface a typechecker error so an
elaboration bug cannot silently drop constraints.
Step 6 of the typechecker migration — the load-bearing function.

Each PathPattern variant produces a (PathType, TypeEnvironment) pair:
- Node: refine descriptor against schema, wrap as PathType::Node, bind
  to env if a variable name is present.
- EdgeRight / EdgeLeft / EdgeUndirected / EdgeAnyDirection: fan out
  fppc's single Edge { direction } arm. Right/Left/Any produce
  EdgeDirectional; Undirected produces EdgeNonDirectional. The PathType
  conversion uses EdgeDir to decide forward / reversed / union shape.
- Concat / Join: meet environments (with schema refinement) and meet
  path shapes. Join is gqlite-only and reduces to Concat for typing
  purposes — both compose two patterns over shared variables; the
  endpoint-glued vs not distinction is a runtime concern. Documented in
  the migration doc; flagged for review if Join semantics ever diverge.
- Filter: type-check the expression against Bool; warn if not boolean.
- Union: union of paths, union of environments.
- Repeat / Questioned: pow_path_type on the path, env wrapped in Group
  (gqlite's name for what fppc calls List). lb capped at 3 to avoid
  exponential blowup, mirroring fppc.

create_context handles None descriptors (anonymous patterns) by
returning an empty environment.
Step 7 of the typechecker migration.

compile and compile_query now run typechecking between elaborate and
optimizer. Typechecking uses the permissive Schema::star() and
rejects the query if the checker reports errors (unbound variables,
context meet failures, etc.).

Sibling entry points compile_unchecked and compile_query_unchecked
skip typechecking — same plan as gqlite/main produced before this
branch landed. The optimizer is unchanged and works in both modes.

Typechecker::check_query simply delegates to check_pattern: gqlite's
elaborator folds WHERE clauses into PathPattern::Filter nodes inside
the pattern, so there is no separate WHERE expression to check.
RETURN-clause typing is out of scope for phase 1.

All 62 lib tests and 18 integration tests still pass.
Step 8 of the typechecker migration. Four hand-written queries that
exercise the on-by-default and opt-out paths:

- A node pattern with property return.
- A two-hop concat with a WHERE filter on a node property.
- A three-node concat pattern.
- An unbound-variable reference in WHERE — the checked path rejects
  it with a 'not found' error, the unchecked path accepts it and
  produces a plan, confirming the typechecker is the gatekeeper.

For the three valid queries the test asserts that compile_query and
compile_query_unchecked produce the same optimized plan: the checker
adds verification, not transformation.

Not a test suite — focused punch-list confirming the wiring works
end-to-end against the existing parser/elaborator/optimizer.
Step 9 of the typechecker migration. Records:

- Commit list and outcome vs acceptance criteria.
- Lattice reconciliation outcomes with fppc citations and gqlite
  before/after for every touched op.
- Five deviations from the Step-0 plan:
  - to_list -> to_group (gqlite's name for repetition grouping).
  - TypeEnvironment::meet still returns Result<_, String> by
    synthesizing Err from a VariableType::meet that collapses to Zero;
    preserves fppc's check-level observable behavior.
  - EdgeDir as a typechecker-local enum in path_type.rs (gqlite has no
    AST EdgeDirection); From<(VariableType, ...)> became
    PathType::from_variable.
  - Join treated as Concat for typing — verified: comma-join shares
    variables the same way Concat does.
  - Schema needed no new constructor (public fields suffice).
- Punts (FieldAccess on non-records, list/record literal typing, RETURN
  typing, schema sourcing, NodeId side-table).
- Two architectural follow-ups option 1 defers (parallel typing
  modules; per-segment clones at Concat/Union).
- Three findings noted for the next phase to consider, none requiring
  action now.
Closes the validation gap noted in the Step 9 final report. Each test
in tests/typecheck_fppc_parity.rs mirrors a fppc typechecker test
line-for-line and asserts the same ok/empty/warning predicates.
Surface-syntax differences ({a: int} -> {a is int}, bare -> -> ()-[]->())
are adapted; the behavior under test is unchanged.

All 45 translated tests pass on first run, with no implementation
changes needed. The skipped test (test_incompatible_records) uses
double-brace nested record syntax that gqlite's parser doesn't accept;
the underlying lattice case is covered by other tests.

This moves the equivalence claim against fppc from 'I built it that
way' to 'verified by direct test diff'. The remaining cross-direction
edge meet error path (fppc Err vs gqlite Zero) is unexercised by
fppc's tests and doesn't affect observable typecheck verdicts.

Migration doc final report updated.
Two cargo examples for hands-on use:

- examples/typecheck_demo.rs: ad-hoc CLI that takes a query as argv,
  prints the type-check result (ok/empty/path/env/errors/warnings).
  Useful for poking at individual queries:
    cargo run --release --example typecheck_demo -- '(x: {a is int})(x: {a is bool})'
  -> empty=true, prop 'a' collapses to Zero (int ⊓ bool).

- examples/typecheck_repl_smoke.rs: end-to-end smoke against
  examples/movies.gdb. Runs four representative queries through
  compile_query (typecheck on) and confirms:
  * the checked path accepts each query,
  * the unchecked path produces the same plan,
  * the runtime returns expected row counts (38 movies, 133 people, ...),
  * an unbound-variable WHERE is rejected by the checker only.
  All five cases pass.

Migration doc updated with the completed test-plan items:
- bench_test failure mode confirmed identical to main (pre-existing,
  unrelated to this branch)
- REPL-level smoke against movies.gdb passes

Removed unused TypecheckResult::failed associated function (dead code
warning).
Two fppc-parity fixes in the typing lattice, both narrowed to behavior
on variants both repos share:

- LabelType::is_subtype: swap (And, _) and (_, Or) arm order so
  decomposition matches fppc. The two orderings are sound but pick
  different short-circuit paths on (And, Or) inputs and disagree on
  nested cases (e.g. And(A, Or(B,C)) vs Or(D, And(E,F))).

- PropertyType::is_subtype: add (Zero, _) => true as the first arm.
  Without it, the lower-bound law fails when meet collapses to Zero
  on key-mismatched Closed maps (regression seed in fppc:
  Closed({"x": Star}) vs Closed({})).

Verified against ported fppc property tests + regression seeds and
the existing typecheck_fppc_parity suite.
In gqlite there's no parity-vs-non-parity distinction — these are
just the typechecker tests, mirroring the gqlite-local *_test.rs
convention (parser_test.rs, runtime_test.rs, etc).

The file's header comment is updated to record where the cases came
from (line-for-line translation of fppc/src/typechecker/checker.rs's
inline #[cfg(test)] mod tests, which is the canonical typechecker
reference suite) without framing them as a separate 'parity' suite.

No semantic changes; all 45 tests still pass.
@matiastoro matiastoro marked this pull request as ready for review April 27, 2026 15:33
@matiastoro matiastoro merged commit 1414bfb into main Apr 27, 2026
@Felipe705x Felipe705x mentioned this pull request Apr 27, 2026
Felipe705x added a commit that referenced this pull request Apr 27, 2026
Format the files on main that the fmt-CI PR (#2) didn't reach. Those
files came in via PR #1 (typing/checker) and either weren't part of
PR #2's diff or were touched in a way that didn't normalize the
formatting. Running 'cargo fmt --all' on main produces no further
changes after this lands.

Files touched:
  examples/typecheck_demo.rs
  examples/typecheck_repl_smoke.rs
  src/lib.rs
  src/typing/checker.rs
  src/typing/mod.rs
  src/typing/path_type.rs
  src/typing/type_environment.rs
  tests/typecheck_smoke.rs
  tests/typecheck_test.rs

No semantic changes. cargo build + cargo test --lib both clean.
@Felipe705x Felipe705x mentioned this pull request Apr 27, 2026
Felipe705x added a commit that referenced this pull request Apr 29, 2026
Now that the LDBC loader exposes `id` as a queryable property
(parent commit), the IC2 query can anchor by `Person.id` directly
— same as the spec.

Changes:
- `WHERE p.firstName = ... AND p.lastName = ...` → `WHERE p.id = ...`
- `RETURN ...firstName, ...lastName, c.content, c.creationDate`
  → `RETURN friend.id, ...firstName, ...lastName, c.id, c.content,
  c.creationDate` (matches the spec's RETURN modulo `coalesce`).
- `PARAMS` is now five real `Person.id` values from SF0.1 instead
  of `(firstName, lastName)` workaround pairs.

Doc reflects the upgraded fidelity: replaces the old "divergences
fixed" section with a 9-row honest divergence table covering every
surface-syntax-or-semantic difference. Net divergences from the
spec are reduced to two with real semantic effect:

  - **#6 (`coalesce`)**: gqlite has no `coalesce` builtin. Posts
    with empty `content` but non-empty `imageFile` return blank
    rather than the imageFile path. Few rows in SF0.1.
  - **#7 (no ORDER BY)**: gqlite parser doesn't have ORDER BY yet.
    Affects *which* 20 rows you see under LIMIT, not wall time.

All other table rows (#1, #2, #3, #4, #5, #8, #9) are surface-
syntax or naming differences with no semantic divergence — same
query, just spelled to fit gqlite's grammar / data conventions.

### Result snapshot (SF0.1, single-iter)

|  Person.id      | resolves to    | rows | wall time |
|-----------------|----------------|------|-----------|
| 933             | Mahinda Perera | 20   |  108.5 s  |
| 1129            | Carmen Lepland | 20   |  100.6 s  |
| 8796093023296   | Hồ Chí Loan    | 20   |   97.4 s  |
| 21990232555524  | Bryn Davies    | 20   |   99.2 s  |
| 32985348833865  | Cheng Yu       | 20   |  107.5 s  |

Wall times are similar to the previous (less faithful) bench. The
optimizer gap (no value-equality predicate pushdown) still
dominates; documented in "Things this surfaces".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Felipe705x added a commit that referenced this pull request Apr 29, 2026
Sweep of LDBC divergences found a bigger one I'd missed: we don't
use the LDBC SNB Driver at all. The driver defines the audited
measurement regime — workload at target throughput, mix of IC+IS
reads and IU updates, multi-minute warmup, randomized param order,
audit JSON output, throughput-at-SLA + p99 latency reporting.

We have none of that. The bench is a hand-rolled harness: direct
Runtime::run_query calls, IC2 only, sequential, no warmup, CSV
output, min/median/mean/max wall times.

The query and parameter set are still spec-faithful (rows #1-#5,
#8 in the audit); the rows returned would match an LDBC-conformant
run modulo #6 (no coalesce) and #7 (no ORDER BY). What's NOT
spec-faithful is the measurement regime itself — the numbers we
report are not directly comparable to throughput numbers in
published LDBC audit submissions.

Adding this as audit row #9 plus an expanded "Methodology /
reporting divergence" section listing the 9 specific differences
(harness, workload mix, concurrency, warmup, param order, headline
metric, output format, update streams, result validation). Strict
description updated: "IC2 with no coalesce, no ordering, and a
hand-rolled (non-driver) harness reporting wall-time stats instead
of throughput-at-SLA."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Felipe705x added a commit that referenced this pull request Apr 29, 2026
Two changes from the post-spec-fetch review:

1. Switch IC2 query to use `(p: Person {id: $personId})` shorthand,
   which is verbatim spec syntax. gqlite's parser accepts it and
   elaborates to `(p: Person) WHERE p.id = $personId` internally;
   identical post-elab behavior. Closes audit-row #1 entirely —
   that row was a self-imposed stylistic divergence, not a gap.

2. Add "Paper-tier disclosure" section answering all 10 items
   from `benchmark-checklist.tex` directly. Each answer is one
   line, no hedging. Replaces what would otherwise be scattered
   across the doc:
   - Cross-validation: No (with reason: coalesce + ORDER BY gaps)
   - Persistent storage: Yes
   - ACID: No
   - Fault-tolerance: No
   - Warmup rounds: 0
   - Execution rounds: 3 per param × 15 params
   - Summarization: per-param med + across-param med headline
   - Load phase included: No (separate)
   - SUT-dev consulted: gqlite is our own system

The maxDate predicate stays in WHERE since `<=` isn't equality;
the descriptor shorthand only handles equality. That's a spec-
faithful split (the IC2 spec has the `id` filter inside the
descriptor and the date filter in WHERE, exactly as we now do).

Results table is being re-run with --iters 3 (was 1) for stable
medians; lands in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Felipe705x added a commit that referenced this pull request Apr 29, 2026
* loader(ldbc): expose `id` as a queryable property on every node

Previously `load_ldbc_node_file` consumed the LDBC `id` column for the
internal node name only, excluding it from the property map. That made
LDBC's spec queries un-expressible: every IC1-IC14 query in LDBC SNB
Interactive parameterizes by `Person.id`, `Comment.id`, etc. via
`{id: $param}` shorthand or `WHERE x.id = $param`. Without `id` as a
property, those queries return NULL and the bench has to use surrogate
anchors (firstName, lastName) — a real divergence from the spec.

The fix: after the existing prop_cols loop, insert
`props["id"] = parse_ldbc_value(&id_str)`. The id is also still folded
into the internal node name (`"Person:933"`) for cross-file edge
resolution; the duplication is ~12 bytes per node (~4 MB extra on
SF0.1), required for spec-faithful queries.

Why LDBC-only and not the spanner-config CSV loader: that loader
follows different conventions (`vid`, `<Label>_id`, etc., per
CLAUDE.md) and isn't tied to a workload that requires id-as-property.
LDBC is the immediate driver; the spanner loader can argue for the
same treatment separately if needed.

Verified end-to-end on SF0.1:
  $ MATCH (p: Person) WHERE p.id = 933 RETURN p.firstName, p.lastName
  "Mahinda" | "Perera"
  1 rows (0.026s)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench: add LDBC SNB Interactive IC2 benchmark

First slice of the LDBC Social Network Benchmark v1 (arXiv:2001.02299
§6, Interactive Complex reads) — IC2 only ("recent messages by
friends"). Other ICs need features gqlite doesn't have: shortest /
variable-length paths (IC1, IC3, IC11, IC12, IC14), date arithmetic /
OPTIONAL MATCH / ORDER BY / aggregation-with-HAVING (IC4-IC8, IC10,
IC13).

`src/bin/ldbc_bench.rs` runs IC2 against an SF0.1 .gdb, sweeping a
small curated set of `Person.firstName` values (LDBC's `Person.id` is
folded into the gqlite node name, not addressable as a property).
Reports per-param min / median / mean / max wall time. Typechecker
is off — IC queries are valid by construction; comparing typecheck
overhead vs runtime is a separate bench.

`bench/LDBC_BENCHMARK.md` documents setup (download, decompress,
import) and includes baseline numbers from the first run.

### Result snapshot (SF0.1, 327k nodes / 1.48M edges, limit=20)

|  Param   | min   | med   | mean  | max    |
|----------|-------|-------|-------|--------|
| Mahinda  | 52 s  | 56 s  | 55 s  |  57 s  |
| Carmen   | 38 s  | 50 s  | 65 s  | 106 s  |
| Bryn     | 84 s  | 85 s  | 86 s  |  88 s  |
| Cheng    | 87 s  | 95 s  | 94 s  |  99 s  |

These are 50-130x slower than LDBC's sub-second target. The benchmark
surfaces the cause: the optimizer's predicate pushdown handles
`x.attr is type` (per `docs/implemented-optimizations.md`) but not
value-equality predicates like `WHERE p.firstName = 'Mahinda'`. So
the firstName filter applies *after* the join enumerates every
`Person~knows~Person~hasCreator~Comment` triple — work scales with
the cartesian intermediate, not the parameter's selectivity.

Documented as a finding for a future optimizer pass — pushing `=`
constants into descriptors would collapse this into a 1-2-row anchor
on Person, dropping the join's left side from ~28k to a handful.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): close 4 IC2 fidelity divergences vs LDBC spec

The earlier bench had five divergences from LDBC IC2; four are now
fixed without touching gqlite implementation. Only ORDER BY remains
(parser feature gap).

Fixed:

- **Anchor selectivity.** `Person.firstName` → `(firstName, lastName)`.
  Five params each map to *exactly one* Person in SF0.1 — same per-
  query selectivity as the spec's `Person.id`, just spelled
  differently because the gqlite LDBC loader folds `id` into the
  internal node name rather than a property.

- **Message = Comment ∪ Post.** LDBC's `(message:Message)` covers
  both via Cypher label inheritance, which gqlite doesn't have. Use
  path-pattern union: `<-[:hasCreator]-(c:Comment) | <-[:hasCreator]-
  (c:Post)`. `c` binds to whichever arm matches.

- **`message.creationDate <= $maxDate` filter.** Was dropped
  entirely. Now in the WHERE clause; `maxDate` is a const
  (1340000000000 ms ≈ mid-2012) chosen so the filter retains a
  sizable fraction of SF0.1 data while still exercising.

- **RETURN columns.** Added `friend.firstName`, `friend.lastName`,
  `c.content` alongside the existing `c.creationDate`. The two `id`
  columns the spec wants (`friend.id`, `message.id`) remain dropped
  because they're not addressable as properties; documented.

Remaining divergence (out of scope without parser changes):

- **No `ORDER BY ... DESC`** — gqlite's parser doesn't have it.
  Affects which 20 rows you get under `LIMIT 20`, not wall time.

Updated `bench/LDBC_BENCHMARK.md` to spell out the spec query, the
gqlite query, and the divergence list explicitly.

### Result snapshot (SF0.1, single-iter)

|  Param            | rows | wall time |
|-------------------|------|-----------|
| Mahinda Perera    | 20   |   96.4 s  |
| Carmen Lepland    | 20   |  108.1 s  |
| Bryn Davies       | 20   |   85.0 s  |
| Cheng Yu          | 20   |  105.6 s  |
| Hồ Chí Loan       | 20   |  110.7 s  |

Wall times are slightly higher than the previous (less faithful)
bench because the union arm doubles the right side of the join
(~286k hasCreator edges vs ~151k Comment-only). The dominant cost
remains the optimizer's missing predicate pushdown on value-equality
WHERE; documented in "Things this surfaces".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): use Person.id anchor (full IC2 fidelity)

Now that the LDBC loader exposes `id` as a queryable property
(parent commit), the IC2 query can anchor by `Person.id` directly
— same as the spec.

Changes:
- `WHERE p.firstName = ... AND p.lastName = ...` → `WHERE p.id = ...`
- `RETURN ...firstName, ...lastName, c.content, c.creationDate`
  → `RETURN friend.id, ...firstName, ...lastName, c.id, c.content,
  c.creationDate` (matches the spec's RETURN modulo `coalesce`).
- `PARAMS` is now five real `Person.id` values from SF0.1 instead
  of `(firstName, lastName)` workaround pairs.

Doc reflects the upgraded fidelity: replaces the old "divergences
fixed" section with a 9-row honest divergence table covering every
surface-syntax-or-semantic difference. Net divergences from the
spec are reduced to two with real semantic effect:

  - **#6 (`coalesce`)**: gqlite has no `coalesce` builtin. Posts
    with empty `content` but non-empty `imageFile` return blank
    rather than the imageFile path. Few rows in SF0.1.
  - **#7 (no ORDER BY)**: gqlite parser doesn't have ORDER BY yet.
    Affects *which* 20 rows you see under LIMIT, not wall time.

All other table rows (#1, #2, #3, #4, #5, #8, #9) are surface-
syntax or naming differences with no semantic divergence — same
query, just spelled to fit gqlite's grammar / data conventions.

### Result snapshot (SF0.1, single-iter)

|  Person.id      | resolves to    | rows | wall time |
|-----------------|----------------|------|-----------|
| 933             | Mahinda Perera | 20   |  108.5 s  |
| 1129            | Carmen Lepland | 20   |  100.6 s  |
| 8796093023296   | Hồ Chí Loan    | 20   |   97.4 s  |
| 21990232555524  | Bryn Davies    | 20   |   99.2 s  |
| 32985348833865  | Cheng Yu       | 20   |  107.5 s  |

Wall times are similar to the previous (less faithful) bench. The
optimizer gap (no value-equality predicate pushdown) still
dominates; documented in "Things this surfaces".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): document parameter-generator divergence

Previous audit listed personId and maxDate substitutions as if they
were independent surface differences. They aren't — they're symptoms
of a single methodology divergence: we didn't run LDBC's
substitution_parameters generator (ldbc_snb_datagen) and instead
hand-picked params by inspection. LDBC's spec example values are
SF-specific by design (the id space is bit-packed per SF and event
timelines stretch with SF), so spec values from one SF are not
expected to work against another. The canonical workflow is to run
the generator against the chosen SF and consume its
interactive_<n>.txt output.

This doesn't affect the bench's timing claim — the join work is the
same for any valid Person.id anchor and any maxDate that doesn't cut
all rows — but it's a procedural deviation from LDBC methodology
that wasn't called out clearly. Adding it as row #10 in the audit
table, reframing #8 and #9 as derived from #10, and updating the
bench's rustdoc comment to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): use official LDBC interactive_2 substitution params

Closes the methodology divergence flagged in the previous commit.
LDBC ships substitution parameters as a separate archive
(substitution_parameters-sf0.1.tar.zst on datasets.ldbcouncil.org)
because our SF0.1 distribution was generated with
parametergenerator.parameters:false. Downloading that archive gives
us interactive_2_param.txt: 15 (personId, maxDate) pairs from
LDBC's official parameter generator, matching our SF0.1 build
because the generator is deterministic on the dataset seed
(verified — all 15 personIds resolve in person_0_0.csv).

Changes:

- src/bin/ldbc_bench.rs: PARAMS replaced with the 15 official
  pairs verbatim. Each param now carries its own maxDate (the
  generator's output is per-param), so the global MAX_DATE_MS
  constant goes away. CSV stdout grows a `max_date` column.
  Names alongside each ID are resolved from person_0_0.csv only
  for human-readable output.
- bench/LDBC_BENCHMARK.md: audit table collapses #8/#9/#10 into
  a single row #8 ("uses official LDBC params, no semantic
  divergence"). Setup grows a §1b for the substitution-params
  download. Summary bullets updated. Strict description is now
  "IC2 with no coalesce and no ordering" — the parameter-source
  caveat is gone.

The Results table is currently a re-run note pointing at the old
hand-picked numbers as a reference; the official-param run is in
progress (one param in: 101s, 20 rows — same order of magnitude
as the hand-picked-param run, as expected since the bench's
bottleneck is optimizer-bound rather than parameter-bound). A
follow-up commit will land the fresh numbers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): document driver / reporting methodology divergence

Sweep of LDBC divergences found a bigger one I'd missed: we don't
use the LDBC SNB Driver at all. The driver defines the audited
measurement regime — workload at target throughput, mix of IC+IS
reads and IU updates, multi-minute warmup, randomized param order,
audit JSON output, throughput-at-SLA + p99 latency reporting.

We have none of that. The bench is a hand-rolled harness: direct
Runtime::run_query calls, IC2 only, sequential, no warmup, CSV
output, min/median/mean/max wall times.

The query and parameter set are still spec-faithful (rows #1-#5,
#8 in the audit); the rows returned would match an LDBC-conformant
run modulo #6 (no coalesce) and #7 (no ORDER BY). What's NOT
spec-faithful is the measurement regime itself — the numbers we
report are not directly comparable to throughput numbers in
published LDBC audit submissions.

Adding this as audit row #9 plus an expanded "Methodology /
reporting divergence" section listing the 9 specific differences
(harness, workload mix, concurrency, warmup, param order, headline
metric, output format, update streams, result validation). Strict
description updated: "IC2 with no coalesce, no ordering, and a
hand-rolled (non-driver) harness reporting wall-time stats instead
of throughput-at-SLA."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): land Results table from official-param re-run

Full IC2 sweep against ldbc-sf0.1.gdb with the 15 official LDBC
substitution parameters. 1 iter per param, ~20 min total wall.

Stats across all 15 params:
- min  52.8 s
- med  79.7 s
- mean 83.0 s
- max 146.9 s

Same regime as the previous hand-picked-param run (97-108 s for 5
params), confirming the bottleneck is gqlite's optimizer (no
predicate pushdown for value equality on `=`, see "Things this
surfaces") rather than parameter selectivity. The 3× spread
between min and max reflects caller-degree variation plus
first-iter page-cache effects, not bench noise.

All 15 params returned 20 rows (the LIMIT). Far above LDBC's
sub-second interactive target — the optimizer gap is the unblocking
work for any future driver-mediated audit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): note bench/impl scope split + no result validation

Two clarifications surfaced in review:

1. The optimizer pushdown gap that makes IC2 take ~100s is a real
   finding but is NOT something this branch will fix. bench/ldbc is
   bench-only — no changes to src/* implementation. Adding a
   pointer to which files would need to change (pushdown.rs +
   descriptor.rs + engine.rs) so a future impl PR has a starting
   point.

2. We don't validate row contents against LDBC's expected_output
   archives. Two divergences make any byte-for-byte check fail:
   #7 (no ORDER BY) means our 20 rows aren't the 20 most recent,
   and #6 (no coalesce) means imageFile-only Posts show blank.
   Either alone invalidates a diff -q check. The bench asserts
   row_count == LIMIT, not row contents.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): drop misleading min/med/mean/max labels at N=1

Sweep-review item: with --iters 1, the bench's report() printed
"min=X med=X mean=X max=X" with all four columns showing the same
number, suggesting statistical content where there is none.

Now: N=1 prints "wall=X.XXms (n=1, --iters >=3 recommended for
stable median)". N>=2 keeps the full spread display with an
explicit n=N. Bench-side fix only; doesn't touch gqlite impl.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): add SPEC_CHECKLIST.md — verbatim quotes from LDBC spec

Source-of-truth doc that maps each LDBC spec requirement to whether
it applies to research-paper use vs audited submissions, and what
our bench currently does. All quotes verbatim from
github.com/ldbc/ldbc_snb_docs (main branch), with file citations.

Headline finding: the spec itself sanctions a two-tier model.
benchmark-checklist.tex says "For most research papers, fully
audited results are unrealistic and even unaudited results can
provide insight" and lists exactly what info to disclose for
research use. We're firmly in that tier and the spec says that's
fine.

Practical use: when someone challenges "is this a real LDBC bench?"
the answer is now "yes, at the research-paper tier the spec
defines, with these specific gaps documented." Each gap (validation,
ACID, driver, run duration, scale) carries a verbatim quote and
a reason it's audit-only or out-of-scope here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): IC2 syntax fidelity + paper-tier disclosure

Two changes from the post-spec-fetch review:

1. Switch IC2 query to use `(p: Person {id: $personId})` shorthand,
   which is verbatim spec syntax. gqlite's parser accepts it and
   elaborates to `(p: Person) WHERE p.id = $personId` internally;
   identical post-elab behavior. Closes audit-row #1 entirely —
   that row was a self-imposed stylistic divergence, not a gap.

2. Add "Paper-tier disclosure" section answering all 10 items
   from `benchmark-checklist.tex` directly. Each answer is one
   line, no hedging. Replaces what would otherwise be scattered
   across the doc:
   - Cross-validation: No (with reason: coalesce + ORDER BY gaps)
   - Persistent storage: Yes
   - ACID: No
   - Fault-tolerance: No
   - Warmup rounds: 0
   - Execution rounds: 3 per param × 15 params
   - Summarization: per-param med + across-param med headline
   - Load phase included: No (separate)
   - SUT-dev consulted: gqlite is our own system

The maxDate predicate stays in WHERE since `<=` isn't equality;
the descriptor shorthand only handles equality. That's a spec-
faithful split (the IC2 spec has the `id` filter inside the
descriptor and the date filter in WHERE, exactly as we now do).

Results table is being re-run with --iters 3 (was 1) for stable
medians; lands in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): --backend memory|lazy|disk + RSS measurement (sysinfo)

Per professor's request: compare gqlite's three GraphAccess backends
on the same IC2 query so the paper can show the speed-vs-RAM tradeoff.

- Memory (Graph from CSV): everything in RAM, no disk I/O at query
  time. Uses csv_loader::load_from_ldbc_csv_dir.
- Lazy (LazyGraphStore): topology + label index in RAM, props via
  per-process LRU page cache + OS page cache.
- Disk (DiskGraphStore): topology in RAM, everything else from disk
  per access (via OS page cache).

Bench grows --backend and --csv-dir flags. Memory backend takes the
CSV directory directly; Lazy/Disk take the .gdb path. Each run
reports both per-param wall time and peak RSS delta over baseline.

sysinfo crate (0.30) added for cross-platform RSS reading. Used
only by ldbc_bench. RSS is OS-reported working-set; on Windows
that includes mapped file pages, so Lazy/Disk numbers reflect both
the in-process structures AND any portion of the .gdb that the OS
has resident — disclosed in the doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): land 3-backend Results table + cache-cost note

Memory vs Lazy vs Disk on SF0.1 IC2, per professor's request.
Median per-param: memory 4.65s, disk 7.71s, lazy 8.42s. Memory
wins on speed by ~2x, costs +605 MiB; Disk uses least RAM and
loads fastest but no in-process caching.

Three real findings, all worth flagging in the paper:

1. Memory beats Lazy/Disk by ~2x. Page-fault-free queries.
2. Lazy and Disk are within 10% on speed despite Lazy having
   an in-process LRU cache. The OS file cache dominates; Lazy's
   process-level cache adds bookkeeping without much benefit.
3. Cold-cache vs warm-cache cost differs by ~10x. The very first
   run after machine restart shows 53-147s per param (median 80s);
   subsequent runs benefit from OS-cached .gdb pages and drop to
   7-10s per param. Documented as context for the warm-cache
   table; cold-cache numbers preserved for paper use.

Caveat documented: sysinfo's RSS reading on Windows includes
mmap'd file pages, so Lazy/Disk RSS deltas are confounded by
which portion of the .gdb the OS has resident. Memory's +605 MiB
delta is purely allocated heap and is the most reliable absolute
number in the table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): per-IC TOML query catalog + dispatching runner

PR #1 of the LDBC runner refactor (per the planning doc).

Architecture:
- bench/ldbc-queries/ic<n>.toml — one file per IC (1-14).
  IC2 carries query template + params_file ref + divergences;
  the other 13 carry status="blocked" + required_features +
  blocked_reason.
- src/bin/ldbc_bench.rs rewritten to load TOML files, parse the
  LDBC param file, and substitute {paramName} placeholders in the
  query template. No more hardcoded PARAMS / query string.

CLI:
  --ic <n>|N,M|all|blocked   which IC(s) to run; default 2
  --backend memory|lazy|disk default lazy
  --params-dir <dir>         default bench/data/substitution_parameters-sf0.1/...
  --queries-dir <dir>        default bench/ldbc-queries
  --iters N                  default 3
  --limit N                  default 20
  --csv-dir <dir>            required only with --backend memory

`--ic blocked` prints the inventory of blocked ICs with their
required_features (no bench run). Useful as a status report:
"what would we need to add to gqlite to run these?"

New deps: toml = "0.8" (small, well-maintained; multiline strings
in TOML are nicer than JSON for query templates).

Per professor feedback: Memory backend stays available in the bench
binary for smoke-testing but is dropped from the headline Lazy-vs-Disk
comparison in LDBC_BENCHMARK.md (Memory loads everything from CSV at
every startup — unrealistic for a database product).

Same numbers as the previous IC2-only bench; this commit is purely
the architecture change. Adding a new IC (when blocking gqlite
features land) = drop a TOML file, no bench code changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): trim docs — drop audit-tier discussion + non-divergences

Per feedback: docs were too verbose with audit-tier content we don't
care about and "divergences" that turned out to be no-ops.

- Delete LDBC_SPEC_CHECKLIST.md entirely (314 lines of audit vs
  research-paper analysis with verbatim spec quotes — irrelevant
  since we're not auditing).
- Drop the "Methodology / reporting divergence (#9, expanded)"
  section from LDBC_BENCHMARK.md (driver, throughput-at-SLA, ACID,
  audit JSON — all audit-tier).
- Drop the "Paper-tier disclosure" section's 10-row spec-checklist
  table (mostly no/no/n.a. answers; not useful when we're not
  arguing for compliance).
- Trim the divergence audit from 9 rows to 2: only the real
  divergences (#6 no coalesce, #7 no ORDER BY). The other 7 rows
  were "None — same as spec" or "fixed in earlier commit" and
  weren't divergences anyway.
- Trim the RSS caveat and cold-cache notes to a few lines each.
- Remove the long "out of scope" rationale block on the optimizer
  pushdown gap; reduced to a 4-line "Fix:" pointer to which
  src/* files would need to change.

LDBC_BENCHMARK.md goes from 428 -> 324 lines. Combined with the
deleted SPEC_CHECKLIST.md, total bench-doc reduction is ~420 lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): note LIMIT-keyword TODO in ic2.toml

When gqlite's parser gains a LIMIT keyword, append `LIMIT 20` to
the query template and drop --limit from the bench CLI. Until then
the runtime row cap (Runtime::run_query's second arg, exposed as
--limit) is the only way to enforce the spec's LIMIT 20.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): rewrite README to be how-to-run only

Per feedback: the markdown was trying to do too much.
- Divergences are already in each ic<n>.toml's [divergences] table
- Results / findings / cold-warm analysis belong in the paper, not
  the bench README
- The README's job is "how does someone reproduce this run"

Cuts:
- IC2 fidelity-vs-spec section (the 2-row divergence table is now
  redundant with ic2.toml; the spec quote is in the TOML's spec link)
- Results section (Lazy-vs-Disk table, per-param times, RSS caveat,
  cold-cache note — all paper material)
- "Things this surfaces" (optimizer pushdown finding — paper material)
- "Architecture" section (TOML schema is self-documenting)

What stays:
- Status: 1 implemented, 13 catalogued, point at toml inventory
- Setup: download dataset, download params, build .gdb
- Run: invocation examples + flag table + output format
- Add new IC: 2-step recipe for promoting a blocked TOML
- Backends: one paragraph distinguishing the three options
- See also: pointers to the toml files, runner source, and paper

Goes from 324 to 133 lines (-258 +67). Combined with the earlier
trim and the deleted SPEC_CHECKLIST.md, total reduction across the
two LDBC bench docs is ~640 lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): bench_setup binary for one-command data prep

Replaces the README's three manual setup steps (curl + python zstd
+ gqlite import) with a single binary that does all three.

Idempotent: each step checks for its output and skips if present.
Re-running bench_setup with everything in place is a no-op (just
prints "[skip] ..." for each step). Pass --rebuild to force-rebuild
the .gdb.

Three deps added (only used by this binary):
- ureq 2.12 (default-features = false, "tls"): small blocking HTTP
  client, ~one network call per archive
- zstd 0.13: streaming decompressor for the .tar.zst archives
- tar 0.4: streaming tar extractor

The .gdb build step shells out to the existing `gqlite --import-ldbc-csv`
binary rather than calling csv_loader directly, so the .gdb is
produced by exactly the same code path users run from the REPL — no
risk of bench_setup and gqlite drifting on the .gdb format.

Output:
  data-dir: bench/data
  downloading https://datasets.../ldbc-sf0.1.tar.zst
    done: 17.5 MiB in 0.7s
  extracting ldbc-sf0.1.tar.zst -> ldbc-sf0.1
    done in 0.3s
  downloading https://datasets.../substitution_parameters-sf0.1.tar.zst
    done: 198.4 KiB in 0.2s
  extracting substitution_parameters-sf0.1.tar.zst -> substitution_parameters-sf0.1
    done in 0.0s
  building ldbc-sf0.1.gdb from .../social_network-sf0.1-CsvBasic-LongDateFormatter
    done: ldbc-sf0.1.gdb in 39.1s
  ✓ setup complete

README's "Setup" section collapsed from 3 multi-block steps to a
single command. Manual steps still recoverable from this commit's
diff if anyone needs them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): document output format + rename ic CSV column to params

Two related fixes after a user pointed out the CSV format wasn't
self-explanatory:

1. The third CSV column was named `ic` but actually carries the
   substitution-param values (the IC name is already in column 1).
   Renamed to `params`. Reading the column as `params` matches the
   stderr summary's parens, e.g. `(19791209300143|1354060800000)`.

2. README's "Output" section was 4 lines listing column names with
   no semantics. Expanded to ~50 lines covering:
   - Why two streams (stdout=data, stderr=chatter, Unix convention)
   - Column-by-column meaning of the CSV
   - What `(personId|maxDate)` and `(n=3)` in the stderr summary mean
   - Worked cross-check showing min/median/mean/max of three CSV
     rows reproduces one stderr summary line

The CSV header rename is breaking-but-trivial: any existing scripts
reading `ic` would now get `params` and need a one-character change.
Since the bench has shipped maybe 2 days ago and nothing is pinned
to it, this is the right time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): rename param_idx -> row + done summary at end of run

Two small UX fixes:

1. The CSV's `param_idx` column was the same number as the stderr
   summary's `row#N` label but with a different name. Renamed to
   `row` so the two streams use the same vocabulary.

2. After the last param of the last IC, the bench now prints:

       ✓ done — 1 IC(s) ran to completion
         IC2: 15 rows × 1 iter(s) = 15 runs;
              across-row median 11506.22ms
              (range 9109.93-16517.23ms)
       Peak RSS during query loop: 84.5 MiB (+75.4 MiB over baseline)

   Confirms the run finished cleanly (vs got killed mid-row) and
   gives the across-row median without grepping. Per-IC summary line
   so multi-IC runs (when more get implemented) compose cleanly.

   Implementation: `report` now returns the per-row median;
   `run_one_ic` collects them into an `IcSummary { ic_id, iters,
   row_medians_ns }` and `run_targets` builds the final aggregate
   from the Vec<IcSummary>.

README's Output section updated for both changes (column rename in
the CSV table; done-summary section added with worked example).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): pre-merge fixes from review (A1-A4 + warmup + --help)

Five fixes from the pre-merge review on the runner:

A1. Type-aware param substitution. New optional `param_types` array
    in the per-IC TOML; per-column tag is "int" (raw substitution,
    default) or "string" (wraps in single quotes, rejects values
    containing `'` since gqlite's lexer has no escape syntax).
    Unblocks future ICs with string params (firstName, country
    names, tag names) without putting quote-handling onto the
    TOML author. ic2.toml is unchanged — defaults to all-int.

A2. Duplicate IC id detection in load_queries. After sort, scan
    adjacent for matching ids; fail fast naming both source files.
    Catches the footgun of copying ic2.toml without renumbering.

A3. Median for even N. Previously returned sorted[n/2] (upper of
    two midpoints) for even-sized samples — neither the true
    median nor a labeled percentile. New median_of() helper does
    the standard average for even N. Doesn't affect --iters 3
    (odd); fixes --iters 2/4/6.

A4. Schema validation at load time. New IcQuery::validate()
    enforces that "implemented" carries query+params_file and
    "blocked" carries blocked_reason, plus param_columns/param_types
    length agreement. Failure now surfaces before any backend is
    opened — not 8 minutes into a multi-IC run.

B4. --warmup N flag. Per-row warmup that runs N extra iters of the
    query before the timed ones, discarding their measurements.
    Banner now reads "{warmup}+{iters} iters/param (warmup+measured)"
    so the count is visible. Mirrors typecheck_bench's --warmup.
    Default 0 (preserves existing behavior).

Plus a small ergonomic fix:

  - `ldbc_bench --help` (with no positional .gdb) now prints usage
    instead of trying to open "--help" as a database file.

What was not fixed (deliberately, low impact for now):

  - B1 CSV ambiguity if param values contain `|` or `;`: real
    issue but no LDBC param file has these chars in our scope.
  - B2 peak RSS sampling between rows, not during query: watchdog
    thread isn't worth the complexity.
  - B3 mmap inflates Windows RSS: documented in the README; can't
    fix from the bench side.
  - B5 hardcoded SF0.1 in bench_setup: out of scope for now.
  - B6 no checksums on downloads: would need LDBC's published
    hashes; out of scope.
  - B7 --ic 2,2 doesn't dedup: not a bug, just undocumented; fine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): second-pass review fixes — stale docs + warmup honesty

Findings from a second-pass review of the runner + README:

Stale doc references:

- Rustdoc said `--show-blocked` mode (real flag is `--ic blocked`).
- Usage text recommended `--warmup 1` to absorb cold-cache cost.
  Empirically wrong for our scope: looking at iter-level data from
  the SF0.1 iter=3 run, iter 0 is NOT systematically the slowest
  within a row — variance is general jitter, not first-iter cold
  cache. The OS page cache is cross-row, so once row#0's iters
  populate it, subsequent rows + iters all hit warm pages.
  Reframed: --warmup is available for parity with typecheck_bench
  and for larger SFs where the cache can't hold the working set,
  but at SF0.1 it's overhead.
- Warmup loop's inline comment claimed "iter 0 cold-cache"; updated
  to match the empirical finding.

Stale README:

- The flag table was missing --warmup entirely (added in a previous
  commit but the doc table didn't get the row).
- The stderr summary example showed `3 iters/param` but the new
  banner format is `0+3 iters/param (warmup+measured)`.
- "Adding a new IC" recipe didn't mention `param_types` for ICs
  with string params (IC1 firstName, IC3/IC11 country names, IC6
  tagName, etc.). Now mentions `param_types` and the gqlite
  no-escape-for-quotes caveat.

What's NOT a stale-doc issue (verified during the pass):
- All `# Spec: ...` URLs in TOMLs point to ldbc/ldbc_snb_docs main
  branch which is stable.
- The `_ic` parameter in `report()` is used in the eprintln line.
- Edge cases (--iters 0, --warmup with no iters, etc.) handled
  gracefully via the `is_empty()` skip in the final summary.
- `param_columns` doesn't strictly need to match the param file's
  header — the runtime reads the file's own header. Documented.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): real fixes from a deeper review pass + 13 unit tests

The previous "second-pass review" was thin (mostly stale-doc fixes).
A proper adversarial pass surfaced four substantive issues:

C1. Arg parsing crashed with index-out-of-bounds when a flag had
    no value (e.g. `ldbc_bench db.gdb --iters` with nothing after).
    New `need_value` helper checks bounds and prints a clear
    "{flag} requires a value" before exiting.

C2. "All rows failed" was reported as `✓ done` with no per-IC line.
    A user could read this as "ran successfully, no rows" instead of
    "every row erred." IcSummary now tracks `failed_rows`; final
    summary prints `⚠ done with errors` when any IC had zero
    successful rows, and per-IC lines surface failure counts even
    when row_medians_ns is empty.

C3. `substitute()` silently left unmatched `{...}` placeholders if
    the query template referenced a column not in the param file's
    header. The gqlite parser would then fail with "Parse error" —
    a misleading symptom for the actual problem (TOML/header
    mismatch). New `find_unsubstituted_placeholder` post-pass scan
    catches bare `{ident}` tokens — careful to skip the legitimate
    `{id: 933}` descriptor shorthand and `{1,3}` repetition syntax.
    Verified end-to-end: changing ic2.toml's query to reference a
    nonexistent {countryName} now produces:
      SUBSTITUTE ERROR on row 0 (params: ...): unsubstituted
      placeholder {countryName} remains after substitution; check
      that the query template's column names match the param
      file's header (personId, maxDate)

C4. Zero unit tests for 700 lines of code. Added a tests module
    covering median_of (incl. the even-N regression A3 caught
    earlier), substitute (int/string/bad-type/quote-rejection/
    unmatched-placeholder), and find_unsubstituted_placeholder
    (true positives + false-positive guards on descriptor
    shorthand and repetition quantifiers). 13 tests, all green.

Plus reframed the warmup recommendation honestly. Earlier commits
said "empirically not needed at SF0.1" based on 6 data points — too
strong a claim from too little data. New text acknowledges OS
page-cache mechanics (real, well-known) but is calibrated about
what we've actually measured: "varies by OS / RAM / storage; on a
warm machine with free RAM >= dataset size it usually doesn't help,
on cold or cache-constrained machines it can absorb a cold-iter
spike." Default stays 0; recommendation is "raise it if the
per-row summary shows iter 0 consistently dominating."

What's still NOT fixed (deliberately, low impact):
  C5 TOML schema versioning, C6 absolute-path footgun in
  `params_file`, C7 .gdb freshness vs loader version. Each is
  documentation territory or a system-wide gqlite concern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): final pre-merge fixes — arg parsing + idempotency

Three more issues found in the fourth review pass:

E1. The --ic value parser still used .expect() (panic with raw
    Rust error) for non-numeric values like `--ic 2,abc`.
    Inconsistent with the C1 fix from the previous round.
    Now a clean error: "invalid --ic value \"2,abc\":
    invalid digit found in string. expected: N, N,M,K, all, or
    blocked".

F1. bench_setup's arg parser had the same C1-style index-out-of-bounds
    risk (`bench_setup --data-dir` with no value would panic). Added
    the same `need_value` helper used in ldbc_bench.

F2. bench_setup's idempotency check was "does the directory exist?".
    A partially-extracted directory (interrupted run, network drop
    mid-extract) would skip re-extraction and feed garbage data into
    the .gdb build. Now checks for a known sub-path that only exists
    post-successful-extract:
      - dataset:  csv_dir/dynamic/ must be a directory
      - params:   params_inner/interactive_2_param.txt must be a file
    Detected the partial-extract case in a synthetic test (empty
    csv_dir → bench_setup correctly tried to re-download instead of
    skipping).

13 unit tests still pass. End-to-end smoke tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): loop-review fixes — error handling, idempotency, lint, doc

User asked for a loop-review until I find no issues. Multiple
passes; this commit is everything that surfaced.

Pass 1 — `.expect()` panics replaced with clean exits:
  - load_params on empty file (ldbc_bench)
  - fs::create_dir_all on the data-dir (bench_setup)
  - env::current_exe lookup (bench_setup)
  - fs::remove_file on --rebuild (bench_setup)

Pass 2 — TOML schema strictness:
  - #[serde(deny_unknown_fields)] on IcQuery so a typo like
    `parms_file` produces an immediate parse error naming the
    unknown key, instead of silently leaving params_file=None
    and surfacing later as "missing `params_file`".

Pass 3 — three more `.expect()` on backend dispatch:
  - csv_dir on memory backend → "--csv-dir is required..."
  - csv_loader::load_from_ldbc_csv_dir → propagates the io::Error
  - LazyGraphStore::open / DiskGraphStore::open → ditto

Pass 4 — README's flag table had stale strong claim
  "--warmup empirically doesn't help at SF0.1" — softened to match
  the calibrated runner help text (varies by OS / RAM / storage).

Pass 5 — `--iters 0` was silently allowed but produced no output:
  the bench would run warmup iters (if any), record zero measured
  iters, emit no CSV, and print "✓ done" with no per-IC line.
  Now rejected at arg parse with "--iters must be >= 1".

Pass 8 — clippy warnings:
  - cmp_owned: `data_dir == PathBuf::from("...")` → `Path::new`
  - doc_overindented_list_items: re-indented two doc bullets

Pass 9 — `--ic 99` (no match) printed "✓ done — 0 IC(s) ran to
  completion" which falsely implied success. Now prints
  "⚠ done — 0 IC(s) ran (none of the requested ICs were
  implemented in the catalog). Use --ic blocked to see the
  inventory."

13 unit tests still pass. cargo clippy clean. cargo fmt clean.
End-to-end smoke (--ic blocked, --ic 2, --ic 99, --ic 0, --ic 2
with bad value) all produce clean error/output paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* bench(ldbc): rewrite stderr docs — full sample + RSS/baseline glossary

Two issues a reader pointed out in the README's stderr section:

1. The example was incomplete — it skipped the setup lines (the
   "Loaded N IC definitions / RSS baseline / Loading .gdb / RSS
   after open" header) and the "✓ done" final summary that the
   bench actually emits. README only showed the per-row middle.
   Now shows a complete stderr trace from --ic 2 --iters 3, top
   to bottom, so readers can match what they see live to what's
   documented.

2. RSS and "baseline" were used in the example without definition,
   leaving readers (correctly) confused about what the numbers
   mean. New "What RSS and baseline mean" subsection: RSS is OS-
   reported RAM use; baseline is process startup; the
   "+X MiB over baseline" form makes per-step cost readable
   without manual subtraction. Worked example using the actual
   sample numbers.

Per-row docs (min/med/mean/max meaning, sample-size note when n=1)
unchanged — those were already correct, just buried under the
incomplete example.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants