Phase 7b: vector distance functions + ORDER BY expressions (operators deferred to 7b.1)#44
Merged
Merged
Conversation
Three SQL functions land in the executor's expression evaluator:
- vec_distance_l2(a, b) Euclidean √Σ(aᵢ−bᵢ)²
- vec_distance_cosine(a, b) 1 − a·b/(‖a‖·‖b‖); errors on zero-mag
- vec_distance_dot(a, b) −(a·b); negated so smaller-is-closer
like the others (pgvector convention)
All three return Value::Real(f64). Internal math is f32 (matches
the Vec<f32> input type from VECTOR(N) columns); widening to f64
at the return boundary so distances slot cleanly into the
executor's existing arithmetic/comparison paths.
The KNN query shape works end-to-end:
SELECT id FROM docs
ORDER BY vec_distance_l2(embedding, [0.1, 0.2, ..., 0.0])
LIMIT 10;
That requires a parser change beyond just adding the functions:
ORDER BY previously only accepted bare column refs. Phase 7b
widens it to accept arbitrary expressions, with `eval_expr`
called per-row in `sort_rowids`. Strict superset — `ORDER BY col`
still works because Expr::Identifier takes the same path. Sort-key
results pre-computed up front so the comparator runs O(N log N)
against pre-evaluated Values rather than re-evaluating the
expression O(N log N) times (will matter once 7d's HNSW is the
hot path).
**Scope correction recorded in docs/phase-7-plan.md:** Q6 anticipated
pgvector-style operators (<-> / <=> / <#>) as a "tiny parser
change". Reality: sqlparser fails outright on `<->` and `<#>`
("Expected: an expression, found: ->"); only `<=>` parses, and
that's MySQL Spaceship null-safe-equality. Supporting all three
needs either a sqlparser fork or a SQL-string preprocessor —
neither tiny.
Decision: ship 7b with **functions only**. Operators move to a
follow-up sub-phase 7b.1. Note added to phase-7-plan.md and to
supported-sql.md so users reading the docs see the rationale.
KNN queries still work — just verbose; 7b.1 swaps to operator
form without other behavior change.
**Other parser change baked in for free:**
The Expr::Identifier evaluator now recognizes bracket-quoted
identifiers (`quote_style == Some('[')`) and parses them as
vector literals via parse_vector_literal. Same trick the INSERT
parser uses (sqlparser inherits MSSQL `[name]` syntax) — needed
its own copy here because expression-eval runs on a different
code path. Without this, the two-vector-args extractor in
eval_function() would see the bracket literal as a column ref,
look it up against the table, and find nothing.
**Tests** — 200 lib tests passing (was 184; +16 new):
- Unit tests on the math: identical-is-zero, orthogonal-is-1
(cosine), 3-4-5 triangle (l2), unit-norm dot ≡ cosine-1,
zero-magnitude cosine errors, dot negation
- End-to-end via process_command: WHERE filter for each of
the 3 functions, KNN-shape ORDER BY ... LIMIT, dim-mismatch
clean error, unknown-function error mentions function name
**LOC**: ~250 for functions + ~50 for ORDER BY parser/executor
extension. Slightly over the original Q-time estimate (was ~250
flat). Total file format unchanged from 7a's v4 — no new cell
tags needed; distance is a pure compute.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three vector-distance SQL functions wired into the executor's expression evaluator — plus the ORDER BY parser change that makes KNN queries actually work end-to-end.
What lands
vec_distance_l2/vec_distance_cosine/vec_distance_dotsrc/sql/executor.rs— neweval_functiondispatchExpr::Identifierwithquote_style == Some('[')parses asValue::Vectorsrc/sql/executor.rssrc/sql/parser/select.rs+src/sql/executor.rsAll three distance functions return
Value::Real(f64). Internal math is f32 (matches Vec from VECTOR(N) columns); widened at the return boundary so distances slot into the existing arithmetic/comparison paths.Scope correction (recorded in
docs/phase-7-plan.md)Q6 anticipated pgvector-style operators (
<->/<=>/<#>) as a "tiny parser change." Reality: sqlparser fails outright on<->and<#>("Expected: an expression, found: ->"); only<=>parses (as MySQLSpaceship). Supporting all three needs either a sqlparser fork or a SQL-string preprocessor — neither tiny.Decision: ship functions only. Operators move to Phase 7b.1 follow-up.
KNN queries still work — just verbose. 7b.1 will swap function-call form to operator form without any other behavior change.
Tests
process_command:ORDER BY … LIMITlhs=2, rhs=3)What stays out of scope (intentionally)
<->,<=>,<#>) — Phase 7b.1SELECT vec_distance_l2(...) AS dist FROM ...) — needsAS aliassupport; future polishVerified clean
cargo fmt --all -- --check— cleancargo check --workspace --exclude sqlrite-python --exclude sqlrite-nodejs— cleancargo check -p sqlrite-python -p sqlrite-nodejs— cleancd sdk/wasm && cargo check— cleancargo test -p sqlrite-engine --lib— 200/200🤖 Generated with Claude Code