Skip to content

Phase 7b: vector distance functions + ORDER BY expressions (operators deferred to 7b.1)#44

Merged
joaoh82 merged 1 commit into
mainfrom
feat/vector-distance-functions
Apr 27, 2026
Merged

Phase 7b: vector distance functions + ORDER BY expressions (operators deferred to 7b.1)#44
joaoh82 merged 1 commit into
mainfrom
feat/vector-distance-functions

Conversation

@joaoh82
Copy link
Copy Markdown
Owner

@joaoh82 joaoh82 commented Apr 27, 2026

Summary

Three vector-distance SQL functions wired into the executor's expression evaluator — plus the ORDER BY parser change that makes KNN queries actually work end-to-end.

-- Single distance computation in WHERE
SELECT id FROM docs
WHERE vec_distance_l2(embedding, [0.1, 0.2, ..., 0.0]) < 0.5;

-- KNN: sort by distance, take top-k
SELECT id FROM docs
ORDER BY vec_distance_l2(embedding, [0.1, 0.2, ..., 0.0]) ASC
LIMIT 10;

What lands

What Where
Functions vec_distance_l2 / vec_distance_cosine / vec_distance_dot src/sql/executor.rs — new eval_function dispatch
Bracket literals in expressions Expr::Identifier with quote_style == Some('[') parses as Value::Vector src/sql/executor.rs
ORDER BY expressions Parser + executor accept any expression (was bare column refs only) src/sql/parser/select.rs + src/sql/executor.rs

All three distance functions return Value::Real(f64). Internal math is f32 (matches Vec from VECTOR(N) columns); widened at the return boundary so distances slot into the existing arithmetic/comparison paths.

Scope correction (recorded in docs/phase-7-plan.md)

Q6 anticipated pgvector-style operators (<-> / <=> / <#>) as a "tiny parser change." Reality: sqlparser fails outright on <-> and <#> ("Expected: an expression, found: ->"); only <=> parses (as MySQL Spaceship). Supporting all three needs either a sqlparser fork or a SQL-string preprocessor — neither tiny.

Decision: ship functions only. Operators move to Phase 7b.1 follow-up.

KNN queries still work — just verbose. 7b.1 will swap function-call form to operator form without any other behavior change.

Tests

  • 200 lib tests passing (was 184 — +16 new)
  • Unit tests on the math:
    • identical-is-zero, orthogonal-is-1 (cosine)
    • 3-4-5 triangle (l2)
    • unit-norm dot ≡ cosine - 1
    • zero-magnitude cosine errors clean
    • dot product negation
  • End-to-end via process_command:
    • WHERE filter for each of the 3 functions
    • KNN-shape ORDER BY … LIMIT
    • Dim-mismatch clean error (lhs=2, rhs=3)
    • Unknown-function error mentions function name

What stays out of scope (intentionally)

  • Operator forms (<->, <=>, <#>) — Phase 7b.1
  • Function calls in projection (SELECT vec_distance_l2(...) AS dist FROM ...) — needs AS alias support; future polish
  • Multi-column ORDER BY — still single sort key
  • HNSW index probing — Phase 7d. Today this is full-scan + sort.

Verified clean

  • cargo fmt --all -- --check — clean
  • cargo check --workspace --exclude sqlrite-python --exclude sqlrite-nodejs — clean
  • cargo check -p sqlrite-python -p sqlrite-nodejs — clean
  • cd sdk/wasm && cargo check — clean
  • cargo test -p sqlrite-engine --lib — 200/200
  • No file format change (no v5; the v4 envelope already covers this work)

🤖 Generated with Claude Code

Three SQL functions land in the executor's expression evaluator:

  - vec_distance_l2(a, b)      Euclidean √Σ(aᵢ−bᵢ)²
  - vec_distance_cosine(a, b)  1 − a·b/(‖a‖·‖b‖); errors on zero-mag
  - vec_distance_dot(a, b)     −(a·b); negated so smaller-is-closer
                                like the others (pgvector convention)

All three return Value::Real(f64). Internal math is f32 (matches
the Vec<f32> input type from VECTOR(N) columns); widening to f64
at the return boundary so distances slot cleanly into the
executor's existing arithmetic/comparison paths.

The KNN query shape works end-to-end:

  SELECT id FROM docs
  ORDER BY vec_distance_l2(embedding, [0.1, 0.2, ..., 0.0])
  LIMIT 10;

That requires a parser change beyond just adding the functions:
ORDER BY previously only accepted bare column refs. Phase 7b
widens it to accept arbitrary expressions, with `eval_expr`
called per-row in `sort_rowids`. Strict superset — `ORDER BY col`
still works because Expr::Identifier takes the same path. Sort-key
results pre-computed up front so the comparator runs O(N log N)
against pre-evaluated Values rather than re-evaluating the
expression O(N log N) times (will matter once 7d's HNSW is the
hot path).

**Scope correction recorded in docs/phase-7-plan.md:** Q6 anticipated
pgvector-style operators (<-> / <=> / <#>) as a "tiny parser
change". Reality: sqlparser fails outright on `<->` and `<#>`
("Expected: an expression, found: ->"); only `<=>` parses, and
that's MySQL Spaceship null-safe-equality. Supporting all three
needs either a sqlparser fork or a SQL-string preprocessor —
neither tiny.

Decision: ship 7b with **functions only**. Operators move to a
follow-up sub-phase 7b.1. Note added to phase-7-plan.md and to
supported-sql.md so users reading the docs see the rationale.
KNN queries still work — just verbose; 7b.1 swaps to operator
form without other behavior change.

**Other parser change baked in for free:**

The Expr::Identifier evaluator now recognizes bracket-quoted
identifiers (`quote_style == Some('[')`) and parses them as
vector literals via parse_vector_literal. Same trick the INSERT
parser uses (sqlparser inherits MSSQL `[name]` syntax) — needed
its own copy here because expression-eval runs on a different
code path. Without this, the two-vector-args extractor in
eval_function() would see the bracket literal as a column ref,
look it up against the table, and find nothing.

**Tests** — 200 lib tests passing (was 184; +16 new):

  - Unit tests on the math: identical-is-zero, orthogonal-is-1
    (cosine), 3-4-5 triangle (l2), unit-norm dot ≡ cosine-1,
    zero-magnitude cosine errors, dot negation
  - End-to-end via process_command: WHERE filter for each of
    the 3 functions, KNN-shape ORDER BY ... LIMIT, dim-mismatch
    clean error, unknown-function error mentions function name

**LOC**: ~250 for functions + ~50 for ORDER BY parser/executor
extension. Slightly over the original Q-time estimate (was ~250
flat). Total file format unchanged from 7a's v4 — no new cell
tags needed; distance is a pure compute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joaoh82 joaoh82 merged commit 1ba5b67 into main Apr 27, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant