Skip to content

Commit fd69fed

Browse files
committed
docs(knn): document KNN filtering behavior with metadata vs JOINs
Add comprehensive documentation explaining when filters are applied: - Metadata columns, partition keys, distance: filtered DURING KNN search - JOIN filters, subqueries: filtered AFTER KNN search (may return < k results) - Workarounds: use metadata columns or over-fetch with LIMIT Add test-knn-filtering.py to validate documented behavior.
1 parent de0edf3 commit fd69fed

File tree

2 files changed

+356
-0
lines changed

2 files changed

+356
-0
lines changed

site/features/knn.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,105 @@ where contents_embedding match :query
8484

8585
<!-- TODO match on vector column, k vs limit, distance_metric configurable, etc.-->
8686

87+
## Filtering KNN Results
88+
89+
You can combine `MATCH` queries with additional `WHERE` clauses to filter results.
90+
**Where you put your filter data determines when filtering happens.**
91+
92+
### Filters applied DURING KNN search
93+
94+
These constraints are evaluated while finding nearest neighbors:
95+
96+
| Column Type | Supported Operators | Example |
97+
|-------------|---------------------|---------|
98+
| **Metadata columns** | `=`, `!=`, `<`, `<=`, `>`, `>=`, `IN`, `LIKE`, `GLOB`, `IS NULL`, `IS NOT NULL` | `AND category = 'electronics'` |
99+
| **Partition keys** | `=`, `!=`, `<`, `<=`, `>`, `>=` | `AND tenant_id = 123` |
100+
| **Distance** | `<`, `<=`, `>`, `>=` | `AND distance < 0.5` |
101+
| **Rowid** | `IN (...)` | `AND rowid IN (1, 2, 3)` |
102+
103+
```sql
104+
-- Metadata filter: applied DURING search, guarantees up to k matching results
105+
create virtual table products using vec0(
106+
embedding float[128],
107+
category text -- metadata column
108+
);
109+
110+
select rowid, distance
111+
from products
112+
where embedding match :query
113+
and k = 10
114+
and category = 'electronics'; -- filtered during KNN search
115+
```
116+
117+
### Filters applied AFTER KNN search
118+
119+
Filters on **joined tables** or **subquery results** are applied after the KNN search returns.
120+
This means you may get **fewer than k results**.
121+
122+
```sql
123+
-- ⚠️ JOIN filter: applied AFTER search, may return fewer than k results
124+
create virtual table products using vec0(embedding float[128]);
125+
create table categories (product_id integer primary key, category text);
126+
127+
select p.rowid, p.distance, c.category
128+
from products p
129+
join categories c on p.rowid = c.product_id
130+
where p.embedding match :query
131+
and k = 10
132+
and c.category = 'electronics'; -- filtered AFTER KNN returns 10 rows
133+
```
134+
135+
If KNN returns rowids `[0,1,2,3,4,5,6,7,8,9]` but only `[0,2,4,6,8]` are electronics,
136+
you get 5 results instead of 10.
137+
138+
### Workarounds for JOIN filtering
139+
140+
**Option 1: Use metadata columns** (recommended)
141+
142+
Move filterable attributes into the `vec0` table as metadata columns:
143+
144+
```sql
145+
create virtual table products using vec0(
146+
embedding float[128],
147+
category text -- now a metadata column, filtered during search
148+
);
149+
```
150+
151+
**Option 2: Over-fetch and limit**
152+
153+
Request more results than needed, then filter and limit:
154+
155+
```sql
156+
select p.rowid, p.distance, c.category
157+
from products p
158+
join categories c on p.rowid = c.product_id
159+
where p.embedding match :query
160+
and k = 100 -- fetch extra
161+
and c.category = 'electronics'
162+
limit 10; -- then limit
163+
```
164+
165+
This works but wastes computation and may still miss results if your filter
166+
is highly selective.
167+
168+
### Auxiliary columns cannot filter KNN
169+
170+
[Auxiliary columns](./vec0.md) (declared with `+` prefix) store data but are not indexed.
171+
They cannot be used in KNN `WHERE` clauses:
172+
173+
```sql
174+
create virtual table products using vec0(
175+
embedding float[128],
176+
+description text -- auxiliary: stored but not filterable
177+
);
178+
179+
-- ❌ Error: auxiliary columns cannot filter KNN queries
180+
select * from products
181+
where embedding match :query and k = 10 and description like '%foo%';
182+
```
183+
184+
Use auxiliary columns for data you need to retrieve, not filter.
185+
87186
## Manually with SQL scalar functions
88187

89188
You don't need a `vec0` virtual table to perform KNN searches with `sqlite-vec`.

tests/test-knn-filtering.py

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
"""
2+
Tests demonstrating KNN filtering behavior: metadata columns vs JOINs.
3+
4+
Key insight: Filters on metadata columns are applied DURING the KNN search,
5+
while filters on joined tables are applied AFTER. This affects result counts.
6+
"""
7+
import pytest
8+
9+
10+
def test_metadata_filter_during_knn(db):
11+
"""Metadata column filters are applied DURING KNN search.
12+
13+
This guarantees you get up to k results matching your filter.
14+
"""
15+
db.execute("""
16+
CREATE VIRTUAL TABLE products USING vec0(
17+
embedding float[4],
18+
category TEXT
19+
)
20+
""")
21+
22+
# Insert 10 items: 5 electronics, 5 clothing
23+
for i in range(10):
24+
category = "electronics" if i < 5 else "clothing"
25+
db.execute(
26+
"INSERT INTO products(rowid, embedding, category) VALUES (?, ?, ?)",
27+
[i, f"[{i}, {i}, {i}, {i}]", category]
28+
)
29+
30+
# Request k=5 with category filter - gets 5 electronics
31+
results = db.execute("""
32+
SELECT rowid, distance, category FROM products
33+
WHERE embedding MATCH '[0,0,0,0]'
34+
AND k = 5
35+
AND category = 'electronics'
36+
ORDER BY distance
37+
""").fetchall()
38+
39+
assert len(results) == 5, "Should get exactly 5 results (all electronics)"
40+
assert all(r[2] == "electronics" for r in results)
41+
42+
43+
def test_join_filter_after_knn(db):
44+
"""JOIN filters are applied AFTER KNN search returns.
45+
46+
This means you may get fewer than k results if the join eliminates rows.
47+
"""
48+
# vec0 table with NO category metadata
49+
db.execute("""
50+
CREATE VIRTUAL TABLE products USING vec0(
51+
embedding float[4]
52+
)
53+
""")
54+
55+
# Separate categories table
56+
db.execute("""
57+
CREATE TABLE categories (
58+
product_id INTEGER PRIMARY KEY,
59+
category TEXT
60+
)
61+
""")
62+
63+
# Insert 10 items: 5 electronics, 5 clothing
64+
for i in range(10):
65+
category = "electronics" if i < 5 else "clothing"
66+
db.execute(
67+
"INSERT INTO products(rowid, embedding) VALUES (?, ?)",
68+
[i, f"[{i}, {i}, {i}, {i}]"]
69+
)
70+
db.execute(
71+
"INSERT INTO categories(product_id, category) VALUES (?, ?)",
72+
[i, category]
73+
)
74+
75+
# Request k=5 with JOIN filter - KNN returns 5 nearest (rowids 0-4),
76+
# but they happen to all be electronics, so we get 5
77+
# This is a lucky case!
78+
results = db.execute("""
79+
SELECT p.rowid, p.distance, c.category
80+
FROM products p
81+
JOIN categories c ON p.rowid = c.product_id
82+
WHERE p.embedding MATCH '[0,0,0,0]'
83+
AND k = 5
84+
AND c.category = 'electronics'
85+
ORDER BY p.distance
86+
""").fetchall()
87+
88+
# We get 5 here, but only because the 5 nearest happen to be electronics
89+
assert len(results) == 5
90+
91+
92+
def test_join_filter_reduces_results(db):
93+
"""Demonstrates the JOIN problem: asking for k results but getting fewer.
94+
95+
This is the key gotcha developers need to understand.
96+
"""
97+
db.execute("""
98+
CREATE VIRTUAL TABLE products USING vec0(
99+
embedding float[4]
100+
)
101+
""")
102+
103+
db.execute("""
104+
CREATE TABLE categories (
105+
product_id INTEGER PRIMARY KEY,
106+
category TEXT
107+
)
108+
""")
109+
110+
# Insert 10 items with interleaved categories
111+
# rowids 0,2,4,6,8 = electronics; 1,3,5,7,9 = clothing
112+
for i in range(10):
113+
category = "electronics" if i % 2 == 0 else "clothing"
114+
db.execute(
115+
"INSERT INTO products(rowid, embedding) VALUES (?, ?)",
116+
[i, f"[{i}, {i}, {i}, {i}]"]
117+
)
118+
db.execute(
119+
"INSERT INTO categories(product_id, category) VALUES (?, ?)",
120+
[i, category]
121+
)
122+
123+
# Request k=5, but filter for electronics
124+
# KNN returns rowids 0,1,2,3,4 (nearest to origin)
125+
# After JOIN filter: only 0,2,4 remain (electronics)
126+
results = db.execute("""
127+
SELECT p.rowid, p.distance, c.category
128+
FROM products p
129+
JOIN categories c ON p.rowid = c.product_id
130+
WHERE p.embedding MATCH '[0,0,0,0]'
131+
AND k = 5
132+
AND c.category = 'electronics'
133+
ORDER BY p.distance
134+
""").fetchall()
135+
136+
# Only 3 results even though we asked for k=5!
137+
assert len(results) == 3, (
138+
f"JOIN filter applied AFTER KNN - got {len(results)} instead of 5. "
139+
"This is expected behavior but often surprises developers."
140+
)
141+
assert [r[0] for r in results] == [0, 2, 4]
142+
143+
144+
def test_subquery_also_filters_after(db):
145+
"""Subqueries with external filters also apply AFTER KNN."""
146+
db.execute("""
147+
CREATE VIRTUAL TABLE products USING vec0(
148+
embedding float[4]
149+
)
150+
""")
151+
152+
db.execute("""
153+
CREATE TABLE categories (
154+
product_id INTEGER PRIMARY KEY,
155+
category TEXT
156+
)
157+
""")
158+
159+
for i in range(10):
160+
category = "electronics" if i % 2 == 0 else "clothing"
161+
db.execute(
162+
"INSERT INTO products(rowid, embedding) VALUES (?, ?)",
163+
[i, f"[{i}, {i}, {i}, {i}]"]
164+
)
165+
db.execute(
166+
"INSERT INTO categories(product_id, category) VALUES (?, ?)",
167+
[i, category]
168+
)
169+
170+
# CTE approach - same problem
171+
results = db.execute("""
172+
WITH knn AS (
173+
SELECT rowid, distance
174+
FROM products
175+
WHERE embedding MATCH '[0,0,0,0]' AND k = 5
176+
)
177+
SELECT knn.rowid, knn.distance, c.category
178+
FROM knn
179+
JOIN categories c ON knn.rowid = c.product_id
180+
WHERE c.category = 'electronics'
181+
""").fetchall()
182+
183+
assert len(results) == 3, "CTE+JOIN also filters after KNN"
184+
185+
186+
def test_workaround_increase_k(db):
187+
"""Workaround: request more results than needed, filter, then limit.
188+
189+
This works but wastes computation and may still miss results.
190+
"""
191+
db.execute("""
192+
CREATE VIRTUAL TABLE products USING vec0(
193+
embedding float[4]
194+
)
195+
""")
196+
197+
db.execute("""
198+
CREATE TABLE categories (
199+
product_id INTEGER PRIMARY KEY,
200+
category TEXT
201+
)
202+
""")
203+
204+
for i in range(10):
205+
category = "electronics" if i % 2 == 0 else "clothing"
206+
db.execute(
207+
"INSERT INTO products(rowid, embedding) VALUES (?, ?)",
208+
[i, f"[{i}, {i}, {i}, {i}]"]
209+
)
210+
db.execute(
211+
"INSERT INTO categories(product_id, category) VALUES (?, ?)",
212+
[i, category]
213+
)
214+
215+
# Request k=10 (2x what we need), then filter and limit
216+
results = db.execute("""
217+
SELECT p.rowid, p.distance, c.category
218+
FROM products p
219+
JOIN categories c ON p.rowid = c.product_id
220+
WHERE p.embedding MATCH '[0,0,0,0]'
221+
AND k = 10
222+
AND c.category = 'electronics'
223+
ORDER BY p.distance
224+
LIMIT 5
225+
""").fetchall()
226+
227+
assert len(results) == 5, "Got 5 results by over-fetching"
228+
229+
230+
def test_solution_use_metadata_columns(db):
231+
"""The correct solution: use metadata columns for filterable attributes."""
232+
db.execute("""
233+
CREATE VIRTUAL TABLE products USING vec0(
234+
embedding float[4],
235+
category TEXT
236+
)
237+
""")
238+
239+
# All category data in the vec0 table itself
240+
for i in range(10):
241+
category = "electronics" if i % 2 == 0 else "clothing"
242+
db.execute(
243+
"INSERT INTO products(rowid, embedding, category) VALUES (?, ?, ?)",
244+
[i, f"[{i}, {i}, {i}, {i}]", category]
245+
)
246+
247+
# Filter applied DURING KNN search
248+
results = db.execute("""
249+
SELECT rowid, distance, category FROM products
250+
WHERE embedding MATCH '[0,0,0,0]'
251+
AND k = 5
252+
AND category = 'electronics'
253+
ORDER BY distance
254+
""").fetchall()
255+
256+
assert len(results) == 5, "Metadata filter guarantees k results"
257+
assert [r[0] for r in results] == [0, 2, 4, 6, 8]

0 commit comments

Comments
 (0)