Skip to content

to 3.0-dev: cherry-pick #24027 (fulltext single-keyword fast path and correctness fixes)#24102

Merged
heni02 merged 4 commits intomatrixorigin:3.0-devfrom
XuPeng-SH:port-fulltext-optimize-3.0-dev
Apr 15, 2026
Merged

to 3.0-dev: cherry-pick #24027 (fulltext single-keyword fast path and correctness fixes)#24102
heni02 merged 4 commits intomatrixorigin:3.0-devfrom
XuPeng-SH:port-fulltext-optimize-3.0-dev

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change which improves an existing feature)

Which issue(s) this PR fixes

Port of #24027 from main to 3.0-dev.

What this PR does / why we need it

Cherry-picks the fulltext optimization and correctness fixes from PR #24027 to the 3.0-dev branch.

5 Correctness Fixes

  1. SQL escape injection: escape() now escapes backslashes before single quotes to prevent SQL string framing breakage
  2. Multi-row argument bug: v.GetStringAt(0)v.GetStringAt(nthRow) — all rows in a batch were using row 0's arguments
  3. aggcnt double-counting: increment counter only for new doc_ids, not for every row
  4. BM25 doc-length dedup: GROUP BY doc_id with MAX(pos) before COUNT/AVG to prevent duplicates
  5. NULL doc-length: COALESCE(dl.pos, 0) prevents NULL from LEFT JOIN

Single-Keyword Fast Path

  • Bypass streaming pipeline for single TEXT/STAR keyword queries with limit
  • SingleKeywordTopKSQL (TF-IDF) and SingleKeywordTopKBM25SQL generate optimized SQL
  • Score computation matches full pipeline precision

Memory Lifecycle Hardening

  • sort_topk: evicted items freed immediately (ranking) / all non-survivors freed (non-ranking)
  • streamingStarted guard in free() prevents blocking on nil/unstarted channels
  • resetRowState: proper per-row cleanup for multi-row invocations
  • normalizeDocID/outputDocID: single cache for binary↔string doc_id conversions

3.0-dev API Adaptation

  • RunStreamingSql uses *process.Process (not *SqlProcess) on 3.0-dev
  • FulltextBloomFilter not available on 3.0-dev; bloom filter propagation test removed
  • All remaining tests pass

How has this been tested?

go test -short -tags matrixone_test ./pkg/fulltext ./pkg/sql/colexec/table_function
ok  github.com/matrixorigin/matrixone/pkg/fulltext  0.068s
ok  github.com/matrixorigin/matrixone/pkg/sql/colexec/table_function  0.292s

…ast path and correctness fixes)

Port PR matrixorigin#24027 from main to 3.0-dev with API adaptation.

**5 Correctness Fixes:**
1. SQL escape injection: escape backslashes before single quotes
2. Multi-row argument bug: use nthRow instead of hardcoded 0
3. aggcnt double-counting: increment only for new doc_ids
4. BM25 doc-length dedup: GROUP BY doc_id before COUNT/AVG
5. NULL doc-length: COALESCE prevents NULL from LEFT JOIN

**Single-Keyword Fast Path:**
- Bypass streaming pipeline for single TEXT/STAR keyword queries
- SingleKeywordTopKSQL (TF-IDF) and SingleKeywordTopKBM25SQL
- Score computation matches full pipeline precision

**Memory Lifecycle Hardening:**
- sort_topk: free evicted items immediately
- streamingStarted guard in free()
- resetRowState: proper per-row cleanup for multi-row invocations
- normalizeDocID/outputDocID: single cache for binary<->string conversion

**3.0-dev API adaptation:**
- RunStreamingSql uses *process.Process (not *SqlProcess)
- FulltextBloomFilter not available on 3.0-dev (removed bloom filter propagation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@heni02 heni02 merged commit 11375aa into matrixorigin:3.0-dev Apr 15, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/enhancement size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants