optimize(fulltext): harden single-keyword fast path and fix BM25/runtime correctness#24027
Conversation
824a968 to
61bd9f7
Compare
61bd9f7 to
eda6918
Compare
426a121 to
9bdaaa5
Compare
…to LEFT JOIN for performance - groupby(): add len(res.Batches)==0 guard before Batches[0] access to prevent panic when streaming SQL returns 0 rows (consistent with runCountStar pattern) - SingleKeywordTopKBM25SQL / PhraseTopKBM25SQL / genBM25SQL: replace correlated subquery for doc_len with LEFT JOIN, restoring the original approach. Keeps CAST(COALESCE(dl.pos,0) AS INT) for type safety (int32, not BIGINT). Correlated subquery forced evaluation for every matching row before LIMIT; LEFT JOIN allows the planner to optimize via hash/merge join. - Remove now-unused docLenExpr helper; update 5 SQL test expectations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…njection escape() previously only handled single quotes. Since MO's default sql_mode does not include NO_BACKSLASH_ESCAPES, backslash is treated as an escape character. A search term containing '\' could break out of a string literal (e.g. word = 'ab\'' -- backslash escapes the closing quote). Fix: escape backslashes before single quotes so both are handled correctly. Add TestEscape covering backslash, quote, combined, and plain input cases. Pre-existing issue in 6ed78f5; hardened here as part of the fulltext security improvement scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
XuPeng-SH
left a comment
There was a problem hiding this comment.
Review: Fulltext Single-Keyword Fast Path & Correctness Fixes
Critical Correctness Fixes ✅
-
SQL Escape Injection (
sql.go): The oldescape()only escaped'→\\'. A pattern containing\followed by'(e.g.,back\slash's) would produce a dangling string in SQL: the\\'gets parsed as an escaped quote instead of end-of-string, breaking SQL framing. The new code escapes\→\\first, then'→\\', producing correct SQL. -
Multi-Row Argument Access (
fulltext.go):v.GetStringAt(0)→v.GetStringAt(nthRow)instart(). Previously, all rows in a batch invocation used row 0's source table, index table, pattern, and mode. -
aggcntDouble-Counting (fulltext.go:groupby): The removed "update only once per doc_id" loop ran for EVERY row (both new and existing doc_ids), incrementingaggcnt[i]for all set positions including already-counted ones. The fix correctly moves the increment into the new-doc block. -
BM25 Doc-Length Dedup (
countstar_avg_sql):GROUP BY doc_idwithMAX(pos)deduplicates__DocLenentries before computing COUNT and AVG. -
NULL Doc-Length (
genBM25SQL):COALESCE(dl.pos, 0)prevents NULL propagation from LEFT JOIN. Note: defaulting to 0 favors short-document scoring — usingavgDocLenmight be more neutral, but 0 is defensible.
Fast Path Design ✅
Well-gated: only NL/DEFAULT mode, single TEXT/STAR keyword, limit > 0, !ranking. Falls back to streaming for all other cases. Score computation matches the full pipeline.
Memory Lifecycle Hardening ✅
sort_topk: evicted items freed immediately (ranking) / all non-survivors freed (non-ranking)streamingStartedguard infree(): prevents blocking on nil/unstarted channelsresetRowState: proper per-row cleanup for multi-row invocationsnormalizeDocID/outputDocID: single cache for binary↔string doc_id conversions
Minor Notes
PhraseCountSQL,PhraseTopKSQL,PhraseTopKBM25SQLdefined and tested but not called from runtime yet — presumably future phrase optimization prep.cappedTfExpr()caps TF at 255 (matching uint8 docvec). If docvec capacity increases later, this SQL cap needs updating too.
Verdict
The escape fix alone is critical. The nthRow + aggcnt fixes address real correctness bugs. Fast path is well-scoped with extensive test coverage.
Merge Queue Status
This pull request spent 59 minutes 43 seconds in the queue, including 59 minutes 14 seconds running CI. Required conditions to merge
|
What type of PR is this?
Which issue(s) this PR fixes:
issue #24026
What this PR does / why we need it:
Summary
This PR hardens MatrixOne fulltext execution while keeping only a behavior-safe optimization scope.
The original optimization direction was useful, but review surfaced several correctness, lifecycle, and performance-shape risks.
This change narrows the fast path to the safe subset, fixes the risky parts, and removes duplicated SQL work in the single-keyword
limited path.