Skip to content

optimize(fulltext): harden single-keyword fast path and fix BM25/runtime correctness#24027

Merged
mergify[bot] merged 6 commits intomatrixorigin:mainfrom
XuPeng-SH:feat/optimize-fulltext1
Apr 13, 2026
Merged

optimize(fulltext): harden single-keyword fast path and fix BM25/runtime correctness#24027
mergify[bot] merged 6 commits intomatrixorigin:mainfrom
XuPeng-SH:feat/optimize-fulltext1

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24026

What this PR does / why we need it:

Summary

This PR hardens MatrixOne fulltext execution while keeping only a behavior-safe optimization scope.

The original optimization direction was useful, but review surfaced several correctness, lifecycle, and performance-shape risks.
This change narrows the fast path to the safe subset, fixes the risky parts, and removes duplicated SQL work in the single-keyword
limited path.

@matrix-meow matrix-meow added the size/XL Denotes a PR that changes [1000, 1999] lines label Mar 31, 2026
@XuPeng-SH XuPeng-SH changed the title optimize(fulltext): harden single-keyword fast path and fix BM25/runtime correctnessoptimize optimize(fulltext): harden single-keyword fast path and fix BM25/runtime correctness Mar 31, 2026
@XuPeng-SH XuPeng-SH force-pushed the feat/optimize-fulltext1 branch from 824a968 to 61bd9f7 Compare March 31, 2026 04:06
@XuPeng-SH XuPeng-SH force-pushed the feat/optimize-fulltext1 branch from 61bd9f7 to eda6918 Compare March 31, 2026 04:49
XuPeng-SH and others added 2 commits March 31, 2026 02:07
…to LEFT JOIN for performance

- groupby(): add len(res.Batches)==0 guard before Batches[0] access to prevent
  panic when streaming SQL returns 0 rows (consistent with runCountStar pattern)
- SingleKeywordTopKBM25SQL / PhraseTopKBM25SQL / genBM25SQL: replace correlated
  subquery for doc_len with LEFT JOIN, restoring the original approach.
  Keeps CAST(COALESCE(dl.pos,0) AS INT) for type safety (int32, not BIGINT).
  Correlated subquery forced evaluation for every matching row before LIMIT;
  LEFT JOIN allows the planner to optimize via hash/merge join.
- Remove now-unused docLenExpr helper; update 5 SQL test expectations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…njection

escape() previously only handled single quotes. Since MO's default sql_mode
does not include NO_BACKSLASH_ESCAPES, backslash is treated as an escape
character. A search term containing '\' could break out of a string literal
(e.g. word = 'ab\'' -- backslash escapes the closing quote).

Fix: escape backslashes before single quotes so both are handled correctly.
Add TestEscape covering backslash, quote, combined, and plain input cases.

Pre-existing issue in 6ed78f5; hardened here as part of the fulltext
security improvement scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread coverage.out Outdated
XuPeng-SH and others added 2 commits March 31, 2026 03:24
Copy link
Copy Markdown
Contributor Author

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fulltext Single-Keyword Fast Path & Correctness Fixes

Critical Correctness Fixes ✅

  1. SQL Escape Injection (sql.go): The old escape() only escaped '\\'. A pattern containing \ followed by ' (e.g., back\slash's) would produce a dangling string in SQL: the \\' gets parsed as an escaped quote instead of end-of-string, breaking SQL framing. The new code escapes \\\ first, then '\\', producing correct SQL.

  2. Multi-Row Argument Access (fulltext.go): v.GetStringAt(0)v.GetStringAt(nthRow) in start(). Previously, all rows in a batch invocation used row 0's source table, index table, pattern, and mode.

  3. aggcnt Double-Counting (fulltext.go:groupby): The removed "update only once per doc_id" loop ran for EVERY row (both new and existing doc_ids), incrementing aggcnt[i] for all set positions including already-counted ones. The fix correctly moves the increment into the new-doc block.

  4. BM25 Doc-Length Dedup (countstar_avg_sql): GROUP BY doc_id with MAX(pos) deduplicates __DocLen entries before computing COUNT and AVG.

  5. NULL Doc-Length (genBM25SQL): COALESCE(dl.pos, 0) prevents NULL propagation from LEFT JOIN. Note: defaulting to 0 favors short-document scoring — using avgDocLen might be more neutral, but 0 is defensible.

Fast Path Design ✅

Well-gated: only NL/DEFAULT mode, single TEXT/STAR keyword, limit > 0, !ranking. Falls back to streaming for all other cases. Score computation matches the full pipeline.

Memory Lifecycle Hardening ✅

  • sort_topk: evicted items freed immediately (ranking) / all non-survivors freed (non-ranking)
  • streamingStarted guard in free(): prevents blocking on nil/unstarted channels
  • resetRowState: proper per-row cleanup for multi-row invocations
  • normalizeDocID/outputDocID: single cache for binary↔string doc_id conversions

Minor Notes

  • PhraseCountSQL, PhraseTopKSQL, PhraseTopKBM25SQL defined and tested but not called from runtime yet — presumably future phrase optimization prep.
  • cappedTfExpr() caps TF at 255 (matching uint8 docvec). If docvec capacity increases later, this SQL cap needs updating too.

Verdict

The escape fix alone is critical. The nthRow + aggcnt fixes address real correctness bugs. Fast path is well-scoped with extensive test coverage.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 13, 2026

Merge Queue Status

  • Entered queue2026-04-13 17:55 UTC · Rule: main
  • Checks passed · in-place
  • Merged2026-04-13 18:55 UTC · at 161021c85d16f86a766f578ce5b9443fd5de1731

This pull request spent 59 minutes 43 seconds in the queue, including 59 minutes 14 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants