Skip to content

fix(search): recall non-first-token matches via full-text indexes#2140

Merged
momothemage merged 6 commits into
openclaw:mainfrom
momothemage:feature/fix_search
May 11, 2026
Merged

fix(search): recall non-first-token matches via full-text indexes#2140
momothemage merged 6 commits into
openclaw:mainfrom
momothemage:feature/fix_search

Conversation

@momothemage
Copy link
Copy Markdown
Contributor

@momothemage momothemage commented May 11, 2026

Summary

  • What changed:
    • Added two Convex full-text searchIndexes to skillSearchDigest in convex/schema.ts: search_by_display_name and search_by_slug, both with filterFields: ["softDeletedAt", "isSuspicious"].
    • Appended two withSearchIndex(...) queries to the existing Promise.all in directPrefixSkillMatches (convex/search.ts) so non-first-token tokens are now recalled. The four legacy prefix-index queries are kept untouched and merged through the existing skillId-based dedup.
    • Extended the test mock in convex/search.test.ts to model Convex's token-level inverted-index semantics, and added 4 regression tests (non-first-token slug, non-first-token displayName, nonSuspiciousOnly safety, cross-path dedup).
  • Why:
    • Skill search systematically returned zero results when the user's query was not a string prefix of the skill's slug / displayName and was not the first token. For example, searching yijian or vision against baidu-yijian-vision produced an empty list even though the skill was active and non-suspicious.
    • The four prefix indexes used by directPrefixSkillMatches are anchored to the start of either the full normalized field or its first alphanumeric token, so non-first-token queries miss every path simultaneously. The lexical fallback only scans the most-recently-updated/created N rows, which explains the "publish a new version, becomes findable, then disappears again" time-decay symptom. Vector search has weak recall for transliterated tokens (e.g. Chinese pinyin), so it cannot reliably backfill the gap.
    • This is Phase 1 of a two-phase zero-downtime rollout: the legacy prefix paths stay live during the index backfill, so worst case the result set is no smaller than today; once the new search indexes show ready in the Convex dashboard, a follow-up PR will retire the redundant prefix indexes.

Linked Issue

Screenshots

For website/UI changes, attach screenshots or recordings from the real app. Include mobile/narrow views when layout changes.

  • Screenshots/recordings attached, or N/A — N/A (backend-only change to Convex query layer; no UI surface modified)

Security / Trust Impact

  • No security/trust impact
  • Security/trust impact explained

Notes (kept for reviewer context, not a separate impact):

  • The new searchIndexes declare softDeletedAt and isSuspicious as filterFields, and every new query passes .eq("softDeletedAt", undefined) plus (when nonSuspiciousOnly is set) .eq("isSuspicious", false), matching the existing prefix-query behavior.
  • The post-hydration isSkillSuspicious(skill) and isSkillHighlighted(skill) guards in directPrefixSkillMatches are unchanged, so suspicious or non-highlighted skills cannot leak through the new path.

Data / Deploy Impact

  • No data/deploy impact

  • Data/deploy impact explained

  • Schema change: two new searchIndex entries on skillSearchDigest. Convex will automatically backfill all existing documents in the background once the schema is deployed; no migration script is required.

  • Index readiness can be observed in the Convex dashboard. During backfill, queries against the new indexes still execute and progressively return more matches as the backfill advances; the legacy prefix paths run in parallel, so user-visible recall never regresses below the pre-PR baseline.

  • No writes to existing fields, no row migrations, no data deletions.

  • Rollback is safe: removing the schema entries and the two withSearchIndex calls restores the previous behavior with no data fix-up needed.

Verification

  • bun run format:check — verified via bunx oxfmt --check on convex/schema.ts, convex/search.ts, convex/search.test.ts: "All matched files use the correct format."
  • bun run lint — verified via bunx oxlint on the same three files: 0 warnings, 0 errors across 119 rules.
  • bun run testbunx vitest run convex/: 96 files / 1176 tests passed; bunx vitest run convex/search.test.ts: 38 / 38 passed (4 newly added).

@momothemage momothemage requested review from a team and Patrick-Erichsen as code owners May 11, 2026 06:51
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 11, 2026

@momothemage is attempting to deploy a commit to the Amantus Machina Team on Vercel.

A member of the Team first needs to authorize it.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 11, 2026

Codex review: needs real behavior proof before merge.

Summary
The PR adds skillSearchDigest full-text search indexes for displayName and slug, queries them from directPrefixSkillMatches, and adds regression tests for middle-token recall and filtering.

Reproducibility: yes. source-reproducible from current main: the direct indexed path only covers full-string and first-token prefixes, while the fallback is bounded by recent digest windows, so a stable older baidu-yijian-vision match can miss yijian or vision. I did not run a live Convex deployment search in this read-only review.

Real behavior proof
Needs real behavior proof before merge: Needs real behavior proof before merge: the PR body contains tests and CI-style output, but no redacted live Convex/search output, logs, terminal screenshot, linked artifact, or recording showing the after-fix behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
This active external PR needs contributor real behavior proof and maintainer selection among duplicate fix branches, not an automated cleanup close or repair lane.

Security
Cleared: No concrete supply-chain, secret-handling, or code-execution concern was found; the new search paths keep softDeletedAt and isSuspicious filters plus existing post-hydration guards.

Review details

Best possible solution:

Land one reviewed full-text digest search implementation that preserves soft-delete and suspicious filters, includes real behavior proof, and supersedes the duplicate parallel branches.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible from current main: the direct indexed path only covers full-string and first-token prefixes, while the fallback is bounded by recent digest windows, so a stable older baidu-yijian-vision match can miss yijian or vision. I did not run a live Convex deployment search in this read-only review.

Is this the best way to solve the issue?

Yes, the updated Phase 1 approach is a narrow maintainable fix direction because it adds full-text digest indexes alongside the existing prefix paths and post-filters candidates through the existing all-token matcher. Merge should wait for real behavior proof and a canonical-branch choice among the duplicate PRs.

What I checked:

  • current-main direct search path: Current main's directPrefixSkillMatches queries only normalized full-string and first-token prefix indexes and dedupes those four result sets; it has no withSearchIndex path for middle-token recall. (convex/search.ts:348, 75e04d2c9145)
  • current-main schema: Current main's skillSearchDigest table defines ordinary indexes through by_nonsuspicious_installs and does not define search_by_display_name or search_by_slug. (convex/schema.ts:720, 75e04d2c9145)
  • current-main fallback window: The fallback search path still reads bounded updated/created digest windows before token filtering, so it is not a full-corpus deterministic lookup for older middle-token matches. (convex/search.ts:591, 75e04d2c9145)
  • latest PR diff: The latest PR head adds two schema searchIndex definitions, appends two withSearchIndex queries to directPrefixSkillMatches, and adds an all-query-token post-filter plus regression tests after the earlier overrecall review. (convex/search.ts:345, e97ea4665b9c)
  • proof and checks: The PR body reports format, lint, and Vitest runs, and GitHub CI checks are mostly successful, but the body and comments still do not include live Convex/search output, logs, a recording, or another after-fix real behavior artifact. (e97ea4665b9c)
  • parallel fix branches: Three open PRs currently target the same search-underrecall issue, including fix(search): add full-text search indexes to fix non-first-token skill underrecall #2139 and fix: add full-text skill digest search #2142, so maintainers should choose one canonical branch rather than merge duplicates.

Likely related people:

  • vincentkoc: Recent commits added normalized skill prefix recall, first-token recall, and bounded lexical fallback behavior in the central search path this PR changes. (role: recent search implementation owner; confidence: high; commits: 881514f444e4, c9fe6db34df8, 21abd07672e0; files: convex/search.ts, convex/schema.ts, convex/lib/skillSearchDigest.ts)
  • sethconvex: PR history shows sethconvex authored the merged skill search digest feature that introduced the table and search hydration structure this PR extends. (role: digest feature-history owner; confidence: high; commits: 9861da02d5fb, 872045681ba8, 5433c6620076; files: convex/schema.ts, convex/search.ts, convex/lib/skillSearchDigest.ts)
  • momothemage: Before this PR, momothemage merged adjacent unified-search suspicious-result disclosure work, relevant to the new path's nonSuspiciousOnly behavior. (role: recent adjacent contributor; confidence: medium; commits: df7e0f77a6bb, b116ef7aa175, 1477ca7120ea; files: src/lib/useUnifiedSearch.ts, src/routes/search.tsx, src/__tests__/search-route.test.tsx)

Remaining risk / open question:

  • No after-fix live Convex/search proof is attached yet; current evidence is tests and CI output only.
  • There are parallel open PRs for the same linked issue, so duplicate implementations need maintainer selection before merge.
  • The linked issue carries a security label, so the moderation-filter behavior should receive explicit maintainer attention even though this diff preserves the relevant filters.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 75e04d2c9145.

@momothemage
Copy link
Copy Markdown
Contributor Author

/clawsweeper automerge

@momothemage momothemage force-pushed the feature/fix_search branch from e600dbb to b597454 Compare May 11, 2026 09:56
@momothemage momothemage merged commit 190ce37 into openclaw:main May 11, 2026
16 of 17 checks passed
@momothemage
Copy link
Copy Markdown
Contributor Author

Merged via squash.

Thanks @momothemage!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Search] Skill search underrecall: non-first-token keywords never match, visibility decays over time

1 participant