Skip to content

[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge#140

Merged
wangxingjun778 merged 2 commits intomainfrom
fix/search
Apr 3, 2026
Merged

[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge#140
wangxingjun778 merged 2 commits intomainfrom
fix/search

Conversation

@wangxingjun778
Copy link
Copy Markdown
Member

@wangxingjun778 wangxingjun778 commented Apr 3, 2026

Features

  • Fast Path for Small Files: Introduced a 'fast path' optimization for files under 100KB, allowing them to bypass sampling for faster processing.
  • Path-Based Knowledge Cluster Reuse: Implemented path-based scoping to enable efficient reuse of knowledge clusters.
  • Enhanced Fast Search Ranking: Improved file ranking logic in fast search by incorporating match deduplication and dynamic score pruning.

Fixes

  • Robust Root Directory Handling: Updated path comparison logic to correctly handle root directories, addressing review feedback on edge cases.
  • Token Limit Protection: Addressed concerns regarding large summaries in the fast path to prevent exceeding LLM token limits.
  • Intra-File Match Deduplication: Added logic to deduplicate redundant matches within the same file, optimizing LLM context usage and reducing noise.

@wangxingjun778 wangxingjun778 changed the title [Feat] Add small file shortcut for FAST mode [Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge Apr 3, 2026
@wangxingjun778 wangxingjun778 merged commit 4bd0a95 into main Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several optimizations and refinements to the search and evidence processing pipelines. Key changes include a 'fast path' for files under 100KB to bypass sampling, path-based scoping for knowledge cluster reuse, and improved file ranking in fast search through match deduplication and dynamic score pruning. Review feedback highlights the need for more robust path comparison logic to handle root directories, concerns regarding large summaries in the fast path potentially exceeding token limits, and the importance of deduplicating redundant matches within the same file to optimize LLM context usage.

Comment on lines +1030 to +1035
normalised_scopes = [os.path.normpath(p) + os.sep for p in search_paths]
covered = 0
for fp in cluster_files:
norm_fp = os.path.normpath(fp)
if any(norm_fp.startswith(scope) or norm_fp == scope.rstrip(os.sep)
for scope in normalised_scopes):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The path coverage calculation using os.sep suffixing and startswith will fail when a search path is the root directory (e.g., / on Unix). os.path.normpath("/") returns /, and appending os.sep results in //. A standard absolute path like /home/user does not start with //. Using os.path.commonpath is a more robust way to check if a file resides within a directory tree.

Suggested change
normalised_scopes = [os.path.normpath(p) + os.sep for p in search_paths]
covered = 0
for fp in cluster_files:
norm_fp = os.path.normpath(fp)
if any(norm_fp.startswith(scope) or norm_fp == scope.rstrip(os.sep)
for scope in normalised_scopes):
normalised_scopes = [os.path.abspath(p) for p in search_paths]
covered = 0
for fp in cluster_files:
try:
norm_fp = os.path.abspath(fp)
if any(os.path.commonpath([norm_fp, scope]) == scope for scope in normalised_scopes):
covered += 1
except (OSError, ValueError):
continue

Comment on lines +441 to +457
if self.doc_len < _SMALL_FILE_THRESHOLD:
await self._log.info(
f"[MC] Small file fast path: {self.doc_len} chars < {_SMALL_FILE_THRESHOLD} threshold, "
f"returning full content as ROI"
)
snippet = {
"snippet": self.doc,
"start": 0,
"end": self.doc_len,
"score": 10.0,
"reasoning": "Small file - full content returned without sampling",
}
return RoiResult(
summary=self.doc,
is_found=True,
snippets=[snippet],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning the full document content (up to 100,000 characters) as the summary in the RoiResult is problematic. While this is a "fast path", a 100KB string is not a summary and will lead to excessively large prompts in Phase 4 of the search pipeline (answer synthesis), potentially exceeding token limits or increasing latency and cost significantly. Consider providing a truncated version (e.g., first 2000 characters) for the summary field while keeping the full content in the snippets list.

Comment thread src/sirchmunk/search.py
Comment on lines +1859 to +1865
if fpath in seen:
# Merge into existing entry
idx = seen[fpath]
existing = deduped[idx]
existing["matches"].extend(entry.get("matches", []))
existing["lines"].extend(entry.get("lines", []))
existing["total_matches"] += entry.get("total_matches", 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deduplication logic for merged files accumulates matches and lines but does not deduplicate them. Since keywords are searched sequentially in _fast_find_best_file, a line matching multiple keywords will appear multiple times in the matches and lines lists. This redundancy increases the amount of text sent to the LLM without adding information. Additionally, total_matches should reflect the count of unique matches after deduplication.

            if fpath in seen:
                # Merge into existing entry
                idx = seen[fpath]
                existing = deduped[idx]
                # Deduplicate matches by line number to avoid redundancy from multi-keyword hits
                existing_lns = {m.get("data", {}).get("line_number") for m in existing["matches"] if m.get("data", {}).get("line_number") is not None}
                for m in entry.get("matches", []):
                    ln = m.get("data", {}).get("line_number")
                    if ln is None or ln not in existing_lns:
                        existing["matches"].append(m)
                        if ln is not None: existing_lns.add(ln)
                # Rebuild lines from unique matches
                existing["lines"] = [m["data"]["lines"]["text"] for m in existing["matches"] if "data" in m and "lines" in m["data"]]
                existing["total_matches"] = len(existing["matches"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant