[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge by wangxingjun778 · Pull Request #140 · modelscope/sirchmunk

wangxingjun778 · 2026-04-03T11:54:47Z

Features

Fast Path for Small Files: Introduced a 'fast path' optimization for files under 100KB, allowing them to bypass sampling for faster processing.
Path-Based Knowledge Cluster Reuse: Implemented path-based scoping to enable efficient reuse of knowledge clusters.
Enhanced Fast Search Ranking: Improved file ranking logic in fast search by incorporating match deduplication and dynamic score pruning.

Fixes

Robust Root Directory Handling: Updated path comparison logic to correctly handle root directories, addressing review feedback on edge cases.
Token Limit Protection: Addressed concerns regarding large summaries in the fast path to prevent exceeding LLM token limits.
Intra-File Match Deduplication: Added logic to deduplicate redundant matches within the same file, optimizing LLM context usage and reducing noise.

gemini-code-assist

Code Review

This pull request introduces several optimizations and refinements to the search and evidence processing pipelines. Key changes include a 'fast path' for files under 100KB to bypass sampling, path-based scoping for knowledge cluster reuse, and improved file ranking in fast search through match deduplication and dynamic score pruning. Review feedback highlights the need for more robust path comparison logic to handle root directories, concerns regarding large summaries in the fast path potentially exceeding token limits, and the importance of deduplicating redundant matches within the same file to optimize LLM context usage.

gemini-code-assist · 2026-04-03T11:56:15Z

+        normalised_scopes = [os.path.normpath(p) + os.sep for p in search_paths]
+        covered = 0
+        for fp in cluster_files:
+            norm_fp = os.path.normpath(fp)
+            if any(norm_fp.startswith(scope) or norm_fp == scope.rstrip(os.sep)
+                   for scope in normalised_scopes):


The path coverage calculation using os.sep suffixing and startswith will fail when a search path is the root directory (e.g., / on Unix). os.path.normpath("/") returns /, and appending os.sep results in //. A standard absolute path like /home/user does not start with //. Using os.path.commonpath is a more robust way to check if a file resides within a directory tree.

Suggested change

normalised_scopes = [os.path.normpath(p) + os.sep for p in search_paths]

covered = 0

for fp in cluster_files:

norm_fp = os.path.normpath(fp)

if any(norm_fp.startswith(scope) or norm_fp == scope.rstrip(os.sep)

for scope in normalised_scopes):

normalised_scopes = [os.path.abspath(p) for p in search_paths]

covered = 0

for fp in cluster_files:

try:

norm_fp = os.path.abspath(fp)

if any(os.path.commonpath([norm_fp, scope]) == scope for scope in normalised_scopes):

covered += 1

except (OSError, ValueError):

continue

gemini-code-assist · 2026-04-03T11:56:15Z

+        if self.doc_len < _SMALL_FILE_THRESHOLD:
+            await self._log.info(
+                f"[MC] Small file fast path: {self.doc_len} chars < {_SMALL_FILE_THRESHOLD} threshold, "
+                f"returning full content as ROI"
+            )
+            snippet = {
+                "snippet": self.doc,
+                "start": 0,
+                "end": self.doc_len,
+                "score": 10.0,
+                "reasoning": "Small file - full content returned without sampling",
+            }
+            return RoiResult(
+                summary=self.doc,
+                is_found=True,
+                snippets=[snippet],
+            )


Returning the full document content (up to 100,000 characters) as the summary in the RoiResult is problematic. While this is a "fast path", a 100KB string is not a summary and will lead to excessively large prompts in Phase 4 of the search pipeline (answer synthesis), potentially exceeding token limits or increasing latency and cost significantly. Consider providing a truncated version (e.g., first 2000 characters) for the summary field while keeping the full content in the snippets list.

gemini-code-assist · 2026-04-03T11:56:15Z

+            if fpath in seen:
+                # Merge into existing entry
+                idx = seen[fpath]
+                existing = deduped[idx]
+                existing["matches"].extend(entry.get("matches", []))
+                existing["lines"].extend(entry.get("lines", []))
+                existing["total_matches"] += entry.get("total_matches", 0)


The deduplication logic for merged files accumulates matches and lines but does not deduplicate them. Since keywords are searched sequentially in _fast_find_best_file, a line matching multiple keywords will appear multiple times in the matches and lines lists. This redundancy increases the amount of text sent to the LLM without adding information. Additionally, total_matches should reflect the count of unique matches after deduplication.

if fpath in seen: # Merge into existing entry idx = seen[fpath] existing = deduped[idx] # Deduplicate matches by line number to avoid redundancy from multi-keyword hits existing_lns = {m.get("data", {}).get("line_number") for m in existing["matches"] if m.get("data", {}).get("line_number") is not None} for m in entry.get("matches", []): ln = m.get("data", {}).get("line_number") if ln is None or ln not in existing_lns: existing["matches"].append(m) if ln is not None: existing_lns.add(ln) # Rebuild lines from unique matches existing["lines"] = [m["data"]["lines"]["text"] for m in existing["matches"] if "data" in m and "lines" in m["data"]] existing["total_matches"] = len(existing["matches"])

wangxingjun778 added 2 commits April 3, 2026 19:24

improve _fast_find_best_file

7415beb

reuse knowledge with scope

61774bf

wangxingjun778 changed the title ~~[Feat] Add small file shortcut for FAST mode~~ [Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge Apr 3, 2026

wangxingjun778 merged commit 4bd0a95 into main Apr 3, 2026

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge#140

[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge#140
wangxingjun778 merged 2 commits intomainfrom
fix/search

wangxingjun778 commented Apr 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangxingjun778 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Fixes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangxingjun778 commented Apr 3, 2026 •

edited

Loading