[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge#140
[Feat] Add small file shortcut for FAST mode & fix reuse scope for knowledge#140wangxingjun778 merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces several optimizations and refinements to the search and evidence processing pipelines. Key changes include a 'fast path' for files under 100KB to bypass sampling, path-based scoping for knowledge cluster reuse, and improved file ranking in fast search through match deduplication and dynamic score pruning. Review feedback highlights the need for more robust path comparison logic to handle root directories, concerns regarding large summaries in the fast path potentially exceeding token limits, and the importance of deduplicating redundant matches within the same file to optimize LLM context usage.
| normalised_scopes = [os.path.normpath(p) + os.sep for p in search_paths] | ||
| covered = 0 | ||
| for fp in cluster_files: | ||
| norm_fp = os.path.normpath(fp) | ||
| if any(norm_fp.startswith(scope) or norm_fp == scope.rstrip(os.sep) | ||
| for scope in normalised_scopes): |
There was a problem hiding this comment.
The path coverage calculation using os.sep suffixing and startswith will fail when a search path is the root directory (e.g., / on Unix). os.path.normpath("/") returns /, and appending os.sep results in //. A standard absolute path like /home/user does not start with //. Using os.path.commonpath is a more robust way to check if a file resides within a directory tree.
| normalised_scopes = [os.path.normpath(p) + os.sep for p in search_paths] | |
| covered = 0 | |
| for fp in cluster_files: | |
| norm_fp = os.path.normpath(fp) | |
| if any(norm_fp.startswith(scope) or norm_fp == scope.rstrip(os.sep) | |
| for scope in normalised_scopes): | |
| normalised_scopes = [os.path.abspath(p) for p in search_paths] | |
| covered = 0 | |
| for fp in cluster_files: | |
| try: | |
| norm_fp = os.path.abspath(fp) | |
| if any(os.path.commonpath([norm_fp, scope]) == scope for scope in normalised_scopes): | |
| covered += 1 | |
| except (OSError, ValueError): | |
| continue |
| if self.doc_len < _SMALL_FILE_THRESHOLD: | ||
| await self._log.info( | ||
| f"[MC] Small file fast path: {self.doc_len} chars < {_SMALL_FILE_THRESHOLD} threshold, " | ||
| f"returning full content as ROI" | ||
| ) | ||
| snippet = { | ||
| "snippet": self.doc, | ||
| "start": 0, | ||
| "end": self.doc_len, | ||
| "score": 10.0, | ||
| "reasoning": "Small file - full content returned without sampling", | ||
| } | ||
| return RoiResult( | ||
| summary=self.doc, | ||
| is_found=True, | ||
| snippets=[snippet], | ||
| ) |
There was a problem hiding this comment.
Returning the full document content (up to 100,000 characters) as the summary in the RoiResult is problematic. While this is a "fast path", a 100KB string is not a summary and will lead to excessively large prompts in Phase 4 of the search pipeline (answer synthesis), potentially exceeding token limits or increasing latency and cost significantly. Consider providing a truncated version (e.g., first 2000 characters) for the summary field while keeping the full content in the snippets list.
| if fpath in seen: | ||
| # Merge into existing entry | ||
| idx = seen[fpath] | ||
| existing = deduped[idx] | ||
| existing["matches"].extend(entry.get("matches", [])) | ||
| existing["lines"].extend(entry.get("lines", [])) | ||
| existing["total_matches"] += entry.get("total_matches", 0) |
There was a problem hiding this comment.
The deduplication logic for merged files accumulates matches and lines but does not deduplicate them. Since keywords are searched sequentially in _fast_find_best_file, a line matching multiple keywords will appear multiple times in the matches and lines lists. This redundancy increases the amount of text sent to the LLM without adding information. Additionally, total_matches should reflect the count of unique matches after deduplication.
if fpath in seen:
# Merge into existing entry
idx = seen[fpath]
existing = deduped[idx]
# Deduplicate matches by line number to avoid redundancy from multi-keyword hits
existing_lns = {m.get("data", {}).get("line_number") for m in existing["matches"] if m.get("data", {}).get("line_number") is not None}
for m in entry.get("matches", []):
ln = m.get("data", {}).get("line_number")
if ln is None or ln not in existing_lns:
existing["matches"].append(m)
if ln is not None: existing_lns.add(ln)
# Rebuild lines from unique matches
existing["lines"] = [m["data"]["lines"]["text"] for m in existing["matches"] if "data" in m and "lines" in m["data"]]
existing["total_matches"] = len(existing["matches"])
Features
Fixes