Conversation
Fix/filename
…ils when some vars are missing
Add env vars when running backup/restore scripts as their might be fa…
…of the embedder model. This prevents errors in that case.
… each matched document to provide additional context.
Feat/chunking
Some openai-compatible libs make the `model` parameter mandatory, to query chat completions API. Thus, we now allow providing empty model, or actual LLM model defined in the config.
|
Note Reviews pausedUse the following commands to manage reviews:
WalkthroughThis PR significantly refactors the document processing and retrieval pipeline by consolidating chunking strategies to recursive splitting only, introducing LLM-based chunk contextualization, adding surrounding chunk retrieval capabilities, integrating token-aware context formatting, and adding embedder truncation support. Configuration files are reorganized and environment variables updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Upload as File Upload
participant Sanitize as File Sanitizer
participant Save as Disk Storage
participant Chunk as Chunking Engine
participant Context as LLM Contextualizer
participant Embed as Embedder
participant VectorDB as Vector Database
User->>Upload: Upload document
Upload->>Sanitize: Get sanitized filename
Sanitize-->>Save: Sanitized + unique name
Save->>Save: Write to disk with random prefix
Save->>Chunk: Load document
Chunk->>Chunk: Split into chunks with recursive_splitter
Chunk->>Chunk: Detect language & assign page numbers
Chunk->>Context: Pass chunks for contextualization
Context->>Context: LLM generates context for each chunk
Context-->>Chunk: Enhanced chunks with context
Chunk->>Embed: Send contextualized chunks
Embed->>Embed: Truncate via max_model_len
Embed->>VectorDB: Generate embeddings
VectorDB->>VectorDB: Store with chunk order metadata
VectorDB-->>User: ✓ Indexed
sequenceDiagram
actor User
participant API as OpenAI API
participant Router as Request Router
participant VectorDB as Vector Database
participant Retriever as Retriever
participant Reranker as Reranker
participant Context as Context Formatter
participant LLM as Language Model
participant Response as Response
User->>API: Query (model: "")
API->>Router: is_direct_llm_model() → true
Router->>LLM: Route to direct LLM
LLM-->>Response: Generate without RAG
alt With RAG (model specified)
User->>API: Query (model: "rag_partition")
API->>Router: is_direct_llm_model() → false
Router->>Retriever: retrieve(with_surrounding_chunks=true)
Retriever->>VectorDB: async_search()
VectorDB-->>Retriever: Top-k chunks
Retriever->>VectorDB: get_surrounding_chunks()
VectorDB-->>Retriever: Adjacent chunks
Retriever->>Reranker: Rerank combined results (top_k=10)
Reranker-->>Context: Ranked chunks
Context->>Context: format_context(max_context_tokens)
Context->>Context: Count tokens via ChatOpenAI
Context-->>LLM: Token-limited context
LLM-->>Response: RAG-augmented response
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai pause |
There was a problem hiding this comment.
Code Health Improved
(2 files improve in Code Health)
Gates Failed
Prevent hotspot decline
(3 hotspots with Complex Method)
Enforce advisory code health rules
(4 files with Complex Method, Overall Code Complexity)
Gates Passed
4 Quality Gates Passed
See analysis details in CodeScene
Reason for failure
| Prevent hotspot decline | Violations | Code Health Impact | |
|---|---|---|---|
| vectordb.py | 1 rule in this hotspot | 7.29 → 7.06 | Suppress |
| openai.py | 1 rule in this hotspot | 7.10 → 7.08 | Suppress |
| chunker.py | 1 rule in this hotspot | 8.55 → 9.39 | Suppress |
| Enforce advisory code health rules | Violations | Code Health Impact | |
|---|---|---|---|
| vectordb.py | 1 advisory rule | 7.29 → 7.06 | Suppress |
| openai.py | 1 advisory rule | 7.10 → 7.08 | Suppress |
| utils.py | 1 advisory rule | 7.86 → 8.57 | Suppress |
| chunker.py | 1 advisory rule | 8.55 → 9.39 | Suppress |
View Improvements
| File | Code Health Impact | Categories Improved |
|---|---|---|
| chunker.py | 8.55 → 9.39 | Code Duplication, Bumpy Road Ahead, Excess Number of Function Arguments |
| utils.py | 7.86 → 8.57 | Complex Method, Bumpy Road Ahead, Deep, Nested Complexity, Excess Number of Function Arguments |
Quality Gate Profile: Pay Down Tech Debt
Want more control? Customize Code Health rules or catch issues early with our IDE extension and CLI tool.
| def _get_chunks( | ||
| self, content: str, metadata: Optional[dict] = None, log=None | ||
| ) -> list[Document]: | ||
| log = log or logger | ||
| texts, tables_and_images = self._prepare_md_elements(content=content) | ||
| combined_texts = "\n".join([e.content for e in texts]) | ||
| text_chunks = self.split_text(combined_texts) |
There was a problem hiding this comment.
❌ New issue: Complex Method
BaseChunker._get_chunks has a cyclomatic complexity of 9, threshold = 9
|
|
||
| from langchain_core.documents.base import Document | ||
| from langchain_openai import ChatOpenAI | ||
| from typing import Callable, Literal, Optional |
There was a problem hiding this comment.
❌ New issue: Overall Code Complexity
This module has a mean cyclomatic complexity of 5.22 across 9 functions. The mean complexity threshold is 4
| similarity_threshold: int = 0.80, | ||
| partition: list[str] = None, | ||
| filter: Optional[dict] = None, | ||
| with_surrounding_chunks: bool = False, |
There was a problem hiding this comment.
❌ Getting worse: Complex Method
MilvusDB.async_search increases in cyclomatic complexity from 12 to 13, threshold = 9
| async def get_surrounding_chunks(self, docs: list[Document]) -> list[Document]: | ||
| existant_ids = set(doc.metadata.get("_id") for doc in docs) | ||
|
|
||
| # Collect all prev/next section IDs | ||
| section_ids = [ | ||
| section_id | ||
| for doc in docs | ||
| for section_id in [ | ||
| doc.metadata.get("prev_section_id"), | ||
| doc.metadata.get("next_section_id"), | ||
| ] | ||
| if section_id is not None | ||
| ] | ||
|
|
||
| if not section_ids: | ||
| return [] | ||
|
|
||
| # Query all sections in parallel | ||
| tasks = [ | ||
| self._async_client.query( | ||
| collection_name=self.collection_name, | ||
| filter=f"section_id == {section_id}", | ||
| limit=1, | ||
| ) | ||
| for section_id in section_ids | ||
| ] | ||
| responses = await asyncio.gather(*tasks) | ||
|
|
||
| # Build output, skipping duplicates | ||
| output_docs = [] | ||
| for response in responses: | ||
| if not response: | ||
| continue | ||
| doc_id = response[0].get("_id") | ||
| if doc_id not in existant_ids: | ||
| existant_ids.add(doc_id) | ||
| output_docs.append( | ||
| Document( | ||
| page_content=response[0]["text"], | ||
| metadata={ | ||
| key: value | ||
| for key, value in response[0].items() | ||
| if key not in ["text", "vector"] | ||
| }, | ||
| ) | ||
| ) | ||
|
|
||
| return output_docs |
There was a problem hiding this comment.
❌ New issue: Complex Method
MilvusDB.get_surrounding_chunks has a cyclomatic complexity of 10, threshold = 9
| user_partitions=Depends(current_user_or_admin_partitions_list), | ||
| ): | ||
| model_name = request.model | ||
| model_name = request.model or config.llm.get("model") |
There was a problem hiding this comment.
❌ Getting worse: Complex Method
openai_completion increases in cyclomatic complexity from 13 to 14, threshold = 9
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
openrag/components/files.py (3)
35-39: Fix return type annotation mismatch.The function returns a
strbut is annotated as returningPath. This causes type inconsistency and could mislead callers.-def make_unique_filename(filename: str) -> Path: +def make_unique_filename(filename: str) -> str: ts = int(time.time() * 1000) rand = secrets.token_hex(2) unique_name = f"{ts}_{rand}_{filename}" return unique_name
54-58: Handle potentialNonefilename.
UploadFile.filenamecan beNoneif the client doesn't provide a filename. This would cause issues when constructing the file path.if with_random_prefix: - filename = make_unique_filename(file.filename) + filename = make_unique_filename(file.filename or "unnamed") else: - filename = file.filename + filename = file.filename or "unnamed" file_path = dest_dir / filename
79-92: Dead code:elsebranch is unreachable.
ray.wait([future])without atimeoutparameter blocks until the future completes, soreadywill always be non-empty. Theelsebranch (lines 90-92) is unreachable dead code.Either remove the dead code or reintroduce the timeout if it was intentionally removed:
# Wait for it to complete, with timeout - ready, _ = await asyncio.to_thread(ray.wait, [future]) + ready, _ = await asyncio.to_thread(ray.wait, [future], timeout=SERIALIZATION_TIMEOUT) if ready: try: doc = await ready[0] return doc except TaskCancelledError: raise except Exception: raise else: ray.cancel(future, recursive=True) - raise TimeoutError(f"Serialization task {task_id} timed out after seconds") + raise TimeoutError(f"Serialization task {task_id} timed out after {SERIALIZATION_TIMEOUT} seconds")Or if timeout is intentionally removed, clean up the dead code:
# Wait for it to complete - ready, _ = await asyncio.to_thread(ray.wait, [future]) - - if ready: - try: - doc = await ready[0] - return doc - except TaskCancelledError: - raise - except Exception: - raise - else: - ray.cancel(future, recursive=True) - raise TimeoutError(f"Serialization task {task_id} timed out after seconds") + await asyncio.to_thread(ray.wait, [future]) + try: + doc = await future + return doc + except TaskCancelledError: + raise
🧹 Nitpick comments (10)
openrag/components/indexer/loaders/base.py (1)
153-153: Clarify the formatting change.The addition of extra blank lines around the image description content may affect downstream parsing. Ensure this formatting change is intentional and compatible with consumers of this XML-wrapped content.
.hydra_config/rag/base.yaml (1)
1-4: LGTM!Good extraction of shared RAG configuration. Consider updating the comment since this base config is used by both
ChatBotRagandSimpleRag, not just chatbot RAG.-# Config for chatbot RAG +# Base config for RAG modes mode: '' chat_history_depth: 4 max_contextualized_query_len: 512openrag/components/test_files.py (1)
60-83: Consider edge case validation for empty filenames.The test case at line 77 expects
sanitize_filename("")to return"". While this test may pass, an empty filename could cause issues in actual file operations (e.g.,save_file_to_disk). Consider whethersanitize_filenameshould validate against empty strings and raise an error, or if the calling code should handle this validation.openrag/routers/indexer.py (1)
250-266: Consistent filename handling, but improve exception chaining.The filename sanitization and metadata handling mirrors
add_fileappropriately. However, the exception handling at lines 253-258 could be improved by usingraise ... from errorraise ... from Noneto preserve the exception chain, as flagged by static analysis.Apply this diff:
try: original_filename = file.filename file.filename = sanitize_filename(file.filename) file_path = await save_file_to_disk(file, save_dir, with_random_prefix=True) - except Exception: + except Exception as e: log.exception("Failed to save file to disk.") raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Failed to save uploaded file.", - ) + ) from eBased on static analysis hints.
openrag/components/utils.py (2)
125-147: Consider reusing the LLM instance for token counting.A new
ChatOpenAIinstance is created on every call toformat_context. This incurs unnecessary overhead for repeated invocations. Consider caching the LLM instance or accepting it as a parameter.+_token_counter_llm = None + +def _get_token_counter(): + global _token_counter_llm + if _token_counter_llm is None: + _token_counter_llm = ChatOpenAI(**config.llm) + return _token_counter_llm + def format_context(docs: list[Document], max_context_tokens: int = 4096) -> str: if not docs: return "No document found from the database" - llm = ChatOpenAI(**config.llm) - _length_function = llm.get_num_tokens + _length_function = _get_token_counter().get_num_tokens
150-157: Hard-coded path may cause issues outside Docker.The
lang_detect_cache_diris hard-coded to/app/model_weights/, which assumes a specific deployment environment. Consider making this configurable via the config system.# Initialize language detector -lang_detect_cache_dir = "/app/model_weights/" +lang_detect_cache_dir = getattr(config.paths, "model_weights_dir", "/app/model_weights/") lang_detector_config = LangDetectConfig( max_input_length=1024, # chars model="auto", cache_dir=lang_detect_cache_dir, )openrag/components/indexer/chunker/utils.py (1)
17-31: Consider using a dataclass for MDElement.The
MDElementclass is a simple data container. Using@dataclasswould reduce boilerplate and provide__eq__,__hash__, etc. automatically.+from dataclasses import dataclass + -class MDElement: - """Class representing a segment of markdown content.""" - - def __init__( - self, - type: Literal["text", "table", "image"], - content: str, - page_number: Optional[int] = None, - ): - self.type = type # 'text', 'table', 'image' - self.content = content - self.page_number = page_number - - def __repr__(self): - return f"Element(type={self.type}, page_number={self.page_number}, content={self.content[:100]}...)" +@dataclass +class MDElement: + """Class representing a segment of markdown content.""" + type: Literal["text", "table", "image"] + content: str + page_number: Optional[int] = None + + def __repr__(self): + return f"Element(type={self.type}, page_number={self.page_number}, content={self.content[:100]}...)"openrag/components/indexer/vectordb/vectordb.py (2)
387-396: Addstrict=Trueto zip for safety.Per static analysis hint B905, adding
strict=Trueensures the iterables have equal length and prevents silent data loss if they don't match.- for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l): + for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l, strict=True):
555-602: N+1 query problem for surrounding chunks—consider batching with IN operator.The current code makes a separate query for each surrounding chunk. Milvus supports the IN operator for batch queries in filter expressions. To optimize, batch the queries:
# Batch query using IN operator instead of N separate queries filter_expr = f"section_id in {section_ids}" responses = await self._async_client.query( collection_name=self.collection_name, filter=filter_expr, limit=len(section_ids), )If you prefer keeping the parallel approach with
asyncio.gather(), add a comment explaining the trade-off. Usefilter_paramsfor proper parameter escaping if needed.openrag/components/indexer/chunker/chunker.py (1)
62-66: Catching broad Exception masks specific errors.Per static analysis (BLE001), catching
Exceptionbroadly can hide unexpected issues. Consider catching more specific exceptions or at least logging the exception type.except Exception as e: logger.warning( - f"Error contextualizing chunk of document `{filename}`: {e}" + f"Error contextualizing chunk of document `{filename}`: {type(e).__name__}: {e}" ) return ""
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (36)
.coderabbit.yaml(1 hunks).github/workflows/smoke_test.yaml(2 hunks).github/workflows/smoke_test/.env(1 hunks).hydra_config/chunker/semantic_splitter.yaml(0 hunks).hydra_config/chunker/token_splitter.yaml(0 hunks).hydra_config/config.yaml(3 hunks).hydra_config/rag/ChatBotRag.yaml(1 hunks).hydra_config/rag/SimpleRag.yaml(1 hunks).hydra_config/rag/base.yaml(1 hunks).hydra_config/retriever/base.yaml(1 hunks)docker-compose.yaml(2 hunks)docs/content/docs/documentation/API.mdx(1 hunks)docs/content/docs/documentation/backup_restore.md(6 hunks)docs/content/docs/documentation/env_vars.md(4 hunks)openrag/components/files.py(1 hunks)openrag/components/indexer/chunker/chunker.py(2 hunks)openrag/components/indexer/chunker/test_chunking.py(1 hunks)openrag/components/indexer/chunker/utils.py(1 hunks)openrag/components/indexer/embeddings/openai.py(2 hunks)openrag/components/indexer/loaders/base.py(1 hunks)openrag/components/indexer/vectordb/vectordb.py(11 hunks)openrag/components/pipeline.py(3 hunks)openrag/components/prompts/prompts.py(2 hunks)openrag/components/retriever.py(5 hunks)openrag/components/test_files.py(3 hunks)openrag/components/utils.py(3 hunks)openrag/routers/indexer.py(4 hunks)openrag/routers/openai.py(6 hunks)openrag/routers/utils.py(1 hunks)openrag/scripts/backup.sh.example(1 hunks)openrag/scripts/restore.sh.example(1 hunks)prompts/example1/chunk_contextualizer_tmpl.txt(1 hunks)prompts/example1/sys_prompt_tmpl.txt(1 hunks)pyproject.toml(1 hunks)pytest.ini(1 hunks)quick_start/docker-compose.yaml(2 hunks)
💤 Files with no reviewable changes (2)
- .hydra_config/chunker/semantic_splitter.yaml
- .hydra_config/chunker/token_splitter.yaml
🧰 Additional context used
🧬 Code graph analysis (7)
openrag/routers/openai.py (1)
openrag/models/openai.py (2)
OpenAIChatCompletionRequest(14-30)OpenAICompletionRequest(75-92)
openrag/components/pipeline.py (1)
openrag/components/utils.py (1)
format_context(125-147)
openrag/components/utils.py (2)
openrag/utils/logger.py (1)
get_logger(10-47)openrag/config/config.py (1)
load_config(12-29)
openrag/components/test_files.py (1)
openrag/components/files.py (2)
sanitize_filename(13-32)save_file_to_disk(42-68)
openrag/routers/indexer.py (1)
openrag/components/files.py (2)
sanitize_filename(13-32)save_file_to_disk(42-68)
openrag/components/indexer/chunker/test_chunking.py (1)
openrag/components/indexer/chunker/utils.py (5)
MDElement(17-31)chunk_table(191-253)clean_markdown_table_spacing(256-280)get_chunk_page_number(110-145)split_md_elements(53-107)
openrag/components/indexer/vectordb/vectordb.py (1)
openrag/components/indexer/indexer.py (1)
chunk(53-57)
🪛 Gitleaks (8.30.0)
docs/content/docs/documentation/API.mdx
[high] 233-236: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.
(curl-auth-header)
🪛 LanguageTool
docs/content/docs/documentation/API.mdx
[grammar] ~226-~226: Ensure spelling is correct
Context: ... "stream": false }' ``` You can also direclty use this endpoint with no RAG pipeline,...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 Ruff (0.14.8)
openrag/routers/indexer.py
255-258: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
openrag/components/indexer/chunker/chunker.py
61-61: Consider moving this statement to an else block
(TRY300)
62-62: Do not catch blind exception: Exception
(BLE001)
103-103: Do not catch blind exception: Exception
(BLE001)
117-117: Unused method argument: kwargs
(ARG002)
openrag/components/indexer/vectordb/vectordb.py
388-388: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
532-532: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: index-backup-restore
🔇 Additional comments (45)
docs/content/docs/documentation/API.mdx (2)
221-221: LGTM: Temperature value updated.The lower temperature value (0.1) provides more deterministic outputs, which is appropriate for documentation examples.
233-248: LGTM: Non-RAG usage example is clear and consistent.The example correctly demonstrates using an empty model string to bypass the RAG pipeline, and the temperature value matches the earlier example for consistency.
Note: The Gitleaks warning about the authorization header is a false positive—
YOUR_AUTH_TOKENis clearly a documentation placeholder, not an actual credential.openrag/scripts/restore.sh.example (1)
7-7: LGTM!The addition of
--env-file .envensures environment variables are properly loaded during restore operations, maintaining consistency with other docker-compose invocations across the PR..github/workflows/smoke_test/.env (1)
29-30: LGTM!The new
MAX_MODEL_LENenvironment variable with a clear comment provides good guidance for adjusting the embedder model's maximum length based on model capabilities..coderabbit.yaml (1)
1-5: LGTM!The CodeRabbit configuration is properly structured and enables auto-review on all branches while allowing reviews to complete even if the PR is closed. The settings align with the provided schema.
pytest.ini (1)
8-11: LGTM!The environment variables properly configure test paths for Hydra configuration, prompts, and logs, aligning with the broader configuration refactoring in this PR.
openrag/components/indexer/embeddings/openai.py (2)
17-17: LGTM!The
max_model_lenconfiguration parameter is properly extracted with a sensible default of 8192, aligning with the environment variable changes across the PR.
38-40: Remove or replace the unsupportedtruncate_prompt_tokensparameter.The
truncate_prompt_tokensparameter is not documented in the official OpenAI embeddings API. The embeddings API enforces a maximum of 8192 tokens per input, and the official guidance recommends manual truncation using tokenization before sending requests. Remove this parameter or implement manual text truncation using tiktoken before calling the API.prompts/example1/chunk_contextualizer_tmpl.txt (1)
1-22: LGTM!The restructured prompt template significantly improves clarity and usability with explicit sections for Core Principles, Output Format, and Examples. The standardized field names (Filename, First Chunks, Previous Chunks, Current Chunk) and clear constraints (1-2 sentences, plain text) make the contextualization task more consistent and predictable.
openrag/components/files.py (1)
1-11: LGTM!Import organization is appropriate with the new modules (
asyncio,re,secrets,time) needed for the added functionality..hydra_config/rag/SimpleRag.yaml (1)
1-3: LGTM!Clean refactoring to use Hydra's defaults composition pattern. Shared settings are properly extracted to
base.yaml, reducing duplication across RAG configurations.openrag/routers/utils.py (1)
233-234: LGTM!Minor formatting adjustment.
prompts/example1/sys_prompt_tmpl.txt (1)
11-11: LGTM!Clear instruction to prevent exposing internal file metadata in responses, which improves the user experience by keeping answers focused on content rather than source details.
docker-compose.yaml (1)
47-47: LGTM - Consistent alignment of max-model-len default.The default value for
--max-model-lenhas been updated from 8194 to 8192 in both the x-vllm_template and vllm-cpu configurations, aligning with the embedder configuration changes across the PR.Also applies to: 146-146
quick_start/docker-compose.yaml (1)
45-45: LGTM - Consistent with main docker-compose.yaml.The max-model-len default has been updated to 8192, matching the changes in the main docker-compose.yaml file and maintaining consistency across configurations.
Also applies to: 142-142
docs/content/docs/documentation/backup_restore.md (1)
13-19: LGTM - Documentation properly updated for environment variable loading.The documentation now consistently instructs users to include
--env-file .envin docker compose commands for backup and restore operations, aligning with the new environment variable loading pattern established across the PR.Also applies to: 26-32, 40-44, 55-62, 66-73, 78-82
.hydra_config/rag/ChatBotRag.yaml (1)
1-3: LGTM - Proper refactoring to use base configuration.The configuration now properly references a base configuration pattern, centralizing common RAG settings (chat_history_depth, max_contextualized_query_len) in base.yaml and reducing duplication.
openrag/scripts/backup.sh.example (1)
6-12: LGTM - Consistent environment variable loading.The backup script now includes
--env-file .env, consistent with the pattern established in documentation and workflow changes..github/workflows/smoke_test.yaml (1)
100-101: LGTM - Workflow properly loads environment variables.The smoke test workflow now consistently loads environment variables from
.envfor backup and restore operations, aligning with the new environment configuration pattern.Also applies to: 110-110, 115-116
.hydra_config/retriever/base.yaml (1)
3-4: LGTM - New surrounding chunks retrieval configuration.The new
with_surrounding_chunksoption enables retrieval of surrounding chunks for each retrieved chunk, with a sensible default of true. This aligns with the broader PR changes to the retriever and vectordb components.openrag/components/pipeline.py (2)
152-152: LGTM - Token-aware context formatting properly integrated.The computed
max_context_tokensis now properly passed toformat_context()in both chat completion and completions paths, enabling token-aware context truncation based on retrieval configuration.Also applies to: 183-183
77-79: Align config access patterns to use consistent bracket notation.Lines 38–39 access
config.reranker["top_k"]using bracket notation, while lines 77–78 use the defensive.get()pattern for the same keys. Sincereranker.top_kandchunker.chunk_sizeare always defined in the Hydra config with built-in defaults (top_k=10, chunk_size=512), the defensive.get()calls are redundant. Use bracket notation consistently throughout:self.max_context_tokens = config.reranker["top_k"] * config.chunker["chunk_size"]This improves readability and maintains a uniform config access pattern.
openrag/components/test_files.py (2)
32-57: LGTM!The test correctly uses monkeypatching to mock
make_unique_filename, ensuring deterministic behavior while testing thewith_random_prefixflag. The test validates both the filename transformation and file content persistence.
5-6: Thecomponents.filesimport works reliably in this project becausepytest.iniis configured withpythonpath = ./openrag, which adds the openrag directory to sys.path. The import is not fragile to different test runner working directories when pytest is used correctly. However, for better clarity and to reduce reliance on pytest configuration, consider using a relative import:from .files import sanitize_filename, save_file_to_disksince this test file is part of the openrag.components package.Likely an incorrect or invalid review comment.
.hydra_config/config.yaml (3)
53-53: LGTM!Increasing
reranker.top_kfrom 5 to 10 aligns with modern LLMs having larger context windows, allowing more documents to be included for better RAG results. The inline comment clearly explains the rationale.
34-34: No action required. The default max_model_len value of 8192 correctly matches jina-embeddings-v3's maximum input length of 8192 tokens.
3-3: Clarify chunker strategy description.The config sets
recursive_splitteras the default chunker. However, there's no evidence of prior support forsemantic_splitterormarkdown_splitteroptions in the codebase—onlyrecursive_splitterhas ever been implemented as a configurable Hydra option. The documentation already accurately reflects this single available strategy.Likely an incorrect or invalid review comment.
openrag/routers/indexer.py (1)
129-145: LGTM!The filename handling follows good practices:
- Preserves
original_filenamefor display and metadata tracking- Sanitizes the filename to prevent filesystem issues (special characters, path traversal)
- Uses centralized
save_file_to_diskwith random prefix to prevent collisions- Stores all relevant metadata for traceability
openrag/routers/openai.py (2)
104-115: LGTM!The
is_direct_llm_modelhelper function centralizes the logic for determining when to bypass RAG partitions. The function is well-documented with a clear docstring, and the implementation correctly handles the three cases:None, empty string, or matching the configured default model.
171-176: Consistent usage of the new helper function.Both endpoints (
/chat/completionsand/completions) consistently useis_direct_llm_model(request)to determine partition routing. This refactoring improves maintainability by centralizing the decision logic and makes the code more readable.Also applies to: 286-291
openrag/components/indexer/chunker/test_chunking.py (4)
12-116: LGTM!The
TestSplitMdElementsclass provides comprehensive coverage for markdown element parsing:
- Basic text, table, and image parsing
- Edge case: tables inside image descriptions are correctly ignored
- Page number assignment for different element types
The test assertions are specific and verify both element count and content, ensuring robust validation of the parsing logic.
118-152: LGTM!The
TestGetChunkPageNumberclass thoroughly tests page number assignment logic:
- Chunks with no markers inherit from previous chunk
- Markers at start/end/middle are handled correctly
- Boundary conditions are explicitly tested
This ensures accurate page tracking for document chunks.
154-192: LGTM!The
TestCleanMarkdownTableSpacingclass validates table normalization:
- Excessive and inconsistent spacing is corrected
- Empty cells are preserved
- Consistent "| cell | cell |" format is enforced
These tests ensure tables are properly formatted before further processing.
194-252: LGTM!The
TestChunkTableclass validates table chunking behavior:
- Small tables remain unchunked
- Large tables are split while preserving group integrity
- Headers are included in all chunks
- Mock length function (4 chars/token) provides deterministic testing
The test at lines 216-248 is particularly important, ensuring that related table rows (country groups) are not split across chunks, which would break semantic coherence.
docs/content/docs/documentation/env_vars.md (3)
97-109: LGTM!The documentation clearly reflects the chunking strategy consolidation:
CHUNKERnow supports onlyrecursive_splitter- The strategy is well-explained with a reference to LangChain documentation
- Users understand that text is hierarchically split while respecting
CHUNK_SIZE
119-119: LGTM!The new
MAX_MODEL_LENparameter is well-documented:
- Default value (8192) matches the configuration
- Clear explanation of truncation behavior when chunks exceed the limit
- Users understand this controls embedding model context length
186-191: LGTM!The retriever configuration is well-documented:
RETRIEVER_TYPEoptions are clearly listed (single, multiQuery, hyde)- New
WITH_SURROUNDING_CHUNKSparameter is explained with its default value- Users understand that enabling this retrieves adjacent chunks for richer context
openrag/components/retriever.py (3)
35-56: LGTM!The
BaseRetrieverproperly adds support for surrounding chunks:
- New
with_surrounding_chunksparameter (defaultTrue) enables retrieving adjacent chunks- Parameter is stored and forwarded to the vector database search
- Default value makes sense for RAG use cases where additional context improves results
59-60: LGTM!The retriever classes are updated consistently:
SingleRetrievernaming corrected (typo fix)MultiQueryRetrieverandHyDeRetrieverproperly forwardwith_surrounding_chunksto their respective search methods- All retriever types now support the surrounding chunks feature uniformly
Also applies to: 95-102, 136-142
146-160: LGTM!The
RetrieverFactoryis properly updated:
- Mapping for "single" now points to the correctly named
SingleRetriever- Guard condition changed from checking
retriever_type is Nonetoretriever_cls is None, which is more appropriate since it validates the actual class lookup resultopenrag/components/prompts/prompts.py (2)
33-33: All references to the renamed constant have been updated.The constant was successfully renamed from
CHUNK_CONTEXTUALIZERtoCHUNK_CONTEXTUALIZER_PROMPTfor naming consistency with other prompt variables. No remaining references to the old name exist in the codebase.
11-27: The function already returns a single string as indicated by the-> strreturn type annotation (line 15). All callers throughout the codebase—inretriever.py,pipeline.py,loaders/base.py, andchunker.py—correctly handle the string return value without any tuple unpacking. No changes are needed.Likely an incorrect or invalid review comment.
openrag/components/indexer/chunker/chunker.py (3)
283-304: RecursiveSplitter initializes text_splitter twice.The parent
BaseChunker.__init__may lazily initializetext_splitterviasplit_text, butRecursiveSplitteralso initializes it in its own__init__. This is fine but could be simplified.The explicit initialization in
RecursiveSplitterwith custom separators is intentional and overrides the lazy fallback inBaseChunker.split_text.
40-53: No changes needed — code is compatible with project's Python version requirement.The f-string syntax using backslash escape sequences in embedded expressions on lines 44 and 47 is valid and supported. Python 3.12 allows backslashes and unicode escape sequences in f-string expression components, and the project explicitly requires
requires-python = ">=3.12", making this code compatible with all supported Python versions.Likely an incorrect or invalid review comment.
194-254: Python dict merge with explicit keys correctly overrides upstream values—concern doesn't apply here.At lines 218–221 and 230–234, rightmost dictionary values take precedence in the pattern
{**metadata, "page": x}. Critically, the upstreammetadatafromdoc.metadata(set at line 256) comes from document loaders and contains fields likefile_idandpartition—notpageorchunk_type. The explicit keys intentionally set chunk-specific page numbers calculated during chunking (e.g.,subtable.page_number,e.page_number,start_page), not carry upstream metadata forward. This behavior is correct and intentional.Likely an incorrect or invalid review comment.
| You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM. | ||
| For that, simply do not specify any model: | ||
| For that, instead of using the `openrag` prefix for the model, you can: | ||
| - Specify no model | ||
| - Specify an empty model | ||
| - Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`. |
There was a problem hiding this comment.
Fix typo in user-facing documentation.
Line 226 contains a spelling error: "direclty" should be "directly". The expanded guidance for non-RAG usage is helpful and improves clarity.
Apply this diff to correct the typo:
-You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
+You can also directly use this endpoint with no RAG pipeline, i.e. to directly use the LLM.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM. | |
| For that, simply do not specify any model: | |
| For that, instead of using the `openrag` prefix for the model, you can: | |
| - Specify no model | |
| - Specify an empty model | |
| - Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`. | |
| You can also directly use this endpoint with no RAG pipeline, i.e. to directly use the LLM. | |
| For that, instead of using the `openrag` prefix for the model, you can: | |
| - Specify no model | |
| - Specify an empty model | |
| - Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`. |
🧰 Tools
🪛 LanguageTool
[grammar] ~226-~226: Ensure spelling is correct
Context: ... "stream": false }' ``` You can also direclty use this endpoint with no RAG pipeline,...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
In docs/content/docs/documentation/API.mdx around lines 226 to 230, fix the
spelling error "direclty" to "directly" in the sentence that explains using the
endpoint without a RAG pipeline; update that one word in-place so the sentence
reads "You can also directly use this endpoint..." and leave the rest of the
guidance unchanged.
| def sanitize_filename(filename: str) -> str: | ||
| # Split filename into name and extension | ||
| path = Path(filename) | ||
| name = path.stem | ||
| ext = path.suffix | ||
|
|
||
| # Remove special characters (keep only word characters and hyphens temporarily) | ||
| name = re.sub(r"[^\w\-]", "_", name) | ||
|
|
||
| # Replace hyphens with underscores | ||
| name = name.replace("-", "_") | ||
|
|
||
| # Collapse multiple underscores | ||
| name = re.sub(r"_+", "_", name) | ||
|
|
||
| # Remove leading/trailing underscores | ||
| name = name.strip("_") | ||
|
|
||
| # Reconstruct filename | ||
| return name + ext |
There was a problem hiding this comment.
Handle edge case where sanitized name becomes empty.
If the filename consists only of special characters (e.g., ---.txt), the sanitized name becomes empty, resulting in just .txt. Consider adding a fallback:
# Remove leading/trailing underscores
name = name.strip("_")
+ # Fallback if name is empty after sanitization
+ if not name:
+ name = "file"
+
# Reconstruct filename
return name + ext📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def sanitize_filename(filename: str) -> str: | |
| # Split filename into name and extension | |
| path = Path(filename) | |
| name = path.stem | |
| ext = path.suffix | |
| # Remove special characters (keep only word characters and hyphens temporarily) | |
| name = re.sub(r"[^\w\-]", "_", name) | |
| # Replace hyphens with underscores | |
| name = name.replace("-", "_") | |
| # Collapse multiple underscores | |
| name = re.sub(r"_+", "_", name) | |
| # Remove leading/trailing underscores | |
| name = name.strip("_") | |
| # Reconstruct filename | |
| return name + ext | |
| def sanitize_filename(filename: str) -> str: | |
| # Split filename into name and extension | |
| path = Path(filename) | |
| name = path.stem | |
| ext = path.suffix | |
| # Remove special characters (keep only word characters and hyphens temporarily) | |
| name = re.sub(r"[^\w\-]", "_", name) | |
| # Replace hyphens with underscores | |
| name = name.replace("-", "_") | |
| # Collapse multiple underscores | |
| name = re.sub(r"_+", "_", name) | |
| # Remove leading/trailing underscores | |
| name = name.strip("_") | |
| # Fallback if name is empty after sanitization | |
| if not name: | |
| name = "file" | |
| # Reconstruct filename | |
| return name + ext |
🤖 Prompt for AI Agents
In openrag/components/files.py around lines 13 to 32 the sanitized base name can
become empty (e.g., "---.txt") producing a filename like ".txt"; detect when
name is empty after stripping and replace it with a safe fallback (e.g., "file"
or "untitled") before reconstructing the filename, ensuring the extension is
preserved and the result does not start with a dot.
| filename = chunks[0].metadata.get("filename") | ||
| if not self.contextual_retrieval or len(chunks) < 2: | ||
| return [ | ||
| Document( | ||
| page_content=BASE_CHUNK_FORMAT.format( | ||
| chunk_context="", filename=filename, content=c.page_content | ||
| ), | ||
| metadata=c.metadata, | ||
| ) | ||
| for c in chunks | ||
| ] |
There was a problem hiding this comment.
Potential IndexError when chunks list is empty.
Line 139 accesses chunks[0].metadata without first checking if chunks is non-empty. The check len(chunks) < 2 at line 140 doesn't prevent this.
async def _apply_contextualization(
self, chunks: list[Document], lang: Literal["en", "fr"] = "en"
) -> list[Document]:
"""Apply contextualization if enabled."""
+ if not chunks:
+ return []
filename = chunks[0].metadata.get("filename")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| filename = chunks[0].metadata.get("filename") | |
| if not self.contextual_retrieval or len(chunks) < 2: | |
| return [ | |
| Document( | |
| page_content=BASE_CHUNK_FORMAT.format( | |
| chunk_context="", filename=filename, content=c.page_content | |
| ), | |
| metadata=c.metadata, | |
| ) | |
| for c in chunks | |
| ] | |
| if not chunks: | |
| return [] | |
| filename = chunks[0].metadata.get("filename") | |
| if not self.contextual_retrieval or len(chunks) < 2: | |
| return [ | |
| Document( | |
| page_content=BASE_CHUNK_FORMAT.format( | |
| chunk_context="", filename=filename, content=c.page_content | |
| ), | |
| metadata=c.metadata, | |
| ) | |
| for c in chunks | |
| ] |
🤖 Prompt for AI Agents
In openrag/components/indexer/chunker/chunker.py around lines 139-149, the code
accesses chunks[0].metadata which will raise IndexError when chunks is empty;
add an explicit guard before that access (e.g., if not chunks: return []), then
proceed to read filename and the existing branch logic—this ensures the function
returns an empty list for empty input instead of crashing.
| def split_md_elements(md_text: str) -> list[MDElement]: | ||
| """ | ||
| Split markdown text into segments of text, tables, and images. | ||
| Returns a list of tuples: (type, content) where type is 'text', 'table', or 'image' | ||
| Returns a list of tuples: | ||
| - ('text', content) for text segments | ||
| - ('table', content, page_number) for tables | ||
| - ('image', content, page_number) for images | ||
| """ | ||
| # Find all page markers | ||
| page_markers = [] | ||
| for match in PAGE_RE.finditer(md_text): | ||
| page_markers.append((match.start(), int(match.group(1)))) | ||
| page_markers.sort() # Ensure they're in order | ||
|
|
||
| all_matches = [] | ||
|
|
||
| # Find image matches first and record their spans | ||
| image_spans = [] | ||
| for match in IMAGE_RE.finditer(md_text): | ||
| span = match.span() | ||
| all_matches.append((span, "image", match.group(1))) | ||
| page_num = get_page_number(span[0], page_markers) | ||
| all_matches.append((span, "image", match.group(1).strip(), page_num)) | ||
| image_spans.append(span) | ||
|
|
||
| # Find table matches, but skip those that are fully inside an image description | ||
| for match in TABLE_RE.finditer(md_text): | ||
| span = match.span() | ||
| if not any(span_inside(span, image_span) for image_span in image_spans): | ||
| all_matches.append((span, "table", match.group(1))) | ||
| page_num = get_page_number(span[0], page_markers) | ||
| all_matches.append((span, "table", match.group(1).strip(), page_num)) | ||
|
|
||
| # Sort matches by start position | ||
| all_matches.sort(key=lambda x: x[0][0]) | ||
|
|
||
| parts = [] | ||
| last = 0 | ||
|
|
||
| for (start, end), match_type, content in all_matches: | ||
| for (start, end), match_type, content, page_num in all_matches: | ||
| # Add text segment before this match if there is any | ||
| if start > last: | ||
| text_segment = md_text[last:start] | ||
| if text_segment.strip(): # Only add non-empty text segments | ||
| parts.append(("text", text_segment.strip())) | ||
|
|
||
| # Add the matched segment | ||
| parts.append((match_type, content.strip())) | ||
| # Add the matched segment with page number | ||
| parts.append((match_type, content, page_num)) | ||
| last = end | ||
|
|
||
| # Add remaining text after the last match | ||
| if last < len(md_text): | ||
| remaining_text = md_text[last:] | ||
| if remaining_text.strip(): # Only add non-empty text segments | ||
| parts.append(("text", remaining_text.strip())) | ||
| return parts | ||
|
|
||
| return [MDElement(*p) for p in parts] |
There was a problem hiding this comment.
Docstring mismatch with actual return type.
The docstring states the function returns tuples, but Line 107 returns a list of MDElement instances. Update the docstring to reflect the actual return type.
def split_md_elements(md_text: str) -> list[MDElement]:
"""
Split markdown text into segments of text, tables, and images.
- Returns a list of tuples:
- - ('text', content) for text segments
- - ('table', content, page_number) for tables
- - ('image', content, page_number) for images
+ Returns a list of MDElement instances with type, content, and page_number.
"""🤖 Prompt for AI Agents
In openrag/components/indexer/chunker/utils.py around lines 53 to 107, the
docstring incorrectly says the function returns tuples while the function
actually returns a list of MDElement instances; update the docstring to
accurately describe the return type (e.g., "Returns a list of MDElement objects
representing segments: MDElement(type='text'|'table'|'image', content,
page_number?)") and adjust the bullet list to describe the MDElement fields
rather than tuple shapes.
| def chunk_table( | ||
| table_element: MDElement, | ||
| chunk_size: int = 512, | ||
| length_function: Optional[Callable[[str], int]] = None, | ||
| ) -> list[MDElement]: | ||
| txt = clean_markdown_table_spacing(table_element.content) | ||
| header_lines, groups = parse_markdown_table(txt) | ||
|
|
||
| # Convert header lines → text block | ||
| header_text = "\n".join(header_lines) | ||
|
|
||
| # Convert group lists → text blocks | ||
| group_texts = ["\n".join(g) for g in groups] | ||
|
|
||
| # Precompute token length | ||
| header_ntoks = length_function(header_text) | ||
| groups_ntoks = [length_function(g) for g in group_texts] | ||
|
|
||
| subtables = [] | ||
| current_rows = [header_text] | ||
| current_size = header_ntoks | ||
|
|
||
| prev_last_row = None # for overlap | ||
|
|
||
| for group_txt, g_ntoks in zip(group_texts, groups_ntoks, strict=True): | ||
| # If adding this group exceeds the chunk limit | ||
| if current_size + g_ntoks > chunk_size: | ||
| # ---- finalize current subtable ---- | ||
| subtables.append("\n".join(current_rows)) | ||
|
|
||
| # ---- start new subtable with OVERLAP ---- | ||
| current_rows = [header_text] # always restart headers | ||
| if prev_last_row: | ||
| current_rows.append(prev_last_row) # add overlapping row | ||
|
|
||
| current_rows.append(group_txt) | ||
| current_size = ( | ||
| header_ntoks | ||
| + (length_function(prev_last_row) if prev_last_row else 0) | ||
| + g_ntoks | ||
| ) | ||
|
|
||
| else: | ||
| # fits → just append normally | ||
| current_rows.append(group_txt) | ||
| current_size += g_ntoks | ||
|
|
||
| # track last row for overlap | ||
| prev_last_row = group_txt | ||
|
|
||
| # finalize last subtable | ||
| if current_rows: | ||
| subtables.append("\n".join(current_rows)) | ||
|
|
||
| # wrap into MDElement list | ||
| return [ | ||
| MDElement( | ||
| type="table", | ||
| content=subtable, | ||
| page_number=table_element.page_number, | ||
| ) | ||
| for subtable in subtables | ||
| ] |
There was a problem hiding this comment.
Missing None check before calling length_function.
If length_function is None, Line 206 will raise a TypeError. Either enforce the parameter as required or add a fallback.
def chunk_table(
table_element: MDElement,
chunk_size: int = 512,
length_function: Optional[Callable[[str], int]] = None,
) -> list[MDElement]:
+ if length_function is None:
+ length_function = len # Fallback to character count
+
txt = clean_markdown_table_spacing(table_element.content)
header_lines, groups = parse_markdown_table(txt)🤖 Prompt for AI Agents
In openrag/components/indexer/chunker/utils.py around lines 191-253, the code
calls length_function(...) without checking for None which will raise a
TypeError if length_function is not passed; fix by providing a default length
function at the top of chunk_table (e.g., default to Python len on strings or a
small wrapper that handles None/empty strings) and use that local function for
all subsequent calls, and also guard the call length_function(prev_last_row)
with a conditional or by ensuring prev_last_row is a string (e.g., pass "" when
None) so no None value is ever passed into the length function.
| def _gen_chunk_order_metadata(n: int = 20) -> list[dict]: | ||
| # Use base timestamp + index to ensure uniqueness | ||
| base_ts = int(time.time_ns()) | ||
| ids: list[int] = [base_ts + i for i in range(n)] | ||
| L = [] | ||
| for i in range(n): | ||
| prev_chunk_id = ids[i - 1] if i > 0 else None | ||
| next_chunk_id = ids[i + 1] if i < n - 1 else None | ||
| L.append( | ||
| { | ||
| "prev_section_id": prev_chunk_id, | ||
| "section_id": ids[i], | ||
| "next_section_id": next_chunk_id, | ||
| } | ||
| ) | ||
| return L |
There was a problem hiding this comment.
Timestamp-based ID generation may collide under high concurrency.
Using time.time_ns() with sequential offsets could produce collisions if multiple processes call this function simultaneously. Consider using UUIDs or adding a random component.
+import uuid
+
def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:
- # Use base timestamp + index to ensure uniqueness
- base_ts = int(time.time_ns())
- ids: list[int] = [base_ts + i for i in range(n)]
+ # Use UUID-based IDs to ensure uniqueness across processes
+ ids: list[int] = [uuid.uuid4().int >> 64 for _ in range(n)] # 64-bit unique IDs
L = []
for i in range(n):Alternatively, if integer IDs are required for Milvus compatibility, consider using a combination of process ID, timestamp, and random bits.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _gen_chunk_order_metadata(n: int = 20) -> list[dict]: | |
| # Use base timestamp + index to ensure uniqueness | |
| base_ts = int(time.time_ns()) | |
| ids: list[int] = [base_ts + i for i in range(n)] | |
| L = [] | |
| for i in range(n): | |
| prev_chunk_id = ids[i - 1] if i > 0 else None | |
| next_chunk_id = ids[i + 1] if i < n - 1 else None | |
| L.append( | |
| { | |
| "prev_section_id": prev_chunk_id, | |
| "section_id": ids[i], | |
| "next_section_id": next_chunk_id, | |
| } | |
| ) | |
| return L | |
| import uuid | |
| def _gen_chunk_order_metadata(n: int = 20) -> list[dict]: | |
| # Use UUID-based IDs to ensure uniqueness across processes | |
| ids: list[int] = [uuid.uuid4().int >> 64 for _ in range(n)] # 64-bit unique IDs | |
| L = [] | |
| for i in range(n): | |
| prev_chunk_id = ids[i - 1] if i > 0 else None | |
| next_chunk_id = ids[i + 1] if i < n - 1 else None | |
| L.append( | |
| { | |
| "prev_section_id": prev_chunk_id, | |
| "section_id": ids[i], | |
| "next_section_id": next_chunk_id, | |
| } | |
| ) | |
| return L |
🤖 Prompt for AI Agents
In openrag/components/indexer/vectordb/vectordb.py around lines 1022 to 1037,
the current ID generation uses time.time_ns() with sequential offsets which can
collide across concurrent processes; replace that logic with a
collision-resistant approach — either generate section IDs as UUID strings
(e.g., uuid.uuid4().hex) and update the metadata keys/types accordingly, or if
integer IDs are strictly required, build a composite integer from time_ns plus
process id and random bits (e.g., combine time_ns << N | (pid << M) |
random_bits generated with secrets.randbits) to guarantee uniqueness across
processes; ensure imports (uuid or os and secrets) are added, update type hints
from list[int] to list[str] if using UUIDs, and run tests that write to Milvus
to confirm compatibility.
| def detect_language(text: str): | ||
| outputs = lang_detector.detect(text, k=1) | ||
| return outputs[0].get("lang") |
There was a problem hiding this comment.
Missing error handling in detect_language.
If lang_detector.detect returns an empty list or fails, accessing outputs[0] will raise an IndexError. Add defensive handling.
def detect_language(text: str):
outputs = lang_detector.detect(text, k=1)
- return outputs[0].get("lang")
+ if outputs:
+ return outputs[0].get("lang")
+ return "en" # Default fallback📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def detect_language(text: str): | |
| outputs = lang_detector.detect(text, k=1) | |
| return outputs[0].get("lang") | |
| def detect_language(text: str): | |
| outputs = lang_detector.detect(text, k=1) | |
| if outputs: | |
| return outputs[0].get("lang") | |
| return "en" # Default fallback |
🤖 Prompt for AI Agents
In openrag/components/utils.py around lines 160 to 162, detect_language
currently assumes lang_detector.detect returns a non-empty list and directly
indexes outputs[0], which can raise IndexError or propagate detection errors;
update the function to defensively handle empty or invalid outputs and
exceptions by wrapping the detection call in a try/except, check that outputs is
a non-empty list and that outputs[0] contains a "lang" key before returning it,
and on failure return a safe default (e.g., None or "unknown") and optionally
log the error for diagnostics.
| "langdetect>=1.0.9", | ||
| "alembic>=1.17.0", | ||
| "langdetect>=1.0.9", | ||
| "fast-langdetect>=1.0.0", |
There was a problem hiding this comment.
Remove the duplicate language detection dependency.
Both langdetect>=1.0.9 (line 47) and fast-langdetect>=1.0.0 (line 49) are present in the dependencies. Since the PR migrates to fast-langdetect, the original langdetect dependency on line 47 should be removed to avoid conflicts.
Apply this diff to remove the duplicate:
"html-to-markdown>=2.4.0",
- "langdetect>=1.0.9",
"alembic>=1.17.0",
"fast-langdetect>=1.0.0",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "langdetect>=1.0.9", | |
| "alembic>=1.17.0", | |
| "langdetect>=1.0.9", | |
| "fast-langdetect>=1.0.0", | |
| "html-to-markdown>=2.4.0", | |
| "alembic>=1.17.0", | |
| "fast-langdetect>=1.0.0", |
🤖 Prompt for AI Agents
In pyproject.toml around lines 47 to 49, remove the duplicate dependency entry
"langdetect>=1.0.9" (line 47) since the project is migrating to
"fast-langdetect>=1.0.0"; update the dependencies block so only fast-langdetect
remains and ensure punctuation/commas in the surrounding list remain valid after
deletion.
✅ Actions performedReviews paused. |
Summary by CodeRabbit
Release Notes
New Features
Improvements
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.