Merge for release v1.1.5 by paultranvan · Pull Request #175 · linagora/openrag

paultranvan · 2025-12-17T13:21:10Z

Summary by CodeRabbit

Release Notes

New Features
- Added retrieval of surrounding chunks in search results for enhanced context
- Introduced language detection capability for better multilingual support
- Added configurable embedding model context length support
Improvements
- Enhanced context formatting to respect token limits and prevent overflow
- Improved table chunking and markdown parsing with better page awareness
- Simplified default chunking to single strategy for consistency
- Increased default reranker result count for better coverage
- Better filename sanitization and unique file naming
Documentation
- Updated API documentation with improved examples and guidance
- Clarified environment variable configuration options
- Enhanced system prompts with clearer instruction rules

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Fix/filename

…ils when some vars are missing

Add env vars when running backup/restore scripts as their might be fa…

…of the embedder model. This prevents errors in that case.

… each matched document to provide additional context.

Feat/chunking

Some openai-compatible libs make the `model` parameter mandatory, to query chat completions API. Thus, we now allow providing empty model, or actual LLM model defined in the config.

coderabbitai · 2025-12-17T13:21:23Z

Note

Reviews paused

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Walkthrough

This PR significantly refactors the document processing and retrieval pipeline by consolidating chunking strategies to recursive splitting only, introducing LLM-based chunk contextualization, adding surrounding chunk retrieval capabilities, integrating token-aware context formatting, and adding embedder truncation support. Configuration files are reorganized and environment variables updated accordingly.

Changes

Cohort / File(s)	Summary
CodeRabbit Configuration `.coderabbit.yaml`	Adds auto-review configuration with wildcard base_branches pattern and abort_on_close set to false
Hydra Configuration & Defaults `.hydra_config/config.yaml`, `.hydra_config/rag/base.yaml`, `.hydra_config/rag/ChatBotRag.yaml`, `.hydra_config/rag/SimpleRag.yaml`, `.hydra_config/retriever/base.yaml`	Consolidates RAG configuration into base.yaml defaults; removes semantic and markdown splitter options leaving only recursive_splitter; adds embedder.max_model_len (8192); increases reranker.top_k from 5 to 10; adds retriever.with_surrounding_chunks flag
Deleted Chunker Configurations `.hydra_config/chunker/semantic_splitter.yaml`, `.hydra_config/chunker/token_splitter.yaml`	Removes deprecated chunker configuration files for semantic and markdown splitters
Environment & Docker Setup `.github/workflows/smoke_test.yaml`, `.github/workflows/smoke_test/.env`, `docker-compose.yaml`, `quick_start/docker-compose.yaml`, `openrag/scripts/backup.sh.example`, `openrag/scripts/restore.sh.example`	Adds --env-file .env to docker compose invocations; adds MAX_MODEL_LEN=8192 environment variable; updates vllm max-model-len from 8194 to 8192
Chunking System Refactor `openrag/components/indexer/chunker/chunker.py`, `openrag/components/indexer/chunker/utils.py`, `openrag/components/indexer/chunker/test_chunking.py`	Introduces ChunkContextualizer class and BaseChunker abstract base; adds MDElement abstraction for markdown content; implements page-aware chunking with TABLE_RE multiline support and PAGE_RE markers; removes semantic and markdown splitters from factory; adds async contextualization pipeline; introduces test suite for chunking utilities
File Handling Utilities `openrag/components/files.py`, `openrag/components/test_files.py`, `openrag/routers/indexer.py`	Adds sanitize_filename() and make_unique_filename() functions; implements save_file_to_disk() for disk persistence; removes timeout handling from serialize_file(); updates router to use centralized file saving with original_filename metadata preservation
Vector Database & Retrieval `openrag/components/indexer/vectordb/vectordb.py`, `openrag/components/retriever.py`	Adds with_surrounding_chunks support to async_search and async_multi_query_search; introduces get_surrounding_chunks() method; adds chunk order metadata generation (_gen_chunk_order_metadata); renames SingleRetreiver to SingleRetriever; updates RetrieverFactory to use SingleRetriever for "single" key
Embeddings & Context Formatting `openrag/components/indexer/embeddings/openai.py`, `openrag/components/utils.py`, `openrag/components/pipeline.py`	Adds max_model_len to OpenAI embeddings with truncate_prompt_tokens support; introduces token-aware format_context() with ChatOpenAI token counter; adds detect_language() function with LangDetector; computes max_context_tokens in RagPipeline as reranker.top_k × chunker.chunk_size
Prompts & API Utilities `openrag/components/prompts/prompts.py`, `openrag/routers/openai.py`, `openrag/routers/utils.py`	Changes load_prompt() return type from tuple to string; renames CHUNK_CONTEXTUALIZER to CHUNK_CONTEXTUALIZER_PROMPT; adds is_direct_llm_model() helper for route determination; updates OpenAI endpoints documentation for empty model handling
Prompt Templates `prompts/example1/chunk_contextualizer_tmpl.txt`, `prompts/example1/sys_prompt_tmpl.txt`	Restructures chunk contextualizer prompt with explicit fields (Filename, First Chunks, Previous Chunks, Current Chunk) and new Core Principles/Output Format sections; adds citation-disabling rule to system prompt
Documentation `docs/content/docs/documentation/API.mdx`, `docs/content/docs/documentation/backup_restore.md`, `docs/content/docs/documentation/env_vars.md`	Updates API documentation with lower temperature examples (0.7→0.1); adds --env-file .env to backup/restore commands; documents new CHUNKER/MAX_MODEL_LEN/WITH_SURROUNDING_CHUNKS variables; removes references to deprecated splitters
Build & Test Configuration `pyproject.toml`, `pytest.ini`	Replaces langdetect with fast-langdetect dependency; adds pytest.ini environment variables (CONFIG_PATH, PROMPTS_DIR, LOG_DIR)
Loader & Base Utilities `openrag/components/indexer/loaders/base.py`	Minor formatting: adds blank lines around XML-wrapped image descriptions

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Upload as File Upload
    participant Sanitize as File Sanitizer
    participant Save as Disk Storage
    participant Chunk as Chunking Engine
    participant Context as LLM Contextualizer
    participant Embed as Embedder
    participant VectorDB as Vector Database

    User->>Upload: Upload document
    Upload->>Sanitize: Get sanitized filename
    Sanitize-->>Save: Sanitized + unique name
    Save->>Save: Write to disk with random prefix
    Save->>Chunk: Load document
    Chunk->>Chunk: Split into chunks with recursive_splitter
    Chunk->>Chunk: Detect language & assign page numbers
    Chunk->>Context: Pass chunks for contextualization
    Context->>Context: LLM generates context for each chunk
    Context-->>Chunk: Enhanced chunks with context
    Chunk->>Embed: Send contextualized chunks
    Embed->>Embed: Truncate via max_model_len
    Embed->>VectorDB: Generate embeddings
    VectorDB->>VectorDB: Store with chunk order metadata
    VectorDB-->>User: ✓ Indexed

sequenceDiagram
    actor User
    participant API as OpenAI API
    participant Router as Request Router
    participant VectorDB as Vector Database
    participant Retriever as Retriever
    participant Reranker as Reranker
    participant Context as Context Formatter
    participant LLM as Language Model
    participant Response as Response

    User->>API: Query (model: "")
    API->>Router: is_direct_llm_model() → true
    Router->>LLM: Route to direct LLM
    LLM-->>Response: Generate without RAG
    
    alt With RAG (model specified)
        User->>API: Query (model: "rag_partition")
        API->>Router: is_direct_llm_model() → false
        Router->>Retriever: retrieve(with_surrounding_chunks=true)
        Retriever->>VectorDB: async_search()
        VectorDB-->>Retriever: Top-k chunks
        Retriever->>VectorDB: get_surrounding_chunks()
        VectorDB-->>Retriever: Adjacent chunks
        Retriever->>Reranker: Rerank combined results (top_k=10)
        Reranker-->>Context: Ranked chunks
        Context->>Context: format_context(max_context_tokens)
        Context->>Context: Count tokens via ChatOpenAI
        Context-->>LLM: Token-limited context
        LLM-->>Response: RAG-augmented response
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Chunking system refactor (chunker.py, utils.py): Introduction of ChunkContextualizer, BaseChunker, and MDElement abstractions with async contextualization pipeline requires careful validation of logic flow and integration with LLM calls
Vector database changes (vectordb.py): New with_surrounding_chunks parameter, get_surrounding_chunks() method, and chunk order metadata generation (_gen_chunk_order_metadata) add complexity to retrieval logic
Configuration consolidation (Hydra YAML files, .env additions): Changes to chunker factory, removal of deprecated splitters, and reorganization of RAG defaults require verification across config hierarchy
Public API changes: Multiple signature updates (async_search, async_multi_query_search, format_context, load_prompt) affecting downstream consumers; class rename (SingleRetreiver→SingleRetriever)
Interdependencies: Changes span across modules (chunker→embedder→vectordb→retriever→pipeline) with implicit contracts that need validation
Token counting logic in format_context uses ChatOpenAI client; needs verification of token measurement accuracy and edge cases

Possibly related PRs

Feat/chunking #165: Major overlapping changes in chunker architecture (ChunkContextualizer, BaseChunker, MDElement), vectordb surrounding chunks retrieval, retriever API changes, embedder max_model_len support, and consolidated config structure
Fix/filename #167: Introduces sanitize_filename utility and file handling refactor in components/files.py with test coverage
feat: Allow model in openAI completion routes #173: Adds is_direct_llm_model() helper in openai.py router for distinguishing direct LLM vs. RAG-augmented requests

Suggested labels

chore, enhancement, refactoring

Suggested reviewers

dodekapod
codescene-delta-analysis

🐰 Hops with glee through configs neat,
Chunks contextualized, retrieval sweet!
Surrounding echoes, tokens counted true,
A refactored pipeline, fresh and new. 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.56% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Merge for release v1.1.5' is vague and generic, using non-descriptive terms that don't convey meaningful information about the specific changes in the changeset.	Replace the generic 'Merge for release' phrase with a specific summary of the main changes, such as 'Add chunker refactoring, retriever updates, and embedder configuration' or similar, to clearly communicate the primary modifications.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

paultranvan · 2025-12-17T13:21:52Z

@coderabbitai pause

codescene-delta-analysis

Code Health Improved (2 files improve in Code Health)

Gates Failed
Prevent hotspot decline (3 hotspots with Complex Method)
Enforce advisory code health rules (4 files with Complex Method, Overall Code Complexity)

Gates Passed
4 Quality Gates Passed

See analysis details in CodeScene

Reason for failure

Prevent hotspot decline	Violations	Code Health Impact
vectordb.py	1 rule in this hotspot	7.29 → 7.06	Suppress
openai.py	1 rule in this hotspot	7.10 → 7.08	Suppress
chunker.py	1 rule in this hotspot	8.55 → 9.39	Suppress

Enforce advisory code health rules	Violations	Code Health Impact
vectordb.py	1 advisory rule	7.29 → 7.06	Suppress
openai.py	1 advisory rule	7.10 → 7.08	Suppress
utils.py	1 advisory rule	7.86 → 8.57	Suppress
chunker.py	1 advisory rule	8.55 → 9.39	Suppress

View Improvements

File	Code Health Impact	Categories Improved
chunker.py	8.55 → 9.39	Code Duplication, Bumpy Road Ahead, Excess Number of Function Arguments
utils.py	7.86 → 8.57	Complex Method, Bumpy Road Ahead, Deep, Nested Complexity, Excess Number of Function Arguments

Quality Gate Profile: Pay Down Tech Debt
Want more control? Customize Code Health rules or catch issues early with our IDE extension and CLI tool.

codescene-delta-analysis · 2025-12-17T13:22:22Z

openrag/components/indexer/chunker/chunker.py

+    def _get_chunks(
+        self, content: str, metadata: Optional[dict] = None, log=None
+    ) -> list[Document]:
+        log = log or logger
+        texts, tables_and_images = self._prepare_md_elements(content=content)
+        combined_texts = "\n".join([e.content for e in texts])
+        text_chunks = self.split_text(combined_texts)


❌ New issue: Complex Method
BaseChunker._get_chunks has a cyclomatic complexity of 9, threshold = 9

_Suppress

codescene-delta-analysis · 2025-12-17T13:22:22Z

openrag/components/indexer/chunker/utils.py

-
-from langchain_core.documents.base import Document
-from langchain_openai import ChatOpenAI
+from typing import Callable, Literal, Optional


❌ New issue: Overall Code Complexity
This module has a mean cyclomatic complexity of 5.22 across 9 functions. The mean complexity threshold is 4

_Suppress

codescene-delta-analysis · 2025-12-17T13:22:22Z

openrag/components/indexer/vectordb/vectordb.py

        similarity_threshold: int = 0.80,
        partition: list[str] = None,
        filter: Optional[dict] = None,
+        with_surrounding_chunks: bool = False,


❌ Getting worse: Complex Method
MilvusDB.async_search increases in cyclomatic complexity from 12 to 13, threshold = 9

_Suppress

codescene-delta-analysis · 2025-12-17T13:22:22Z

openrag/components/indexer/vectordb/vectordb.py

+    async def get_surrounding_chunks(self, docs: list[Document]) -> list[Document]:
+        existant_ids = set(doc.metadata.get("_id") for doc in docs)
+
+        # Collect all prev/next section IDs
+        section_ids = [
+            section_id
+            for doc in docs
+            for section_id in [
+                doc.metadata.get("prev_section_id"),
+                doc.metadata.get("next_section_id"),
+            ]
+            if section_id is not None
+        ]
+
+        if not section_ids:
+            return []
+
+        # Query all sections in parallel
+        tasks = [
+            self._async_client.query(
+                collection_name=self.collection_name,
+                filter=f"section_id == {section_id}",
+                limit=1,
+            )
+            for section_id in section_ids
+        ]
+        responses = await asyncio.gather(*tasks)
+
+        # Build output, skipping duplicates
+        output_docs = []
+        for response in responses:
+            if not response:
+                continue
+            doc_id = response[0].get("_id")
+            if doc_id not in existant_ids:
+                existant_ids.add(doc_id)
+                output_docs.append(
+                    Document(
+                        page_content=response[0]["text"],
+                        metadata={
+                            key: value
+                            for key, value in response[0].items()
+                            if key not in ["text", "vector"]
+                        },
+                    )
+                )
+
+        return output_docs


❌ New issue: Complex Method
MilvusDB.get_surrounding_chunks has a cyclomatic complexity of 10, threshold = 9

_Suppress

codescene-delta-analysis · 2025-12-17T13:22:23Z

openrag/routers/openai.py

    user_partitions=Depends(current_user_or_admin_partitions_list),
 ):
-    model_name = request.model
+    model_name = request.model or config.llm.get("model")


❌ Getting worse: Complex Method
openai_completion increases in cyclomatic complexity from 13 to 14, threshold = 9

_Suppress

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

openrag/components/files.py (3)

35-39: Fix return type annotation mismatch.

The function returns a str but is annotated as returning Path. This causes type inconsistency and could mislead callers.

-def make_unique_filename(filename: str) -> Path:
+def make_unique_filename(filename: str) -> str:
     ts = int(time.time() * 1000)
     rand = secrets.token_hex(2)
     unique_name = f"{ts}_{rand}_{filename}"
     return unique_name

54-58: Handle potential None filename.

UploadFile.filename can be None if the client doesn't provide a filename. This would cause issues when constructing the file path.

     if with_random_prefix:
-        filename = make_unique_filename(file.filename)
+        filename = make_unique_filename(file.filename or "unnamed")
     else:
-        filename = file.filename
+        filename = file.filename or "unnamed"
     file_path = dest_dir / filename

79-92: Dead code: else branch is unreachable.

ray.wait([future]) without a timeout parameter blocks until the future completes, so ready will always be non-empty. The else branch (lines 90-92) is unreachable dead code.

Either remove the dead code or reintroduce the timeout if it was intentionally removed:

     # Wait for it to complete, with timeout
-    ready, _ = await asyncio.to_thread(ray.wait, [future])
+    ready, _ = await asyncio.to_thread(ray.wait, [future], timeout=SERIALIZATION_TIMEOUT)
 
     if ready:
         try:
             doc = await ready[0]
             return doc
         except TaskCancelledError:
             raise
         except Exception:
             raise
     else:
         ray.cancel(future, recursive=True)
-        raise TimeoutError(f"Serialization task {task_id} timed out after seconds")
+        raise TimeoutError(f"Serialization task {task_id} timed out after {SERIALIZATION_TIMEOUT} seconds")

Or if timeout is intentionally removed, clean up the dead code:

     # Wait for it to complete
-    ready, _ = await asyncio.to_thread(ray.wait, [future])
-
-    if ready:
-        try:
-            doc = await ready[0]
-            return doc
-        except TaskCancelledError:
-            raise
-        except Exception:
-            raise
-    else:
-        ray.cancel(future, recursive=True)
-        raise TimeoutError(f"Serialization task {task_id} timed out after seconds")
+    await asyncio.to_thread(ray.wait, [future])
+    try:
+        doc = await future
+        return doc
+    except TaskCancelledError:
+        raise

🧹 Nitpick comments (10)

openrag/components/indexer/loaders/base.py (1)

153-153: Clarify the formatting change.

The addition of extra blank lines around the image description content may affect downstream parsing. Ensure this formatting change is intentional and compatible with consumers of this XML-wrapped content.
.hydra_config/rag/base.yaml (1)
1-4: LGTM!

Good extraction of shared RAG configuration. Consider updating the comment since this base config is used by both ChatBotRag and SimpleRag, not just chatbot RAG.
-# Config for chatbot RAG
+# Base config for RAG modes
 mode: ''
 chat_history_depth: 4
 max_contextualized_query_len: 512
openrag/components/test_files.py (1)

60-83: Consider edge case validation for empty filenames.

The test case at line 77 expects sanitize_filename("") to return "". While this test may pass, an empty filename could cause issues in actual file operations (e.g., save_file_to_disk). Consider whether sanitize_filename should validate against empty strings and raise an error, or if the calling code should handle this validation.
openrag/routers/indexer.py (1)
250-266: Consistent filename handling, but improve exception chaining.

The filename sanitization and metadata handling mirrors add_file appropriately. However, the exception handling at lines 253-258 could be improved by using raise ... from err or raise ... from None to preserve the exception chain, as flagged by static analysis.

Apply this diff:
     try:
         original_filename = file.filename
         file.filename = sanitize_filename(file.filename)
         file_path = await save_file_to_disk(file, save_dir, with_random_prefix=True)
-    except Exception:
+    except Exception as e:
         log.exception("Failed to save file to disk.")
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
             detail="Failed to save uploaded file.",
-        )
+        ) from e
Based on static analysis hints.
openrag/components/utils.py (2)
125-147: Consider reusing the LLM instance for token counting.

A new ChatOpenAI instance is created on every call to format_context. This incurs unnecessary overhead for repeated invocations. Consider caching the LLM instance or accepting it as a parameter.
+_token_counter_llm = None
+
+def _get_token_counter():
+    global _token_counter_llm
+    if _token_counter_llm is None:
+        _token_counter_llm = ChatOpenAI(**config.llm)
+    return _token_counter_llm
+
 def format_context(docs: list[Document], max_context_tokens: int = 4096) -> str:
     if not docs:
         return "No document found from the database"
 
-    llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
+    _length_function = _get_token_counter().get_num_tokens
150-157: Hard-coded path may cause issues outside Docker.

The lang_detect_cache_dir is hard-coded to /app/model_weights/, which assumes a specific deployment environment. Consider making this configurable via the config system.
 # Initialize language detector
-lang_detect_cache_dir = "/app/model_weights/"
+lang_detect_cache_dir = getattr(config.paths, "model_weights_dir", "/app/model_weights/")
 lang_detector_config = LangDetectConfig(
     max_input_length=1024,  # chars
     model="auto",
     cache_dir=lang_detect_cache_dir,
 )
openrag/components/indexer/chunker/utils.py (1)
17-31: Consider using a dataclass for MDElement.

The MDElement class is a simple data container. Using @dataclass would reduce boilerplate and provide __eq__, __hash__, etc. automatically.
+from dataclasses import dataclass
+
-class MDElement:
-    """Class representing a segment of markdown content."""
-
-    def __init__(
-        self,
-        type: Literal["text", "table", "image"],
-        content: str,
-        page_number: Optional[int] = None,
-    ):
-        self.type = type  # 'text', 'table', 'image'
-        self.content = content
-        self.page_number = page_number
-
-    def __repr__(self):
-        return f"Element(type={self.type}, page_number={self.page_number}, content={self.content[:100]}...)"
+@dataclass
+class MDElement:
+    """Class representing a segment of markdown content."""
+    type: Literal["text", "table", "image"]
+    content: str
+    page_number: Optional[int] = None
+
+    def __repr__(self):
+        return f"Element(type={self.type}, page_number={self.page_number}, content={self.content[:100]}...)"
openrag/components/indexer/vectordb/vectordb.py (2)
387-396: Add strict=True to zip for safety.

Per static analysis hint B905, adding strict=True ensures the iterables have equal length and prevents silent data loss if they don't match.
-            for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l):
+            for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l, strict=True):
555-602: N+1 query problem for surrounding chunks—consider batching with IN operator.

The current code makes a separate query for each surrounding chunk. Milvus supports the IN operator for batch queries in filter expressions. To optimize, batch the queries:
# Batch query using IN operator instead of N separate queries
filter_expr = f"section_id in {section_ids}"
responses = await self._async_client.query(
    collection_name=self.collection_name,
    filter=filter_expr,
    limit=len(section_ids),
)
If you prefer keeping the parallel approach with asyncio.gather(), add a comment explaining the trade-off. Use filter_params for proper parameter escaping if needed.
openrag/components/indexer/chunker/chunker.py (1)
62-66: Catching broad Exception masks specific errors.

Per static analysis (BLE001), catching Exception broadly can hide unexpected issues. Consider catching more specific exceptions or at least logging the exception type.
             except Exception as e:
                 logger.warning(
-                    f"Error contextualizing chunk of document `{filename}`: {e}"
+                    f"Error contextualizing chunk of document `{filename}`: {type(e).__name__}: {e}"
                 )
                 return ""

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9efbad8 and 97a3f7b.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (36)

.coderabbit.yaml (1 hunks)
.github/workflows/smoke_test.yaml (2 hunks)
.github/workflows/smoke_test/.env (1 hunks)
.hydra_config/chunker/semantic_splitter.yaml (0 hunks)
.hydra_config/chunker/token_splitter.yaml (0 hunks)
.hydra_config/config.yaml (3 hunks)
.hydra_config/rag/ChatBotRag.yaml (1 hunks)
.hydra_config/rag/SimpleRag.yaml (1 hunks)
.hydra_config/rag/base.yaml (1 hunks)
.hydra_config/retriever/base.yaml (1 hunks)
docker-compose.yaml (2 hunks)
docs/content/docs/documentation/API.mdx (1 hunks)
docs/content/docs/documentation/backup_restore.md (6 hunks)
docs/content/docs/documentation/env_vars.md (4 hunks)
openrag/components/files.py (1 hunks)
openrag/components/indexer/chunker/chunker.py (2 hunks)
openrag/components/indexer/chunker/test_chunking.py (1 hunks)
openrag/components/indexer/chunker/utils.py (1 hunks)
openrag/components/indexer/embeddings/openai.py (2 hunks)
openrag/components/indexer/loaders/base.py (1 hunks)
openrag/components/indexer/vectordb/vectordb.py (11 hunks)
openrag/components/pipeline.py (3 hunks)
openrag/components/prompts/prompts.py (2 hunks)
openrag/components/retriever.py (5 hunks)
openrag/components/test_files.py (3 hunks)
openrag/components/utils.py (3 hunks)
openrag/routers/indexer.py (4 hunks)
openrag/routers/openai.py (6 hunks)
openrag/routers/utils.py (1 hunks)
openrag/scripts/backup.sh.example (1 hunks)
openrag/scripts/restore.sh.example (1 hunks)
prompts/example1/chunk_contextualizer_tmpl.txt (1 hunks)
prompts/example1/sys_prompt_tmpl.txt (1 hunks)
pyproject.toml (1 hunks)
pytest.ini (1 hunks)
quick_start/docker-compose.yaml (2 hunks)

💤 Files with no reviewable changes (2)

.hydra_config/chunker/semantic_splitter.yaml
.hydra_config/chunker/token_splitter.yaml

🧰 Additional context used

🧬 Code graph analysis (7)

openrag/routers/openai.py (1)

openrag/models/openai.py (2)

OpenAIChatCompletionRequest (14-30)

OpenAICompletionRequest (75-92)

openrag/components/pipeline.py (1)

openrag/components/utils.py (1)

format_context (125-147)

openrag/components/utils.py (2)

openrag/utils/logger.py (1)

get_logger (10-47)

openrag/config/config.py (1)

load_config (12-29)

openrag/components/test_files.py (1)

openrag/components/files.py (2)

sanitize_filename (13-32)

save_file_to_disk (42-68)

openrag/routers/indexer.py (1)

openrag/components/files.py (2)

sanitize_filename (13-32)

save_file_to_disk (42-68)

openrag/components/indexer/chunker/test_chunking.py (1)

openrag/components/indexer/chunker/utils.py (5)

MDElement (17-31)

chunk_table (191-253)

clean_markdown_table_spacing (256-280)

get_chunk_page_number (110-145)

split_md_elements (53-107)

openrag/components/indexer/vectordb/vectordb.py (1)

openrag/components/indexer/indexer.py (1)

chunk (53-57)

🪛 Gitleaks (8.30.0)

docs/content/docs/documentation/API.mdx

[high] 233-236: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

🪛 LanguageTool

docs/content/docs/documentation/API.mdx

[grammar] ~226-~226: Ensure spelling is correct
Context: ... "stream": false }' ``` You can also direclty use this endpoint with no RAG pipeline,...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 Ruff (0.14.8)

openrag/routers/indexer.py

255-258: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

openrag/components/indexer/chunker/chunker.py

61-61: Consider moving this statement to an else block

(TRY300)

62-62: Do not catch blind exception: Exception

(BLE001)

103-103: Do not catch blind exception: Exception

(BLE001)

117-117: Unused method argument: kwargs

(ARG002)

openrag/components/indexer/vectordb/vectordb.py

388-388: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

532-532: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: index-backup-restore

🔇 Additional comments (45)

docs/content/docs/documentation/API.mdx (2)

221-221: LGTM: Temperature value updated.

The lower temperature value (0.1) provides more deterministic outputs, which is appropriate for documentation examples.

233-248: LGTM: Non-RAG usage example is clear and consistent.

The example correctly demonstrates using an empty model string to bypass the RAG pipeline, and the temperature value matches the earlier example for consistency.

Note: The Gitleaks warning about the authorization header is a false positive—YOUR_AUTH_TOKEN is clearly a documentation placeholder, not an actual credential.

openrag/scripts/restore.sh.example (1)

7-7: LGTM!

The addition of --env-file .env ensures environment variables are properly loaded during restore operations, maintaining consistency with other docker-compose invocations across the PR.

.github/workflows/smoke_test/.env (1)

29-30: LGTM!

The new MAX_MODEL_LEN environment variable with a clear comment provides good guidance for adjusting the embedder model's maximum length based on model capabilities.

.coderabbit.yaml (1)

1-5: LGTM!

The CodeRabbit configuration is properly structured and enables auto-review on all branches while allowing reviews to complete even if the PR is closed. The settings align with the provided schema.

pytest.ini (1)

8-11: LGTM!

The environment variables properly configure test paths for Hydra configuration, prompts, and logs, aligning with the broader configuration refactoring in this PR.

openrag/components/indexer/embeddings/openai.py (2)

17-17: LGTM!

The max_model_len configuration parameter is properly extracted with a sensible default of 8192, aligning with the environment variable changes across the PR.

38-40: Remove or replace the unsupported truncate_prompt_tokens parameter.

The truncate_prompt_tokens parameter is not documented in the official OpenAI embeddings API. The embeddings API enforces a maximum of 8192 tokens per input, and the official guidance recommends manual truncation using tokenization before sending requests. Remove this parameter or implement manual text truncation using tiktoken before calling the API.

prompts/example1/chunk_contextualizer_tmpl.txt (1)

1-22: LGTM!

The restructured prompt template significantly improves clarity and usability with explicit sections for Core Principles, Output Format, and Examples. The standardized field names (Filename, First Chunks, Previous Chunks, Current Chunk) and clear constraints (1-2 sentences, plain text) make the contextualization task more consistent and predictable.

openrag/components/files.py (1)

1-11: LGTM!

Import organization is appropriate with the new modules (asyncio, re, secrets, time) needed for the added functionality.

.hydra_config/rag/SimpleRag.yaml (1)

1-3: LGTM!

Clean refactoring to use Hydra's defaults composition pattern. Shared settings are properly extracted to base.yaml, reducing duplication across RAG configurations.

openrag/routers/utils.py (1)

233-234: LGTM!

Minor formatting adjustment.

prompts/example1/sys_prompt_tmpl.txt (1)

11-11: LGTM!

Clear instruction to prevent exposing internal file metadata in responses, which improves the user experience by keeping answers focused on content rather than source details.

docker-compose.yaml (1)

47-47: LGTM - Consistent alignment of max-model-len default.

The default value for --max-model-len has been updated from 8194 to 8192 in both the x-vllm_template and vllm-cpu configurations, aligning with the embedder configuration changes across the PR.

Also applies to: 146-146

quick_start/docker-compose.yaml (1)

45-45: LGTM - Consistent with main docker-compose.yaml.

The max-model-len default has been updated to 8192, matching the changes in the main docker-compose.yaml file and maintaining consistency across configurations.

Also applies to: 142-142

docs/content/docs/documentation/backup_restore.md (1)

13-19: LGTM - Documentation properly updated for environment variable loading.

The documentation now consistently instructs users to include --env-file .env in docker compose commands for backup and restore operations, aligning with the new environment variable loading pattern established across the PR.

Also applies to: 26-32, 40-44, 55-62, 66-73, 78-82

.hydra_config/rag/ChatBotRag.yaml (1)

1-3: LGTM - Proper refactoring to use base configuration.

The configuration now properly references a base configuration pattern, centralizing common RAG settings (chat_history_depth, max_contextualized_query_len) in base.yaml and reducing duplication.

openrag/scripts/backup.sh.example (1)

6-12: LGTM - Consistent environment variable loading.

The backup script now includes --env-file .env, consistent with the pattern established in documentation and workflow changes.

.github/workflows/smoke_test.yaml (1)

100-101: LGTM - Workflow properly loads environment variables.

The smoke test workflow now consistently loads environment variables from .env for backup and restore operations, aligning with the new environment configuration pattern.

Also applies to: 110-110, 115-116

.hydra_config/retriever/base.yaml (1)

3-4: LGTM - New surrounding chunks retrieval configuration.

The new with_surrounding_chunks option enables retrieval of surrounding chunks for each retrieved chunk, with a sensible default of true. This aligns with the broader PR changes to the retriever and vectordb components.
openrag/components/pipeline.py (2)
152-152: LGTM - Token-aware context formatting properly integrated.

The computed max_context_tokens is now properly passed to format_context() in both chat completion and completions paths, enabling token-aware context truncation based on retrieval configuration.

Also applies to: 183-183

77-79: Align config access patterns to use consistent bracket notation.

Lines 38–39 access config.reranker["top_k"] using bracket notation, while lines 77–78 use the defensive .get() pattern for the same keys. Since reranker.top_k and chunker.chunk_size are always defined in the Hydra config with built-in defaults (top_k=10, chunk_size=512), the defensive .get() calls are redundant. Use bracket notation consistently throughout:
self.max_context_tokens = config.reranker["top_k"] * config.chunker["chunk_size"]
This improves readability and maintains a uniform config access pattern.
openrag/components/test_files.py (2)

32-57: LGTM!

The test correctly uses monkeypatching to mock make_unique_filename, ensuring deterministic behavior while testing the with_random_prefix flag. The test validates both the filename transformation and file content persistence.

5-6: The components.files import works reliably in this project because pytest.ini is configured with pythonpath = ./openrag, which adds the openrag directory to sys.path. The import is not fragile to different test runner working directories when pytest is used correctly. However, for better clarity and to reduce reliance on pytest configuration, consider using a relative import: from .files import sanitize_filename, save_file_to_disk since this test file is part of the openrag.components package.

Likely an incorrect or invalid review comment.

.hydra_config/config.yaml (3)

53-53: LGTM!

Increasing reranker.top_k from 5 to 10 aligns with modern LLMs having larger context windows, allowing more documents to be included for better RAG results. The inline comment clearly explains the rationale.

34-34: No action required. The default max_model_len value of 8192 correctly matches jina-embeddings-v3's maximum input length of 8192 tokens.

3-3: Clarify chunker strategy description.

The config sets recursive_splitter as the default chunker. However, there's no evidence of prior support for semantic_splitter or markdown_splitter options in the codebase—only recursive_splitter has ever been implemented as a configurable Hydra option. The documentation already accurately reflects this single available strategy.

Likely an incorrect or invalid review comment.

openrag/routers/indexer.py (1)

129-145: LGTM!

The filename handling follows good practices:

Preserves original_filename for display and metadata tracking

Sanitizes the filename to prevent filesystem issues (special characters, path traversal)

Uses centralized save_file_to_disk with random prefix to prevent collisions

Stores all relevant metadata for traceability

openrag/routers/openai.py (2)

104-115: LGTM!

The is_direct_llm_model helper function centralizes the logic for determining when to bypass RAG partitions. The function is well-documented with a clear docstring, and the implementation correctly handles the three cases: None, empty string, or matching the configured default model.

171-176: Consistent usage of the new helper function.

Both endpoints (/chat/completions and /completions) consistently use is_direct_llm_model(request) to determine partition routing. This refactoring improves maintainability by centralizing the decision logic and makes the code more readable.

Also applies to: 286-291

openrag/components/indexer/chunker/test_chunking.py (4)

12-116: LGTM!

The TestSplitMdElements class provides comprehensive coverage for markdown element parsing:

Basic text, table, and image parsing

Edge case: tables inside image descriptions are correctly ignored

Page number assignment for different element types

The test assertions are specific and verify both element count and content, ensuring robust validation of the parsing logic.

118-152: LGTM!

The TestGetChunkPageNumber class thoroughly tests page number assignment logic:

Chunks with no markers inherit from previous chunk

Markers at start/end/middle are handled correctly

Boundary conditions are explicitly tested

This ensures accurate page tracking for document chunks.

154-192: LGTM!

The TestCleanMarkdownTableSpacing class validates table normalization:

Excessive and inconsistent spacing is corrected

Empty cells are preserved

Consistent "| cell | cell |" format is enforced

These tests ensure tables are properly formatted before further processing.

194-252: LGTM!

The TestChunkTable class validates table chunking behavior:

Small tables remain unchunked

Large tables are split while preserving group integrity

Headers are included in all chunks

Mock length function (4 chars/token) provides deterministic testing

The test at lines 216-248 is particularly important, ensuring that related table rows (country groups) are not split across chunks, which would break semantic coherence.

docs/content/docs/documentation/env_vars.md (3)

97-109: LGTM!

The documentation clearly reflects the chunking strategy consolidation:

CHUNKER now supports only recursive_splitter

The strategy is well-explained with a reference to LangChain documentation

Users understand that text is hierarchically split while respecting CHUNK_SIZE

119-119: LGTM!

The new MAX_MODEL_LEN parameter is well-documented:

Default value (8192) matches the configuration

Clear explanation of truncation behavior when chunks exceed the limit

Users understand this controls embedding model context length

186-191: LGTM!

The retriever configuration is well-documented:

RETRIEVER_TYPE options are clearly listed (single, multiQuery, hyde)

New WITH_SURROUNDING_CHUNKS parameter is explained with its default value

Users understand that enabling this retrieves adjacent chunks for richer context

openrag/components/retriever.py (3)

35-56: LGTM!

The BaseRetriever properly adds support for surrounding chunks:

New with_surrounding_chunks parameter (default True) enables retrieving adjacent chunks

Parameter is stored and forwarded to the vector database search

Default value makes sense for RAG use cases where additional context improves results

59-60: LGTM!

The retriever classes are updated consistently:

SingleRetriever naming corrected (typo fix)

MultiQueryRetriever and HyDeRetriever properly forward with_surrounding_chunks to their respective search methods

All retriever types now support the surrounding chunks feature uniformly

Also applies to: 95-102, 136-142

146-160: LGTM!

The RetrieverFactory is properly updated:

Mapping for "single" now points to the correctly named SingleRetriever

Guard condition changed from checking retriever_type is None to retriever_cls is None, which is more appropriate since it validates the actual class lookup result

openrag/components/prompts/prompts.py (2)

33-33: All references to the renamed constant have been updated.

The constant was successfully renamed from CHUNK_CONTEXTUALIZER to CHUNK_CONTEXTUALIZER_PROMPT for naming consistency with other prompt variables. No remaining references to the old name exist in the codebase.

11-27: The function already returns a single string as indicated by the -> str return type annotation (line 15). All callers throughout the codebase—in retriever.py, pipeline.py, loaders/base.py, and chunker.py—correctly handle the string return value without any tuple unpacking. No changes are needed.

Likely an incorrect or invalid review comment.

openrag/components/indexer/chunker/chunker.py (3)

283-304: RecursiveSplitter initializes text_splitter twice.

The parent BaseChunker.__init__ may lazily initialize text_splitter via split_text, but RecursiveSplitter also initializes it in its own __init__. This is fine but could be simplified.

The explicit initialization in RecursiveSplitter with custom separators is intentional and overrides the lazy fallback in BaseChunker.split_text.

40-53: No changes needed — code is compatible with project's Python version requirement.

The f-string syntax using backslash escape sequences in embedded expressions on lines 44 and 47 is valid and supported. Python 3.12 allows backslashes and unicode escape sequences in f-string expression components, and the project explicitly requires requires-python = ">=3.12", making this code compatible with all supported Python versions.

Likely an incorrect or invalid review comment.

194-254: Python dict merge with explicit keys correctly overrides upstream values—concern doesn't apply here.

At lines 218–221 and 230–234, rightmost dictionary values take precedence in the pattern {**metadata, "page": x}. Critically, the upstream metadata from doc.metadata (set at line 256) comes from document loaders and contains fields like file_id and partition—not page or chunk_type. The explicit keys intentionally set chunk-specific page numbers calculated during chunking (e.g., subtable.page_number, e.page_number, start_page), not carry upstream metadata forward. This behavior is correct and intentional.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-12-17T13:29:30Z

docs/content/docs/documentation/API.mdx

 You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
-For that, simply do not specify any model: 
+For that, instead of using the `openrag` prefix for the model, you can:
+- Specify no model
+- Specify an empty model
+- Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`.


⚠️ Potential issue | 🟡 Minor

Fix typo in user-facing documentation.

Line 226 contains a spelling error: "direclty" should be "directly". The expanded guidance for non-RAG usage is helpful and improves clarity.

Apply this diff to correct the typo:

-You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM. +You can also directly use this endpoint with no RAG pipeline, i.e. to directly use the LLM.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM.

For that, simply do not specify any model:

For that, instead of using the `openrag` prefix for the model, you can:

- Specify no model

- Specify an empty model

- Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`.

You can also directly use this endpoint with no RAG pipeline, i.e. to directly use the LLM.

For that, instead of using the `openrag` prefix for the model, you can:

- Specify no model

- Specify an empty model

- Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`.

🧰 Tools

🪛 LanguageTool

[grammar] ~226-~226: Ensure spelling is correct
Context: ... "stream": false }' ``` You can also direclty use this endpoint with no RAG pipeline,...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

In docs/content/docs/documentation/API.mdx around lines 226 to 230, fix the spelling error "direclty" to "directly" in the sentence that explains using the endpoint without a RAG pipeline; update that one word in-place so the sentence reads "You can also directly use this endpoint..." and leave the rest of the guidance unchanged.

coderabbitai · 2025-12-17T13:29:30Z

openrag/components/files.py

+def sanitize_filename(filename: str) -> str:
+    # Split filename into name and extension
+    path = Path(filename)
+    name = path.stem
+    ext = path.suffix
+
+    # Remove special characters (keep only word characters and hyphens temporarily)
+    name = re.sub(r"[^\w\-]", "_", name)
+
+    # Replace hyphens with underscores
+    name = name.replace("-", "_")
+
+    # Collapse multiple underscores
+    name = re.sub(r"_+", "_", name)
+
+    # Remove leading/trailing underscores
+    name = name.strip("_")
+
+    # Reconstruct filename
+    return name + ext


⚠️ Potential issue | 🟡 Minor

Handle edge case where sanitized name becomes empty.

If the filename consists only of special characters (e.g., ---.txt), the sanitized name becomes empty, resulting in just .txt. Consider adding a fallback:

# Remove leading/trailing underscores name = name.strip("_") + # Fallback if name is empty after sanitization + if not name: + name = "file" + # Reconstruct filename return name + ext

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def sanitize_filename(filename: str) -> str:

# Split filename into name and extension

path = Path(filename)

name = path.stem

ext = path.suffix

# Remove special characters (keep only word characters and hyphens temporarily)

name = re.sub(r"[^\w\-]", "_", name)

# Replace hyphens with underscores

name = name.replace("-", "_")

# Collapse multiple underscores

name = re.sub(r"_+", "_", name)

# Remove leading/trailing underscores

name = name.strip("_")

# Reconstruct filename

return name + ext

def sanitize_filename(filename: str) -> str:

# Split filename into name and extension

path = Path(filename)

name = path.stem

ext = path.suffix

# Remove special characters (keep only word characters and hyphens temporarily)

name = re.sub(r"[^\w\-]", "_", name)

# Replace hyphens with underscores

name = name.replace("-", "_")

# Collapse multiple underscores

name = re.sub(r"_+", "_", name)

# Remove leading/trailing underscores

name = name.strip("_")

# Fallback if name is empty after sanitization

if not name:

name = "file"

# Reconstruct filename

return name + ext

🤖 Prompt for AI Agents

In openrag/components/files.py around lines 13 to 32 the sanitized base name can become empty (e.g., "---.txt") producing a filename like ".txt"; detect when name is empty after stripping and replace it with a safe fallback (e.g., "file" or "untitled") before reconstructing the filename, ensuring the extension is preserved and the result does not start with a dot.

coderabbitai · 2025-12-17T13:29:30Z

openrag/components/indexer/chunker/chunker.py

+        filename = chunks[0].metadata.get("filename")
+        if not self.contextual_retrieval or len(chunks) < 2:
+            return [
+                Document(
+                    page_content=BASE_CHUNK_FORMAT.format(
+                        chunk_context="", filename=filename, content=c.page_content
+                    ),
+                    metadata=c.metadata,
+                )
+                for c in chunks
+            ]


⚠️ Potential issue | 🟡 Minor

Potential IndexError when chunks list is empty.

Line 139 accesses chunks[0].metadata without first checking if chunks is non-empty. The check len(chunks) < 2 at line 140 doesn't prevent this.

async def _apply_contextualization( self, chunks: list[Document], lang: Literal["en", "fr"] = "en" ) -> list[Document]: """Apply contextualization if enabled.""" + if not chunks: + return [] filename = chunks[0].metadata.get("filename")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

filename = chunks[0].metadata.get("filename")

if not self.contextual_retrieval or len(chunks) < 2:

return [

Document(

page_content=BASE_CHUNK_FORMAT.format(

chunk_context="", filename=filename, content=c.page_content

),

metadata=c.metadata,

)

for c in chunks

]

if not chunks:

return []

filename = chunks[0].metadata.get("filename")

if not self.contextual_retrieval or len(chunks) < 2:

return [

Document(

page_content=BASE_CHUNK_FORMAT.format(

chunk_context="", filename=filename, content=c.page_content

),

metadata=c.metadata,

)

for c in chunks

]

🤖 Prompt for AI Agents

In openrag/components/indexer/chunker/chunker.py around lines 139-149, the code accesses chunks[0].metadata which will raise IndexError when chunks is empty; add an explicit guard before that access (e.g., if not chunks: return []), then proceed to read filename and the existing branch logic—this ensures the function returns an empty list for empty input instead of crashing.

coderabbitai · 2025-12-17T13:29:30Z

openrag/components/indexer/chunker/utils.py

+def split_md_elements(md_text: str) -> list[MDElement]:
    """
    Split markdown text into segments of text, tables, and images.
-    Returns a list of tuples: (type, content) where type is 'text', 'table', or 'image'
+    Returns a list of tuples:
+    - ('text', content) for text segments
+    - ('table', content, page_number) for tables
+    - ('image', content, page_number) for images
    """
+    # Find all page markers
+    page_markers = []
+    for match in PAGE_RE.finditer(md_text):
+        page_markers.append((match.start(), int(match.group(1))))
+    page_markers.sort()  # Ensure they're in order
+
    all_matches = []

    # Find image matches first and record their spans
    image_spans = []
    for match in IMAGE_RE.finditer(md_text):
        span = match.span()
-        all_matches.append((span, "image", match.group(1)))
+        page_num = get_page_number(span[0], page_markers)
+        all_matches.append((span, "image", match.group(1).strip(), page_num))
        image_spans.append(span)

    # Find table matches, but skip those that are fully inside an image description
    for match in TABLE_RE.finditer(md_text):
        span = match.span()
        if not any(span_inside(span, image_span) for image_span in image_spans):
-            all_matches.append((span, "table", match.group(1)))
+            page_num = get_page_number(span[0], page_markers)
+            all_matches.append((span, "table", match.group(1).strip(), page_num))

    # Sort matches by start position
    all_matches.sort(key=lambda x: x[0][0])

    parts = []
    last = 0

-    for (start, end), match_type, content in all_matches:
+    for (start, end), match_type, content, page_num in all_matches:
        # Add text segment before this match if there is any
        if start > last:
            text_segment = md_text[last:start]
            if text_segment.strip():  # Only add non-empty text segments
                parts.append(("text", text_segment.strip()))

-        # Add the matched segment
-        parts.append((match_type, content.strip()))
+        # Add the matched segment with page number
+        parts.append((match_type, content, page_num))
        last = end

    # Add remaining text after the last match
    if last < len(md_text):
        remaining_text = md_text[last:]
        if remaining_text.strip():  # Only add non-empty text segments
            parts.append(("text", remaining_text.strip()))
-    return parts

+    return [MDElement(*p) for p in parts]


⚠️ Potential issue | 🟡 Minor

Docstring mismatch with actual return type.

The docstring states the function returns tuples, but Line 107 returns a list of MDElement instances. Update the docstring to reflect the actual return type.

def split_md_elements(md_text: str) -> list[MDElement]: """ Split markdown text into segments of text, tables, and images. - Returns a list of tuples: - - ('text', content) for text segments - - ('table', content, page_number) for tables - - ('image', content, page_number) for images + Returns a list of MDElement instances with type, content, and page_number. """

🤖 Prompt for AI Agents

In openrag/components/indexer/chunker/utils.py around lines 53 to 107, the docstring incorrectly says the function returns tuples while the function actually returns a list of MDElement instances; update the docstring to accurately describe the return type (e.g., "Returns a list of MDElement objects representing segments: MDElement(type='text'|'table'|'image', content, page_number?)") and adjust the bullet list to describe the MDElement fields rather than tuple shapes.

coderabbitai · 2025-12-17T13:29:30Z

openrag/components/indexer/chunker/utils.py

+def chunk_table(
+    table_element: MDElement,
+    chunk_size: int = 512,
+    length_function: Optional[Callable[[str], int]] = None,
+) -> list[MDElement]:
+    txt = clean_markdown_table_spacing(table_element.content)
+    header_lines, groups = parse_markdown_table(txt)
+
+    # Convert header lines → text block
+    header_text = "\n".join(header_lines)
+
+    # Convert group lists → text blocks
+    group_texts = ["\n".join(g) for g in groups]
+
+    # Precompute token length
+    header_ntoks = length_function(header_text)
+    groups_ntoks = [length_function(g) for g in group_texts]
+
+    subtables = []
+    current_rows = [header_text]
+    current_size = header_ntoks
+
+    prev_last_row = None  # for overlap
+
+    for group_txt, g_ntoks in zip(group_texts, groups_ntoks, strict=True):
+        # If adding this group exceeds the chunk limit
+        if current_size + g_ntoks > chunk_size:
+            # ---- finalize current subtable ----
+            subtables.append("\n".join(current_rows))
+
+            # ---- start new subtable with OVERLAP ----
+            current_rows = [header_text]  # always restart headers
+            if prev_last_row:
+                current_rows.append(prev_last_row)  # add overlapping row
+
+            current_rows.append(group_txt)
+            current_size = (
+                header_ntoks
+                + (length_function(prev_last_row) if prev_last_row else 0)
+                + g_ntoks
+            )
+
+        else:
+            # fits → just append normally
+            current_rows.append(group_txt)
+            current_size += g_ntoks
+
+        # track last row for overlap
+        prev_last_row = group_txt
+
+    # finalize last subtable
+    if current_rows:
+        subtables.append("\n".join(current_rows))
+
+    # wrap into MDElement list
+    return [
+        MDElement(
+            type="table",
+            content=subtable,
+            page_number=table_element.page_number,
+        )
+        for subtable in subtables
+    ]


⚠️ Potential issue | 🟠 Major

Missing None check before calling length_function.

If length_function is None, Line 206 will raise a TypeError. Either enforce the parameter as required or add a fallback.

def chunk_table( table_element: MDElement, chunk_size: int = 512, length_function: Optional[Callable[[str], int]] = None, ) -> list[MDElement]: + if length_function is None: + length_function = len # Fallback to character count + txt = clean_markdown_table_spacing(table_element.content) header_lines, groups = parse_markdown_table(txt)

🤖 Prompt for AI Agents

In openrag/components/indexer/chunker/utils.py around lines 191-253, the code calls length_function(...) without checking for None which will raise a TypeError if length_function is not passed; fix by providing a default length function at the top of chunk_table (e.g., default to Python len on strings or a small wrapper that handles None/empty strings) and use that local function for all subsequent calls, and also guard the call length_function(prev_last_row) with a conditional or by ensuring prev_last_row is a string (e.g., pass "" when None) so no None value is ever passed into the length function.

coderabbitai · 2025-12-17T13:29:30Z

openrag/components/indexer/vectordb/vectordb.py

+def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:
+    # Use base timestamp + index to ensure uniqueness
+    base_ts = int(time.time_ns())
+    ids: list[int] = [base_ts + i for i in range(n)]
+    L = []
+    for i in range(n):
+        prev_chunk_id = ids[i - 1] if i > 0 else None
+        next_chunk_id = ids[i + 1] if i < n - 1 else None
+        L.append(
+            {
+                "prev_section_id": prev_chunk_id,
+                "section_id": ids[i],
+                "next_section_id": next_chunk_id,
+            }
+        )
+    return L


⚠️ Potential issue | 🟠 Major

Timestamp-based ID generation may collide under high concurrency.

Using time.time_ns() with sequential offsets could produce collisions if multiple processes call this function simultaneously. Consider using UUIDs or adding a random component.

+import uuid + def _gen_chunk_order_metadata(n: int = 20) -> list[dict]: - # Use base timestamp + index to ensure uniqueness - base_ts = int(time.time_ns()) - ids: list[int] = [base_ts + i for i in range(n)] + # Use UUID-based IDs to ensure uniqueness across processes + ids: list[int] = [uuid.uuid4().int >> 64 for _ in range(n)] # 64-bit unique IDs L = [] for i in range(n):

Alternatively, if integer IDs are required for Milvus compatibility, consider using a combination of process ID, timestamp, and random bits.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:

# Use base timestamp + index to ensure uniqueness

base_ts = int(time.time_ns())

ids: list[int] = [base_ts + i for i in range(n)]

L = []

for i in range(n):

prev_chunk_id = ids[i - 1] if i > 0 else None

next_chunk_id = ids[i + 1] if i < n - 1 else None

L.append(

{

"prev_section_id": prev_chunk_id,

"section_id": ids[i],

"next_section_id": next_chunk_id,

}

)

return L

import uuid

def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:

# Use UUID-based IDs to ensure uniqueness across processes

ids: list[int] = [uuid.uuid4().int >> 64 for _ in range(n)] # 64-bit unique IDs

L = []

for i in range(n):

prev_chunk_id = ids[i - 1] if i > 0 else None

next_chunk_id = ids[i + 1] if i < n - 1 else None

L.append(

{

"prev_section_id": prev_chunk_id,

"section_id": ids[i],

"next_section_id": next_chunk_id,

}

)

return L

🤖 Prompt for AI Agents

In openrag/components/indexer/vectordb/vectordb.py around lines 1022 to 1037, the current ID generation uses time.time_ns() with sequential offsets which can collide across concurrent processes; replace that logic with a collision-resistant approach — either generate section IDs as UUID strings (e.g., uuid.uuid4().hex) and update the metadata keys/types accordingly, or if integer IDs are strictly required, build a composite integer from time_ns plus process id and random bits (e.g., combine time_ns << N | (pid << M) | random_bits generated with secrets.randbits) to guarantee uniqueness across processes; ensure imports (uuid or os and secrets) are added, update type hints from list[int] to list[str] if using UUIDs, and run tests that write to Milvus to confirm compatibility.

coderabbitai · 2025-12-17T13:29:31Z

openrag/components/utils.py

+def detect_language(text: str):
+    outputs = lang_detector.detect(text, k=1)
+    return outputs[0].get("lang")


⚠️ Potential issue | 🟡 Minor

Missing error handling in detect_language.

If lang_detector.detect returns an empty list or fails, accessing outputs[0] will raise an IndexError. Add defensive handling.

def detect_language(text: str): outputs = lang_detector.detect(text, k=1) - return outputs[0].get("lang") + if outputs: + return outputs[0].get("lang") + return "en" # Default fallback

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def detect_language(text: str):

outputs = lang_detector.detect(text, k=1)

return outputs[0].get("lang")

def detect_language(text: str):

outputs = lang_detector.detect(text, k=1)

if outputs:

return outputs[0].get("lang")

return "en" # Default fallback

🤖 Prompt for AI Agents

In openrag/components/utils.py around lines 160 to 162, detect_language currently assumes lang_detector.detect returns a non-empty list and directly indexes outputs[0], which can raise IndexError or propagate detection errors; update the function to defensively handle empty or invalid outputs and exceptions by wrapping the detection call in a try/except, check that outputs is a non-empty list and that outputs[0] contains a "lang" key before returning it, and on failure return a safe default (e.g., None or "unknown") and optionally log the error for diagnostics.

coderabbitai · 2025-12-17T13:29:31Z

pyproject.toml

    "langdetect>=1.0.9",
    "alembic>=1.17.0",
-    "langdetect>=1.0.9",
+    "fast-langdetect>=1.0.0",


⚠️ Potential issue | 🟠 Major

Remove the duplicate language detection dependency.

Both langdetect>=1.0.9 (line 47) and fast-langdetect>=1.0.0 (line 49) are present in the dependencies. Since the PR migrates to fast-langdetect, the original langdetect dependency on line 47 should be removed to avoid conflicts.

Apply this diff to remove the duplicate:

"html-to-markdown>=2.4.0", - "langdetect>=1.0.9", "alembic>=1.17.0", "fast-langdetect>=1.0.0",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"langdetect>=1.0.9",

"alembic>=1.17.0",

"langdetect>=1.0.9",

"fast-langdetect>=1.0.0",

"html-to-markdown>=2.4.0",

"alembic>=1.17.0",

"fast-langdetect>=1.0.0",

🤖 Prompt for AI Agents

In pyproject.toml around lines 47 to 49, remove the duplicate dependency entry "langdetect>=1.0.9" (line 47) since the project is migrating to "fast-langdetect>=1.0.0"; update the dependencies block so only fast-langdetect remains and ensure punctuation/commas in the surrounding list remain valid after deletion.

coderabbitai · 2025-12-17T13:29:39Z

✅ Actions performed

Reviews paused.

paultranvan and others added 18 commits December 9, 2025 14:24

chore: Enable coderabbit for all branches

7d6863f

sanitize filenames to avoid indexation errors

62d2cfc

add tests for sanitize_filename function

5c8a9fb

saving original filename

512d3fa

Merge pull request #167 from linagora/fix/filename

097c92e

Fix/filename

Add env vars when running backup/restore scripts as their might be fa…

809c28e

…ils when some vars are missing

Updating backup & restore documentation and examples

c5fd711

Merge pull request #171 from linagora/backup_restore_env_file

fa2d7d7

Add env vars when running backup/restore scripts as their might be fa…

refactor: refactoring and enhancing chunking mechanism

9b532bf

Allow trucation in case the chunk to embed exceeds the MAX_MODEL_LEN …

7913977

…of the embedder model. This prevents errors in that case.

feat: Enabling fetching adjacent (preceding and following) chunks for…

ec47d26

… each matched document to provide additional context.

feat: Controlling context length when preparing context for generation

03c5cc4

refactor: cleaning hydra config for rag component

e9cf0b0

Add chunking tests.

56dc3bc

Merge pull request #165 from linagora/feat/chunking

ff47830

Feat/chunking

feat: Allow model in openAI completion routes

df60363

Some openai-compatible libs make the `model` parameter mandatory, to query chat completions API. Thus, we now allow providing empty model, or actual LLM model defined in the config.

fix: Typo in doc

d9f228b

chore: Allow coderabbit review after PR is merged

97a3f7b

codescene-delta-analysis bot reviewed Dec 17, 2025

View reviewed changes

paultranvan merged commit efa103f into main Dec 17, 2025
5 of 8 checks passed

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

This was referenced Jan 9, 2026

Add lightweight API test workflow with mock VLLM #194

Merged

fix: Prevent marker CPU issues from semaphore loops and orphaned processes #193

Merged

fix: improve LLM API error messages for better debugging #200

Merged

Feat/add ruff linting #209

Merged

This was referenced Jan 27, 2026

Fix/api token limit test #220

Merged

Feat/relationship2 #219

Closed

coderabbitai bot mentioned this pull request Jan 30, 2026

Feat/relationship3 #228

Merged

This was referenced Feb 9, 2026

Feat/add file quota2 #233

Closed

Temporality #247

Closed

coderabbitai bot mentioned this pull request Feb 16, 2026

feat: filter RAG sources by LLM citations to reduce false positives #246

Merged

This was referenced Feb 25, 2026

Feat/websearch #253

Merged

BREAKING CHANGE Step 2/3: Adding temporal fields #259

Open

This was referenced Mar 12, 2026

feat: implement query decomposition for search and enhance reranking … #266

Merged

Feat/add OpenAI reranking #288

Open

Conversation

paultranvan commented Dec 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

paultranvan commented Dec 17, 2025

Uh oh!

codescene-delta-analysis bot left a comment

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paultranvan commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading