Skip to content

Merge for release v1.1.5#175

Merged
paultranvan merged 18 commits intomainfrom
dev
Dec 17, 2025
Merged

Merge for release v1.1.5#175
paultranvan merged 18 commits intomainfrom
dev

Conversation

@paultranvan
Copy link
Collaborator

@paultranvan paultranvan commented Dec 17, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added retrieval of surrounding chunks in search results for enhanced context
    • Introduced language detection capability for better multilingual support
    • Added configurable embedding model context length support
  • Improvements

    • Enhanced context formatting to respect token limits and prevent overflow
    • Improved table chunking and markdown parsing with better page awareness
    • Simplified default chunking to single strategy for consistency
    • Increased default reranker result count for better coverage
    • Better filename sanitization and unique file naming
  • Documentation

    • Updated API documentation with improved examples and guidance
    • Clarified environment variable configuration options
    • Enhanced system prompts with clearer instruction rules

✏️ Tip: You can customize this high-level summary in your review settings.

paultranvan and others added 18 commits December 9, 2025 14:24
Add env vars when running backup/restore scripts as their might be fa…
…of the embedder model. This prevents errors in that case.
… each matched document to provide additional context.
Some openai-compatible libs make the `model` parameter mandatory, to
query chat completions API.
Thus, we now allow providing empty model, or actual LLM model defined in the
config.
@coderabbitai
Copy link

coderabbitai bot commented Dec 17, 2025

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Walkthrough

This PR significantly refactors the document processing and retrieval pipeline by consolidating chunking strategies to recursive splitting only, introducing LLM-based chunk contextualization, adding surrounding chunk retrieval capabilities, integrating token-aware context formatting, and adding embedder truncation support. Configuration files are reorganized and environment variables updated accordingly.

Changes

Cohort / File(s) Summary
CodeRabbit Configuration
.coderabbit.yaml
Adds auto-review configuration with wildcard base_branches pattern and abort_on_close set to false
Hydra Configuration & Defaults
.hydra_config/config.yaml, .hydra_config/rag/base.yaml, .hydra_config/rag/ChatBotRag.yaml, .hydra_config/rag/SimpleRag.yaml, .hydra_config/retriever/base.yaml
Consolidates RAG configuration into base.yaml defaults; removes semantic and markdown splitter options leaving only recursive_splitter; adds embedder.max_model_len (8192); increases reranker.top_k from 5 to 10; adds retriever.with_surrounding_chunks flag
Deleted Chunker Configurations
.hydra_config/chunker/semantic_splitter.yaml, .hydra_config/chunker/token_splitter.yaml
Removes deprecated chunker configuration files for semantic and markdown splitters
Environment & Docker Setup
.github/workflows/smoke_test.yaml, .github/workflows/smoke_test/.env, docker-compose.yaml, quick_start/docker-compose.yaml, openrag/scripts/backup.sh.example, openrag/scripts/restore.sh.example
Adds --env-file .env to docker compose invocations; adds MAX_MODEL_LEN=8192 environment variable; updates vllm max-model-len from 8194 to 8192
Chunking System Refactor
openrag/components/indexer/chunker/chunker.py, openrag/components/indexer/chunker/utils.py, openrag/components/indexer/chunker/test_chunking.py
Introduces ChunkContextualizer class and BaseChunker abstract base; adds MDElement abstraction for markdown content; implements page-aware chunking with TABLE_RE multiline support and PAGE_RE markers; removes semantic and markdown splitters from factory; adds async contextualization pipeline; introduces test suite for chunking utilities
File Handling Utilities
openrag/components/files.py, openrag/components/test_files.py, openrag/routers/indexer.py
Adds sanitize_filename() and make_unique_filename() functions; implements save_file_to_disk() for disk persistence; removes timeout handling from serialize_file(); updates router to use centralized file saving with original_filename metadata preservation
Vector Database & Retrieval
openrag/components/indexer/vectordb/vectordb.py, openrag/components/retriever.py
Adds with_surrounding_chunks support to async_search and async_multi_query_search; introduces get_surrounding_chunks() method; adds chunk order metadata generation (_gen_chunk_order_metadata); renames SingleRetreiver to SingleRetriever; updates RetrieverFactory to use SingleRetriever for "single" key
Embeddings & Context Formatting
openrag/components/indexer/embeddings/openai.py, openrag/components/utils.py, openrag/components/pipeline.py
Adds max_model_len to OpenAI embeddings with truncate_prompt_tokens support; introduces token-aware format_context() with ChatOpenAI token counter; adds detect_language() function with LangDetector; computes max_context_tokens in RagPipeline as reranker.top_k × chunker.chunk_size
Prompts & API Utilities
openrag/components/prompts/prompts.py, openrag/routers/openai.py, openrag/routers/utils.py
Changes load_prompt() return type from tuple to string; renames CHUNK_CONTEXTUALIZER to CHUNK_CONTEXTUALIZER_PROMPT; adds is_direct_llm_model() helper for route determination; updates OpenAI endpoints documentation for empty model handling
Prompt Templates
prompts/example1/chunk_contextualizer_tmpl.txt, prompts/example1/sys_prompt_tmpl.txt
Restructures chunk contextualizer prompt with explicit fields (Filename, First Chunks, Previous Chunks, Current Chunk) and new Core Principles/Output Format sections; adds citation-disabling rule to system prompt
Documentation
docs/content/docs/documentation/API.mdx, docs/content/docs/documentation/backup_restore.md, docs/content/docs/documentation/env_vars.md
Updates API documentation with lower temperature examples (0.7→0.1); adds --env-file .env to backup/restore commands; documents new CHUNKER/MAX_MODEL_LEN/WITH_SURROUNDING_CHUNKS variables; removes references to deprecated splitters
Build & Test Configuration
pyproject.toml, pytest.ini
Replaces langdetect with fast-langdetect dependency; adds pytest.ini environment variables (CONFIG_PATH, PROMPTS_DIR, LOG_DIR)
Loader & Base Utilities
openrag/components/indexer/loaders/base.py
Minor formatting: adds blank lines around XML-wrapped image descriptions

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Upload as File Upload
    participant Sanitize as File Sanitizer
    participant Save as Disk Storage
    participant Chunk as Chunking Engine
    participant Context as LLM Contextualizer
    participant Embed as Embedder
    participant VectorDB as Vector Database

    User->>Upload: Upload document
    Upload->>Sanitize: Get sanitized filename
    Sanitize-->>Save: Sanitized + unique name
    Save->>Save: Write to disk with random prefix
    Save->>Chunk: Load document
    Chunk->>Chunk: Split into chunks with recursive_splitter
    Chunk->>Chunk: Detect language & assign page numbers
    Chunk->>Context: Pass chunks for contextualization
    Context->>Context: LLM generates context for each chunk
    Context-->>Chunk: Enhanced chunks with context
    Chunk->>Embed: Send contextualized chunks
    Embed->>Embed: Truncate via max_model_len
    Embed->>VectorDB: Generate embeddings
    VectorDB->>VectorDB: Store with chunk order metadata
    VectorDB-->>User: ✓ Indexed
Loading
sequenceDiagram
    actor User
    participant API as OpenAI API
    participant Router as Request Router
    participant VectorDB as Vector Database
    participant Retriever as Retriever
    participant Reranker as Reranker
    participant Context as Context Formatter
    participant LLM as Language Model
    participant Response as Response

    User->>API: Query (model: "")
    API->>Router: is_direct_llm_model() → true
    Router->>LLM: Route to direct LLM
    LLM-->>Response: Generate without RAG
    
    alt With RAG (model specified)
        User->>API: Query (model: "rag_partition")
        API->>Router: is_direct_llm_model() → false
        Router->>Retriever: retrieve(with_surrounding_chunks=true)
        Retriever->>VectorDB: async_search()
        VectorDB-->>Retriever: Top-k chunks
        Retriever->>VectorDB: get_surrounding_chunks()
        VectorDB-->>Retriever: Adjacent chunks
        Retriever->>Reranker: Rerank combined results (top_k=10)
        Reranker-->>Context: Ranked chunks
        Context->>Context: format_context(max_context_tokens)
        Context->>Context: Count tokens via ChatOpenAI
        Context-->>LLM: Token-limited context
        LLM-->>Response: RAG-augmented response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Chunking system refactor (chunker.py, utils.py): Introduction of ChunkContextualizer, BaseChunker, and MDElement abstractions with async contextualization pipeline requires careful validation of logic flow and integration with LLM calls
  • Vector database changes (vectordb.py): New with_surrounding_chunks parameter, get_surrounding_chunks() method, and chunk order metadata generation (_gen_chunk_order_metadata) add complexity to retrieval logic
  • Configuration consolidation (Hydra YAML files, .env additions): Changes to chunker factory, removal of deprecated splitters, and reorganization of RAG defaults require verification across config hierarchy
  • Public API changes: Multiple signature updates (async_search, async_multi_query_search, format_context, load_prompt) affecting downstream consumers; class rename (SingleRetreiver→SingleRetriever)
  • Interdependencies: Changes span across modules (chunker→embedder→vectordb→retriever→pipeline) with implicit contracts that need validation
  • Token counting logic in format_context uses ChatOpenAI client; needs verification of token measurement accuracy and edge cases

Possibly related PRs

  • Feat/chunking #165: Major overlapping changes in chunker architecture (ChunkContextualizer, BaseChunker, MDElement), vectordb surrounding chunks retrieval, retriever API changes, embedder max_model_len support, and consolidated config structure
  • Fix/filename #167: Introduces sanitize_filename utility and file handling refactor in components/files.py with test coverage
  • feat: Allow model in openAI completion routes #173: Adds is_direct_llm_model() helper in openai.py router for distinguishing direct LLM vs. RAG-augmented requests

Suggested labels

chore, enhancement, refactoring

Suggested reviewers

  • dodekapod
  • codescene-delta-analysis

🐰 Hops with glee through configs neat,
Chunks contextualized, retrieval sweet!
Surrounding echoes, tokens counted true,
A refactored pipeline, fresh and new. 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Merge for release v1.1.5' is vague and generic, using non-descriptive terms that don't convey meaningful information about the specific changes in the changeset. Replace the generic 'Merge for release' phrase with a specific summary of the main changes, such as 'Add chunker refactoring, retriever updates, and embedder configuration' or similar, to clearly communicate the primary modifications.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@paultranvan
Copy link
Collaborator Author

@coderabbitai pause

Copy link

@codescene-delta-analysis codescene-delta-analysis bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Health Improved (2 files improve in Code Health)

Gates Failed
Prevent hotspot decline (3 hotspots with Complex Method)
Enforce advisory code health rules (4 files with Complex Method, Overall Code Complexity)

Gates Passed
4 Quality Gates Passed

See analysis details in CodeScene

Reason for failure
Prevent hotspot decline Violations Code Health Impact
vectordb.py 1 rule in this hotspot 7.29 → 7.06 Suppress
openai.py 1 rule in this hotspot 7.10 → 7.08 Suppress
chunker.py 1 rule in this hotspot 8.55 → 9.39 Suppress
Enforce advisory code health rules Violations Code Health Impact
vectordb.py 1 advisory rule 7.29 → 7.06 Suppress
openai.py 1 advisory rule 7.10 → 7.08 Suppress
utils.py 1 advisory rule 7.86 → 8.57 Suppress
chunker.py 1 advisory rule 8.55 → 9.39 Suppress
View Improvements
File Code Health Impact Categories Improved
chunker.py 8.55 → 9.39 Code Duplication, Bumpy Road Ahead, Excess Number of Function Arguments
utils.py 7.86 → 8.57 Complex Method, Bumpy Road Ahead, Deep, Nested Complexity, Excess Number of Function Arguments

Quality Gate Profile: Pay Down Tech Debt
Want more control? Customize Code Health rules or catch issues early with our IDE extension and CLI tool.

Comment on lines +194 to +200
def _get_chunks(
self, content: str, metadata: Optional[dict] = None, log=None
) -> list[Document]:
log = log or logger
texts, tables_and_images = self._prepare_md_elements(content=content)
combined_texts = "\n".join([e.content for e in texts])
text_chunks = self.split_text(combined_texts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Complex Method
BaseChunker._get_chunks has a cyclomatic complexity of 9, threshold = 9

Suppress


from langchain_core.documents.base import Document
from langchain_openai import ChatOpenAI
from typing import Callable, Literal, Optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Overall Code Complexity
This module has a mean cyclomatic complexity of 5.22 across 9 functions. The mean complexity threshold is 4

Suppress

similarity_threshold: int = 0.80,
partition: list[str] = None,
filter: Optional[dict] = None,
with_surrounding_chunks: bool = False,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Getting worse: Complex Method
MilvusDB.async_search increases in cyclomatic complexity from 12 to 13, threshold = 9

Suppress

Comment on lines +555 to +602
async def get_surrounding_chunks(self, docs: list[Document]) -> list[Document]:
existant_ids = set(doc.metadata.get("_id") for doc in docs)

# Collect all prev/next section IDs
section_ids = [
section_id
for doc in docs
for section_id in [
doc.metadata.get("prev_section_id"),
doc.metadata.get("next_section_id"),
]
if section_id is not None
]

if not section_ids:
return []

# Query all sections in parallel
tasks = [
self._async_client.query(
collection_name=self.collection_name,
filter=f"section_id == {section_id}",
limit=1,
)
for section_id in section_ids
]
responses = await asyncio.gather(*tasks)

# Build output, skipping duplicates
output_docs = []
for response in responses:
if not response:
continue
doc_id = response[0].get("_id")
if doc_id not in existant_ids:
existant_ids.add(doc_id)
output_docs.append(
Document(
page_content=response[0]["text"],
metadata={
key: value
for key, value in response[0].items()
if key not in ["text", "vector"]
},
)
)

return output_docs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Complex Method
MilvusDB.get_surrounding_chunks has a cyclomatic complexity of 10, threshold = 9

Suppress

user_partitions=Depends(current_user_or_admin_partitions_list),
):
model_name = request.model
model_name = request.model or config.llm.get("model")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Getting worse: Complex Method
openai_completion increases in cyclomatic complexity from 13 to 14, threshold = 9

Suppress

@paultranvan paultranvan merged commit efa103f into main Dec 17, 2025
5 of 8 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
openrag/components/files.py (3)

35-39: Fix return type annotation mismatch.

The function returns a str but is annotated as returning Path. This causes type inconsistency and could mislead callers.

-def make_unique_filename(filename: str) -> Path:
+def make_unique_filename(filename: str) -> str:
     ts = int(time.time() * 1000)
     rand = secrets.token_hex(2)
     unique_name = f"{ts}_{rand}_{filename}"
     return unique_name

54-58: Handle potential None filename.

UploadFile.filename can be None if the client doesn't provide a filename. This would cause issues when constructing the file path.

     if with_random_prefix:
-        filename = make_unique_filename(file.filename)
+        filename = make_unique_filename(file.filename or "unnamed")
     else:
-        filename = file.filename
+        filename = file.filename or "unnamed"
     file_path = dest_dir / filename

79-92: Dead code: else branch is unreachable.

ray.wait([future]) without a timeout parameter blocks until the future completes, so ready will always be non-empty. The else branch (lines 90-92) is unreachable dead code.

Either remove the dead code or reintroduce the timeout if it was intentionally removed:

     # Wait for it to complete, with timeout
-    ready, _ = await asyncio.to_thread(ray.wait, [future])
+    ready, _ = await asyncio.to_thread(ray.wait, [future], timeout=SERIALIZATION_TIMEOUT)
 
     if ready:
         try:
             doc = await ready[0]
             return doc
         except TaskCancelledError:
             raise
         except Exception:
             raise
     else:
         ray.cancel(future, recursive=True)
-        raise TimeoutError(f"Serialization task {task_id} timed out after seconds")
+        raise TimeoutError(f"Serialization task {task_id} timed out after {SERIALIZATION_TIMEOUT} seconds")

Or if timeout is intentionally removed, clean up the dead code:

     # Wait for it to complete
-    ready, _ = await asyncio.to_thread(ray.wait, [future])
-
-    if ready:
-        try:
-            doc = await ready[0]
-            return doc
-        except TaskCancelledError:
-            raise
-        except Exception:
-            raise
-    else:
-        ray.cancel(future, recursive=True)
-        raise TimeoutError(f"Serialization task {task_id} timed out after seconds")
+    await asyncio.to_thread(ray.wait, [future])
+    try:
+        doc = await future
+        return doc
+    except TaskCancelledError:
+        raise
🧹 Nitpick comments (10)
openrag/components/indexer/loaders/base.py (1)

153-153: Clarify the formatting change.

The addition of extra blank lines around the image description content may affect downstream parsing. Ensure this formatting change is intentional and compatible with consumers of this XML-wrapped content.

.hydra_config/rag/base.yaml (1)

1-4: LGTM!

Good extraction of shared RAG configuration. Consider updating the comment since this base config is used by both ChatBotRag and SimpleRag, not just chatbot RAG.

-# Config for chatbot RAG
+# Base config for RAG modes
 mode: ''
 chat_history_depth: 4
 max_contextualized_query_len: 512
openrag/components/test_files.py (1)

60-83: Consider edge case validation for empty filenames.

The test case at line 77 expects sanitize_filename("") to return "". While this test may pass, an empty filename could cause issues in actual file operations (e.g., save_file_to_disk). Consider whether sanitize_filename should validate against empty strings and raise an error, or if the calling code should handle this validation.

openrag/routers/indexer.py (1)

250-266: Consistent filename handling, but improve exception chaining.

The filename sanitization and metadata handling mirrors add_file appropriately. However, the exception handling at lines 253-258 could be improved by using raise ... from err or raise ... from None to preserve the exception chain, as flagged by static analysis.

Apply this diff:

     try:
         original_filename = file.filename
         file.filename = sanitize_filename(file.filename)
         file_path = await save_file_to_disk(file, save_dir, with_random_prefix=True)
-    except Exception:
+    except Exception as e:
         log.exception("Failed to save file to disk.")
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
             detail="Failed to save uploaded file.",
-        )
+        ) from e

Based on static analysis hints.

openrag/components/utils.py (2)

125-147: Consider reusing the LLM instance for token counting.

A new ChatOpenAI instance is created on every call to format_context. This incurs unnecessary overhead for repeated invocations. Consider caching the LLM instance or accepting it as a parameter.

+_token_counter_llm = None
+
+def _get_token_counter():
+    global _token_counter_llm
+    if _token_counter_llm is None:
+        _token_counter_llm = ChatOpenAI(**config.llm)
+    return _token_counter_llm
+
 def format_context(docs: list[Document], max_context_tokens: int = 4096) -> str:
     if not docs:
         return "No document found from the database"
 
-    llm = ChatOpenAI(**config.llm)
-    _length_function = llm.get_num_tokens
+    _length_function = _get_token_counter().get_num_tokens

150-157: Hard-coded path may cause issues outside Docker.

The lang_detect_cache_dir is hard-coded to /app/model_weights/, which assumes a specific deployment environment. Consider making this configurable via the config system.

 # Initialize language detector
-lang_detect_cache_dir = "/app/model_weights/"
+lang_detect_cache_dir = getattr(config.paths, "model_weights_dir", "/app/model_weights/")
 lang_detector_config = LangDetectConfig(
     max_input_length=1024,  # chars
     model="auto",
     cache_dir=lang_detect_cache_dir,
 )
openrag/components/indexer/chunker/utils.py (1)

17-31: Consider using a dataclass for MDElement.

The MDElement class is a simple data container. Using @dataclass would reduce boilerplate and provide __eq__, __hash__, etc. automatically.

+from dataclasses import dataclass
+
-class MDElement:
-    """Class representing a segment of markdown content."""
-
-    def __init__(
-        self,
-        type: Literal["text", "table", "image"],
-        content: str,
-        page_number: Optional[int] = None,
-    ):
-        self.type = type  # 'text', 'table', 'image'
-        self.content = content
-        self.page_number = page_number
-
-    def __repr__(self):
-        return f"Element(type={self.type}, page_number={self.page_number}, content={self.content[:100]}...)"
+@dataclass
+class MDElement:
+    """Class representing a segment of markdown content."""
+    type: Literal["text", "table", "image"]
+    content: str
+    page_number: Optional[int] = None
+
+    def __repr__(self):
+        return f"Element(type={self.type}, page_number={self.page_number}, content={self.content[:100]}...)"
openrag/components/indexer/vectordb/vectordb.py (2)

387-396: Add strict=True to zip for safety.

Per static analysis hint B905, adding strict=True ensures the iterables have equal length and prevents silent data loss if they don't match.

-            for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l):
+            for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l, strict=True):

555-602: N+1 query problem for surrounding chunks—consider batching with IN operator.

The current code makes a separate query for each surrounding chunk. Milvus supports the IN operator for batch queries in filter expressions. To optimize, batch the queries:

# Batch query using IN operator instead of N separate queries
filter_expr = f"section_id in {section_ids}"
responses = await self._async_client.query(
    collection_name=self.collection_name,
    filter=filter_expr,
    limit=len(section_ids),
)

If you prefer keeping the parallel approach with asyncio.gather(), add a comment explaining the trade-off. Use filter_params for proper parameter escaping if needed.

openrag/components/indexer/chunker/chunker.py (1)

62-66: Catching broad Exception masks specific errors.

Per static analysis (BLE001), catching Exception broadly can hide unexpected issues. Consider catching more specific exceptions or at least logging the exception type.

             except Exception as e:
                 logger.warning(
-                    f"Error contextualizing chunk of document `{filename}`: {e}"
+                    f"Error contextualizing chunk of document `{filename}`: {type(e).__name__}: {e}"
                 )
                 return ""
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9efbad8 and 97a3f7b.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • .coderabbit.yaml (1 hunks)
  • .github/workflows/smoke_test.yaml (2 hunks)
  • .github/workflows/smoke_test/.env (1 hunks)
  • .hydra_config/chunker/semantic_splitter.yaml (0 hunks)
  • .hydra_config/chunker/token_splitter.yaml (0 hunks)
  • .hydra_config/config.yaml (3 hunks)
  • .hydra_config/rag/ChatBotRag.yaml (1 hunks)
  • .hydra_config/rag/SimpleRag.yaml (1 hunks)
  • .hydra_config/rag/base.yaml (1 hunks)
  • .hydra_config/retriever/base.yaml (1 hunks)
  • docker-compose.yaml (2 hunks)
  • docs/content/docs/documentation/API.mdx (1 hunks)
  • docs/content/docs/documentation/backup_restore.md (6 hunks)
  • docs/content/docs/documentation/env_vars.md (4 hunks)
  • openrag/components/files.py (1 hunks)
  • openrag/components/indexer/chunker/chunker.py (2 hunks)
  • openrag/components/indexer/chunker/test_chunking.py (1 hunks)
  • openrag/components/indexer/chunker/utils.py (1 hunks)
  • openrag/components/indexer/embeddings/openai.py (2 hunks)
  • openrag/components/indexer/loaders/base.py (1 hunks)
  • openrag/components/indexer/vectordb/vectordb.py (11 hunks)
  • openrag/components/pipeline.py (3 hunks)
  • openrag/components/prompts/prompts.py (2 hunks)
  • openrag/components/retriever.py (5 hunks)
  • openrag/components/test_files.py (3 hunks)
  • openrag/components/utils.py (3 hunks)
  • openrag/routers/indexer.py (4 hunks)
  • openrag/routers/openai.py (6 hunks)
  • openrag/routers/utils.py (1 hunks)
  • openrag/scripts/backup.sh.example (1 hunks)
  • openrag/scripts/restore.sh.example (1 hunks)
  • prompts/example1/chunk_contextualizer_tmpl.txt (1 hunks)
  • prompts/example1/sys_prompt_tmpl.txt (1 hunks)
  • pyproject.toml (1 hunks)
  • pytest.ini (1 hunks)
  • quick_start/docker-compose.yaml (2 hunks)
💤 Files with no reviewable changes (2)
  • .hydra_config/chunker/semantic_splitter.yaml
  • .hydra_config/chunker/token_splitter.yaml
🧰 Additional context used
🧬 Code graph analysis (7)
openrag/routers/openai.py (1)
openrag/models/openai.py (2)
  • OpenAIChatCompletionRequest (14-30)
  • OpenAICompletionRequest (75-92)
openrag/components/pipeline.py (1)
openrag/components/utils.py (1)
  • format_context (125-147)
openrag/components/utils.py (2)
openrag/utils/logger.py (1)
  • get_logger (10-47)
openrag/config/config.py (1)
  • load_config (12-29)
openrag/components/test_files.py (1)
openrag/components/files.py (2)
  • sanitize_filename (13-32)
  • save_file_to_disk (42-68)
openrag/routers/indexer.py (1)
openrag/components/files.py (2)
  • sanitize_filename (13-32)
  • save_file_to_disk (42-68)
openrag/components/indexer/chunker/test_chunking.py (1)
openrag/components/indexer/chunker/utils.py (5)
  • MDElement (17-31)
  • chunk_table (191-253)
  • clean_markdown_table_spacing (256-280)
  • get_chunk_page_number (110-145)
  • split_md_elements (53-107)
openrag/components/indexer/vectordb/vectordb.py (1)
openrag/components/indexer/indexer.py (1)
  • chunk (53-57)
🪛 Gitleaks (8.30.0)
docs/content/docs/documentation/API.mdx

[high] 233-236: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

🪛 LanguageTool
docs/content/docs/documentation/API.mdx

[grammar] ~226-~226: Ensure spelling is correct
Context: ... "stream": false }' ``` You can also direclty use this endpoint with no RAG pipeline,...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 Ruff (0.14.8)
openrag/routers/indexer.py

255-258: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

openrag/components/indexer/chunker/chunker.py

61-61: Consider moving this statement to an else block

(TRY300)


62-62: Do not catch blind exception: Exception

(BLE001)


103-103: Do not catch blind exception: Exception

(BLE001)


117-117: Unused method argument: kwargs

(ARG002)

openrag/components/indexer/vectordb/vectordb.py

388-388: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


532-532: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: index-backup-restore
🔇 Additional comments (45)
docs/content/docs/documentation/API.mdx (2)

221-221: LGTM: Temperature value updated.

The lower temperature value (0.1) provides more deterministic outputs, which is appropriate for documentation examples.


233-248: LGTM: Non-RAG usage example is clear and consistent.

The example correctly demonstrates using an empty model string to bypass the RAG pipeline, and the temperature value matches the earlier example for consistency.

Note: The Gitleaks warning about the authorization header is a false positive—YOUR_AUTH_TOKEN is clearly a documentation placeholder, not an actual credential.

openrag/scripts/restore.sh.example (1)

7-7: LGTM!

The addition of --env-file .env ensures environment variables are properly loaded during restore operations, maintaining consistency with other docker-compose invocations across the PR.

.github/workflows/smoke_test/.env (1)

29-30: LGTM!

The new MAX_MODEL_LEN environment variable with a clear comment provides good guidance for adjusting the embedder model's maximum length based on model capabilities.

.coderabbit.yaml (1)

1-5: LGTM!

The CodeRabbit configuration is properly structured and enables auto-review on all branches while allowing reviews to complete even if the PR is closed. The settings align with the provided schema.

pytest.ini (1)

8-11: LGTM!

The environment variables properly configure test paths for Hydra configuration, prompts, and logs, aligning with the broader configuration refactoring in this PR.

openrag/components/indexer/embeddings/openai.py (2)

17-17: LGTM!

The max_model_len configuration parameter is properly extracted with a sensible default of 8192, aligning with the environment variable changes across the PR.


38-40: Remove or replace the unsupported truncate_prompt_tokens parameter.

The truncate_prompt_tokens parameter is not documented in the official OpenAI embeddings API. The embeddings API enforces a maximum of 8192 tokens per input, and the official guidance recommends manual truncation using tokenization before sending requests. Remove this parameter or implement manual text truncation using tiktoken before calling the API.

prompts/example1/chunk_contextualizer_tmpl.txt (1)

1-22: LGTM!

The restructured prompt template significantly improves clarity and usability with explicit sections for Core Principles, Output Format, and Examples. The standardized field names (Filename, First Chunks, Previous Chunks, Current Chunk) and clear constraints (1-2 sentences, plain text) make the contextualization task more consistent and predictable.

openrag/components/files.py (1)

1-11: LGTM!

Import organization is appropriate with the new modules (asyncio, re, secrets, time) needed for the added functionality.

.hydra_config/rag/SimpleRag.yaml (1)

1-3: LGTM!

Clean refactoring to use Hydra's defaults composition pattern. Shared settings are properly extracted to base.yaml, reducing duplication across RAG configurations.

openrag/routers/utils.py (1)

233-234: LGTM!

Minor formatting adjustment.

prompts/example1/sys_prompt_tmpl.txt (1)

11-11: LGTM!

Clear instruction to prevent exposing internal file metadata in responses, which improves the user experience by keeping answers focused on content rather than source details.

docker-compose.yaml (1)

47-47: LGTM - Consistent alignment of max-model-len default.

The default value for --max-model-len has been updated from 8194 to 8192 in both the x-vllm_template and vllm-cpu configurations, aligning with the embedder configuration changes across the PR.

Also applies to: 146-146

quick_start/docker-compose.yaml (1)

45-45: LGTM - Consistent with main docker-compose.yaml.

The max-model-len default has been updated to 8192, matching the changes in the main docker-compose.yaml file and maintaining consistency across configurations.

Also applies to: 142-142

docs/content/docs/documentation/backup_restore.md (1)

13-19: LGTM - Documentation properly updated for environment variable loading.

The documentation now consistently instructs users to include --env-file .env in docker compose commands for backup and restore operations, aligning with the new environment variable loading pattern established across the PR.

Also applies to: 26-32, 40-44, 55-62, 66-73, 78-82

.hydra_config/rag/ChatBotRag.yaml (1)

1-3: LGTM - Proper refactoring to use base configuration.

The configuration now properly references a base configuration pattern, centralizing common RAG settings (chat_history_depth, max_contextualized_query_len) in base.yaml and reducing duplication.

openrag/scripts/backup.sh.example (1)

6-12: LGTM - Consistent environment variable loading.

The backup script now includes --env-file .env, consistent with the pattern established in documentation and workflow changes.

.github/workflows/smoke_test.yaml (1)

100-101: LGTM - Workflow properly loads environment variables.

The smoke test workflow now consistently loads environment variables from .env for backup and restore operations, aligning with the new environment configuration pattern.

Also applies to: 110-110, 115-116

.hydra_config/retriever/base.yaml (1)

3-4: LGTM - New surrounding chunks retrieval configuration.

The new with_surrounding_chunks option enables retrieval of surrounding chunks for each retrieved chunk, with a sensible default of true. This aligns with the broader PR changes to the retriever and vectordb components.

openrag/components/pipeline.py (2)

152-152: LGTM - Token-aware context formatting properly integrated.

The computed max_context_tokens is now properly passed to format_context() in both chat completion and completions paths, enabling token-aware context truncation based on retrieval configuration.

Also applies to: 183-183


77-79: Align config access patterns to use consistent bracket notation.

Lines 38–39 access config.reranker["top_k"] using bracket notation, while lines 77–78 use the defensive .get() pattern for the same keys. Since reranker.top_k and chunker.chunk_size are always defined in the Hydra config with built-in defaults (top_k=10, chunk_size=512), the defensive .get() calls are redundant. Use bracket notation consistently throughout:

self.max_context_tokens = config.reranker["top_k"] * config.chunker["chunk_size"]

This improves readability and maintains a uniform config access pattern.

openrag/components/test_files.py (2)

32-57: LGTM!

The test correctly uses monkeypatching to mock make_unique_filename, ensuring deterministic behavior while testing the with_random_prefix flag. The test validates both the filename transformation and file content persistence.


5-6: The components.files import works reliably in this project because pytest.ini is configured with pythonpath = ./openrag, which adds the openrag directory to sys.path. The import is not fragile to different test runner working directories when pytest is used correctly. However, for better clarity and to reduce reliance on pytest configuration, consider using a relative import: from .files import sanitize_filename, save_file_to_disk since this test file is part of the openrag.components package.

Likely an incorrect or invalid review comment.

.hydra_config/config.yaml (3)

53-53: LGTM!

Increasing reranker.top_k from 5 to 10 aligns with modern LLMs having larger context windows, allowing more documents to be included for better RAG results. The inline comment clearly explains the rationale.


34-34: No action required. The default max_model_len value of 8192 correctly matches jina-embeddings-v3's maximum input length of 8192 tokens.


3-3: Clarify chunker strategy description.

The config sets recursive_splitter as the default chunker. However, there's no evidence of prior support for semantic_splitter or markdown_splitter options in the codebase—only recursive_splitter has ever been implemented as a configurable Hydra option. The documentation already accurately reflects this single available strategy.

Likely an incorrect or invalid review comment.

openrag/routers/indexer.py (1)

129-145: LGTM!

The filename handling follows good practices:

  • Preserves original_filename for display and metadata tracking
  • Sanitizes the filename to prevent filesystem issues (special characters, path traversal)
  • Uses centralized save_file_to_disk with random prefix to prevent collisions
  • Stores all relevant metadata for traceability
openrag/routers/openai.py (2)

104-115: LGTM!

The is_direct_llm_model helper function centralizes the logic for determining when to bypass RAG partitions. The function is well-documented with a clear docstring, and the implementation correctly handles the three cases: None, empty string, or matching the configured default model.


171-176: Consistent usage of the new helper function.

Both endpoints (/chat/completions and /completions) consistently use is_direct_llm_model(request) to determine partition routing. This refactoring improves maintainability by centralizing the decision logic and makes the code more readable.

Also applies to: 286-291

openrag/components/indexer/chunker/test_chunking.py (4)

12-116: LGTM!

The TestSplitMdElements class provides comprehensive coverage for markdown element parsing:

  • Basic text, table, and image parsing
  • Edge case: tables inside image descriptions are correctly ignored
  • Page number assignment for different element types

The test assertions are specific and verify both element count and content, ensuring robust validation of the parsing logic.


118-152: LGTM!

The TestGetChunkPageNumber class thoroughly tests page number assignment logic:

  • Chunks with no markers inherit from previous chunk
  • Markers at start/end/middle are handled correctly
  • Boundary conditions are explicitly tested

This ensures accurate page tracking for document chunks.


154-192: LGTM!

The TestCleanMarkdownTableSpacing class validates table normalization:

  • Excessive and inconsistent spacing is corrected
  • Empty cells are preserved
  • Consistent "| cell | cell |" format is enforced

These tests ensure tables are properly formatted before further processing.


194-252: LGTM!

The TestChunkTable class validates table chunking behavior:

  • Small tables remain unchunked
  • Large tables are split while preserving group integrity
  • Headers are included in all chunks
  • Mock length function (4 chars/token) provides deterministic testing

The test at lines 216-248 is particularly important, ensuring that related table rows (country groups) are not split across chunks, which would break semantic coherence.

docs/content/docs/documentation/env_vars.md (3)

97-109: LGTM!

The documentation clearly reflects the chunking strategy consolidation:

  • CHUNKER now supports only recursive_splitter
  • The strategy is well-explained with a reference to LangChain documentation
  • Users understand that text is hierarchically split while respecting CHUNK_SIZE

119-119: LGTM!

The new MAX_MODEL_LEN parameter is well-documented:

  • Default value (8192) matches the configuration
  • Clear explanation of truncation behavior when chunks exceed the limit
  • Users understand this controls embedding model context length

186-191: LGTM!

The retriever configuration is well-documented:

  • RETRIEVER_TYPE options are clearly listed (single, multiQuery, hyde)
  • New WITH_SURROUNDING_CHUNKS parameter is explained with its default value
  • Users understand that enabling this retrieves adjacent chunks for richer context
openrag/components/retriever.py (3)

35-56: LGTM!

The BaseRetriever properly adds support for surrounding chunks:

  • New with_surrounding_chunks parameter (default True) enables retrieving adjacent chunks
  • Parameter is stored and forwarded to the vector database search
  • Default value makes sense for RAG use cases where additional context improves results

59-60: LGTM!

The retriever classes are updated consistently:

  • SingleRetriever naming corrected (typo fix)
  • MultiQueryRetriever and HyDeRetriever properly forward with_surrounding_chunks to their respective search methods
  • All retriever types now support the surrounding chunks feature uniformly

Also applies to: 95-102, 136-142


146-160: LGTM!

The RetrieverFactory is properly updated:

  • Mapping for "single" now points to the correctly named SingleRetriever
  • Guard condition changed from checking retriever_type is None to retriever_cls is None, which is more appropriate since it validates the actual class lookup result
openrag/components/prompts/prompts.py (2)

33-33: All references to the renamed constant have been updated.

The constant was successfully renamed from CHUNK_CONTEXTUALIZER to CHUNK_CONTEXTUALIZER_PROMPT for naming consistency with other prompt variables. No remaining references to the old name exist in the codebase.


11-27: The function already returns a single string as indicated by the -> str return type annotation (line 15). All callers throughout the codebase—in retriever.py, pipeline.py, loaders/base.py, and chunker.py—correctly handle the string return value without any tuple unpacking. No changes are needed.

Likely an incorrect or invalid review comment.

openrag/components/indexer/chunker/chunker.py (3)

283-304: RecursiveSplitter initializes text_splitter twice.

The parent BaseChunker.__init__ may lazily initialize text_splitter via split_text, but RecursiveSplitter also initializes it in its own __init__. This is fine but could be simplified.

The explicit initialization in RecursiveSplitter with custom separators is intentional and overrides the lazy fallback in BaseChunker.split_text.


40-53: No changes needed — code is compatible with project's Python version requirement.

The f-string syntax using backslash escape sequences in embedded expressions on lines 44 and 47 is valid and supported. Python 3.12 allows backslashes and unicode escape sequences in f-string expression components, and the project explicitly requires requires-python = ">=3.12", making this code compatible with all supported Python versions.

Likely an incorrect or invalid review comment.


194-254: Python dict merge with explicit keys correctly overrides upstream values—concern doesn't apply here.

At lines 218–221 and 230–234, rightmost dictionary values take precedence in the pattern {**metadata, "page": x}. Critically, the upstream metadata from doc.metadata (set at line 256) comes from document loaders and contains fields like file_id and partition—not page or chunk_type. The explicit keys intentionally set chunk-specific page numbers calculated during chunking (e.g., subtable.page_number, e.page_number, start_page), not carry upstream metadata forward. This behavior is correct and intentional.

Likely an incorrect or invalid review comment.

Comment on lines 226 to +230
You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
For that, simply do not specify any model:
For that, instead of using the `openrag` prefix for the model, you can:
- Specify no model
- Specify an empty model
- Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in user-facing documentation.

Line 226 contains a spelling error: "direclty" should be "directly". The expanded guidance for non-RAG usage is helpful and improves clarity.

Apply this diff to correct the typo:

-You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
+You can also directly use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
You can also direclty use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
For that, simply do not specify any model:
For that, instead of using the `openrag` prefix for the model, you can:
- Specify no model
- Specify an empty model
- Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`.
You can also directly use this endpoint with no RAG pipeline, i.e. to directly use the LLM.
For that, instead of using the `openrag` prefix for the model, you can:
- Specify no model
- Specify an empty model
- Specify the openRAG configured model, e.g. `Mistral-Small-3.1-24B-Instruct-2503`.
🧰 Tools
🪛 LanguageTool

[grammar] ~226-~226: Ensure spelling is correct
Context: ... "stream": false }' ``` You can also direclty use this endpoint with no RAG pipeline,...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In docs/content/docs/documentation/API.mdx around lines 226 to 230, fix the
spelling error "direclty" to "directly" in the sentence that explains using the
endpoint without a RAG pipeline; update that one word in-place so the sentence
reads "You can also directly use this endpoint..." and leave the rest of the
guidance unchanged.

Comment on lines +13 to +32
def sanitize_filename(filename: str) -> str:
# Split filename into name and extension
path = Path(filename)
name = path.stem
ext = path.suffix

# Remove special characters (keep only word characters and hyphens temporarily)
name = re.sub(r"[^\w\-]", "_", name)

# Replace hyphens with underscores
name = name.replace("-", "_")

# Collapse multiple underscores
name = re.sub(r"_+", "_", name)

# Remove leading/trailing underscores
name = name.strip("_")

# Reconstruct filename
return name + ext
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle edge case where sanitized name becomes empty.

If the filename consists only of special characters (e.g., ---.txt), the sanitized name becomes empty, resulting in just .txt. Consider adding a fallback:

     # Remove leading/trailing underscores
     name = name.strip("_")
 
+    # Fallback if name is empty after sanitization
+    if not name:
+        name = "file"
+
     # Reconstruct filename
     return name + ext
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def sanitize_filename(filename: str) -> str:
# Split filename into name and extension
path = Path(filename)
name = path.stem
ext = path.suffix
# Remove special characters (keep only word characters and hyphens temporarily)
name = re.sub(r"[^\w\-]", "_", name)
# Replace hyphens with underscores
name = name.replace("-", "_")
# Collapse multiple underscores
name = re.sub(r"_+", "_", name)
# Remove leading/trailing underscores
name = name.strip("_")
# Reconstruct filename
return name + ext
def sanitize_filename(filename: str) -> str:
# Split filename into name and extension
path = Path(filename)
name = path.stem
ext = path.suffix
# Remove special characters (keep only word characters and hyphens temporarily)
name = re.sub(r"[^\w\-]", "_", name)
# Replace hyphens with underscores
name = name.replace("-", "_")
# Collapse multiple underscores
name = re.sub(r"_+", "_", name)
# Remove leading/trailing underscores
name = name.strip("_")
# Fallback if name is empty after sanitization
if not name:
name = "file"
# Reconstruct filename
return name + ext
🤖 Prompt for AI Agents
In openrag/components/files.py around lines 13 to 32 the sanitized base name can
become empty (e.g., "---.txt") producing a filename like ".txt"; detect when
name is empty after stripping and replace it with a safe fallback (e.g., "file"
or "untitled") before reconstructing the filename, ensuring the extension is
preserved and the result does not start with a dot.

Comment on lines +139 to +149
filename = chunks[0].metadata.get("filename")
if not self.contextual_retrieval or len(chunks) < 2:
return [
Document(
page_content=BASE_CHUNK_FORMAT.format(
chunk_context="", filename=filename, content=c.page_content
),
metadata=c.metadata,
)
for c in chunks
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential IndexError when chunks list is empty.

Line 139 accesses chunks[0].metadata without first checking if chunks is non-empty. The check len(chunks) < 2 at line 140 doesn't prevent this.

     async def _apply_contextualization(
         self, chunks: list[Document], lang: Literal["en", "fr"] = "en"
     ) -> list[Document]:
         """Apply contextualization if enabled."""
+        if not chunks:
+            return []
         filename = chunks[0].metadata.get("filename")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
filename = chunks[0].metadata.get("filename")
if not self.contextual_retrieval or len(chunks) < 2:
return [
Document(
page_content=BASE_CHUNK_FORMAT.format(
chunk_context="", filename=filename, content=c.page_content
),
metadata=c.metadata,
)
for c in chunks
]
if not chunks:
return []
filename = chunks[0].metadata.get("filename")
if not self.contextual_retrieval or len(chunks) < 2:
return [
Document(
page_content=BASE_CHUNK_FORMAT.format(
chunk_context="", filename=filename, content=c.page_content
),
metadata=c.metadata,
)
for c in chunks
]
🤖 Prompt for AI Agents
In openrag/components/indexer/chunker/chunker.py around lines 139-149, the code
accesses chunks[0].metadata which will raise IndexError when chunks is empty;
add an explicit guard before that access (e.g., if not chunks: return []), then
proceed to read filename and the existing branch logic—this ensures the function
returns an empty list for empty input instead of crashing.

Comment on lines +53 to +107
def split_md_elements(md_text: str) -> list[MDElement]:
"""
Split markdown text into segments of text, tables, and images.
Returns a list of tuples: (type, content) where type is 'text', 'table', or 'image'
Returns a list of tuples:
- ('text', content) for text segments
- ('table', content, page_number) for tables
- ('image', content, page_number) for images
"""
# Find all page markers
page_markers = []
for match in PAGE_RE.finditer(md_text):
page_markers.append((match.start(), int(match.group(1))))
page_markers.sort() # Ensure they're in order

all_matches = []

# Find image matches first and record their spans
image_spans = []
for match in IMAGE_RE.finditer(md_text):
span = match.span()
all_matches.append((span, "image", match.group(1)))
page_num = get_page_number(span[0], page_markers)
all_matches.append((span, "image", match.group(1).strip(), page_num))
image_spans.append(span)

# Find table matches, but skip those that are fully inside an image description
for match in TABLE_RE.finditer(md_text):
span = match.span()
if not any(span_inside(span, image_span) for image_span in image_spans):
all_matches.append((span, "table", match.group(1)))
page_num = get_page_number(span[0], page_markers)
all_matches.append((span, "table", match.group(1).strip(), page_num))

# Sort matches by start position
all_matches.sort(key=lambda x: x[0][0])

parts = []
last = 0

for (start, end), match_type, content in all_matches:
for (start, end), match_type, content, page_num in all_matches:
# Add text segment before this match if there is any
if start > last:
text_segment = md_text[last:start]
if text_segment.strip(): # Only add non-empty text segments
parts.append(("text", text_segment.strip()))

# Add the matched segment
parts.append((match_type, content.strip()))
# Add the matched segment with page number
parts.append((match_type, content, page_num))
last = end

# Add remaining text after the last match
if last < len(md_text):
remaining_text = md_text[last:]
if remaining_text.strip(): # Only add non-empty text segments
parts.append(("text", remaining_text.strip()))
return parts

return [MDElement(*p) for p in parts]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring mismatch with actual return type.

The docstring states the function returns tuples, but Line 107 returns a list of MDElement instances. Update the docstring to reflect the actual return type.

 def split_md_elements(md_text: str) -> list[MDElement]:
     """
     Split markdown text into segments of text, tables, and images.
-    Returns a list of tuples:
-    - ('text', content) for text segments
-    - ('table', content, page_number) for tables
-    - ('image', content, page_number) for images
+    Returns a list of MDElement instances with type, content, and page_number.
     """
🤖 Prompt for AI Agents
In openrag/components/indexer/chunker/utils.py around lines 53 to 107, the
docstring incorrectly says the function returns tuples while the function
actually returns a list of MDElement instances; update the docstring to
accurately describe the return type (e.g., "Returns a list of MDElement objects
representing segments: MDElement(type='text'|'table'|'image', content,
page_number?)") and adjust the bullet list to describe the MDElement fields
rather than tuple shapes.

Comment on lines +191 to +253
def chunk_table(
table_element: MDElement,
chunk_size: int = 512,
length_function: Optional[Callable[[str], int]] = None,
) -> list[MDElement]:
txt = clean_markdown_table_spacing(table_element.content)
header_lines, groups = parse_markdown_table(txt)

# Convert header lines → text block
header_text = "\n".join(header_lines)

# Convert group lists → text blocks
group_texts = ["\n".join(g) for g in groups]

# Precompute token length
header_ntoks = length_function(header_text)
groups_ntoks = [length_function(g) for g in group_texts]

subtables = []
current_rows = [header_text]
current_size = header_ntoks

prev_last_row = None # for overlap

for group_txt, g_ntoks in zip(group_texts, groups_ntoks, strict=True):
# If adding this group exceeds the chunk limit
if current_size + g_ntoks > chunk_size:
# ---- finalize current subtable ----
subtables.append("\n".join(current_rows))

# ---- start new subtable with OVERLAP ----
current_rows = [header_text] # always restart headers
if prev_last_row:
current_rows.append(prev_last_row) # add overlapping row

current_rows.append(group_txt)
current_size = (
header_ntoks
+ (length_function(prev_last_row) if prev_last_row else 0)
+ g_ntoks
)

else:
# fits → just append normally
current_rows.append(group_txt)
current_size += g_ntoks

# track last row for overlap
prev_last_row = group_txt

# finalize last subtable
if current_rows:
subtables.append("\n".join(current_rows))

# wrap into MDElement list
return [
MDElement(
type="table",
content=subtable,
page_number=table_element.page_number,
)
for subtable in subtables
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing None check before calling length_function.

If length_function is None, Line 206 will raise a TypeError. Either enforce the parameter as required or add a fallback.

 def chunk_table(
     table_element: MDElement,
     chunk_size: int = 512,
     length_function: Optional[Callable[[str], int]] = None,
 ) -> list[MDElement]:
+    if length_function is None:
+        length_function = len  # Fallback to character count
+
     txt = clean_markdown_table_spacing(table_element.content)
     header_lines, groups = parse_markdown_table(txt)
🤖 Prompt for AI Agents
In openrag/components/indexer/chunker/utils.py around lines 191-253, the code
calls length_function(...) without checking for None which will raise a
TypeError if length_function is not passed; fix by providing a default length
function at the top of chunk_table (e.g., default to Python len on strings or a
small wrapper that handles None/empty strings) and use that local function for
all subsequent calls, and also guard the call length_function(prev_last_row)
with a conditional or by ensuring prev_last_row is a string (e.g., pass "" when
None) so no None value is ever passed into the length function.

Comment on lines +1022 to +1037
def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:
# Use base timestamp + index to ensure uniqueness
base_ts = int(time.time_ns())
ids: list[int] = [base_ts + i for i in range(n)]
L = []
for i in range(n):
prev_chunk_id = ids[i - 1] if i > 0 else None
next_chunk_id = ids[i + 1] if i < n - 1 else None
L.append(
{
"prev_section_id": prev_chunk_id,
"section_id": ids[i],
"next_section_id": next_chunk_id,
}
)
return L
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Timestamp-based ID generation may collide under high concurrency.

Using time.time_ns() with sequential offsets could produce collisions if multiple processes call this function simultaneously. Consider using UUIDs or adding a random component.

+import uuid
+
 def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:
-    # Use base timestamp + index to ensure uniqueness
-    base_ts = int(time.time_ns())
-    ids: list[int] = [base_ts + i for i in range(n)]
+    # Use UUID-based IDs to ensure uniqueness across processes
+    ids: list[int] = [uuid.uuid4().int >> 64 for _ in range(n)]  # 64-bit unique IDs
     L = []
     for i in range(n):

Alternatively, if integer IDs are required for Milvus compatibility, consider using a combination of process ID, timestamp, and random bits.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:
# Use base timestamp + index to ensure uniqueness
base_ts = int(time.time_ns())
ids: list[int] = [base_ts + i for i in range(n)]
L = []
for i in range(n):
prev_chunk_id = ids[i - 1] if i > 0 else None
next_chunk_id = ids[i + 1] if i < n - 1 else None
L.append(
{
"prev_section_id": prev_chunk_id,
"section_id": ids[i],
"next_section_id": next_chunk_id,
}
)
return L
import uuid
def _gen_chunk_order_metadata(n: int = 20) -> list[dict]:
# Use UUID-based IDs to ensure uniqueness across processes
ids: list[int] = [uuid.uuid4().int >> 64 for _ in range(n)] # 64-bit unique IDs
L = []
for i in range(n):
prev_chunk_id = ids[i - 1] if i > 0 else None
next_chunk_id = ids[i + 1] if i < n - 1 else None
L.append(
{
"prev_section_id": prev_chunk_id,
"section_id": ids[i],
"next_section_id": next_chunk_id,
}
)
return L
🤖 Prompt for AI Agents
In openrag/components/indexer/vectordb/vectordb.py around lines 1022 to 1037,
the current ID generation uses time.time_ns() with sequential offsets which can
collide across concurrent processes; replace that logic with a
collision-resistant approach — either generate section IDs as UUID strings
(e.g., uuid.uuid4().hex) and update the metadata keys/types accordingly, or if
integer IDs are strictly required, build a composite integer from time_ns plus
process id and random bits (e.g., combine time_ns << N | (pid << M) |
random_bits generated with secrets.randbits) to guarantee uniqueness across
processes; ensure imports (uuid or os and secrets) are added, update type hints
from list[int] to list[str] if using UUIDs, and run tests that write to Milvus
to confirm compatibility.

Comment on lines +160 to +162
def detect_language(text: str):
outputs = lang_detector.detect(text, k=1)
return outputs[0].get("lang")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error handling in detect_language.

If lang_detector.detect returns an empty list or fails, accessing outputs[0] will raise an IndexError. Add defensive handling.

 def detect_language(text: str):
     outputs = lang_detector.detect(text, k=1)
-    return outputs[0].get("lang")
+    if outputs:
+        return outputs[0].get("lang")
+    return "en"  # Default fallback
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def detect_language(text: str):
outputs = lang_detector.detect(text, k=1)
return outputs[0].get("lang")
def detect_language(text: str):
outputs = lang_detector.detect(text, k=1)
if outputs:
return outputs[0].get("lang")
return "en" # Default fallback
🤖 Prompt for AI Agents
In openrag/components/utils.py around lines 160 to 162, detect_language
currently assumes lang_detector.detect returns a non-empty list and directly
indexes outputs[0], which can raise IndexError or propagate detection errors;
update the function to defensively handle empty or invalid outputs and
exceptions by wrapping the detection call in a try/except, check that outputs is
a non-empty list and that outputs[0] contains a "lang" key before returning it,
and on failure return a safe default (e.g., None or "unknown") and optionally
log the error for diagnostics.

Comment on lines 47 to +49
"langdetect>=1.0.9",
"alembic>=1.17.0",
"langdetect>=1.0.9",
"fast-langdetect>=1.0.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the duplicate language detection dependency.

Both langdetect>=1.0.9 (line 47) and fast-langdetect>=1.0.0 (line 49) are present in the dependencies. Since the PR migrates to fast-langdetect, the original langdetect dependency on line 47 should be removed to avoid conflicts.

Apply this diff to remove the duplicate:

     "html-to-markdown>=2.4.0",
-    "langdetect>=1.0.9",
     "alembic>=1.17.0",
     "fast-langdetect>=1.0.0",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"langdetect>=1.0.9",
"alembic>=1.17.0",
"langdetect>=1.0.9",
"fast-langdetect>=1.0.0",
"html-to-markdown>=2.4.0",
"alembic>=1.17.0",
"fast-langdetect>=1.0.0",
🤖 Prompt for AI Agents
In pyproject.toml around lines 47 to 49, remove the duplicate dependency entry
"langdetect>=1.0.9" (line 47) since the project is migrating to
"fast-langdetect>=1.0.0"; update the dependencies block so only fast-langdetect
remains and ensure punctuation/commas in the surrounding list remain valid after
deletion.

@coderabbitai
Copy link

coderabbitai bot commented Dec 17, 2025

✅ Actions performed

Reviews paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants