-
Notifications
You must be signed in to change notification settings - Fork 58
LCORE-347: Implement referenced documents support on /query endpoint #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds ReferencedDocument model and a referenced_documents field to QueryResponse. Extends retrieve_response to return referenced documents. Implements helpers to parse document metadata from text/tool responses. Updates query endpoint to propagate referenced_documents into responses. Adjusts tests to new return shape and adds unit tests for parsing logic. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant Q as query_endpoint_handler
participant R as retrieve_response
participant P as parse_referenced_documents
participant M as Models (QueryResponse)
C->>Q: POST /query (payload)
Q->>R: retrieve_response(...)
R-->>Q: (turn_summary, conversation_id, referenced_documents)
Q->>P: parse_referenced_documents(turn/agent responses)
P-->>Q: aggregated referenced_documents
Q->>M: Build QueryResponse(..., referenced_documents)
M-->>Q: QueryResponse
Q-->>C: 200 OK (QueryResponse)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/app/endpoints/query.py (3)
439-443: Standardize metadata field names across the system.The code expects
"docs_url"and"title"fields. Ensure these field names are consistent with the actual RAG provider output format. Consider defining constants for these field names.+# Add at module level +METADATA_URL_FIELD = "docs_url" +METADATA_TITLE_FIELD = "title" + def parse_metadata_from_text_item( text_item: TextContentItem, ) -> Optional[ReferencedDocument]: # ... existing code ... data = ast.literal_eval(block) - url = data.get("docs_url") - title = data.get("title") + url = data.get(METADATA_URL_FIELD) + title = data.get(METADATA_TITLE_FIELD) if url and title:
3-9: Keepast— it's used to parse 'Metadata' blocks; prefer JSON-first with a fallback.query.py (~line 439) and streaming_query.py (~line 496) call ast.literal_eval on regex-captured Metadata blocks. Unit tests include both Python-dict (single-quoted) and JSON-style examples, so replacing ast.literal_eval with json.loads() alone would break Python-literal cases; implement robust parsing (try json.loads(block) and fall back to ast.literal_eval(block)) or validate/normalize the metadata format upstream.
434-436: Consider security implications of the regex pattern (ReDoS risk).
- parse_metadata_from_text_item (src/app/endpoints/query.py:434–436) uses r"Metadata:\s*({.*?})(?:\n|$)" with re.DOTALL and no input-size checks — this can be abused for catastrophic backtracking. Replace the regex with a linear, balanced-brace extractor (scan for the opening '{' and find the matching '}' while handling quotes/escapes) and then parse with ast.literal_eval/json.loads, or enforce a strict input-length cap before running the regex.
- Apply the same mitigation to METADATA_PATTERN in src/app/endpoints/streaming_query.py (around line 97): r"\nMetadata: ({.+})\n".
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/app/endpoints/query.py(9 hunks)src/models/responses.py(4 hunks)tests/unit/app/endpoints/test_query.py(21 hunks)
🧰 Additional context used
📓 Path-based instructions (9)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use absolute imports for internal modules (e.g., from auth import get_auth_dependency)
Files:
src/models/responses.pysrc/app/endpoints/query.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: All modules start with descriptive module-level docstrings explaining purpose
Use logger = logging.getLogger(name) for module logging after import logging
Define type aliases at module level for clarity
All functions require docstrings with brief descriptions
Provide complete type annotations for all function parameters and return types
Use typing_extensions.Self in model validators where appropriate
Use modern union syntax (str | int) and Optional[T] or T | None consistently
Function names use snake_case with descriptive, action-oriented prefixes (get_, validate_, check_)
Avoid in-place parameter modification; return new data structures instead of mutating arguments
Use appropriate logging levels: debug, info, warning, error with clear messages
All classes require descriptive docstrings explaining purpose
Class names use PascalCase with conventional suffixes (Configuration, Error/Exception, Resolver, Interface)
Abstract base classes should use abc.ABC and @AbstractMethod for interfaces
Provide complete type annotations for all class attributes
Follow Google Python docstring style for modules, classes, and functions, including Args, Returns, Raises, Attributes sections as needed
Files:
src/models/responses.pytests/unit/app/endpoints/test_query.pysrc/app/endpoints/query.py
src/{models/**/*.py,configuration.py}
📄 CodeRabbit inference engine (CLAUDE.md)
src/{models/**/*.py,configuration.py}: Use @field_validator and @model_validator for custom validation in Pydantic models
Use precise type hints in configuration (e.g., Optional[FilePath], PositiveInt, SecretStr)
Files:
src/models/responses.py
src/models/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/models/**/*.py: Pydantic models: use BaseModel for data models and extend ConfigurationBase for configuration
Use @model_validator and @field_validator for Pydantic model validation
Files:
src/models/responses.py
tests/{unit,integration}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests
Do not use unittest in tests; pytest is the standard
Files:
tests/unit/app/endpoints/test_query.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest-mock to create AsyncMock objects for async interactions in tests
Use the shared auth mock constant: MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") in tests
Files:
tests/unit/app/endpoints/test_query.py
src/app/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use standard FastAPI imports (from fastapi import APIRouter, HTTPException, Request, status, Depends) in FastAPI app code
Files:
src/app/endpoints/query.py
src/{app/**/*.py,client.py}
📄 CodeRabbit inference engine (CLAUDE.md)
Use async def for I/O-bound operations and external API calls
Files:
src/app/endpoints/query.py
src/app/endpoints/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
In API endpoints, raise FastAPI HTTPException with appropriate status codes for error handling
Files:
src/app/endpoints/query.py
🧬 Code graph analysis (1)
src/app/endpoints/query.py (6)
src/metrics/utils.py (1)
update_llm_token_count_from_turn(60-77)src/models/requests.py (2)
Attachment(15-69)QueryRequest(72-222)src/models/responses.py (2)
QueryResponse(59-112)ReferencedDocument(37-47)src/utils/endpoints.py (4)
check_configuration_loaded(63-69)get_system_prompt(72-112)validate_conversation_ownership(40-60)validate_model_provider_override(115-135)src/utils/mcp_headers.py (2)
handle_mcp_headers_with_toolgroups(56-90)mcp_headers_dependency(15-26)src/utils/types.py (1)
TurnSummary(59-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (12)
src/models/responses.py (2)
37-48: LGTM! Well-structured data model.The
ReferencedDocumentmodel follows Pydantic best practices with clear type annotations and descriptive fields.
81-93: Excellent documentation and examples.The field definition properly uses
default_factory=listand includes comprehensive examples in both the field definition and model config.tests/unit/app/endpoints/test_query.py (6)
177-181: Good test coverage adaptation.Tests properly handle the updated return signature of
retrieve_responsewith the third element for referenced documents.
867-881: Well-tested metadata parsing with valid input.Good test coverage for the happy path of metadata parsing from text items.
883-907: Comprehensive error handling tests.Good coverage of edge cases including missing fields and malformed URLs.
962-982: Correct filter logic for non-RAG tools.The test properly ensures that only
knowledge_searchtool responses are processed for document extraction.
933-960: SAMPLE_KNOWLEDGE_SEARCH_RESULTS verified — contains docs_url and title fields.Found in tests/unit/app/endpoints/test_streaming_query.py; metadata blocks include 'docs_url' and 'title' (e.g. 'https://example.com/doc1', 'Doc1').
909-931: Validate test data format for metadata parsing — OK: uses "docs_url".query.py and streaming_query.py expect "docs_url", so the test's metadata key is correct; no change required.
src/app/endpoints/query.py (4)
63-69: Documentation examples correctly updated.The query response documentation properly includes the new
referenced_documentsfield with appropriate example data.
240-248: Correct integration of referenced documents in endpoint handler.The handler properly unpacks the three-element tuple and includes referenced documents in the response.
Also applies to: 275-279
608-609: Good placement of document parsing.The referenced documents are parsed after the response is received and before returning, which is the appropriate location in the flow.
Also applies to: 628-628
513-515: Documentation correctly updated for return type.The docstring properly reflects the updated return type as a three-element tuple.
| def parse_metadata_from_text_item( | ||
| text_item: TextContentItem, | ||
| ) -> Optional[ReferencedDocument]: | ||
| """ | ||
| Parse a single TextContentItem to extract referenced documents. | ||
| Args: | ||
| text_item (TextContentItem): The TextContentItem containing metadata. | ||
| Returns: | ||
| ReferencedDocument: A ReferencedDocument object containing 'doc_url' and 'doc_title' | ||
| representing the referenced documents found in the metadata. | ||
| """ | ||
| docs: list[ReferencedDocument] = [] | ||
| if not isinstance(text_item, TextContentItem): | ||
| return docs | ||
|
|
||
| metadata_blocks = re.findall( | ||
| r"Metadata:\s*({.*?})(?:\n|$)", text_item.text, re.DOTALL | ||
| ) | ||
| for block in metadata_blocks: | ||
| try: | ||
| data = ast.literal_eval(block) | ||
| url = data.get("docs_url") | ||
| title = data.get("title") | ||
| if url and title: | ||
| return ReferencedDocument(doc_url=url, doc_title=title) | ||
| logger.debug("Invalid metadata block (missing url or title): %s", block) | ||
| except (ValueError, SyntaxError) as e: | ||
| logger.debug("Failed to parse metadata block: %s | Error: %s", block, e) | ||
| return None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix return type inconsistency and improve error handling.
The function has several issues:
- Return type annotation says
Optional[ReferencedDocument]but line 432 returns an empty list[] - The variable
docsis declared but never used - Missing validation for URL format could cause issues downstream
Apply this fix:
def parse_metadata_from_text_item(
text_item: TextContentItem,
) -> Optional[ReferencedDocument]:
"""
Parse a single TextContentItem to extract referenced documents.
Args:
text_item (TextContentItem): The TextContentItem containing metadata.
Returns:
- ReferencedDocument: A ReferencedDocument object containing 'doc_url' and 'doc_title'
+ Optional[ReferencedDocument]: A ReferencedDocument object containing 'doc_url' and 'doc_title'
representing the referenced documents found in the metadata.
"""
- docs: list[ReferencedDocument] = []
if not isinstance(text_item, TextContentItem):
- return docs
+ return None
metadata_blocks = re.findall(
r"Metadata:\s*({.*?})(?:\n|$)", text_item.text, re.DOTALL
)
for block in metadata_blocks:
try:
data = ast.literal_eval(block)
url = data.get("docs_url")
title = data.get("title")
if url and title:
return ReferencedDocument(doc_url=url, doc_title=title)
logger.debug("Invalid metadata block (missing url or title): %s", block)
except (ValueError, SyntaxError) as e:
logger.debug("Failed to parse metadata block: %s | Error: %s", block, e)
return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def parse_metadata_from_text_item( | |
| text_item: TextContentItem, | |
| ) -> Optional[ReferencedDocument]: | |
| """ | |
| Parse a single TextContentItem to extract referenced documents. | |
| Args: | |
| text_item (TextContentItem): The TextContentItem containing metadata. | |
| Returns: | |
| ReferencedDocument: A ReferencedDocument object containing 'doc_url' and 'doc_title' | |
| representing the referenced documents found in the metadata. | |
| """ | |
| docs: list[ReferencedDocument] = [] | |
| if not isinstance(text_item, TextContentItem): | |
| return docs | |
| metadata_blocks = re.findall( | |
| r"Metadata:\s*({.*?})(?:\n|$)", text_item.text, re.DOTALL | |
| ) | |
| for block in metadata_blocks: | |
| try: | |
| data = ast.literal_eval(block) | |
| url = data.get("docs_url") | |
| title = data.get("title") | |
| if url and title: | |
| return ReferencedDocument(doc_url=url, doc_title=title) | |
| logger.debug("Invalid metadata block (missing url or title): %s", block) | |
| except (ValueError, SyntaxError) as e: | |
| logger.debug("Failed to parse metadata block: %s | Error: %s", block, e) | |
| return None | |
| def parse_metadata_from_text_item( | |
| text_item: TextContentItem, | |
| ) -> Optional[ReferencedDocument]: | |
| """ | |
| Parse a single TextContentItem to extract referenced documents. | |
| Args: | |
| text_item (TextContentItem): The TextContentItem containing metadata. | |
| Returns: | |
| Optional[ReferencedDocument]: A ReferencedDocument object containing 'doc_url' and 'doc_title' | |
| representing the referenced documents found in the metadata. | |
| """ | |
| if not isinstance(text_item, TextContentItem): | |
| return None | |
| metadata_blocks = re.findall( | |
| r"Metadata:\s*({.*?})(?:\n|$)", text_item.text, re.DOTALL | |
| ) | |
| for block in metadata_blocks: | |
| try: | |
| data = ast.literal_eval(block) | |
| url = data.get("docs_url") | |
| title = data.get("title") | |
| if url and title: | |
| return ReferencedDocument(doc_url=url, doc_title=title) | |
| logger.debug("Invalid metadata block (missing url or title): %s", block) | |
| except (ValueError, SyntaxError) as e: | |
| logger.debug("Failed to parse metadata block: %s | Error: %s", block, e) | |
| return None |
🤖 Prompt for AI Agents
In src/app/endpoints/query.py around lines 417 to 448, the function
parse_metadata_from_text_item has inconsistent returns (returns an empty list
but annotated Optional[ReferencedDocument]), an unused docs list, and lacks URL
validation; fix it by removing the unused docs list, ensure the function only
returns a ReferencedDocument instance or None (never a list), validate the
extracted url (e.g., use urllib.parse.urlparse to check scheme/netloc or a small
regex) before constructing ReferencedDocument, and improve error logging to
include exception details (use logger.debug(..., exc_info=True)) while keeping
the try/except around ast.literal_eval.
| def parse_referenced_documents(response: Turn) -> list[ReferencedDocument]: | ||
| """ | ||
| Parse referenced documents from Turn. | ||
| Iterate through the steps of a response and collect all referenced | ||
| documents from rag tool responses. | ||
| Args: | ||
| response(Turn): The response object from the agent turn. | ||
| Returns: | ||
| list[ReferencedDocument]: A list of ReferencedDocument, each with 'doc_url' and 'doc_title' | ||
| representing all referenced documents found in the response. | ||
| """ | ||
| docs = [] | ||
| for step in response.steps: | ||
| if not isinstance(step, ToolExecutionStep): | ||
| continue | ||
| for tool_response in step.tool_responses: | ||
| # TODO(are-ces): use constant instead | ||
| if tool_response.tool_name != "knowledge_search": | ||
| continue | ||
| for text_item in tool_response.content: | ||
| if not isinstance(text_item, TextContentItem): | ||
| continue | ||
| doc = parse_metadata_from_text_item(text_item) | ||
| if doc: | ||
| docs.append(doc) | ||
| return docs | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add TODO constant and improve function documentation.
- The TODO comment at line 469 should reference a constant for the tool name
- The docstring's Returns section should be more accurate about when an empty list vs populated list is returned
+# Add at module level with other constants
+KNOWLEDGE_SEARCH_TOOL_NAME = "knowledge_search"
+
def parse_referenced_documents(response: Turn) -> list[ReferencedDocument]:
"""
Parse referenced documents from Turn.
Iterate through the steps of a response and collect all referenced
documents from rag tool responses.
Args:
response(Turn): The response object from the agent turn.
Returns:
- list[ReferencedDocument]: A list of ReferencedDocument, each with 'doc_url' and 'doc_title'
- representing all referenced documents found in the response.
+ list[ReferencedDocument]: A list of ReferencedDocument objects, each with 'doc_url' and 'doc_title'.
+ Returns an empty list if no referenced documents are found or if the response contains no tool execution steps.
"""
docs = []
for step in response.steps:
if not isinstance(step, ToolExecutionStep):
continue
for tool_response in step.tool_responses:
- # TODO(are-ces): use constant instead
- if tool_response.tool_name != "knowledge_search":
+ if tool_response.tool_name != KNOWLEDGE_SEARCH_TOOL_NAME:
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def parse_referenced_documents(response: Turn) -> list[ReferencedDocument]: | |
| """ | |
| Parse referenced documents from Turn. | |
| Iterate through the steps of a response and collect all referenced | |
| documents from rag tool responses. | |
| Args: | |
| response(Turn): The response object from the agent turn. | |
| Returns: | |
| list[ReferencedDocument]: A list of ReferencedDocument, each with 'doc_url' and 'doc_title' | |
| representing all referenced documents found in the response. | |
| """ | |
| docs = [] | |
| for step in response.steps: | |
| if not isinstance(step, ToolExecutionStep): | |
| continue | |
| for tool_response in step.tool_responses: | |
| # TODO(are-ces): use constant instead | |
| if tool_response.tool_name != "knowledge_search": | |
| continue | |
| for text_item in tool_response.content: | |
| if not isinstance(text_item, TextContentItem): | |
| continue | |
| doc = parse_metadata_from_text_item(text_item) | |
| if doc: | |
| docs.append(doc) | |
| return docs | |
| KNOWLEDGE_SEARCH_TOOL_NAME = "knowledge_search" | |
| def parse_referenced_documents(response: Turn) -> list[ReferencedDocument]: | |
| """ | |
| Parse referenced documents from Turn. | |
| Iterate through the steps of a response and collect all referenced | |
| documents from rag tool responses. | |
| Args: | |
| response(Turn): The response object from the agent turn. | |
| Returns: | |
| list[ReferencedDocument]: A list of ReferencedDocument objects, each with 'doc_url' and 'doc_title'. | |
| Returns an empty list if no referenced documents are found or if the response contains no tool execution steps. | |
| """ | |
| docs = [] | |
| for step in response.steps: | |
| if not isinstance(step, ToolExecutionStep): | |
| continue | |
| for tool_response in step.tool_responses: | |
| if tool_response.tool_name != KNOWLEDGE_SEARCH_TOOL_NAME: | |
| continue | |
| for text_item in tool_response.content: | |
| if not isinstance(text_item, TextContentItem): | |
| continue | |
| doc = parse_metadata_from_text_item(text_item) | |
| if doc: | |
| docs.append(doc) | |
| return docs |
🤖 Prompt for AI Agents
In src/app/endpoints/query.py around lines 450 to 479, replace the inline TODO
and hard-coded tool name check with a reference to a constant (e.g., use
KNOWLEDGE_SEARCH_TOOL_NAME instead of "knowledge_search" and add/import that
constant at the top of the module), and update the docstring Returns section to
explicitly state that the function returns an empty list when no referenced
documents are found and a list of ReferencedDocument objects (each with
'doc_url' and 'doc_title') when they are found.
|
@are-ces |
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Added referenced docs (incl. doc_url and doc_title) to /query endpoint, reaching parity with /streaming_query endpoint.
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Set up a rag db with LCS and post a request on /query endpoint, check the returned JSON. Confirm correctness of reference with source doc.
Summary by CodeRabbit
New Features
Tests