LCORE-347: Implement referenced documents support on /query endpoint #572

are-ces · 2025-09-23T08:12:26Z

Description

Added referenced docs (incl. doc_url and doc_title) to /query endpoint, reaching parity with /streaming_query endpoint.

Type of change

Related Tickets & Documents

Related Issue # LCORE-347
Closes # LCORE-347

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.

Set up a rag db with LCS and post a request on /query endpoint, check the returned JSON. Confirm correctness of reference with source doc.

Summary by CodeRabbit

New Features
- Responses now include a “Referenced documents” section with source links and titles, enabling users to see which documents informed the answer.
- Improved extraction of document metadata from tool outputs for more accurate and complete source listings.
Tests
- Expanded unit tests to validate metadata parsing and referenced document aggregation across varied scenarios.
- Updated tests to reflect the enhanced response format and ensure reliability when no tools or references are present.

coderabbitai · 2025-09-23T08:12:32Z

Walkthrough

Adds ReferencedDocument model and a referenced_documents field to QueryResponse. Extends retrieve_response to return referenced documents. Implements helpers to parse document metadata from text/tool responses. Updates query endpoint to propagate referenced_documents into responses. Adjusts tests to new return shape and adds unit tests for parsing logic.

Changes

Cohort / File(s)	Summary
Endpoint logic and parsing `src/app/endpoints/query.py`	Extended retrieve_response to return (TurnSummary, str, list[ReferencedDocument]). Added parse_metadata_from_text_item and parse_referenced_documents. Updated handler to unpack and include referenced_documents in QueryResponse.
Response models `src/models/responses.py`	Introduced ReferencedDocument (doc_url: AnyUrl, doc_title: str). Added referenced_documents: list[ReferencedDocument] to QueryResponse with defaults and examples; updated imports.
Unit tests `tests/unit/app/endpoints/test_query.py`	Adjusted to new retrieve_response return tuple. Added tests for parse_metadata_from_text_item and parse_referenced_documents, including edge cases and non-RAG tools. Updated existing tests to handle referenced_documents propagation.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant Q as query_endpoint_handler
    participant R as retrieve_response
    participant P as parse_referenced_documents
    participant M as Models (QueryResponse)

    C->>Q: POST /query (payload)
    Q->>R: retrieve_response(...)
    R-->>Q: (turn_summary, conversation_id, referenced_documents)
    Q->>P: parse_referenced_documents(turn/agent responses)
    P-->>Q: aggregated referenced_documents
    Q->>M: Build QueryResponse(..., referenced_documents)
    M-->>Q: QueryResponse
    Q-->>C: 200 OK (QueryResponse)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble links and titles bright,
A trail of docs in moonlit byte—
Hop, parse, and gather as I go,
Footnotes blooming in the snow.
Now responses carry what we found,
Little carrots of context bound.
Thump! The query’s richer sound.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly and concisely summarizes the primary change — adding referenced documents support to the /query endpoint — and directly matches the PR diff and stated objectives. It is specific, readable, and highlights the most important developer-facing change without extraneous detail.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

src/app/endpoints/query.py (3)
439-443: Standardize metadata field names across the system.

The code expects "docs_url" and "title" fields. Ensure these field names are consistent with the actual RAG provider output format. Consider defining constants for these field names.
+# Add at module level
+METADATA_URL_FIELD = "docs_url"
+METADATA_TITLE_FIELD = "title"
+
 def parse_metadata_from_text_item(
     text_item: TextContentItem,
 ) -> Optional[ReferencedDocument]:
     # ... existing code ...
             data = ast.literal_eval(block)
-            url = data.get("docs_url")
-            title = data.get("title")
+            url = data.get(METADATA_URL_FIELD)
+            title = data.get(METADATA_TITLE_FIELD)
             if url and title:
3-9: Keep ast — it's used to parse 'Metadata' blocks; prefer JSON-first with a fallback.

query.py (~line 439) and streaming_query.py (~line 496) call ast.literal_eval on regex-captured Metadata blocks. Unit tests include both Python-dict (single-quoted) and JSON-style examples, so replacing ast.literal_eval with json.loads() alone would break Python-literal cases; implement robust parsing (try json.loads(block) and fall back to ast.literal_eval(block)) or validate/normalize the metadata format upstream.

434-436: Consider security implications of the regex pattern (ReDoS risk).

parse_metadata_from_text_item (src/app/endpoints/query.py:434–436) uses r"Metadata:\s*({.*?})(?:\n|$)" with re.DOTALL and no input-size checks — this can be abused for catastrophic backtracking. Replace the regex with a linear, balanced-brace extractor (scan for the opening '{' and find the matching '}' while handling quotes/escapes) and then parse with ast.literal_eval/json.loads, or enforce a strict input-length cap before running the regex.

Apply the same mitigation to METADATA_PATTERN in src/app/endpoints/streaming_query.py (around line 97): r"\nMetadata: ({.+})\n".

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3dba25f and 2d1567a.

📒 Files selected for processing (3)

src/app/endpoints/query.py (9 hunks)
src/models/responses.py (4 hunks)
tests/unit/app/endpoints/test_query.py (21 hunks)

🧰 Additional context used

📓 Path-based instructions (9)

src/**/*.py