Add DefaultMetadataProcessor and generate_embeddings.py CLI script#104
Add DefaultMetadataProcessor and generate_embeddings.py CLI script#104tisnik merged 2 commits intolightspeed-core:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughAdds a concrete DefaultMetadataProcessor (providing a basename-based url_function), re-exports MetadataProcessor and DefaultMetadataProcessor at package top level, and introduces a CLI script to generate embeddings plus tests covering the new processor and script behavior. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/test_metadata_processor.py (1)
168-185: Add a collision regression test for duplicate basenames.These tests validate basename extraction, but they don’t cover two different paths sharing the same filename (and same title). Given downstream grouping by
(docs_url, title), add a regression test to ensure distinct documents don’t collapse into one group.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_metadata_processor.py` around lines 168 - 185, Add a regression test that ensures two different source paths with the same basename and same title are treated as distinct documents and do not collapse when grouping by (docs_url, title); in tests/test_metadata_processor.py create a new test using DefaultMetadataProcessor and its url_function to produce identical basenames for two different input paths (e.g., "/path/one/doc.md" and "/other/path/doc.md"), then simulate or call the code that groups by (docs_url, title) and assert that the group contains two separate entries (or that the grouping key preserves distinct source paths), referencing DefaultMetadataProcessor and url_function to locate where to generate the inputs and validate grouping behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 103-105: The docstring incorrectly states that the base class will
use the YAML frontmatter `url` value directly; update the wording in
metadata_processor.py so it accurately reflects the implementation: clarify that
this path uses the provided `url_function(file_path)` to derive the URL (and not
the base class reading `url` from frontmatter), and rephrase the sentence to
avoid implying the base class directly consumes frontmatter `url`; reference the
`url_function` symbol and the file's metadata processor/docstring in your
change.
- Around line 108-110: The url_function currently returns only
os.path.basename(file_path), which can produce collisions; change url_function
to produce a unique source key (e.g., use os.path.abspath(file_path) or basename
plus a stable hash of the full path or parent directory) so docs_url (the
grouping/citation key) cannot merge unrelated documents; update url_function to
return that unique string (keep the function name url_function and ensure any
code that expects docs_url continues to receive a stable, filesystem-unique
identifier).
---
Nitpick comments:
In `@tests/test_metadata_processor.py`:
- Around line 168-185: Add a regression test that ensures two different source
paths with the same basename and same title are treated as distinct documents
and do not collapse when grouping by (docs_url, title); in
tests/test_metadata_processor.py create a new test using
DefaultMetadataProcessor and its url_function to produce identical basenames for
two different input paths (e.g., "/path/one/doc.md" and "/other/path/doc.md"),
then simulate or call the code that groups by (docs_url, title) and assert that
the group contains two separate entries (or that the grouping key preserves
distinct source paths), referencing DefaultMetadataProcessor and url_function to
locate where to generate the inputs and validate grouping behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c34abc49-3e31-4257-8829-4b2d8b3be953
📒 Files selected for processing (3)
src/lightspeed_rag_content/__init__.pysrc/lightspeed_rag_content/metadata_processor.pytests/test_metadata_processor.py
| Suitable for documents that carry a ``url`` field in their YAML frontmatter | ||
| (the base class will use that value directly) or when the bare filename is | ||
| an acceptable reference for RAG retrieval. |
There was a problem hiding this comment.
Docstring behavior claim is inaccurate.
On Line 103-105, the docstring says the base class uses YAML frontmatter url directly, but this implementation path only uses url_function(file_path). Please reword to match actual behavior and avoid integration confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/metadata_processor.py` around lines 103 - 105, The
docstring incorrectly states that the base class will use the YAML frontmatter
`url` value directly; update the wording in metadata_processor.py so it
accurately reflects the implementation: clarify that this path uses the provided
`url_function(file_path)` to derive the URL (and not the base class reading
`url` from frontmatter), and rephrase the sentence to avoid implying the base
class directly consumes frontmatter `url`; reference the `url_function` symbol
and the file's metadata processor/docstring in your change.
| def url_function(self, file_path: str) -> str: | ||
| """Return the basename of the file as the document URL.""" | ||
| return os.path.basename(file_path) |
There was a problem hiding this comment.
Use a unique source key, not only basename.
On Line 108-110, os.path.basename(file_path) can collide for different files with the same filename (e.g., /a/doc.md and /b/doc.md). Since docs_url is later used as a grouping key and citation source, this can merge unrelated documents and corrupt attribution.
Suggested fix
def url_function(self, file_path: str) -> str:
- """Return the basename of the file as the document URL."""
- return os.path.basename(file_path)
+ """Return a stable, path-based identifier for the document source."""
+ return os.path.normpath(file_path).replace(os.sep, "/")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/metadata_processor.py` around lines 108 - 110, The
url_function currently returns only os.path.basename(file_path), which can
produce collisions; change url_function to produce a unique source key (e.g.,
use os.path.abspath(file_path) or basename plus a stable hash of the full path
or parent directory) so docs_url (the grouping/citation key) cannot merge
unrelated documents; update url_function to return that unique string (keep the
function name url_function and ensure any code that expects docs_url continues
to receive a stable, filesystem-unique identifier).
899c47e to
4b905cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/generate_embeddings.py (1)
84-99: Consider adding basic error handling for the pipeline.The
main()function has no error handling. Ifprocess()orsave()fails (e.g., invalid folder path, missing model files, disk full), the script will crash with an unhelpful traceback. For a user-facing CLI, consider catching common exceptions and providing actionable error messages.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_embeddings.py` around lines 84 - 99, Wrap main()'s pipeline call in a try/except that catches common errors (e.g., FileNotFoundError, OSError, ValueError, and a broad Exception fallback) around DocumentProcessor.process and DocumentProcessor.save; use _parse_args and DefaultMetadataProcessor as before but on error log a concise, actionable message including the exception message (and optionally hint like "check folder path" or "check model files"), then exit with a non-zero status. Ensure the exception handling references DocumentProcessor.process and DocumentProcessor.save so failures there are handled, and keep the normal flow unchanged on success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/generate_embeddings.py`:
- Line 93: The call passing args.model_dir (a str) to DocumentProcessor should
convert the string to a Path to match the embeddings_model_dir type annotation;
update the invocation that sets embeddings_model_dir=args.model_dir to pass a
pathlib.Path instance (use Path(args.model_dir)) so the
DocumentProcessor.__init__ signature and static typing are honored (refer to
embeddings_model_dir, args.model_dir, and DocumentProcessor.__init__ to locate
the change).
---
Nitpick comments:
In `@scripts/generate_embeddings.py`:
- Around line 84-99: Wrap main()'s pipeline call in a try/except that catches
common errors (e.g., FileNotFoundError, OSError, ValueError, and a broad
Exception fallback) around DocumentProcessor.process and DocumentProcessor.save;
use _parse_args and DefaultMetadataProcessor as before but on error log a
concise, actionable message including the exception message (and optionally hint
like "check folder path" or "check model files"), then exit with a non-zero
status. Ensure the exception handling references DocumentProcessor.process and
DocumentProcessor.save so failures there are handled, and keep the normal flow
unchanged on success.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9d8d31ed-084d-4b68-a7ff-07eb05e1e0f0
📒 Files selected for processing (5)
scripts/generate_embeddings.pysrc/lightspeed_rag_content/__init__.pysrc/lightspeed_rag_content/metadata_processor.pytests/test_generate_embeddings.pytests/test_metadata_processor.py
✅ Files skipped from review due to trivial changes (1)
- src/lightspeed_rag_content/init.py
| chunk_size=args.chunk_size, | ||
| chunk_overlap=args.chunk_overlap, | ||
| model_name=args.model_name, | ||
| embeddings_model_dir=args.model_dir, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify DocumentProcessor.__init__ signature and how embeddings_model_dir is used
ast-grep --pattern $'def __init__(
$$$
embeddings_model_dir: $_,
$$$
)'Repository: lightspeed-core/rag-content
Length of output: 3613
🏁 Script executed:
# First, examine the generate_embeddings.py file around line 93 and check argparse setup
cat -n scripts/generate_embeddings.py | head -100Repository: lightspeed-core/rag-content
Length of output: 3862
🏁 Script executed:
# Check if Path is already imported in generate_embeddings.py
rg "from pathlib import Path|import pathlib" scripts/generate_embeddings.pyRepository: lightspeed-core/rag-content
Length of output: 53
🏁 Script executed:
# Look at how _parse_args sets up model_dir to confirm it returns a string
sed -n '30,81p' scripts/generate_embeddings.pyRepository: lightspeed-core/rag-content
Length of output: 1699
🏁 Script executed:
# Check if project has type checking configuration
fd "pyproject.toml|setup.cfg|mypy.ini|pyrightconfig.json" . --type fRepository: lightspeed-core/rag-content
Length of output: 87
🏁 Script executed:
# Check imports in document_processor.py to confirm Path is used from pathlib
head -30 src/lightspeed_rag_content/document_processor.pyRepository: lightspeed-core/rag-content
Length of output: 1134
🏁 Script executed:
# Check pyproject.toml for type checking configuration
cat pyproject.toml | grep -A 20 "\[tool.mypy\]\|\[tool.pyright\]\|mypy"Repository: lightspeed-core/rag-content
Length of output: 1066
Type mismatch: embeddings_model_dir expects Path, not str.
args.model_dir is a string from argparse, but DocumentProcessor.__init__ expects embeddings_model_dir as type Path (line 603 in document_processor.py). While this works at runtime due to the str() conversion at line 635, the type annotation should be honored for consistency and static type checking.
🛠️ Proposed fix
+from pathlib import Path
+
from lightspeed_rag_content.document_processor import DocumentProcessor
from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor document_processor = DocumentProcessor(
chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap,
model_name=args.model_name,
- embeddings_model_dir=args.model_dir,
+ embeddings_model_dir=Path(args.model_dir),
vector_store_type=args.vector_store,
doc_type=args.doc_type,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| embeddings_model_dir=args.model_dir, | |
| from pathlib import Path | |
| from lightspeed_rag_content.document_processor import DocumentProcessor | |
| from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor | |
| ... | |
| document_processor = DocumentProcessor( | |
| chunk_size=args.chunk_size, | |
| chunk_overlap=args.chunk_overlap, | |
| model_name=args.model_name, | |
| embeddings_model_dir=Path(args.model_dir), | |
| vector_store_type=args.vector_store, | |
| doc_type=args.doc_type, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_embeddings.py` at line 93, The call passing args.model_dir
(a str) to DocumentProcessor should convert the string to a Path to match the
embeddings_model_dir type annotation; update the invocation that sets
embeddings_model_dir=args.model_dir to pass a pathlib.Path instance (use
Path(args.model_dir)) so the DocumentProcessor.__init__ signature and static
typing are honored (refer to embeddings_model_dir, args.model_dir, and
DocumentProcessor.__init__ to locate the change).
|
/ok-to-test |
tisnik
left a comment
There was a problem hiding this comment.
LGTM, but please fix CI failures
|
Added commit for fixing the CI issues. Thanks! |
Motivation
The existing CLI workflow required users to author a
custom_processor.pyfile before running the embedding pipeline This was unnecessary boilerplate for the common case where default fallback strategy (filename or YAML frontmatter) is sufficient.Description
Removing the need for specifying a custom metadata processor file when generating the vector embeddings:
DefaultMetadataProcessor— a concrete, ready-to-use subclass ofMetadataProcessorthat falls back to the filename (basename) as thedocument URL. Eliminates the need for users to write a custom Python
subclass for the common case.
DefaultMetadataProcessorfrom the package's public API(
lightspeed_rag_content.__init__).scripts/generate_embeddings.py, a ready-made CLI script thatwires
DefaultMetadataProcessor+DocumentProcessorwith allparameters exposed as flags. Users can now generate a vector database
directly from the container without writing any Python.
It will default as YAML frontmatter strategy, once merged: #103
Usage
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Tested locally, metadata generated properly, no need for processor parameter.
Summary by CodeRabbit
New Features
Tests