Add DefaultMetadataProcessor and generate_embeddings.py CLI script by romartin · Pull Request #104 · lightspeed-core/rag-content

romartin · 2026-03-22T19:41:43Z

Motivation

The existing CLI workflow required users to author a custom_processor.py file before running the embedding pipeline This was unnecessary boilerplate for the common case where default fallback strategy (filename or YAML frontmatter) is sufficient.

Description

Removing the need for specifying a custom metadata processor file when generating the vector embeddings:

Adds DefaultMetadataProcessor — a concrete, ready-to-use subclass of
MetadataProcessor that falls back to the filename (basename) as the
document URL. Eliminates the need for users to write a custom Python
subclass for the common case.
Exports DefaultMetadataProcessor from the package's public API
(lightspeed_rag_content.__init__).
Adds scripts/generate_embeddings.py, a ready-made CLI script that
wires DefaultMetadataProcessor + DocumentProcessor with all
parameters exposed as flags. Users can now generate a vector database
directly from the container without writing any Python.

It will default as YAML frontmatter strategy, once merged: #103

Usage

podman run --rm \
  --userns=keep-id \
  -v "$(pwd)/my_docs:/input:Z" \
  -v "$(pwd)/vector_db:/output:Z" \
  quay.io/lightspeed-core/rag-content-<variant>:latest \
  python /rag-content/scripts/generate_embeddings.py \
    -f /input -o /output -i my-docs-index

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude
Generated by: Claude

Related Tickets & Documents

Related Issue: https://redhat.atlassian.net/browse/AAP-66454

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Tested locally, metadata generated properly, no need for processor parameter.

Summary by CodeRabbit

New Features
- Added a new CLI utility for generating document embeddings with configurable chunk sizing, model selection, and vector store options.
- Introduced a default metadata processor implementation that automatically derives URLs from file paths for immediate use.
- Enhanced package imports to provide easier access to core metadata processor classes.
Tests
- Added comprehensive test coverage for the new metadata processor and embedding generation tool.

coderabbitai · 2026-03-22T19:41:58Z

Warning

Rate limit exceeded

@romartin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8445874-5d61-44d9-adc0-679ce46d5964

📥 Commits

Reviewing files that changed from the base of the PR and between 4b905cb and 6b0eb58.

📒 Files selected for processing (3)

scripts/generate_embeddings.py
tests/test_generate_embeddings.py
tests/test_metadata_processor.py

Walkthrough

Adds a concrete DefaultMetadataProcessor (providing a basename-based url_function), re-exports MetadataProcessor and DefaultMetadataProcessor at package top level, and introduces a CLI script to generate embeddings plus tests covering the new processor and script behavior.

Changes

Cohort / File(s)	Summary
Package Public API `src/lightspeed_rag_content/__init__.py`	Added top-level re-exports of `MetadataProcessor` and `DefaultMetadataProcessor` so they are available from the package root.
Metadata processor implementation `src/lightspeed_rag_content/metadata_processor.py`	Added `DefaultMetadataProcessor` subclass implementing `url_function(file_path)` using `os.path.basename(file_path)` as a default URL derivation.
Embedding generator CLI `scripts/generate_embeddings.py`	New CLI utility: argument parsing, constants for defaults (chunk size/overlap, model name/dir, vector store, doc type), instantiates `DefaultMetadataProcessor` and `DocumentProcessor`, runs processing, and saves vector DB.
Processor tests `tests/test_metadata_processor.py`	Added tests for `DefaultMetadataProcessor`: instantiation, inheritance, `url_function` behavior on nested and bare filenames, package-level exposure, and `populate()` behavior including reachable/unreachable URL handling and warning logs.
CLI tests `tests/test_generate_embeddings.py`	Added tests validating CLI parsing (required and optional flags/defaults), argument typing/overrides, and `main()` orchestration with mocked `DocumentProcessor` and `DefaultMetadataProcessor` (process/save calls).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant CLI as CLI (scripts/generate_embeddings.py)
participant Meta as DefaultMetadataProcessor
participant Proc as DocumentProcessor
participant FS as File System
participant VS as VectorStore
CLI->>Meta: instantiate (defaults or provided)
CLI->>Proc: instantiate (model, chunking, vector store, doc type, metadata=Meta)
CLI->>Proc: process(folder)
Proc->>FS: read files
Proc->>Meta: populate metadata per file (calls url_function, ping_url, get_file_title)
Meta->>FS: (optional) ping URL / derive basename
Proc->>VS: save(index, output_dir)
VS-->>CLI: persist complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the two main additions in the changeset: DefaultMetadataProcessor class and the generate_embeddings.py CLI script.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/test_metadata_processor.py (1)
168-185: Add a collision regression test for duplicate basenames.

These tests validate basename extraction, but they don’t cover two different paths sharing the same filename (and same title). Given downstream grouping by (docs_url, title), add a regression test to ensure distinct documents don’t collapse into one group.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_metadata_processor.py` around lines 168 - 185, Add a regression
test that ensures two different source paths with the same basename and same
title are treated as distinct documents and do not collapse when grouping by
(docs_url, title); in tests/test_metadata_processor.py create a new test using
DefaultMetadataProcessor and its url_function to produce identical basenames for
two different input paths (e.g., "/path/one/doc.md" and "/other/path/doc.md"),
then simulate or call the code that groups by (docs_url, title) and assert that
the group contains two separate entries (or that the grouping key preserves
distinct source paths), referencing DefaultMetadataProcessor and url_function to
locate where to generate the inputs and validate grouping behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 103-105: The docstring incorrectly states that the base class will
use the YAML frontmatter `url` value directly; update the wording in
metadata_processor.py so it accurately reflects the implementation: clarify that
this path uses the provided `url_function(file_path)` to derive the URL (and not
the base class reading `url` from frontmatter), and rephrase the sentence to
avoid implying the base class directly consumes frontmatter `url`; reference the
`url_function` symbol and the file's metadata processor/docstring in your
change.
- Around line 108-110: The url_function currently returns only
os.path.basename(file_path), which can produce collisions; change url_function
to produce a unique source key (e.g., use os.path.abspath(file_path) or basename
plus a stable hash of the full path or parent directory) so docs_url (the
grouping/citation key) cannot merge unrelated documents; update url_function to
return that unique string (keep the function name url_function and ensure any
code that expects docs_url continues to receive a stable, filesystem-unique
identifier).

---

Nitpick comments:
In `@tests/test_metadata_processor.py`:
- Around line 168-185: Add a regression test that ensures two different source
paths with the same basename and same title are treated as distinct documents
and do not collapse when grouping by (docs_url, title); in
tests/test_metadata_processor.py create a new test using
DefaultMetadataProcessor and its url_function to produce identical basenames for
two different input paths (e.g., "/path/one/doc.md" and "/other/path/doc.md"),
then simulate or call the code that groups by (docs_url, title) and assert that
the group contains two separate entries (or that the grouping key preserves
distinct source paths), referencing DefaultMetadataProcessor and url_function to
locate where to generate the inputs and validate grouping behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c34abc49-3e31-4257-8829-4b2d8b3be953

📥 Commits

Reviewing files that changed from the base of the PR and between c750c6f and 899c47e.

📒 Files selected for processing (3)

src/lightspeed_rag_content/__init__.py
src/lightspeed_rag_content/metadata_processor.py
tests/test_metadata_processor.py

coderabbitai · 2026-03-22T19:45:18Z

+    Suitable for documents that carry a ``url`` field in their YAML frontmatter
+    (the base class will use that value directly) or when the bare filename is
+    an acceptable reference for RAG retrieval.


⚠️ Potential issue | 🟡 Minor

Docstring behavior claim is inaccurate.

On Line 103-105, the docstring says the base class uses YAML frontmatter url directly, but this implementation path only uses url_function(file_path). Please reword to match actual behavior and avoid integration confusion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 103 - 105, The docstring incorrectly states that the base class will use the YAML frontmatter `url` value directly; update the wording in metadata_processor.py so it accurately reflects the implementation: clarify that this path uses the provided `url_function(file_path)` to derive the URL (and not the base class reading `url` from frontmatter), and rephrase the sentence to avoid implying the base class directly consumes frontmatter `url`; reference the `url_function` symbol and the file's metadata processor/docstring in your change.

coderabbitai · 2026-03-22T19:45:18Z

+    def url_function(self, file_path: str) -> str:
+        """Return the basename of the file as the document URL."""
+        return os.path.basename(file_path)


⚠️ Potential issue | 🟠 Major

Use a unique source key, not only basename.

On Line 108-110, os.path.basename(file_path) can collide for different files with the same filename (e.g., /a/doc.md and /b/doc.md). Since docs_url is later used as a grouping key and citation source, this can merge unrelated documents and corrupt attribution.

Suggested fix

def url_function(self, file_path: str) -> str: - """Return the basename of the file as the document URL.""" - return os.path.basename(file_path) + """Return a stable, path-based identifier for the document source.""" + return os.path.normpath(file_path).replace(os.sep, "/")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 108 - 110, The url_function currently returns only os.path.basename(file_path), which can produce collisions; change url_function to produce a unique source key (e.g., use os.path.abspath(file_path) or basename plus a stable hash of the full path or parent directory) so docs_url (the grouping/citation key) cannot merge unrelated documents; update url_function to return that unique string (keep the function name url_function and ensure any code that expects docs_url continues to receive a stable, filesystem-unique identifier).

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/generate_embeddings.py (1)
84-99: Consider adding basic error handling for the pipeline.

The main() function has no error handling. If process() or save() fails (e.g., invalid folder path, missing model files, disk full), the script will crash with an unhelpful traceback. For a user-facing CLI, consider catching common exceptions and providing actionable error messages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_embeddings.py` around lines 84 - 99, Wrap main()'s pipeline
call in a try/except that catches common errors (e.g., FileNotFoundError,
OSError, ValueError, and a broad Exception fallback) around
DocumentProcessor.process and DocumentProcessor.save; use _parse_args and
DefaultMetadataProcessor as before but on error log a concise, actionable
message including the exception message (and optionally hint like "check folder
path" or "check model files"), then exit with a non-zero status. Ensure the
exception handling references DocumentProcessor.process and
DocumentProcessor.save so failures there are handled, and keep the normal flow
unchanged on success.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/generate_embeddings.py`:
- Line 93: The call passing args.model_dir (a str) to DocumentProcessor should
convert the string to a Path to match the embeddings_model_dir type annotation;
update the invocation that sets embeddings_model_dir=args.model_dir to pass a
pathlib.Path instance (use Path(args.model_dir)) so the
DocumentProcessor.__init__ signature and static typing are honored (refer to
embeddings_model_dir, args.model_dir, and DocumentProcessor.__init__ to locate
the change).

---

Nitpick comments:
In `@scripts/generate_embeddings.py`:
- Around line 84-99: Wrap main()'s pipeline call in a try/except that catches
common errors (e.g., FileNotFoundError, OSError, ValueError, and a broad
Exception fallback) around DocumentProcessor.process and DocumentProcessor.save;
use _parse_args and DefaultMetadataProcessor as before but on error log a
concise, actionable message including the exception message (and optionally hint
like "check folder path" or "check model files"), then exit with a non-zero
status. Ensure the exception handling references DocumentProcessor.process and
DocumentProcessor.save so failures there are handled, and keep the normal flow
unchanged on success.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d8d31ed-084d-4b68-a7ff-07eb05e1e0f0

📥 Commits

Reviewing files that changed from the base of the PR and between 899c47e and 4b905cb.

📒 Files selected for processing (5)

scripts/generate_embeddings.py
src/lightspeed_rag_content/__init__.py
src/lightspeed_rag_content/metadata_processor.py
tests/test_generate_embeddings.py
tests/test_metadata_processor.py

✅ Files skipped from review due to trivial changes (1)

src/lightspeed_rag_content/init.py

coderabbitai · 2026-03-23T11:37:34Z

+        chunk_size=args.chunk_size,
+        chunk_overlap=args.chunk_overlap,
+        model_name=args.model_name,
+        embeddings_model_dir=args.model_dir,


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify DocumentProcessor.__init__ signature and how embeddings_model_dir is used ast-grep --pattern $'def __init__( $$$ embeddings_model_dir: $_, $$$ )'

Repository: lightspeed-core/rag-content

Length of output: 3613

🏁 Script executed:

# First, examine the generate_embeddings.py file around line 93 and check argparse setup cat -n scripts/generate_embeddings.py | head -100

Repository: lightspeed-core/rag-content

Length of output: 3862

🏁 Script executed:

# Check if Path is already imported in generate_embeddings.py rg "from pathlib import Path|import pathlib" scripts/generate_embeddings.py

Repository: lightspeed-core/rag-content

Length of output: 53

🏁 Script executed:

# Look at how _parse_args sets up model_dir to confirm it returns a string sed -n '30,81p' scripts/generate_embeddings.py

Repository: lightspeed-core/rag-content

Length of output: 1699

🏁 Script executed:

# Check if project has type checking configuration fd "pyproject.toml|setup.cfg|mypy.ini|pyrightconfig.json" . --type f

Repository: lightspeed-core/rag-content

Length of output: 87

🏁 Script executed:

# Check imports in document_processor.py to confirm Path is used from pathlib head -30 src/lightspeed_rag_content/document_processor.py

Repository: lightspeed-core/rag-content

Length of output: 1134

🏁 Script executed:

# Check pyproject.toml for type checking configuration cat pyproject.toml | grep -A 20 "\[tool.mypy\]\|\[tool.pyright\]\|mypy"

Repository: lightspeed-core/rag-content

Length of output: 1066

Type mismatch: embeddings_model_dir expects Path, not str.

args.model_dir is a string from argparse, but DocumentProcessor.__init__ expects embeddings_model_dir as type Path (line 603 in document_processor.py). While this works at runtime due to the str() conversion at line 635, the type annotation should be honored for consistency and static type checking.

🛠️ Proposed fix

+from pathlib import Path + from lightspeed_rag_content.document_processor import DocumentProcessor from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor

document_processor = DocumentProcessor( chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap, model_name=args.model_name, - embeddings_model_dir=args.model_dir, + embeddings_model_dir=Path(args.model_dir), vector_store_type=args.vector_store, doc_type=args.doc_type, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

embeddings_model_dir=args.model_dir,

from pathlib import Path

from lightspeed_rag_content.document_processor import DocumentProcessor

from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor

...

document_processor = DocumentProcessor(

chunk_size=args.chunk_size,

chunk_overlap=args.chunk_overlap,

model_name=args.model_name,

embeddings_model_dir=Path(args.model_dir),

vector_store_type=args.vector_store,

doc_type=args.doc_type,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/generate_embeddings.py` at line 93, The call passing args.model_dir (a str) to DocumentProcessor should convert the string to a Path to match the embeddings_model_dir type annotation; update the invocation that sets embeddings_model_dir=args.model_dir to pass a pathlib.Path instance (use Path(args.model_dir)) so the DocumentProcessor.__init__ signature and static typing are honored (refer to embeddings_model_dir, args.model_dir, and DocumentProcessor.__init__ to locate the change).

are-ces

LGTM

are-ces · 2026-03-23T14:40:26Z

/ok-to-test

tisnik

LGTM, but please fix CI failures

romartin · 2026-03-23T19:52:17Z

Added commit for fixing the CI issues. Thanks!

coderabbitai Bot reviewed Mar 22, 2026

View reviewed changes

romartin mentioned this pull request Mar 23, 2026

Add --output-image flag to package vector DB as a container image archive #108

Merged

16 tasks

Add DefaultMetadataProcessor and generate_embeddings.py CLI script

4b905cb

romartin force-pushed the generate-embeddings-cli branch from 899c47e to 4b905cb Compare March 23, 2026 11:31

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

are-ces approved these changes Mar 23, 2026

View reviewed changes

tisnik approved these changes Mar 23, 2026

View reviewed changes

Fix CI issues.

6b0eb58

tisnik merged commit 2b8c8b0 into lightspeed-core:main Mar 23, 2026
14 of 15 checks passed

romartin deleted the generate-embeddings-cli branch March 23, 2026 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DefaultMetadataProcessor and generate_embeddings.py CLI script#104

Add DefaultMetadataProcessor and generate_embeddings.py CLI script#104
tisnik merged 2 commits intolightspeed-core:mainfrom
romartin:generate-embeddings-cli

romartin commented Mar 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 22, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 22, 2026

Uh oh!

coderabbitai Bot Mar 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 23, 2026

Uh oh!

are-ces left a comment

Uh oh!

are-ces commented Mar 23, 2026

Uh oh!

tisnik left a comment

Uh oh!

romartin commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        embeddings_model_dir=args.model_dir,
+from pathlib import Path
+from lightspeed_rag_content.document_processor import DocumentProcessor
+from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor
+...
+document_processor = DocumentProcessor(
+    chunk_size=args.chunk_size,
+    chunk_overlap=args.chunk_overlap,
+    model_name=args.model_name,
+    embeddings_model_dir=Path(args.model_dir),
+    vector_store_type=args.vector_store,
+    doc_type=args.doc_type,
+)

Conversation

romartin commented Mar 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Usage

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

are-ces left a comment

Choose a reason for hiding this comment

Uh oh!

are-ces commented Mar 23, 2026

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

romartin commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

romartin commented Mar 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 22, 2026 •

edited

Loading