Skip to content

Add DefaultMetadataProcessor and generate_embeddings.py CLI script#104

Merged
tisnik merged 2 commits intolightspeed-core:mainfrom
romartin:generate-embeddings-cli
Mar 23, 2026
Merged

Add DefaultMetadataProcessor and generate_embeddings.py CLI script#104
tisnik merged 2 commits intolightspeed-core:mainfrom
romartin:generate-embeddings-cli

Conversation

@romartin
Copy link
Copy Markdown
Contributor

@romartin romartin commented Mar 22, 2026

Motivation

The existing CLI workflow required users to author a custom_processor.py file before running the embedding pipeline This was unnecessary boilerplate for the common case where default fallback strategy (filename or YAML frontmatter) is sufficient.

Description

Removing the need for specifying a custom metadata processor file when generating the vector embeddings:

  • Adds DefaultMetadataProcessor — a concrete, ready-to-use subclass of
    MetadataProcessor that falls back to the filename (basename) as the
    document URL. Eliminates the need for users to write a custom Python
    subclass for the common case.
  • Exports DefaultMetadataProcessor from the package's public API
    (lightspeed_rag_content.__init__).
  • Adds scripts/generate_embeddings.py, a ready-made CLI script that
    wires DefaultMetadataProcessor + DocumentProcessor with all
    parameters exposed as flags. Users can now generate a vector database
    directly from the container without writing any Python.

It will default as YAML frontmatter strategy, once merged: #103

Usage

podman run --rm \
  --userns=keep-id \
  -v "$(pwd)/my_docs:/input:Z" \
  -v "$(pwd)/vector_db:/output:Z" \
  quay.io/lightspeed-core/rag-content-<variant>:latest \
  python /rag-content/scripts/generate_embeddings.py \
    -f /input -o /output -i my-docs-index

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude
  • Generated by: Claude

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

Tested locally, metadata generated properly, no need for processor parameter.

Summary by CodeRabbit

  • New Features

    • Added a new CLI utility for generating document embeddings with configurable chunk sizing, model selection, and vector store options.
    • Introduced a default metadata processor implementation that automatically derives URLs from file paths for immediate use.
    • Enhanced package imports to provide easier access to core metadata processor classes.
  • Tests

    • Added comprehensive test coverage for the new metadata processor and embedding generation tool.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 22, 2026

Warning

Rate limit exceeded

@romartin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8445874-5d61-44d9-adc0-679ce46d5964

📥 Commits

Reviewing files that changed from the base of the PR and between 4b905cb and 6b0eb58.

📒 Files selected for processing (3)
  • scripts/generate_embeddings.py
  • tests/test_generate_embeddings.py
  • tests/test_metadata_processor.py

Walkthrough

Adds a concrete DefaultMetadataProcessor (providing a basename-based url_function), re-exports MetadataProcessor and DefaultMetadataProcessor at package top level, and introduces a CLI script to generate embeddings plus tests covering the new processor and script behavior.

Changes

Cohort / File(s) Summary
Package Public API
src/lightspeed_rag_content/__init__.py
Added top-level re-exports of MetadataProcessor and DefaultMetadataProcessor so they are available from the package root.
Metadata processor implementation
src/lightspeed_rag_content/metadata_processor.py
Added DefaultMetadataProcessor subclass implementing url_function(file_path) using os.path.basename(file_path) as a default URL derivation.
Embedding generator CLI
scripts/generate_embeddings.py
New CLI utility: argument parsing, constants for defaults (chunk size/overlap, model name/dir, vector store, doc type), instantiates DefaultMetadataProcessor and DocumentProcessor, runs processing, and saves vector DB.
Processor tests
tests/test_metadata_processor.py
Added tests for DefaultMetadataProcessor: instantiation, inheritance, url_function behavior on nested and bare filenames, package-level exposure, and populate() behavior including reachable/unreachable URL handling and warning logs.
CLI tests
tests/test_generate_embeddings.py
Added tests validating CLI parsing (required and optional flags/defaults), argument typing/overrides, and main() orchestration with mocked DocumentProcessor and DefaultMetadataProcessor (process/save calls).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant CLI as CLI (scripts/generate_embeddings.py)
participant Meta as DefaultMetadataProcessor
participant Proc as DocumentProcessor
participant FS as File System
participant VS as VectorStore
CLI->>Meta: instantiate (defaults or provided)
CLI->>Proc: instantiate (model, chunking, vector store, doc type, metadata=Meta)
CLI->>Proc: process(folder)
Proc->>FS: read files
Proc->>Meta: populate metadata per file (calls url_function, ping_url, get_file_title)
Meta->>FS: (optional) ping URL / derive basename
Proc->>VS: save(index, output_dir)
VS-->>CLI: persist complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the two main additions in the changeset: DefaultMetadataProcessor class and the generate_embeddings.py CLI script.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/test_metadata_processor.py (1)

168-185: Add a collision regression test for duplicate basenames.

These tests validate basename extraction, but they don’t cover two different paths sharing the same filename (and same title). Given downstream grouping by (docs_url, title), add a regression test to ensure distinct documents don’t collapse into one group.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_metadata_processor.py` around lines 168 - 185, Add a regression
test that ensures two different source paths with the same basename and same
title are treated as distinct documents and do not collapse when grouping by
(docs_url, title); in tests/test_metadata_processor.py create a new test using
DefaultMetadataProcessor and its url_function to produce identical basenames for
two different input paths (e.g., "/path/one/doc.md" and "/other/path/doc.md"),
then simulate or call the code that groups by (docs_url, title) and assert that
the group contains two separate entries (or that the grouping key preserves
distinct source paths), referencing DefaultMetadataProcessor and url_function to
locate where to generate the inputs and validate grouping behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 103-105: The docstring incorrectly states that the base class will
use the YAML frontmatter `url` value directly; update the wording in
metadata_processor.py so it accurately reflects the implementation: clarify that
this path uses the provided `url_function(file_path)` to derive the URL (and not
the base class reading `url` from frontmatter), and rephrase the sentence to
avoid implying the base class directly consumes frontmatter `url`; reference the
`url_function` symbol and the file's metadata processor/docstring in your
change.
- Around line 108-110: The url_function currently returns only
os.path.basename(file_path), which can produce collisions; change url_function
to produce a unique source key (e.g., use os.path.abspath(file_path) or basename
plus a stable hash of the full path or parent directory) so docs_url (the
grouping/citation key) cannot merge unrelated documents; update url_function to
return that unique string (keep the function name url_function and ensure any
code that expects docs_url continues to receive a stable, filesystem-unique
identifier).

---

Nitpick comments:
In `@tests/test_metadata_processor.py`:
- Around line 168-185: Add a regression test that ensures two different source
paths with the same basename and same title are treated as distinct documents
and do not collapse when grouping by (docs_url, title); in
tests/test_metadata_processor.py create a new test using
DefaultMetadataProcessor and its url_function to produce identical basenames for
two different input paths (e.g., "/path/one/doc.md" and "/other/path/doc.md"),
then simulate or call the code that groups by (docs_url, title) and assert that
the group contains two separate entries (or that the grouping key preserves
distinct source paths), referencing DefaultMetadataProcessor and url_function to
locate where to generate the inputs and validate grouping behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c34abc49-3e31-4257-8829-4b2d8b3be953

📥 Commits

Reviewing files that changed from the base of the PR and between c750c6f and 899c47e.

📒 Files selected for processing (3)
  • src/lightspeed_rag_content/__init__.py
  • src/lightspeed_rag_content/metadata_processor.py
  • tests/test_metadata_processor.py

Comment on lines +103 to +105
Suitable for documents that carry a ``url`` field in their YAML frontmatter
(the base class will use that value directly) or when the bare filename is
an acceptable reference for RAG retrieval.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring behavior claim is inaccurate.

On Line 103-105, the docstring says the base class uses YAML frontmatter url directly, but this implementation path only uses url_function(file_path). Please reword to match actual behavior and avoid integration confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 103 - 105, The
docstring incorrectly states that the base class will use the YAML frontmatter
`url` value directly; update the wording in metadata_processor.py so it
accurately reflects the implementation: clarify that this path uses the provided
`url_function(file_path)` to derive the URL (and not the base class reading
`url` from frontmatter), and rephrase the sentence to avoid implying the base
class directly consumes frontmatter `url`; reference the `url_function` symbol
and the file's metadata processor/docstring in your change.

Comment on lines +108 to +110
def url_function(self, file_path: str) -> str:
"""Return the basename of the file as the document URL."""
return os.path.basename(file_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a unique source key, not only basename.

On Line 108-110, os.path.basename(file_path) can collide for different files with the same filename (e.g., /a/doc.md and /b/doc.md). Since docs_url is later used as a grouping key and citation source, this can merge unrelated documents and corrupt attribution.

Suggested fix
 def url_function(self, file_path: str) -> str:
-    """Return the basename of the file as the document URL."""
-    return os.path.basename(file_path)
+    """Return a stable, path-based identifier for the document source."""
+    return os.path.normpath(file_path).replace(os.sep, "/")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 108 - 110, The
url_function currently returns only os.path.basename(file_path), which can
produce collisions; change url_function to produce a unique source key (e.g.,
use os.path.abspath(file_path) or basename plus a stable hash of the full path
or parent directory) so docs_url (the grouping/citation key) cannot merge
unrelated documents; update url_function to return that unique string (keep the
function name url_function and ensure any code that expects docs_url continues
to receive a stable, filesystem-unique identifier).

@romartin romartin force-pushed the generate-embeddings-cli branch from 899c47e to 4b905cb Compare March 23, 2026 11:31
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/generate_embeddings.py (1)

84-99: Consider adding basic error handling for the pipeline.

The main() function has no error handling. If process() or save() fails (e.g., invalid folder path, missing model files, disk full), the script will crash with an unhelpful traceback. For a user-facing CLI, consider catching common exceptions and providing actionable error messages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_embeddings.py` around lines 84 - 99, Wrap main()'s pipeline
call in a try/except that catches common errors (e.g., FileNotFoundError,
OSError, ValueError, and a broad Exception fallback) around
DocumentProcessor.process and DocumentProcessor.save; use _parse_args and
DefaultMetadataProcessor as before but on error log a concise, actionable
message including the exception message (and optionally hint like "check folder
path" or "check model files"), then exit with a non-zero status. Ensure the
exception handling references DocumentProcessor.process and
DocumentProcessor.save so failures there are handled, and keep the normal flow
unchanged on success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/generate_embeddings.py`:
- Line 93: The call passing args.model_dir (a str) to DocumentProcessor should
convert the string to a Path to match the embeddings_model_dir type annotation;
update the invocation that sets embeddings_model_dir=args.model_dir to pass a
pathlib.Path instance (use Path(args.model_dir)) so the
DocumentProcessor.__init__ signature and static typing are honored (refer to
embeddings_model_dir, args.model_dir, and DocumentProcessor.__init__ to locate
the change).

---

Nitpick comments:
In `@scripts/generate_embeddings.py`:
- Around line 84-99: Wrap main()'s pipeline call in a try/except that catches
common errors (e.g., FileNotFoundError, OSError, ValueError, and a broad
Exception fallback) around DocumentProcessor.process and DocumentProcessor.save;
use _parse_args and DefaultMetadataProcessor as before but on error log a
concise, actionable message including the exception message (and optionally hint
like "check folder path" or "check model files"), then exit with a non-zero
status. Ensure the exception handling references DocumentProcessor.process and
DocumentProcessor.save so failures there are handled, and keep the normal flow
unchanged on success.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d8d31ed-084d-4b68-a7ff-07eb05e1e0f0

📥 Commits

Reviewing files that changed from the base of the PR and between 899c47e and 4b905cb.

📒 Files selected for processing (5)
  • scripts/generate_embeddings.py
  • src/lightspeed_rag_content/__init__.py
  • src/lightspeed_rag_content/metadata_processor.py
  • tests/test_generate_embeddings.py
  • tests/test_metadata_processor.py
✅ Files skipped from review due to trivial changes (1)
  • src/lightspeed_rag_content/init.py

chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap,
model_name=args.model_name,
embeddings_model_dir=args.model_dir,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify DocumentProcessor.__init__ signature and how embeddings_model_dir is used
ast-grep --pattern $'def __init__(
  $$$
  embeddings_model_dir: $_,
  $$$
)'

Repository: lightspeed-core/rag-content

Length of output: 3613


🏁 Script executed:

# First, examine the generate_embeddings.py file around line 93 and check argparse setup
cat -n scripts/generate_embeddings.py | head -100

Repository: lightspeed-core/rag-content

Length of output: 3862


🏁 Script executed:

# Check if Path is already imported in generate_embeddings.py
rg "from pathlib import Path|import pathlib" scripts/generate_embeddings.py

Repository: lightspeed-core/rag-content

Length of output: 53


🏁 Script executed:

# Look at how _parse_args sets up model_dir to confirm it returns a string
sed -n '30,81p' scripts/generate_embeddings.py

Repository: lightspeed-core/rag-content

Length of output: 1699


🏁 Script executed:

# Check if project has type checking configuration
fd "pyproject.toml|setup.cfg|mypy.ini|pyrightconfig.json" . --type f

Repository: lightspeed-core/rag-content

Length of output: 87


🏁 Script executed:

# Check imports in document_processor.py to confirm Path is used from pathlib
head -30 src/lightspeed_rag_content/document_processor.py

Repository: lightspeed-core/rag-content

Length of output: 1134


🏁 Script executed:

# Check pyproject.toml for type checking configuration
cat pyproject.toml | grep -A 20 "\[tool.mypy\]\|\[tool.pyright\]\|mypy"

Repository: lightspeed-core/rag-content

Length of output: 1066


Type mismatch: embeddings_model_dir expects Path, not str.

args.model_dir is a string from argparse, but DocumentProcessor.__init__ expects embeddings_model_dir as type Path (line 603 in document_processor.py). While this works at runtime due to the str() conversion at line 635, the type annotation should be honored for consistency and static type checking.

🛠️ Proposed fix
+from pathlib import Path
+
 from lightspeed_rag_content.document_processor import DocumentProcessor
 from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor
     document_processor = DocumentProcessor(
         chunk_size=args.chunk_size,
         chunk_overlap=args.chunk_overlap,
         model_name=args.model_name,
-        embeddings_model_dir=args.model_dir,
+        embeddings_model_dir=Path(args.model_dir),
         vector_store_type=args.vector_store,
         doc_type=args.doc_type,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
embeddings_model_dir=args.model_dir,
from pathlib import Path
from lightspeed_rag_content.document_processor import DocumentProcessor
from lightspeed_rag_content.metadata_processor import DefaultMetadataProcessor
...
document_processor = DocumentProcessor(
chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap,
model_name=args.model_name,
embeddings_model_dir=Path(args.model_dir),
vector_store_type=args.vector_store,
doc_type=args.doc_type,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_embeddings.py` at line 93, The call passing args.model_dir
(a str) to DocumentProcessor should convert the string to a Path to match the
embeddings_model_dir type annotation; update the invocation that sets
embeddings_model_dir=args.model_dir to pass a pathlib.Path instance (use
Path(args.model_dir)) so the DocumentProcessor.__init__ signature and static
typing are honored (refer to embeddings_model_dir, args.model_dir, and
DocumentProcessor.__init__ to locate the change).

Copy link
Copy Markdown
Contributor

@are-ces are-ces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@are-ces
Copy link
Copy Markdown
Contributor

are-ces commented Mar 23, 2026

/ok-to-test

Copy link
Copy Markdown
Collaborator

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please fix CI failures

@romartin
Copy link
Copy Markdown
Contributor Author

Added commit for fixing the CI issues. Thanks!

@tisnik tisnik merged commit 2b8c8b0 into lightspeed-core:main Mar 23, 2026
14 of 15 checks passed
@romartin romartin deleted the generate-embeddings-cli branch March 23, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants