Add --output-image flag to package vector DB as a container image archive by romartin · Pull Request #108 · lightspeed-core/rag-content

romartin · 2026-03-23T11:27:07Z

Description

Adds the ability to produce the generated vector database embedded in a
container image, in addition to writing it to the filesystem.

When --output-image is supplied, generate_embeddings.py packages the
vector DB (and optionally the embedding model) into an OCI-compatible archive
(.tar) after the embedding pipeline completes. The archive can be loaded
with podman load and pushed to any OCI registry.

The image is built entirely in Python + skopeo — no container daemon,
elevated privileges, or user-namespace configuration required. This works
safely inside an unprivileged container (e.g. the RAG tool container image).

Architecture: privilege-free image build

/usr/bin/skopeo copy docker://<base-image> docker-archive:<tmp.tar> — pulls
the base image with zero daemon/namespace requirements.
Pure Python parses the manifest and config from the base archive (layer
names and metadata only — layer data is not loaded into memory).
A custom data layer tar is written to a temp file (/rag/vector_db/ +
optional /rag/embeddings_model/); SHA-256 is computed by streaming the
file in chunks.
The new image config extends the base config: appends the custom layer's
diff_id, sets User: 65532:65532, and adds a history entry.
A combined docker-archive is written by streaming each base layer
directly from the skopeo temp archive and the custom layer from its temp
file — peak RAM is independent of image size.

The default base image is full UBI 9 (ubi9/ubi:latest).

Changes

src/lightspeed_rag_content/image_builder.py: Pure-Python + skopeo
utility that builds a Docker-archive tar by fetching the base image via
/usr/bin/skopeo copy and stacking a custom data layer in Python.
Layer construction and SHA-256 hashing are fully streaming (temp files,
chunked reads) to keep peak memory independent of image size.
skopeo failures include a 5-minute timeout and surface stderr in the
exception message.
src/lightspeed_rag_content/Containerfile.output-image (new): Static
canonical specification for the output image layout. Uses ubi9/ubi:latest.
scripts/generate_embeddings.py: Five new optional CLI flags:
- --output-image <path> — path for the output .tar archive
- --image-name — repository name in the archive (default: rag-content-output)
- --image-tag — image tag (default: latest)
- --exclude-model — exclude the embedding model from the image
- --base-image — override the base image (default: ubi9/ubi:latest)
When --output-image is set and the model is included (default), the
model directory is validated to exist before building; a missing directory
raises FileNotFoundError with a hint to use --exclude-model.
Containerfile: Installs skopeo for the privilege-free base image fetch.
tests/test_image_builder.py: 29 unit tests covering the static
Containerfile, layer tar construction, skopeo invocation, docker-archive
parsing, streaming SHA-256, and the full build_image_archive pipeline.
tests/test_generate_embeddings.py: Updated for the new CLI flags
and model-dir validation; includes a FileNotFoundError guard test.
All 154 tests pass.

Usage

# Generate vector DB and package it as a container image archive
python scripts/generate_embeddings.py \
  -f /input -o /output/vector_db -i my-index \
  --output-image /output/my-rag.tar \
  --image-name my-rag --image-tag v1

# Load and push
podman load < /output/my-rag.tar
podman push my-rag:v1

# Exclude the embedding model from the image
python scripts/generate_embeddings.py \
  -f /input -o /output/vector_db -i my-index \
  --output-image /output/my-rag.tar \
  --exclude-model

# Override base image
python scripts/generate_embeddings.py \
  -f /input -o /output/vector_db -i my-index \
  --output-image /output/my-rag.tar \
  --base-image registry.example.com/custom-base:v2

Type of change

Tools used to create PR

Assisted-by: Claude
Generated by: Claude

Related Tickets & Documents

Related Issue: https://redhat.atlassian.net/browse/AAP-66454

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

All 154 tests pass (uv run pytest tests/ -v). The image archive pipeline
is covered by 29 unit tests in test_image_builder.py that mock the
skopeo subprocess and use real on-disk docker-archive construction.

Summary by CodeRabbit

Release Notes

New Features
- Vector database can now be packaged as a container image using the --output-image flag
- Added CLI options: --image-name, --image-tag, --exclude-model, and --base-image for image customization
- New Make targets build-tool-image and generate-output-image added for container workflows
Documentation
- Added guide for packaging vector database as a container image

coderabbitai · 2026-03-23T11:27:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a containerized workflow and CLI options to generate embeddings and optionally package the resulting vector DB (and optionally the embedding model) into a docker-archive (.tar) image via a new pure‑Python image builder, Makefile targets, Containerfile additions, and corresponding tests and docs.

Changes

Cohort / File(s)	Summary
Build targets `Makefile`	Adds overridable variables (`TOOL_IMAGE`, `DOCS_DIR`, `OUTPUT_DIR`, `INDEX_NAME`, `OUTPUT_IMAGE_NAME`, `OUTPUT_IMAGE_TAG`, `OUTPUT_IMAGE_FILE`) and phony targets `build-tool-image` and `generate-output-image` to build/run the tool container and produce an output archive.
Embedding CLI `scripts/generate_embeddings.py`	Adds CLI flags `--output-image`, `--image-name`, `--image-tag`, `--exclude-model` (dest=`include_model`), `--base-image`; after saving, conditionally calls `build_image_archive(...)` and passes `extra_dirs` when embedding model inclusion is enabled.
Image assembly library `src/lightspeed_rag_content/image_builder.py`	New module exporting `DEFAULT_BASE_IMAGE` and `build_image_archive(...)`. Fetches a base docker-archive via `skopeo`, creates a custom layer (packages vector DB, gzips `.db` files, optionally includes extra dirs), updates config/manifest/rootfs.diff_ids, and writes a combined docker-archive `.tar`.
Container build `Containerfile`, `src/lightspeed_rag_content/Containerfile.output-image`	Adds `skopeo` to installed packages in `Containerfile`. Adds `Containerfile.output-image` that takes `BASE_IMAGE`, copies `vector_db` and `embeddings_model` into `/rag/...` with ownership `65532:65532`, and sets `USER 65532:65532`.
Document processor API `src/lightspeed_rag_content/document_processor.py`	`_LlamaStackDB.save(...)` now returns `str`; `DocumentProcessor.save(...)` now returns `Optional[str]` and forwards the backend return value.
Tests `tests/test_generate_embeddings.py`, `tests/test_image_builder.py`	Adds CLI-level tests for new `generate_embeddings.py` options and orchestration (including model-dir validation and `build_image_archive` invocations). Adds comprehensive tests for `image_builder` (synthetic base archive, layer creation, gzip behavior, manifest/config assertions, end-to-end archive output).
Docs & config `README.md`, `pyproject.toml`	Adds README section documenting `--output-image` usage and image layout; adds Ruff per-file-ignores for `src/lightspeed_rag_content/image_builder.py` to suppress S603,S607.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Make
    participant Podman
    participant Container as ToolContainer
    participant CLI as generate_embeddings.py
    participant Processor as DocumentProcessor
    participant ImageBuilder as image_builder.py
    participant FS as Filesystem

    User->>Make: make generate-output-image
    Make->>Podman: podman run $(TOOL_IMAGE) (mount docs->/input, output->/output)
    Podman->>Container: start container
    Container->>CLI: invoke script
    CLI->>Processor: process documents, build embeddings
    Processor-->>CLI: return vector_store_id (optional)
    CLI->>FS: write vector DB files to /output
    alt --output-image provided
        CLI->>ImageBuilder: build_image_archive(vector_db_dir, output_tar, image_name, image_tag, extra_dirs, base_image)
        ImageBuilder->>ImageBuilder: fetch base docker-archive (skopeo)
        ImageBuilder->>ImageBuilder: create custom layer (vector_db + extra_dirs, gzip .db)
        ImageBuilder->>ImageBuilder: update config & manifest, compute digests
        ImageBuilder->>FS: write final docker-archive tar to /output
    end
    FS-->>User: output archive available

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a new flag (--output-image) to enable packaging of the vector database as a container image archive.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

Makefile (1)

117-134: Consider adding SELinux labels to volume mounts for Fedora/RHEL systems.

On SELinux-enabled systems (common in Red Hat environments), the volume mounts may need :Z or :z labels to allow the container to access the mounted directories.

🔧 Proposed fix for SELinux compatibility

 generate-output-image: ## Run the tool container and produce a vector-DB image archive
 	mkdir -p $(OUTPUT_DIR)
 	podman run --rm \
-	  -v $(abspath $(DOCS_DIR)):/input:ro \
-	  -v $(abspath $(OUTPUT_DIR)):/output \
+	  -v $(abspath $(DOCS_DIR)):/input:ro,Z \
+	  -v $(abspath $(OUTPUT_DIR)):/output:Z \
 	  $(TOOL_IMAGE) \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 117 - 134, The generate-output-image target mounts
host dirs into the container without SELinux labels which breaks access on
SELinux-enabled systems; update the podman run volume flags in the
generate-output-image rule to append a suitable SELinux label (e.g., :Z or :z)
to the mounts for $(abspath $(DOCS_DIR)):/input and $(abspath
$(OUTPUT_DIR)):/output so the container can access those paths on Fedora/RHEL;
keep variable names TOOL_IMAGE, OUTPUT_IMAGE_FILE, OUTPUT_IMAGE_NAME,
OUTPUT_IMAGE_TAG unchanged and ensure the chosen label is applied consistently
to both -v volume options.

src/lightspeed_rag_content/image_builder.py (1)

33-50: Consider validating that vector_db_dir exists.

The function silently proceeds if vector_db_dir doesn't exist, producing an empty layer. While extra_dirs entries are checked with os.path.isdir(), the primary vector_db_dir has no such guard. This could lead to a confusing outcome (an image with no vector DB content).

🛡️ Proposed validation

 def _make_layer_tar(
     vector_db_dir: str,
     extra_dirs: dict[str, str] | None,
 ) -> bytes:
     """Return the raw bytes of a tar archive that forms the single image layer.

     Files from *vector_db_dir* are placed under ``/rag/vector_db/`` inside the
     archive.  Any additional source→destination mappings in *extra_dirs* are
     copied verbatim using the supplied destination prefix.
     """
+    if not os.path.isdir(vector_db_dir):
+        raise ValueError(f"vector_db_dir does not exist: {vector_db_dir}")
     buf = io.BytesIO()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/image_builder.py` around lines 33 - 50, The
function _make_layer_tar currently doesn't validate that vector_db_dir exists;
add a check at the start (similar to the os.path.isdir check used for
extra_dirs) and raise a clear exception (e.g., FileNotFoundError or ValueError)
if vector_db_dir is missing or not a directory, so callers get an explicit error
instead of producing an empty layer; keep using the existing _add_dir_to_tar for
adding when the path is valid and include the path in the exception message for
easier debugging.

tests/test_generate_embeddings.py (1)

22-27: Add None check for _spec.loader to satisfy type checker.

spec_from_file_location can return None, and even if it returns a spec, spec.loader can be None. This could cause an AttributeError at module load time.

🔧 Proposed fix with assertions

 _SCRIPT_PATH = Path(__file__).parent.parent / "scripts" / "generate_embeddings.py"
 _spec = importlib.util.spec_from_file_location("generate_embeddings", _SCRIPT_PATH)
+assert _spec is not None, f"Could not load spec from {_SCRIPT_PATH}"
+assert _spec.loader is not None, f"Spec has no loader for {_SCRIPT_PATH}"
 _generate_embeddings = importlib.util.module_from_spec(_spec)
 _spec.loader.exec_module(_generate_embeddings)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_generate_embeddings.py` around lines 22 - 27, The spec returned by
importlib.util.spec_from_file_location may be None and its loader may also be
None, so before calling _spec.loader.exec_module ensure both _spec and
_spec.loader are not None: check if _spec is None or _spec.loader is None and
raise a clear ImportError (or assert) with a descriptive message, then proceed
to create _generate_embeddings and call
_spec.loader.exec_module(_generate_embeddings); reference symbols:
importlib.util.spec_from_file_location, _spec, _spec.loader, and
_generate_embeddings.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/generate_embeddings.py`:
- Around line 114-121: The DocumentProcessor constructor expects
embeddings_model_dir: Path but scripts/generate_embeddings.py passes a string
(args.model_dir); import Path from pathlib and pass a Path-wrapped value (e.g.,
Path(args.model_dir)) when constructing DocumentProcessor so
embeddings_model_dir is a Path; update the DocumentProcessor(...) call to wrap
args.model_dir and add the pathlib import near other imports.

---

Nitpick comments:
In `@Makefile`:
- Around line 117-134: The generate-output-image target mounts host dirs into
the container without SELinux labels which breaks access on SELinux-enabled
systems; update the podman run volume flags in the generate-output-image rule to
append a suitable SELinux label (e.g., :Z or :z) to the mounts for $(abspath
$(DOCS_DIR)):/input and $(abspath $(OUTPUT_DIR)):/output so the container can
access those paths on Fedora/RHEL; keep variable names TOOL_IMAGE,
OUTPUT_IMAGE_FILE, OUTPUT_IMAGE_NAME, OUTPUT_IMAGE_TAG unchanged and ensure the
chosen label is applied consistently to both -v volume options.

In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 33-50: The function _make_layer_tar currently doesn't validate
that vector_db_dir exists; add a check at the start (similar to the
os.path.isdir check used for extra_dirs) and raise a clear exception (e.g.,
FileNotFoundError or ValueError) if vector_db_dir is missing or not a directory,
so callers get an explicit error instead of producing an empty layer; keep using
the existing _add_dir_to_tar for adding when the path is valid and include the
path in the exception message for easier debugging.

In `@tests/test_generate_embeddings.py`:
- Around line 22-27: The spec returned by importlib.util.spec_from_file_location
may be None and its loader may also be None, so before calling
_spec.loader.exec_module ensure both _spec and _spec.loader are not None: check
if _spec is None or _spec.loader is None and raise a clear ImportError (or
assert) with a descriptive message, then proceed to create _generate_embeddings
and call _spec.loader.exec_module(_generate_embeddings); reference symbols:
importlib.util.spec_from_file_location, _spec, _spec.loader, and
_generate_embeddings.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d5a2a95-a694-4838-94bf-8e58ea9c8eb5

📥 Commits

Reviewing files that changed from the base of the PR and between c750c6f and 267abc2.

📒 Files selected for processing (5)

Makefile
scripts/generate_embeddings.py
src/lightspeed_rag_content/image_builder.py
tests/test_generate_embeddings.py
tests/test_image_builder.py

coderabbitai · 2026-03-23T11:32:02Z

+    document_processor = DocumentProcessor(
+        chunk_size=args.chunk_size,
+        chunk_overlap=args.chunk_overlap,
+        model_name=args.model_name,
+        embeddings_model_dir=args.model_dir,
+        vector_store_type=args.vector_store,
+        doc_type=args.doc_type,
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify DocumentProcessor.__init__ signature for embeddings_model_dir type ast-grep --pattern $'def __init__( self, $$$ embeddings_model_dir: $_, $$$ ):'

Repository: lightspeed-core/rag-content

Length of output: 53

🏁 Script executed:

# Find the DocumentProcessor class definition find . -type f -name "*.py" -path "*document_processor*" | head -20

Repository: lightspeed-core/rag-content

Length of output: 250

🏁 Script executed:

# Also search for DocumentProcessor class directly rg "class DocumentProcessor" -A 5

Repository: lightspeed-core/rag-content

Length of output: 507

🏁 Script executed:

# Check the generate_embeddings.py script cat scripts/generate_embeddings.py

Repository: lightspeed-core/rag-content

Length of output: 4530

🏁 Script executed:

# Read DocumentProcessor __init__ method to check the embeddings_model_dir parameter type head -800 src/lightspeed_rag_content/document_processor.py | tail -100

Repository: lightspeed-core/rag-content

Length of output: 3726

🏁 Script executed:

# Search for the __init__ method of DocumentProcessor and show more context rg "def __init__" -A 30 src/lightspeed_rag_content/document_processor.py | head -50

Repository: lightspeed-core/rag-content

Length of output: 1867

Type mismatch: embeddings_model_dir expects Path, not str.

The DocumentProcessor.__init__ signature (line ~737 of src/lightspeed_rag_content/document_processor.py) declares embeddings_model_dir: Path, but the script passes args.model_dir which is a string from argparse. While the code works at runtime because it converts to string internally, this violates the type contract and should be fixed.

🔧 Proposed fix

Add the import:

import argparse +from pathlib import Path

And wrap the argument:

document_processor = DocumentProcessor( chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap, model_name=args.model_name, - embeddings_model_dir=args.model_dir, + embeddings_model_dir=Path(args.model_dir), vector_store_type=args.vector_store, doc_type=args.doc_type, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/generate_embeddings.py` around lines 114 - 121, The DocumentProcessor constructor expects embeddings_model_dir: Path but scripts/generate_embeddings.py passes a string (args.model_dir); import Path from pathlib and pass a Path-wrapped value (e.g., Path(args.model_dir)) when constructing DocumentProcessor so embeddings_model_dir is a Path; update the DocumentProcessor(...) call to wrap args.model_dir and add the pathlib import near other imports.

are-ces

LGTM very nice. Please update the documentation with this new feature, seems a feature users will use extensively 😄

romartin · 2026-03-23T13:11:50Z

Converting PR to draft as it depends on #104 . So once merged, I will work on rebasing and CI checks

romartin · 2026-03-24T22:50:09Z

@are-ces @syedriko

Status update:

Rebased from main
Also implemented compression (GZIP) of the vector database file in the output image archive, and inclusion of the sha256 information.
Also included a new file, in the generated output image archive, which contains the vector index (ID) generated for the new database (provider_vector_db_id.ind). So later, at deployment time, we can use it for exporting the provider_vectordb_id configuration for the BYOK image as ENV variables the container, and reference them from the llama-stack configuration file.
Updated this PR description properly
Fixed @are-ces comment LGTM very nice. Please update the documentation with this new feature, seems a feature users will use extensively 😄 => by updating the README in the proper way

Pending:

Review comments from @syedriko
Review CI failues
Move to a regular PR and ask for review agian

On the meanwhile, I am building this rag-content image locally, with these changes, and testing them in the AAP chatbot (LCS) deployment, as BYOK. Once local validation completes, I will update this PR and ask for review again.

Thx!

+import io
+import json
+import os
+import subprocess


romartin · 2026-03-30T12:31:34Z

@tisnik @are-ces @syedriko

Hey guys, this is now ready for review! We have been testing the generated BYOK image in our AAP chatbot deployment already.

Thanks!

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

README.md (1)
453-455: Add language specifiers to fenced code blocks.

Static analysis flags these blocks as missing language specifiers. Since they contain file paths, use text or leave empty but be consistent.
✏️ Proposed fix
-```
+```text
 src/lightspeed_rag_content/Containerfile.output-image
```diff
-```
+```text
 registry.access.redhat.com/ubi9/ubi:latest
</details>


Also applies to: 466-468

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @README.md around lines 453 - 455, Update the README's fenced code blocks
that show file paths and image names to include a language specifier (use
"text") so static analysis stops flagging them; specifically add text before the blocks containing "src/lightspeed_rag_content/Containerfile.output-image" and the block containing "registry.access.redhat.com/ubi9/ubi:latest" (and the similar block referenced at lines 466-468) so the fences become text ... ```
consistently.
</details>

</blockquote></details>
<details>
<summary>tests/test_image_builder.py (1)</summary><blockquote>

`120-140`: **Consider using context managers for file reads.**

Multiple tests read the Containerfile without a context manager. While not a bug (Python will close the file handle eventually), using `with` is more explicit.


<details>
<summary>✏️ Example improvement</summary>

```diff
     def test_containerfile_references_ubi9_ubi(self):
-        content = open(_CONTAINERFILE).read()
+        with open(_CONTAINERFILE) as f:
+            content = f.read()
         assert "ubi9/ubi" in content
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/test_image_builder.py` around lines 120 - 140, Replace bare open(...)
calls in the test functions (e.g., test_containerfile_has_base_image,
test_containerfile_has_no_run_instruction, test_containerfile_copies_vector_db,
test_containerfile_copies_embeddings_model,
test_containerfile_sets_non_root_user) with context-managed file opens using the
_CONTAINERFILE variable (use "with open(_CONTAINERFILE) as f:" then f.read() or
f.readlines()) so files are closed deterministically; update each test to use
the with-block and the same read method currently used (read() or readlines())
without changing assertions.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @README.md:

Line 555: The README contains a corrupted heading line ("## UpdaUse the plan
from file @GENERATION.md ... ting Dependencies for Hermetic Builds"); locate
that exact string and replace it with the intended heading "## Updating
Dependencies for Hermetic Builds" so the section title is restored and the
accidental pasted instruction text is removed; ensure surrounding
content/formatting remains unchanged.

Nitpick comments:
In @README.md:

Around line 453-455: Update the README's fenced code blocks that show file
paths and image names to include a language specifier (use "text") so static
analysis stops flagging them; specifically add text before the blocks containing "src/lightspeed_rag_content/Containerfile.output-image" and the block containing "registry.access.redhat.com/ubi9/ubi:latest" (and the similar block referenced at lines 466-468) so the fences become text ... ``` consistently.

In @tests/test_image_builder.py:

Around line 120-140: Replace bare open(...) calls in the test functions (e.g.,
test_containerfile_has_base_image, test_containerfile_has_no_run_instruction,
test_containerfile_copies_vector_db, test_containerfile_copies_embeddings_model,
test_containerfile_sets_non_root_user) with context-managed file opens using the
_CONTAINERFILE variable (use "with open(_CONTAINERFILE) as f:" then f.read() or
f.readlines()) so files are closed deterministically; update each test to use
the with-block and the same read method currently used (read() or readlines())
without changing assertions.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `6027f03c-0b88-481d-a1ba-93fe8d97a2c3`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 267abc2bfb4e1c0e108dabf435d9555d3312a2d3 and cf7a588c55edf65ded9b257e3ff02a8918006aaf.

</details>

<details>
<summary>📒 Files selected for processing (10)</summary>

* `Containerfile`
* `Makefile`
* `README.md`
* `pyproject.toml`
* `scripts/generate_embeddings.py`
* `src/lightspeed_rag_content/Containerfile.output-image`
* `src/lightspeed_rag_content/document_processor.py`
* `src/lightspeed_rag_content/image_builder.py`
* `tests/test_generate_embeddings.py`
* `tests/test_image_builder.py`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (3)</summary>

* pyproject.toml
* Containerfile
* src/lightspeed_rag_content/Containerfile.output-image

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (2)</summary>

* Makefile
* scripts/generate_embeddings.py

</details>

</details>

syedriko · 2026-03-30T20:53:02Z

/ok-to-test

syedriko · 2026-03-31T00:17:36Z

/retest

romartin · 2026-03-31T19:51:20Z

thanks @syedriko , that's an awesome review! working on it :)

romartin · 2026-03-31T21:54:41Z

@syedriko squashed and applied some of your comments, but still working on others.... thx.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

README.md (1)

543-543: ⚠️ Potential issue | 🟡 Minor

Restore the section heading text.

This heading still contains an accidental pasted prompt, so the section title and ToC entry are broken.

✏️ Suggested fix

-## UpdaUse the plan from file `@GENERATION.md` to generate a BYOK RAG image from the sources in `@ansible-collection-docs/`. But as a rag-content (tool) image, don't use the one from 'quay.io' repository, use my local one `localhost/rag-content:local`.ting Dependencies for Hermetic Builds
+## Updating Dependencies for Hermetic Builds

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 543, The section heading in README.md has an accidental
pasted prompt; replace the incorrect heading text ("## UpdaUse the plan from
file `@GENERATION.md`...") with the intended header "## Dependencies for Hermetic
Builds" and ensure the ToC entry matches that exact heading; locate the heading
by searching for the malformed line starting with "## UpdaUse" and update it to
"## Dependencies for Hermetic Builds" so the heading and table-of-contents link
are restored.

src/lightspeed_rag_content/image_builder.py (3)

66-66: ⚠️ Potential issue | 🟠 Major

This hardcodes FAISS-style packaging into every backend.

Every .db under vector_db_dir is renamed to *.gz here, but scripts/generate_embeddings.py exposes --output-image for all vector-store types. That matches the FAISS deployment artifact shape, not generic backends that still produce a plain *.db contract. Please either gate this path to the supported stores or make the compression/layout backend-specific instead of applying it blindly.

Also applies to: 98-106

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/image_builder.py` at line 66, The code in
image_builder.py calls _add_dir_to_tar(vector_db_dir, "/rag/vector_db",
compress_extensions={".db"}) which blindly renames/compresses all .db files
(FAISS-specific) for every backend; change this to be backend-aware by checking
the vector store type (e.g., a backend identifier passed into the image builder
or a config variable) and only apply compress_extensions={".db"} when the
backend is FAISS (or the expected packaging), otherwise pass no
compress_extensions or use a backend-specific layout/compression policy; update
the same logic used around lines referenced (98-106) to mirror this conditional
behavior and document/select the policy by backend name (function:
_add_dir_to_tar, variable: vector_db_dir, parameter: compress_extensions).

125-134: ⚠️ Potential issue | 🟠 Major

Harden the skopeo subprocess call against hangs and opaque failures.

A stalled registry can block this build indefinitely, and check=True without captured output drops the useful stderr context when the pull fails. Please resolve the executable explicitly and wrap timeout/process failures with a clearer error.

🔧 Suggested hardening

+import shutil
+
 ...
 def _fetch_base_archive(base_image: str, dest_tar: str) -> None:
     """Download *base_image* as a docker-archive tar using ``skopeo copy``.
@@
-    subprocess.run(
-        [
-            "skopeo",
-            "copy",
-            "--remove-signatures",
-            f"docker://{base_image}",
-            f"docker-archive:{dest_tar}",
-        ],
-        check=True,
-    )
+    skopeo = shutil.which("skopeo")
+    if skopeo is None:
+        raise RuntimeError("skopeo not found on PATH")
+
+    try:
+        subprocess.run(
+            [
+                skopeo,
+                "copy",
+                "--remove-signatures",
+                f"docker://{base_image}",
+                f"docker-archive:{dest_tar}",
+            ],
+            check=True,
+            capture_output=True,
+            text=True,
+            timeout=300,
+        )
+    except subprocess.TimeoutExpired as exc:
+        raise RuntimeError(f"skopeo timed out while pulling {base_image}") from exc
+    except subprocess.CalledProcessError as exc:
+        detail = exc.stderr.strip() or exc.stdout.strip()
+        raise RuntimeError(f"skopeo failed for {base_image}: {detail}") from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/image_builder.py` around lines 125 - 134, The
subprocess.run call that invokes skopeo using base_image and dest_tar should be
hardened: resolve the skopeo executable via shutil.which and raise a clear error
if missing, call subprocess.run with timeout and capture_output=True (or
stdout/stderr pipes), and wrap it in try/except to catch
subprocess.TimeoutExpired, subprocess.CalledProcessError and FileNotFoundError;
on failure raise a new exception that includes the base_image, dest_tar, the
timed-out status or returncode and the captured stderr/stdout to avoid opaque
failures. Locate the invocation of subprocess.run (the skopeo copy call) and
replace it with the resolved executable path, timeout, captured output, and
explicit exception handling that surfaces helpful error text.

64-71: ⚠️ Potential issue | 🟠 Major

The archive builder still scales with total image size in RAM.

_make_layer_tar() returns one full bytes, _read_docker_archive() reads every base layer into memory, and build_image_archive() keeps all of that alive until the final tar write completes. Including a large model or moving to a larger base image can OOM the unprivileged-container workflow this feature targets. Please stream layer hashing/copying instead of materializing the whole archive first.

Also applies to: 147-163, 236-263

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/image_builder.py` around lines 64 - 71, The
current implementation materializes entire layer archives in memory causing OOM;
change _make_layer_tar, _read_docker_archive, and build_image_archive to stream
layer data instead of returning/holding bytes: have _make_layer_tar write the
tar directly to a file-like stream (or temporary file) and return a readable
file-like object or path, update _read_docker_archive to iterate/read base
layers in chunks (using streaming reads and hashing as it copies) rather than
loading into memory, and modify build_image_archive to assemble the final image
tar by copying layer streams into the output tar incrementally (using
copyfileobj or tarfile.addfile with a stream) and compute hashes on-the-fly so
no full-layer bytes are retained in RAM. Ensure APIs reference _make_layer_tar,
_read_docker_archive, and build_image_archive are updated to accept/return
streams or paths and to close/cleanup temp files when done.

🧹 Nitpick comments (1)

Makefile (1)

121-133: Make OUTPUT_IMAGE_FILE either authoritative or basename-only.

basename drops any overridden directory, so OUTPUT_IMAGE_FILE=dist/images/foo.tar still writes /output/foo.tar while the status line claims dist/images/foo.tar. Either mount the real parent directory or document this variable as a filename under OUTPUT_DIR.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 121 - 133, The Makefile currently strips directories
from OUTPUT_IMAGE_FILE via basename causing mismatched status vs actual
location; fix by making OUTPUT_IMAGE_FILE authoritative or documenting it as
basename-only: either (A) treat OUTPUT_IMAGE_FILE as a full path — mount its
parent into the container (use the parent directory of OUTPUT_IMAGE_FILE instead
of always mounting OUTPUT_DIR) and pass the same full path to --output-image so
the container writes to the authoritative path, or (B) enforce basename-only
semantics — set/output a basename variable (e.g. OUTFILE from OUTPUT_IMAGE_FILE)
and update documentation for OUTPUT_IMAGE_FILE to indicate it should be a
filename only under OUTPUT_DIR; update the Makefile logic around
OUTPUT_IMAGE_FILE, the mounted volume arguments, and the --output-image argument
accordingly (look for usage of OUTPUT_IMAGE_FILE, OUTPUT_DIR, and the
--output-image flag in the Makefile and the podman run invocation).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 445-447: Add a clear prerequisite for host-side runs stating that
skopeo must be installed locally when using the host example that runs "uv run
python scripts/generate_embeddings.py --output-image" (or alternatively replace
that host example with the containerized Makefile flow), and repair the
corrupted heading text near "UpdaUse" by restoring it to a valid markdown
heading (e.g., a clear "Update: Use the plan from file..." style heading) so the
README renders correctly; update the README lines containing the host example
and the corrupted heading accordingly.

In `@scripts/generate_embeddings.py`:
- Around line 133-136: When args.output_image is set and args.include_model is
True, validate that args.model_dir exists and is a directory before building
extra_dirs so we fail fast on typos; if the path is missing, exit with an error
(or call the argument parser error) telling the user to provide a correct
--model-dir or pass --exclude-model, rather than letting build_image_archive
silently omit the model; update the logic around extra_dirs / args.include_model
in generate_embeddings.py to perform this check (referencing args.output_image,
args.include_model, args.model_dir, and extra_dirs).

---

Duplicate comments:
In `@README.md`:
- Line 543: The section heading in README.md has an accidental pasted prompt;
replace the incorrect heading text ("## UpdaUse the plan from file
`@GENERATION.md`...") with the intended header "## Dependencies for Hermetic
Builds" and ensure the ToC entry matches that exact heading; locate the heading
by searching for the malformed line starting with "## UpdaUse" and update it to
"## Dependencies for Hermetic Builds" so the heading and table-of-contents link
are restored.

In `@src/lightspeed_rag_content/image_builder.py`:
- Line 66: The code in image_builder.py calls _add_dir_to_tar(vector_db_dir,
"/rag/vector_db", compress_extensions={".db"}) which blindly renames/compresses
all .db files (FAISS-specific) for every backend; change this to be
backend-aware by checking the vector store type (e.g., a backend identifier
passed into the image builder or a config variable) and only apply
compress_extensions={".db"} when the backend is FAISS (or the expected
packaging), otherwise pass no compress_extensions or use a backend-specific
layout/compression policy; update the same logic used around lines referenced
(98-106) to mirror this conditional behavior and document/select the policy by
backend name (function: _add_dir_to_tar, variable: vector_db_dir, parameter:
compress_extensions).
- Around line 125-134: The subprocess.run call that invokes skopeo using
base_image and dest_tar should be hardened: resolve the skopeo executable via
shutil.which and raise a clear error if missing, call subprocess.run with
timeout and capture_output=True (or stdout/stderr pipes), and wrap it in
try/except to catch subprocess.TimeoutExpired, subprocess.CalledProcessError and
FileNotFoundError; on failure raise a new exception that includes the
base_image, dest_tar, the timed-out status or returncode and the captured
stderr/stdout to avoid opaque failures. Locate the invocation of subprocess.run
(the skopeo copy call) and replace it with the resolved executable path,
timeout, captured output, and explicit exception handling that surfaces helpful
error text.
- Around line 64-71: The current implementation materializes entire layer
archives in memory causing OOM; change _make_layer_tar, _read_docker_archive,
and build_image_archive to stream layer data instead of returning/holding bytes:
have _make_layer_tar write the tar directly to a file-like stream (or temporary
file) and return a readable file-like object or path, update
_read_docker_archive to iterate/read base layers in chunks (using streaming
reads and hashing as it copies) rather than loading into memory, and modify
build_image_archive to assemble the final image tar by copying layer streams
into the output tar incrementally (using copyfileobj or tarfile.addfile with a
stream) and compute hashes on-the-fly so no full-layer bytes are retained in
RAM. Ensure APIs reference _make_layer_tar, _read_docker_archive, and
build_image_archive are updated to accept/return streams or paths and to
close/cleanup temp files when done.

---

Nitpick comments:
In `@Makefile`:
- Around line 121-133: The Makefile currently strips directories from
OUTPUT_IMAGE_FILE via basename causing mismatched status vs actual location; fix
by making OUTPUT_IMAGE_FILE authoritative or documenting it as basename-only:
either (A) treat OUTPUT_IMAGE_FILE as a full path — mount its parent into the
container (use the parent directory of OUTPUT_IMAGE_FILE instead of always
mounting OUTPUT_DIR) and pass the same full path to --output-image so the
container writes to the authoritative path, or (B) enforce basename-only
semantics — set/output a basename variable (e.g. OUTFILE from OUTPUT_IMAGE_FILE)
and update documentation for OUTPUT_IMAGE_FILE to indicate it should be a
filename only under OUTPUT_DIR; update the Makefile logic around
OUTPUT_IMAGE_FILE, the mounted volume arguments, and the --output-image argument
accordingly (look for usage of OUTPUT_IMAGE_FILE, OUTPUT_DIR, and the
--output-image flag in the Makefile and the podman run invocation).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c35e761c-97d4-4aa3-96e1-181feb8ca7b2

📥 Commits

Reviewing files that changed from the base of the PR and between cf7a588 and ef2810c.

📒 Files selected for processing (10)

Containerfile
Makefile
README.md
pyproject.toml
scripts/generate_embeddings.py
src/lightspeed_rag_content/Containerfile.output-image
src/lightspeed_rag_content/document_processor.py
src/lightspeed_rag_content/image_builder.py
tests/test_generate_embeddings.py
tests/test_image_builder.py

✅ Files skipped from review due to trivial changes (3)

pyproject.toml
Containerfile
src/lightspeed_rag_content/Containerfile.output-image

🚧 Files skipped from review as they are similar to previous changes (1)

src/lightspeed_rag_content/document_processor.py

are-ces

Please fix also Black formatting and linting issues

+        subprocess.run(
+            [


coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

src/lightspeed_rag_content/image_builder.py (1)

210-210: ⚠️ Potential issue | 🟡 Minor

Fix Ruff RUF005 lint error: prefer unpacking over concatenation.

The CI pipeline fails because Ruff prefers [*base_layer_names, layer_entry] over base_layer_names + [layer_entry].

🔧 Proposed fix

     manifest = [
         {
             "Config": config_entry,
             "RepoTags": [f"{image_name}:{image_tag}"],
-            "Layers": base_layer_names + [layer_entry],
+            "Layers": [*base_layer_names, layer_entry],
         }
     ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/image_builder.py` at line 210, Replace the list
concatenation used for the "Layers" value with unpacking to satisfy Ruff RUF005:
change the expression that currently builds the layers from base_layer_names +
[layer_entry] to use a list with the starred unpacking of base_layer_names and
layer_entry (i.e., [*base_layer_names, layer_entry]); locate the usage in
image_builder.py where the "Layers" key is assigned and update the expression
referencing base_layer_names and layer_entry accordingly.

README.md (1)

445-468: ⚠️ Potential issue | 🟡 Minor

Host-side usage requires skopeo to be installed locally.

Lines 445-447 mention skopeo is included in the RAG tool container image, but the example at lines 460-468 runs uv run python scripts/generate_embeddings.py directly on the host. Users following this path need skopeo installed locally.

Consider adding a prerequisite note for host-side execution:

📝 Suggested addition after line 447

 The image is built using `skopeo` (included in the RAG tool container image)
 to fetch the base image and pure Python to compose the final archive — no
 container daemon or elevated privileges required.
+
+> **Note:** When running directly on the host (not inside the container),
+> ensure `skopeo` is installed and available in your `PATH`.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 445 - 468, Add a short prerequisite note after the
paragraph that mentions skopeo is included in the container (after the text that
currently says "The image is built using `skopeo`...") clarifying that if users
run the example command on the host (the `uv run python
scripts/generate_embeddings.py ...` invocation) they must have `skopeo`
installed locally; update README.md near the Containerfile/base-image text to
either add a "Prerequisites" bullet stating "skopeo must be installed on the
host when running host-side commands" or insert a one-line parenthetical note
before the "Basic usage" example referencing the need for local skopeo.

🧹 Nitpick comments (2)

tests/test_image_builder.py (1)

120-141: Use context managers for file reads.

Multiple open(_CONTAINERFILE).read() calls don't use context managers, leaving file handles to garbage collection. While unlikely to cause issues in tests, it's cleaner to use with statements.

♻️ Suggested refactor

     def test_containerfile_references_ubi9_ubi(self):
-        content = open(_CONTAINERFILE).read()
+        with open(_CONTAINERFILE) as f:
+            content = f.read()
         assert "ubi9/ubi" in content

     def test_containerfile_has_no_run_instruction(self):
         # No RUN instruction — packages come from the base image itself.
-        lines = open(_CONTAINERFILE).readlines()
+        with open(_CONTAINERFILE) as f:
+            lines = f.readlines()
         assert not any(line.startswith("RUN ") for line in lines)

     def test_containerfile_copies_vector_db(self):
-        content = open(_CONTAINERFILE).read()
+        with open(_CONTAINERFILE) as f:
+            content = f.read()
         assert "COPY --from=vector_db" in content
         assert "/rag/vector_db/" in content

     def test_containerfile_copies_embeddings_model(self):
-        content = open(_CONTAINERFILE).read()
+        with open(_CONTAINERFILE) as f:
+            content = f.read()
         assert "COPY --from=embeddings_model" in content
         assert "/rag/embeddings_model/" in content

     def test_containerfile_sets_non_root_user(self):
-        content = open(_CONTAINERFILE).read()
+        with open(_CONTAINERFILE) as f:
+            content = f.read()
         assert "USER 65532" in content

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_image_builder.py` around lines 120 - 141, The tests repeatedly
call open(_CONTAINERFILE).read() and open(_CONTAINERFILE).readlines() without
context managers; update the test functions
(test_containerfile_references_ubi9_ubi,
test_containerfile_has_no_run_instruction, test_containerfile_copies_vector_db,
test_containerfile_copies_embeddings_model,
test_containerfile_sets_non_root_user) to open the file using with
open(_CONTAINERFILE, 'r') as f: and either f.read() or f.readlines(), ideally
reading the file once into a local variable (e.g., content or lines) and reusing
that across assertions to close the file promptly and avoid leaving file handles
to garbage collection.

src/lightspeed_rag_content/image_builder.py (1)

103-111: Large .db files are fully loaded into memory during gzip compression.

While the layer tar streaming was addressed, the gzip compression path still reads the entire source file into memory via gz.write(f.read()). For very large vector database files, this could cause memory pressure.

Consider streaming the compression:

♻️ Suggested streaming approach

             if compress_extensions and suffix in compress_extensions:
                 buf = io.BytesIO()
                 with gzip.GzipFile(fileobj=buf, mode="wb") as gz:
                     with open(abs_path, "rb") as f:
-                        gz.write(f.read())
+                        for chunk in iter(lambda: f.read(8 * 1024 * 1024), b""):
+                            gz.write(chunk)
                 data = buf.getvalue()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/image_builder.py` around lines 103 - 111, The
current gzip path in image_builder.py reads the whole file into memory via
gz.write(f.read()); change it to stream-compress to avoid memory spikes by
writing compressed bytes to a temporary file instead of a BytesIO: open abs_path
in binary, open a tempfile.NamedTemporaryFile(mode="wb+", delete=True), copy
from the source to a gzip.GzipFile(fileobj=tempfile) in chunks (or use
shutil.copyfileobj), flush and seek to start, set TarInfo(name=arcname + ".gz")
with info.size = tempfile.tell(), then call tar.addfile(info, tempfile) and
close/remove the temp file; keep the existing condition using
compress_extensions and suffix.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 454-456: The fenced code block containing the container image
reference `registry.access.redhat.com/ubi9/ubi:latest` lacks a language
specifier; update the block to include a language (e.g., change ``` to ```text)
so the fenced code becomes ```text followed by the image reference and closing
``` to satisfy MD040.

In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 1-283: The file needs to be reformatted with Black; run the
provided command (uv tool run black
src/lightspeed_rag_content/image_builder.py), stage and commit the resulting
changes so CI passes; ensure you format the whole file (functions like
_make_layer_tar, _add_dir_to_tar, _fetch_base_archive, _build_image_config, and
build_image_archive) and include the updated file in your PR (or add Black to
pre-commit to avoid future CI failures).

---

Duplicate comments:
In `@README.md`:
- Around line 445-468: Add a short prerequisite note after the paragraph that
mentions skopeo is included in the container (after the text that currently says
"The image is built using `skopeo`...") clarifying that if users run the example
command on the host (the `uv run python scripts/generate_embeddings.py ...`
invocation) they must have `skopeo` installed locally; update README.md near the
Containerfile/base-image text to either add a "Prerequisites" bullet stating
"skopeo must be installed on the host when running host-side commands" or insert
a one-line parenthetical note before the "Basic usage" example referencing the
need for local skopeo.

In `@src/lightspeed_rag_content/image_builder.py`:
- Line 210: Replace the list concatenation used for the "Layers" value with
unpacking to satisfy Ruff RUF005: change the expression that currently builds
the layers from base_layer_names + [layer_entry] to use a list with the starred
unpacking of base_layer_names and layer_entry (i.e., [*base_layer_names,
layer_entry]); locate the usage in image_builder.py where the "Layers" key is
assigned and update the expression referencing base_layer_names and layer_entry
accordingly.

---

Nitpick comments:
In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 103-111: The current gzip path in image_builder.py reads the whole
file into memory via gz.write(f.read()); change it to stream-compress to avoid
memory spikes by writing compressed bytes to a temporary file instead of a
BytesIO: open abs_path in binary, open a tempfile.NamedTemporaryFile(mode="wb+",
delete=True), copy from the source to a gzip.GzipFile(fileobj=tempfile) in
chunks (or use shutil.copyfileobj), flush and seek to start, set
TarInfo(name=arcname + ".gz") with info.size = tempfile.tell(), then call
tar.addfile(info, tempfile) and close/remove the temp file; keep the existing
condition using compress_extensions and suffix.

In `@tests/test_image_builder.py`:
- Around line 120-141: The tests repeatedly call open(_CONTAINERFILE).read() and
open(_CONTAINERFILE).readlines() without context managers; update the test
functions (test_containerfile_references_ubi9_ubi,
test_containerfile_has_no_run_instruction, test_containerfile_copies_vector_db,
test_containerfile_copies_embeddings_model,
test_containerfile_sets_non_root_user) to open the file using with
open(_CONTAINERFILE, 'r') as f: and either f.read() or f.readlines(), ideally
reading the file once into a local variable (e.g., content or lines) and reusing
that across assertions to close the file promptly and avoid leaving file handles
to garbage collection.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 40a0a0a9-ff1a-4e5f-b403-f28ee5ce37fb

📥 Commits

Reviewing files that changed from the base of the PR and between ef2810c and 6e1fbf8.

📒 Files selected for processing (5)

README.md
scripts/generate_embeddings.py
src/lightspeed_rag_content/image_builder.py
tests/test_generate_embeddings.py
tests/test_image_builder.py

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/generate_embeddings.py

coderabbitai · 2026-04-01T13:12:01Z

+```
+registry.access.redhat.com/ubi9/ubi:latest
+```


⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block.

Markdownlint reports MD040: fenced code blocks should have a language specified. Since this is a container image reference, use text or leave empty with proper fence.

📝 Suggested fix

-``` +```text registry.access.redhat.com/ubi9/ubi:latest

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.0)</summary> [warning] 454-454: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @README.md around lines 454 - 456, The fenced code block containing the
container image reference registry.access.redhat.com/ubi9/ubi:latest lacks a
language specifier; update the block to include a language (e.g., change totext) so the fenced code becomes text followed by the image reference and closing to satisfy MD040.

</details>  

romartin · 2026-04-01T13:42:16Z

Hey @syedriko @are-ces

I have been working through your comments and introduced changes / fixes
I have also decoupled some logic between the generate_embeddings.py and image_builder.py logic, except that adding arg-parse arguments in embeddings.py for output type as image, as I understand this script is the entry point script for all generation stuff
Also no further need for additional filies as part of the image, logic moved to an AAP specific usecase, and not part this repo any longer
Fixed CI failures too

Just pending one existing comment to address: #108 (comment). Anyway I think it is ready for early review again.

Thanks

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 105-109: The code currently calls _gzip_compress(abs_path) which
buffers the entire compressed payload in memory before tar.addfile; instead,
stream the compression to disk to avoid OOMs: for the block handling
compress_extensions (where suffix, arcname, abs_path and tar.addfile are used),
open a tempfile.NamedTemporaryFile (or TemporaryFile) and stream-compress the
source by reading abs_path in chunks into a gzip.GzipFile writing to that temp
file, close/flush it, stat the temp file to get the compressed size, create a
tarfile.TarInfo(name=arcname + ".gz") with info.size set to the temp file size,
reopen the temp file for reading in binary and pass that file object to
tar.addfile(info, fileobj), then delete the temp file; apply the same change for
the other similar block around lines 114-120 instead of using _gzip_compress().
- Around line 250-255: Before calling _make_layer_tar, validate that
vector_db_dir exists and is a directory and fail fast if not; specifically, in
the function that creates the temp layer (the block that calls _make_layer_tar
with vector_db_dir and extra_dirs), add an early check like
os.path.exists(vector_db_dir) and os.path.isdir(vector_db_dir) and raise a clear
exception (e.g., ValueError or FileNotFoundError) if the check fails so
_add_dir_to_tar is never invoked with a missing path; keep validation for
extra_dirs as-is and reference vector_db_dir, _make_layer_tar, and
_add_dir_to_tar in your change.
- Around line 95-96: The os.walk loop over src_dir (and the similar loop
handling extra_dirs) currently ignores symlinked directories and tar.add(...,
recursive=False) preserves symlinks, which can leave missing targets in the
image; update the loops that iterate src_dir and extra_dirs to explicitly detect
symlinks (os.path.islink) and either (a) raise/return an error rejecting
symlinked dirs with a clear message referencing src_dir/extra_dirs, or (b)
dereference them when adding to the tar by resolving targets (os.path.realpath)
and adding the target file contents instead of the symlink (ensuring you replace
tar.add(..., recursive=False) behavior accordingly). Locate and change the code
around the os.walk(...) loop and the tar.add usage to implement one of these two
behaviors and apply the same change to the analogous block that processes
extra_dirs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 96ca5658-f122-47db-9972-752cdac0a312

📥 Commits

Reviewing files that changed from the base of the PR and between 6e1fbf8 and 1cdd4fa.

📒 Files selected for processing (1)

src/lightspeed_rag_content/image_builder.py

coderabbitai · 2026-04-01T13:49:30Z

+    for root, dirs, files in os.walk(src_dir):
+        dirs.sort()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the image_builder.py file find . -name "image_builder.py" -type f

Repository: lightspeed-core/rag-content

Length of output: 116

🏁 Script executed:

# Read the image_builder.py file to see the context around lines 95-96 and 110-111 wc -l src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 118

🏁 Script executed:

# Read the relevant sections with context sed -n '90,120p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 1367

🏁 Script executed:

# Get the function definition and signature sed -n '80,115p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 1485

🏁 Script executed:

# Search for usages of this function rg -n "tar.add\|_add_dir_to_tar" src/lightspeed_rag_content/

Repository: lightspeed-core/rag-content

Length of output: 53

🏁 Script executed:

# Look for the function name rg -n "def.*add.*tar" src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 168

🏁 Script executed:

# Find where _add_dir_to_tar is called rg -n "_add_dir_to_tar" src/

Repository: lightspeed-core/rag-content

Length of output: 311

🏁 Script executed:

# Look at generate_embeddings.py as mentioned in the review find . -name "generate_embeddings.py" -type f

Repository: lightspeed-core/rag-content

Length of output: 103

🏁 Script executed:

# Look at the context where _add_dir_to_tar is called (lines 60-80) sed -n '60,80p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 871

🏁 Script executed:

# Also check the section mentioned in review (around 110-111) sed -n '105,120p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 718

🏁 Script executed:

# Check the generate_embeddings.py file to see if it uses extra_dirs with symlinks cat scripts/generate_embeddings.py

Repository: lightspeed-core/rag-content

Length of output: 5129

🏁 Script executed:

# Verify os.walk() default behavior with symlinks python3 << 'EOF' import os import tempfile import shutil # Create a test directory structure with symlinks with tempfile.TemporaryDirectory() as tmpdir: # Create a source directory src = os.path.join(tmpdir, "src") os.makedirs(src) # Create a regular subdirectory regular_dir = os.path.join(src, "regular") os.makedirs(regular_dir) with open(os.path.join(regular_dir, "file.txt"), "w") as f: f.write("test") # Create another directory to link to linked_content = os.path.join(tmpdir, "linked_content") os.makedirs(linked_content) with open(os.path.join(linked_content, "linked_file.txt"), "w") as f: f.write("linked") # Create a symlink to a directory symlink_path = os.path.join(src, "symlink_dir") os.symlink(linked_content, symlink_path) # Test os.walk default behavior (followlinks=False) print("os.walk with followlinks=False (default):") for root, dirs, files in os.walk(src): print(f" Root: {os.path.relpath(root, src)}, Files: {files}, Dirs: {dirs}") print("\nos.walk with followlinks=True:") for root, dirs, files in os.walk(src, followlinks=True): print(f" Root: {os.path.relpath(root, src)}, Files: {files}, Dirs: {dirs}") EOF

Repository: lightspeed-core/rag-content

Length of output: 406

🏁 Script executed:

# Verify tar.add behavior with symlinks python3 << 'EOF' import os import tempfile import tarfile with tempfile.TemporaryDirectory() as tmpdir: # Create source structure with symlink src = os.path.join(tmpdir, "src") os.makedirs(src) # Create a regular file with open(os.path.join(src, "regular.txt"), "w") as f: f.write("regular") # Create a file and a symlink to it target_file = os.path.join(src, "target.txt") with open(target_file, "w") as f: f.write("target content") symlink_file = os.path.join(src, "symlink.txt") os.symlink(target_file, symlink_file) # Create tar with recursive=False (used in the code) tar_path = os.path.join(tmpdir, "test.tar") with tarfile.open(tar_path, "w") as tar: tar.add(symlink_file, arcname="symlink.txt", recursive=False) # Check what's in the tar print("Tar contents (with symlink added using recursive=False):") with tarfile.open(tar_path, "r") as tar: for member in tar.getmembers(): print(f" Name: {member.name}, Type: {member.type}, Issym: {member.issym()}, Isreg: {member.isreg()}") EOF

Repository: lightspeed-core/rag-content

Length of output: 186

🏁 Script executed:

# Check if there's any symlink handling in the codebase rg -i "symlink|followlinks|dereferenc" src/ scripts/

Repository: lightspeed-core/rag-content

Length of output: 53

🏁 Script executed:

# Look at the complete function that calls _add_dir_to_tar to understand the full context sed -n '55,78p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 1044

🏁 Script executed:

# Check if there are any tests or documentation about symlink handling fd -e py -e md -e txt | xargs rg -l "symlink" 2>/dev/null | head -5

Repository: lightspeed-core/rag-content

Length of output: 53

Handle symlinks explicitly when copying source directories.

os.walk() skips symlinked directories by default, and tar.add(..., recursive=False) preserves file symlinks instead of copying their targets. If extra_dirs points to a model snapshot or cache tree with symlinks, the resulting image will ship with missing files or broken links. Either reject symlinks up front or dereference them while writing tar members.

Also applies to: 110–111

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 95 - 96, The os.walk loop over src_dir (and the similar loop handling extra_dirs) currently ignores symlinked directories and tar.add(..., recursive=False) preserves symlinks, which can leave missing targets in the image; update the loops that iterate src_dir and extra_dirs to explicitly detect symlinks (os.path.islink) and either (a) raise/return an error rejecting symlinked dirs with a clear message referencing src_dir/extra_dirs, or (b) dereference them when adding to the tar by resolving targets (os.path.realpath) and adding the target file contents instead of the symlink (ensuring you replace tar.add(..., recursive=False) behavior accordingly). Locate and change the code around the os.walk(...) loop and the tar.add usage to implement one of these two behaviors and apply the same change to the analogous block that processes extra_dirs.

coderabbitai · 2026-04-01T13:49:30Z

+            if compress_extensions and suffix in compress_extensions:
+                data = _gzip_compress(abs_path)
+                info = tarfile.TarInfo(name=arcname + ".gz")
+                info.size = len(data)
+                tar.addfile(info, io.BytesIO(data))


⚠️ Potential issue | 🟠 Major

Stream the .db compression path instead of buffering it in memory.

_gzip_compress() still reads the full DB into memory and keeps the full .gz payload in a BytesIO before tar.addfile() sees it. For large vector stores, that brings peak RAM back to O(file_size) and can still OOM the build despite the streaming refactor elsewhere.

Also applies to: 114-120

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 105 - 109, The code currently calls _gzip_compress(abs_path) which buffers the entire compressed payload in memory before tar.addfile; instead, stream the compression to disk to avoid OOMs: for the block handling compress_extensions (where suffix, arcname, abs_path and tar.addfile are used), open a tempfile.NamedTemporaryFile (or TemporaryFile) and stream-compress the source by reading abs_path in chunks into a gzip.GzipFile writing to that temp file, close/flush it, stat the temp file to get the compressed size, create a tarfile.TarInfo(name=arcname + ".gz") with info.size set to the temp file size, reopen the temp file for reading in binary and pass that file object to tar.addfile(info, fileobj), then delete the temp file; apply the same change for the other similar block around lines 114-120 instead of using _gzip_compress().

coderabbitai · 2026-04-01T13:49:31Z

+    with tempfile.TemporaryDirectory() as tmpdir:
+        # 1. Build the custom layer to a temp file; compute SHA-256 by reading
+        #    it back in chunks so it never needs to coexist in RAM with the
+        #    base image layers.
+        layer_file = os.path.join(tmpdir, "layer.tar")
+        layer_sha = _make_layer_tar(vector_db_dir, extra_dirs, layer_file)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "image_builder.py" | head -5

Repository: lightspeed-core/rag-content

Length of output: 116

🏁 Script executed:

head -30 src/lightspeed_rag_content/image_builder.py && echo -e "\n...\n" && sed -n '240,270p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 2720

🏁 Script executed:

# Find the _make_layer_tar and _add_dir_to_tar functions rg -n "_make_layer_tar|_add_dir_to_tar" src/lightspeed_rag_content/image_builder.py -A 10

Repository: lightspeed-core/rag-content

Length of output: 2000

🏁 Script executed:

# Check the full function signature for build_image_archive ast-grep --pattern 'def build_image_archive($_, $_) { $$$ }'

Repository: lightspeed-core/rag-content

Length of output: 53

🏁 Script executed:

# Get the full _add_dir_to_tar implementation and see how it handles missing directories sed -n '81,120p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 1658

🏁 Script executed:

# Check if there's any validation of vector_db_dir in build_image_archive or before it's called sed -n '220,260p' src/lightspeed_rag_content/image_builder.py

Repository: lightspeed-core/rag-content

Length of output: 2032

🏁 Script executed:

# Check how scripts/generate_embeddings.py calls build_image_archive rg -n "build_image_archive" scripts/generate_embeddings.py -B 5 -A 5

Repository: lightspeed-core/rag-content

Length of output: 1039

🏁 Script executed:

# Test os.walk behavior with a missing directory using Python python3 << 'EOF' import os import tempfile # Test 1: Does os.walk yield anything for a missing directory? missing_dir = "/nonexistent/path/that/does/not/exist" entries = list(os.walk(missing_dir)) print(f"os.walk on missing dir yields: {entries}") print(f"Number of entries: {len(entries)}") # Test 2: Test with a temp directory that's been deleted with tempfile.TemporaryDirectory() as tmpdir: test_dir = os.path.join(tmpdir, "test") os.makedirs(test_dir) # Directory exists print(f"Directory exists: {os.path.isdir(test_dir)}") # Now directory is gone (context exited) entries = list(os.walk(test_dir)) print(f"os.walk on deleted directory yields: {entries}") EOF

Repository: lightspeed-core/rag-content

Length of output: 188

Add validation to fail fast if vector_db_dir does not exist.

This is the only required input, but it is currently not validated before the layer build starts. os.walk() silently yields nothing for a missing directory, resulting in a valid-looking archive with an empty /rag/vector_db layer instead of surfacing the real problem. Note that extra_dirs entries are already validated (line 72), but vector_db_dir is passed directly to _add_dir_to_tar() without any check.

Suggested fix

Args: vector_db_dir: Directory whose contents are placed at ``/rag/vector_db`` inside the image. output_tar_path: Destination path for the ``.tar`` archive. image_name: Repository name for the image (e.g. ``my-rag``). image_tag: Tag for the image (e.g. ``v1``). extra_dirs: Optional mapping of ``{local_src_path: image_dest_path}``. Each existing source directory is added under the given image path. base_image: Parent image reference (default: UBI 9 full). """ + if not os.path.isdir(vector_db_dir): + raise FileNotFoundError( + f"Vector DB directory not found: {vector_db_dir}" + ) + with tempfile.TemporaryDirectory() as tmpdir:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 250 - 255, Before calling _make_layer_tar, validate that vector_db_dir exists and is a directory and fail fast if not; specifically, in the function that creates the temp layer (the block that calls _make_layer_tar with vector_db_dir and extra_dirs), add an early check like os.path.exists(vector_db_dir) and os.path.isdir(vector_db_dir) and raise a clear exception (e.g., ValueError or FileNotFoundError) if the check fails so _add_dir_to_tar is never invoked with a missing path; keep validation for extra_dirs as-is and reference vector_db_dir, _make_layer_tar, and _add_dir_to_tar in your change.

romartin · 2026-04-01T19:45:58Z

Hey @tisnik @are-ces @syedriko

Everything done from my side, also the updates of the rpm locks now, plus previous updates described here: #108 (comment)

So now , it is ready for review again, thanks guys!

Implement a privilege-free mechanism to build a Docker-archive tar from a local vector DB directory using skopeo + pure Python layer composition. Includes skopeo RPM dependency (and transitive deps) for aarch64/x86_64, and makes generate-rpm-lock.sh executable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

romartin · 2026-04-01T19:50:08Z

Squashed commits too!

are-ces

TYVM very nice work, LGTM

tisnik

LGTM

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

are-ces requested changes Mar 23, 2026

View reviewed changes

romartin marked this pull request as draft March 23, 2026 13:11

romartin mentioned this pull request Mar 23, 2026

BYOK Examples ansible/aap-rag-content#298

Merged

syedriko reviewed Mar 23, 2026

View reviewed changes

Comment thread scripts/generate_embeddings.py

syedriko reviewed Mar 23, 2026

View reviewed changes

Comment thread Makefile Outdated

romartin force-pushed the generate-output-container branch 2 times, most recently from b4f7c75 to 81666ca Compare March 24, 2026 22:47

romartin force-pushed the generate-output-container branch from 81666ca to aa6e6b6 Compare March 24, 2026 23:25

github-advanced-security AI found potential problems Mar 25, 2026

View reviewed changes

Comment thread src/lightspeed_rag_content/image_builder.py Fixed

Comment thread src/lightspeed_rag_content/image_builder.py Fixed

Comment thread src/lightspeed_rag_content/image_builder.py Fixed

github-advanced-security AI found potential problems Mar 25, 2026

View reviewed changes

Comment thread src/lightspeed_rag_content/image_builder.py Fixed

github-advanced-security AI found potential problems Mar 25, 2026

View reviewed changes

romartin marked this pull request as ready for review March 30, 2026 12:30

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread README.md Outdated