Add --output-image flag to package vector DB as a container image archive#108
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a containerized workflow and CLI options to generate embeddings and optionally package the resulting vector DB (and optionally the embedding model) into a docker-archive (.tar) image via a new pure‑Python image builder, Makefile targets, Containerfile additions, and corresponding tests and docs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Make
participant Podman
participant Container as ToolContainer
participant CLI as generate_embeddings.py
participant Processor as DocumentProcessor
participant ImageBuilder as image_builder.py
participant FS as Filesystem
User->>Make: make generate-output-image
Make->>Podman: podman run $(TOOL_IMAGE) (mount docs->/input, output->/output)
Podman->>Container: start container
Container->>CLI: invoke script
CLI->>Processor: process documents, build embeddings
Processor-->>CLI: return vector_store_id (optional)
CLI->>FS: write vector DB files to /output
alt --output-image provided
CLI->>ImageBuilder: build_image_archive(vector_db_dir, output_tar, image_name, image_tag, extra_dirs, base_image)
ImageBuilder->>ImageBuilder: fetch base docker-archive (skopeo)
ImageBuilder->>ImageBuilder: create custom layer (vector_db + extra_dirs, gzip .db)
ImageBuilder->>ImageBuilder: update config & manifest, compute digests
ImageBuilder->>FS: write final docker-archive tar to /output
end
FS-->>User: output archive available
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
Makefile (1)
117-134: Consider adding SELinux labels to volume mounts for Fedora/RHEL systems.On SELinux-enabled systems (common in Red Hat environments), the volume mounts may need
:Zor:zlabels to allow the container to access the mounted directories.🔧 Proposed fix for SELinux compatibility
generate-output-image: ## Run the tool container and produce a vector-DB image archive mkdir -p $(OUTPUT_DIR) podman run --rm \ - -v $(abspath $(DOCS_DIR)):/input:ro \ - -v $(abspath $(OUTPUT_DIR)):/output \ + -v $(abspath $(DOCS_DIR)):/input:ro,Z \ + -v $(abspath $(OUTPUT_DIR)):/output:Z \ $(TOOL_IMAGE) \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 117 - 134, The generate-output-image target mounts host dirs into the container without SELinux labels which breaks access on SELinux-enabled systems; update the podman run volume flags in the generate-output-image rule to append a suitable SELinux label (e.g., :Z or :z) to the mounts for $(abspath $(DOCS_DIR)):/input and $(abspath $(OUTPUT_DIR)):/output so the container can access those paths on Fedora/RHEL; keep variable names TOOL_IMAGE, OUTPUT_IMAGE_FILE, OUTPUT_IMAGE_NAME, OUTPUT_IMAGE_TAG unchanged and ensure the chosen label is applied consistently to both -v volume options.src/lightspeed_rag_content/image_builder.py (1)
33-50: Consider validating thatvector_db_direxists.The function silently proceeds if
vector_db_dirdoesn't exist, producing an empty layer. Whileextra_dirsentries are checked withos.path.isdir(), the primaryvector_db_dirhas no such guard. This could lead to a confusing outcome (an image with no vector DB content).🛡️ Proposed validation
def _make_layer_tar( vector_db_dir: str, extra_dirs: dict[str, str] | None, ) -> bytes: """Return the raw bytes of a tar archive that forms the single image layer. Files from *vector_db_dir* are placed under ``/rag/vector_db/`` inside the archive. Any additional source→destination mappings in *extra_dirs* are copied verbatim using the supplied destination prefix. """ + if not os.path.isdir(vector_db_dir): + raise ValueError(f"vector_db_dir does not exist: {vector_db_dir}") buf = io.BytesIO()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 33 - 50, The function _make_layer_tar currently doesn't validate that vector_db_dir exists; add a check at the start (similar to the os.path.isdir check used for extra_dirs) and raise a clear exception (e.g., FileNotFoundError or ValueError) if vector_db_dir is missing or not a directory, so callers get an explicit error instead of producing an empty layer; keep using the existing _add_dir_to_tar for adding when the path is valid and include the path in the exception message for easier debugging.tests/test_generate_embeddings.py (1)
22-27: Add None check for_spec.loaderto satisfy type checker.
spec_from_file_locationcan returnNone, and even if it returns a spec,spec.loadercan beNone. This could cause anAttributeErrorat module load time.🔧 Proposed fix with assertions
_SCRIPT_PATH = Path(__file__).parent.parent / "scripts" / "generate_embeddings.py" _spec = importlib.util.spec_from_file_location("generate_embeddings", _SCRIPT_PATH) +assert _spec is not None, f"Could not load spec from {_SCRIPT_PATH}" +assert _spec.loader is not None, f"Spec has no loader for {_SCRIPT_PATH}" _generate_embeddings = importlib.util.module_from_spec(_spec) _spec.loader.exec_module(_generate_embeddings)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_generate_embeddings.py` around lines 22 - 27, The spec returned by importlib.util.spec_from_file_location may be None and its loader may also be None, so before calling _spec.loader.exec_module ensure both _spec and _spec.loader are not None: check if _spec is None or _spec.loader is None and raise a clear ImportError (or assert) with a descriptive message, then proceed to create _generate_embeddings and call _spec.loader.exec_module(_generate_embeddings); reference symbols: importlib.util.spec_from_file_location, _spec, _spec.loader, and _generate_embeddings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/generate_embeddings.py`:
- Around line 114-121: The DocumentProcessor constructor expects
embeddings_model_dir: Path but scripts/generate_embeddings.py passes a string
(args.model_dir); import Path from pathlib and pass a Path-wrapped value (e.g.,
Path(args.model_dir)) when constructing DocumentProcessor so
embeddings_model_dir is a Path; update the DocumentProcessor(...) call to wrap
args.model_dir and add the pathlib import near other imports.
---
Nitpick comments:
In `@Makefile`:
- Around line 117-134: The generate-output-image target mounts host dirs into
the container without SELinux labels which breaks access on SELinux-enabled
systems; update the podman run volume flags in the generate-output-image rule to
append a suitable SELinux label (e.g., :Z or :z) to the mounts for $(abspath
$(DOCS_DIR)):/input and $(abspath $(OUTPUT_DIR)):/output so the container can
access those paths on Fedora/RHEL; keep variable names TOOL_IMAGE,
OUTPUT_IMAGE_FILE, OUTPUT_IMAGE_NAME, OUTPUT_IMAGE_TAG unchanged and ensure the
chosen label is applied consistently to both -v volume options.
In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 33-50: The function _make_layer_tar currently doesn't validate
that vector_db_dir exists; add a check at the start (similar to the
os.path.isdir check used for extra_dirs) and raise a clear exception (e.g.,
FileNotFoundError or ValueError) if vector_db_dir is missing or not a directory,
so callers get an explicit error instead of producing an empty layer; keep using
the existing _add_dir_to_tar for adding when the path is valid and include the
path in the exception message for easier debugging.
In `@tests/test_generate_embeddings.py`:
- Around line 22-27: The spec returned by importlib.util.spec_from_file_location
may be None and its loader may also be None, so before calling
_spec.loader.exec_module ensure both _spec and _spec.loader are not None: check
if _spec is None or _spec.loader is None and raise a clear ImportError (or
assert) with a descriptive message, then proceed to create _generate_embeddings
and call _spec.loader.exec_module(_generate_embeddings); reference symbols:
importlib.util.spec_from_file_location, _spec, _spec.loader, and
_generate_embeddings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d5a2a95-a694-4838-94bf-8e58ea9c8eb5
📒 Files selected for processing (5)
Makefilescripts/generate_embeddings.pysrc/lightspeed_rag_content/image_builder.pytests/test_generate_embeddings.pytests/test_image_builder.py
| document_processor = DocumentProcessor( | ||
| chunk_size=args.chunk_size, | ||
| chunk_overlap=args.chunk_overlap, | ||
| model_name=args.model_name, | ||
| embeddings_model_dir=args.model_dir, | ||
| vector_store_type=args.vector_store, | ||
| doc_type=args.doc_type, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify DocumentProcessor.__init__ signature for embeddings_model_dir type
ast-grep --pattern $'def __init__(
self,
$$$
embeddings_model_dir: $_,
$$$
):'Repository: lightspeed-core/rag-content
Length of output: 53
🏁 Script executed:
# Find the DocumentProcessor class definition
find . -type f -name "*.py" -path "*document_processor*" | head -20Repository: lightspeed-core/rag-content
Length of output: 250
🏁 Script executed:
# Also search for DocumentProcessor class directly
rg "class DocumentProcessor" -A 5Repository: lightspeed-core/rag-content
Length of output: 507
🏁 Script executed:
# Check the generate_embeddings.py script
cat scripts/generate_embeddings.pyRepository: lightspeed-core/rag-content
Length of output: 4530
🏁 Script executed:
# Read DocumentProcessor __init__ method to check the embeddings_model_dir parameter type
head -800 src/lightspeed_rag_content/document_processor.py | tail -100Repository: lightspeed-core/rag-content
Length of output: 3726
🏁 Script executed:
# Search for the __init__ method of DocumentProcessor and show more context
rg "def __init__" -A 30 src/lightspeed_rag_content/document_processor.py | head -50Repository: lightspeed-core/rag-content
Length of output: 1867
Type mismatch: embeddings_model_dir expects Path, not str.
The DocumentProcessor.__init__ signature (line ~737 of src/lightspeed_rag_content/document_processor.py) declares embeddings_model_dir: Path, but the script passes args.model_dir which is a string from argparse. While the code works at runtime because it converts to string internally, this violates the type contract and should be fixed.
🔧 Proposed fix
Add the import:
import argparse
+from pathlib import PathAnd wrap the argument:
document_processor = DocumentProcessor(
chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap,
model_name=args.model_name,
- embeddings_model_dir=args.model_dir,
+ embeddings_model_dir=Path(args.model_dir),
vector_store_type=args.vector_store,
doc_type=args.doc_type,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_embeddings.py` around lines 114 - 121, The DocumentProcessor
constructor expects embeddings_model_dir: Path but
scripts/generate_embeddings.py passes a string (args.model_dir); import Path
from pathlib and pass a Path-wrapped value (e.g., Path(args.model_dir)) when
constructing DocumentProcessor so embeddings_model_dir is a Path; update the
DocumentProcessor(...) call to wrap args.model_dir and add the pathlib import
near other imports.
are-ces
left a comment
There was a problem hiding this comment.
LGTM very nice. Please update the documentation with this new feature, seems a feature users will use extensively 😄
|
Converting PR to draft as it depends on #104 . So once merged, I will work on rebasing and CI checks |
b4f7c75 to
81666ca
Compare
|
Status update:
Pending:
On the meanwhile, I am building this rag-content image locally, with these changes, and testing them in the AAP chatbot (LCS) deployment, as BYOK. Once local validation completes, I will update this PR and ask for review again. Thx! |
81666ca to
aa6e6b6
Compare
| import io | ||
| import json | ||
| import os | ||
| import subprocess |
Check notice
Code scanning / Bandit
Consider possible security implications associated with the subprocess module. Note
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
README.md (1)
453-455: Add language specifiers to fenced code blocks.Static analysis flags these blocks as missing language specifiers. Since they contain file paths, use
textor leave empty but be consistent.✏️ Proposed fix
-``` +```text src/lightspeed_rag_content/Containerfile.output-image```diff -``` +```text registry.access.redhat.com/ubi9/ubi:latest</details> Also applies to: 466-468 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@README.mdaround lines 453 - 455, Update the README's fenced code blocks
that show file paths and image names to include a language specifier (use
"text") so static analysis stops flagging them; specifically addtext before the blocks containing "src/lightspeed_rag_content/Containerfile.output-image" and the block containing "registry.access.redhat.com/ubi9/ubi:latest" (and the similar block referenced at lines 466-468) so the fences becometext ... ```
consistently.</details> </blockquote></details> <details> <summary>tests/test_image_builder.py (1)</summary><blockquote> `120-140`: **Consider using context managers for file reads.** Multiple tests read the Containerfile without a context manager. While not a bug (Python will close the file handle eventually), using `with` is more explicit. <details> <summary>✏️ Example improvement</summary> ```diff def test_containerfile_references_ubi9_ubi(self): - content = open(_CONTAINERFILE).read() + with open(_CONTAINERFILE) as f: + content = f.read() assert "ubi9/ubi" in content ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tests/test_image_builder.py` around lines 120 - 140, Replace bare open(...) calls in the test functions (e.g., test_containerfile_has_base_image, test_containerfile_has_no_run_instruction, test_containerfile_copies_vector_db, test_containerfile_copies_embeddings_model, test_containerfile_sets_non_root_user) with context-managed file opens using the _CONTAINERFILE variable (use "with open(_CONTAINERFILE) as f:" then f.read() or f.readlines()) so files are closed deterministically; update each test to use the with-block and the same read method currently used (read() or readlines()) without changing assertions. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@README.md:
- Line 555: The README contains a corrupted heading line ("## UpdaUse the plan
from file@GENERATION.md... ting Dependencies for Hermetic Builds"); locate
that exact string and replace it with the intended heading "## Updating
Dependencies for Hermetic Builds" so the section title is restored and the
accidental pasted instruction text is removed; ensure surrounding
content/formatting remains unchanged.
Nitpick comments:
In@README.md:
- Around line 453-455: Update the README's fenced code blocks that show file
paths and image names to include a language specifier (use "text") so static
analysis stops flagging them; specifically addtext before the blocks containing "src/lightspeed_rag_content/Containerfile.output-image" and the block containing "registry.access.redhat.com/ubi9/ubi:latest" (and the similar block referenced at lines 466-468) so the fences becometext ... ``` consistently.In
@tests/test_image_builder.py:
- Around line 120-140: Replace bare open(...) calls in the test functions (e.g.,
test_containerfile_has_base_image, test_containerfile_has_no_run_instruction,
test_containerfile_copies_vector_db, test_containerfile_copies_embeddings_model,
test_containerfile_sets_non_root_user) with context-managed file opens using the
_CONTAINERFILE variable (use "with open(_CONTAINERFILE) as f:" then f.read() or
f.readlines()) so files are closed deterministically; update each test to use
the with-block and the same read method currently used (read() or readlines())
without changing assertions.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro **Run ID**: `6027f03c-0b88-481d-a1ba-93fe8d97a2c3` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 267abc2bfb4e1c0e108dabf435d9555d3312a2d3 and cf7a588c55edf65ded9b257e3ff02a8918006aaf. </details> <details> <summary>📒 Files selected for processing (10)</summary> * `Containerfile` * `Makefile` * `README.md` * `pyproject.toml` * `scripts/generate_embeddings.py` * `src/lightspeed_rag_content/Containerfile.output-image` * `src/lightspeed_rag_content/document_processor.py` * `src/lightspeed_rag_content/image_builder.py` * `tests/test_generate_embeddings.py` * `tests/test_image_builder.py` </details> <details> <summary>✅ Files skipped from review due to trivial changes (3)</summary> * pyproject.toml * Containerfile * src/lightspeed_rag_content/Containerfile.output-image </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (2)</summary> * Makefile * scripts/generate_embeddings.py </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
/ok-to-test |
|
/retest |
|
thanks @syedriko , that's an awesome review! working on it :) |
cf7a588 to
ef2810c
Compare
|
@syedriko squashed and applied some of your comments, but still working on others.... thx. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
README.md (1)
543-543:⚠️ Potential issue | 🟡 MinorRestore the section heading text.
This heading still contains an accidental pasted prompt, so the section title and ToC entry are broken.
✏️ Suggested fix
-## UpdaUse the plan from file `@GENERATION.md` to generate a BYOK RAG image from the sources in `@ansible-collection-docs/`. But as a rag-content (tool) image, don't use the one from 'quay.io' repository, use my local one `localhost/rag-content:local`.ting Dependencies for Hermetic Builds +## Updating Dependencies for Hermetic Builds🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` at line 543, The section heading in README.md has an accidental pasted prompt; replace the incorrect heading text ("## UpdaUse the plan from file `@GENERATION.md`...") with the intended header "## Dependencies for Hermetic Builds" and ensure the ToC entry matches that exact heading; locate the heading by searching for the malformed line starting with "## UpdaUse" and update it to "## Dependencies for Hermetic Builds" so the heading and table-of-contents link are restored.src/lightspeed_rag_content/image_builder.py (3)
66-66:⚠️ Potential issue | 🟠 MajorThis hardcodes FAISS-style packaging into every backend.
Every
.dbundervector_db_diris renamed to*.gzhere, butscripts/generate_embeddings.pyexposes--output-imagefor all vector-store types. That matches the FAISS deployment artifact shape, not generic backends that still produce a plain*.dbcontract. Please either gate this path to the supported stores or make the compression/layout backend-specific instead of applying it blindly.Also applies to: 98-106
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` at line 66, The code in image_builder.py calls _add_dir_to_tar(vector_db_dir, "/rag/vector_db", compress_extensions={".db"}) which blindly renames/compresses all .db files (FAISS-specific) for every backend; change this to be backend-aware by checking the vector store type (e.g., a backend identifier passed into the image builder or a config variable) and only apply compress_extensions={".db"} when the backend is FAISS (or the expected packaging), otherwise pass no compress_extensions or use a backend-specific layout/compression policy; update the same logic used around lines referenced (98-106) to mirror this conditional behavior and document/select the policy by backend name (function: _add_dir_to_tar, variable: vector_db_dir, parameter: compress_extensions).
125-134:⚠️ Potential issue | 🟠 MajorHarden the
skopeosubprocess call against hangs and opaque failures.A stalled registry can block this build indefinitely, and
check=Truewithout captured output drops the useful stderr context when the pull fails. Please resolve the executable explicitly and wrap timeout/process failures with a clearer error.🔧 Suggested hardening
+import shutil + ... def _fetch_base_archive(base_image: str, dest_tar: str) -> None: """Download *base_image* as a docker-archive tar using ``skopeo copy``. @@ - subprocess.run( - [ - "skopeo", - "copy", - "--remove-signatures", - f"docker://{base_image}", - f"docker-archive:{dest_tar}", - ], - check=True, - ) + skopeo = shutil.which("skopeo") + if skopeo is None: + raise RuntimeError("skopeo not found on PATH") + + try: + subprocess.run( + [ + skopeo, + "copy", + "--remove-signatures", + f"docker://{base_image}", + f"docker-archive:{dest_tar}", + ], + check=True, + capture_output=True, + text=True, + timeout=300, + ) + except subprocess.TimeoutExpired as exc: + raise RuntimeError(f"skopeo timed out while pulling {base_image}") from exc + except subprocess.CalledProcessError as exc: + detail = exc.stderr.strip() or exc.stdout.strip() + raise RuntimeError(f"skopeo failed for {base_image}: {detail}") from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 125 - 134, The subprocess.run call that invokes skopeo using base_image and dest_tar should be hardened: resolve the skopeo executable via shutil.which and raise a clear error if missing, call subprocess.run with timeout and capture_output=True (or stdout/stderr pipes), and wrap it in try/except to catch subprocess.TimeoutExpired, subprocess.CalledProcessError and FileNotFoundError; on failure raise a new exception that includes the base_image, dest_tar, the timed-out status or returncode and the captured stderr/stdout to avoid opaque failures. Locate the invocation of subprocess.run (the skopeo copy call) and replace it with the resolved executable path, timeout, captured output, and explicit exception handling that surfaces helpful error text.
64-71:⚠️ Potential issue | 🟠 MajorThe archive builder still scales with total image size in RAM.
_make_layer_tar()returns one fullbytes,_read_docker_archive()reads every base layer into memory, andbuild_image_archive()keeps all of that alive until the final tar write completes. Including a large model or moving to a larger base image can OOM the unprivileged-container workflow this feature targets. Please stream layer hashing/copying instead of materializing the whole archive first.Also applies to: 147-163, 236-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 64 - 71, The current implementation materializes entire layer archives in memory causing OOM; change _make_layer_tar, _read_docker_archive, and build_image_archive to stream layer data instead of returning/holding bytes: have _make_layer_tar write the tar directly to a file-like stream (or temporary file) and return a readable file-like object or path, update _read_docker_archive to iterate/read base layers in chunks (using streaming reads and hashing as it copies) rather than loading into memory, and modify build_image_archive to assemble the final image tar by copying layer streams into the output tar incrementally (using copyfileobj or tarfile.addfile with a stream) and compute hashes on-the-fly so no full-layer bytes are retained in RAM. Ensure APIs reference _make_layer_tar, _read_docker_archive, and build_image_archive are updated to accept/return streams or paths and to close/cleanup temp files when done.
🧹 Nitpick comments (1)
Makefile (1)
121-133: MakeOUTPUT_IMAGE_FILEeither authoritative or basename-only.
basenamedrops any overridden directory, soOUTPUT_IMAGE_FILE=dist/images/foo.tarstill writes/output/foo.tarwhile the status line claimsdist/images/foo.tar. Either mount the real parent directory or document this variable as a filename underOUTPUT_DIR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 121 - 133, The Makefile currently strips directories from OUTPUT_IMAGE_FILE via basename causing mismatched status vs actual location; fix by making OUTPUT_IMAGE_FILE authoritative or documenting it as basename-only: either (A) treat OUTPUT_IMAGE_FILE as a full path — mount its parent into the container (use the parent directory of OUTPUT_IMAGE_FILE instead of always mounting OUTPUT_DIR) and pass the same full path to --output-image so the container writes to the authoritative path, or (B) enforce basename-only semantics — set/output a basename variable (e.g. OUTFILE from OUTPUT_IMAGE_FILE) and update documentation for OUTPUT_IMAGE_FILE to indicate it should be a filename only under OUTPUT_DIR; update the Makefile logic around OUTPUT_IMAGE_FILE, the mounted volume arguments, and the --output-image argument accordingly (look for usage of OUTPUT_IMAGE_FILE, OUTPUT_DIR, and the --output-image flag in the Makefile and the podman run invocation).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 445-447: Add a clear prerequisite for host-side runs stating that
skopeo must be installed locally when using the host example that runs "uv run
python scripts/generate_embeddings.py --output-image" (or alternatively replace
that host example with the containerized Makefile flow), and repair the
corrupted heading text near "UpdaUse" by restoring it to a valid markdown
heading (e.g., a clear "Update: Use the plan from file..." style heading) so the
README renders correctly; update the README lines containing the host example
and the corrupted heading accordingly.
In `@scripts/generate_embeddings.py`:
- Around line 133-136: When args.output_image is set and args.include_model is
True, validate that args.model_dir exists and is a directory before building
extra_dirs so we fail fast on typos; if the path is missing, exit with an error
(or call the argument parser error) telling the user to provide a correct
--model-dir or pass --exclude-model, rather than letting build_image_archive
silently omit the model; update the logic around extra_dirs / args.include_model
in generate_embeddings.py to perform this check (referencing args.output_image,
args.include_model, args.model_dir, and extra_dirs).
---
Duplicate comments:
In `@README.md`:
- Line 543: The section heading in README.md has an accidental pasted prompt;
replace the incorrect heading text ("## UpdaUse the plan from file
`@GENERATION.md`...") with the intended header "## Dependencies for Hermetic
Builds" and ensure the ToC entry matches that exact heading; locate the heading
by searching for the malformed line starting with "## UpdaUse" and update it to
"## Dependencies for Hermetic Builds" so the heading and table-of-contents link
are restored.
In `@src/lightspeed_rag_content/image_builder.py`:
- Line 66: The code in image_builder.py calls _add_dir_to_tar(vector_db_dir,
"/rag/vector_db", compress_extensions={".db"}) which blindly renames/compresses
all .db files (FAISS-specific) for every backend; change this to be
backend-aware by checking the vector store type (e.g., a backend identifier
passed into the image builder or a config variable) and only apply
compress_extensions={".db"} when the backend is FAISS (or the expected
packaging), otherwise pass no compress_extensions or use a backend-specific
layout/compression policy; update the same logic used around lines referenced
(98-106) to mirror this conditional behavior and document/select the policy by
backend name (function: _add_dir_to_tar, variable: vector_db_dir, parameter:
compress_extensions).
- Around line 125-134: The subprocess.run call that invokes skopeo using
base_image and dest_tar should be hardened: resolve the skopeo executable via
shutil.which and raise a clear error if missing, call subprocess.run with
timeout and capture_output=True (or stdout/stderr pipes), and wrap it in
try/except to catch subprocess.TimeoutExpired, subprocess.CalledProcessError and
FileNotFoundError; on failure raise a new exception that includes the
base_image, dest_tar, the timed-out status or returncode and the captured
stderr/stdout to avoid opaque failures. Locate the invocation of subprocess.run
(the skopeo copy call) and replace it with the resolved executable path,
timeout, captured output, and explicit exception handling that surfaces helpful
error text.
- Around line 64-71: The current implementation materializes entire layer
archives in memory causing OOM; change _make_layer_tar, _read_docker_archive,
and build_image_archive to stream layer data instead of returning/holding bytes:
have _make_layer_tar write the tar directly to a file-like stream (or temporary
file) and return a readable file-like object or path, update
_read_docker_archive to iterate/read base layers in chunks (using streaming
reads and hashing as it copies) rather than loading into memory, and modify
build_image_archive to assemble the final image tar by copying layer streams
into the output tar incrementally (using copyfileobj or tarfile.addfile with a
stream) and compute hashes on-the-fly so no full-layer bytes are retained in
RAM. Ensure APIs reference _make_layer_tar, _read_docker_archive, and
build_image_archive are updated to accept/return streams or paths and to
close/cleanup temp files when done.
---
Nitpick comments:
In `@Makefile`:
- Around line 121-133: The Makefile currently strips directories from
OUTPUT_IMAGE_FILE via basename causing mismatched status vs actual location; fix
by making OUTPUT_IMAGE_FILE authoritative or documenting it as basename-only:
either (A) treat OUTPUT_IMAGE_FILE as a full path — mount its parent into the
container (use the parent directory of OUTPUT_IMAGE_FILE instead of always
mounting OUTPUT_DIR) and pass the same full path to --output-image so the
container writes to the authoritative path, or (B) enforce basename-only
semantics — set/output a basename variable (e.g. OUTFILE from OUTPUT_IMAGE_FILE)
and update documentation for OUTPUT_IMAGE_FILE to indicate it should be a
filename only under OUTPUT_DIR; update the Makefile logic around
OUTPUT_IMAGE_FILE, the mounted volume arguments, and the --output-image argument
accordingly (look for usage of OUTPUT_IMAGE_FILE, OUTPUT_DIR, and the
--output-image flag in the Makefile and the podman run invocation).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c35e761c-97d4-4aa3-96e1-181feb8ca7b2
📒 Files selected for processing (10)
ContainerfileMakefileREADME.mdpyproject.tomlscripts/generate_embeddings.pysrc/lightspeed_rag_content/Containerfile.output-imagesrc/lightspeed_rag_content/document_processor.pysrc/lightspeed_rag_content/image_builder.pytests/test_generate_embeddings.pytests/test_image_builder.py
✅ Files skipped from review due to trivial changes (3)
- pyproject.toml
- Containerfile
- src/lightspeed_rag_content/Containerfile.output-image
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lightspeed_rag_content/document_processor.py
are-ces
left a comment
There was a problem hiding this comment.
Please fix also Black formatting and linting issues
| subprocess.run( | ||
| [ |
Check notice
Code scanning / Bandit
subprocess call - check for execution of untrusted input. Note
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/lightspeed_rag_content/image_builder.py (1)
210-210:⚠️ Potential issue | 🟡 MinorFix Ruff RUF005 lint error: prefer unpacking over concatenation.
The CI pipeline fails because Ruff prefers
[*base_layer_names, layer_entry]overbase_layer_names + [layer_entry].🔧 Proposed fix
manifest = [ { "Config": config_entry, "RepoTags": [f"{image_name}:{image_tag}"], - "Layers": base_layer_names + [layer_entry], + "Layers": [*base_layer_names, layer_entry], } ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` at line 210, Replace the list concatenation used for the "Layers" value with unpacking to satisfy Ruff RUF005: change the expression that currently builds the layers from base_layer_names + [layer_entry] to use a list with the starred unpacking of base_layer_names and layer_entry (i.e., [*base_layer_names, layer_entry]); locate the usage in image_builder.py where the "Layers" key is assigned and update the expression referencing base_layer_names and layer_entry accordingly.README.md (1)
445-468:⚠️ Potential issue | 🟡 MinorHost-side usage requires
skopeoto be installed locally.Lines 445-447 mention
skopeois included in the RAG tool container image, but the example at lines 460-468 runsuv run python scripts/generate_embeddings.pydirectly on the host. Users following this path needskopeoinstalled locally.Consider adding a prerequisite note for host-side execution:
📝 Suggested addition after line 447
The image is built using `skopeo` (included in the RAG tool container image) to fetch the base image and pure Python to compose the final archive — no container daemon or elevated privileges required. + +> **Note:** When running directly on the host (not inside the container), +> ensure `skopeo` is installed and available in your `PATH`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 445 - 468, Add a short prerequisite note after the paragraph that mentions skopeo is included in the container (after the text that currently says "The image is built using `skopeo`...") clarifying that if users run the example command on the host (the `uv run python scripts/generate_embeddings.py ...` invocation) they must have `skopeo` installed locally; update README.md near the Containerfile/base-image text to either add a "Prerequisites" bullet stating "skopeo must be installed on the host when running host-side commands" or insert a one-line parenthetical note before the "Basic usage" example referencing the need for local skopeo.
🧹 Nitpick comments (2)
tests/test_image_builder.py (1)
120-141: Use context managers for file reads.Multiple
open(_CONTAINERFILE).read()calls don't use context managers, leaving file handles to garbage collection. While unlikely to cause issues in tests, it's cleaner to usewithstatements.♻️ Suggested refactor
def test_containerfile_references_ubi9_ubi(self): - content = open(_CONTAINERFILE).read() + with open(_CONTAINERFILE) as f: + content = f.read() assert "ubi9/ubi" in content def test_containerfile_has_no_run_instruction(self): # No RUN instruction — packages come from the base image itself. - lines = open(_CONTAINERFILE).readlines() + with open(_CONTAINERFILE) as f: + lines = f.readlines() assert not any(line.startswith("RUN ") for line in lines) def test_containerfile_copies_vector_db(self): - content = open(_CONTAINERFILE).read() + with open(_CONTAINERFILE) as f: + content = f.read() assert "COPY --from=vector_db" in content assert "/rag/vector_db/" in content def test_containerfile_copies_embeddings_model(self): - content = open(_CONTAINERFILE).read() + with open(_CONTAINERFILE) as f: + content = f.read() assert "COPY --from=embeddings_model" in content assert "/rag/embeddings_model/" in content def test_containerfile_sets_non_root_user(self): - content = open(_CONTAINERFILE).read() + with open(_CONTAINERFILE) as f: + content = f.read() assert "USER 65532" in content🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_image_builder.py` around lines 120 - 141, The tests repeatedly call open(_CONTAINERFILE).read() and open(_CONTAINERFILE).readlines() without context managers; update the test functions (test_containerfile_references_ubi9_ubi, test_containerfile_has_no_run_instruction, test_containerfile_copies_vector_db, test_containerfile_copies_embeddings_model, test_containerfile_sets_non_root_user) to open the file using with open(_CONTAINERFILE, 'r') as f: and either f.read() or f.readlines(), ideally reading the file once into a local variable (e.g., content or lines) and reusing that across assertions to close the file promptly and avoid leaving file handles to garbage collection.src/lightspeed_rag_content/image_builder.py (1)
103-111: Large.dbfiles are fully loaded into memory during gzip compression.While the layer tar streaming was addressed, the gzip compression path still reads the entire source file into memory via
gz.write(f.read()). For very large vector database files, this could cause memory pressure.Consider streaming the compression:
♻️ Suggested streaming approach
if compress_extensions and suffix in compress_extensions: buf = io.BytesIO() with gzip.GzipFile(fileobj=buf, mode="wb") as gz: with open(abs_path, "rb") as f: - gz.write(f.read()) + for chunk in iter(lambda: f.read(8 * 1024 * 1024), b""): + gz.write(chunk) data = buf.getvalue()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/image_builder.py` around lines 103 - 111, The current gzip path in image_builder.py reads the whole file into memory via gz.write(f.read()); change it to stream-compress to avoid memory spikes by writing compressed bytes to a temporary file instead of a BytesIO: open abs_path in binary, open a tempfile.NamedTemporaryFile(mode="wb+", delete=True), copy from the source to a gzip.GzipFile(fileobj=tempfile) in chunks (or use shutil.copyfileobj), flush and seek to start, set TarInfo(name=arcname + ".gz") with info.size = tempfile.tell(), then call tar.addfile(info, tempfile) and close/remove the temp file; keep the existing condition using compress_extensions and suffix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 454-456: The fenced code block containing the container image
reference `registry.access.redhat.com/ubi9/ubi:latest` lacks a language
specifier; update the block to include a language (e.g., change ``` to ```text)
so the fenced code becomes ```text followed by the image reference and closing
``` to satisfy MD040.
In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 1-283: The file needs to be reformatted with Black; run the
provided command (uv tool run black
src/lightspeed_rag_content/image_builder.py), stage and commit the resulting
changes so CI passes; ensure you format the whole file (functions like
_make_layer_tar, _add_dir_to_tar, _fetch_base_archive, _build_image_config, and
build_image_archive) and include the updated file in your PR (or add Black to
pre-commit to avoid future CI failures).
---
Duplicate comments:
In `@README.md`:
- Around line 445-468: Add a short prerequisite note after the paragraph that
mentions skopeo is included in the container (after the text that currently says
"The image is built using `skopeo`...") clarifying that if users run the example
command on the host (the `uv run python scripts/generate_embeddings.py ...`
invocation) they must have `skopeo` installed locally; update README.md near the
Containerfile/base-image text to either add a "Prerequisites" bullet stating
"skopeo must be installed on the host when running host-side commands" or insert
a one-line parenthetical note before the "Basic usage" example referencing the
need for local skopeo.
In `@src/lightspeed_rag_content/image_builder.py`:
- Line 210: Replace the list concatenation used for the "Layers" value with
unpacking to satisfy Ruff RUF005: change the expression that currently builds
the layers from base_layer_names + [layer_entry] to use a list with the starred
unpacking of base_layer_names and layer_entry (i.e., [*base_layer_names,
layer_entry]); locate the usage in image_builder.py where the "Layers" key is
assigned and update the expression referencing base_layer_names and layer_entry
accordingly.
---
Nitpick comments:
In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 103-111: The current gzip path in image_builder.py reads the whole
file into memory via gz.write(f.read()); change it to stream-compress to avoid
memory spikes by writing compressed bytes to a temporary file instead of a
BytesIO: open abs_path in binary, open a tempfile.NamedTemporaryFile(mode="wb+",
delete=True), copy from the source to a gzip.GzipFile(fileobj=tempfile) in
chunks (or use shutil.copyfileobj), flush and seek to start, set
TarInfo(name=arcname + ".gz") with info.size = tempfile.tell(), then call
tar.addfile(info, tempfile) and close/remove the temp file; keep the existing
condition using compress_extensions and suffix.
In `@tests/test_image_builder.py`:
- Around line 120-141: The tests repeatedly call open(_CONTAINERFILE).read() and
open(_CONTAINERFILE).readlines() without context managers; update the test
functions (test_containerfile_references_ubi9_ubi,
test_containerfile_has_no_run_instruction, test_containerfile_copies_vector_db,
test_containerfile_copies_embeddings_model,
test_containerfile_sets_non_root_user) to open the file using with
open(_CONTAINERFILE, 'r') as f: and either f.read() or f.readlines(), ideally
reading the file once into a local variable (e.g., content or lines) and reusing
that across assertions to close the file promptly and avoid leaving file handles
to garbage collection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 40a0a0a9-ff1a-4e5f-b403-f28ee5ce37fb
📒 Files selected for processing (5)
README.mdscripts/generate_embeddings.pysrc/lightspeed_rag_content/image_builder.pytests/test_generate_embeddings.pytests/test_image_builder.py
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/generate_embeddings.py
| ``` | ||
| registry.access.redhat.com/ubi9/ubi:latest | ||
| ``` |
There was a problem hiding this comment.
Add language specifier to fenced code block.
Markdownlint reports MD040: fenced code blocks should have a language specified. Since this is a container image reference, use text or leave empty with proper fence.
📝 Suggested fix
-```
+```text
registry.access.redhat.com/ubi9/ubi:latest</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>
[warning] 454-454: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @README.md around lines 454 - 456, The fenced code block containing the
container image reference registry.access.redhat.com/ubi9/ubi:latest lacks a
language specifier; update the block to include a language (e.g., change totext) so the fenced code becomes text followed by the image reference and closing to satisfy MD040.
</details>
<!-- fingerprinting:phantom:medusa:ocelot:f98fa6b9-c2f7-40d3-93e6-9309e630be04 -->
<!-- This is an auto-generated comment by CodeRabbit -->
Just pending one existing comment to address: #108 (comment). Anyway I think it is ready for early review again. Thanks |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/image_builder.py`:
- Around line 105-109: The code currently calls _gzip_compress(abs_path) which
buffers the entire compressed payload in memory before tar.addfile; instead,
stream the compression to disk to avoid OOMs: for the block handling
compress_extensions (where suffix, arcname, abs_path and tar.addfile are used),
open a tempfile.NamedTemporaryFile (or TemporaryFile) and stream-compress the
source by reading abs_path in chunks into a gzip.GzipFile writing to that temp
file, close/flush it, stat the temp file to get the compressed size, create a
tarfile.TarInfo(name=arcname + ".gz") with info.size set to the temp file size,
reopen the temp file for reading in binary and pass that file object to
tar.addfile(info, fileobj), then delete the temp file; apply the same change for
the other similar block around lines 114-120 instead of using _gzip_compress().
- Around line 250-255: Before calling _make_layer_tar, validate that
vector_db_dir exists and is a directory and fail fast if not; specifically, in
the function that creates the temp layer (the block that calls _make_layer_tar
with vector_db_dir and extra_dirs), add an early check like
os.path.exists(vector_db_dir) and os.path.isdir(vector_db_dir) and raise a clear
exception (e.g., ValueError or FileNotFoundError) if the check fails so
_add_dir_to_tar is never invoked with a missing path; keep validation for
extra_dirs as-is and reference vector_db_dir, _make_layer_tar, and
_add_dir_to_tar in your change.
- Around line 95-96: The os.walk loop over src_dir (and the similar loop
handling extra_dirs) currently ignores symlinked directories and tar.add(...,
recursive=False) preserves symlinks, which can leave missing targets in the
image; update the loops that iterate src_dir and extra_dirs to explicitly detect
symlinks (os.path.islink) and either (a) raise/return an error rejecting
symlinked dirs with a clear message referencing src_dir/extra_dirs, or (b)
dereference them when adding to the tar by resolving targets (os.path.realpath)
and adding the target file contents instead of the symlink (ensuring you replace
tar.add(..., recursive=False) behavior accordingly). Locate and change the code
around the os.walk(...) loop and the tar.add usage to implement one of these two
behaviors and apply the same change to the analogous block that processes
extra_dirs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 96ca5658-f122-47db-9972-752cdac0a312
📒 Files selected for processing (1)
src/lightspeed_rag_content/image_builder.py
| for root, dirs, files in os.walk(src_dir): | ||
| dirs.sort() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and examine the image_builder.py file
find . -name "image_builder.py" -type fRepository: lightspeed-core/rag-content
Length of output: 116
🏁 Script executed:
# Read the image_builder.py file to see the context around lines 95-96 and 110-111
wc -l src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 118
🏁 Script executed:
# Read the relevant sections with context
sed -n '90,120p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 1367
🏁 Script executed:
# Get the function definition and signature
sed -n '80,115p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 1485
🏁 Script executed:
# Search for usages of this function
rg -n "tar.add\|_add_dir_to_tar" src/lightspeed_rag_content/Repository: lightspeed-core/rag-content
Length of output: 53
🏁 Script executed:
# Look for the function name
rg -n "def.*add.*tar" src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 168
🏁 Script executed:
# Find where _add_dir_to_tar is called
rg -n "_add_dir_to_tar" src/Repository: lightspeed-core/rag-content
Length of output: 311
🏁 Script executed:
# Look at generate_embeddings.py as mentioned in the review
find . -name "generate_embeddings.py" -type fRepository: lightspeed-core/rag-content
Length of output: 103
🏁 Script executed:
# Look at the context where _add_dir_to_tar is called (lines 60-80)
sed -n '60,80p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 871
🏁 Script executed:
# Also check the section mentioned in review (around 110-111)
sed -n '105,120p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 718
🏁 Script executed:
# Check the generate_embeddings.py file to see if it uses extra_dirs with symlinks
cat scripts/generate_embeddings.pyRepository: lightspeed-core/rag-content
Length of output: 5129
🏁 Script executed:
# Verify os.walk() default behavior with symlinks
python3 << 'EOF'
import os
import tempfile
import shutil
# Create a test directory structure with symlinks
with tempfile.TemporaryDirectory() as tmpdir:
# Create a source directory
src = os.path.join(tmpdir, "src")
os.makedirs(src)
# Create a regular subdirectory
regular_dir = os.path.join(src, "regular")
os.makedirs(regular_dir)
with open(os.path.join(regular_dir, "file.txt"), "w") as f:
f.write("test")
# Create another directory to link to
linked_content = os.path.join(tmpdir, "linked_content")
os.makedirs(linked_content)
with open(os.path.join(linked_content, "linked_file.txt"), "w") as f:
f.write("linked")
# Create a symlink to a directory
symlink_path = os.path.join(src, "symlink_dir")
os.symlink(linked_content, symlink_path)
# Test os.walk default behavior (followlinks=False)
print("os.walk with followlinks=False (default):")
for root, dirs, files in os.walk(src):
print(f" Root: {os.path.relpath(root, src)}, Files: {files}, Dirs: {dirs}")
print("\nos.walk with followlinks=True:")
for root, dirs, files in os.walk(src, followlinks=True):
print(f" Root: {os.path.relpath(root, src)}, Files: {files}, Dirs: {dirs}")
EOFRepository: lightspeed-core/rag-content
Length of output: 406
🏁 Script executed:
# Verify tar.add behavior with symlinks
python3 << 'EOF'
import os
import tempfile
import tarfile
with tempfile.TemporaryDirectory() as tmpdir:
# Create source structure with symlink
src = os.path.join(tmpdir, "src")
os.makedirs(src)
# Create a regular file
with open(os.path.join(src, "regular.txt"), "w") as f:
f.write("regular")
# Create a file and a symlink to it
target_file = os.path.join(src, "target.txt")
with open(target_file, "w") as f:
f.write("target content")
symlink_file = os.path.join(src, "symlink.txt")
os.symlink(target_file, symlink_file)
# Create tar with recursive=False (used in the code)
tar_path = os.path.join(tmpdir, "test.tar")
with tarfile.open(tar_path, "w") as tar:
tar.add(symlink_file, arcname="symlink.txt", recursive=False)
# Check what's in the tar
print("Tar contents (with symlink added using recursive=False):")
with tarfile.open(tar_path, "r") as tar:
for member in tar.getmembers():
print(f" Name: {member.name}, Type: {member.type}, Issym: {member.issym()}, Isreg: {member.isreg()}")
EOFRepository: lightspeed-core/rag-content
Length of output: 186
🏁 Script executed:
# Check if there's any symlink handling in the codebase
rg -i "symlink|followlinks|dereferenc" src/ scripts/Repository: lightspeed-core/rag-content
Length of output: 53
🏁 Script executed:
# Look at the complete function that calls _add_dir_to_tar to understand the full context
sed -n '55,78p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 1044
🏁 Script executed:
# Check if there are any tests or documentation about symlink handling
fd -e py -e md -e txt | xargs rg -l "symlink" 2>/dev/null | head -5Repository: lightspeed-core/rag-content
Length of output: 53
Handle symlinks explicitly when copying source directories.
os.walk() skips symlinked directories by default, and tar.add(..., recursive=False) preserves file symlinks instead of copying their targets. If extra_dirs points to a model snapshot or cache tree with symlinks, the resulting image will ship with missing files or broken links. Either reject symlinks up front or dereference them while writing tar members.
Also applies to: 110–111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/image_builder.py` around lines 95 - 96, The
os.walk loop over src_dir (and the similar loop handling extra_dirs) currently
ignores symlinked directories and tar.add(..., recursive=False) preserves
symlinks, which can leave missing targets in the image; update the loops that
iterate src_dir and extra_dirs to explicitly detect symlinks (os.path.islink)
and either (a) raise/return an error rejecting symlinked dirs with a clear
message referencing src_dir/extra_dirs, or (b) dereference them when adding to
the tar by resolving targets (os.path.realpath) and adding the target file
contents instead of the symlink (ensuring you replace tar.add(...,
recursive=False) behavior accordingly). Locate and change the code around the
os.walk(...) loop and the tar.add usage to implement one of these two behaviors
and apply the same change to the analogous block that processes extra_dirs.
| if compress_extensions and suffix in compress_extensions: | ||
| data = _gzip_compress(abs_path) | ||
| info = tarfile.TarInfo(name=arcname + ".gz") | ||
| info.size = len(data) | ||
| tar.addfile(info, io.BytesIO(data)) |
There was a problem hiding this comment.
Stream the .db compression path instead of buffering it in memory.
_gzip_compress() still reads the full DB into memory and keeps the full .gz payload in a BytesIO before tar.addfile() sees it. For large vector stores, that brings peak RAM back to O(file_size) and can still OOM the build despite the streaming refactor elsewhere.
Also applies to: 114-120
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/image_builder.py` around lines 105 - 109, The code
currently calls _gzip_compress(abs_path) which buffers the entire compressed
payload in memory before tar.addfile; instead, stream the compression to disk to
avoid OOMs: for the block handling compress_extensions (where suffix, arcname,
abs_path and tar.addfile are used), open a tempfile.NamedTemporaryFile (or
TemporaryFile) and stream-compress the source by reading abs_path in chunks into
a gzip.GzipFile writing to that temp file, close/flush it, stat the temp file to
get the compressed size, create a tarfile.TarInfo(name=arcname + ".gz") with
info.size set to the temp file size, reopen the temp file for reading in binary
and pass that file object to tar.addfile(info, fileobj), then delete the temp
file; apply the same change for the other similar block around lines 114-120
instead of using _gzip_compress().
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| # 1. Build the custom layer to a temp file; compute SHA-256 by reading | ||
| # it back in chunks so it never needs to coexist in RAM with the | ||
| # base image layers. | ||
| layer_file = os.path.join(tmpdir, "layer.tar") | ||
| layer_sha = _make_layer_tar(vector_db_dir, extra_dirs, layer_file) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "image_builder.py" | head -5Repository: lightspeed-core/rag-content
Length of output: 116
🏁 Script executed:
head -30 src/lightspeed_rag_content/image_builder.py && echo -e "\n...\n" && sed -n '240,270p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 2720
🏁 Script executed:
# Find the _make_layer_tar and _add_dir_to_tar functions
rg -n "_make_layer_tar|_add_dir_to_tar" src/lightspeed_rag_content/image_builder.py -A 10Repository: lightspeed-core/rag-content
Length of output: 2000
🏁 Script executed:
# Check the full function signature for build_image_archive
ast-grep --pattern 'def build_image_archive($_, $_) { $$$ }'Repository: lightspeed-core/rag-content
Length of output: 53
🏁 Script executed:
# Get the full _add_dir_to_tar implementation and see how it handles missing directories
sed -n '81,120p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 1658
🏁 Script executed:
# Check if there's any validation of vector_db_dir in build_image_archive or before it's called
sed -n '220,260p' src/lightspeed_rag_content/image_builder.pyRepository: lightspeed-core/rag-content
Length of output: 2032
🏁 Script executed:
# Check how scripts/generate_embeddings.py calls build_image_archive
rg -n "build_image_archive" scripts/generate_embeddings.py -B 5 -A 5Repository: lightspeed-core/rag-content
Length of output: 1039
🏁 Script executed:
# Test os.walk behavior with a missing directory using Python
python3 << 'EOF'
import os
import tempfile
# Test 1: Does os.walk yield anything for a missing directory?
missing_dir = "/nonexistent/path/that/does/not/exist"
entries = list(os.walk(missing_dir))
print(f"os.walk on missing dir yields: {entries}")
print(f"Number of entries: {len(entries)}")
# Test 2: Test with a temp directory that's been deleted
with tempfile.TemporaryDirectory() as tmpdir:
test_dir = os.path.join(tmpdir, "test")
os.makedirs(test_dir)
# Directory exists
print(f"Directory exists: {os.path.isdir(test_dir)}")
# Now directory is gone (context exited)
entries = list(os.walk(test_dir))
print(f"os.walk on deleted directory yields: {entries}")
EOFRepository: lightspeed-core/rag-content
Length of output: 188
Add validation to fail fast if vector_db_dir does not exist.
This is the only required input, but it is currently not validated before the layer build starts. os.walk() silently yields nothing for a missing directory, resulting in a valid-looking archive with an empty /rag/vector_db layer instead of surfacing the real problem. Note that extra_dirs entries are already validated (line 72), but vector_db_dir is passed directly to _add_dir_to_tar() without any check.
Suggested fix
Args:
vector_db_dir: Directory whose contents are placed at ``/rag/vector_db``
inside the image.
output_tar_path: Destination path for the ``.tar`` archive.
image_name: Repository name for the image (e.g. ``my-rag``).
image_tag: Tag for the image (e.g. ``v1``).
extra_dirs: Optional mapping of ``{local_src_path: image_dest_path}``.
Each existing source directory is added under the given image path.
base_image: Parent image reference (default: UBI 9 full).
"""
+ if not os.path.isdir(vector_db_dir):
+ raise FileNotFoundError(
+ f"Vector DB directory not found: {vector_db_dir}"
+ )
+
with tempfile.TemporaryDirectory() as tmpdir:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/image_builder.py` around lines 250 - 255, Before
calling _make_layer_tar, validate that vector_db_dir exists and is a directory
and fail fast if not; specifically, in the function that creates the temp layer
(the block that calls _make_layer_tar with vector_db_dir and extra_dirs), add an
early check like os.path.exists(vector_db_dir) and os.path.isdir(vector_db_dir)
and raise a clear exception (e.g., ValueError or FileNotFoundError) if the check
fails so _add_dir_to_tar is never invoked with a missing path; keep validation
for extra_dirs as-is and reference vector_db_dir, _make_layer_tar, and
_add_dir_to_tar in your change.
|
Hey @tisnik @are-ces @syedriko Everything done from my side, also the updates of the rpm locks now, plus previous updates described here: #108 (comment) So now , it is ready for review again, thanks guys! |
Implement a privilege-free mechanism to build a Docker-archive tar from a local vector DB directory using skopeo + pure Python layer composition. Includes skopeo RPM dependency (and transitive deps) for aarch64/x86_64, and makes generate-rpm-lock.sh executable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b8443e4 to
d51e443
Compare
|
Squashed commits too! |
are-ces
left a comment
There was a problem hiding this comment.
TYVM very nice work, LGTM
Description
Adds the ability to produce the generated vector database embedded in a
container image, in addition to writing it to the filesystem.
When
--output-imageis supplied,generate_embeddings.pypackages thevector DB (and optionally the embedding model) into an OCI-compatible archive
(
.tar) after the embedding pipeline completes. The archive can be loadedwith
podman loadand pushed to any OCI registry.The image is built entirely in Python + skopeo — no container daemon,
elevated privileges, or user-namespace configuration required. This works
safely inside an unprivileged container (e.g. the RAG tool container image).
Architecture: privilege-free image build
/usr/bin/skopeo copy docker://<base-image> docker-archive:<tmp.tar>— pullsthe base image with zero daemon/namespace requirements.
names and metadata only — layer data is not loaded into memory).
/rag/vector_db/+optional
/rag/embeddings_model/); SHA-256 is computed by streaming thefile in chunks.
diff_id, sets
User: 65532:65532, and adds a history entry.directly from the skopeo temp archive and the custom layer from its temp
file — peak RAM is independent of image size.
The default base image is full UBI 9 (
ubi9/ubi:latest).Changes
src/lightspeed_rag_content/image_builder.py: Pure-Python + skopeoutility that builds a Docker-archive tar by fetching the base image via
/usr/bin/skopeo copyand stacking a custom data layer in Python.Layer construction and SHA-256 hashing are fully streaming (temp files,
chunked reads) to keep peak memory independent of image size.
skopeofailures include a 5-minute timeout and surface stderr in theexception message.
src/lightspeed_rag_content/Containerfile.output-image(new): Staticcanonical specification for the output image layout. Uses
ubi9/ubi:latest.scripts/generate_embeddings.py: Five new optional CLI flags:--output-image <path>— path for the output.tararchive--image-name— repository name in the archive (default:rag-content-output)--image-tag— image tag (default:latest)--exclude-model— exclude the embedding model from the image--base-image— override the base image (default:ubi9/ubi:latest)When
--output-imageis set and the model is included (default), themodel directory is validated to exist before building; a missing directory
raises
FileNotFoundErrorwith a hint to use--exclude-model.Containerfile: Installsskopeofor the privilege-free base image fetch.tests/test_image_builder.py: 29 unit tests covering the staticContainerfile, layer tar construction, skopeo invocation, docker-archive
parsing, streaming SHA-256, and the full
build_image_archivepipeline.tests/test_generate_embeddings.py: Updated for the new CLI flagsand model-dir validation; includes a
FileNotFoundErrorguard test.All 154 tests pass.
Usage
Type of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
All 154 tests pass (
uv run pytest tests/ -v). The image archive pipelineis covered by 29 unit tests in
test_image_builder.pythat mock theskopeosubprocess and use real on-disk docker-archive construction.Summary by CodeRabbit
Release Notes
New Features
--output-imageflag--image-name,--image-tag,--exclude-model, and--base-imagefor image customizationbuild-tool-imageandgenerate-output-imageadded for container workflowsDocumentation