Added YAML frontmatter support for metadata generation. by romartin · Pull Request #103 · lightspeed-core/rag-content

romartin · 2026-03-20T11:43:24Z

Description

Adding support for consuming the YAML frontmatter (in MD files), for metadata generation (title and url).

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude
Generated by: Claude

Related Tickets & Documents

Related Issue: https://redhat.atlassian.net/browse/AAP-66454

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Tested locally, metadata generated properly.

Summary by CodeRabbit

New Features
- Support for YAML frontmatter to source document titles and URLs.
- Hermetic build mode to skip URL reachability checks during processing.
Documentation
- README updated with frontmatter behavior and hermetic-build guidance.
Chores
- Added a frontmatter runtime dependency; regenerated lockfiles and updated pinned hashes and wheel pins.
- Updated CI/pipeline prefetch configuration.
Tests
- Added tests for frontmatter parsing, URL selection, reachability, and hermetic mode.

coderabbitai · 2026-03-20T11:43:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds YAML frontmatter parsing to MetadataProcessor (title and URL precedence), introduces a hermetic_build flag to skip URL reachability checks, adds the python-frontmatter runtime dependency, and updates lockfiles and CI prefetch package lists.

Changes

Cohort / File(s)	Summary
Dependency manifests & locks `pyproject.toml`, `requirements-build.txt`, `requirements.hashes.source.txt`, `requirements.hashes.wheel.pypi.txt`, `requirements.hashes.wheel.txt`	Added runtime dependency `python-frontmatter`; regenerated/updated lock/hashes files with multiple package version/hash changes and edits to provenance comments; a wheel pin bumped (`sqlite-vec`), and some wheel hash entries removed.
Metadata processor `src/lightspeed_rag_content/metadata_processor.py`	Added `MetadataProcessor.__init__(hermetic_build: bool)`; parse top-of-file YAML frontmatter for `title` and `url` via `_get_frontmatter_url()`; `get_file_title()` and `populate()` prefer frontmatter values; `populate()` uses resolved `docs_url` consistently and skips pinging when `hermetic_build=True`.
Tests `tests/test_metadata_processor.py`	Added/updated tests covering frontmatter-derived title and URL precedence, ping invocation behavior, and hermetic_build skipping of URL reachability checks.
Documentation `README.md`	Updated examples and added docs describing YAML frontmatter support and the `hermetic_build` flag; adjusted example `CustomMetadataProcessor` constructor to accept `hermetic_build`.
CI / Tekton configs `.tekton/rag-tool-pull-request.yaml`, `.tekton/rag-tool-push.yaml`	Modified pip prefetch package lists used by Tekton tasks (removed/added package names in parameters); pipeline steps and control flow unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature addition: YAML frontmatter support for metadata generation across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/lightspeed_rag_content/metadata_processor.py (2)

58-68: Duplicate frontmatter parsing in populate() flow.

Both _get_frontmatter_url() and get_file_title() parse frontmatter independently when called from populate(). Consider extracting frontmatter once and passing it to both methods to avoid redundant I/O and parsing.

♻️ Suggested approach: Extract frontmatter once

def _load_frontmatter(self, file_path: str) -> dict | None:
    """Load frontmatter from file if present."""
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            if file.readline().startswith("---"):
                post = frontmatter.load(file_path)
                return dict(post.metadata) if post.metadata else None
    except Exception:
        pass
    return None

Then use this in populate() to extract both title and url from a single parse.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 58 - 68,
Duplicate frontmatter parsing occurs because _get_frontmatter_url and
get_file_title each call frontmatter.load during populate; create a single
helper (e.g. _load_frontmatter) that opens and parses frontmatter once and
returns the metadata dict (or None), then update populate to call
_load_frontmatter(file_path) and pass that metadata to both get_file_title and
_get_frontmatter_url (or refactor those to accept metadata instead of reading
the file themselves) to eliminate redundant I/O and parsing.

47-53: Consider avoiding double file read for efficiency.

The current implementation opens the file twice: once to check for --- prefix, then frontmatter.load(file_path) reopens it. While functionally correct, this could be simplified.

♻️ Suggested refactor using frontmatter directly

     def get_file_title(self, file_path: str) -> str:
         """Extract title from the plaintext doc file."""
         title = ""
         try:
             with open(file_path, "r", encoding="utf-8") as file:
-                first_line = file.readline()
-                if first_line.startswith("---"):
-                    post = frontmatter.load(file_path)
-                    title = post.get("title", "")
-                else:
-                    title = first_line.rstrip("\n").lstrip("# ")
+                post = frontmatter.load(file)
+                if post.metadata and "title" in post.metadata:
+                    title = post.get("title", "")
+                else:
+                    # No frontmatter or no title in frontmatter, read first line
+                    file.seek(0)
+                    first_line = file.readline()
+                    title = first_line.rstrip("\n").lstrip("# ")
         except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
             pass
         return title

Note: frontmatter.load() accepts a file object, which avoids reopening the file.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 47 - 53, Avoid
reopening the file: open file_path once as file, read first_line from that file
object, and if it startswith("---") rewind the file (file.seek(0)) and call
frontmatter.load(file) passing the file object (not file_path) to get post and
title; otherwise derive title from first_line as before. Update references to
first_line, post, title, and frontmatter.load accordingly to use the single open
file object.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 58-68: Duplicate frontmatter parsing occurs because
_get_frontmatter_url and get_file_title each call frontmatter.load during
populate; create a single helper (e.g. _load_frontmatter) that opens and parses
frontmatter once and returns the metadata dict (or None), then update populate
to call _load_frontmatter(file_path) and pass that metadata to both
get_file_title and _get_frontmatter_url (or refactor those to accept metadata
instead of reading the file themselves) to eliminate redundant I/O and parsing.
- Around line 47-53: Avoid reopening the file: open file_path once as file, read
first_line from that file object, and if it startswith("---") rewind the file
(file.seek(0)) and call frontmatter.load(file) passing the file object (not
file_path) to get post and title; otherwise derive title from first_line as
before. Update references to first_line, post, title, and frontmatter.load
accordingly to use the single open file object.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ff9d38d-a179-4682-9c39-f22b4f485def

📥 Commits

Reviewing files that changed from the base of the PR and between c750c6f and a4c84e1.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

pyproject.toml
src/lightspeed_rag_content/metadata_processor.py
tests/test_metadata_processor.py

syedriko · 2026-03-20T13:37:14Z

@romartin Looks great, but please run make konflux-requirements to update Python dependencies for Konflux:
https://github.com/lightspeed-core/rag-content/blob/main/README.md#updating-dependencies-for-hermetic-builds

romartin · 2026-03-20T18:49:33Z

@syedriko Thanks for the pointer! Konflux python deps updated, a new commit has been added.

syedriko · 2026-03-20T18:53:45Z

/ok-to-test

are-ces

LGTM but please update the documentation with this new feature

are-ces · 2026-03-23T14:39:46Z

/ok-to-test

+        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
+            pass


coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lightspeed_rag_content/metadata_processor.py (1)

54-55: ⚠️ Potential issue | 🟡 Minor

Avoid silent exception swallowing in metadata extraction paths.

Line 54 and Line 66 catch broad exceptions and pass, which hides parse failures and makes bad metadata hard to diagnose. Please at least log the failure with file_path.

💡 Proposed fix

-        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
-            pass
+        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
+            LOG.warning("Failed to extract title from %s", file_path, exc_info=True)
         return title
@@
-        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
-            pass
+        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
+            LOG.warning(
+                "Failed to extract frontmatter URL from %s", file_path, exc_info=True
+            )
         return None

Also applies to: 66-67

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 54 - 55,
Replace the silent "except Exception: pass" handlers in the metadata extraction
code with logging that records the file_path and the exception details so
failures are visible; locate the broad except blocks in
src/lightspeed_rag_content/metadata_processor.py (the metadata extraction
function(s) where parsing of files occurs) and change them to call the module
logger (or logger.exception/logging.exception) with a message like "Failed to
extract metadata for {file_path}" including the caught exception traceback, then
allow the function to continue or re-raise as appropriate for the caller's
error-handling policy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 49-53: When frontmatter exists but post.get("title", "") returns
an empty string, ensure you fall back to deriving the title from the document
body (same logic used in the else branch). In the frontmatter branch (where
frontmatter.load(file_path) is called), set title = str(post.get("title", "")).
If that result is falsy/empty, compute title from first_line using
first_line.rstrip("\n").lstrip("# "), so title is never blank; keep the
variables frontmatter.load, post, title, and first_line as the reference points.

---

Duplicate comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 54-55: Replace the silent "except Exception: pass" handlers in the
metadata extraction code with logging that records the file_path and the
exception details so failures are visible; locate the broad except blocks in
src/lightspeed_rag_content/metadata_processor.py (the metadata extraction
function(s) where parsing of files occurs) and change them to call the module
logger (or logger.exception/logging.exception) with a message like "Failed to
extract metadata for {file_path}" including the caught exception traceback, then
allow the function to continue or re-raise as appropriate for the caller's
error-handling policy.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19697ef7-07cf-466d-b2f1-7be89616238f

📥 Commits

Reviewing files that changed from the base of the PR and between 5f58e17 and b3362c1.

📒 Files selected for processing (1)

src/lightspeed_rag_content/metadata_processor.py

coderabbitai · 2026-03-23T19:36:23Z

+                if first_line.startswith("---"):
+                    post = frontmatter.load(file_path)
+                    title = str(post.get("title", ""))
+                else:
+                    title = first_line.rstrip("\n").lstrip("# ")


⚠️ Potential issue | 🟠 Major

Missing fallback when frontmatter exists but title is absent.

Line 49 switches to frontmatter parsing, but at Line 51 the title can become empty ("") and no body-title fallback runs. This can emit blank metadata titles for valid markdown files.

💡 Proposed fix

if first_line.startswith("---"): post = frontmatter.load(file_path) - title = str(post.get("title", "")) + fm_title = post.get("title") + if fm_title is not None and str(fm_title).strip(): + title = str(fm_title) + else: + # Fallback to first non-empty line after frontmatter + in_frontmatter = False + file.seek(0) + for idx, line in enumerate(file): + if idx == 0 and line.startswith("---"): + in_frontmatter = True + continue + if in_frontmatter: + if line.startswith("---"): + in_frontmatter = False + continue + if line.strip(): + title = line.rstrip("\n").lstrip("# ") + break else: title = first_line.rstrip("\n").lstrip("# ")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 49 - 53, When frontmatter exists but post.get("title", "") returns an empty string, ensure you fall back to deriving the title from the document body (same logic used in the else branch). In the frontmatter branch (where frontmatter.load(file_path) is called), set title = str(post.get("title", "")). If that result is falsy/empty, compute title from first_line using first_line.rstrip("\n").lstrip("# "), so title is never blank; keep the variables frontmatter.load, post, title, and first_line as the reference points.

romartin · 2026-03-23T19:39:39Z

@are-ces @syedriko

Added 2 new commits, one for trying to fix the CI, another one including docs.

Here is the summary of the updates on the README file:

  - New "YAML Frontmatter Support" subsection (before the Faiss Vector Store section) with:                                                                                                                                         
    - A table of the two recognised fields (title, url)                                                                                                                                                                             
    - A concrete example markdown file with frontmatter                                                                                                                                                                             
    - Explanation of fallback behaviour when fields are absent                                                                                                                                                                      
    - A "Hermetic Builds" sub-subsection explaining hermetic_build=True and its effect                                                                                                                                              
  - Updated CustomMetadataProcessor example to call super().__init__() (so hermetic_build is properly forwarded) and added a comment clarifying that url_function is the fallback when frontmatter has no url field.

Thanks!

…E.md)

romartin · 2026-03-23T20:45:09Z

Rebased

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lightspeed_rag_content/metadata_processor.py (1)

50-54: ⚠️ Potential issue | 🟠 Major

Frontmatter branch can return blank titles without fallback.

On Line 52, if title is absent/empty in frontmatter, body-title extraction never runs, so metadata title can be empty.

💡 Proposed fix

-                if first_line.startswith("---"):
-                    post = frontmatter.load(file_path)
-                    title = str(post.get("title", ""))
+                if first_line.startswith("---"):
+                    post = frontmatter.load(file_path)
+                    fm_title = str(post.get("title", "")).strip()
+                    if fm_title:
+                        title = fm_title
+                    else:
+                        in_frontmatter = True
+                        for line in file:
+                            if in_frontmatter:
+                                if line.startswith("---"):
+                                    in_frontmatter = False
+                                continue
+                            if line.strip():
+                                title = line.rstrip("\n").lstrip("# ").strip()
+                                break
                 else:
                     title = first_line.rstrip("\n").lstrip("# ")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 50 - 54, The
frontmatter branch (using frontmatter.load(file_path)) can yield an empty title
and currently skips the body-based fallback; update the logic in
metadata_processor.py so after loading post you set title =
str(post.get("title", "")).strip() and if that result is empty then fall back to
the existing body-first-line extraction (the same logic used in the else branch
that trims "#" and newlines from first_line). Ensure you reference the same
variables (file_path, post, first_line, title) so the frontmatter path uses the
body-derived title when post.get("title") is falsy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 65-66: The code currently returns str(url) for any non-None
frontmatter value, allowing whitespace-only strings to be treated as valid URLs;
update the logic where url = post.get("url") (in metadata_processor.py) to
normalize and validate the value by stripping surrounding whitespace, and return
None if the stripped string is empty so that whitespace-only frontmatter does
not override url_function; ensure you return the stripped string (not the
original) when valid.

---

Duplicate comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 50-54: The frontmatter branch (using frontmatter.load(file_path))
can yield an empty title and currently skips the body-based fallback; update the
logic in metadata_processor.py so after loading post you set title =
str(post.get("title", "")).strip() and if that result is empty then fall back to
the existing body-first-line extraction (the same logic used in the else branch
that trims "#" and newlines from first_line). Ensure you reference the same
variables (file_path, post, first_line, title) so the frontmatter path uses the
body-derived title when post.get("title") is falsy.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9f64ac07-a699-4f46-88e5-a160927f701d

📥 Commits

Reviewing files that changed from the base of the PR and between 4f7162b and 4600d1b.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

.tekton/rag-tool-pull-request.yaml
.tekton/rag-tool-push.yaml
README.md
pyproject.toml
requirements-build.txt
requirements.hashes.source.txt
requirements.hashes.wheel.pypi.txt
requirements.hashes.wheel.txt
src/lightspeed_rag_content/metadata_processor.py
tests/test_metadata_processor.py

💤 Files with no reviewable changes (1)

requirements.hashes.wheel.txt

✅ Files skipped from review due to trivial changes (4)

pyproject.toml
requirements.hashes.wheel.pypi.txt
README.md
requirements-build.txt

🚧 Files skipped from review as they are similar to previous changes (2)

.tekton/rag-tool-pull-request.yaml
.tekton/rag-tool-push.yaml

coderabbitai · 2026-03-23T20:49:48Z

+                    url = post.get("url")
+                    return str(url) if url is not None else None


⚠️ Potential issue | 🟡 Minor

Normalize frontmatter url before returning it.

Line 66 currently accepts whitespace-only values as valid URLs, which can override url_function and degrade metadata quality.

💡 Proposed fix

- url = post.get("url") - return str(url) if url is not None else None + url = post.get("url") + if url is None: + return None + normalized = str(url).strip() + return normalized or None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

url = post.get("url")

return str(url) if url is not None else None

url = post.get("url")

if url is None:

return None

normalized = str(url).strip()

return normalized or None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 65 - 66, The code currently returns str(url) for any non-None frontmatter value, allowing whitespace-only strings to be treated as valid URLs; update the logic where url = post.get("url") (in metadata_processor.py) to normalize and validate the value by stripping surrounding whitespace, and return None if the stripped string is empty so that whitespace-only frontmatter does not override url_function; ensure you return the stripped string (not the original) when valid.

syedriko · 2026-03-23T21:46:43Z

/ok-to-test

@romartin

@romartin has update the documentation

syedriko

LGTM

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

romartin mentioned this pull request Mar 22, 2026

Add DefaultMetadataProcessor and generate_embeddings.py CLI script #104

Merged

18 tasks

are-ces self-assigned this Mar 23, 2026

are-ces previously requested changes Mar 23, 2026

View reviewed changes

github-advanced-security AI found potential problems Mar 23, 2026

View reviewed changes

Comment thread src/lightspeed_rag_content/metadata_processor.py

Comment on lines +66 to +67

except Exception: # noqa: S110 pylint: disable=broad-exception-caught

pass

Check notice

Code scanning / Bandit

Try, Except, Pass detected. Note

Try, Except, Pass detected.

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

romartin added 4 commits March 23, 2026 21:44

Added YAML frontmatter support for metadata generation.

fd2a41c

updated Python dependencies for Konflux.

3a21c21

Fix CI issue in step "Run Pyright tests"

1433936

Updated the documentation considering YAML frontmatter support (READM…

4600d1b

…E.md)

romartin force-pushed the yaml-frontmatter branch from 4f7162b to 4600d1b Compare March 23, 2026 20:44

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

romartin mentioned this pull request Mar 23, 2026

BYOK Examples ansible/aap-rag-content#298

Merged

syedriko approved these changes Mar 23, 2026

View reviewed changes

syedriko merged commit 94246bf into lightspeed-core:main Mar 23, 2026
17 checks passed

		except Exception: # noqa: S110 pylint: disable=broad-exception-caught
		pass

		url = post.get("url")
		return str(url) if url is not None else None

Conversation

romartin commented Mar 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

syedriko commented Mar 20, 2026

Uh oh!

romartin commented Mar 20, 2026

Uh oh!

syedriko commented Mar 20, 2026

Uh oh!

are-ces left a comment

Choose a reason for hiding this comment

Uh oh!

are-ces commented Mar 23, 2026

Uh oh!

Check notice

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

romartin commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romartin commented Mar 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

syedriko commented Mar 23, 2026

Uh oh!

syedriko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

romartin commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading

romartin commented Mar 23, 2026 •

edited

Loading