Skip to content

Added YAML frontmatter support for metadata generation.#103

Merged
syedriko merged 4 commits intolightspeed-core:mainfrom
romartin:yaml-frontmatter
Mar 23, 2026
Merged

Added YAML frontmatter support for metadata generation.#103
syedriko merged 4 commits intolightspeed-core:mainfrom
romartin:yaml-frontmatter

Conversation

@romartin
Copy link
Copy Markdown
Contributor

@romartin romartin commented Mar 20, 2026

Description

Adding support for consuming the YAML frontmatter (in MD files), for metadata generation (title and url).

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude
  • Generated by: Claude

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

Tested locally, metadata generated properly.

Summary by CodeRabbit

  • New Features

    • Support for YAML frontmatter to source document titles and URLs.
    • Hermetic build mode to skip URL reachability checks during processing.
  • Documentation

    • README updated with frontmatter behavior and hermetic-build guidance.
  • Chores

    • Added a frontmatter runtime dependency; regenerated lockfiles and updated pinned hashes and wheel pins.
    • Updated CI/pipeline prefetch configuration.
  • Tests

    • Added tests for frontmatter parsing, URL selection, reachability, and hermetic mode.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds YAML frontmatter parsing to MetadataProcessor (title and URL precedence), introduces a hermetic_build flag to skip URL reachability checks, adds the python-frontmatter runtime dependency, and updates lockfiles and CI prefetch package lists.

Changes

Cohort / File(s) Summary
Dependency manifests & locks
pyproject.toml, requirements-build.txt, requirements.hashes.source.txt, requirements.hashes.wheel.pypi.txt, requirements.hashes.wheel.txt
Added runtime dependency python-frontmatter; regenerated/updated lock/hashes files with multiple package version/hash changes and edits to provenance comments; a wheel pin bumped (sqlite-vec), and some wheel hash entries removed.
Metadata processor
src/lightspeed_rag_content/metadata_processor.py
Added MetadataProcessor.__init__(hermetic_build: bool); parse top-of-file YAML frontmatter for title and url via _get_frontmatter_url(); get_file_title() and populate() prefer frontmatter values; populate() uses resolved docs_url consistently and skips pinging when hermetic_build=True.
Tests
tests/test_metadata_processor.py
Added/updated tests covering frontmatter-derived title and URL precedence, ping invocation behavior, and hermetic_build skipping of URL reachability checks.
Documentation
README.md
Updated examples and added docs describing YAML frontmatter support and the hermetic_build flag; adjusted example CustomMetadataProcessor constructor to accept hermetic_build.
CI / Tekton configs
.tekton/rag-tool-pull-request.yaml, .tekton/rag-tool-push.yaml
Modified pip prefetch package lists used by Tekton tasks (removed/added package names in parameters); pipeline steps and control flow unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature addition: YAML frontmatter support for metadata generation across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/lightspeed_rag_content/metadata_processor.py (2)

58-68: Duplicate frontmatter parsing in populate() flow.

Both _get_frontmatter_url() and get_file_title() parse frontmatter independently when called from populate(). Consider extracting frontmatter once and passing it to both methods to avoid redundant I/O and parsing.

♻️ Suggested approach: Extract frontmatter once
def _load_frontmatter(self, file_path: str) -> dict | None:
    """Load frontmatter from file if present."""
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            if file.readline().startswith("---"):
                post = frontmatter.load(file_path)
                return dict(post.metadata) if post.metadata else None
    except Exception:
        pass
    return None

Then use this in populate() to extract both title and url from a single parse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 58 - 68,
Duplicate frontmatter parsing occurs because _get_frontmatter_url and
get_file_title each call frontmatter.load during populate; create a single
helper (e.g. _load_frontmatter) that opens and parses frontmatter once and
returns the metadata dict (or None), then update populate to call
_load_frontmatter(file_path) and pass that metadata to both get_file_title and
_get_frontmatter_url (or refactor those to accept metadata instead of reading
the file themselves) to eliminate redundant I/O and parsing.

47-53: Consider avoiding double file read for efficiency.

The current implementation opens the file twice: once to check for --- prefix, then frontmatter.load(file_path) reopens it. While functionally correct, this could be simplified.

♻️ Suggested refactor using frontmatter directly
     def get_file_title(self, file_path: str) -> str:
         """Extract title from the plaintext doc file."""
         title = ""
         try:
             with open(file_path, "r", encoding="utf-8") as file:
-                first_line = file.readline()
-                if first_line.startswith("---"):
-                    post = frontmatter.load(file_path)
-                    title = post.get("title", "")
-                else:
-                    title = first_line.rstrip("\n").lstrip("# ")
+                post = frontmatter.load(file)
+                if post.metadata and "title" in post.metadata:
+                    title = post.get("title", "")
+                else:
+                    # No frontmatter or no title in frontmatter, read first line
+                    file.seek(0)
+                    first_line = file.readline()
+                    title = first_line.rstrip("\n").lstrip("# ")
         except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
             pass
         return title

Note: frontmatter.load() accepts a file object, which avoids reopening the file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 47 - 53, Avoid
reopening the file: open file_path once as file, read first_line from that file
object, and if it startswith("---") rewind the file (file.seek(0)) and call
frontmatter.load(file) passing the file object (not file_path) to get post and
title; otherwise derive title from first_line as before. Update references to
first_line, post, title, and frontmatter.load accordingly to use the single open
file object.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 58-68: Duplicate frontmatter parsing occurs because
_get_frontmatter_url and get_file_title each call frontmatter.load during
populate; create a single helper (e.g. _load_frontmatter) that opens and parses
frontmatter once and returns the metadata dict (or None), then update populate
to call _load_frontmatter(file_path) and pass that metadata to both
get_file_title and _get_frontmatter_url (or refactor those to accept metadata
instead of reading the file themselves) to eliminate redundant I/O and parsing.
- Around line 47-53: Avoid reopening the file: open file_path once as file, read
first_line from that file object, and if it startswith("---") rewind the file
(file.seek(0)) and call frontmatter.load(file) passing the file object (not
file_path) to get post and title; otherwise derive title from first_line as
before. Update references to first_line, post, title, and frontmatter.load
accordingly to use the single open file object.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ff9d38d-a179-4682-9c39-f22b4f485def

📥 Commits

Reviewing files that changed from the base of the PR and between c750c6f and a4c84e1.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • pyproject.toml
  • src/lightspeed_rag_content/metadata_processor.py
  • tests/test_metadata_processor.py

@syedriko
Copy link
Copy Markdown
Collaborator

@romartin Looks great, but please run make konflux-requirements to update Python dependencies for Konflux:
https://github.com/lightspeed-core/rag-content/blob/main/README.md#updating-dependencies-for-hermetic-builds

@romartin
Copy link
Copy Markdown
Contributor Author

@syedriko Thanks for the pointer! Konflux python deps updated, a new commit has been added.

@syedriko
Copy link
Copy Markdown
Collaborator

/ok-to-test

Copy link
Copy Markdown
Contributor

@are-ces are-ces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please update the documentation with this new feature

@are-ces
Copy link
Copy Markdown
Contributor

are-ces commented Mar 23, 2026

/ok-to-test

Comment on lines +66 to +67
except Exception: # noqa: S110 pylint: disable=broad-exception-caught
pass

Check notice

Code scanning / Bandit

Try, Except, Pass detected. Note

Try, Except, Pass detected.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lightspeed_rag_content/metadata_processor.py (1)

54-55: ⚠️ Potential issue | 🟡 Minor

Avoid silent exception swallowing in metadata extraction paths.

Line 54 and Line 66 catch broad exceptions and pass, which hides parse failures and makes bad metadata hard to diagnose. Please at least log the failure with file_path.

💡 Proposed fix
-        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
-            pass
+        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
+            LOG.warning("Failed to extract title from %s", file_path, exc_info=True)
         return title
@@
-        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
-            pass
+        except Exception:  # noqa: S110 pylint: disable=broad-exception-caught
+            LOG.warning(
+                "Failed to extract frontmatter URL from %s", file_path, exc_info=True
+            )
         return None

Also applies to: 66-67

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 54 - 55,
Replace the silent "except Exception: pass" handlers in the metadata extraction
code with logging that records the file_path and the exception details so
failures are visible; locate the broad except blocks in
src/lightspeed_rag_content/metadata_processor.py (the metadata extraction
function(s) where parsing of files occurs) and change them to call the module
logger (or logger.exception/logging.exception) with a message like "Failed to
extract metadata for {file_path}" including the caught exception traceback, then
allow the function to continue or re-raise as appropriate for the caller's
error-handling policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 49-53: When frontmatter exists but post.get("title", "") returns
an empty string, ensure you fall back to deriving the title from the document
body (same logic used in the else branch). In the frontmatter branch (where
frontmatter.load(file_path) is called), set title = str(post.get("title", "")).
If that result is falsy/empty, compute title from first_line using
first_line.rstrip("\n").lstrip("# "), so title is never blank; keep the
variables frontmatter.load, post, title, and first_line as the reference points.

---

Duplicate comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 54-55: Replace the silent "except Exception: pass" handlers in the
metadata extraction code with logging that records the file_path and the
exception details so failures are visible; locate the broad except blocks in
src/lightspeed_rag_content/metadata_processor.py (the metadata extraction
function(s) where parsing of files occurs) and change them to call the module
logger (or logger.exception/logging.exception) with a message like "Failed to
extract metadata for {file_path}" including the caught exception traceback, then
allow the function to continue or re-raise as appropriate for the caller's
error-handling policy.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19697ef7-07cf-466d-b2f1-7be89616238f

📥 Commits

Reviewing files that changed from the base of the PR and between 5f58e17 and b3362c1.

📒 Files selected for processing (1)
  • src/lightspeed_rag_content/metadata_processor.py

Comment on lines +49 to +53
if first_line.startswith("---"):
post = frontmatter.load(file_path)
title = str(post.get("title", ""))
else:
title = first_line.rstrip("\n").lstrip("# ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing fallback when frontmatter exists but title is absent.

Line 49 switches to frontmatter parsing, but at Line 51 the title can become empty ("") and no body-title fallback runs. This can emit blank metadata titles for valid markdown files.

💡 Proposed fix
                 if first_line.startswith("---"):
                     post = frontmatter.load(file_path)
-                    title = str(post.get("title", ""))
+                    fm_title = post.get("title")
+                    if fm_title is not None and str(fm_title).strip():
+                        title = str(fm_title)
+                    else:
+                        # Fallback to first non-empty line after frontmatter
+                        in_frontmatter = False
+                        file.seek(0)
+                        for idx, line in enumerate(file):
+                            if idx == 0 and line.startswith("---"):
+                                in_frontmatter = True
+                                continue
+                            if in_frontmatter:
+                                if line.startswith("---"):
+                                    in_frontmatter = False
+                                continue
+                            if line.strip():
+                                title = line.rstrip("\n").lstrip("# ")
+                                break
                 else:
                     title = first_line.rstrip("\n").lstrip("# ")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 49 - 53, When
frontmatter exists but post.get("title", "") returns an empty string, ensure you
fall back to deriving the title from the document body (same logic used in the
else branch). In the frontmatter branch (where frontmatter.load(file_path) is
called), set title = str(post.get("title", "")). If that result is falsy/empty,
compute title from first_line using first_line.rstrip("\n").lstrip("# "), so
title is never blank; keep the variables frontmatter.load, post, title, and
first_line as the reference points.

@romartin
Copy link
Copy Markdown
Contributor Author

romartin commented Mar 23, 2026

@are-ces @syedriko

Added 2 new commits, one for trying to fix the CI, another one including docs.

Here is the summary of the updates on the README file:

  - New "YAML Frontmatter Support" subsection (before the Faiss Vector Store section) with:                                                                                                                                         
    - A table of the two recognised fields (title, url)                                                                                                                                                                             
    - A concrete example markdown file with frontmatter                                                                                                                                                                             
    - Explanation of fallback behaviour when fields are absent                                                                                                                                                                      
    - A "Hermetic Builds" sub-subsection explaining hermetic_build=True and its effect                                                                                                                                              
  - Updated CustomMetadataProcessor example to call super().__init__() (so hermetic_build is properly forwarded) and added a comment clarifying that url_function is the fallback when frontmatter has no url field.                                                                                                                                                                

Thanks!

@romartin
Copy link
Copy Markdown
Contributor Author

Rebased

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lightspeed_rag_content/metadata_processor.py (1)

50-54: ⚠️ Potential issue | 🟠 Major

Frontmatter branch can return blank titles without fallback.

On Line 52, if title is absent/empty in frontmatter, body-title extraction never runs, so metadata title can be empty.

💡 Proposed fix
-                if first_line.startswith("---"):
-                    post = frontmatter.load(file_path)
-                    title = str(post.get("title", ""))
+                if first_line.startswith("---"):
+                    post = frontmatter.load(file_path)
+                    fm_title = str(post.get("title", "")).strip()
+                    if fm_title:
+                        title = fm_title
+                    else:
+                        in_frontmatter = True
+                        for line in file:
+                            if in_frontmatter:
+                                if line.startswith("---"):
+                                    in_frontmatter = False
+                                continue
+                            if line.strip():
+                                title = line.rstrip("\n").lstrip("# ").strip()
+                                break
                 else:
                     title = first_line.rstrip("\n").lstrip("# ")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 50 - 54, The
frontmatter branch (using frontmatter.load(file_path)) can yield an empty title
and currently skips the body-based fallback; update the logic in
metadata_processor.py so after loading post you set title =
str(post.get("title", "")).strip() and if that result is empty then fall back to
the existing body-first-line extraction (the same logic used in the else branch
that trims "#" and newlines from first_line). Ensure you reference the same
variables (file_path, post, first_line, title) so the frontmatter path uses the
body-derived title when post.get("title") is falsy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 65-66: The code currently returns str(url) for any non-None
frontmatter value, allowing whitespace-only strings to be treated as valid URLs;
update the logic where url = post.get("url") (in metadata_processor.py) to
normalize and validate the value by stripping surrounding whitespace, and return
None if the stripped string is empty so that whitespace-only frontmatter does
not override url_function; ensure you return the stripped string (not the
original) when valid.

---

Duplicate comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 50-54: The frontmatter branch (using frontmatter.load(file_path))
can yield an empty title and currently skips the body-based fallback; update the
logic in metadata_processor.py so after loading post you set title =
str(post.get("title", "")).strip() and if that result is empty then fall back to
the existing body-first-line extraction (the same logic used in the else branch
that trims "#" and newlines from first_line). Ensure you reference the same
variables (file_path, post, first_line, title) so the frontmatter path uses the
body-derived title when post.get("title") is falsy.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9f64ac07-a699-4f46-88e5-a160927f701d

📥 Commits

Reviewing files that changed from the base of the PR and between 4f7162b and 4600d1b.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • .tekton/rag-tool-pull-request.yaml
  • .tekton/rag-tool-push.yaml
  • README.md
  • pyproject.toml
  • requirements-build.txt
  • requirements.hashes.source.txt
  • requirements.hashes.wheel.pypi.txt
  • requirements.hashes.wheel.txt
  • src/lightspeed_rag_content/metadata_processor.py
  • tests/test_metadata_processor.py
💤 Files with no reviewable changes (1)
  • requirements.hashes.wheel.txt
✅ Files skipped from review due to trivial changes (4)
  • pyproject.toml
  • requirements.hashes.wheel.pypi.txt
  • README.md
  • requirements-build.txt
🚧 Files skipped from review as they are similar to previous changes (2)
  • .tekton/rag-tool-pull-request.yaml
  • .tekton/rag-tool-push.yaml

Comment on lines +65 to +66
url = post.get("url")
return str(url) if url is not None else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize frontmatter url before returning it.

Line 66 currently accepts whitespace-only values as valid URLs, which can override url_function and degrade metadata quality.

💡 Proposed fix
-                    url = post.get("url")
-                    return str(url) if url is not None else None
+                    url = post.get("url")
+                    if url is None:
+                        return None
+                    normalized = str(url).strip()
+                    return normalized or None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
url = post.get("url")
return str(url) if url is not None else None
url = post.get("url")
if url is None:
return None
normalized = str(url).strip()
return normalized or None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_rag_content/metadata_processor.py` around lines 65 - 66, The
code currently returns str(url) for any non-None frontmatter value, allowing
whitespace-only strings to be treated as valid URLs; update the logic where url
= post.get("url") (in metadata_processor.py) to normalize and validate the value
by stripping surrounding whitespace, and return None if the stripped string is
empty so that whitespace-only frontmatter does not override url_function; ensure
you return the stripped string (not the original) when valid.

@syedriko
Copy link
Copy Markdown
Collaborator

/ok-to-test

@syedriko syedriko dismissed are-ces’s stale review March 23, 2026 23:14

@romartin has update the documentation

Copy link
Copy Markdown
Collaborator

@syedriko syedriko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@syedriko syedriko merged commit 94246bf into lightspeed-core:main Mar 23, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants