Added YAML frontmatter support for metadata generation.#103
Added YAML frontmatter support for metadata generation.#103syedriko merged 4 commits intolightspeed-core:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds YAML frontmatter parsing to MetadataProcessor (title and URL precedence), introduces a hermetic_build flag to skip URL reachability checks, adds the python-frontmatter runtime dependency, and updates lockfiles and CI prefetch package lists. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/lightspeed_rag_content/metadata_processor.py (2)
58-68: Duplicate frontmatter parsing inpopulate()flow.Both
_get_frontmatter_url()andget_file_title()parse frontmatter independently when called frompopulate(). Consider extracting frontmatter once and passing it to both methods to avoid redundant I/O and parsing.♻️ Suggested approach: Extract frontmatter once
def _load_frontmatter(self, file_path: str) -> dict | None: """Load frontmatter from file if present.""" try: with open(file_path, "r", encoding="utf-8") as file: if file.readline().startswith("---"): post = frontmatter.load(file_path) return dict(post.metadata) if post.metadata else None except Exception: pass return NoneThen use this in
populate()to extract bothtitleandurlfrom a single parse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 58 - 68, Duplicate frontmatter parsing occurs because _get_frontmatter_url and get_file_title each call frontmatter.load during populate; create a single helper (e.g. _load_frontmatter) that opens and parses frontmatter once and returns the metadata dict (or None), then update populate to call _load_frontmatter(file_path) and pass that metadata to both get_file_title and _get_frontmatter_url (or refactor those to accept metadata instead of reading the file themselves) to eliminate redundant I/O and parsing.
47-53: Consider avoiding double file read for efficiency.The current implementation opens the file twice: once to check for
---prefix, thenfrontmatter.load(file_path)reopens it. While functionally correct, this could be simplified.♻️ Suggested refactor using frontmatter directly
def get_file_title(self, file_path: str) -> str: """Extract title from the plaintext doc file.""" title = "" try: with open(file_path, "r", encoding="utf-8") as file: - first_line = file.readline() - if first_line.startswith("---"): - post = frontmatter.load(file_path) - title = post.get("title", "") - else: - title = first_line.rstrip("\n").lstrip("# ") + post = frontmatter.load(file) + if post.metadata and "title" in post.metadata: + title = post.get("title", "") + else: + # No frontmatter or no title in frontmatter, read first line + file.seek(0) + first_line = file.readline() + title = first_line.rstrip("\n").lstrip("# ") except Exception: # noqa: S110 pylint: disable=broad-exception-caught pass return titleNote:
frontmatter.load()accepts a file object, which avoids reopening the file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 47 - 53, Avoid reopening the file: open file_path once as file, read first_line from that file object, and if it startswith("---") rewind the file (file.seek(0)) and call frontmatter.load(file) passing the file object (not file_path) to get post and title; otherwise derive title from first_line as before. Update references to first_line, post, title, and frontmatter.load accordingly to use the single open file object.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 58-68: Duplicate frontmatter parsing occurs because
_get_frontmatter_url and get_file_title each call frontmatter.load during
populate; create a single helper (e.g. _load_frontmatter) that opens and parses
frontmatter once and returns the metadata dict (or None), then update populate
to call _load_frontmatter(file_path) and pass that metadata to both
get_file_title and _get_frontmatter_url (or refactor those to accept metadata
instead of reading the file themselves) to eliminate redundant I/O and parsing.
- Around line 47-53: Avoid reopening the file: open file_path once as file, read
first_line from that file object, and if it startswith("---") rewind the file
(file.seek(0)) and call frontmatter.load(file) passing the file object (not
file_path) to get post and title; otherwise derive title from first_line as
before. Update references to first_line, post, title, and frontmatter.load
accordingly to use the single open file object.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3ff9d38d-a179-4682-9c39-f22b4f485def
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
pyproject.tomlsrc/lightspeed_rag_content/metadata_processor.pytests/test_metadata_processor.py
|
@romartin Looks great, but please run |
|
@syedriko Thanks for the pointer! Konflux python deps updated, a new commit has been added. |
|
/ok-to-test |
are-ces
left a comment
There was a problem hiding this comment.
LGTM but please update the documentation with this new feature
|
/ok-to-test |
| except Exception: # noqa: S110 pylint: disable=broad-exception-caught | ||
| pass |
Check notice
Code scanning / Bandit
Try, Except, Pass detected. Note
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lightspeed_rag_content/metadata_processor.py (1)
54-55:⚠️ Potential issue | 🟡 MinorAvoid silent exception swallowing in metadata extraction paths.
Line 54 and Line 66 catch broad exceptions and
pass, which hides parse failures and makes bad metadata hard to diagnose. Please at least log the failure withfile_path.💡 Proposed fix
- except Exception: # noqa: S110 pylint: disable=broad-exception-caught - pass + except Exception: # noqa: S110 pylint: disable=broad-exception-caught + LOG.warning("Failed to extract title from %s", file_path, exc_info=True) return title @@ - except Exception: # noqa: S110 pylint: disable=broad-exception-caught - pass + except Exception: # noqa: S110 pylint: disable=broad-exception-caught + LOG.warning( + "Failed to extract frontmatter URL from %s", file_path, exc_info=True + ) return NoneAlso applies to: 66-67
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 54 - 55, Replace the silent "except Exception: pass" handlers in the metadata extraction code with logging that records the file_path and the exception details so failures are visible; locate the broad except blocks in src/lightspeed_rag_content/metadata_processor.py (the metadata extraction function(s) where parsing of files occurs) and change them to call the module logger (or logger.exception/logging.exception) with a message like "Failed to extract metadata for {file_path}" including the caught exception traceback, then allow the function to continue or re-raise as appropriate for the caller's error-handling policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 49-53: When frontmatter exists but post.get("title", "") returns
an empty string, ensure you fall back to deriving the title from the document
body (same logic used in the else branch). In the frontmatter branch (where
frontmatter.load(file_path) is called), set title = str(post.get("title", "")).
If that result is falsy/empty, compute title from first_line using
first_line.rstrip("\n").lstrip("# "), so title is never blank; keep the
variables frontmatter.load, post, title, and first_line as the reference points.
---
Duplicate comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 54-55: Replace the silent "except Exception: pass" handlers in the
metadata extraction code with logging that records the file_path and the
exception details so failures are visible; locate the broad except blocks in
src/lightspeed_rag_content/metadata_processor.py (the metadata extraction
function(s) where parsing of files occurs) and change them to call the module
logger (or logger.exception/logging.exception) with a message like "Failed to
extract metadata for {file_path}" including the caught exception traceback, then
allow the function to continue or re-raise as appropriate for the caller's
error-handling policy.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 19697ef7-07cf-466d-b2f1-7be89616238f
📒 Files selected for processing (1)
src/lightspeed_rag_content/metadata_processor.py
| if first_line.startswith("---"): | ||
| post = frontmatter.load(file_path) | ||
| title = str(post.get("title", "")) | ||
| else: | ||
| title = first_line.rstrip("\n").lstrip("# ") |
There was a problem hiding this comment.
Missing fallback when frontmatter exists but title is absent.
Line 49 switches to frontmatter parsing, but at Line 51 the title can become empty ("") and no body-title fallback runs. This can emit blank metadata titles for valid markdown files.
💡 Proposed fix
if first_line.startswith("---"):
post = frontmatter.load(file_path)
- title = str(post.get("title", ""))
+ fm_title = post.get("title")
+ if fm_title is not None and str(fm_title).strip():
+ title = str(fm_title)
+ else:
+ # Fallback to first non-empty line after frontmatter
+ in_frontmatter = False
+ file.seek(0)
+ for idx, line in enumerate(file):
+ if idx == 0 and line.startswith("---"):
+ in_frontmatter = True
+ continue
+ if in_frontmatter:
+ if line.startswith("---"):
+ in_frontmatter = False
+ continue
+ if line.strip():
+ title = line.rstrip("\n").lstrip("# ")
+ break
else:
title = first_line.rstrip("\n").lstrip("# ")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/metadata_processor.py` around lines 49 - 53, When
frontmatter exists but post.get("title", "") returns an empty string, ensure you
fall back to deriving the title from the document body (same logic used in the
else branch). In the frontmatter branch (where frontmatter.load(file_path) is
called), set title = str(post.get("title", "")). If that result is falsy/empty,
compute title from first_line using first_line.rstrip("\n").lstrip("# "), so
title is never blank; keep the variables frontmatter.load, post, title, and
first_line as the reference points.
|
Added 2 new commits, one for trying to fix the CI, another one including docs. Here is the summary of the updates on the README file: Thanks! |
4f7162b to
4600d1b
Compare
|
Rebased |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lightspeed_rag_content/metadata_processor.py (1)
50-54:⚠️ Potential issue | 🟠 MajorFrontmatter branch can return blank titles without fallback.
On Line 52, if
titleis absent/empty in frontmatter, body-title extraction never runs, so metadata title can be empty.💡 Proposed fix
- if first_line.startswith("---"): - post = frontmatter.load(file_path) - title = str(post.get("title", "")) + if first_line.startswith("---"): + post = frontmatter.load(file_path) + fm_title = str(post.get("title", "")).strip() + if fm_title: + title = fm_title + else: + in_frontmatter = True + for line in file: + if in_frontmatter: + if line.startswith("---"): + in_frontmatter = False + continue + if line.strip(): + title = line.rstrip("\n").lstrip("# ").strip() + break else: title = first_line.rstrip("\n").lstrip("# ")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_rag_content/metadata_processor.py` around lines 50 - 54, The frontmatter branch (using frontmatter.load(file_path)) can yield an empty title and currently skips the body-based fallback; update the logic in metadata_processor.py so after loading post you set title = str(post.get("title", "")).strip() and if that result is empty then fall back to the existing body-first-line extraction (the same logic used in the else branch that trims "#" and newlines from first_line). Ensure you reference the same variables (file_path, post, first_line, title) so the frontmatter path uses the body-derived title when post.get("title") is falsy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 65-66: The code currently returns str(url) for any non-None
frontmatter value, allowing whitespace-only strings to be treated as valid URLs;
update the logic where url = post.get("url") (in metadata_processor.py) to
normalize and validate the value by stripping surrounding whitespace, and return
None if the stripped string is empty so that whitespace-only frontmatter does
not override url_function; ensure you return the stripped string (not the
original) when valid.
---
Duplicate comments:
In `@src/lightspeed_rag_content/metadata_processor.py`:
- Around line 50-54: The frontmatter branch (using frontmatter.load(file_path))
can yield an empty title and currently skips the body-based fallback; update the
logic in metadata_processor.py so after loading post you set title =
str(post.get("title", "")).strip() and if that result is empty then fall back to
the existing body-first-line extraction (the same logic used in the else branch
that trims "#" and newlines from first_line). Ensure you reference the same
variables (file_path, post, first_line, title) so the frontmatter path uses the
body-derived title when post.get("title") is falsy.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9f64ac07-a699-4f46-88e5-a160927f701d
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
.tekton/rag-tool-pull-request.yaml.tekton/rag-tool-push.yamlREADME.mdpyproject.tomlrequirements-build.txtrequirements.hashes.source.txtrequirements.hashes.wheel.pypi.txtrequirements.hashes.wheel.txtsrc/lightspeed_rag_content/metadata_processor.pytests/test_metadata_processor.py
💤 Files with no reviewable changes (1)
- requirements.hashes.wheel.txt
✅ Files skipped from review due to trivial changes (4)
- pyproject.toml
- requirements.hashes.wheel.pypi.txt
- README.md
- requirements-build.txt
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/rag-tool-pull-request.yaml
- .tekton/rag-tool-push.yaml
| url = post.get("url") | ||
| return str(url) if url is not None else None |
There was a problem hiding this comment.
Normalize frontmatter url before returning it.
Line 66 currently accepts whitespace-only values as valid URLs, which can override url_function and degrade metadata quality.
💡 Proposed fix
- url = post.get("url")
- return str(url) if url is not None else None
+ url = post.get("url")
+ if url is None:
+ return None
+ normalized = str(url).strip()
+ return normalized or None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| url = post.get("url") | |
| return str(url) if url is not None else None | |
| url = post.get("url") | |
| if url is None: | |
| return None | |
| normalized = str(url).strip() | |
| return normalized or None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_rag_content/metadata_processor.py` around lines 65 - 66, The
code currently returns str(url) for any non-None frontmatter value, allowing
whitespace-only strings to be treated as valid URLs; update the logic where url
= post.get("url") (in metadata_processor.py) to normalize and validate the value
by stripping surrounding whitespace, and return None if the stripped string is
empty so that whitespace-only frontmatter does not override url_function; ensure
you return the stripped string (not the original) when valid.
|
/ok-to-test |
@romartin has update the documentation
Description
Adding support for consuming the YAML frontmatter (in MD files), for metadata generation (title and url).
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Tested locally, metadata generated properly.
Summary by CodeRabbit
New Features
Documentation
Chores
Tests