Conversation
WalkthroughAdds a new OKP metadata processing module to parse TOML frontmatter from .md files, filter files by project and presence of URL/title, provide an OKPMetadataProcessor, add unit tests for these utilities, and update Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant FS as Filesystem
participant OKP as OKP Utils
Caller->>FS: glob("*.md")
FS-->>Caller: list of files
loop for each file
Caller->>OKP: parse_metadata(file)
OKP-->>Caller: metadata or raise
Caller->>OKP: metadata_has_url_and_title(metadata)
OKP-->>Caller: true/false
Caller->>OKP: is_file_related_to_projects(metadata, projects)
OKP-->>Caller: true/false
alt both true
Caller-->>Caller: yield file
else
Caller-->>Caller: skip file (log warning if parse succeeded)
end
end
sequenceDiagram
participant Client
participant OKP as OKPMetadataProcessor
participant Parser as parse_metadata
Client->>OKP: url_function(file_path)
OKP->>Parser: parse_metadata(file_path)
Parser-->>OKP: metadata dict
OKP-->>Client: metadata.extra.reference_url
Client->>OKP: get_file_title(file_path)
OKP->>Parser: parse_metadata(file_path)
Parser-->>OKP: metadata dict
OKP-->>Client: metadata.title
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
.gitignore (1)
53-53: Consider making the ignored path explicitly a directoryIf this is meant to be a directory for test artifacts, prefer a trailing slash to make intent explicit and avoid accidental matches with a file named tests/test_results.
-tests/test_results +tests/test_results/src/lightspeed_rag_content/okp.py (5)
15-18: Minor readability: avoid shadowing parameters during normalizationReassigning projects can be confusing. Use new names for the normalized lists.
- # Lowercase both lists - product_names = [p.lower() for p in product_names] - projects = [p.lower() for p in projects] - - # Check if any project is in the product names - return any(p in pn for p in projects for pn in product_names) + # Lowercase both lists + product_names_lc = [p.lower() for p in product_names] + projects_lc = [p.lower() for p in projects] + + # Check if any project is in the product names + return any(p in pn for p in projects_lc for pn in product_names_lc)Also applies to: 19-21
23-30: Title/URL presence check looks goodLogic correctly verifies both presence and non-empty title. Consider a future enhancement to be more defensive by using metadata.get("title", "") to avoid potential KeyError if upstream callers misuse the function, but current guard is fine.
45-57: Anchor metadata regex to the front matter blockAnchoring to the beginning of the file/front-matter guards against accidentally matching a +++ sequence in the body. Inline flags keep it compact.
- # Extract everything between the +++ markers - match = re.search(rb"\+{3,}\s*(.*?)\s*\+{3,}", content, re.S) + # Extract everything between the +++ front-matter markers (top-of-file) + match = re.search(rb"(?ms)^\s*\+{3}\s*\n(.*?)\n\s*\+{3}\s*", content)
45-47: Optional: cache parse results to avoid double readsOKPMetadataProcessor calls parse_metadata twice per file (URL + title). Light caching avoids duplicate IO without complicating call sites. Suitable if files are immutable during a run.
+from functools import lru_cache @@ -def parse_metadata(filepath): +@lru_cache(maxsize=512) +def parse_metadata(filepath):Note: If you want to support both str and Path inputs consistently in the cache key, you may normalize filepath to str at the start of parse_metadata.
62-70: Harden accessors against missing keysIf metadata lacks expected keys, raise a clear ValueError instead of a KeyError. This keeps behavior consistent with parse_metadata’s error style and plays better with callers.
def url_function(self, file_path: str) -> str: """Return the URL for the OKP file.""" - md = parse_metadata(file_path) - return md["extra"]["reference_url"] + md = parse_metadata(file_path) + try: + return md["extra"]["reference_url"] + except KeyError as e: + raise ValueError(f"reference_url not found in metadata for {file_path}") from e @@ def get_file_title(self, file_path: str) -> str: """Return the title of the OKP file.""" - md = parse_metadata(file_path) - return md["title"] + md = parse_metadata(file_path) + try: + return md["title"] + except KeyError as e: + raise ValueError(f"title not found in metadata for {file_path}") from etests/test_okp.py (1)
35-42: Fix misleading docstring in negative testDocstring says the metadata has both URL and title, but the test asserts False cases. Update for clarity.
- def test_metadata_has_url_and_title_false(self): - """Test that the metadata has both URL and title.""" + def test_metadata_has_url_and_title_false(self): + """Test that the metadata_missing_url_or_title returns False."""Also applies to: 44-51
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (3)
.gitignore(1 hunks)src/lightspeed_rag_content/okp.py(1 hunks)tests/test_okp.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
src/lightspeed_rag_content/okp.py (1)
src/lightspeed_rag_content/metadata_processor.py (1)
MetadataProcessor(26-96)
tests/test_okp.py (1)
src/lightspeed_rag_content/okp.py (7)
metadata_has_url_and_title(23-29)is_file_related_to_projects(11-20)parse_metadata(45-56)yield_files_related_to_projects(32-42)OKPMetadataProcessor(59-70)url_function(62-65)get_file_title(67-70)
🪛 GitHub Actions: Ruff
src/lightspeed_rag_content/okp.py
[error] 20-20: Ruff: Unnecessary list comprehension (C419) in okp.py. Command 'uv tool run ruff check src' failed.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-and-push-dev
- GitHub Check: Pylinter
- GitHub Check: mypy
🔇 Additional comments (2)
src/lightspeed_rag_content/okp.py (1)
5-7: No tomli fallback needed — project requires Python 3.12pyproject.toml declares requires-python = "==3.12.*" (pyproject.toml:51) and uv.lock shows the same; no tomli dependency was found in the repo scan. Keep the existing import tomllib in src/lightspeed_rag_content/okp.py (lines 5–7) as-is — a tomli fallback is unnecessary unless you lower the project's minimum Python version.
- src/lightspeed_rag_content/okp.py (lines 5–7): import tomllib
- pyproject.toml:51 — requires-python = "==3.12.*"
- uv.lock:3 — requires-python = "==3.12.*"
- No tomli package found in dependencies/search
tests/test_okp.py (1)
65-87: Good coverage of parse_metadata behaviorThe test exercises front-matter parsing and nested extra fields well, including list types. Using bytes for mock_open aligns with the binary read in parse_metadata.
Also applies to: 89-117
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/lightspeed_rag_content/okp.py (3)
21-22: Good: any() now uses a generator expression (Ruff C419 satisfied)Thanks for addressing the earlier C419. No allocation of intermediate list anymore.
45-46: Replace print with logging and catch tomllib/Unicode decode errorsPrinting from libraries is discouraged. Also catch tomllib.TOMLDecodeError and UnicodeDecodeError to continue iteration on corrupt/invalid files. This aligns with prior review feedback.
- except ValueError as e: - print(f"Skipping file {filepath}: {e}") + except (ValueError, tomllib.TOMLDecodeError, UnicodeDecodeError) as e: + LOG.warning("Skipping file %s: %s", filepath, e)
5-12: Fix Ruff I001: sort imports and add a module loggerRuff flagged the import block ordering. Also, switch to a module logger for observability in downstream comments.
Apply:
-import re -from pathlib import Path -from typing import Any, Generator -import tomllib - -from lightspeed_rag_content.metadata_processor import MetadataProcessor +import logging +import re +from pathlib import Path +from typing import Any, Generator + +import tomllib + +from lightspeed_rag_content.metadata_processor import MetadataProcessor + +LOG = logging.getLogger(__name__)
🧹 Nitpick comments (4)
src/lightspeed_rag_content/okp.py (4)
15-20: Defensive handling for portal_product_namesSome OKP sources may provide a single string or include stray whitespace. Normalize to a list of cleaned strings to avoid surprises.
- product_names = metadata.get("extra", {}).get("portal_product_names", []) - - # Lowercase both lists - product_names = [p.lower() for p in product_names] - projects = [p.lower() for p in projects] + product_names = metadata.get("extra", {}).get("portal_product_names", []) + if isinstance(product_names, str): + product_names = [product_names] + + # Lowercase and strip both lists, guard for non-str entries + product_names = [p.lower().strip() for p in product_names if isinstance(p, str)] + projects = [p.lower().strip() for p in projects if isinstance(p, str)]
55-60: Anchor metadata regex to front-matter and consider pre-compilationAnchoring reduces the chance of accidentally matching +++ blocks elsewhere in the document (e.g., code samples). Pre-compiling avoids recompilation on each call.
- # Extract everything between the +++ markers - match = re.search(rb"\+{3,}\s*(.*?)\s*\+{3,}", content, re.S) + # Extract TOML front-matter between +++ markers at the start of lines + # Use a backref to ensure the same number of pluses is matched on close. + METADATA_RE = re.compile(rb"(?m)^(\+{3,})\s*(.*?)\s*^\1\s*$", re.S) + match = METADATA_RE.search(content)Note: If OKP files sometimes put front-matter away from the top, keep your current regex, but still consider pre-compiling at module scope for efficiency.
3-3: Nit: docstring grammar“Utility methods for processing OKP files.” reads more naturally.
-"""Utility methods processing OKP files.""" +"""Utility methods for processing OKP files."""
49-61: Optional: cache metadata parsing to avoid repeated IOIf the processor tends to request both URL and title for the same file(s), caching parse_metadata will reduce repeated file reads/parsing.
Add near imports:
from functools import lru_cacheDecorate and normalize the key:
-def parse_metadata(filepath: Path) -> dict[str, Any]: +@lru_cache(maxsize=1024) +def parse_metadata(filepath: Path) -> dict[str, Any]: """Extract metadata from the OKP file.""" - with open(filepath, "rb") as f: + # Normalize key for cache consistency + filepath = Path(filepath) + with open(filepath, "rb") as f: content = f.read()Note: If you adopt the earlier change to pass Path(file_path) at call sites, this cache will be effective. If you keep accepting both str and Path, normalize to Path inside the function as shown.
Also applies to: 66-74
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (3)
.gitignore(1 hunks)src/lightspeed_rag_content/okp.py(1 hunks)tests/test_okp.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .gitignore
- tests/test_okp.py
🧰 Additional context used
🪛 GitHub Actions: Pyright
src/lightspeed_rag_content/okp.py
[error] 68-68: Step 'uv run pyright src' failed with Pyright error: Argument of type 'str' cannot be assigned to parameter 'filepath' of type 'Path' in function 'parse_metadata'. 'str' is not assignable to 'Path' (reportArgumentType).
[error] 73-73: Step 'uv run pyright src' failed with Pyright error: Argument of type 'str' cannot be assigned to parameter 'filepath' of type 'Path' in function 'parse_metadata'. 'str' is not assignable to 'Path' (reportArgumentType).
🪛 GitHub Actions: Ruff
src/lightspeed_rag_content/okp.py
[error] 5-10: Ruff I001: Import block is unsorted or unformatted. Organize imports. Run 'ruff --fix' to auto-fix.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pylinter
- GitHub Check: build-and-push-dev
- GitHub Check: mypy
40b5bb7 to
14b94c5
Compare
Akrog
left a comment
There was a problem hiding this comment.
I think some documentation is necessary.
It may be enough having an example with a brief explanation (the example could be a section of your PR for openstack lightspeed).
|
@Akrog thanks for the review! Regarding documenting, I will do it in the docstrings otherwise leaving it in the PR or in the commit message may not be easily accessible for people using this module. |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/lightspeed_rag_content/okp.py (2)
97-109: Broaden exception handling so one bad file doesn’t halt iterationOnly ValueError is caught; TOML parse errors or bad encodings will currently bubble and stop the scan. Catch tomllib.TOMLDecodeError and UnicodeDecodeError as well, and keep logging.
Apply this diff:
- except ValueError as e: - LOG.warning("Skipping OKP file %s: %s", filepath, e) + except (ValueError, tomllib.TOMLDecodeError, UnicodeDecodeError) as e: + LOG.warning("Skipping OKP file %s: %s", filepath, e)
64-67: Bug: returns a string instead of bool; also lacks type checks for URL/titleThe current expression returns the stripped title string when truthy (not a boolean), and will raise if title is non-string. Tighten validation and ensure a boolean is returned.
Apply this diff:
- return ( - "reference_url" in metadata.get("extra", {}) - and metadata.get("title", "").strip() - ) + url = metadata.get("extra", {}).get("reference_url") + title = metadata.get("title") + return ( + isinstance(url, str) and url.strip() != "" and + isinstance(title, str) and title.strip() != "" + )
🧹 Nitpick comments (3)
src/lightspeed_rag_content/okp.py (3)
48-53: Defensively normalize inputs in is_file_related_to_projects (avoid type surprises)portal_product_names can be a string or a non-list value; also avoid reassigning the parameter name for readability. Normalize both inputs to lists of lowercase strings before matching.
Apply this diff:
- product_names = metadata.get("extra", {}).get("portal_product_names", []) - # Lowercase both lists - product_names = [p.lower() for p in product_names] - projects = [p.lower() for p in projects] - return any(p in pn for p in projects for pn in product_names) + raw_product_names = metadata.get("extra", {}).get("portal_product_names", []) + # Normalize product names + if isinstance(raw_product_names, str): + product_names = [raw_product_names.lower()] + elif isinstance(raw_product_names, (list, tuple, set)): + product_names = [p.lower() for p in raw_product_names if isinstance(p, str)] + else: + product_names = [] + # Normalize project terms + project_terms = [p.lower() for p in projects if isinstance(p, str) and p.strip()] + return any(p in pn for p in project_terms for pn in product_names)
132-138: Anchor front-matter regex to file start to avoid false positivesAs written, it can match +++ blocks found later in the body. Anchoring improves correctness for standard TOML front matter at the file start.
Apply this diff:
- # Extract everything between the +++ markers - match = re.search(rb"\+{3,}\s*(.*?)\s*\+{3,}", content, re.S) + # Extract front matter between +++ markers at the start of the file + match = re.search(rb"\A\+{3}\s*\n(.*?)\n\+{3}", content, re.S)
144-152: Avoid double parsing and consider resilience to missing keysBoth url_function and get_file_title parse the file separately, which duplicates IO/parse work when used back-to-back (e.g., in populate). Optionally cache or parse once per file operation, or override populate to read metadata once and derive both fields. Also, these assume keys always exist; if used outside your filtering pipeline, a KeyError will surface.
- If you want, I can propose an override of MetadataProcessor.populate in this class to parse once and extract both fields.
- Verify call sites guarantee presence of extra.reference_url and title before invoking these accessors.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
.gitignore(1 hunks)src/lightspeed_rag_content/okp.py(1 hunks)tests/test_okp.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_okp.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/lightspeed_rag_content/okp.py (1)
src/lightspeed_rag_content/metadata_processor.py (1)
MetadataProcessor(26-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-and-push-dev
- GitHub Check: mypy
- GitHub Check: Pylinter
🔇 Additional comments (1)
src/lightspeed_rag_content/okp.py (1)
1-26: Module scaffold and logging look solidLicense header, imports, and module-level logger are set up correctly. Good foundation.
|
For those interested, this is how we are using it for OpenStack: openstack-lightspeed/rag-content#58 |
lpiwowar
left a comment
There was a problem hiding this comment.
Looks good to me 👍 except for one small nit. I feel like the type: ignore comments should be last resort if we commit to the type checking.
This patch is adding an OKP module with functions to help processing OKP files. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
|
@tisnik whenever u have some free time. Just some common methods that I think other lightspeed teams might benefit from when handling OKP files (Here is the OpenStack usage for reference openstack-lightspeed/rag-content#58) |
Description
This patch is adding an OKP module with functions to help processing OKP files.
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Tests
Chores