-
Notifications
You must be signed in to change notification settings - Fork 5
Observability reader [ENG-347] #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ded54f6
a2f0094
ddd7d3f
9536757
5cab032
79d7940
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,236 @@ | ||||||||||
| """Read-only viewer for pipeline observability results stored in Delta Lake.""" | ||||||||||
|
|
||||||||||
| from __future__ import annotations | ||||||||||
|
|
||||||||||
| from pathlib import Path | ||||||||||
| from typing import ClassVar, TYPE_CHECKING | ||||||||||
|
|
||||||||||
| import polars as pl | ||||||||||
|
|
||||||||||
| if TYPE_CHECKING: | ||||||||||
| from upath import UPath | ||||||||||
|
|
||||||||||
|
|
||||||||||
| class ObservabilityReader: | ||||||||||
| """Auto-discovers and queries pipeline status and log Delta tables. | ||||||||||
|
|
||||||||||
| Provides two DataFrames: ``status()`` for execution state and | ||||||||||
| ``logs()`` for execution logs. Both return clean polars DataFrames | ||||||||||
| ready for further analysis with standard polars operations. | ||||||||||
|
|
||||||||||
| Args: | ||||||||||
| root: Path to the results output directory. Supports local paths, | ||||||||||
| ``pathlib.Path``, and ``UPath`` for cloud storage. | ||||||||||
|
|
||||||||||
| Raises: | ||||||||||
| ValueError: If ``root`` does not exist or contains no Delta tables. | ||||||||||
| """ | ||||||||||
|
|
||||||||||
| def __init__(self, root: str | Path | UPath) -> None: | ||||||||||
| self._root = root if isinstance(root, Path) else Path(root) | ||||||||||
| if not self._root.exists(): | ||||||||||
| raise ValueError( | ||||||||||
| f"Results root does not exist: {self._root}" | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| self._status_tables: dict[str, list[Path]] = {} | ||||||||||
| self._log_tables: dict[str, list[Path]] = {} | ||||||||||
| self._discover_tables() | ||||||||||
|
|
||||||||||
| if not self._status_tables and not self._log_tables: | ||||||||||
| raise ValueError( | ||||||||||
| f"No observability Delta tables found under: {self._root}" | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| self._status_df: pl.DataFrame | None = None | ||||||||||
| self._logs_df: pl.DataFrame | None = None | ||||||||||
|
|
||||||||||
| def _discover_tables(self) -> None: | ||||||||||
| """Find all Delta tables and classify as status or log.""" | ||||||||||
| for delta_log_dir in self._root.rglob("_delta_log"): | ||||||||||
| if not delta_log_dir.is_dir(): | ||||||||||
| continue | ||||||||||
| table_dir = delta_log_dir.parent | ||||||||||
| parts = table_dir.relative_to(self._root).parts | ||||||||||
|
|
||||||||||
| if "status" in parts: | ||||||||||
| idx = parts.index("status") | ||||||||||
| if idx + 1 < len(parts): | ||||||||||
| node_name = parts[idx + 1] | ||||||||||
| self._status_tables.setdefault(node_name, []).append( | ||||||||||
| table_dir | ||||||||||
| ) | ||||||||||
| elif "logs" in parts: | ||||||||||
| idx = parts.index("logs") | ||||||||||
| if idx + 1 < len(parts): | ||||||||||
| node_name = parts[idx + 1] | ||||||||||
| self._log_tables.setdefault(node_name, []).append( | ||||||||||
| table_dir | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| @property | ||||||||||
| def nodes(self) -> list[str]: | ||||||||||
| """Sorted list of discovered node names.""" | ||||||||||
| return sorted(self._status_tables.keys()) | ||||||||||
|
Comment on lines
+73
to
+74
|
||||||||||
| """Sorted list of discovered node names.""" | |
| return sorted(self._status_tables.keys()) | |
| """Sorted list of discovered node names (from status and logs).""" | |
| return sorted(set(self._status_tables) | set(self._log_tables)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By design — nodes is derived from status tables because the StatusObserver emits status for every node the orchestrator executes. A node without status entries was not executed. Logs are secondary — they capture stdout/stderr but are only written after packet execution completes, so a node could have status but no logs (e.g. still running). Keeping nodes status-derived gives a consistent, authoritative view of what ran.
Copilot
AI
Mar 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag_columns is inferred exclusively from the status DataFrame. If the root contains only log tables (or status tables are temporarily absent), tag_columns will be empty even though logs may contain user tag columns. Consider falling back to logs schema when status is empty/unavailable, or inferring from whichever table type is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reasoning as nodes — status tables are the primary/authoritative data source. The constructor requires at least one Delta table to exist (raises ValueError otherwise), and status() now short-circuits on empty DataFrames. If a results directory has only log tables and no status, tag_columns returns empty and status() returns an empty DataFrame, which is correct behavior — no status means no execution state to report.
Copilot
AI
Mar 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status() assumes a non-empty status table with a "timestamp" column; if the root contains no status tables (but does contain log tables), _get_status_df() returns an empty DataFrame and df.sort("timestamp") will raise. Consider short-circuiting when df is empty / missing required columns and returning an empty DataFrame (ideally with the expected schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed — status() now short-circuits with if df.is_empty(): return df before attempting sort("timestamp"). Added a test (test_empty_status_returns_empty_df) that creates a results directory with only log tables and verifies status() returns an empty DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_discover_tables stores only a single table_dir per node_name; if multiple Delta tables exist for the same node (e.g., different schema_hash / node hash, multiple pipeline versions), later discoveries overwrite earlier ones and data is silently dropped. Consider storing a list of table dirs per node and concatenating all of them (or selecting deterministically by version/timestamp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed —
_discover_tablesnow stores alist[Path]per node name, and_get_status_df/_get_logs_dfiterate over all tables for each node before concatenating. This handles multiple Delta tables per node (e.g. different schema hashes or pipeline versions).