Skip to content

feat(databases): add at() and base_path for sub-scoped database views (ENG-341)#123

Merged
eywalker merged 12 commits intodevfrom
eywalker/eng-341-feat-database-path-contextualization-create-sub-scoped
Apr 1, 2026
Merged

feat(databases): add at() and base_path for sub-scoped database views (ENG-341)#123
eywalker merged 12 commits intodevfrom
eywalker/eng-341-feat-database-path-contextualization-create-sub-scoped

Conversation

@kurodo3
Copy link
Copy Markdown
Contributor

@kurodo3 kurodo3 Bot commented Apr 1, 2026

Summary

Implements ENG-341: database path contextualization — sub-scoped database views.

  • Adds at(*path_components) and base_path: tuple[str, ...] to ArrowDatabaseProtocol
  • All four backends implemented: DeltaTableDatabase, InMemoryArrowDatabase, ConnectorArrowDatabase, NoOpArrowDatabase
  • InMemoryArrowDatabase and ConnectorArrowDatabase share underlying storage by reference; DeltaTableDatabase creates a fresh instance (filesystem is shared storage)
  • to_config/from_config round-trips base_path for all backends
  • DeltaTableDatabase config key rename: "base_path" (URI root) → "root_uri"; "base_path" now carries the scoping prefix tuple
  • _validate_record_path updated to check combined len(base_path) + len(record_path) depth
  • list_sources() on scoped DeltaTableDatabase instances scans from the scoped root

Closes ENG-341

Test plan

  • All 4 database backends have TestAtMethod test classes
  • Config round-trip tests for all backends
  • Combined depth validation tests
  • Shared-storage tests for InMemory and Connector backends
  • Filesystem path correctness test for DeltaTable backend
  • list_sources() scoping test for DeltaTable backend
  • Full test suite passes (excluding integration tests requiring live external services)

🤖 Generated with Claude Code

kurodo3 Bot and others added 11 commits April 1, 2026 02:50
…l and all backends

Adds base_path property (returns tuple[str, ...]) and at() method stubs to
ArrowDatabaseProtocol and all four backends (NoOpArrowDatabase, InMemoryArrowDatabase,
ConnectorArrowDatabase, DeltaTableDatabase) to satisfy protocol conformance for ENG-341.

Also renames DeltaTableDatabase's internal self.base_path (Path) instance attribute
to self._local_base_path to avoid collision with the new tuple-typed property.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rowDatabase

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds full at() and base_path implementation: scoped views share the
underlying _tables/_pending_batches/_pending_record_ids dicts by
reference, _get_record_key prepends the path prefix, and
_validate_record_path checks combined prefix+record_path depth.
to_config/from_config round-trip the base_path field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Full implementation of path scoping: `at()` returns a new instance sharing
the connector and pending dicts by reference; `base_path` exposes `_path_prefix`;
`_get_record_key` and `_get_committed_table` prepend the prefix so scoped views
transparently read/write to prefixed SQL table names. `_validate_record_path`
checks combined depth (prefix + path). `to_config` serializes `base_path`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements full database path contextualization for DeltaTableDatabase
(ENG-341 Task 5): renames internal attributes (_base_uri→_root_uri,
_local_base_path→_local_root), adds _path_prefix support, updates
_get_table_uri to apply prefix, fixes _validate_record_path for combined
depth checking, updates list_sources() to scan from scoped root, implements
at() to return a properly scoped instance, and renames config key
"base_path" (URI) to "root_uri" while adding "base_path" for the prefix
tuple.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 96.47059% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/orcapod/protocols/database_protocols.py 60.00% 2 Missing ⚠️
src/orcapod/databases/delta_lake_databases.py 97.29% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements ENG-341 “database path contextualization” by introducing sub-scoped database views via at(*path_components) and a base_path tuple on ArrowDatabaseProtocol and all supported backends, plus tests and supporting design/plan docs.

Changes:

  • Added base_path: tuple[str, ...] and at(*path_components) to ArrowDatabaseProtocol and implemented them for DeltaTable/InMemory/Connector/NoOp backends.
  • Updated config serialization to round-trip base_path across backends, including renaming DeltaTableDatabase’s root location key from "base_path""root_uri".
  • Added/expanded unit tests for scoping behavior, config round-trips, and combined-depth validation.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/orcapod/protocols/database_protocols.py Extends the DB protocol with base_path and at() scoping API.
src/orcapod/databases/noop_database.py Tracks base_path, implements at(), and includes base_path in config.
src/orcapod/databases/in_memory_databases.py Implements scoped views by prefixing keys and sharing underlying dict storage.
src/orcapod/databases/connector_arrow_database.py Implements scoped views by prefixing SQL table naming and sharing pending state.
src/orcapod/databases/delta_lake_databases.py Implements scoped filesystem paths, updates depth validation, and renames config key to root_uri.
tests/test_databases/test_noop_database.py Adds TestAtMethod coverage for NoOp scoping behavior.
tests/test_databases/test_in_memory_database.py Adds TestAtMethod coverage + combined-depth validation for InMemory scoping.
tests/test_databases/test_connector_arrow_database.py Adds TestAtMethod coverage including SQL table name prefixing.
tests/test_databases/test_delta_table_database.py Adds scoped-path filesystem assertions and scoped list_sources() tests.
tests/test_databases/test_database_config.py Updates Delta config expectations (root_uri) and adds base_path round-trip tests.
superpowers/specs/2026-04-01-database-path-contextualization-design.md Design spec describing scoping semantics and per-backend implementation details.
superpowers/plans/2026-04-01-database-path-contextualization.md Implementation plan documenting steps, testing strategy, and file map.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 851 to 862
@classmethod
def from_config(cls, config: dict[str, Any]) -> DeltaTableDatabase:
def from_config(cls, config: dict[str, Any]) -> "DeltaTableDatabase":
"""Reconstruct a DeltaTableDatabase from a config dict."""
return cls(
base_path=config["base_path"],
base_path=config["root_uri"],
storage_options=config.get("storage_options"),
create_base_path=True,
batch_size=config.get("batch_size", 1000),
max_hierarchy_depth=config.get("max_hierarchy_depth", 10),
allow_schema_evolution=config.get("allow_schema_evolution", True),
_path_prefix=tuple(config.get("base_path", [])),
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeltaTableDatabase.from_config now requires config["root_uri"], which will raise KeyError when loading configs produced before this change (where the filesystem root was stored under "base_path"). Since pipeline deserialization calls cls.from_config(config) directly, this is a backwards-compatibility break. Consider accepting both shapes: treat config.get("root_uri") as preferred, but fall back to config.get("base_path") when it’s a string/Path-like root, while still supporting the new list-valued base_path prefix tuple.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa61441. from_config now checks for "root_uri" first (current format) and falls back to "base_path" as the URI string (legacy pre-ENG-341 format). When falling back, _path_prefix is set to (). Covered by test_from_config_accepts_legacy_base_path in TestDeltaTableDatabaseConfig.

Comment on lines +960 to +966
return DeltaTableDatabase(
base_path=self._root_uri,
storage_options=self._storage_options,
batch_size=self.batch_size,
max_hierarchy_depth=self.max_hierarchy_depth,
allow_schema_evolution=self.allow_schema_evolution,
_path_prefix=self._path_prefix + path_components,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeltaTableDatabase.at() accepts arbitrary path_components but does not validate them. Since these components become part of filesystem paths, allowing values like ".." or strings containing separators can cause unexpected directory traversal / incorrect scoping. Suggest validating path_components with the same rules as record_path components (non-empty str; disallow '/', '\', '\0', etc.; and consider explicitly rejecting '.'/'..').

Suggested change
return DeltaTableDatabase(
base_path=self._root_uri,
storage_options=self._storage_options,
batch_size=self.batch_size,
max_hierarchy_depth=self.max_hierarchy_depth,
allow_schema_evolution=self.allow_schema_evolution,
_path_prefix=self._path_prefix + path_components,
validated_components: list[str] = []
for component in path_components:
if not isinstance(component, str):
raise TypeError(
f"DeltaTableDatabase.at() path components must be str, got {type(component)!r}"
)
if component == "" or "/" in component or "\\" in component or "\0" in component:
raise ValueError(
f"Invalid path component {component!r}: must be non-empty and must not contain '/', '\\\\', or NUL"
)
if component in (".", ".."):
raise ValueError(
f"Invalid path component {component!r}: '.' and '..' are not allowed"
)
validated_components.append(self._sanitize_path_component(component))
return DeltaTableDatabase(
base_path=self._root_uri,
storage_options=self._storage_options,
batch_size=self.batch_size,
max_hierarchy_depth=self.max_hierarchy_depth,
allow_schema_evolution=self.allow_schema_evolution,
_path_prefix=self._path_prefix + tuple(validated_components),

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa61441. at() now validates each component before constructing the new instance: rejects non-str, empty string, '.'/'..', and the filesystem-unsafe char set (/, \\, *, ?, ", <, >, |, NUL). I validated but did not pre-sanitize what gets stored in _path_prefix_get_table_uri already applies _sanitize_path_component at use time, keeping base_path predictable across platforms. Covered by test_at_rejects_slash_in_component, test_at_rejects_dotdot_component, test_at_rejects_empty_component, test_at_rejects_non_str_component in TestAtMethod.

Comment on lines +265 to +283
@property
def base_path(self) -> tuple[str, ...]:
"""The current relative root of this database view (always () for root instances)."""
return self._path_prefix

def at(self, *path_components: str) -> "InMemoryArrowDatabase":
"""Return a new InMemoryArrowDatabase scoped to the given sub-path.

The returned instance shares the underlying storage dicts (_tables,
_pending_batches, _pending_record_ids) by reference, so writes
through any view are visible to all views of the same root database.
"""
return InMemoryArrowDatabase(
max_hierarchy_depth=self.max_hierarchy_depth,
_path_prefix=self._path_prefix + path_components,
_shared_tables=self._tables,
_shared_pending_batches=self._pending_batches,
_shared_pending_record_ids=self._pending_record_ids,
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemoryArrowDatabase.at() builds _path_prefix directly from path_components without validating them. Because _get_record_key uses '/'.join(...) and flush reconstructs paths by splitting on '/', any path component containing '/' or '\0' will break key round-tripping and can corrupt reads/writes across views. Consider validating path_components (same checks as _validate_record_path applies to record_path components) before constructing the scoped view.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa61441. at() now validates each component before building _path_prefix: rejects empty strings, non-str values, and components containing '/' or '\\0'. These are the same characters rejected by _validate_record_path for individual components. Covered by test_at_rejects_slash_in_component, test_at_rejects_null_in_component, test_at_rejects_empty_component in TestAtMethod.

Comment on lines +279 to +298
@property
def base_path(self) -> tuple[str, ...]:
"""The current relative root of this database view (always () for root instances)."""
return self._path_prefix

def at(self, *path_components: str) -> "ConnectorArrowDatabase":
"""Return a new ConnectorArrowDatabase scoped to the given sub-path.

The returned instance shares the connector and all three pending dicts
(_pending_batches, _pending_record_ids, _pending_skip_existing) by reference.
Calling flush() on any view drains the entire shared pending queue.
"""
return ConnectorArrowDatabase(
connector=self._connector,
max_hierarchy_depth=self.max_hierarchy_depth,
_path_prefix=self._path_prefix + path_components,
_shared_pending_batches=self._pending_batches,
_shared_pending_record_ids=self._pending_record_ids,
_shared_pending_skip_existing=self._pending_skip_existing,
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectorArrowDatabase.at() stores path_components into _path_prefix without validation. Since record keys are built via '/'.join(...) and flush reconstructs record_path using split('/'), a scoped prefix component containing '/' or '\0' will break round-tripping and could cause data to be written/read under unintended table names. Consider validating path_components (non-empty str; disallow '/' and '\0') before returning the new scoped instance.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa61441. Same validation as InMemoryArrowDatabase.at(): rejects empty strings and components containing '/' or '\\0'. Covered by test_at_rejects_slash_in_component, test_at_rejects_null_in_component, test_at_rejects_empty_component in TestAtMethod.

… support

- InMemoryArrowDatabase.at() and ConnectorArrowDatabase.at() now reject
  components containing '/' or '\0', which would corrupt the '/'-separated
  record key scheme and break flush()'s key reconstruction via split('/')
- DeltaTableDatabase.at() validates components against the full filesystem
  unsafe-char set plus '.' and '..' (directory traversal)
- DeltaTableDatabase.from_config() now accepts both current format
  ('root_uri' key) and legacy pre-ENG-341 format ('base_path' as a URI
  string), preventing KeyError when deserializing older persisted configs
- Tests added for all four changes

Addresses Copilot review on PR #123
@kurodo3
Copy link
Copy Markdown
Contributor Author

kurodo3 Bot commented Apr 1, 2026

Review round 1 addressed — fa61441

All four Copilot comments resolved in a single commit:

at() path component validation (InMemory, Connector, Delta)

InMemoryArrowDatabase.at() and ConnectorArrowDatabase.at() now reject components containing '/' or '\0' — the same characters blocked by _validate_record_path. Without this guard, a prefix component containing '/' would corrupt the '/'.join(...) key scheme and break flush()'s key reconstruction via split('/').

DeltaTableDatabase.at() validates against the full _validate_record_path unsafe-char set plus '.'/'..' (directory traversal). Components are validated but not pre-sanitized when stored in _path_prefix_get_table_uri already applies _sanitize_path_component at use time, keeping base_path predictable across platforms (avoiding ':''!' mutation on Windows).

DeltaTableDatabase.from_config backwards-compatibility

from_config now accepts both formats:

  • Current (post-ENG-341): "root_uri" key present → use it, read "base_path" as list for prefix
  • Legacy (pre-ENG-341): no "root_uri" → treat "base_path" as the root URI string, _path_prefix = ()

Tests added

  • TestAtMethod in InMemory, Connector, Delta: 3 new validation tests each (slash, null/dotdot, empty, non-str for Delta)
  • TestDeltaTableDatabaseConfig: test_from_config_accepts_legacy_base_path

405 database tests pass.

@eywalker eywalker merged commit 7d95d18 into dev Apr 1, 2026
9 checks passed
@eywalker eywalker deleted the eywalker/eng-341-feat-database-path-contextualization-create-sub-scoped branch April 1, 2026 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants