Skip to content

Add Upath support [ENG-235]#94

Merged
eywalker merged 4 commits intonauticalab:devfrom
brian-arnold:upath
Mar 22, 2026
Merged

Add Upath support [ENG-235]#94
eywalker merged 4 commits intonauticalab:devfrom
brian-arnold:upath

Conversation

@brian-arnold
Copy link
Copy Markdown
Collaborator

Goal: Add UPath (universal-pathlib) support so orcapod can handle remote filesystem paths (S3, GCS, etc.) alongside local pathlib.Path.

New dependency: universal-pathlib>=0.3.8 added to pyproject.toml.

Type conversion (semantic_struct_converters.py):

  • Extracted shared logic from PathStructConverter into a new FilePathStructConverterBase ABC, parameterized by path type and field name.
  • Added UPathStructConverter — stores UPath as a {"upath": ""} Arrow struct (distinct from Path's {"path": "..."} struct).
  • PathStructConverter explicitly rejects UPath instances and types to avoid ambiguity from UPath being a Path subclass.

Hashing (hash_utils.py, builtin_handlers.py):

  • hash_file() now accepts PathLike and uses path.open() instead of open(), so UPath remote-filesystem semantics are preserved.
  • _to_path() helper simplified to a single isinstance(file_path, Path) check (covers UPath as a subclass).
  • Added UPathContentHandler for semantic hashing of UPath objects, with strict type checking (rejects non-UPath inputs).

Universal converter (universal_converter.py):

  • Added Optional[T] handling in get_python_to_arrow_converter — passes through None, delegates to inner converter for non-None values.
  • Made python_type_to_arrow_type raise ValueError for complex unions (A | B) instead of silently picking the first type, consistent with
    the converter path.

Context registration (v0.1.json):

  • Registered UPathStructConverter in the semantic type registry.
  • Registered UPathContentHandler in the semantic hasher's type handler list (keyed to upath.core.UPath).

Tests:

  • test_upath_struct_converter.py — full coverage of UPath struct conversion, hashing, type discrimination, and Path/UPath boundary
    enforcement.
  • test_universal_converter.py — tests for UPath arrow type mapping, Optional[UPath] conversion, and complex union error consistency.

Brian Arnold added 2 commits March 22, 2026 00:03
  - Move top-level UPath import to TYPE_CHECKING with lazy imports in methods
  - Replace duck-typing in _to_path() with isinstance check (UPath subclasses Path)
  - Make PathStructConverter reject UPath instances to avoid dispatch ambiguity
  - Raise ValueError for complex unions in python_type_to_arrow_type (was silently picking first type)
  - Reject non-UPath objects in UPathContentHandler.handle() instead of silently wrapping
  - Add tests for PathStructConverter/UPath boundary and consistent union error
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Universal Pathlib (UPath) support to orcapod so path-like semantic types and file-content hashing can work with remote filesystems (e.g., S3/GCS) in addition to local pathlib.Path.

Changes:

  • Introduces UPathStructConverter (distinct Arrow struct signature: {"upath": "<string>"}) and refactors shared path logic into FilePathStructConverterBase.
  • Extends semantic hashing with UPathContentHandler, and updates file hashing to use path.open() to preserve path object semantics.
  • Improves type conversion: supports Optional[T] unions and rejects complex unions consistently across converter and Arrow-type mapping.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
uv.lock Locks new dependency set including universal-pathlib and its transitive deps.
pyproject.toml Adds universal-pathlib>=0.3.8 dependency.
src/orcapod/semantic_types/semantic_struct_converters.py Refactors Path converter base; adds UPathStructConverter; makes PathStructConverter explicitly reject UPath.
src/orcapod/hashing/hash_utils.py Updates hash_file() to accept PathLike and use path.open() (intended to preserve UPath semantics).
src/orcapod/hashing/semantic_hashing/builtin_handlers.py Adds UPathContentHandler for semantic hashing of UPath file content.
src/orcapod/semantic_types/universal_converter.py Adds Optional-handling converter path; raises on complex unions instead of silently selecting a type.
src/orcapod/contexts/data/v0.1.json Registers the new UPath semantic converter + UPath hashing handler in the default context.
tests/test_semantic_types/test_upath_struct_converter.py Adds tests for UPath struct conversion, hashing behavior, and Path/UPath discrimination.
tests/test_semantic_types/test_universal_converter.py Adds tests for UPath Arrow mapping, Optional[UPath] conversion, and complex union errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/orcapod/hashing/hash_utils.py Outdated
Comment on lines +46 to +55
def _to_path(file_path: PathLike) -> Path:
"""Convert a path-like to a Path, preserving UPath instances.

If ``file_path`` is already a ``Path`` (including ``UPath`` subclasses),
return it as-is so that remote-filesystem semantics are retained.
Otherwise wrap it in ``Path()``.
"""
if isinstance(file_path, Path):
return file_path
return Path(file_path)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_to_path() claims to “preserve UPath instances”, but it only passes through pathlib.Path objects. Since PathLike is str | os.PathLike, a UPath is not guaranteed to be a pathlib.Path subclass; coercing it via Path(file_path) can strip remote-filesystem semantics (e.g., turning s3://... into a local path) and make path.open() operate on the wrong backend. Consider explicitly passing through upath.UPath (local import) or, more generally, returning the original object when it already provides a file API (open, is_file, etc.) instead of always wrapping in pathlib.Path.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added an explicit isinstance(file_path, UPath) check (with lazy import + ImportError guard) before the Path fallback in _to_path(). This ensures UPath instances are preserved even if the inheritance relationship with pathlib.Path ever changes.

# Path-specific implementation
class PathStructConverter(SemanticStructConverterBase):
"""Converter for pathlib.Path objects to/from semantic structs of form { path: "/value/of/path"}"""
class FilePathStructConverterBase(SemanticStructConverterBase, ABC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I wouldn't call the base class to both a FilePathStructConverter. Rather let's just call it PathStructConverterBase, and the old PathConveeter one should be called PythonPathStructConverter. File path is actually a subset of path in general

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed as requested:

  • FilePathStructConverterBasePathStructConverterBase
  • PathStructConverterPythonPathStructConverter
  • UPathStructConverter stays as-is

Updated all references across source, config JSON, and tests.

@brian-arnold brian-arnold changed the title Add Upath support Add Upath support [ENG-235] Mar 22, 2026
Comment thread src/orcapod/hashing/hash_utils.py Outdated


def hash_file(file_path, algorithm="sha256", buffer_size=65536) -> ContentHash:
def _to_path(file_path: PathLike) -> Path:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are treating UPath this specially then the return signature ought to be Path|UPath

Comment thread src/orcapod/hashing/hash_utils.py Outdated

if isinstance(file_path, UPath):
return file_path # type: ignore[return-value]
except ImportError:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make much sense for us to expect this import fails -- UPath is specified as an explicit dependency for OrcaPod. Let's just drop the try except here and I'd actually be happy for us to just have UPath as top level import (assuming it's not a huge library that's very slow to import)

@brian-arnold brian-arnold requested a review from eywalker March 22, 2026 16:14
@eywalker eywalker merged commit f069bc6 into nauticalab:dev Mar 22, 2026
5 checks passed
eywalker pushed a commit that referenced this pull request Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants