Conversation
- Move top-level UPath import to TYPE_CHECKING with lazy imports in methods - Replace duck-typing in _to_path() with isinstance check (UPath subclasses Path) - Make PathStructConverter reject UPath instances to avoid dispatch ambiguity - Raise ValueError for complex unions in python_type_to_arrow_type (was silently picking first type) - Reject non-UPath objects in UPathContentHandler.handle() instead of silently wrapping - Add tests for PathStructConverter/UPath boundary and consistent union error
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds Universal Pathlib (UPath) support to orcapod so path-like semantic types and file-content hashing can work with remote filesystems (e.g., S3/GCS) in addition to local pathlib.Path.
Changes:
- Introduces
UPathStructConverter(distinct Arrow struct signature:{"upath": "<string>"}) and refactors shared path logic intoFilePathStructConverterBase. - Extends semantic hashing with
UPathContentHandler, and updates file hashing to usepath.open()to preserve path object semantics. - Improves type conversion: supports
Optional[T]unions and rejects complex unions consistently across converter and Arrow-type mapping.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Locks new dependency set including universal-pathlib and its transitive deps. |
| pyproject.toml | Adds universal-pathlib>=0.3.8 dependency. |
| src/orcapod/semantic_types/semantic_struct_converters.py | Refactors Path converter base; adds UPathStructConverter; makes PathStructConverter explicitly reject UPath. |
| src/orcapod/hashing/hash_utils.py | Updates hash_file() to accept PathLike and use path.open() (intended to preserve UPath semantics). |
| src/orcapod/hashing/semantic_hashing/builtin_handlers.py | Adds UPathContentHandler for semantic hashing of UPath file content. |
| src/orcapod/semantic_types/universal_converter.py | Adds Optional-handling converter path; raises on complex unions instead of silently selecting a type. |
| src/orcapod/contexts/data/v0.1.json | Registers the new UPath semantic converter + UPath hashing handler in the default context. |
| tests/test_semantic_types/test_upath_struct_converter.py | Adds tests for UPath struct conversion, hashing behavior, and Path/UPath discrimination. |
| tests/test_semantic_types/test_universal_converter.py | Adds tests for UPath Arrow mapping, Optional[UPath] conversion, and complex union errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _to_path(file_path: PathLike) -> Path: | ||
| """Convert a path-like to a Path, preserving UPath instances. | ||
|
|
||
| If ``file_path`` is already a ``Path`` (including ``UPath`` subclasses), | ||
| return it as-is so that remote-filesystem semantics are retained. | ||
| Otherwise wrap it in ``Path()``. | ||
| """ | ||
| if isinstance(file_path, Path): | ||
| return file_path | ||
| return Path(file_path) |
There was a problem hiding this comment.
_to_path() claims to “preserve UPath instances”, but it only passes through pathlib.Path objects. Since PathLike is str | os.PathLike, a UPath is not guaranteed to be a pathlib.Path subclass; coercing it via Path(file_path) can strip remote-filesystem semantics (e.g., turning s3://... into a local path) and make path.open() operate on the wrong backend. Consider explicitly passing through upath.UPath (local import) or, more generally, returning the original object when it already provides a file API (open, is_file, etc.) instead of always wrapping in pathlib.Path.
There was a problem hiding this comment.
Fixed — added an explicit isinstance(file_path, UPath) check (with lazy import + ImportError guard) before the Path fallback in _to_path(). This ensures UPath instances are preserved even if the inheritance relationship with pathlib.Path ever changes.
| # Path-specific implementation | ||
| class PathStructConverter(SemanticStructConverterBase): | ||
| """Converter for pathlib.Path objects to/from semantic structs of form { path: "/value/of/path"}""" | ||
| class FilePathStructConverterBase(SemanticStructConverterBase, ABC): |
There was a problem hiding this comment.
Actually I wouldn't call the base class to both a FilePathStructConverter. Rather let's just call it PathStructConverterBase, and the old PathConveeter one should be called PythonPathStructConverter. File path is actually a subset of path in general
There was a problem hiding this comment.
Done — renamed as requested:
FilePathStructConverterBase→PathStructConverterBasePathStructConverter→PythonPathStructConverterUPathStructConverterstays as-is
Updated all references across source, config JSON, and tests.
… harden _to_path
|
|
||
|
|
||
| def hash_file(file_path, algorithm="sha256", buffer_size=65536) -> ContentHash: | ||
| def _to_path(file_path: PathLike) -> Path: |
There was a problem hiding this comment.
If we are treating UPath this specially then the return signature ought to be Path|UPath
|
|
||
| if isinstance(file_path, UPath): | ||
| return file_path # type: ignore[return-value] | ||
| except ImportError: |
There was a problem hiding this comment.
It doesn't make much sense for us to expect this import fails -- UPath is specified as an explicit dependency for OrcaPod. Let's just drop the try except here and I'd actually be happy for us to just have UPath as top level import (assuming it's not a huge library that's very slow to import)
Goal: Add UPath (universal-pathlib) support so orcapod can handle remote filesystem paths (S3, GCS, etc.) alongside local pathlib.Path.
New dependency: universal-pathlib>=0.3.8 added to pyproject.toml.
Type conversion (semantic_struct_converters.py):
Hashing (hash_utils.py, builtin_handlers.py):
Universal converter (universal_converter.py):
the converter path.
Context registration (v0.1.json):
Tests:
enforcement.