Skip to content

feat(ENG-258): add MinIO/S3 and UPath support for DeltaTableDatabase#122

Merged
eywalker merged 15 commits intodevfrom
eywalker/eng-258-addtest-support-for-minio-integration-with
Mar 30, 2026
Merged

feat(ENG-258): add MinIO/S3 and UPath support for DeltaTableDatabase#122
eywalker merged 15 commits intodevfrom
eywalker/eng-258-addtest-support-for-minio-integration-with

Conversation

@kurodo3
Copy link
Copy Markdown
Contributor

@kurodo3 kurodo3 Bot commented Mar 30, 2026

Summary

Closes ENG-258.

  • Add storage_options: dict[str, str] | None parameter to DeltaTableDatabase, mirroring the deltalake API
  • Accept UPath as base_path; credentials embedded in the UPath are auto-translated from fsspec (key/secret/endpoint_url) to deltalake's AWS_* format via the new storage_utils.py helper
  • Gate local-only operations (mkdir, Path.exists) on not self._is_cloud
  • Rename _get_table_path_get_table_uri returning str for both local and cloud paths
  • Add list_sources() method (local-only; raises NotImplementedError for cloud paths)
  • Full TDD test suite: test_storage_utils, local CRUD tests, and MinIO/S3 integration tests
  • GitHub Actions CI with bitnami/minio service container — no external S3 server needed

Test plan

  • uv run pytest tests/test_databases/test_storage_utils.py — passes locally
  • uv run pytest tests/test_databases/test_delta_table_database.py — passes locally
  • uv run pytest tests/test_databases/test_delta_table_database_s3.py — passes with Docker (or in CI)
  • Full CI run passes on GitHub Actions (check the Actions tab)

🤖 Generated with Claude Code

kurodo3 Bot and others added 11 commits March 30, 2026 22:04
Spec covers ENG-258: adding storage_options + UPath support to
DeltaTableDatabase, test strategy using testcontainers/MinIO, and
a new GitHub Actions CI workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Plan for ENG-258: covers TDD task breakdown, exact file paths,
complete code snippets, and CI workflow setup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rts, test coverage)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge in S3 fixtures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add storage_options parameter and parse_base_path/is_cloud_uri integration
- Replace _get_table_path with _get_table_uri (returns str, works for local and cloud)
- Pass storage_options to DeltaTable() and write_deltalake() calls
- Skip mkdir for cloud paths in flush_batch
- Add list_sources() method (raises NotImplementedError for cloud paths)
- Update S3 test to expect NotImplementedError instead of AttributeError

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3 kurodo3 Bot force-pushed the eywalker/eng-258-addtest-support-for-minio-integration-with branch from 860ac75 to 57a42b5 Compare March 30, 2026 22:12
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/orcapod/databases/delta_lake_databases.py 78.78% 14 Missing ⚠️
src/orcapod/databases/storage_utils.py 82.05% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

kurodo3 Bot and others added 3 commits March 30, 2026 22:34
Colons appear in semantic-version hash paths (e.g. 'semantic_v0.1:abc123')
used throughout the pipeline. The colon check in _validate_record_path was
overly strict since _sanitize_path_component already handles colons on
Windows by replacing them with '!'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… step

bitnami/minio is no longer reliably available. Use the official minio/minio
image started via 'docker run' with 'server /data' — this approach does not
require service container syntax and gives full control over the start command.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pygraphviz>=1.14 requires graphviz C headers (graphviz/cgraph.h) which
are not present on ubuntu-latest by default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
def __init__(
self,
base_path: str | Path,
base_path: "str | Path | UPath",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "" around type hint -- make sure to use from __future__ import annotations so there's no need to put quotes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added from __future__ import annotations and removed all quoted annotations throughout the file.

# For cloud paths: create_base_path is silently ignored (no directory needed).

# Batch management
self._delta_table_cache: dict[str, "deltalake.DeltaTable"] = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove quotes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

)

def _get_delta_table(self, record_path: tuple[str, ...]) -> DeltaTable | None:
def _get_delta_table(self, record_path: tuple[str, ...]) -> "deltalake.DeltaTable | None":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote quotes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


def flush(self) -> None:
"""Flush all pending batches."""
# TODO: capture and re-raise exceptions at the end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this TODO if already addressed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO removed — implemented collect-and-re-raise in flush(): all batches are attempted, errors are collected, and a RuntimeError is raised at the end listing the failed keys.

record_path = tuple(record_key.split("/"))
try:
self.flush_batch(record_path)
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suppresses exception too broadly -- make it more specific to the expected case (i.e. failure to flush batch).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed together with the TODO above: flush() now collects exceptions across all batches and re-raises at the end, so nothing is silently suppressed.


table_path = self._get_table_path(record_path)
table_path.mkdir(parents=True, exist_ok=True)
table_uri = self._get_table_uri(record_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider incorporating path creation as part of the _get_table_uri so we can minimize the number of times self._is_cloud has to be explicitly checked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added a create_dir: bool = False parameter to _get_table_uri that handles the mkdir internally for local paths (no-op for cloud). flush_batch now calls _get_table_uri(record_path, create_dir=True) and no longer checks self._is_cloud directly for this.

Comment thread src/orcapod/databases/storage_utils.py Outdated
return scheme in _CLOUD_SCHEMES


def _extract_upath_options(upath: "UPath") -> dict[str, str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for quotes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread src/orcapod/databases/storage_utils.py Outdated
is_upath = False

if is_upath:
uri = str(base_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract the common component of uri = str(base_path) and then simplify to derived = _extract_upath... if is_upath else {}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — extracted the common uri = str(base_path) and collapsed the branch to derived = _extract_upath_options(base_path) if is_upath else {}.

…sh collect+reraise, _get_table_uri create_dir

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eywalker eywalker merged commit 1996912 into dev Mar 30, 2026
9 checks passed
@eywalker eywalker deleted the eywalker/eng-258-addtest-support-for-minio-integration-with branch March 30, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant