feat(spiraldb): implement SpiralDBConnector (PLT-1074)#109
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…w feedback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or (PLT-1074) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or (PLT-1074) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or (PLT-1074) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (PLT-1074) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… __init__ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pk_columns, get_column_info) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TION_TESTS=1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PK skip_existing limitation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b39d051 to
7d7f743
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new SpiralDB-backed implementation of DBConnectorProtocol so existing database abstractions (ConnectorArrowDatabase, DBTableSource) can operate on SpiralDB via Arrow-native scans/writes, alongside unit + opt-in integration tests and supporting design/plan docs.
Changes:
- Implement
SpiralDBConnectorwith lazyspiralloading, schema introspection, full-table scans, table creation, upsert, and config serialization. - Export
SpiralDBConnectorfromorcapod.databases. - Add mocked unit tests plus env-gated live integration tests; add design spec and implementation plan docs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/orcapod/databases/spiraldb_connector.py |
New SpiralDB connector implementing DBConnectorProtocol methods (scan/write/introspection/config). |
src/orcapod/databases/__init__.py |
Re-export SpiralDBConnector for public import path. |
tests/test_databases/test_spiraldb_connector.py |
Mocked unit tests covering protocol conformance and connector behavior. |
tests/test_databases/test_spiraldb_connector_integration.py |
Opt-in integration tests against the dev SpiralDB project. |
superpowers/specs/2026-03-25-spiraldb-connector-design.md |
Design spec documenting intended behavior/limitations. |
superpowers/plans/2026-03-25-spiraldb-connector.md |
Implementation plan for the connector and tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mask = pa.array([ | ||
| tuple(row[k] for k in pk_cols) not in existing_keys | ||
| for row in records.to_pylist() | ||
| ]) |
There was a problem hiding this comment.
When records is empty and skip_existing=True, records.to_pylist() produces an empty list so pa.array([]) will infer a null-typed array. records.filter(mask) then fails because the filter must be boolean. Consider early-returning when records.num_rows == 0 and/or constructing the mask with an explicit boolean type (e.g., type=pa.bool_()).
| mask = pa.array([ | |
| tuple(row[k] for k in pk_cols) not in existing_keys | |
| for row in records.to_pylist() | |
| ]) | |
| mask = pa.array( | |
| [ | |
| tuple(row[k] for k in pk_cols) not in existing_keys | |
| for row in records.to_pylist() | |
| ], | |
| type=pa.bool_(), | |
| ) |
There was a problem hiding this comment.
Fixed. Added type=pa.bool_() to the pa.array() call so the filter always receives a boolean array, even when records is empty.
| These tests are skipped unless ``SPIRAL_INTEGRATION_TESTS=1`` env var is set | ||
| AND valid credentials are present in ``~/.config/pyspiral/auth.json`` | ||
| (obtained via ``spiral login``). If the env var is set but credentials are | ||
| absent, the tests fail rather than skip — the operator is expected to ensure | ||
| auth is in place when enabling integration tests. |
There was a problem hiding this comment.
The module docstring says tests are skipped unless both SPIRAL_INTEGRATION_TESTS=1 and credentials are present, but the actual skip condition only checks the env var. If credentials are missing and the env var is set, these tests will fail (likely during connector init / first network call). Please adjust the docstring to match the implemented behavior (skipped only by env var; fails if auth is missing).
| These tests are skipped unless ``SPIRAL_INTEGRATION_TESTS=1`` env var is set | |
| AND valid credentials are present in ``~/.config/pyspiral/auth.json`` | |
| (obtained via ``spiral login``). If the env var is set but credentials are | |
| absent, the tests fail rather than skip — the operator is expected to ensure | |
| auth is in place when enabling integration tests. | |
| These tests are skipped unless the ``SPIRAL_INTEGRATION_TESTS=1`` env var is set. | |
| They require valid credentials in ``~/.config/pyspiral/auth.json`` (obtained via | |
| ``spiral login``). If the env var is set but credentials are absent, the tests | |
| fail rather than skip — the operator is expected to ensure auth is in place | |
| when enabling integration tests. |
There was a problem hiding this comment.
Fixed. Rewrote the opening sentence to state the skip condition clearly (env var only), then described the credential requirement separately. The clarifying sentence about fail-vs-skip was already present and is retained.
| ## Overview | ||
|
|
||
| Implement `SpiralDBConnector`, a concrete implementation of `DBConnectorProtocol` | ||
| backed by [pyspiral](https://pypi.org/project/pyspiral/) (`spiral` module, `>=0.11.6`). |
There was a problem hiding this comment.
This spec states pyspiral requirement as >=0.11.6, but pyproject.toml currently defines the spiraldb extra as pyspiral>=0.11.0. Please align the spec with the repo’s actual dependency constraint (or update the constraint in pyproject.toml in the same change if 0.11.6 is truly required).
| backed by [pyspiral](https://pypi.org/project/pyspiral/) (`spiral` module, `>=0.11.6`). | |
| backed by [pyspiral](https://pypi.org/project/pyspiral/) (`spiral` module, `>=0.11.0`). |
There was a problem hiding this comment.
Fixed. Updated both occurrences in the spec (>=0.11.6 → >=0.11.0) to match the constraint in pyproject.toml.
|
|
||
| **Architecture:** `SpiralDBConnector` bridges the SQL-style `DBConnectorProtocol` interface to SpiralDB's expression-based scan/write API. It is dataset-scoped — a `(project_id, dataset)` pair is fixed at construction. All other layers (`ConnectorArrowDatabase`, `DBTableSource`) are unmodified; only the connector varies. `pyspiral` (`spiral` module) is lazy-loaded so `import orcapod` does not fail without it. | ||
|
|
||
| **Tech Stack:** Python ≥ 3.10, `pyspiral >= 0.11.6` (optional extra `orcapod[spiraldb]`), `pyarrow`, `orcapod.utils.lazy_module.LazyModule`, `pytest`, `unittest.mock` |
There was a problem hiding this comment.
The plan lists the tech stack as “Python ≥ 3.10”, but the project’s pyproject.toml specifies requires-python = ">=3.11.0". Please update this document to reflect the actual minimum supported Python version to avoid confusion for anyone following the plan.
| **Tech Stack:** Python ≥ 3.10, `pyspiral >= 0.11.6` (optional extra `orcapod[spiraldb]`), `pyarrow`, `orcapod.utils.lazy_module.LazyModule`, `pytest`, `unittest.mock` | |
| **Tech Stack:** Python ≥ 3.11, `pyspiral >= 0.11.6` (optional extra `orcapod[spiraldb]`), `pyarrow`, `orcapod.utils.lazy_module.LazyModule`, `pytest`, `unittest.mock` |
There was a problem hiding this comment.
Fixed. Updated Python version from 3.10 → 3.11 to match requires-python = ">=3.11.0" in pyproject.toml.
|
|
||
| **Architecture:** `SpiralDBConnector` bridges the SQL-style `DBConnectorProtocol` interface to SpiralDB's expression-based scan/write API. It is dataset-scoped — a `(project_id, dataset)` pair is fixed at construction. All other layers (`ConnectorArrowDatabase`, `DBTableSource`) are unmodified; only the connector varies. `pyspiral` (`spiral` module) is lazy-loaded so `import orcapod` does not fail without it. | ||
|
|
||
| **Tech Stack:** Python ≥ 3.10, `pyspiral >= 0.11.6` (optional extra `orcapod[spiraldb]`), `pyarrow`, `orcapod.utils.lazy_module.LazyModule`, `pytest`, `unittest.mock` |
There was a problem hiding this comment.
This plan references pyspiral >= 0.11.6, but the repo currently constrains the spiraldb extra to pyspiral>=0.11.0 in pyproject.toml. Please align the plan with the actual dependency constraint (or update the constraint if 0.11.6 is required).
| **Tech Stack:** Python ≥ 3.10, `pyspiral >= 0.11.6` (optional extra `orcapod[spiraldb]`), `pyarrow`, `orcapod.utils.lazy_module.LazyModule`, `pytest`, `unittest.mock` | |
| **Tech Stack:** Python ≥ 3.10, `pyspiral >= 0.11.0` (optional extra `orcapod[spiraldb]`), `pyarrow`, `orcapod.utils.lazy_module.LazyModule`, `pytest`, `unittest.mock` |
There was a problem hiding this comment.
Fixed. Updated pyspiral >= 0.11.6 → >= 0.11.0 to match the actual constraint in pyproject.toml.
- Fix potential pa.array null-type crash when records is empty and skip_existing=True by passing explicit type=pa.bool_() to pa.array() - Correct integration test docstring: skip condition is env var only; clarify that missing credentials cause failure, not skip - Align pyspiral version in spec and plan docs to >=0.11.0 (matches pyproject.toml), not >=0.11.6 - Correct Python minimum version in plan doc from 3.10 to 3.11 (matches pyproject.toml requires-python) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review round addressed (commit 91be361)All 5 Copilot comments resolved in a single commit:
45 unit tests still pass. |
- Add spiral-integration job to run-tests.yml: runs on push/manual trigger only (not PRs), requires SPIRAL_WORKLOAD_ID secret, uses GitHub OIDC (permissions.id-token: write) to authenticate as workload work_fbjjcy (policy work_policy_p9cxfh, bound to nauticalab/orcapod-python) - Parameterise integration test project/server via SPIRAL_PROJECT_ID and SPIRAL_SERVER_URL env vars (default to test-orcapod-362211 / api.spiraldb.dev) so CI can target the right environment without a code change - Update integration test docstring to document both local-dev (auth.json) and CI (OIDC workload) auth paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- iter_batches: drop batch_size kwarg from to_record_batches() — the pyspiral Plan API does not accept batch_size; batch sizing is controlled by the Spiral execution engine. Log a warning if a non-default value is passed for protocol compliance. - get_column_info: normalize pa.string() → pa.large_string() and pa.binary() → pa.large_binary(). SpiralDB stores string columns as string regardless of whether large_string was written, causing schema comparisons in ConnectorArrowDatabase.flush() to fail. Consistent with SQLiteConnector and PostgreSQLConnector which always return large_string. Update unit tests to reflect the new behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents both issues in detail under a new "Discovered during live integration testing" section in Known Limitations: 1. string/large_string mismatch: SpiralDB stores all string columns as utf8/pa.string() regardless of the written type. Without normalization, ConnectorArrowDatabase.flush() raised a false schema mismatch ValueError. Fix: get_column_info normalizes string→large_string, binary→large_binary. Note: iter_batches scan results are NOT normalized (intentional). 2. batch_size kwarg rejected by Plan.to_record_batches(): the pyspiral API does not accept batch_size at all. Fix: drop the kwarg; emit a warning if a non-default value is passed for protocol compliance. Also updates: - get_column_info code block to show normalization helper - iter_batches section to document batch_size behaviour accurately - Type Notes section split into raw SpiralDB types vs get_column_info normalization, with an explicit warning that iter_batches returns raw types - Integration test section to document OIDC CI setup and that these tests were the source of both bug discoveries - Unit test plan row for get_column_info Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - CI (GitHub Actions): ``SPIRAL_WORKLOAD_ID=work_fbjjcy`` env var — workload | ||
| ``work_fbjjcy`` (policy ``work_policy_p9cxfh``) has editor access on the dev | ||
| project and is bound to ``nauticalab/orcapod-python`` via GitHub OIDC. |
There was a problem hiding this comment.
The docstring includes concrete workload/policy IDs (work_fbjjcy, work_policy_p9cxfh) and the repo name binding. These identifiers are likely to become stale and may be considered sensitive infrastructure details; it’s usually better to describe the requirement generically ("set SPIRAL_WORKLOAD_ID with a workload that has editor access") and keep specific IDs in a private runbook or CI configuration docs.
| - CI (GitHub Actions): ``SPIRAL_WORKLOAD_ID=work_fbjjcy`` env var — workload | |
| ``work_fbjjcy`` (policy ``work_policy_p9cxfh``) has editor access on the dev | |
| project and is bound to ``nauticalab/orcapod-python`` via GitHub OIDC. | |
| - CI (GitHub Actions): ``SPIRAL_WORKLOAD_ID`` env var pointing to a workload | |
| that has editor access on the dev project and is bound to this repository via | |
| GitHub OIDC (see internal CI/runbook docs for the exact workload/policy IDs). |
There was a problem hiding this comment.
Fixed. Replaced the specific workload/policy IDs and repository binding with a generic description: 'SPIRAL_WORKLOAD_ID env var pointing to a workload that has editor access on the dev project and is bound to this repository via GitHub OIDC (see internal CI/runbook docs for the exact workload/policy IDs).' The concrete IDs now live only in repo secrets/vars and internal runbook docs, not in version-controlled source.
| yield from reader | ||
|
|
There was a problem hiding this comment.
iter_batches() yields pyspiral-native string/binary column types, while ConnectorArrowDatabase writes and buffers records using large_string/large_binary (e.g. it injects __record_id as pa.large_string()) and both SQLite/PostgreSQL connectors also yield large_* types. This can break ConnectorArrowDatabase.get_all_records() / get_records_by_ids() when it needs to pa.concat_tables([committed, pending]), because committed batches from Spiral will be string but pending uses large_string, causing a schema mismatch error. Consider normalizing string→large_string and binary→large_binary in iter_batches output (e.g. cast per batch) so committed and pending tables remain schema-compatible.
| yield from reader | |
| # Normalize pyspiral-native string/binary columns to large_string/large_binary | |
| # so that schemas remain compatible with ConnectorArrowDatabase and other | |
| # connectors that use large_* types (important for pa.concat_tables). | |
| for batch in reader: | |
| schema = batch.schema | |
| new_fields = [] | |
| needs_cast = False | |
| for field in schema: | |
| ftype = field.type | |
| if ftype == pa.string(): | |
| new_fields.append(pa.field(field.name, pa.large_string(), field.nullable, field.metadata)) | |
| needs_cast = True | |
| elif ftype == pa.binary(): | |
| new_fields.append(pa.field(field.name, pa.large_binary(), field.nullable, field.metadata)) | |
| needs_cast = True | |
| else: | |
| new_fields.append(field) | |
| if needs_cast: | |
| target_schema = pa.schema(new_fields, metadata=schema.metadata) | |
| batch = batch.cast(target_schema) | |
| yield batch |
There was a problem hiding this comment.
Fixed. iter_batches now casts each batch before yielding: fields with pa.string() are cast to pa.large_string() and pa.binary() to pa.large_binary(). Only batches that actually contain these types incur the overhead (needs_cast guard). Consistent with normalization already applied in get_column_info, so ConnectorArrowDatabase pa.concat_tables([committed, pending]) will always see matching schemas. Added test_string_and_binary_normalized_to_large and updated test_full_scan_yields_batches to assert large_string output.
| spiral-integration: | ||
| runs-on: ubuntu-latest | ||
| # Only run on push/manual trigger — not on PRs — to avoid consuming SpiralDB | ||
| # resources on every review cycle. The workload work_fbjjcy is bound to | ||
| # nauticalab/orcapod-python via GitHub OIDC (policy work_policy_p9cxfh). | ||
| if: github.event_name == 'push' || github.event_name == 'workflow_dispatch' | ||
| needs: test | ||
| permissions: | ||
| id-token: write # required to request a GitHub OIDC token for SpiralDB auth | ||
| contents: read | ||
| env: | ||
| SPIRAL_WORKLOAD_ID: ${{ secrets.SPIRAL_WORKLOAD_ID }} | ||
| SPIRAL_INTEGRATION_TESTS: "1" | ||
| SPIRAL_PROJECT_ID: ${{ vars.SPIRAL_PROJECT_ID }} | ||
| SPIRAL_SERVER_URL: ${{ vars.SPIRAL_SERVER_URL }} | ||
|
|
There was a problem hiding this comment.
The spiral-integration job will still run even when SPIRAL_WORKLOAD_ID / SPIRAL_PROJECT_ID / SPIRAL_SERVER_URL are unset; in GitHub Actions missing secrets.*/vars.* resolve to empty strings, which will override the test defaults and likely cause a hard failure. Consider adding an additional job-level if: guard requiring a non-empty secrets.SPIRAL_WORKLOAD_ID (and/or required vars), and/or providing expression fallbacks (e.g. ${{ vars.SPIRAL_SERVER_URL || 'http://api.spiraldb.dev' }}) so the env doesn’t get set to an empty string.
There was a problem hiding this comment.
Fixed. Added secrets.SPIRAL_WORKLOAD_ID != '' to the job's if: condition so it skips gracefully when the secret is not configured. Also added || fallback defaults for SPIRAL_PROJECT_ID and SPIRAL_SERVER_URL so empty variables resolve to the dev-environment defaults (test-orcapod-362211 and http://api.spiraldb.dev) rather than empty strings.
…missing SPIRAL_WORKLOAD_ID, scrub stale workload IDs from docs - `iter_batches`: per-batch cast of `pa.string()→pa.large_string()` and `pa.binary()→pa.large_binary()` so `ConnectorArrowDatabase.get_all_records()` and `get_records_by_ids()` can call `pa.concat_tables([committed, pending])` without a schema-mismatch error. Matches the normalization already applied in `get_column_info`. - CI (`run-tests.yml`): add `secrets.SPIRAL_WORKLOAD_ID != ''` guard to the `spiral-integration` job `if:` condition so the job skips gracefully when the secret is not configured. Also add `||` fallback defaults for `SPIRAL_PROJECT_ID`/`SPIRAL_SERVER_URL` so an empty variable resolves to the dev-environment defaults rather than an empty string. - Integration-test docstring: replace specific workload/policy IDs (`work_fbjjcy`, `work_policy_p9cxfh`) with a generic description to avoid stale references as CI infrastructure evolves. - Tests: update `test_full_scan_yields_batches` to assert `large_string` output (not raw `string`); add `test_string_and_binary_normalized_to_large` to explicitly cover the per-batch cast path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 3 review comments addressedThree issues fixed in commit ae7e5d9: 1.
|
Summary
SpiralDBConnector— aDBConnectorProtocolbacked by SpiralDB (pyspiral), enablingConnectorArrowDatabaseandDBTableSourceto work with SpiralDB(project_id, dataset)pair is fixed at construction; all table names are plain strings within that datasetscan().to_record_batches(), upsert-by-key writes, andskip_existing=Truevia scan+filter approximationWhat's in this PR
src/orcapod/databases/spiraldb_connector.py(new) —SpiralDBConnectorwith lazyspiralmodule loading, allDBConnectorProtocolmethods, documented known limitations (timezone stripping, O(n)skip_existing, single-column PK creation)src/orcapod/databases/__init__.py(updated) — exportsSpiralDBConnectortests/test_databases/test_spiraldb_connector.py(new) — 45 unit tests, fully mocked (no network;spiralmodule is patched out entirely)tests/test_databases/test_spiraldb_connector_integration.py(new) — 3 integration tests againsttest-orcapod-362211dev project, skipped unlessSPIRAL_INTEGRATION_TESTS=1Test Plan
SPIRAL_INTEGRATION_TESTS=1tests/test_databases/suite (290 tests) passes with no regressionsSPIRAL_INTEGRATION_TESTS=1afterspiral loginagainst dev projecttest-orcapod-362211Known Limitations (documented in code)
timestampPK columns by SpiralDB at storage time (affectsskip_existing=Truereliability for tz-aware timestamps)skip_existing=Truerequires full table scan (O(n))create_table_if_not_existssupports single-column PKs only; composite-PK tables must be created externallyRelated Issues
🤖 Generated with Claude Code