Document and extend Ibis support coverage by eddietejeda · Pull Request #3 · hotdata-dev/hotdata-ibis

eddietejeda · 2026-05-11T16:01:58Z

Summary

Document the supported and unsupported Ibis workflows for Hotdata.
Add direct Arrow-backed to_pyarrow() / to_pyarrow_batches() materialization.
Add limited dataset-backed create_table() and drop_table() support for local pandas/PyArrow data and Hotdata-managed datasets.

Test plan

uv run ruff check src tests examples
uv run ruff format --check src tests examples
uv run pytest tests -q

Clarify which Ibis workflows Hotdata supports today and which full backend features remain outside the read-oriented SQL surface.

Add dataset-backed create/drop table helpers and implement direct Arrow materialization for Ibis Arrow APIs, with coverage for local data uploads and dataset deletion behavior.

claude · 2026-05-11T16:40:30Z

+    def _dataset_database(self, database: tuple[str, str] | str | None) -> str | None:
+        if database is None:
+            return None
+        table_loc = self._to_sqlglot_table(database)
+        catalog, schema_name = self._to_catalog_db_tuple(table_loc)
+        if catalog and catalog != "datasets":
+            return "__not_datasets__"
+        return schema_name or catalog
+
+    def _find_dataset(
+        self, table_name: str, database: tuple[str, str] | str | None
+    ) -> dict[str, Any]:
+        schema_name = self._dataset_database(database)
+        matches = [
+            ds
+            for ds in self._iterate_datasets()
+            if ds["table_name"] == table_name
+            and schema_name != "__not_datasets__"
+            and (schema_name is None or ds["schema_name"] == schema_name)
+        ]
+        if not matches:
+            raise com.TableNotFound(table_name)
+        if len(matches) > 1:
+            raise com.IbisInputError(
+                f"Multiple Hotdata datasets named {table_name!r}; pass database=('datasets', schema)."
+            )
+        return matches[0]


nit: (not blocking) The "__not_datasets__" magic string is fragile — if a Hotdata schema is ever literally named __not_datasets__, lookups would silently misbehave. More importantly, when the sentinel is set, _find_dataset still iterates every page of list_datasets() only to discard each row against schema_name != "__not_datasets__". Consider short-circuiting before the iteration (e.g., raise TableNotFound directly in _find_dataset when the catalog isn't "datasets") which avoids both the sentinel and the unnecessary network calls — see test_drop_table_raises_for_non_dataset_catalog where the dataset list endpoint is hit needlessly.

claude · 2026-05-11T16:40:33Z

+        data, table_schema = self._local_table_to_parquet(obj, schema)
+        upload = self.upload_file(data, content_type="application/parquet")
+        dataset = self.create_dataset_from_upload(
+            upload_id=upload["id"],
+            label=name,
+            table_name=name,
+            file_format="parquet",
+        )
+        return ops.DatabaseTable(
+            dataset["table_name"],
+            schema=table_schema,
+            source=self,
+            namespace=ops.Namespace(catalog="datasets", database=dataset["schema_name"]),
+        ).to_expr()


nit: (not blocking) Two concerns here:

The returned expression carries table_schema derived from the local PyArrow schema, not the schema Hotdata actually stored. If Hotdata applies any type coercion during ingest (common for Parquet→catalog mapping), the returned DatabaseTable will lie about its column types, and subsequent operations on the expression may compile against the wrong types. Consider re-fetching via self.table(name, database=("datasets", dataset["schema_name"])) so the schema reflects the server's view.

If create_dataset_from_upload raises after upload_file succeeds, the upload is orphaned with no cleanup. Worth a comment acknowledging this, or wrapping in try/except to best-effort delete the upload.

claude · 2026-05-11T16:40:35Z

+    def _local_table_to_parquet(self, obj: Any, schema: sch.Schema | None):
+        import pandas as pd
+        import pyarrow as pa
+        import pyarrow.parquet as pq
+
+        from ibis.formats.pyarrow import PyArrowSchema
+
+        if obj is None:
+            if schema is None:
+                raise com.IbisInputError("create_table requires a pandas/pyarrow object or schema")
+            arrow_schema = schema.to_pyarrow()
+            table = pa.Table.from_arrays(
+                [pa.array([], type=field.type) for field in arrow_schema],
+                schema=arrow_schema,
+            )
+        elif isinstance(obj, pa.Table):
+            table = obj
+        elif isinstance(obj, pd.DataFrame):
+            table = pa.Table.from_pandas(obj, preserve_index=False)
+        else:
+            raise com.IbisInputError(
+                "create_table currently accepts pandas.DataFrame or pyarrow.Table"
+            )


super nit: (not blocking) When both obj and schema are provided, schema is silently ignored. A user passing create_table("x", df, schema=...) would reasonably expect the schema to be applied (cast/validate). Either raise on the combo or cast obj to the requested schema before serializing.

claude · 2026-05-11T16:40:38Z

+    def to_pyarrow_batches(
+        self,
+        expr: ir.Expr,
+        /,
+        *,
+        params: Mapping[ir.Scalar, Any] | None = None,
+        limit: int | str | None = None,
+        chunk_size: int = 1_000_000,
+        **kwargs: Any,
+    ):
+        import pyarrow as pa
+
+        table = self.to_pyarrow(expr.as_table(), params=params, limit=limit, **kwargs)
+        return pa.ipc.RecordBatchReader.from_batches(
+            table.schema,
+            table.to_batches(max_chunksize=chunk_size),
+        )


super nit: (not blocking) to_pyarrow_batches fully materializes the result via to_pyarrow before slicing into batches, so it offers no memory advantage over to_pyarrow for large results — chunk_size only affects batch granularity, not peak memory. The README phrasing ("use the Arrow IPC result data exposed by Hotdata without converting through JSON rows") is accurate, but users familiar with other backends may assume this method streams. Worth a brief docstring noting the in-memory materialization.

Enforce create_table argument exclusivity and avoid dataset lookups for non-dataset drop targets so the limited dataset-backed table support matches its documented contract.

Add the new architecture guardrail test and ignore .DS_Store files so local metadata artifacts do not pollute status output.

Update docs and CLI helper defaults to remove HOTDATA_TOKEN references and keep setup instructions consistent.

Align Ibis README and examples with the canonical workspace env var.

Summarize connection, catalog, execution, Arrow, SQL, upload, and dataset cleanup support near the top of the README.

docs: describe Ibis support coverage

9adc90f

Clarify which Ibis workflows Hotdata supports today and which full backend features remain outside the read-oriented SQL surface.

claude Bot previously approved these changes May 11, 2026

View reviewed changes

feat: support dataset tables and direct Arrow exports

6948932

Add dataset-backed create/drop table helpers and implement direct Arrow materialization for Ibis Arrow APIs, with coverage for local data uploads and dataset deletion behavior.

eddietejeda dismissed claude[bot]’s stale review via 6948932 May 11, 2026 16:37

eddietejeda changed the title ~~Document Ibis support coverage~~ Document and extend Ibis support coverage May 11, 2026

claude Bot reviewed May 11, 2026

View reviewed changes

claude Bot previously approved these changes May 11, 2026

View reviewed changes

fix: tighten dataset table edge cases

2d2db2f

Enforce create_table argument exclusivity and avoid dataset lookups for non-dataset drop targets so the limited dataset-backed table support matches its documented contract.

eddietejeda dismissed claude[bot]’s stale review via 2d2db2f May 11, 2026 16:57

claude Bot previously approved these changes May 11, 2026

View reviewed changes

Version metadata and tests; uv default dev group; README sync commands.

10a238d

eddietejeda dismissed claude[bot]’s stale review via 10a238d May 13, 2026 05:00

claude Bot previously approved these changes May 13, 2026

View reviewed changes

test: add architecture guardrail coverage and ignore macOS files

e65ee1e

Add the new architecture guardrail test and ignore .DS_Store files so local metadata artifacts do not pollute status output.

eddietejeda dismissed claude[bot]’s stale review via e65ee1e May 14, 2026 18:51

claude Bot previously approved these changes May 14, 2026

View reviewed changes

Standardize ibis examples on HOTDATA_API_KEY.

f80ce96

Update docs and CLI helper defaults to remove HOTDATA_TOKEN references and keep setup instructions consistent.

eddietejeda dismissed claude[bot]’s stale review via f80ce96 May 15, 2026 23:55

claude Bot previously approved these changes May 15, 2026

View reviewed changes

docs: use HOTDATA_WORKSPACE env var in examples

fd377ab

Align Ibis README and examples with the canonical workspace env var.

eddietejeda dismissed claude[bot]’s stale review via fd377ab May 17, 2026 03:09

claude Bot previously approved these changes May 17, 2026

View reviewed changes

docs: add ibis feature overview

e4f83f2

Summarize connection, catalog, execution, Arrow, SQL, upload, and dataset cleanup support near the top of the README.

eddietejeda dismissed claude[bot]’s stale review via e4f83f2 May 17, 2026 03:18

claude Bot approved these changes May 17, 2026

View reviewed changes

eddietejeda merged commit 9b96213 into main May 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document and extend Ibis support coverage#3

Document and extend Ibis support coverage#3
eddietejeda merged 8 commits into
mainfrom
docs/ibis-support-overview

eddietejeda commented May 11, 2026 •

edited

Loading

Uh oh!

claude Bot May 11, 2026

Uh oh!

claude Bot May 11, 2026

Uh oh!

claude Bot May 11, 2026

Uh oh!

claude Bot May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eddietejeda commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eddietejeda commented May 11, 2026 •

edited

Loading