Skip to content

Document and extend Ibis support coverage#3

Merged
eddietejeda merged 8 commits into
mainfrom
docs/ibis-support-overview
May 17, 2026
Merged

Document and extend Ibis support coverage#3
eddietejeda merged 8 commits into
mainfrom
docs/ibis-support-overview

Conversation

@eddietejeda
Copy link
Copy Markdown
Contributor

@eddietejeda eddietejeda commented May 11, 2026

Summary

  • Document the supported and unsupported Ibis workflows for Hotdata.
  • Add direct Arrow-backed to_pyarrow() / to_pyarrow_batches() materialization.
  • Add limited dataset-backed create_table() and drop_table() support for local pandas/PyArrow data and Hotdata-managed datasets.

Test plan

  • uv run ruff check src tests examples
  • uv run ruff format --check src tests examples
  • uv run pytest tests -q

Clarify which Ibis workflows Hotdata supports today and which full backend features remain outside the read-oriented SQL surface.
claude[bot]
claude Bot previously approved these changes May 11, 2026
Add dataset-backed create/drop table helpers and implement direct Arrow materialization for Ibis Arrow APIs, with coverage for local data uploads and dataset deletion behavior.
@eddietejeda eddietejeda changed the title Document Ibis support coverage Document and extend Ibis support coverage May 11, 2026
Comment on lines +333 to +359
def _dataset_database(self, database: tuple[str, str] | str | None) -> str | None:
if database is None:
return None
table_loc = self._to_sqlglot_table(database)
catalog, schema_name = self._to_catalog_db_tuple(table_loc)
if catalog and catalog != "datasets":
return "__not_datasets__"
return schema_name or catalog

def _find_dataset(
self, table_name: str, database: tuple[str, str] | str | None
) -> dict[str, Any]:
schema_name = self._dataset_database(database)
matches = [
ds
for ds in self._iterate_datasets()
if ds["table_name"] == table_name
and schema_name != "__not_datasets__"
and (schema_name is None or ds["schema_name"] == schema_name)
]
if not matches:
raise com.TableNotFound(table_name)
if len(matches) > 1:
raise com.IbisInputError(
f"Multiple Hotdata datasets named {table_name!r}; pass database=('datasets', schema)."
)
return matches[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (not blocking) The "__not_datasets__" magic string is fragile — if a Hotdata schema is ever literally named __not_datasets__, lookups would silently misbehave. More importantly, when the sentinel is set, _find_dataset still iterates every page of list_datasets() only to discard each row against schema_name != "__not_datasets__". Consider short-circuiting before the iteration (e.g., raise TableNotFound directly in _find_dataset when the catalog isn't "datasets") which avoids both the sentinel and the unnecessary network calls — see test_drop_table_raises_for_non_dataset_catalog where the dataset list endpoint is hit needlessly.

Comment thread src/ibis_hotdata/backend.py Outdated
Comment on lines +552 to +565
data, table_schema = self._local_table_to_parquet(obj, schema)
upload = self.upload_file(data, content_type="application/parquet")
dataset = self.create_dataset_from_upload(
upload_id=upload["id"],
label=name,
table_name=name,
file_format="parquet",
)
return ops.DatabaseTable(
dataset["table_name"],
schema=table_schema,
source=self,
namespace=ops.Namespace(catalog="datasets", database=dataset["schema_name"]),
).to_expr()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (not blocking) Two concerns here:

  1. The returned expression carries table_schema derived from the local PyArrow schema, not the schema Hotdata actually stored. If Hotdata applies any type coercion during ingest (common for Parquet→catalog mapping), the returned DatabaseTable will lie about its column types, and subsequent operations on the expression may compile against the wrong types. Consider re-fetching via self.table(name, database=("datasets", dataset["schema_name"])) so the schema reflects the server's view.

  2. If create_dataset_from_upload raises after upload_file succeeds, the upload is orphaned with no cleanup. Worth a comment acknowledging this, or wrapping in try/except to best-effort delete the upload.

Comment on lines +506 to +528
def _local_table_to_parquet(self, obj: Any, schema: sch.Schema | None):
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

from ibis.formats.pyarrow import PyArrowSchema

if obj is None:
if schema is None:
raise com.IbisInputError("create_table requires a pandas/pyarrow object or schema")
arrow_schema = schema.to_pyarrow()
table = pa.Table.from_arrays(
[pa.array([], type=field.type) for field in arrow_schema],
schema=arrow_schema,
)
elif isinstance(obj, pa.Table):
table = obj
elif isinstance(obj, pd.DataFrame):
table = pa.Table.from_pandas(obj, preserve_index=False)
else:
raise com.IbisInputError(
"create_table currently accepts pandas.DataFrame or pyarrow.Table"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: (not blocking) When both obj and schema are provided, schema is silently ignored. A user passing create_table("x", df, schema=...) would reasonably expect the schema to be applied (cast/validate). Either raise on the combo or cast obj to the requested schema before serializing.

Comment on lines +458 to +474
def to_pyarrow_batches(
self,
expr: ir.Expr,
/,
*,
params: Mapping[ir.Scalar, Any] | None = None,
limit: int | str | None = None,
chunk_size: int = 1_000_000,
**kwargs: Any,
):
import pyarrow as pa

table = self.to_pyarrow(expr.as_table(), params=params, limit=limit, **kwargs)
return pa.ipc.RecordBatchReader.from_batches(
table.schema,
table.to_batches(max_chunksize=chunk_size),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: (not blocking) to_pyarrow_batches fully materializes the result via to_pyarrow before slicing into batches, so it offers no memory advantage over to_pyarrow for large results — chunk_size only affects batch granularity, not peak memory. The README phrasing ("use the Arrow IPC result data exposed by Hotdata without converting through JSON rows") is accurate, but users familiar with other backends may assume this method streams. Worth a brief docstring noting the in-memory materialization.

claude[bot]
claude Bot previously approved these changes May 11, 2026
Enforce create_table argument exclusivity and avoid dataset lookups for non-dataset drop targets so the limited dataset-backed table support matches its documented contract.
claude[bot]
claude Bot previously approved these changes May 11, 2026
claude[bot]
claude Bot previously approved these changes May 13, 2026
Add the new architecture guardrail test and ignore .DS_Store files so local metadata artifacts do not pollute status output.
claude[bot]
claude Bot previously approved these changes May 14, 2026
Update docs and CLI helper defaults to remove HOTDATA_TOKEN references and keep setup instructions consistent.
claude[bot]
claude Bot previously approved these changes May 15, 2026
Align Ibis README and examples with the canonical workspace env var.
claude[bot]
claude Bot previously approved these changes May 17, 2026
Summarize connection, catalog, execution, Arrow, SQL, upload, and dataset cleanup support near the top of the README.
@eddietejeda eddietejeda merged commit 9b96213 into main May 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant