Document and extend Ibis support coverage#3
Conversation
Clarify which Ibis workflows Hotdata supports today and which full backend features remain outside the read-oriented SQL surface.
Add dataset-backed create/drop table helpers and implement direct Arrow materialization for Ibis Arrow APIs, with coverage for local data uploads and dataset deletion behavior.
| def _dataset_database(self, database: tuple[str, str] | str | None) -> str | None: | ||
| if database is None: | ||
| return None | ||
| table_loc = self._to_sqlglot_table(database) | ||
| catalog, schema_name = self._to_catalog_db_tuple(table_loc) | ||
| if catalog and catalog != "datasets": | ||
| return "__not_datasets__" | ||
| return schema_name or catalog | ||
|
|
||
| def _find_dataset( | ||
| self, table_name: str, database: tuple[str, str] | str | None | ||
| ) -> dict[str, Any]: | ||
| schema_name = self._dataset_database(database) | ||
| matches = [ | ||
| ds | ||
| for ds in self._iterate_datasets() | ||
| if ds["table_name"] == table_name | ||
| and schema_name != "__not_datasets__" | ||
| and (schema_name is None or ds["schema_name"] == schema_name) | ||
| ] | ||
| if not matches: | ||
| raise com.TableNotFound(table_name) | ||
| if len(matches) > 1: | ||
| raise com.IbisInputError( | ||
| f"Multiple Hotdata datasets named {table_name!r}; pass database=('datasets', schema)." | ||
| ) | ||
| return matches[0] |
There was a problem hiding this comment.
nit: (not blocking) The "__not_datasets__" magic string is fragile — if a Hotdata schema is ever literally named __not_datasets__, lookups would silently misbehave. More importantly, when the sentinel is set, _find_dataset still iterates every page of list_datasets() only to discard each row against schema_name != "__not_datasets__". Consider short-circuiting before the iteration (e.g., raise TableNotFound directly in _find_dataset when the catalog isn't "datasets") which avoids both the sentinel and the unnecessary network calls — see test_drop_table_raises_for_non_dataset_catalog where the dataset list endpoint is hit needlessly.
| data, table_schema = self._local_table_to_parquet(obj, schema) | ||
| upload = self.upload_file(data, content_type="application/parquet") | ||
| dataset = self.create_dataset_from_upload( | ||
| upload_id=upload["id"], | ||
| label=name, | ||
| table_name=name, | ||
| file_format="parquet", | ||
| ) | ||
| return ops.DatabaseTable( | ||
| dataset["table_name"], | ||
| schema=table_schema, | ||
| source=self, | ||
| namespace=ops.Namespace(catalog="datasets", database=dataset["schema_name"]), | ||
| ).to_expr() |
There was a problem hiding this comment.
nit: (not blocking) Two concerns here:
-
The returned expression carries
table_schemaderived from the local PyArrow schema, not the schema Hotdata actually stored. If Hotdata applies any type coercion during ingest (common for Parquet→catalog mapping), the returnedDatabaseTablewill lie about its column types, and subsequent operations on the expression may compile against the wrong types. Consider re-fetching viaself.table(name, database=("datasets", dataset["schema_name"]))so the schema reflects the server's view. -
If
create_dataset_from_uploadraises afterupload_filesucceeds, the upload is orphaned with no cleanup. Worth a comment acknowledging this, or wrapping in try/except to best-effort delete the upload.
| def _local_table_to_parquet(self, obj: Any, schema: sch.Schema | None): | ||
| import pandas as pd | ||
| import pyarrow as pa | ||
| import pyarrow.parquet as pq | ||
|
|
||
| from ibis.formats.pyarrow import PyArrowSchema | ||
|
|
||
| if obj is None: | ||
| if schema is None: | ||
| raise com.IbisInputError("create_table requires a pandas/pyarrow object or schema") | ||
| arrow_schema = schema.to_pyarrow() | ||
| table = pa.Table.from_arrays( | ||
| [pa.array([], type=field.type) for field in arrow_schema], | ||
| schema=arrow_schema, | ||
| ) | ||
| elif isinstance(obj, pa.Table): | ||
| table = obj | ||
| elif isinstance(obj, pd.DataFrame): | ||
| table = pa.Table.from_pandas(obj, preserve_index=False) | ||
| else: | ||
| raise com.IbisInputError( | ||
| "create_table currently accepts pandas.DataFrame or pyarrow.Table" | ||
| ) |
There was a problem hiding this comment.
super nit: (not blocking) When both obj and schema are provided, schema is silently ignored. A user passing create_table("x", df, schema=...) would reasonably expect the schema to be applied (cast/validate). Either raise on the combo or cast obj to the requested schema before serializing.
| def to_pyarrow_batches( | ||
| self, | ||
| expr: ir.Expr, | ||
| /, | ||
| *, | ||
| params: Mapping[ir.Scalar, Any] | None = None, | ||
| limit: int | str | None = None, | ||
| chunk_size: int = 1_000_000, | ||
| **kwargs: Any, | ||
| ): | ||
| import pyarrow as pa | ||
|
|
||
| table = self.to_pyarrow(expr.as_table(), params=params, limit=limit, **kwargs) | ||
| return pa.ipc.RecordBatchReader.from_batches( | ||
| table.schema, | ||
| table.to_batches(max_chunksize=chunk_size), | ||
| ) |
There was a problem hiding this comment.
super nit: (not blocking) to_pyarrow_batches fully materializes the result via to_pyarrow before slicing into batches, so it offers no memory advantage over to_pyarrow for large results — chunk_size only affects batch granularity, not peak memory. The README phrasing ("use the Arrow IPC result data exposed by Hotdata without converting through JSON rows") is accurate, but users familiar with other backends may assume this method streams. Worth a brief docstring noting the in-memory materialization.
Enforce create_table argument exclusivity and avoid dataset lookups for non-dataset drop targets so the limited dataset-backed table support matches its documented contract.
Add the new architecture guardrail test and ignore .DS_Store files so local metadata artifacts do not pollute status output.
Update docs and CLI helper defaults to remove HOTDATA_TOKEN references and keep setup instructions consistent.
Align Ibis README and examples with the canonical workspace env var.
Summarize connection, catalog, execution, Arrow, SQL, upload, and dataset cleanup support near the top of the README.
Summary
to_pyarrow()/to_pyarrow_batches()materialization.create_table()anddrop_table()support for local pandas/PyArrow data and Hotdata-managed datasets.Test plan