Skip to content

Feat: unity catalog/delta lake integration#145

Merged
beinan merged 7 commits intolance-format:mainfrom
jja725:feat/unity-catalog-integration
Mar 3, 2026
Merged

Feat: unity catalog/delta lake integration#145
beinan merged 7 commits intolance-format:mainfrom
jja725:feat/unity-catalog-integration

Conversation

@jja725
Copy link
Contributor

@jja725 jja725 commented Feb 27, 2026

Summary

  • Add Unity Catalog (OSS) integration for browsing catalog metadata and auto-registering tables into SqlEngine
  • Introduce Presto-inspired extensible connector architecture with clean separation of catalog metadata (CatalogProvider) and data format reading (TableReader)
  • Support Delta Lake and Parquet table formats out of the box
  • Support cloud storage (S3, Azure, GCS) via storage_options

Architecture

Inspired by Presto's connector SPI:

Layer Component Purpose
SPI CatalogProvider trait Browse catalog metadata (like Presto's ConnectorMetadata)
SPI TableReader trait Read data in specific formats (like Presto's ConnectorPageSourceProvider)
Facade Connector struct Bundles catalog + readers + storage options (like Presto's Connector)

Extensibility:

  • New catalog (e.g., AWS Glue) → implement CatalogProvider, reuses existing Delta/Parquet readers
  • New format (e.g., Iceberg) → implement TableReader, works with any catalog

Python API

```python
from lance_graph import UnityCatalog

Connect to Unity Catalog

uc = UnityCatalog("http://localhost:8080/api/2.1/unity-catalog")

Browse

catalogs = uc.list_catalogs()
tables = uc.list_tables("unity", "default")
table = uc.get_table("unity", "default", "marksheet")
print(table.columns())

Auto-register Delta + Parquet tables and query via SQL

engine = uc.create_sql_engine("unity", "default")
result = engine.execute("SELECT * FROM marksheet WHERE mark > 80")

Cloud storage support (S3, Azure, GCS)

uc = UnityCatalog(
"http://localhost:8080/api/2.1/unity-catalog",
storage_options={
"azure_storage_account_name": "myaccount",
"azure_storage_account_key": "...",
}
)
```

New files

SPI layer (`lance-graph-catalog`):

  • `catalog_provider.rs` — `CatalogProvider` trait + data types
  • `table_reader.rs` — `TableReader` trait with `storage_options` for cloud access
  • `connector.rs` — `Connector` facade bundling catalog + readers + storage options
  • `type_mapping.rs` — UC type → Arrow type mapping
  • `unity_catalog.rs` — OSS UC REST client

Implementation layer (`lance-graph`):

  • `table_readers.rs` — `ParquetTableReader` + `DeltaTableReader` (via deltalake 0.29)
  • `sql_catalog.rs` — `build_context_from_connector()` bridge to SqlEngine

Python bindings (`lance-graph-python`):

  • `catalog.rs` — UnityCatalog, CatalogInfo, SchemaInfo, TableInfo PyO3 classes

Test plan

  • 12 unit tests for UC type → Arrow type mapping
  • 15 wiremock integration tests for UC REST client
  • 9 Python unit tests for UnityCatalog class
  • 6 Python integration tests (require live UC server, skipped in CI)
  • All existing tests pass unchanged (119 Python + 566 Rust)
  • README docs updated with UC examples and cloud storage usage

jja725 and others added 3 commits February 25, 2026 23:28
Add SqlQuery and SqlEngine that let users run standard SQL directly
against their datasets without requiring a GraphConfig. This is useful
for data analytics workflows where users want explicit JOINs and
aggregations against node/relationship tables. DataFusion handles SQL
parsing and execution.
…r architecture

Add support for browsing and querying tables from Unity Catalog (OSS).
Inspired by Presto's connector SPI, the design cleanly separates:

- CatalogProvider trait: catalog metadata browsing (UC first, extensible
  to Hive Metastore, AWS Glue, Iceberg REST Catalog)
- TableReader trait: format-specific data reading (Parquet + Delta Lake,
  extensible to CSV, Iceberg, ORC)
- Connector struct: facade bundling catalog + readers

Key features:
- Full UC REST API client (list/get catalogs, schemas, tables, columns)
- UC type → Arrow type mapping (20 type mappings)
- ParquetTableReader via DataFusion register_parquet()
- DeltaTableReader via deltalake 0.29 (behind "delta" feature flag)
- Auto-register UC tables into SqlEngine via create_sql_engine()
- Python bindings: UnityCatalog, CatalogInfo, SchemaInfo, TableInfo
- 15 wiremock integration tests for UC REST client
- 12 type mapping unit tests
- 9 Python unit tests

Python API:
  uc = UnityCatalog("http://localhost:8080/api/2.1/unity-catalog")
  engine = uc.create_sql_engine("unity", "default")
  result = engine.execute("SELECT * FROM my_table")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 67.56757% with 96 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/lance-graph/src/table_readers.rs 0.00% 51 Missing ⚠️
crates/lance-graph/src/sql_query.rs 86.84% 30 Missing ⚠️
crates/lance-graph/src/sql_catalog.rs 0.00% 15 Missing ⚠️

📢 Thoughts on this report? Let us know!

- cargo fmt fixes across all new files
- Replace EnumName::Variant with Self::Variant (clippy::unnecessary_structure_name_repetition)
- Fix Python import sorting and line length (ruff)
…re, GCS)

- Add `storage_options` parameter to `TableReader::register_table()` trait
- `Connector::with_storage_options()` stores credentials and passes them
  to table readers during registration
- `DeltaTableReader` uses `open_table_with_storage_options()` when
  storage options are provided
- Enable deltalake cloud features: s3, azure, gcs
- Python: `UnityCatalog(url, storage_options={...})` accepts cloud creds

Usage:
  uc = UnityCatalog(
      "http://localhost:8080/api/2.1/unity-catalog",
      storage_options={
          "azure_storage_account_name": "myaccount",
          "azure_storage_account_key": "...",
      }
  )
  engine = uc.create_sql_engine("unity", "default")
Add examples for UnityCatalog browsing, create_sql_engine, and
cloud storage options (S3, Azure, GCS) to both project and Python READMEs.
Copy link
Collaborator

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's merge this pr first due to the urgent use cases

@beinan beinan merged commit 82b98a2 into lance-format:main Mar 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants