feat(engines): add Databricks Connect engine by tomz · Pull Request #87 · microsoft/LakeBench

tomz · 2026-05-29T21:13:47Z

Part 5/5 of a stack: #1 (lint) → #2 (cloud engines) → #3 (cli) → #4 (tpcdi) → #5 (databricks). Best reviewed in order; each builds on the previous.

What

Adds a Databricks Connect execution engine with dynamic version alignment.

Databricks Connect engine.
Dynamic alignment of the databricks-connect client version to the target cluster's DBR version.
_has_spark_context() helper in utils/timer.py to support Databricks timing.

Open question for maintainers ⚠️

The version-alignment path currently does an os.execvpe re-exec inside the engine constructor to relaunch the process with the matching databricks-connect. That works for a CLI invocation but is surprising/dangerous when LakeBench is used as a library (it would replace the host process). I'd like maintainer input on the preferred approach — e.g. fail with actionable instructions, or gate the re-exec behind an explicit allow_reexec flag / CLI-only context.

Tests

Suite green (179 passed).

Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI lint job (lint + format both enforcing). Reformat the existing tree with `ruff format` and replace ad-hoc print() diagnostics with module-level loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff (import order, bare except -> except Exception, dead assignment). Drop Python 3.8 support and move pyarrow to base dependencies (the core results/timing modules import it unconditionally). Gitignore scratch/ for workspace-specific scratchpads. W291/W293 stay globally ignored because trailing whitespace inside multi-line SQL string literals is intentional and not touched by `ruff format`.

Add remote/cloud engines that talk to managed Spark via protocol: - Livy — Fabric / Synapse / HDInsight via the Livy REST API, with session auto-recovery, per-query timeout, multi-part SHOW TABLES. - SparkConnect — Spark Connect gRPC client. - FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses. Catalog plumbing shared by all engines: - BaseEngine.list_databases() / list_tables() / get_table_columns() defaults, overridden for the Spark family, Livy and DuckDB. - query_timeout_seconds attribute. - transpile_and_qualify_query() rewritten with AST-based qualification that correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity catalog.schema): builds quoted identifier chains via sqlglot, preserves the caller's catalog, and leaves CTE references untouched. Adds 9 multi-part tests (previously untested). Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when active — silently rewriting columns to match non-spec data hurts benchmark reproducibility and can mask data-prep bugs. Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.

…iles) A purely additive command-line surface over the existing Python API. Library consumers are unaffected. - cli/ package: argparse plumbing, override merge (-E/--conf), output formatting (human/table/json/csv/yaml), dry-run/print-config, verbosity and meaningful exit codes. - config.py: profile loader (~/.lakebench.json + ./lakebench.json), env-var expansion, `extends:` composition, validation, lazy engine/benchmark registries. resolve_engine() handles *_env credential references by honoring the engine signature: engines that accept the env-var NAME (Databricks, Livy) get it untouched and resolve the secret themselves; engines that accept the bare key (or **kwargs) get the resolved value. This avoids silently dropping the credential. Covered by tests/test_config.py. - results.py / reporting.py / discover.py as before. - Expose console_script entry point and livy/spark_connect extras + Fabric/ Synapse/HDInsight aliases. Docs: cli-quickstart, cli-reference, architecture, development.

A whole new benchmark following LakeBench's plug-in pattern: per-engine ETL classes (DuckDB, Spark, Sail, Polars, Daft) plus canonical/duckdb DDL and an audit-validation query. - Extract a shared DDL-load fallback helper in benchmarks/base.py (also simplifies elt_bench). - FinWire fixed-width parser + 5 unit tests; un-skip the CLI tpcdi mode test. Full unit suite: 178 passing.

- engines/databricks.py — Databricks engine on top of the Spark family, authenticating via Databricks Connect (host + token-from-env). Includes optional dynamic databricks-connect version alignment with the target cluster's DBR (re-exec guarded against loops). - Register Databricks in the engine package and add the `databricks` extra. - README + install-databricks / install-fabric docs. Workspace-specific demo captures live in gitignored scratch/, not docs/. Lands last because it builds on the cloud Spark engines and the CLI.

tomz · 2026-05-29T21:14:12Z

Part 5/5 of the stack in #82, following #86. Diff note: targets main (fork-only branches), so the diff currently includes the whole stack; it narrows to just the Databricks Connect work after the predecessors merge and I rebase. Merge order: #83 → #84 → #85 → #86 → #87. Note the open design question in the description re: the constructor re-exec.

tomz added 5 commits May 29, 2026 12:32

This was referenced May 29, 2026

feat(engines): add cloud Spark engines + multi-part catalog name support #84

Draft

feat(cli): add lakebench CLI (run/results/report/discover/doctor/profiles) #85

Draft

feat(tpcdi): add TPC-DI benchmark port with six engine implementations #86

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engines): add Databricks Connect engine#87

feat(engines): add Databricks Connect engine#87
tomz wants to merge 5 commits into
microsoft:mainfrom
tomz:pr5-databricks

tomz commented May 29, 2026

Uh oh!

tomz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomz commented May 29, 2026

What

Open question for maintainers ⚠️

Tests

Uh oh!

tomz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant