Skip to content

feat(engines): add Databricks Connect engine#87

Draft
tomz wants to merge 5 commits into
microsoft:mainfrom
tomz:pr5-databricks
Draft

feat(engines): add Databricks Connect engine#87
tomz wants to merge 5 commits into
microsoft:mainfrom
tomz:pr5-databricks

Conversation

@tomz
Copy link
Copy Markdown

@tomz tomz commented May 29, 2026

Part 5/5 of a stack: #1 (lint) → #2 (cloud engines) → #3 (cli) → #4 (tpcdi) → #5 (databricks). Best reviewed in order; each builds on the previous.

What

Adds a Databricks Connect execution engine with dynamic version alignment.

  • Databricks Connect engine.
  • Dynamic alignment of the databricks-connect client version to the target cluster's DBR version.
  • _has_spark_context() helper in utils/timer.py to support Databricks timing.

Open question for maintainers ⚠️

The version-alignment path currently does an os.execvpe re-exec inside the engine constructor to relaunch the process with the matching databricks-connect. That works for a CLI invocation but is surprising/dangerous when LakeBench is used as a library (it would replace the host process). I'd like maintainer input on the preferred approach — e.g. fail with actionable instructions, or gate the re-exec behind an explicit allow_reexec flag / CLI-only context.

Tests

Suite green (179 passed).

tomz added 5 commits May 29, 2026 12:32
Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI
lint job (lint + format both enforcing). Reformat the existing tree with
`ruff format` and replace ad-hoc print() diagnostics with module-level
loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff
(import order, bare except -> except Exception, dead assignment).

Drop Python 3.8 support and move pyarrow to base dependencies (the core
results/timing modules import it unconditionally). Gitignore scratch/ for
workspace-specific scratchpads.

W291/W293 stay globally ignored because trailing whitespace inside multi-line
SQL string literals is intentional and not touched by `ruff format`.
Add remote/cloud engines that talk to managed Spark via protocol:

- Livy   — Fabric / Synapse / HDInsight via the Livy REST API, with
           session auto-recovery, per-query timeout, multi-part SHOW TABLES.
- SparkConnect — Spark Connect gRPC client.
- FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses.

Catalog plumbing shared by all engines:
- BaseEngine.list_databases() / list_tables() / get_table_columns() defaults,
  overridden for the Spark family, Livy and DuckDB.
- query_timeout_seconds attribute.
- transpile_and_qualify_query() rewritten with AST-based qualification that
  correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity
  catalog.schema): builds quoted identifier chains via sqlglot, preserves the
  caller's catalog, and leaves CTE references untouched. Adds 9 multi-part
  tests (previously untested).

Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when
active — silently rewriting columns to match non-spec data hurts benchmark
reproducibility and can mask data-prep bugs.

Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.
…iles)

A purely additive command-line surface over the existing Python API.
Library consumers are unaffected.

- cli/ package: argparse plumbing, override merge (-E/--conf), output
  formatting (human/table/json/csv/yaml), dry-run/print-config, verbosity
  and meaningful exit codes.
- config.py: profile loader (~/.lakebench.json + ./lakebench.json), env-var
  expansion, `extends:` composition, validation, lazy engine/benchmark
  registries.

  resolve_engine() handles *_env credential references by honoring the engine
  signature: engines that accept the env-var NAME (Databricks, Livy) get it
  untouched and resolve the secret themselves; engines that accept the bare
  key (or **kwargs) get the resolved value. This avoids silently dropping the
  credential. Covered by tests/test_config.py.
- results.py / reporting.py / discover.py as before.
- Expose console_script entry point and livy/spark_connect extras + Fabric/
  Synapse/HDInsight aliases.

Docs: cli-quickstart, cli-reference, architecture, development.
A whole new benchmark following LakeBench's plug-in pattern: per-engine ETL
classes (DuckDB, Spark, Sail, Polars, Daft) plus canonical/duckdb DDL and an
audit-validation query.

- Extract a shared DDL-load fallback helper in benchmarks/base.py (also
  simplifies elt_bench).
- FinWire fixed-width parser + 5 unit tests; un-skip the CLI tpcdi mode test.

Full unit suite: 178 passing.
- engines/databricks.py — Databricks engine on top of the Spark family,
  authenticating via Databricks Connect (host + token-from-env). Includes
  optional dynamic databricks-connect version alignment with the target
  cluster's DBR (re-exec guarded against loops).
- Register Databricks in the engine package and add the `databricks` extra.
- README + install-databricks / install-fabric docs.

Workspace-specific demo captures live in gitignored scratch/, not docs/.

Lands last because it builds on the cloud Spark engines and the CLI.
@tomz
Copy link
Copy Markdown
Author

tomz commented May 29, 2026

Part 5/5 of the stack in #82, following #86. Diff note: targets main (fork-only branches), so the diff currently includes the whole stack; it narrows to just the Databricks Connect work after the predecessors merge and I rebase. Merge order: #83#84#85#86#87. Note the open design question in the description re: the constructor re-exec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant