feat(engines): add Databricks Connect engine#87
Draft
tomz wants to merge 5 commits into
Draft
Conversation
Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI lint job (lint + format both enforcing). Reformat the existing tree with `ruff format` and replace ad-hoc print() diagnostics with module-level loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff (import order, bare except -> except Exception, dead assignment). Drop Python 3.8 support and move pyarrow to base dependencies (the core results/timing modules import it unconditionally). Gitignore scratch/ for workspace-specific scratchpads. W291/W293 stay globally ignored because trailing whitespace inside multi-line SQL string literals is intentional and not touched by `ruff format`.
Add remote/cloud engines that talk to managed Spark via protocol:
- Livy — Fabric / Synapse / HDInsight via the Livy REST API, with
session auto-recovery, per-query timeout, multi-part SHOW TABLES.
- SparkConnect — Spark Connect gRPC client.
- FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses.
Catalog plumbing shared by all engines:
- BaseEngine.list_databases() / list_tables() / get_table_columns() defaults,
overridden for the Spark family, Livy and DuckDB.
- query_timeout_seconds attribute.
- transpile_and_qualify_query() rewritten with AST-based qualification that
correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity
catalog.schema): builds quoted identifier chains via sqlglot, preserves the
caller's catalog, and leaves CTE references untouched. Adds 9 multi-part
tests (previously untested).
Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when
active — silently rewriting columns to match non-spec data hurts benchmark
reproducibility and can mask data-prep bugs.
Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.
…iles) A purely additive command-line surface over the existing Python API. Library consumers are unaffected. - cli/ package: argparse plumbing, override merge (-E/--conf), output formatting (human/table/json/csv/yaml), dry-run/print-config, verbosity and meaningful exit codes. - config.py: profile loader (~/.lakebench.json + ./lakebench.json), env-var expansion, `extends:` composition, validation, lazy engine/benchmark registries. resolve_engine() handles *_env credential references by honoring the engine signature: engines that accept the env-var NAME (Databricks, Livy) get it untouched and resolve the secret themselves; engines that accept the bare key (or **kwargs) get the resolved value. This avoids silently dropping the credential. Covered by tests/test_config.py. - results.py / reporting.py / discover.py as before. - Expose console_script entry point and livy/spark_connect extras + Fabric/ Synapse/HDInsight aliases. Docs: cli-quickstart, cli-reference, architecture, development.
A whole new benchmark following LakeBench's plug-in pattern: per-engine ETL classes (DuckDB, Spark, Sail, Polars, Daft) plus canonical/duckdb DDL and an audit-validation query. - Extract a shared DDL-load fallback helper in benchmarks/base.py (also simplifies elt_bench). - FinWire fixed-width parser + 5 unit tests; un-skip the CLI tpcdi mode test. Full unit suite: 178 passing.
- engines/databricks.py — Databricks engine on top of the Spark family, authenticating via Databricks Connect (host + token-from-env). Includes optional dynamic databricks-connect version alignment with the target cluster's DBR (re-exec guarded against loops). - Register Databricks in the engine package and add the `databricks` extra. - README + install-databricks / install-fabric docs. Workspace-specific demo captures live in gitignored scratch/, not docs/. Lands last because it builds on the cloud Spark engines and the CLI.
Author
|
Part 5/5 of the stack in #82, following #86. Diff note: targets |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a Databricks Connect execution engine with dynamic version alignment.
databricks-connectclient version to the target cluster's DBR version._has_spark_context()helper inutils/timer.pyto support Databricks timing.Open question for maintainers⚠️
The version-alignment path currently does an
os.execvpere-exec inside the engine constructor to relaunch the process with the matchingdatabricks-connect. That works for a CLI invocation but is surprising/dangerous when LakeBench is used as a library (it would replace the host process). I'd like maintainer input on the preferred approach — e.g. fail with actionable instructions, or gate the re-exec behind an explicitallow_reexecflag / CLI-only context.Tests
Suite green (179 passed).