This repository provides container-based inner loop development environment for Databricks. The container includes the following Python environments.
- local pyspark
- pyspark running on remote cluster through Databricks Connect
- Databricks Connect allows you to execute your local spark code remotely on a Databricks cluster instead of in the local Spark session.
- Databricks enables you to interact with the data sitting in the Cloud while staying and coding on local VSCode.
Databricks offers Databricks Notebooks to implement features but you can't leverage python tooling such as:
- python related auto-completion
- sophisticated debugger
- python definition peeking
- auto-formatter
- auto-linter
- auto static analyzer
- unit tests
To accelerate your inner loop dev cycle, this repo fully leverages python tooling and VSCode extensions in contrast. This repo centers VSCode as a place to do most of implementations. You can switch python env with one click on VSCode.
Suggested workflow with this repo is the following:
- Implement features with local pyspark
- Put unit tests with sample data locally
- (Optional) When you want to interact with data sitting in the Cloud, you can switch to databricks connect env
- Package all the code into library
- Upload that library to a Databricks cluster
- Install that library into the cluster and test it with the real data
- (After multiple iterations of 1-6) Create Pull Request
This repo uses VSCode as a development tool and The Remote - Containers extension that lets you use a Docker container as a full-featured development environment.
Install the following tools:
- VSCode
- Docker Desktop for Windows or Mac
- VSCode Remote - Containers extension
- Check System Requirements and Installation carefully. For example, if you are on Windows and not using WSL2 backend engine for Docker, you need to enable File Sharing.
Take the following steps to get started.
- Open this repository with VSCode.
- Copy
.env.example
and rename it to.env
. - Edit
.env
file. The dev container loads the defined variables into environment variables and uses them for Databricks Connect.DATABRICKS_ADDRESS
: Databricks workspace URLDATABRICKS_API_TOKEN
: personal access token (PAT) token to Databricks workspaceDATABRICKS_CLUSTER_ID
: Cluster ID of Databricks clusterDATABRICKS_ORG_ID
: Org ID. See ?o=orgId in URLDATABRICKS_PORT
. Use 15001- For more information about how to set up variables for Databricks Connect, see Step 2: Configure connection properties
- Edit
requirements_db_connect.txt
and matchdatabricks-connect
version with your cluster version. See Step 1: Install the client for the details. - Edit
requirements_local.txt
and matchpyspark
version with your cluster's pyspark version. - Open VSCode command palette (
ctrl+shift+p
) and selectRemote-Containers: Reopen in Container
. It may take a while for the first time as it builds a devcontainer. - Activate
db_connect_env
python environment withsource /.envs/db_connect_env/bin/activate
. - Run
databricks-connect test
and see if your setting for Databricks Connect with environment variables works.
Default python environment is local_spark_env
. If you want to switch it to Databricks Connect env, open VSCode command palette (ctrl+shift+p
), select Python: Select Interpreter
and select db_connect_env
.
- Open your python file
- Open VSCode command palette (
ctrl+shift+p
) and selectPython: Run Python File in Terminal
To test this functionality, you can open src/main.py
. When you run it, be aware which python env you selected on VSCode.
If you want to change the tool setting, see requirements_dev.txt
and .devcontainer/devcontainer.json
. requirements_dev.txt
states what python libraries are installed to both local_spark_env
and db_connect_env
. .devcontainer/devcontainer.json
states what VSCode extensions are installed and what python tooling are enabled on VSCode.
The following libraries are enabled as default:
- yapf
- bandit
- mypy
- flake8
- pytest
Databricks's display function doesn't work on local spark or on databricks connect. When you want to visualize data, use normal python visualization libraries or use Databricks Notebooks.