# Databricks Asset Bundles

**tl;dr** Automating deployments of Databricks assets (e.g., experiments, jobs, models, and pipelines)

Learn more in the official documentation:

1. [What are Databricks Asset Bundles?](https://docs.databricks.com/en/dev-tools/bundles/index.html)
1. [Develop on Databricks](https://docs.databricks.com/en/languages/index.html)


[Databricks Asset Bundles](https://www.databricks.com/resources/demos/tours/data-engineering/databricks-asset-bundles):

> Databricks Asset Bundles (DAB) is a new capability on Databricks that **standardizes and unifies the deployment strategy** for all data products developed on the platform.
> It allows developers to describe the infrastructure and resources of their project through a **YAML configuration file**.

The main take-aways from the above introduction about DAB are as follows:

1. DAB is all about standardizing deployment of Databricks projects
1. DAB is an [Infrastructure as code (IaC)](https://en.wikipedia.org/wiki/Infrastructure_as_code) tool
1. DAB uses a YAML configuration file to declaratively describe what/when/how

[Databricks Asset Bundles went GA](https://www.databricks.com/blog/announcing-general-availability-databricks-asset-bundles) around April 23, 2024 🥳


With DABs, you can easily bundle deployable resources (jobs, pipelines, notebooks, code) so you can version, test, deploy, and collaborate on your project as a unit.

DABs help you adopt software engineering best practices for your projects on the Databricks Platform. DABs facilitate source control, code review, testing, and continuous integration and delivery (CI/CD) for all your data assets as code.


The [slides](https://docs.google.com/presentation/d/1bnnTR19j_nZhB0bDCMoGga-8Sq6eBjhBAom-6NJ6F0I/edit) of the talk on Databricks Asset Bundles at Data & AI Summit 2023


[Databricks Asset Bundle deployment modes](https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html):

> Bundles enable programmatic management of Databricks Workflows

Databricks asset bundles make it possible to express complete data, analytics, and ML projects as a collection of source files called a bundle.

## Automate Databricks Deployments

DAB is not alone in the IaC/deployment 'market' for Databricks.

Developers (and devops) have been using the following for quite some time:

1. [Databricks REST API](https://docs.databricks.com/api/)
1. [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html)
1. [Databricks Terraform provider](https://docs.databricks.com/en/dev-tools/terraform/index.html)

## Fun Fact: DAB == terraform

Note `terraform apply` in the output of `databricks bundle deploy`. 

```
Starting resource deployment
Error: terraform apply: exit status 1

Error: cannot create job: Invalid quartz_cron_expression: '44 37 8 * * ?'. Databricks uses Quartz cron syntax, which is different from the standard cron syntax. See https://docs.databricks.com/jobs.html#schedule-a-job  for more details.

  with databricks_job.jacek_demo_meetup_job,
  on bundle.tf.json line 82, in resource.databricks_job.jacek_demo_meetup_job:
  82:       }
```

```console
$ databricks bundle validate
{
  "bundle": {
    "name": "delta_live_tables_demo",
    "target": "dev",
    "environment": "dev",
    "terraform": { ⬅️
      "exec_path": "/Users/jacek/dev/oss/learn-databricks/Databricks Asset Bundles/delta_live_tables_demo/.databricks/bundle/dev/bin/terraform"
    },
    ...
```


[workspace](https://docs.databricks.com/en/dev-tools/bundles/settings.html#workspace) (highlighting mine):

> The `state_path` mapping defaults to the default path of `${workspace.root}/state` and represents the path within your workspace to store **Terraform** state information about deployments.

> **Note:**
>
> _"to store Terraform state information about deployments"_


## Typical Development Flow

Typical development flow using Databricks Asset Bundles (`databricks bundle` commands):

* `init`
* `deploy`
* `run`

## 🚀 Demo: On Fast Track to Deploy

[Develop a job on Databricks by using Databricks asset bundles](https://docs.databricks.com/en/workflows/jobs/how-to/use-bundles-with-jobs.html)


```console
$ databricks --version
Databricks CLI v0.221.1
```


```shell
$ databricks bundle
Databricks Asset Bundles let you express data/AI/analytics projects as code.

Online documentation: https://docs.databricks.com/en/dev-tools/bundles/index.html

Usage:
  databricks bundle [command]

Available Commands:
  deploy      Deploy bundle
  deployment  Deployment related commands
  destroy     Destroy deployed bundle resources
  generate    Generate bundle configuration
  init        Initialize using a bundle template
  run         Run a job or pipeline update
  schema      Generate JSON Schema for bundle configuration
  sync        Synchronize bundle tree to the workspace
  validate    Validate configuration

Flags:
  -h, --help          help for bundle
      --var strings   set values for variables defined in bundle config. Example: --var="foo=bar"

Global Flags:
      --debug            enable debug logging
  -o, --output type      output type: text or json (default text)
  -p, --profile string   ~/.databrickscfg profile
  -t, --target string    bundle target to use (if applicable)

Use "databricks bundle [command] --help" for more information about a command.
```


### init

<br>

```shell
$ databricks bundle init
Search: █
? Template to use:
  default-python (The default Python template for Notebooks / Delta Live Tables / Workflows)
  default-sql
  dbt-sql
  mlops-stacks
  custom...
```

> ⚠️ **Note:**
>
> Two new templates added: `default-sql` and `dbt-sql`.

Select `default-python`.

```shell
Welcome to the default Python template for Databricks Asset Bundles!
Please provide the following details to tailor the template to your preferences.

Unique name for this project [my_project]:
```

...and accept the defaults (except to the stub (sample) Python package).

```shell
Include a stub (sample) notebook in 'my_project/src': yes
Include a stub (sample) Delta Live Tables pipeline in 'my_project/src': yes
Include a stub (sample) Python package in 'my_project/src': no
Workspace to use (auto-detected, edit in 'job_id_change/databricks.yml'): https://XXX

✨ Your new project has been created in the 'my_project' directory!

Please refer to the README.md file for "getting started" instructions.
See also the documentation at https://docs.databricks.com/dev-tools/bundles/index.html.
```

> ⚠️ **Note:**
>
> Project name must consist of letters, numbers, and underscores

> ⚠️ **Note:**
>
> Workspace to use (auto-detected, edit in 'job_id_change/databricks.yml')

```shell
$ cd my_project
```


### deploy

<br>

```shell
$ databricks bundle deploy
Uploading bundle files to /Users/jacek@japila.pl/.bundle/my_project/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!
```

> ⚠️ **Note:**
>
> Uploading bundle files to /Users/jacek@japila.pl/.bundle/my_project/dev/files...


### run

<br>

```console
$ databricks bundle run --help
Run the job or pipeline identified by KEY.

The KEY is the unique identifier of the resource to run. In addition to
customizing the run using any of the available flags, you can also specify
keyword or positional arguments as shown in these examples:

   databricks bundle run my_job -- --key1 value1 --key2 value2

Or:

   databricks bundle run my_job -- value1 value2 value3

If the specified job uses job parameters or the job has a notebook task with
parameters, the first example applies and flag names are mapped to the
parameter names.

If the specified job does not use job parameters and the job has a Python file
task or a Python wheel task, the second example applies.

Usage:
  databricks bundle run [flags] KEY

Job Flags:
      --params stringToString   comma separated k=v pairs for job parameters (default [])

Job Task Flags:
  Note: please prefer use of job-level parameters (--param) over task-level parameters.
  For more information, see https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html#pass-parameters-to-a-databricks-job-task
      --dbt-commands strings                 A list of commands to execute for jobs with DBT tasks.
      --jar-params strings                   A list of parameters for jobs with Spark JAR tasks.
      --notebook-params stringToString       A map from keys to values for jobs with notebook tasks. (default [])
      --pipeline-params stringToString       A map from keys to values for jobs with pipeline tasks. (default [])
      --python-named-params stringToString   A map from keys to values for jobs with Python wheel tasks. (default [])
      --python-params strings                A list of parameters for jobs with Python tasks.
      --spark-submit-params strings          A list of parameters for jobs with Spark submit tasks.
      --sql-params stringToString            A map from keys to values for jobs with SQL tasks. (default [])

Pipeline Flags:
      --full-refresh strings   List of tables to reset and recompute.
      --full-refresh-all       Perform a full graph reset and recompute.
      --refresh strings        List of tables to update.
      --refresh-all            Perform a full graph update.
      --validate-only          Perform an update to validate graph correctness.

Flags:
  -h, --help      help for run
      --no-wait   Don't wait for the run to complete.
      --restart   Restart the run if it is already running.

Global Flags:
      --debug            enable debug logging
  -o, --output type      output type: text or json (default text)
  -p, --profile string   ~/.databrickscfg profile
  -t, --target string    bundle target to use (if applicable)
      --var strings      set values for variables defined in bundle config. Example: --var="foo=bar"
```

<br>

```shell
$ databricks bundle run
Update URL: https://training-partners.cloud.databricks.com/#joblist/pipelines/84f3895d-a910-4d9a-b8ec-ac275d4985bd/updates/ce459b1d-5323-46de-b0a4-86b459c13301

2023-10-21T12:58:16.972Z update_progress INFO "Update ce459b is WAITING_FOR_RESOURCES."
2023-10-21T13:01:50.065Z update_progress INFO "Update ce459b is INITIALIZING."
2023-10-21T13:02:36.634Z update_progress INFO "Update ce459b is SETTING_UP_TABLES."
2023-10-21T13:03:01.865Z update_progress INFO "Update ce459b is RUNNING."
2023-10-21T13:03:01.871Z flow_progress   INFO "Flow 'filtered_taxis' is QUEUED."
2023-10-21T13:03:01.893Z flow_progress   INFO "Flow 'filtered_taxis' is PLANNING."
2023-10-21T13:03:02.673Z flow_progress   INFO "Flow 'filtered_taxis' is STARTING."
2023-10-21T13:03:02.712Z flow_progress   INFO "Flow 'filtered_taxis' is RUNNING."
2023-10-21T13:03:42.162Z flow_progress   INFO "Flow 'filtered_taxis' has COMPLETED."
2023-10-21T13:03:43.702Z update_progress INFO "Update ce459b is COMPLETED."
```

## Validate configuration


From [Databricks Asset Bundle configurations](https://docs.databricks.com/en/dev-tools/bundles/settings.html):

1. A bundle configuration file must be expressed in YAML format
1. A bundle configuration file must contain at minimum the top-level [bundle](https://docs.databricks.com/en/dev-tools/bundles/settings.html#bundle-syntax-mappings-bundle) mapping


```console
$ databricks bundle validate
Error: unable to locate bundle root: databricks.yml not found
```

```console
$ databricks bundle validate --help
Validate configuration

Usage:
  databricks bundle validate [flags]

Flags:
  -h, --help   help for validate

Global Flags:
      --log-file file            file to write logs to (default stderr)
      --log-format type          log output format (text or json) (default text)
      --log-level format         log level (default disabled)
  -o, --output type              output type: text or json (default text)
  -p, --profile string           ~/.databrickscfg profile
      --progress-format format   format for progress logs (append, inplace, json) (default default)
  -t, --target string            bundle target to use (if applicable)
      --var strings              set values for variables defined in bundle config. Example: --var="foo=bar"
```

## Variables

[Custom variables](https://docs.databricks.com/en/dev-tools/bundles/settings.html#custom-variables):

* Use custom variables to make your bundle configuration files more modular and reusable
* Variables work only with string-based values.
* E.g., the ID of an existing cluster for various workflow runs within multiple targets


`variables` mapping in a bundle configuration file

```yaml
variables:
  <variable-name>:
    description: <optional-description>
    default: <optional-default-value>
```


* You should provide the same values during both the deployment and run stages
* For variables, use substitutions in the format `${var.<variable_name>}`
* Use Databricks CLI's `--var` option to define the value of a variable


```shell
databricks bundle deploy --var "quartz_cron_expression=1"
```

## ☀️ Demo: Delta Live Tables Project


```console
$ databricks bundle init
Template to use [default-python]:
Unique name for this project [my_project]: delta_live_tables_demo
Include a stub (sample) notebook in 'delta_live_tables_demo/src': yes
Include a stub (sample) Delta Live Tables pipeline in 'delta_live_tables_demo/src': yes
Include a stub (sample) Python package in 'delta_live_tables_demo/src': yes

✨ Your new project has been created in the 'delta_live_tables_demo' directory!

Please refer to the README.md of your project for further instructions on getting started.
Or read the documentation on Databricks Asset Bundles at https://docs.databricks.com/dev-tools/bundles/index.html.
```


```console
$ databricks auth profiles --help
Lists profiles from ~/.databrickscfg

Usage:
  databricks auth profiles [flags]

Flags:
  -h, --help            help for profiles
      --skip-validate   Whether to skip validating the profiles
```


```console
$ databricks auth profiles
Name     Host                                            Valid
DEFAULT  https://training-partners.cloud.databricks.com  YES
```


```console
// Uses default target
// default: true
$ databricks bundle validate
{
  "bundle": {
    "name": "delta_live_tables_demo",
    "target": "dev",
    "environment": "dev",
    "terraform": {
      "exec_path": "/Users/jacek/dev/oss/learn-databricks/Databricks Asset Bundles/delta_live_tables_demo/.databricks/bundle/dev/bin/terraform"
    },
    "lock": {
      "enabled": null,
      "force": false
    },
    "force": false,
    "git": {
      "branch": "meetup-nov-2",
      "origin_url": "https://github.com/jaceklaskowski/learn-databricks.git",
      "commit": "63f784b0000e85107ffea06be24c8151d45cc6c7"
    },
    "mode": "development"
  },
  ...
  ```


```console
$ databricks bundle validate --target prod
{
  "bundle": {
    "name": "delta_live_tables_demo",
    "target": "prod",
    "environment": "prod", ⬅️
    "terraform": {
      "exec_path": "/Users/jacek/dev/oss/learn-databricks/Databricks Asset Bundles/delta_live_tables_demo/.databricks/bundle/prod/bin/terraform
      ...
```


Review `resources/delta_live_tables_demo_pipeline.yml`

## Deployment Modes

[Databricks Asset Bundle deployment modes](https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html)


1. In CI/CD workflows, developers typically code, test, deploy, and run solutions in various phases, or modes.
1. The most common deployment modes include:
    * A development mode for pre-production validation
    * A production mode for validated deliverables
1. Databricks Asset Bundles provides an optional collection of default behaviors that correspond to each of these modes.1. Modes specify (declaratively) intended behaviors
1. `mode` mapping in a target (under `targets`)
    * `databricks bundle deploy -t <target-name>`


### Development mode

[Development mode](https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html#development-mode)

1. `mode: development`
1. Tags deployed jobs and pipelines with a `dev` Databricks tag
1. Delta Live Tables pipelines run in `development: true`
1. _others_

### Production mode

[Production mode](https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html#production-mode)

1. `mode: production`
1. Validates that all related deployed Delta Live Tables pipelines are marked as `development: false`.
1. Validates that the current git branch is equal to the git branch that is specified in the target
      ```
      git:
        branch: main
      ```

## Bundle Templates

[Databricks Asset Bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html)

`databricks bundle init` accepts an optional path of the template to use to initialize a DAB project:
- `default-python` for the default Python template
- a local file system path with a template directory
- a git repository URL, e.g. https://github.com/my/repository


```shell
$ databricks bundle init --help
Initialize using a bundle template.

TEMPLATE_PATH optionally specifies which template to use. It can be one of the following:
- 'default-python' for the default Python template
- a local file system path with a template directory
- a Git repository URL, e.g. https://github.com/my/repository

See https://docs.databricks.com//dev-tools/bundles/templates.html for more information on templates.

Usage:
  databricks bundle init [TEMPLATE_PATH] [flags]

Flags:
      --config-file string    File containing input parameters for template initialization.
  -h, --help                  help for init
      --output-dir string     Directory to write the initialized template to.
      --template-dir string   Directory path within a Git repository containing the template.

Global Flags:
      --log-file file            file to write logs to (default stderr)
      --log-format type          log output format (text or json) (default text)
      --log-level format         log level (default disabled)
  -o, --output type              output type: text or json (default text)
  -p, --profile string           ~/.databrickscfg profile
      --progress-format format   format for progress logs (append, inplace, json) (default default)
  -t, --target string            bundle target to use (if applicable)
      --var strings              set values for variables defined in bundle config. Example: --var="foo=bar"
```

## 🚀 Demo: Create DAB Template (WIP)


An idea is to execute the following command with a random template name and guide the audience through errors.

```
databricks bundle init
```

## Source Code

Given [this recent PR](https://github.com/databricks/cli/pull/795/files), it appears that the source code of `bundle` command of Databricks CLI is in [Databricks CLI](https://github.com/databricks/cli/tree/main/cmd/bundle) repo itself.

> **Note**
>
> Phew, the source code is Go! 😬


## Bundle Configuration File (databricks.yml)

➡️ [Databricks Asset Bundle configurations](https://docs.databricks.com/en/dev-tools/bundles/settings.html#overview)


### Target mappings

[target mappings](https://docs.databricks.com/en/dev-tools/bundles/settings.html#targets)

## Target Workspace Resolution

<br>

Databricks CLI uses `-t, --target string` options for the target Databricks workspace

[Databricks Asset Bundle configurations](https://docs.databricks.com/en/dev-tools/bundles/settings.html#examples):

* Databricks recommends that you use the `host` mapping instead of the default mapping wherever possible (makes your bundle configuration files more portable)
* Setting the `host` mapping instructs the Databricks CLI to find a matching profile in your `~/.databrickscfg` file and then use that profile’s fields to determine which Databricks authentication type to use
* If multiple profiles with a matching host field exist within your `~/.databrickscfg` file, then you must use the profile to instruct the Databricks CLI about which specific profile to use


## Resource ID Resolution

[Retrieve an object’s ID value](https://docs.databricks.com/en/dev-tools/bundles/settings.html#retrieve-an-objects-id-value):

* For the `alert`, `cluster_policy`, `cluster`, `dashboard`, `instance_pool`, `job`, `metastore`, `pipeline`, `query`, `service_principal`, and `warehouse` object types, you can define a `lookup` for your custom variable to retrieve a named object’s ID
* The correct resolved ID of an object is always used for the variable


## Project Directory Structure


### Fixtures

This folder is reserved for (test) fixtures.

Learn more in [Unit Testing and Code Modularization in Databricks](https://medium.com/@mariusz_kujawski/unit-testing-and-code-modularization-in-databricks-33f40c9f6da9)

### Tests

## Questions

1. Any relationship between DAB and Databricks SDK?