# 02.2 - Kedro on Databricks, part 2

_Notes: This notebook is supposed to be run locally from VS Code, all with Databricks Connect_

## Kedro and Databricks Connect

**Databricks Connect** is a client library that allows you to run Spark code locally on your machine while connecting to a remote Databricks cluster for computation. It essentially lets you develop and execute Spark applications from your local IDE or notebook environment, but the actual processing happens on the Databricks cluster.

The **Databricks extension for Visual Studio Code** has several interesting features for connecting to Databricks from VS Code and perform actions sach us deploying and running Databricks Asset Bundles, manage clusters, and easily set up **Databricks Connect**.

Therefore, the two are the perfect companion for developing Kedro projects on VS Code, since you can develop on your IDE while using Databricks compute.

Follow the official documentation to

1. [Install the Databricks extension for VS Code](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/install)
2. [Configure the appropriate cluster](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/configure)
3. [Install Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/databricks-connect)

_Note: `databricks-connect` provides its own `pyspark` top-level module, and [pip doesn't check for conflicting packages](https://github.com/pypa/pip/issues/4625), so make sure you don't have a [conflicting `pyspark` installation](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/troubleshooting#conflicting-pyspark-installations)!_

In [1]:
%%sh
uv pip install ./rocketfuel

%%sh is not supported on Databricks. This notebook might fail when running on a Databricks cluster.
Consider using %sh instead.
[2mUsing Python 3.12.9 environment at: /Users/juan_cano/Projects/QuantumBlackLabs/Kedro/kedro-databricks-bootcamp/.venv[0m
[2mResolved [1m179 packages[0m [2min 20.71s[0m[0m
   [36m[1mBuilding[0m[39m rocketfuel[2m @ file:///Users/juan_cano/Projects/QuantumBlackLabs/Kedro/kedro-databricks-bootcamp/02_databricks/rocketfuel[0m
      [32m[1mBuilt[0m[39m rocketfuel[2m @ file:///Users/juan_cano/Projects/QuantumBlackLabs/Kedro/kedro-databricks-bootcamp/02_databricks/rocketfuel[0m
[2mPrepared [1m3 packages[0m [2min 3.50s[0m[0m
[2mUninstalled [1m5 packages[0m [2min 1.54s[0m[0m
[2mInstalled [1m32 packages[0m [2min 4.23s[0m[0m
 [32m+[39m [1maiobotocore[0m[2m==2.22.0[0m
 [32m+[39m [1maiofiles[0m[2m==24.1.0[0m
 [32m+[39m [1maiohappyeyeballs[0m[2m==2.6.1[0m
 [32m+[39m [1maiohttp[0m[2m==3.12.11[0m
 [32m+[39m [1m

In [2]:
%load_ext kedro.ipython

In [3]:
%reload_kedro rocketfuel

In [4]:
catalog.list()


[1m[[0m
    [32m'companies'[0m,
    [32m'reviews'[0m,
    [32m'shuttles'[0m,
    [32m'preprocessed_companies'[0m,
    [32m'preprocessed_shuttles'[0m,
    [32m'preprocessed_reviews'[0m,
    [32m'model_input_table'[0m,
    [32m'regressor'[0m,
    [32m'parameters'[0m,
    [32m'params:model_options'[0m,
    [32m'params:model_options.test_size'[0m,
    [32m'params:model_options.random_state'[0m,
    [32m'params:model_options.features'[0m
[1m][0m

Notice how data is loaded as a PySpark DataFrame, directly from Databricks Unity Catalog!

In [5]:
catalog._get_dataset("companies")

[1;35mkedro_datasets.databricks.managed_table_dataset.ManagedTableDataset[0m[1m([0m[33mcatalog[0m=[32m'aza-databricks-b9b7aae-catalog'[0m, [33mdatabase[0m=[32m'rocketfuel'[0m, [33mtable[0m=[32m'companies'[0m, [33mwrite_mode[0m=[32m'overwrite'[0m, [33mdataframe_type[0m=[32m'spark'[0m, [33mversion[0m=[32m'None'[0m[1m)[0m

In [6]:
display(catalog.load("companies"))

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,id,company_rating,company_location,total_fleet_count,iata_approved
0,3888,100%,Isle of Man,1.0,f
1,46728,100%,,1.0,f
2,34618,38%,Isle of Man,1.0,f
3,28619,100%,Bosnia and Herzegovina,1.0,f
4,8240,,Chile,1.0,t
5,16813,100%,Kiribati,2.0,f
6,2859,90%,Bahrain,1.0,f
7,33237,,Nicaragua,1.0,f
8,30052,100%,Turkmenistan,1.0,f
9,43711,100%,Rwanda,1.0,f


## Namespaces for pipeline grouping

## Integration with Databricks MLflow

### Log Kedro runs as MLflow experiments

There are 2 types of MLflow experiments in Databricks:
- **Workspace** experiments are not associated with any notebook, and any notebook can log a run to these experiments by using the experiment ID or the experiment name. _They cannot be created inside Git folders._
- **Notebook** experiments are associated with a specific notebook. _They are note checked into source control_.

Therefore, for personal experimentation **notebook** experiments are more appropriate, and for collaboration **workspace** experiments can be created in a regular workspace folder outside of Git.

In [0]:
%%sh
uv pip install ./rocketfuel

In [0]:
%load_ext kedro.ipython

Kedro runs can be logged in MLflow using the [`kedro-mlflow`](https://kedro-mlflow.readthedocs.io/) community plugin.

In [7]:
%%sh
uv pip install kedro-mlflow

[2mUsing Python 3.12.9 environment at: /Users/juan_cano/Projects/QuantumBlackLabs/Kedro/kedro-databricks-bootcamp/.venv[0m
[2mAudited [1m1 package[0m [2min 979ms[0m[0m


`kedro-mlflow` can take [configuration](https://kedro-mlflow.readthedocs.io/en/0.14.4/source/03_experiment_tracking/01_experiment_tracking/01_configuration.html) from `conf/<environment>/mlflow.yml`, which can be used to configure the experiment name:

- For **notebook** experiments, you have to configure the experiment name to match the full path of the notebook.
- For **workspace** experiments, the experiment name would be the full path to the experiments folder in the workspace.

To this end, let's add some OmegaConf syntax to `mlflow.yml` so that the experiment name can be specified from the outside, while taking a default value if not present:

In [0]:
%%writefile rocketfuel/conf/databricks/mlflow.yml
tracking:
  experiment:
    name: ${runtime_params:mlflow_experiment_name, ${kedro_root:}}

Let's try to set up a **notebook** experiment. For this, extract the notebook path:

In [0]:
notebook_path = dbutils.entry_point.getDbutils().notebook().getContext().notebookPath().get()
notebook_path

And pass that as a runtime parameter to specify the experiment name:

_Note: Extra params cannot contain spaces when passed to `%reload_kedro`, see [this issue](https://github.com/kedro-org/kedro/issues/4813)_

In [0]:
%reload_kedro rocketfuel --env databricks --params mlflow_experiment_name=$notebook_path

Now, every time a Kedro pipeline is run, it's logged as al MLflow run:

In [0]:
session.run("load_data")

![MLflow run corresponding to a Kedro run on Databricks](./kedro-databricks-mlflow-run.png)

### Log artifacts as MLflow models in the Unity Catalog

In [0]:
%sql
SHOW CATALOGS;

In [0]:
# Test code
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

import mlflow

# Train a sklearn model on the iris dataset
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
clf = RandomForestClassifier(max_depth=7)
clf.fit(X, y)

# Note that the UC model name follows the pattern
# <catalog_name>.<schema_name>.<model_name>, corresponding to
# the catalog, schema, and registered model name
# in Unity Catalog under which to create the version
# The registered model will be created if it doesn't already exist
autolog_run = mlflow.last_active_run()
model_uri = "runs:/{}/model".format(autolog_run.info.run_id)  # NOTE: Can this be automatic?
mlflow.register_model(model_uri, "aza-databricks-b9b7aae-catalog.rocketfuel.iris_model")  # NOTE: Can this be automatic?

In [0]:
session.run("__default__")

### Register models using the Databricks Unity Catalog

_**Note**: fsspec uses the DBFS API, which is not compatible with Unity Catalog according to https://github.com/fsspec/filesystem_spec/issues/1656_