# 02.2 - Kedro on Databricks, part 2

_Notes: This notebook is supposed to be run locally from VS Code, all with Databricks Connect_

## Kedro and Databricks Connect

**Databricks Connect** is a client library that allows you to run Spark code locally on your machine while connecting to a remote Databricks cluster for computation. It essentially lets you develop and execute Spark applications from your local IDE or notebook environment, but the actual processing happens on the Databricks cluster.

The **Databricks extension for Visual Studio Code** has several interesting features for connecting to Databricks from VS Code and perform actions sach us deploying and running Databricks Asset Bundles, manage clusters, and easily set up **Databricks Connect**.

Therefore, the two are the perfect companion for developing Kedro projects on VS Code, since you can develop on your IDE while using Databricks compute.

Follow the official documentation to

1. [Install the Databricks extension for VS Code](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/install)
2. [Configure the appropriate cluster](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/configure)
3. [Install Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/databricks-connect)

_Note: `databricks-connect` provides its own `pyspark` top-level module, and [pip doesn't check for conflicting packages](https://github.com/pypa/pip/issues/4625), so make sure you don't have a [conflicting `pyspark` installation](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/troubleshooting#conflicting-pyspark-installations)!_

### Install the needed requirements again

In [None]:
%pip install -r ../../../requirements.in
%pip install hdfs s3fs

Uninstall dependencies that conflict:

In [None]:
%pip uninstall -y kedro-mlflow

Load the Kedro ipython extension.

In [None]:
%load_ext kedro.ipython

Ensure your project now contains the databricks configuration created in the previous notebook. Add it into your `base/catalog.yml` file. The content should be as follows:

```yaml
_uc_catalog: #your catalog location
_uc_schema: #your schema location

companies_raw:
  type: spark.SparkDataset
  filepath: /Volumes/${_uc_catalog}/${_uc_schema}/bronze/companies.csv
  file_format: csv
  load_args:
    header: True
    inferSchema: True

reviews_raw:
  type: spark.SparkDataset
  filepath: /Volumes/${_uc_catalog}/${_uc_schema}/bronze/reviews.csv
  file_format: csv
  load_args:
    header: True
    inferSchema: True

companies:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: companies
  write_mode: overwrite

reviews:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: reviews
  write_mode: overwrite

shuttles:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: shuttles
  write_mode: overwrite

preprocessed_companies:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: preprocessed_companies
  write_mode: overwrite

preprocessed_shuttles:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: preprocessed_shuttles
  write_mode: overwrite

preprocessed_reviews:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: preprocessed_reviews
  write_mode: overwrite

model_input_table:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: model_input_table
  write_mode: overwrite

regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor.pickle
  versioned: true
```

In [None]:
%reload_kedro ../../rocketfuel

In [None]:
catalog.list()

Notice how data is loaded as a PySpark DataFrame, directly from Databricks Unity Catalog!

In [None]:
catalog._get_dataset("companies")

In [None]:
display(catalog.load("companies"))

### Exercise 1

Codify the logic of the dummy `load_data` pipeline inside the project and run it locally through `databricks-connect`. For that:

- Create the pipeline in your local development environment.
- Run ```%reload_kedro``` to reload the Kedro project.
- Try to execute the `load_data` pipeline from the VS Code notebook.
- Iterate until it works.

In [None]:
%reload_kedro ../../rocketfuel

In [None]:
# The below command let's you execute the `load_data` pipeline
session.run("load_data")

## Integration with Databricks MLflow

### Log Kedro runs as MLflow experiments

There are 2 types of MLflow experiments in Databricks:
- **Workspace** experiments are not associated with any notebook, and any notebook can log a run to these experiments by using the experiment ID or the experiment name. _They cannot be created inside Git folders._
- **Notebook** experiments are associated with a specific notebook. _They are note checked into source control_.

Therefore, for personal experimentation **notebook** experiments are more appropriate, and for collaboration **workspace** experiments can be created in a regular workspace folder outside of Git.

Since you will be running this notebook locally using Databricks Connect, creating a **workspace** experiment will be more flexible. First, create the appropriate parent directory using the Databricks SDK:

In [None]:
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

current_user = w.current_user.me()
home_dir = f"/Users/{current_user.user_name}"
home_dir

Next, you will need a Databricks token:

In [None]:
# FIXME: Call `databricks configure` on the CLI instead of setting the environment variables here?
#
# import os
#
# os.environ["DATABRICKS_INSTANCE"] = w.config.host
# Do NOT commit this to version control!
# os.environ["DATABRICKS_TOKEN"] = "..."


Finally, verify that everything works:

In [None]:
import mlflow

# This workaround is needed with serverless compute, see official answer at
# https://community.databricks.com/t5/machine-learning/using-datbricks-connect-with-serverless-compute-and-mlflow/m-p/97604#M3764
mlflow.tracking._model_registry.utils._get_registry_uri_from_spark_session = (
    lambda: "databricks-uc"
)

experiment_path = f"{home_dir}/02_2-kedro-on-databricks"

mlflow.set_tracking_uri("databricks")
mlflow.set_experiment(experiment_path)

MLflow is the perfect companion for Kedro projects, thanks to the `kedro-mlflow` community plugin:

In [None]:
%pip install kedro-mlflow

`kedro-mlflow` can take [configuration](https://kedro-mlflow.readthedocs.io/en/0.14.4/source/03_experiment_tracking/01_experiment_tracking/01_configuration.html) from `conf/<environment>/mlflow.yml`, which can be used to configure the experiment name.

To this end, let's add some OmegaConf syntax to `mlflow.yml` so that the experiment name can be specified from the outside:

In [None]:
%%writefile ../conf/local/mlflow.yml
server:
  mlflow_tracking_uri: databricks

tracking:
  experiment:
    name: ${runtime_params:mlflow_experiment_name}

Now you can pass the experiment name as a runtime parameter:

_Note: Extra params cannot contain spaces when passed to `%reload_kedro`, see [this issue](https://github.com/kedro-org/kedro/issues/4813)_

In [None]:
%reload_kedro --params mlflow_experiment_name=$experiment_path

Now, every time a Kedro pipeline is run, it's logged as al MLflow run:

In [None]:
session.run("data_processing")

![MLflow run corresponding to a Kedro run on Databricks](./kedro-databricks-mlflow-run.png)

### Exercise 3

Make the necessary changes to the project so that you can run a Kedro pipeline and log the results as a MLflow experiment from the CLI:

```bash
(.venv) $ kedro run -p data_processing --params mlflow_experiment_name=/Users/juan_luis_cano@mckinsey.com/02_2-kedro-on-databricks
```

_Note: To run locally you might need to `export DATABRICKS_SERVERLESS_COMPUTE_ID=auto`, see https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/cluster-config#configure-a-connection-to-serverless-compute_

### Register models using the Databricks Unity Catalog

Registering models on the Unity Catalog from Kedro pipelines is trivial.

Adjust the `regressor` dataset:

```diff
 regressor:
-  type: pickle.PickleDataset
-  filepath: data/06_models/regressor.pickle
-  versioned: true
+  type: kedro_mlflow.io.models.MlflowModelTrackingDataset
+  flavor: mlflow.sklearn
```

And run the `data_science` pipeline:

In [None]:
%reload_kedro --params mlflow_experiment_name=$experiment_path

In [None]:
session.run("data_science")

![Model registry](kedro-databricks-mlflow-model-registry.png)