# 02.1 - Kedro on Databricks, part 1

## Overview of deployment patterns

- Day 1: explore Kedro locally (with Pandas instead of PySpark)
- Day 2, part 1: explore Kedro fully on Databricks
- Day 2, part 2: develop Kedro locally, execute on Databricks (with `databricks-connect`)
- Day 3: intermediate topics, including Databricks Asset Bundles (with `kedro-databricks`)

## Workspace notebooks

### Install Kedro and dependencies

In [0]:
%pip install -r ../../../requirements.in

In [0]:
%pip install hdfs s3fs

In [0]:
%pip uninstall -y rich

### Load data from workspace files using `spark.SparkDataset`

Loading data directly from the workspace may be disabled depending on the cluster configuration.

First verify that the data is available from a Volume in Unity Catalog first:

In [0]:
spark.read.csv("/Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/bronze/companies.csv").show(5)  # Replace with your catalog, schema, and volume!

Create a Kedro catalog to load the data from the Databricks workspace. We'll define the data as a `spark.SparkDataset` in the catalog, which allows us to load the data as a PySpark DataFrame.

In [0]:
from kedro.io import DataCatalog

interactive_catalog = DataCatalog.from_config(
    {
        "companies_raw": {
            "type": "spark.SparkDataset",
            "filepath": "/Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/bronze/companies.csv",
            "file_format": "csv",
            "load_args": {
                "header": True,
                "inferSchema": True
            }
        }
    }
)
interactive_catalog.list()

In [0]:
display(interactive_catalog.load("companies_raw"))

Create configuration files to use the Kedro config loader to load data from the workspace files. First we'll create a databricks environment inside the configuration directory to hold the configuration files:


In [0]:
%sh
mkdir -p ../conf/databricks

Create a Kedro catalog yaml file to load the data from the Databricks workspace.

Don't forget to update the `_uc_catalog` and `_uc_schema` variables to match your cluster and unity catalog configuration.

In [0]:
%%writefile ../conf/databricks/catalog.yml
_uc_catalog: aza-databricks-b9b7aae-catalog
_uc_schema: rocketfuel

companies_raw:
  type: spark.SparkDataset
  filepath: /Volumes/${_uc_catalog}/${_uc_schema}/bronze/companies.csv
  file_format: csv
  load_args:
    header: True
    inferSchema: True

Create a config loader that uses the databricks environment as the default run environment:

In [0]:
from kedro.config import OmegaConfigLoader

config_loader = OmegaConfigLoader(
    conf_source="../conf",
    base_env="base",
    default_run_env="databricks",  # Notice newly created environment
)

Create a Kedro catalog object using the config loader:

In [None]:
catalog_config = config_loader.get("catalog")

 Check the data can be loaded from the catalog configuration:

In [0]:
catalog_config["companies_raw"]

And finally create a Kedro `DataCatalog` object from the configuration:

In [0]:
interactive_catalog = DataCatalog.from_config(catalog_config)

Fetch and display the data:

In [0]:
display(interactive_catalog.load("companies_raw"))

### Bootstrap the Kedro project inside a Databricks notebook using the Kedro extension for IPython

In [0]:
# XXX: Shouldn't be necessary, but DISABLE_HOOKS_FOR_PLUGINS seems to have no effect
%pip uninstall -y kedro-mlflow

A quick way to explore the Kedro `catalog`, `context`, `pipelines`, and `session` variables in your project within a IPython compatible environment, such as Databricks notebooks, Google Colab, and more, is to use the `kedro.ipython` extension. This is tool-independent and useful in situations where launching a Jupyter interactive environment is not possible. You can use the `%load_ext` line magic to explicitly load the Kedro IPython extension:


In [0]:
%load_ext kedro.ipython

You can use `%reload_kedro` line magic within your Jupyter notebook to reload the Kedro variables (for example, if you need to update `catalog` following changes to your Data Catalog).

You don’t need to restart the kernel for the `catalog`, `context`, `pipelines` and `session` variables.

%reload_kedro accepts optional keyword arguments env and params. For example, to use configuration environment `databricks`:



In [0]:
%reload_kedro --env databricks

Now you can load the project catalog and fetch data from it. Note that this is a different catalog object from the one created earlier as this is created from the Kedro project.

In [0]:
catalog.list(".*_raw")  # Accepts a regular expression

In [0]:
display(catalog.load("companies_raw"))

### Exercise 1 (5 mins)

Complete the `catalog.yml` of the `databricks` environment inside the `rocketfuel` project by adding a new datasets:

- `reviews_raw` using `spark.SparkDataset` (similar configuration as `companies_raw`)

When done, reload the Kedro project and verify that loading the data works.

## Connection with the Unity Catalog

Normally we will load all the structured data from Unity Catalog tables.

Therefore, let's write a first pipeline that ingests these CSV and Excel files into the UC.

### Read and write structured data to Databricks Unity Catalog using `databricks.ManagedTableDataset`

Add a new dataset entry for the `companies` dataset that uses the `databricks.ManagedTableDataset` type. This dataset type allows you to write and load data to a Unity Catalog table in Databricks.

In [0]:
%%writefile -a ../conf/databricks/catalog.yml
companies:
  type: databricks.ManagedTableDataset
  catalog: ${_uc_catalog}
  database: ${_uc_schema}
  table: companies
  write_mode: overwrite

Reload the `kedro.ipython` extension and project with the updated catalog.

In [0]:
%load_ext kedro.ipython

In [0]:
%reload_kedro --env databricks

Load the raw data again from the Databricks workspace.

In [0]:
df_companies = catalog.load("companies_raw")
df_companies.show(1)

Now save this data into the Unity Catalog table using the `databricks.ManagedTableDataset` dataset type. 

In [0]:
catalog.save("companies", df_companies)

In [0]:
# Uncomment if cell below does not show any output,
# see https://github.com/kedro-org/kedro/issues/4804
# %sh uv pip uninstall rich

Verify you can read the data from the Unity Catalog table using a SQL command. This will also verify that the table was created successfully.

In [0]:
%sql
SELECT * FROM `aza-databricks-b9b7aae-catalog`.rocketfuel.companies
LIMIT 5;

### Exercise 2 (5 mins)

Add a new datasets to `conf/databricks/catalog.yml` to represent the Delta Tables of `reviews`,
and ingest the data manually from the notebook.

When you are done, run the appropriate SQL command to verify that everything worked.

The full documentation for the Kedro `databricks.ManagedTableDataset` is available [here](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-7.0.0/api/kedro_datasets.databricks.ManagedTableDataset.html).


### Exercise 3 (15 mins)

Codify the logic to load the raw data into the Unity Catalog in a `load_data` Kedro pipeline. For that:

- Create a function to load and return a spark dataframe to be used to load the raw companies and reviews data.
- Create two Kedro nodes to load each dataset into the Unity Catalog.
- Create a Kedro pipeline from these two nodes.
- Use the NotebookVisualizer to visualise the pipeline in the notebook.

In [None]:
from kedro.pipeline import node, pipeline

### Your code goes here

# node1 = node(func=..., inputs="companies_raw", outputs="companies")

# load_data = pipeline([node1, node2])

In [0]:
from kedro_viz.integrations.notebook import NotebookVisualizer

NotebookVisualizer(load_data).show()