# 01.2 - Kedro framework

<img src="../static/kedro-horizontal-color-on-light.png" width="400" alt="Kedro">

This notebook covers how to use the Kedro CLI to create and manage projects using the Kedro framework, which assembles the library components seen in [First steps with Kedro](./01_1-First%20Steps%20with%20Kedro.ipynb) in a standard way.

## Starters

A Kedro starter contains code in the form of a Cookiecutter template for a Kedro project. Using a starter is like using a pre-defined layout when creating a presentation or document.

You can find [the official list of starters](https://docs.kedro.org/en/0.19.10/starters/starters.html#official-kedro-starters) in the documentation.

The basis for the rest of this notebook will be the [`spaceflights-pandas`](https://github.com/kedro-org/kedro-starters/tree/0.19.10/spaceflights-pandas) starter, which is ideal for local execution. Later on in the bootcamp you will move on to the `spaceflights-pyspark` starter, which uses PySpark and is ready to be executed in Databricks.

To use it, you will first need `kedro` installed. You can use `conda`, a separate virtual environment, or a Python workflow tool capable of managing global utilities, such as pipx or uv.

```bash
(.venv) $ kedro new --starter=spaceflights-pandas --name rocketfuel
```


## The directory structure

A typical Kedro project looks like this:

```
project-dir          # Parent directory of the template
├── conf             # Project configuration files
├── data             # Local project data (not committed to version control)
├── docs             # Project documentation
├── notebooks        # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src)
├── src              # Project source code
├── tests            # Folder containing unit and integration tests
├── .gitignore       # Hidden file that prevents staging of unnecessary files to `git`
├── pyproject.toml   # Identifies the project root and contains configuration information
├── README.md        # Project README
├── requirements.txt # Project dependencies file
```


## Running pipelines

You can use the Kedro CLI to run any pipeline that has been registered.

To see the pipelines already registered, in your project, run

```bash
(.venv) $ kedro registry list
```

Now you can use `kedro run` to execute the pipeline of your liking:

```bash
(.venv) $ kedro run --pipeline data_processing
```

And after the `data_processing` pipeline is executed, you should be able to see the results in the appropriate local directory:

In [None]:
%%sh
cd rocketfuel && ls data/02_intermediate


## Environments

A [configuration environment](https://docs.kedro.org/en/0.19.10/configuration/configuration_basics.html#configuration-environments) is a way of organising your configuration settings for different stages of your data pipeline. For example, you might have different settings for development, testing, and production environments.

By default, Kedro projects have a `base` and a `local` environment.


## Pipeline creation

To create a new pipeline, you can use the CLI:

```bash
(.venv) $ kedro pipeline create PIPELINE_NAME
```


## Integration with VS Code

Kedro has an official extension for VS Code, providing features like enhanced code navigation and autocompletion for seamless development.


## Visualisation with Kedro Viz

You can use Kedro Viz to visualise your pipelines in 3 ways:

1. Using the `NotebookVisualizer` (see 01-1 notebook)
2. Using the VS Code integration https://docs.kedro.org/en/stable/development/set_up_vscode.html#visualise-the-pipeline-with-kedro-viz
3. Launching the Kedro Viz web application on the command line:

```bash
(.venv) $ kedro viz run
```

### Exercise 5

Register two new pipelines in the pipeline registry called `train` and `inference` that use the appropriate nodes from the `data_science` pipeline.

Visualise them in Kedro Viz.


## Remote data paths in the catalog

Kedro datasets can be virtually anything: database connections, REST APIs, and of course files.

These files can be referenced by using remote filepaths, not just local ones.

_We will showcase how an example with DBFS would work, but notice that Storing and accessing data using DBFS root or DBFS mounts is a deprecated pattern and not recommended by Databricks._

For example, imagine your Databricks DBFS contains an example `/Shared/patients.csv`:

```python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

assert w.dbfs.exists("/Shared/patients.csv")
```

Most [official Kedro datasets](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-7.0.0/api/kedro_datasets.html) use [`fsspec`](https://filesystem-spec.readthedocs.io/),
a Python library that allows users to easily specify remote filepaths
with [lots of different cloud filesystems](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations),
including DBFS.

```python
import fsspec
import os

dbfs = fsspec.filesystem(
    "dbfs", instance=os.environ["DATABRICKS_INSTANCE"], token=os.environ["DATABRICKS_TOKEN"],
)
files = dbfs.ls("/Shared/")
print(files)
```

To load this file from Kedro, you will need two things:

- A proper way of specifying the credentials, and
- A URL with the appropriate protocol specifier

For credentials, Kedro allows you to use environment variables, but it will force you to specify the desired properties.

The default `.gitignore` has some protections to avoid accidentally committing them to version control.

In this case, you could use the `local` environment:

```yaml
# conf/local/credentials.yml
databricks:
  instance: ${oc.env:DATABRICKS_INSTANCE}
  token: ${oc.env:DATABRICKS_TOKEN}
```

And finally, by specifying a `dbfs://` URL and the proper `credentials` key, you can define an appropriate dataset in the Kedro catalog:

```yaml
# conf/local/catalog.yml
patients:
  type: pandas.CSVDataset
  filepath: dbfs:///Shared/patients.csv
  credentials: databricks
```

Notice the triple slash `///`, the first two correspond to the protocol `dbfs://` and the third one marks an absolute path `/Shared`.

The rest of the training will favour using the Unity Catalog rather than direct access to DBFS, as recommended by Databricks.

However, this shows that you can use this pattern for direct cloud storage access (`s3://`, `abfs://`), files in websites (`https://`), and more.

In [1]:
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

assert w.dbfs.exists("/Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/unstructured/companies.csv")

In [3]:
import pyspark

In [4]:
pyspark.read.csv("/Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/unstructured/companies.csv").show(5)

AttributeError: module 'pyspark' has no attribute 'read'

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [6]:
spark.read.csv("/Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/unstructured/companies.csv").show(5)

+-----+--------------+--------------------+-----------------+-------------+
|  _c0|           _c1|                 _c2|              _c3|          _c4|
+-----+--------------+--------------------+-----------------+-------------+
|   id|company_rating|    company_location|total_fleet_count|iata_approved|
| 3888|          100%|         Isle of Man|              1.0|            f|
|46728|          100%|                NULL|              1.0|            f|
|34618|           38%|         Isle of Man|              1.0|            f|
|28619|          100%|Bosnia and Herzeg...|              1.0|            f|
+-----+--------------+--------------------+-----------------+-------------+
only showing top 5 rows


In [7]:
spark.read.csv("dbfs:/Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/unstructured/companies.csv").show(5)

+-----+--------------+--------------------+-----------------+-------------+
|  _c0|           _c1|                 _c2|              _c3|          _c4|
+-----+--------------+--------------------+-----------------+-------------+
|   id|company_rating|    company_location|total_fleet_count|iata_approved|
| 3888|          100%|         Isle of Man|              1.0|            f|
|46728|          100%|                NULL|              1.0|            f|
|34618|           38%|         Isle of Man|              1.0|            f|
|28619|          100%|Bosnia and Herzeg...|              1.0|            f|
+-----+--------------+--------------------+-----------------+-------------+
only showing top 5 rows


In [8]:
spark.read.csv("dbfs:///Volumes/aza-databricks-b9b7aae-catalog/rocketfuel/unstructured/companies.csv").show(5)

+-----+--------------+--------------------+-----------------+-------------+
|  _c0|           _c1|                 _c2|              _c3|          _c4|
+-----+--------------+--------------------+-----------------+-------------+
|   id|company_rating|    company_location|total_fleet_count|iata_approved|
| 3888|          100%|         Isle of Man|              1.0|            f|
|46728|          100%|                NULL|              1.0|            f|
|34618|           38%|         Isle of Man|              1.0|            f|
|28619|          100%|Bosnia and Herzeg...|              1.0|            f|
+-----+--------------+--------------------+-----------------+-------------+
only showing top 5 rows
