# First Steps with Kedro

<img src="static/kedro-horizontal-color-on-light.png" width="400" alt="Kedro">

This session covers the foundational concepts of Kedro, including the Data Catalog, the Config Loader, Nodes, and Pipelines. It's inspired in the [Spaceflights tutorial](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html).

## The `DataCatalog`

Normally, you would read your CSV data like this:

In [None]:
import pandas as pd

pd.read_csv("data/companies.csv").head()

This is fine, and it works. However, for large projects it scales poorly:

- What if you move all your data files somewhere else? You would need to `Cmd+F` a bunch of paths across different notebooks and Python modules and change them all.
- How do you differentiate between development and production? You could maybe create an `if` block, or pass paths as environment variables. Each option has pros and cons.
- How do you quickly assess all the input files that you need in a project?

Kedro’s [Data Catalog](https://docs.kedro.org/en/latest/data/) is a registry of all data sources available for use by the project. It offers a separate place to declare details of the datasets your projects use. Kedro provides built-in datasets for different file types and file systems so you don’t have to write any of the logic for reading or writing data.

Kedro offers a range of datasets, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames, and more. They are supported with the APIs of pandas, spark, networkx, matplotlib, yaml, and beyond. It relies on fsspec to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. You can pass arguments in to load and save operations, and use versioning and credentials for data access.

To start using the Data Catalog, create an instance of the `DataCatalog` class with a dictionary configuration as follows, to load our first dataset, *companies*:

In [None]:
from kedro.io import DataCatalog
catalog = DataCatalog.from_config(
    {
        "companies": {
            "type": "pandas.CSVDataset",
            "filepath": "data/companies.csv",
        }
    }
)

Each entry in the dictionary represents a **dataset**, and each dataset has a **type** as well as some extra properties. Datasets are Python classes that take care of all the I/O needs in Kedro. In this case, we're using `kedro_datasets.pandas.ParquetDataset`, you can read [its full documentation](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ParquetDataset.html) online.

After the catalog is created, `catalog.list()` will yield a list of the available dataset names, which you can load using the `catalog.load(<dataset_name>)` method:

In [None]:
catalog.list()

In [None]:
companies = catalog.load("companies")
type(companies)

In [None]:
companies.head(5)

Let's proceed by loading the next two datasets: *reviews* and *shuttles*. Now, instead of loading them from local files on disk, we will use URLs:

In [None]:
from kedro.io import DataCatalog
catalog = DataCatalog.from_config(
    {
        "companies": {
            "type": "pandas.CSVDataset",
            "filepath": "data/companies.parquet",
        },
        "reviews": {
            "type": "pandas.CSVDataset",
            # URL instead of local file
            "filepath": "https://raw.githubusercontent.com/kedro-org/kedro-starters/refs/heads/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/reviews.csv",
        },
        "shuttles": {
            # Different dataset
            "type": "pandas.ExcelDataset",
            "filepath": "https://github.com/kedro-org/kedro-starters/raw/refs/heads/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx",
            # Can add extra arguments for the underlying pandas.read_excel function
            "load_args": {
                "engine": "openpyxl"
            }
        }
    }
)
catalog.list()

In [None]:
catalog.load("reviews").head()

In [None]:
catalog.load("shuttles").head()

## The `OmegaConfigLoader`

Instead of creating the Data Catalog by hand like this, Kedro usually stores configuration in YAML files. To load them, Kedro offers a [configuration loader](https://docs.kedro.org/en/latest/configuration/configuration_basics.html) based on the [Omegaconf](https://omegaconf.readthedocs.io/) library called the `OmegaConfigLoader`. This adds several interesting features, such as

- Consolidating different configuration files into one
- Substitution, templating
- [Resolvers](https://omegaconf.readthedocs.io/en/2.3_branch/custom_resolvers.html)
- And [much more](https://docs.kedro.org/en/latest/configuration/advanced_configuration.html)

To start using it, first save the catalog configuration to a `catalog.yml` file, and then use `OmegaConfigLoader` as follows:

In [None]:
%%writefile catalog.yml
companies:
  type: pandas.CSVDataset
  filepath: data/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: https://raw.githubusercontent.com/kedro-org/kedro-starters/refs/heads/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: https://raw.githubusercontent.com/kedro-org/kedro-starters/refs/heads/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl

In [None]:
from kedro.config import OmegaConfigLoader

config_loader = OmegaConfigLoader(
    conf_source=".",  # Directory where configuration files are located
)

In [None]:
catalog_config = config_loader.get("catalog")
catalog_config

As you can see, `config_loader.get("catalog")` gets you the same dictionary we crafted by hand earlier.

However, the repetition in the URLs seems like an invitation to trouble. Let's declare a variable `_root` inside the YAML file using Omegaconf syntax and load the catalog config again:

In [None]:
%%writefile catalog.yml
_root: https://raw.githubusercontent.com/kedro-org/kedro-starters/refs/heads/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/

companies:
  type: pandas.CSVDataset
  filepath: data/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: ${_root}data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: ${_root}data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl

In [None]:
catalog_config = config_loader.get("catalog")
catalog_config

In [None]:
catalog = DataCatalog.from_config(catalog_config)
catalog

In [None]:
catalog.load("companies").head(5)

In [None]:
# catalog.load("reviews").head(5)

In [None]:
# catalog.load("shuttles").head(5)

## Nodes and pipelines

Now comes the interesting part. Kedro structures the computation on Directed Acyclic Graphs (DAGs), which are created by instantiating `Pipeline` objects with a list of `Node`s. By linking the inputs and outpus of each node, Kedro is then able to perform a topological sort and produce a graph.

Let's start by creating a simple pipeline with a single node. This node will be a `preprocess_companies` function that cleans and prepaires the `companies` input table.

In [None]:
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def _parse_percentage(x: pd.Series) -> pd.Series:
    x = x.str.replace("%", "")
    x = x.astype(float) / 100
    return x

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    companies["company_rating"] = _parse_percentage(companies["company_rating"])
    return companies

In [None]:
companies = catalog.load("companies")

preprocess_companies(companies).head()

Now, let's wrap it using the `node` convenience function from Kedro:

In [None]:
from kedro.pipeline import node

preprocess_companies_node = node(func=preprocess_companies, inputs="companies", outputs="preprocessed_companies")
preprocess_companies_node

Conceptually, a `Node` is a wrapper around a Python function that defines a single step in a pipeline. It has inputs and outputs, which are the names of the Data Catalog datasets that the function will receive and return, respectively. Therefore, you could execute it as follows:

```python
n0.func(
    *[catalog.load(input_dataset) for input_dataset in n0.inputs],
)
```

Let's not do that though; Kedro will take care of it.

The next step is to assemble the pipeline. In this case, it will only have 1 node:

In [None]:
from kedro.pipeline import pipeline

data_processing = pipeline([preprocess_companies_node])
data_processing

And finally, you can now execute the pipeline. For the purposes of this tutorial, you can use Kedro's `SequentialRunner` directly:

In [None]:
from kedro.runner import SequentialRunner

outputs = SequentialRunner().run(data_processing, catalog=catalog)

The output of the `.run(...)` method will be "Any node outputs that cannot be processed by the `DataCatalog`". Since `preprocessed_companies` is not declared in the Data Catalog, it's right there in the dictionary:

In [None]:
outputs.keys()

In [None]:
outputs["preprocessed_companies"].head()

## Exercises

### Exercise 1

Create a Python function `preprocess_shuttles` with the following result:

In [None]:
preprocess_shuttles(catalog.load("shuttles")).head()

Then create a Kedro node named `preprocess_shuttles_node` by specifying the correct function, inputs, and outputs.

In [None]:
%load solutions/nb01_ex01.py

## Exercise 2

Write a `create_model_input_table` function that joins all the 3 datasets into one using the common columns (hint: look at columns ending with `_id`):

In [None]:
shuttles = catalog.load("shuttles")
companies = catalog.load("companies")
reviews = catalog.load("reviews")

create_model_input_table(shuttles, companies, reviews).columns

In [None]:
%load solutions/nb01_ex02.py

## Exercise 3

Create and run a complete `data_processing` pipeline that assembles all the nodes written so far: preprocess two input tables and then merge three cleaned input tables.

In [None]:
outputs = SequentialRunner().run(data_processing, catalog=catalog)
outputs.keys()

In [None]:
%load solutions/nb01_ex03.py