---
badges: true
categories:
- python
- kedro
description: DataCatalog is the first concept that you learnt in Kedro. Although it is important, users actually never have to interact with it, at least not directly. This post will explain the concept with a minimal example that how user should save or load data with Kedro.
toc: true
hide: false
date: '2024-03-26'
---

# How to save or load data with Kedro?
The first answer that you may get is to use the `DataCatalog`, but what really is `DataCatalog`? Some may say it is the `catalog.yml`, there are some truth in it but it's not an accurate answer.

Let's try a different explanation, forget about the existence of `DataCatalog` (you don't even need to use this class in your Kedro Project). Instead, we focus on "how" to load or save data in a Kedro Project.

First, we need a Python function that takes some inputs and outputs:

In [5]:
import pandas as pd

def dummy_func():
    df = pd.DataFrame([{"foo": "bar"}])
    # df.to_csv("my_data.csv") # You don't need to save it explicitly
    return df


This function takes no input but produce a DataFrame, how does Kedro know how to save this data? In Kedro, it uses `Node` instead of function. A Node is a Python function + inputs + outputs. The inputs and outputs is merely a name of the data rather than the actual object.

In [6]:
from kedro.pipeline import node, pipeline

dummy_node = node(func=dummy_func, inputs=None, outputs="my_data")

You can call the node directly, but it's not necessary because it is handled by Kedro Runner and Kedro Pipeline.

In [11]:
dummy_func()

Unnamed: 0,foo
0,bar


In [13]:
result = dummy_node()
result

{'my_data':    foo
 0  bar}

It saves the DataFrame inside a dictionary with the key "my_data" (the outputs defined in node). The last step is to save it as a file, which is where the [`DataCatalog` or `catalog.yml`](https://docs.kedro.org/en/latest/data/data_catalog.html) come into the picture.

In [21]:
from kedro.io import DataCatalog

catalog_config = {"my_data":
                        {"type": "pandas.CSVDataset",
                         "filepath": "my_csv.csv"}
                        }
catalog = DataCatalog.from_config(catalog_config)
catalog.save( "my_data", result["my_data"])

We can check if the data is saved correctly:

In [22]:
pd.read_csv("my_csv.csv")

Unnamed: 0,foo
0,bar


In [23]:
# Or use DataCatalog
catalog.load("my_data")

Unnamed: 0,foo
0,bar


## catalog.yml

In [None]:
catalog_config = {"my_data":
                        {"type": "pandas.CSVDataset",
                         "filepath": "my_csv.csv"}
                        }
catalog = DataCatalog.from_config(catalog_config)

Going back to this, `catalog.yml` is merely `catalog_config` but written in YAML.

In [27]:
catalog_yml = """
my_data:
  type: pandas.CSVDataset
  filepath: my_csv.csv
"""

import yaml
catalog_config = yaml.safe_load(catalog_yml)
new_catalog = DataCatalog.from_config(catalog_config)

# Summary
These abstraction are usually hidden from the end users. You do not need to use the `DataCatalog` if you are working with a Kedro Project. Behind the scene, this is what happened.

1. Function signature (and outputs) are mapped according to the node definition.
2. `DataCatalog` load the data according to their name, and look that up from `catalog.yml` to figure out whether it should be load from a CSV or a parquet file
