Skip to content

Commit

Permalink
Reorganise and improve the data catalog documentation (#2888)
Browse files Browse the repository at this point in the history
* First drop of newly organised data catalog docs

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* linter

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Added to-do notes

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Afternoon's work in rewriting/reorganising content

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* More changes

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Further changes

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Another chunk of changes

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Final changes

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Revise ordering of pages

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Add new CLI commands to dataset factory docs (#2935)

* Add changes from #2930

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Lint

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Apply suggestions from code review

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Make code snippets collapsable

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

---------

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Bunch of changes from feedback

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* A few more tweaks

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update h1,h2,h3 font sizes

Signed-off-by: Tynan DeBold <thdebold@gmail.com>

* Add code snippet for using DataCatalog with Kedro config

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Few more tweaks

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/data/data_catalog.md

* Upgrade kedro-datasets for docs

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Improve prose

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

---------

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Tynan DeBold <thdebold@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com>
Co-authored-by: Tynan DeBold <thdebold@gmail.com>
Co-authored-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
  • Loading branch information
5 people committed Aug 18, 2023
1 parent 16dd1df commit c45e629
Show file tree
Hide file tree
Showing 24 changed files with 1,187 additions and 1,028 deletions.
2 changes: 2 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
* Updated `kedro pipeline create` and `kedro catalog create` to use new `/conf` file structure.

## Documentation changes
* Revised the `data` section to restructure beginner and advanced pages about the Data Catalog and datasets.
* Moved contributor documentation to the [GitHub wiki](https://github.com/kedro-org/kedro/wiki/Contribute-to-Kedro).
* Update example of using generator functions in nodes.
* Added migration guide from the `ConfigLoader` to the `OmegaConfigLoader`. The `ConfigLoader` is deprecated and will be removed in the `0.19.0` release.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/_static/css/qb1-sphinx-rtd.css
Original file line number Diff line number Diff line change
Expand Up @@ -321,16 +321,16 @@ h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend {
}

.wy-body-for-nav h1 {
font-size: 2.6rem;
font-size: 2.6rem !important;
letter-spacing: -0.3px;
}

.wy-body-for-nav h2 {
font-size: 2.3rem;
font-size: 2rem;
}

.wy-body-for-nav h3 {
font-size: 2.1rem;
font-size: 2rem;
}

.wy-body-for-nav h4 {
Expand Down
2 changes: 1 addition & 1 deletion docs/source/configuration/credentials.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
For security reasons, we strongly recommend that you *do not* commit any credentials or other secrets to version control.
Kedro is set up so that, by default, if a file inside the `conf` folder (and its subfolders) contains `credentials` in its name, it will be ignored by git.

Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#feeding-in-credentials).
Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#dataset-access-credentials).
If you would rather store your credentials in environment variables instead of a file, you can use the `OmegaConfigLoader` [to load credentials from environment variables](advanced_configuration.md#how-to-load-credentials-through-environment-variables) as described in the advanced configuration chapter.

## How to load credentials in code
Expand Down
225 changes: 225 additions & 0 deletions docs/source/data/advanced_data_catalog_usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# Advanced: Access the Data Catalog in code

You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md), but it is possible to access the Data Catalog programmatically through [`kedro.io.DataCatalog`](/kedro.io.DataCatalog) using an API that allows you to configure data sources in code and use the IO module within notebooks.

## How to configure the Data Catalog

To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`.

In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets).

```python
from kedro.io import DataCatalog
from kedro_datasets.pandas import (
CSVDataSet,
SQLTableDataSet,
SQLQueryDataSet,
ParquetDataSet,
)

io = DataCatalog(
{
"bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
"cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
"cars_table": SQLTableDataSet(
table_name="cars", credentials=dict(con="sqlite:///kedro.db")
),
"scooters_query": SQLQueryDataSet(
sql="select * from cars where gear=4",
credentials=dict(con="sqlite:///kedro.db"),
),
"ranked": ParquetDataSet(filepath="ranked.parquet"),
}
)
```

When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only).

## How to view the available data sources

To review the `DataCatalog`:

```python
io.list()
```

## How to load datasets programmatically

To access each dataset by its name:

```python
cars = io.load("cars") # data is now loaded as a DataFrame in 'cars'
gear = cars["gear"].values
```

The following steps happened behind the scenes when `load` was called:

- The value `cars` was located in the Data Catalog
- The corresponding `AbstractDataSet` object was retrieved
- The `load` method of this dataset was called
- This `load` method delegated the loading to the underlying pandas `read_csv` function

## How to save data programmatically

```{warning}
This pattern is not recommended unless you are using platform notebook environments (Sagemaker, Databricks etc) or writing unit/integration tests for your Kedro pipeline. Use the YAML approach in preference.
```

### How to save data to memory

To save data using an API similar to that used to load data:

```python
from kedro.io import MemoryDataSet

memory = MemoryDataSet(data=None)
io.add("cars_cache", memory)
io.save("cars_cache", "Memory can store anything.")
io.load("cars_cache")
```

### How to save data to a SQL database for querying

To put the data in a SQLite database:

```python
import os

# This cleans up the database in case it exists at this point
try:
os.remove("kedro.db")
except FileNotFoundError:
pass

io.save("cars_table", cars)

# rank scooters by their mpg
ranked = io.load("scooters_query")[["brand", "mpg"]]
```

### How to save data in Parquet

To save the processed data in Parquet format:

```python
io.save("ranked", ranked)
```

```{warning}
Saving `None` to a dataset is not allowed!
```

## How to access a dataset with credentials
Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument.

Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents:

```yaml
dev_s3:
client_kwargs:
aws_access_key_id: key
aws_secret_access_key: secret

scooters_credentials:
con: sqlite:///kedro.db

my_gcp_credentials:
id_token: key
```

Your code will look as follows:

```python
CSVDataSet(
filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv",
load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]),
credentials=dict(key="token", secret="key"),
)
```

## How to version a dataset using the Code API

In an earlier section of the documentation we described how [Kedro enables dataset and ML model versioning](./data_catalog.md/#dataset-versioning).

If you require programmatic control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation:

```python
from kedro.io import DataCatalog, Version
from kedro_datasets.pandas import CSVDataSet
import pandas as pd

data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]})
version = Version(
load=None, # load the latest available version
save=None, # generate save version automatically on each save operation
)

test_data_set = CSVDataSet(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_data_set": test_data_set})

# save the dataset to data/01_raw/test.csv/<version>/test.csv
io.save("test_data_set", data1)
# save the dataset into a new file data/01_raw/test.csv/<version>/test.csv
io.save("test_data_set", data2)

# load the latest version from data/test.csv/*/test.csv
reloaded = io.load("test_data_set")
assert data2.equals(reloaded)
```

In the example above, we do not fix any versions. The behaviour of load and save operations becomes slightly different when we set a version:


```python
version = Version(
load="my_exact_version", # load exact version
save="my_exact_version", # save to exact version
)

test_data_set = CSVDataSet(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_data_set": test_data_set})

# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv
io.save("test_data_set", data1)
# load from data/01_raw/test.csv/my_exact_version/test.csv
reloaded = io.load("test_data_set")
assert data1.equals(reloaded)

# raises DataSetError since the path
# data/01_raw/test.csv/my_exact_version/test.csv already exists
io.save("test_data_set", data2)
```

We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning`.

Imagine a simple pipeline with two nodes, where B takes the output from A. If you specify the load-version of the data for B to be `my_data_2023_08_16.csv`, the data that A produces (`my_data_20230818.csv`) is not used.

```text
Node_A -> my_data_20230818.csv
my_data_2023_08_16.csv -> Node B
```

In code:

```python
version = Version(
load="my_data_2023_08_16.csv", # load exact version
save="my_data_20230818.csv", # save to exact version
)

test_data_set = CSVDataSet(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_data_set": test_data_set})

io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency

# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv
# file does not exist
reloaded = io.load("test_data_set")
```
Loading

0 comments on commit c45e629

Please sign in to comment.