Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new CLI commands to dataset factory docs #2930

Closed
wants to merge 6 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -667,6 +667,148 @@ You can use dataset factories to define a catch-all pattern which will overwrite
Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog
as `pandas.CSVDataSet`.

### CLI commands for dataset factories

To manage your dataset factories, two new commands have been added to the Kedro CLI: `kedro catalog rank` (0.18.12) and `kedro catalog resolve` (0.18.13).

#### Using `kedro catalog rank`

This command outputs a list of all dataset factories in the catalog, ranked in the order by which pipeline datasets are matched against them. This ordering is determined by considering the following criteria:

1. The number of non-placeholder characters in the pattern
2. The number of placeholders in the pattern
3. Alphabetic ordering

Consider a catalog file with the following patterns:

```yaml
"{layer}.{dataset_name}":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add an example where the number of characters outside the placeholders is same for two patterns but one is ranked higher because of the number of placeholders. For eg. {whatever}_{dataset}csv would be ranked higher than {dataset}#csv for preprocessed_companies#csv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 569cd63

type: pandas.CSVDataSet
filepath: data/{layer}/{dataset_name}.csv

preprocessed_{dataset_name}:
type: pandas.ParquetDataSet
filepath: data/02_intermediate/preprocessed_{dataset_name}.pq

processed_{dataset_name}:
type: pandas.ParquetDataSet
filepath: data/03_primary/processed_{dataset_name}.pq

"{dataset_name}_csv":
type: pandas.CSVDataSet
filepath: data/03_primary/{dataset_name}.csv

"{namespace}.{dataset_name}_pq":
type: pandas.ParquetDataSet
filepath: data/03_primary/{dataset_name}_{namespace}.pq

"{default_dataset}":
type: pickle.PickleDataSet
filepath: data/01_raw/{default_dataset}.pickle
```

Running `kedro catalog rank` will result in the following output:

```
- preprocessed_{dataset_name}
- processed_{dataset_name}
- '{namespace}.{dataset_name}_pq'
- '{dataset_name}_csv'
- '{layer}.{dataset_name}'
- '{default_dataset}'
```

As we can see, the entries are ranked firstly by how many non-placeholders are in the pattern, in descending order. Where two entries have the same number of non-placeholder characters, `{namespace}.{dataset_name}_pq` and `{dataset_name}_csv` with four each, they are then ranked by the number of placeholders, also in decreasing order. `{default_dataset}` is the least specific pattern possible, and will always be matched against last.

#### Using `kedro catalog resolve`

This command resolves dataset patterns in the catalog against any explicit dataset entries in the project pipeline. The resulting output contains all explicit dataset entries in the catalog and any dataset in the default pipeline that resolves some dataset pattern.

To illustrate this, consider the following catalog file:

```yaml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv

reviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv

shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
load_args:
engine: openpyxl # Use modern Excel engine, it is the default since Kedro 0.18.0

preprocessed_{name}:
type: pandas.ParquetDataSet
filepath: data/02_intermediate/preprocessed_{name}.pq

"{default}":
type: pandas.ParquetDataSet
filepath: data/03_primary/{default}.pq
```

and the following pipeline in `pipeline.py`:

```python
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding some datasets in the pipeline that would match the catch-all pattern would also be nice to demonstrate here - Datasets that would normally be MemoryDataSets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_input_table matches the catch-all pattern

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad. Maybe adding an explanation about which factory pattern resolved which dataset and which ones came from the catalog will be helpful for users?

),
]
)
```

The resolved catalog output by the command will be as follows:

```yaml
companies:
filepath: data/01_raw/companies.csv
type: pandas.CSVDataSet
model_input_table:
filepath: data/03_primary/model_input_table.pq
type: pandas.ParquetDataSet
preprocessed_companies:
filepath: data/02_intermediate/preprocessed_companies.pq
type: pandas.ParquetDataSet
preprocessed_shuttles:
filepath: data/02_intermediate/preprocessed_shuttles.pq
type: pandas.ParquetDataSet
reviews:
filepath: data/01_raw/reviews.csv
type: pandas.CSVDataSet
shuttles:
filepath: data/01_raw/shuttles.xlsx
load_args:
engine: openpyxl
type: pandas.ExcelDataSet
```

By default this is output to the terminal. However, if you wish to output the resolved catalog to a specific file, you can use the redirection operator `>`:

```bash
kedro catalog resolve > output_file.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kedro catalog resolve > output_file.yaml
kedro catalog resolve > conf/base/catalog_resolved.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can then be picked up by the config loader, but then the pre-existing datasets that came from explicit entries in the catalog would need to be commented out so there's no duplicates. Maybe we could suggest this in the docs? Not sure if this would make this part of the docs unnecessarily complicated for the users though so feel free to ignore this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this might be a bit out of place, the beginning of this page in the docs does already talk about having different catalog files in different environments, it might get a bit messy to start explaining that here as well? I'm leaning against it but also curious if anyone else has any other opinions on this

```

## Transcode datasets

You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations.
Expand Down