Make sure all example code blocks for datasets are runnable #2287

merelcht · 2022-08-01T14:46:41Z

Description

Some of the code examples we provide in the API docs for datasets (https://docs.kedro.org/en/stable/kedro_datasets.html#module-kedro_datasets) aren't actually runnable. Some datasets have easy and straightforward examples that can be copy-pasted and run straight away, others reference setup including S3, but it's not clear these snippets won't be runnable.

Implementation

Update all code snippets on in the dataset API docs to basic examples that can be run. And in case a simpler example doesn't make sense, clarify that this snippet can't be run as is and what additional setup would be needed.

Please also make sure the example refer to kedro-datasets but not kedro

The text was updated successfully, but these errors were encountered:

merelcht · 2022-11-09T16:22:45Z

Part of this task was completed in #1962

Remaining datasets are:

dask.ParquetDataSet
holoviews.HoloviewsWriter
pandas.GBQQueryDataSet
pandas.GBQTableDataSet
pandas.SQLQueryDataSet
pandas.SQLTableDataSet
redis.PickleDataSet
all Spark datasets
tensorflow.TensorFlowModelDataset

JoaoAreias · 2023-03-21T20:37:28Z

Hi, I wanted to start contributing to Kedro, it's such an amazing tool, may I take this one?

merelcht · 2023-03-22T10:13:10Z

Hi @JoaoAreias, Welcome to the Kedro community 😄 That would be wonderful! This work will need to be done in kedro-datasets in the kedro-plugins repo: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets

JoaoAreias · 2023-03-22T13:35:58Z

Thank you! Will do!

noklam · 2023-09-18T14:26:29Z

In addition, I think we should make sure the example are from kedro-datasets import not kedro. I will add this to requirements. @merelcht

deepyaman · 2023-09-25T13:13:35Z

It would be nice to actually use doctest for these. (Rather than verifying once manually.)

stichbury · 2023-10-18T13:41:29Z

Agreed @deepyaman! Putting the examples into a test script and including as part of ongoing CI would be awesome.

Gundalai-Batkhuu · 2023-10-20T05:50:32Z

I would like to take this one

astrojuanlu · 2023-10-20T09:19:54Z

Hi @Gundalai-Batkhuu, feel free to take #2287, #2604, #2643, and if @ggermade does not reappear in a few days, #2008. Just bear in mind that we are not assigning all the issues at once, focus on one of them and feel free to open a PR when you're ready. Other people might take them, so be quick! And rather than asking us whether you can start working on them, either open a PR directly, or if there's something you don't understand about the issue or the scope, ask for clarification in the relevant ticket. This will increase your chances of success. Good luck!

deepyaman · 2023-10-25T01:45:57Z

kedro-org/kedro-plugins#416 is a first attempt at validating using doctest. Example run: https://github.com/kedro-org/kedro-plugins/actions/runs/6634484238/job/18023981594?pr=416

As @merelcht mentioned, some of the tests reference S3, or data files that don't exist; many of these can probably be updated. There are some where the issue is just that the correct output isn't reflected. Certain cases, the doctests are catching legitimate mistakes it seems (e.g. missing arguments).

Want to take a pause and check, before investing more time on this--are we aligned on/okay with using doctest?

deepyaman · 2023-10-25T02:12:44Z

Failure details

kedro_datasets.dask.parquet_dataset.ParquetDataset: It's technically not runnable because botocore client creation fails, given AWS credentials from the environment and passed manually. However, the S3 use would fail regardless, so probably best to replace with a local example.
kedro_datasets.databricks.managed_table_dataset.ManagedTableDataset: Seems it's catching an error about wrong write mode?
kedro_datasets.matplotlib.matplotlib_writer.MatplotlibWriter: The examples are working, but need to make sure the output isn't checked using ELLIPSIS or something.
kedro_datasets.pandas.deltatable_dataset.DeltaTableDataset: Not able to find some _last_checkpoint? Seems like a legit error at first glance.
kedro_datasets.pandas.gbq_dataset.GBQQueryDataset: No actual BigQuery to connect to; assume this will have to be ignored.
kedro_datasets.pandas.gbq_dataset.GBQTableDataset: Same as above.
kedro_datasets.pandas.generic_dataset.GenericDataset: Haven't looked into it, but I assume this is a bug due to not specifying params for reading/writing with pandas, and due to how the defaults are handled with index.
kedro_datasets.pandas.sql_dataset.SQLQueryDataset: Not a valid connection string. This could potentially be done with SQLite or something.
kedro_datasets.pandas.sql_dataset.SQLTableDataset: Same as above.
kedro_datasets.partitions.incremental_dataset.IncrementalDataset: key1, etc. aren't valid arguments to filesystem constructor.
kedro_datasets.partitions.partitioned_dataset.PartitionedDataset: Same as above.
kedro_datasets.pillow.image_dataset.ImageDataset: Loading a nonexistent image. Maybe can use a public example image.
kedro_datasets.polars.lazy_polars_dataset.LazyPolarsDataset: Seems like a bug, missing file_format argument.
kedro_datasets.redis.redis_dataset.PickleDataset: Can't connect to Redis; not sure if this is doable in a doctest.
kedro_datasets.spark.deltatable_dataset.DeltaTableDataset: Delta connector needs to be installed? Not sure...
kedro_datasets.spark.spark_dataset.SparkDataset: Example works; just need to ignore the output.
kedro_datasets.spark.spark_hive_dataset.SparkHiveDataset: No Hive support.
kedro_datasets.spark.spark_hive_dataset.SparkHiveDataset: Easy first step--fix import!
kedro_datasets.video.video_dataset.VideoDataset: File doesn't exist.

stichbury · 2023-10-30T13:28:53Z

I am more than happy with using doctest if it means we end up with solid code that works for readers. I'm somewhat overwhelmed by the issues listed above though: do we need to ticket all these?

laizaparizotto · 2023-11-15T17:32:48Z

Hi @deepyaman, I would like to contribute to updating the docstrings at kedro-plugins/kedro-datasets.

Do you have any suggestions about which ones I could look at?

deepyaman · 2023-11-20T19:33:49Z

Hi @deepyaman, I would like to contribute to updating the docstrings at kedro-plugins/kedro-datasets.

Do you have any suggestions about which ones I could look at?

@laizaparizotto Sorry for not responding to you earlier! Somebody just brought it to my attention this morning. Would love your help.

If you see https://github.com/kedro-org/kedro-plugins/blob/main/Makefile#L30, all these files have examples that are failing in one way or another; feel free to take any of them (ideally, that don't have a PR already). I've just raised one for dask.ParquetDataset, if that helps as an example.

Let me know if you have any questions, or run into issues!

merelcht · 2023-11-24T10:25:39Z

Fixes tried, but no success and might be too complicated:

kedro_datasets/spark/spark_hive_dataset.py looking at the tests for this dataset, setting up a testing Spark hive instance is complex, and probably not worth the effort for the doctest.
kedro_datasets/spark/spark_jdbc_dataset.py needs a Postgres driver.
kedro_datasets/pandas/gbq_dataset.py needs a (mocked) bigquery client.
kedro_datasets/redis/redis_dataset.py Needs Redis instance.
kedro_datasets/snowflake/snowpark_dataset.py has no python example, and only works on python 3.8.
kedro_datasets/partitions/partitioned_dataset.py has two examples, the local example works, the S3 one doesn't, but not worth setting up a whole mock S3 server.

merelcht · 2023-12-20T12:30:31Z

All fixable dataset docstrings are now fixed. The remaining examples all require complicated cloud/database client setup, which is overkill for the examples. I'll close this as completed.

merelcht assigned AhdraMeraliQB Sep 28, 2022

stichbury mentioned this issue Oct 17, 2022

Rationalise data catalog examples in documentation #1742

Closed

2 tasks

AhdraMeraliQB mentioned this issue Oct 21, 2022

Validate code snippets in datasets documentation - Part 1 #1962

Merged

5 tasks

merelcht unassigned AhdraMeraliQB Nov 9, 2022

merelcht transferred this issue from kedro-org/kedro Jan 26, 2023

merelcht added the good first issue label Jan 26, 2023

merelcht transferred this issue from kedro-org/kedro-plugins Feb 6, 2023

merelcht added this to the Package kedro.extras.datasets into its own package milestone Feb 6, 2023

stichbury added Component: Example code Example code creation/publication and removed good first issue labels Jul 5, 2023

This was referenced Sep 18, 2023

[PARENT] kedro-datasets tickets for 2.0.0 #3043

Closed

Remove kedro.extras.datasets and kedro.extras #3057

Merged

deepyaman mentioned this issue Oct 3, 2023

docs(datasets): blacken code in rst literal blocks kedro-org/kedro-plugins#365

Closed

merelcht added the Hacktoberfest label Oct 11, 2023

This was referenced Oct 20, 2023

Review diagrams/images in docs and potentially resize to a consistent maximum or use a plugin to make thumbnails #2643

Open

Improve reference documentation for the CLI #2604

Open

deepyaman assigned deepyaman and Gundalai-Batkhuu Oct 25, 2023

deepyaman mentioned this issue Oct 26, 2023

test(datasets): run doctests to check examples run kedro-org/kedro-plugins#416

Merged

4 tasks

merelcht mentioned this issue Nov 20, 2023

Release kedro-datasets 2.0.0 kedro-org/kedro-plugins#421

Closed

4 tasks

deepyaman mentioned this issue Nov 20, 2023

test(datasets): fix dask.ParquetDataset doctests kedro-org/kedro-plugins#439

Merged

4 tasks

merelcht mentioned this issue Nov 27, 2023

chore(datasets): Fix pandas.GenericDataset doctest kedro-org/kedro-plugins#445

Merged

4 tasks

merelcht removed the Hacktoberfest label Nov 27, 2023

merelcht mentioned this issue Nov 27, 2023

chore(datasets): Fix more doctest issues kedro-org/kedro-plugins#451

Merged

4 tasks

deepyaman mentioned this issue Nov 29, 2023

feat(datasets): accept os.PathLike for filepaths kedro-org/kedro-plugins#456

Open

merelcht unassigned deepyaman and Gundalai-Batkhuu Dec 11, 2023

merelcht self-assigned this Dec 19, 2023

This was referenced Dec 19, 2023

chore(datasets): Fix doctests kedro-org/kedro-plugins#488

Merged

chore(datasets): Fix delta docstrings kedro-org/kedro-plugins#489

Merged

merelcht closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure all example code blocks for datasets are runnable #2287

Make sure all example code blocks for datasets are runnable #2287

merelcht commented Aug 1, 2022 •

edited

Loading

merelcht commented Nov 9, 2022

JoaoAreias commented Mar 21, 2023

merelcht commented Mar 22, 2023

JoaoAreias commented Mar 22, 2023

noklam commented Sep 18, 2023

deepyaman commented Sep 25, 2023

stichbury commented Oct 18, 2023

Gundalai-Batkhuu commented Oct 20, 2023

astrojuanlu commented Oct 20, 2023

deepyaman commented Oct 25, 2023

deepyaman commented Oct 25, 2023 •

edited

Loading

stichbury commented Oct 30, 2023

laizaparizotto commented Nov 15, 2023

deepyaman commented Nov 20, 2023

merelcht commented Nov 24, 2023 •

edited

Loading

merelcht commented Dec 20, 2023

Make sure all example code blocks for datasets are runnable #2287

Make sure all example code blocks for datasets are runnable #2287

Comments

merelcht commented Aug 1, 2022 • edited Loading

Description

Implementation

merelcht commented Nov 9, 2022

JoaoAreias commented Mar 21, 2023

merelcht commented Mar 22, 2023

JoaoAreias commented Mar 22, 2023

noklam commented Sep 18, 2023

deepyaman commented Sep 25, 2023

stichbury commented Oct 18, 2023

Gundalai-Batkhuu commented Oct 20, 2023

astrojuanlu commented Oct 20, 2023

deepyaman commented Oct 25, 2023

deepyaman commented Oct 25, 2023 • edited Loading

Failure details

stichbury commented Oct 30, 2023

laizaparizotto commented Nov 15, 2023

deepyaman commented Nov 20, 2023

merelcht commented Nov 24, 2023 • edited Loading

merelcht commented Dec 20, 2023

merelcht commented Aug 1, 2022 •

edited

Loading

deepyaman commented Oct 25, 2023 •

edited

Loading

merelcht commented Nov 24, 2023 •

edited

Loading