Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure all example code blocks for datasets are runnable #2287

Closed
merelcht opened this issue Aug 1, 2022 · 16 comments
Closed

Make sure all example code blocks for datasets are runnable #2287

merelcht opened this issue Aug 1, 2022 · 16 comments
Assignees
Labels
Component: Example code Example code creation/publication

Comments

@merelcht
Copy link
Member

merelcht commented Aug 1, 2022

Description

Some of the code examples we provide in the API docs for datasets (https://docs.kedro.org/en/stable/kedro_datasets.html#module-kedro_datasets) aren't actually runnable. Some datasets have easy and straightforward examples that can be copy-pasted and run straight away, others reference setup including S3, but it's not clear these snippets won't be runnable.

Implementation

Update all code snippets on in the dataset API docs to basic examples that can be run. And in case a simpler example doesn't make sense, clarify that this snippet can't be run as is and what additional setup would be needed.

Please also make sure the example refer to kedro-datasets but not kedro

@merelcht
Copy link
Member Author

merelcht commented Nov 9, 2022

Part of this task was completed in #1962

Remaining datasets are:

  • dask.ParquetDataSet
  • holoviews.HoloviewsWriter
  • pandas.GBQQueryDataSet
  • pandas.GBQTableDataSet
  • pandas.SQLQueryDataSet
  • pandas.SQLTableDataSet
  • redis.PickleDataSet
  • all Spark datasets
  • tensorflow.TensorFlowModelDataset

@merelcht merelcht transferred this issue from kedro-org/kedro Jan 26, 2023
@merelcht merelcht transferred this issue from kedro-org/kedro-plugins Feb 6, 2023
@JoaoAreias
Copy link

Hi, I wanted to start contributing to Kedro, it's such an amazing tool, may I take this one?

@merelcht
Copy link
Member Author

Hi @JoaoAreias, Welcome to the Kedro community 😄 That would be wonderful! This work will need to be done in kedro-datasets in the kedro-plugins repo: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets

@JoaoAreias
Copy link

Thank you! Will do!

@stichbury stichbury added Component: Example code Example code creation/publication and removed good first issue labels Jul 5, 2023
@noklam
Copy link
Contributor

noklam commented Sep 18, 2023

In addition, I think we should make sure the example are from kedro-datasets import not kedro. I will add this to requirements. @merelcht

@deepyaman
Copy link
Member

It would be nice to actually use doctest for these. (Rather than verifying once manually.)

@stichbury
Copy link
Contributor

Agreed @deepyaman! Putting the examples into a test script and including as part of ongoing CI would be awesome.

@Gundalai-Batkhuu
Copy link

I would like to take this one

@astrojuanlu
Copy link
Member

Hi @Gundalai-Batkhuu, feel free to take #2287, #2604, #2643, and if @ggermade does not reappear in a few days, #2008. Just bear in mind that we are not assigning all the issues at once, focus on one of them and feel free to open a PR when you're ready. Other people might take them, so be quick! And rather than asking us whether you can start working on them, either open a PR directly, or if there's something you don't understand about the issue or the scope, ask for clarification in the relevant ticket. This will increase your chances of success. Good luck!

@deepyaman
Copy link
Member

kedro-org/kedro-plugins#416 is a first attempt at validating using doctest. Example run: https://github.com/kedro-org/kedro-plugins/actions/runs/6634484238/job/18023981594?pr=416

As @merelcht mentioned, some of the tests reference S3, or data files that don't exist; many of these can probably be updated. There are some where the issue is just that the correct output isn't reflected. Certain cases, the doctests are catching legitimate mistakes it seems (e.g. missing arguments).

Want to take a pause and check, before investing more time on this--are we aligned on/okay with using doctest?

@deepyaman
Copy link
Member

deepyaman commented Oct 25, 2023

Failure details

  1. kedro_datasets.dask.parquet_dataset.ParquetDataset: It's technically not runnable because botocore client creation fails, given AWS credentials from the environment and passed manually. However, the S3 use would fail regardless, so probably best to replace with a local example.
  2. kedro_datasets.databricks.managed_table_dataset.ManagedTableDataset: Seems it's catching an error about wrong write mode?
  3. kedro_datasets.matplotlib.matplotlib_writer.MatplotlibWriter: The examples are working, but need to make sure the output isn't checked using ELLIPSIS or something.
  4. kedro_datasets.pandas.deltatable_dataset.DeltaTableDataset: Not able to find some _last_checkpoint? Seems like a legit error at first glance.
  5. kedro_datasets.pandas.gbq_dataset.GBQQueryDataset: No actual BigQuery to connect to; assume this will have to be ignored.
  6. kedro_datasets.pandas.gbq_dataset.GBQTableDataset: Same as above.
  7. kedro_datasets.pandas.generic_dataset.GenericDataset: Haven't looked into it, but I assume this is a bug due to not specifying params for reading/writing with pandas, and due to how the defaults are handled with index.
  8. kedro_datasets.pandas.sql_dataset.SQLQueryDataset: Not a valid connection string. This could potentially be done with SQLite or something.
  9. kedro_datasets.pandas.sql_dataset.SQLTableDataset: Same as above.
  10. kedro_datasets.partitions.incremental_dataset.IncrementalDataset: key1, etc. aren't valid arguments to filesystem constructor.
  11. kedro_datasets.partitions.partitioned_dataset.PartitionedDataset: Same as above.
  12. kedro_datasets.pillow.image_dataset.ImageDataset: Loading a nonexistent image. Maybe can use a public example image.
  13. kedro_datasets.polars.lazy_polars_dataset.LazyPolarsDataset: Seems like a bug, missing file_format argument.
  14. kedro_datasets.redis.redis_dataset.PickleDataset: Can't connect to Redis; not sure if this is doable in a doctest.
  15. kedro_datasets.spark.deltatable_dataset.DeltaTableDataset: Delta connector needs to be installed? Not sure...
  16. kedro_datasets.spark.spark_dataset.SparkDataset: Example works; just need to ignore the output.
  17. kedro_datasets.spark.spark_hive_dataset.SparkHiveDataset: No Hive support.
  18. kedro_datasets.spark.spark_hive_dataset.SparkHiveDataset: Easy first step--fix import!
  19. kedro_datasets.video.video_dataset.VideoDataset: File doesn't exist.

@stichbury
Copy link
Contributor

I am more than happy with using doctest if it means we end up with solid code that works for readers. I'm somewhat overwhelmed by the issues listed above though: do we need to ticket all these?

@laizaparizotto
Copy link
Contributor

Hi @deepyaman, I would like to contribute to updating the docstrings at kedro-plugins/kedro-datasets.

Do you have any suggestions about which ones I could look at?

@deepyaman
Copy link
Member

Hi @deepyaman, I would like to contribute to updating the docstrings at kedro-plugins/kedro-datasets.

Do you have any suggestions about which ones I could look at?

@laizaparizotto Sorry for not responding to you earlier! Somebody just brought it to my attention this morning. Would love your help.

If you see https://github.com/kedro-org/kedro-plugins/blob/main/Makefile#L30, all these files have examples that are failing in one way or another; feel free to take any of them (ideally, that don't have a PR already). I've just raised one for dask.ParquetDataset, if that helps as an example.

Let me know if you have any questions, or run into issues!

@merelcht
Copy link
Member Author

merelcht commented Nov 24, 2023

Fixes tried, but no success and might be too complicated:

  1. kedro_datasets/spark/spark_hive_dataset.py looking at the tests for this dataset, setting up a testing Spark hive instance is complex, and probably not worth the effort for the doctest.
  2. kedro_datasets/spark/spark_jdbc_dataset.py needs a Postgres driver.
  3. kedro_datasets/pandas/gbq_dataset.py needs a (mocked) bigquery client.
  4. kedro_datasets/redis/redis_dataset.py Needs Redis instance.
  5. kedro_datasets/snowflake/snowpark_dataset.py has no python example, and only works on python 3.8.
  6. kedro_datasets/partitions/partitioned_dataset.py has two examples, the local example works, the S3 one doesn't, but not worth setting up a whole mock S3 server.

@merelcht
Copy link
Member Author

All fixable dataset docstrings are now fixed. The remaining examples all require complicated cloud/database client setup, which is overkill for the examples. I'll close this as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Example code Example code creation/publication
Projects
Archived in project
Development

No branches or pull requests

9 participants