Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1242] Kedro 'core' library without included io DataSets or contrib.io #178

Closed
sarchila opened this issue Dec 3, 2019 · 4 comments
Closed

Comments

@sarchila
Copy link

sarchila commented Dec 3, 2019

Description

This was discussed in a comment on a separate issue, but I figured it merited its own feature request, so I'll repeat here:

I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.

One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io or contrib.io datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.

That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io DataSets anyways in Glue).

Since then I ended up taking my thought a step further by forking Kedro and coarsely removing the non-core functionality (branch here) that causes Kedro to depend on pandas, numpy, and other libraries that I considered not part of the 'core' Kedro runtime context/catalog/pipeline/node machinery. By providing my forked "Kedro core" branch to AWS Glue, I have been able to deploy my Kedro project and run it in Glue successfully 🎉

Context

This opens up the opportunities for Kedro to handle a purely Pyspark pipeline use-case and to allow for simple deployment to AWS Glue, a good choice for running spark in the cloud without the need for managing one's own cluster.

Possible Implementation

I've also been using the AWS CDK library, and thought Kedro could use a similar approach to what CDK uses: providing a 'core' library and have every other use-case-specific 'io' plugin as a separate small library that could be installed as needed. e.g. see https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#hello_world_tutorial_add_bucket

@sarchila sarchila added the Issue: Feature Request New feature or improvement to existing feature label Dec 3, 2019
@lorenabalan
Copy link
Contributor

Good to hear back from you @sarchila! That is excellent news, thank you for sharing! This was added to our backlog a while ago, with a view to deliver in 2020. We welcome any contributions in this space if you are interested. :)

@lorenabalan lorenabalan changed the title Kedro 'core' library without included io DataSets or contrib.io [KED-1242] Kedro 'core' library without included io DataSets or contrib.io Dec 4, 2019
@yetudada
Copy link
Contributor

yetudada commented Feb 5, 2020

We're on our way to this issue! We're launching these datasets in the next release: https://github.com/quantumblacklabs/kedro/tree/develop/kedro/extras/datasets

And we will give users time to use these ones instead. The major release following this will have io and contrib dependencies removed from Kedro.

@sarchila
Copy link
Author

sarchila commented Feb 5, 2020

Great news @yetudada - thanks so much for your team's responsiveness on this issue 🙌

@yetudada yetudada removed the Issue: Feature Request New feature or improvement to existing feature label Feb 14, 2020
@yetudada
Copy link
Contributor

@sarchila this issue can finally be closed. Commit ecd7277 has addressed this change. Thank you so much for submitting this request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants