Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Databricks Dataset #974

Closed
tiencuongkieu opened this issue Oct 20, 2021 · 9 comments
Closed

Add support for Databricks Dataset #974

tiencuongkieu opened this issue Oct 20, 2021 · 9 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@tiencuongkieu
Copy link

Description

Currently Kedro doesn't support read/write directly to Databricks tables. Should we support this feature?

Context

Databricks is used in many projects. And using Databricks tables, instead of, for example S3, has several benefits like we can see sample data in Databricks UI, we have metadata for tables, it is also faster.

Possible Implementation

I already implemented the dataset and used it in a client project.

@tiencuongkieu tiencuongkieu added the Issue: Feature Request New feature or improvement to existing feature label Oct 20, 2021
@datajoely
Copy link
Contributor

Hi @tiencuongkieu - are you talking about regular databricks tables or delta tables?

You can do the former via the SparkJDBCDataSet today, but we also have a Delta dataset in progress as well #964.

@jiriklein
Copy link
Contributor

I think @tiencuongkieu might also mean the native saveAsTable functionality of tables within environment such as Databricks, in line with https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables
In such case it should be a simple inheritance of SparkDataSet with the _save and _load methods overriden to call the relevant methods.

@datajoely
Copy link
Contributor

Ah I wasn't aware of that! @tiencuongkieu we're always open to PRs :)

@tiencuongkieu
Copy link
Author

tiencuongkieu commented Oct 21, 2021

Yes, exactly as @jiriklein mentioned above, it is for the native saveAsTable and the code is quite simple as he mentioned as well. I'll raise a PR.

@datajoely
Copy link
Contributor

Thank you!

@jiriklein
Copy link
Contributor

Awesome, thank you @tiencuongkieu
I recommend this is rather called SparkTableDataSet or similar - it's not related strictly to databricks, but to any underlying system that allows for table metastore.

Then inheritance should come from SparkDataSet or AbstractVersionedDataSet as parent class, with overriding for __init__ (instead of filepath, you have a "database" and table - or fully-qualified name), then _load and _save methods as discussed.

Let us know if and when you need help with reviews.

@tiencuongkieu
Copy link
Author

Thank you @jiriklein for you suggestion. I raised a draft PR. Still need to add tests though.

@tiencuongkieu
Copy link
Author

Just copied it over from my project's implementation. Inheritance comes from AbtractDataSet only. Maybe we need versioning functionality as well?

@stale
Copy link

stale bot commented Dec 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 20, 2021
@stale stale bot closed this as completed Dec 27, 2021
Galileo-Galilei pushed a commit to Galileo-Galilei/kedro that referenced this issue Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants