Add support for Databricks Dataset #974

tiencuongkieu · 2021-10-20T00:40:54Z

Description

Currently Kedro doesn't support read/write directly to Databricks tables. Should we support this feature?

Context

Databricks is used in many projects. And using Databricks tables, instead of, for example S3, has several benefits like we can see sample data in Databricks UI, we have metadata for tables, it is also faster.

Possible Implementation

I already implemented the dataset and used it in a client project.

datajoely · 2021-10-20T08:51:54Z

Hi @tiencuongkieu - are you talking about regular databricks tables or delta tables?

You can do the former via the SparkJDBCDataSet today, but we also have a Delta dataset in progress as well #964.

jiriklein · 2021-10-20T12:33:45Z

I think @tiencuongkieu might also mean the native saveAsTable functionality of tables within environment such as Databricks, in line with https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables
In such case it should be a simple inheritance of SparkDataSet with the _save and _load methods overriden to call the relevant methods.

datajoely · 2021-10-20T14:47:51Z

Ah I wasn't aware of that! @tiencuongkieu we're always open to PRs :)

tiencuongkieu · 2021-10-21T01:08:10Z

Yes, exactly as @jiriklein mentioned above, it is for the native saveAsTable and the code is quite simple as he mentioned as well. I'll raise a PR.

datajoely · 2021-10-21T09:06:55Z

Thank you!

jiriklein · 2021-10-21T10:11:11Z

Awesome, thank you @tiencuongkieu
I recommend this is rather called SparkTableDataSet or similar - it's not related strictly to databricks, but to any underlying system that allows for table metastore.

Then inheritance should come from SparkDataSet or AbstractVersionedDataSet as parent class, with overriding for __init__ (instead of filepath, you have a "database" and table - or fully-qualified name), then _load and _save methods as discussed.

Let us know if and when you need help with reviews.

tiencuongkieu · 2021-10-21T10:23:53Z

Thank you @jiriklein for you suggestion. I raised a draft PR. Still need to add tests though.

tiencuongkieu · 2021-10-21T10:26:13Z

Just copied it over from my project's implementation. Inheritance comes from AbtractDataSet only. Maybe we need versioning functionality as well?

stale · 2021-12-20T10:55:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tiencuongkieu added the Issue: Feature Request New feature or improvement to existing feature label Oct 20, 2021

tiencuongkieu mentioned this issue Oct 21, 2021

Add Spark Databricks dataset #977

Closed

6 tasks

stale bot added the stale label Dec 20, 2021

stale bot closed this as completed Dec 27, 2021

Galileo-Galilei pushed a commit to Galileo-Galilei/kedro that referenced this issue Feb 19, 2022

Set checkout value in advance (kedro-org#974)

bb11692

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Databricks Dataset #974

Add support for Databricks Dataset #974

tiencuongkieu commented Oct 20, 2021

datajoely commented Oct 20, 2021

jiriklein commented Oct 20, 2021

datajoely commented Oct 20, 2021

tiencuongkieu commented Oct 21, 2021 •

edited

datajoely commented Oct 21, 2021

jiriklein commented Oct 21, 2021

tiencuongkieu commented Oct 21, 2021

tiencuongkieu commented Oct 21, 2021

stale bot commented Dec 20, 2021

Add support for Databricks Dataset #974

Add support for Databricks Dataset #974

Comments

tiencuongkieu commented Oct 20, 2021

Description

Context

Possible Implementation

datajoely commented Oct 20, 2021

jiriklein commented Oct 20, 2021

datajoely commented Oct 20, 2021

tiencuongkieu commented Oct 21, 2021 • edited

datajoely commented Oct 21, 2021

jiriklein commented Oct 21, 2021

tiencuongkieu commented Oct 21, 2021

tiencuongkieu commented Oct 21, 2021

stale bot commented Dec 20, 2021

tiencuongkieu commented Oct 21, 2021 •

edited