Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to specify save format for SparkHIveDataSet #1528

Closed
jstammers opened this issue May 13, 2022 · 7 comments · Fixed by #1857
Closed

Unable to specify save format for SparkHIveDataSet #1528

jstammers opened this issue May 13, 2022 · 7 comments · Fixed by #1857
Labels
Community Issue/PR opened by the open-source community Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@jstammers
Copy link
Contributor

Description

The implementation for SparkHiveDataSet allows the user to specify additional save arguments. This should enable a delta table to be saved which is done using the following pyspark code

table = spark.table(...)
table.write.saveAsTable("db.table", format='delta')

Which should be replicated using the following configuration

table:
  type: spark.SparkHiveDataSet
  database: db
  table: table
  write_mode: overwrite
  save_args:
    format: delta

However, this raises a DataSetError because the SparkHiveDataSet constructor gets the format from the save_args

self._format = self._save_args.get("format") or "hive"

but if it exists, does not remove it from self._save_args. The error is raised when creating a hive table because there are two arguments named format

def _create_hive_table(self, data: DataFrame, mode: str = None):
_mode: str = mode or self._write_mode
data.write.saveAsTable(
self._full_table_address,
mode=_mode,
format=self._format,
**self._save_args,
)

Context

I am trying to save a table using the delta format which is possible using the pyspark API, but currently not supported using SparkHiveDataSet. With the current implementation, the only supported format is the 'hive' default.

A possible solution would be to pop the 'format' value if it exists in save_args, e.g.

 self._format = self._save_args.pop("format", "hive")  #returns "hive" if "format" not in self._save_args.keys()

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.18.0
  • Python version used (python -V): 3.9
  • Operating system and version: ubuntu 18.04
@merelcht merelcht added the Community Issue/PR opened by the open-source community label May 16, 2022
@merelcht merelcht added the Issue: Bug Report 🐞 Bug that needs to be fixed label May 16, 2022
@merelcht
Copy link
Member

Hi @jstammers, have you tried using the DeltaTAbleDataSet?

@merelcht
Copy link
Member

Hi @jstammers do you still need help with this?

@jstammers
Copy link
Contributor Author

Hi @MerelTheisenQB, I've been able to save my dataset as a delta table using the DeltaTableDataSet.

If you think this functionality should be available using the HiveDataSet class, I'd be happy to submit a PR that implements the change I proposed above.

Otherwise, please feel free to close this issue and thanks for the help

@merelcht
Copy link
Member

Hi @jstammers, what are the differences for you when using DeltaTableDataSet and SparkHiveDataSet that makes you want to have the functionality as part of the SparkHiveDataSet?

@jstammers
Copy link
Contributor Author

Hi @MerelTheisenQB , the main difference would be the fact that we have upstream processes that insert data using spark.sql, e.g.

spark.sql("Insert into facts.volume")

which means that accessing that data using SparkHiveDataSet is more convenient. In this example, the data are saved at /user/hive/facts.db/volume.

In my current use-case, I am intending to use this across multiple projects where the data structure will be the same, but the underlying file locations will be different. I expect it will be easier to handle this using the hive metastore rather than parameterising the base file location, but happy to hear otherwise

@merelcht
Copy link
Member

Thanks for clarifying @jstammers, that makes sense. It sounds very reasonable to me to add the saving as delta table functionality to the SparkHiveDataSet, so you're more than welcome to open a PR for it 🙂 And of course reach out here or on our Discord channel if you need any help.

@noklam
Copy link
Contributor

noklam commented Sep 26, 2022

@MerelTheisenQB I am not very familiar with the difference between different Spark options, but this looks like a pure implementation bug to me.

See https://github.com/quantumblacklabs/private-kedro/pull/1083/files (in the old private repo). save_args was added specifically to support more format.

More confident about this as I skim through the commit history of the PR. See this commit https://github.com/quantumblacklabs/private-kedro/pull/1083/commits/443c8b0bf0ada48ff9d3ae2685ab0fc9d1ab7851

@noklam noklam linked a pull request Sep 26, 2022 that will close this issue
5 tasks
@merelcht merelcht moved this to Done in Kedro Framework Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants