Unable to specify save format for SparkHIveDataSet #1528

jstammers · 2022-05-13T15:13:48Z

Description

The implementation for SparkHiveDataSet allows the user to specify additional save arguments. This should enable a delta table to be saved which is done using the following pyspark code

table = spark.table(...)
table.write.saveAsTable("db.table", format='delta')

Which should be replicated using the following configuration

table:
  type: spark.SparkHiveDataSet
  database: db
  table: table
  write_mode: overwrite
  save_args:
    format: delta

However, this raises a DataSetError because the SparkHiveDataSet constructor gets the format from the save_args

kedro/kedro/extras/datasets/spark/spark_hive_dataset.py

Line 117 in 805da32

self._format = self._save_args.get("format") or "hive"

but if it exists, does not remove it from self._save_args. The error is raised when creating a hive table because there are two arguments named format

kedro/kedro/extras/datasets/spark/spark_hive_dataset.py

Lines 144 to 151 in 805da32

    
           def _create_hive_table(self, data: DataFrame, mode: str = None): 
        
               _mode: str = mode or self._write_mode 
        
               data.write.saveAsTable( 
        
                   self._full_table_address, 
        
                   mode=_mode, 
        
                   format=self._format, 
        
                   **self._save_args, 
        
               )

Context

I am trying to save a table using the delta format which is possible using the pyspark API, but currently not supported using SparkHiveDataSet. With the current implementation, the only supported format is the 'hive' default.

A possible solution would be to pop the 'format' value if it exists in save_args, e.g.

 self._format = self._save_args.pop("format", "hive")  #returns "hive" if "format" not in self._save_args.keys()

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.18.0
Python version used (python -V): 3.9
Operating system and version: ubuntu 18.04

The text was updated successfully, but these errors were encountered:

merelcht · 2022-05-16T12:41:19Z

Hi @jstammers, have you tried using the DeltaTAbleDataSet?

merelcht · 2022-06-20T12:55:37Z

Hi @jstammers do you still need help with this?

jstammers · 2022-06-22T07:46:51Z

Hi @MerelTheisenQB, I've been able to save my dataset as a delta table using the DeltaTableDataSet.

If you think this functionality should be available using the HiveDataSet class, I'd be happy to submit a PR that implements the change I proposed above.

Otherwise, please feel free to close this issue and thanks for the help

merelcht · 2022-07-11T13:11:19Z

Hi @jstammers, what are the differences for you when using DeltaTableDataSet and SparkHiveDataSet that makes you want to have the functionality as part of the SparkHiveDataSet?

jstammers · 2022-07-18T10:31:22Z

Hi @MerelTheisenQB , the main difference would be the fact that we have upstream processes that insert data using spark.sql, e.g.

spark.sql("Insert into facts.volume")

which means that accessing that data using SparkHiveDataSet is more convenient. In this example, the data are saved at /user/hive/facts.db/volume.

In my current use-case, I am intending to use this across multiple projects where the data structure will be the same, but the underlying file locations will be different. I expect it will be easier to handle this using the hive metastore rather than parameterising the base file location, but happy to hear otherwise

merelcht · 2022-07-25T10:42:44Z

Thanks for clarifying @jstammers, that makes sense. It sounds very reasonable to me to add the saving as delta table functionality to the SparkHiveDataSet, so you're more than welcome to open a PR for it 🙂 And of course reach out here or on our Discord channel if you need any help.

noklam · 2022-09-26T10:23:56Z

@MerelTheisenQB I am not very familiar with the difference between different Spark options, but this looks like a pure implementation bug to me.

See https://github.com/quantumblacklabs/private-kedro/pull/1083/files (in the old private repo). save_args was added specifically to support more format.

More confident about this as I skim through the commit history of the PR. See this commit https://github.com/quantumblacklabs/private-kedro/pull/1083/commits/443c8b0bf0ada48ff9d3ae2685ab0fc9d1ab7851

merelcht added the Community Issue/PR opened by the open-source community label May 16, 2022

merelcht added this to Kedro Framework May 16, 2022

merelcht added the Issue: Bug Report 🐞 Bug that needs to be fixed label May 16, 2022

jstammers mentioned this issue Sep 20, 2022

Fix issue with specifying format for SparkHiveDataSet #1857

Merged

5 tasks

noklam linked a pull request Sep 26, 2022 that will close this issue

Fix issue with specifying format for SparkHiveDataSet #1857

Merged

5 tasks

merelcht closed this as completed in #1857 Sep 28, 2022

merelcht moved this to Done in Kedro Framework Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to specify save format for SparkHIveDataSet #1528

Unable to specify save format for SparkHIveDataSet #1528

jstammers commented May 13, 2022

merelcht commented May 16, 2022

merelcht commented Jun 20, 2022

jstammers commented Jun 22, 2022

merelcht commented Jul 11, 2022

jstammers commented Jul 18, 2022

merelcht commented Jul 25, 2022

noklam commented Sep 26, 2022 •

edited

Loading

Unable to specify save format for SparkHIveDataSet #1528

Unable to specify save format for SparkHIveDataSet #1528

Comments

jstammers commented May 13, 2022

Description

Context

Your Environment

merelcht commented May 16, 2022

merelcht commented Jun 20, 2022

jstammers commented Jun 22, 2022

merelcht commented Jul 11, 2022

jstammers commented Jul 18, 2022

merelcht commented Jul 25, 2022

noklam commented Sep 26, 2022 • edited Loading

noklam commented Sep 26, 2022 •

edited

Loading