-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to specify save format for SparkHIveDataSet #1528
Comments
Hi @jstammers, have you tried using the DeltaTAbleDataSet? |
Hi @jstammers do you still need help with this? |
Hi @MerelTheisenQB, I've been able to save my dataset as a delta table using the DeltaTableDataSet. If you think this functionality should be available using the HiveDataSet class, I'd be happy to submit a PR that implements the change I proposed above. Otherwise, please feel free to close this issue and thanks for the help |
Hi @jstammers, what are the differences for you when using |
Hi @MerelTheisenQB , the main difference would be the fact that we have upstream processes that insert data using spark.sql("Insert into facts.volume") which means that accessing that data using In my current use-case, I am intending to use this across multiple projects where the data structure will be the same, but the underlying file locations will be different. I expect it will be easier to handle this using the hive metastore rather than parameterising the base file location, but happy to hear otherwise |
Thanks for clarifying @jstammers, that makes sense. It sounds very reasonable to me to add the saving as delta table functionality to the |
@MerelTheisenQB I am not very familiar with the difference between different Spark options, but this looks like a pure implementation bug to me. See https://github.com/quantumblacklabs/private-kedro/pull/1083/files (in the old private repo). More confident about this as I skim through the commit history of the PR. See this commit https://github.com/quantumblacklabs/private-kedro/pull/1083/commits/443c8b0bf0ada48ff9d3ae2685ab0fc9d1ab7851 |
Description
The implementation for
SparkHiveDataSet
allows the user to specify additional save arguments. This should enable a delta table to be saved which is done using the following pyspark codeWhich should be replicated using the following configuration
However, this raises a
DataSetError
because theSparkHiveDataSet
constructor gets the format from thesave_args
kedro/kedro/extras/datasets/spark/spark_hive_dataset.py
Line 117 in 805da32
but if it exists, does not remove it from
self._save_args
. The error is raised when creating a hive table because there are two arguments namedformat
kedro/kedro/extras/datasets/spark/spark_hive_dataset.py
Lines 144 to 151 in 805da32
Context
I am trying to save a table using the delta format which is possible using the pyspark API, but currently not supported using
SparkHiveDataSet
. With the current implementation, the only supported format is the'hive'
default.A possible solution would be to
pop
the 'format' value if it exists in save_args, e.g.Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
): 0.18.0python -V
): 3.9The text was updated successfully, but these errors were encountered: