Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark fileoutputcommitter configuration to BigQuery offload template #183

Open
nj1973 opened this issue May 23, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@nj1973
Copy link
Collaborator

nj1973 commented May 23, 2024

From https://spark.apache.org/docs/latest/cloud-integration.html:

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

The page also lists Google Cloud Storage (gs) as a safe object store. Therefore when staging to GCS we should use this and can add the following to the offload.env.template.bigquery template file:

export OFFLOAD_TRANSPORT_SPARK_PROPERTIES='{"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2}'

We need to verify the information above is still accurate before working on this.

@nj1973 nj1973 added the enhancement New feature or request label May 23, 2024
@nj1973
Copy link
Collaborator Author

nj1973 commented May 23, 2024

This should also apply to Snowflake when using GCS/Azure transport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant