Add Spark fileoutputcommitter configuration to BigQuery offload template #183

nj1973 · 2024-05-23T13:11:21Z

From https://spark.apache.org/docs/latest/cloud-integration.html:

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

The page also lists Google Cloud Storage (gs) as a safe object store. Therefore when staging to GCS we should use this and can add the following to the offload.env.template.bigquery template file:

export OFFLOAD_TRANSPORT_SPARK_PROPERTIES='{"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2}'

We need to verify the information above is still accurate before working on this.

The text was updated successfully, but these errors were encountered:

nj1973 · 2024-05-23T13:12:02Z

This should also apply to Snowflake when using GCS/Azure transport.

nj1973 added the enhancement New feature or request label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark fileoutputcommitter configuration to BigQuery offload template #183

Add Spark fileoutputcommitter configuration to BigQuery offload template #183

nj1973 commented May 23, 2024

nj1973 commented May 23, 2024

Add Spark fileoutputcommitter configuration to BigQuery offload template #183

Add Spark fileoutputcommitter configuration to BigQuery offload template #183

Comments

nj1973 commented May 23, 2024

nj1973 commented May 23, 2024