Saving to an EU location via the spark API #44

samelamin · 2017-01-19T12:20:59Z

Hi

When saving to BQ using the saveAsNewAPIHadoopDataset it defaults to the US location, is there any way to save to the EU location?

Setting the hadoop configuration to EU doesnt seem to be picked up by the connector

The text was updated successfully, but these errors were encountered:

samelamin · 2017-02-20T12:11:57Z

@dennishuo is this project still being maintained? no commits since Dec and no replies on the issues :(

dennishuo · 2017-02-20T21:44:47Z

Sorry for the delay, indeed this project is still being maintained.

I believe the location is a property of the destination BigQuery dataset, and IIRC the connector doesn't auto-create an output dataset if it doesn't exist yet. The configuration location also has to be set to match the destination dataset because the configuration location is used for the temporary dataset holding uncommitted results before appending into the destination table during commitTask.

You should make sure the destination dataset you pre-created is already in EU and continue setting the configuration key to match.

Incidentally, are you using the older com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat or the newer com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat?

samelamin · 2017-02-20T23:08:41Z

Thanks for the reply! I was using the older version so perhaps that was the reason

I will check if that sorts the problem out.

Out of interest, will using the newer version give any performance improvements?

Whats the difference between the old and new?

dennishuo · 2017-02-20T23:42:22Z

Indeed, the newer version has been measured to have better performance.

The difference between the old and the new is that the old version tried to get too fancy and write straight into BigQuery temporary tables, calling "CopyTable" inside of commitTask; this was a "simpler" flow since it doesn't require intermediate storage outside of BigQuery, but unfortunately it consumes much more "BigQuery load jobs" quota, since every task commits independent temporary tables first.

The new version writes to GCS first via the GCS connector for Hadoop, and on commitJob calls a single BigQuery "load" to ingest from GCS. In theory this could be slower because BigQuery doesn't even begin its backend ingest until the commitJob, but it turns out the overhead of per-task BigQuery loads is much higher, so that overall the newer version is faster.

We're still in the process of switching documentation over to recommending the new connector version as the default; since it's newer it's possible there are new bugs we haven't found yet, but generally since it's just built on the well-tested GCS connector for Hadoop, it's expected to be fairly stable nonetheless.

Note however that both the old and new versions do not create a new BigQuery "dataset", and location is still determined by the "dataset" location more so than by config key.

If you use the new vesion, the GCS staging location you specify should also be in EU if you're going to load it into an EU BigQuery dataset.

samelamin · 2017-03-20T10:01:57Z

I see, well thanks for the clarification and ofcourse your help!

samelamin closed this as completed Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving to an EU location via the spark API #44

Saving to an EU location via the spark API #44

samelamin commented Jan 19, 2017

samelamin commented Feb 20, 2017

dennishuo commented Feb 20, 2017

samelamin commented Feb 20, 2017

dennishuo commented Feb 20, 2017

samelamin commented Mar 20, 2017

Saving to an EU location via the spark API #44

Saving to an EU location via the spark API #44

Comments

samelamin commented Jan 19, 2017

samelamin commented Feb 20, 2017

dennishuo commented Feb 20, 2017

samelamin commented Feb 20, 2017

dennishuo commented Feb 20, 2017

samelamin commented Mar 20, 2017