-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"java.io.IOException: Not an Avro data file" when using AvroBigQueryInputFormat #14
Comments
More stacktrace from a failed executor.
|
Thanks for the report, @nevillelyh. I'm going to start looking at this. AFAIU, BigQuery shouldn't be producing 0-length files, but instead 0-record files. As a workaround, consider setting mapred.bq.input.sharded.export.enable to false (or AbstractBigQueryinputFormat.setEnableShardedOutput(conf, false). The overall job will run more slowly as BigQuery won't be writing records to GCS as frequently, but it may allow you to make progress. |
I tried adding the following line: conf.set(BigQueryConfiguration.ENABLE_SHARDED_EXPORT_KEY, "false") But am getting this stacktrace now:
|
@nevillelyh - apologies for the broken-ness here. I've opened PR #17 to fix unsharded exports. For the original issue of Avro files being invalid with sharded exports, BigQuery is writing an initial 0-byte file to GCS before writing the finalized object which is does not seem like valid behavior (and is different from how JSON is exported) and a bug is open with the BigQuery team on this. I have not yet seen the 0-byte file issue with unsharded exports. |
I've done some testing and BigQuery is no longer writing 0-length files as part of sharded exports. |
Code looks like this
Stacktrace:
I saw these in the log:
15/10/20 14:51:54 INFO bigquery.AbstractBigQueryInputFormat: Resolved GCS export path: 'gs://starship/hadoop/tmp/bigquery/job_201510201451_0002' 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Computed '2' shards for sharded BigQuery export. 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Table 'prefab-wave-844:bigquery_staging.spark_query_20151020144723_1201584542' to be exported has 4251779 rows and 1641186694 bytes 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Computed '2' shards for sharded BigQuery export. 15/10/20 14:51:55 INFO bigquery.ShardedExportToCloudStorage: Table 'prefab-wave-844:bigquery_staging.spark_query_20151020144723_1201584542' to be exported has 4251779 rows and 1641186694 bytes
And verified that the export path indeed contains Avro files only and nothing else.
I also tried reading the export path with spark-avro and that works fine.
The text was updated successfully, but these errors were encountered: