bigquery-table-to-one-file

Using Google Cloud Dataflow, this trivial Java application reads a table in BigQuery, and turns it into one file in GCS (GZIP compressed format). Why? Because currently BigQuery only support unsharded exports of under 1 GB.

https://cloud.google.com/bigquery/docs/exporting-data https://cloud.google.com/dataflow/

It uses the default credentials set to the environment variable GOOGLE_APPLICATION_CREDENTIALS. See all about that here: https://developers.google.com/identity/protocols/application-default-credentials

In the code, change the table name and bucket details etc. to suit your needs. You will also just need to create the GCS bucket(s) yourself. I wasn't bothered making them cli parameters.

To run:

--project=<your_project_id> --runner=DataflowRunner --jobName=bigquery-table-to-one-file --maxNumWorkers=50 --zone=australia-southeast1-a --stagingLocation=gs://<your_bucket>/jars --tempLocation=gs://<your_bucket>/tmp

It should look like this when it's running. I tested it with the public WIKI table (1 billion rows & ~100GB) and it took about 6 hours using 50 n1-standard-1 workers:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
gradle/wrapper		gradle/wrapper
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigquery-table-to-one-file

About

Releases

Packages

Languages

License

polleyg/gcp-bigquery-table-to-one-file

Folders and files

Latest commit

History

Repository files navigation

bigquery-table-to-one-file

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages