Skip to content

mhandria/bigquery-ingest-avro-dataflow-sample

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This tutorial describes storing Avro SpecificRecord objects in BigQuery using Cloud Dataflow by automatically generating the table schema and transforming the input elements. This tutorial also showcases the usage of Avro-generated classes to materialize or transmit intermediate data between workers in your Cloud Dataflow pipeline.

Please refer to the related article for all the steps to follow in this tutorial.

Contents of this repository:

  • BeamAvro: Java code for the Apache Beam pipeline deployed on Cloud Dataflow.
  • generator: Python code for the randomized event generator.

To run the example:

  1. Update configuration by updating env.sh
  2. Set environment variables
    source env.sh
  3. Generate java beans from the avro file and Run Dataflow pipeline:
    mvn clean generate-sources compile exec:java \
      -Dexec.mainClass=com.google.cloud.solutions.beamavro.AvroToBigQuery \
      -Dexec.cleanupDaemonThreads=false \
      -Dexec.args=" \
    --project=$GOOGLE_CLOUD_PROJECT \
    --runner=DataflowRunner \
    --stagingLocation=gs://$MY_BUCKET/stage/ \
    --tempLocation=gs://$MY_BUCKET/temp/ \
    --inputPath=projects/$GOOGLE_CLOUD_PROJECT/topics/$MY_TOPIC \
    --workerMachineType=n1-standard-1 \
    --maxNumWorkers=$NUM_WORKERS \
    --region=$REGION \
    --dataset=$BQ_DATASET \
    --bqTable=$BQ_TABLE \
    --outputPath=$AVRO_OUT" \
    --file BeamAvro/pom.xml
  4. Run event generation script:
    1. Create Python virtual environment
      python3 -m venv ~/generator-venv
      source ~/generator-venv/bin/activate
    2. Install python dependencies
      pip install -r generator/requirements.txt
    3. Run the Generator
      python generator/gen.py -p $GOOGLE_CLOUD_PROJECT -t $MY_TOPIC -n 100 -f avro

About

Stream Avro SpecificRecord objects in BigQuery using Cloud Dataflow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 93.3%
  • Python 4.9%
  • Shell 1.8%