Dataflow Flex Templates Example

This repo holds a sample code required to deploy a Dataflow job using Flex templates.

What is Dataflow?

Dataflow is a fully managed Apache Beam service provided by Google Cloud Platform (GCP). You only have to pay for the resources it uses for your pipeline and the rest (spawning, scaling, and teardown) is handled by GCP.

What are Flex Templates?

Flex templates for metadata definition method for our Apache Beam pipeline to allow us passing dynamic parameters because the Directed Acyclic Graph (DAG) for the pipeline is created on-the-fly.

To know more about Flex Templates, please read my blog article.

Repo Structure

Here is the structure of this repo.

Source code

The source code of the pipeline is stored inside src folder, where it will have demo folder for our sample pipeline. If we need to have multiple pipelines in the same repo, we can add it inside src.

Each pipline (i.e. demo) ideally has the folder structure as below.

├── Dockerfile
├── resources
│   ├── image_spec.json
│   └── python_command_spec.json
└── src
    ├── demo
    │   ├── __init__.py
    │   ├── main.py
    │   ├── pipeline
    │   │   ├── __init__.py
    │   │   └── word_count.py
    │   ├── requirements.txt
    │   ├── setup.py
    │   └── spec
    │       └── template_metadata

Dockerfile
This contains the Docker image definition for our pipeline. There are several important values to be set, and they're mentioned inside the Dockerfile.
resources/image_spec.json
This file declares the Docker Image from Google Container Registry (GCR) that contains the source code for the pipeline. So, we must push our pipeline as an image to GCR before triggering the pipeline.
resources/python_command_spec
This contains the main file for the pipeline. Usually this directs to /dataflow/pipeline/main.py. This file can be skipped if we set FLEX_TEMPLATE_PYTHON_PY_FILE inside the Dockerfile.
main.py
The entry point for the pipeline. This extracts arguments for the pipeline and triggers it.
pipeline/*
This folder contains the source code for our pipeline. We usually create a pipeline.py with a run() functions which triggers the beam pipeline. However, feel free to have any structure as long as you call the pipeline.run() (or a function which does it) from main.py. Please note that have word_count.py to define our pipeline in this example.
setup.py
This file contains the deployment instructions for the pipeline source code package. It also contains the third-party dependencies for our pipeline, including apache-beam.
requirements.txt
This file contains the requirements for our pipeline to run locally. It most probably will have the same dependencies as mentioned in the setup file.
spec/template_metadata
This file specifies the metadata for the pipeline as a template. This file also includes the parameters for our pipeline, and should be built as a flex template before using.

How to build?

Prerequisites

First of all, we need to point to the GCP project we're going to use by setting the PROJECT variable in the Makefile.
Make sure that you have a Google Cloud Storage (GCS) bucket defined with TEMPLATE_BUCKET in the project and you have access with the current Service Account.
Make sure to update the region you want Dataflow to run by setting REGION value in the Makefile.
Check whether you have Docker client running.
Make sure you can push to GCR.
- Authenticate for GCP: gcloud auth login
- Authenticate for Docker: gcloud auth configure-docker

If this is the first time you're using the project with Dataflow, you might need to enable Dataflow API for the project.

Steps

Create a virtual environment for the project and enable it.
```
make init
source ./.venv/bin/activate
```
Install the dependencies.
```
make build
```
Build the Flex Template.
```
make template
```
This command will run the build-template recipe and it will replace some variables defined in the Dockerfile to create a new Dockerfile inside _tmp folder. Then this new Dockerfile is used to build the pipeline code.

With these steps, we have built a Docker image with our pipeline source code and pushed to GCR. Now we can trigger it using the Dataflow API.

Running the pipeline

Prerequisites

Make sure the GCS paths defined in the pipeline parameters are present.
Make sure your project has access to the subnet defined in the subnetwork parameter to run dataflow jobs.

Run

You have simply run the run_dataflow.py file that already has the basic code for triggering the Dataflow pipeline using Flex Templates.

make run

Or, you can use the gcloud beta dataflow flex-template run command to trigger the pipeline as well. You can read more on how to do it at the docs.

CI/CD

You can use the Makefile recipes in any CI/CD framework you prefer, however you need to define GOOGLE_APPLICATION_CREDENTIALS with a proper Service Account.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
resources		resources
src/demo		src/demo
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
run_dataflow.py		run_dataflow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataflow Flex Templates Example

What is Dataflow?

What are Flex Templates?

Repo Structure

Source code

How to build?

Prerequisites

Steps

Running the pipeline

Prerequisites

Run

CI/CD

About

Releases

Packages

Languages

License

gnomezgrave/dataflow-flex-templates-example

Folders and files

Latest commit

History

Repository files navigation

Dataflow Flex Templates Example

What is Dataflow?

What are Flex Templates?

Repo Structure

Source code

How to build?

Prerequisites

Steps

Running the pipeline

Prerequisites

Run

CI/CD

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages