Base project for creating Python Apache Beam pipelines and running them in Google DataFlow using CRON scheduler
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
dataflow_pipeline
.gitignore
LICENSE
__init__.py
app.yaml
appengine_config.py
cron.yaml
main.py
readme.md
requirements.txt
setup.py

readme.md

Google DataFlow python - App Engine deployment

Full description and implementation details are in my blogpost:

http://zablo.net/blog/post/python-apache-beam-google-dataflow-cron


This repository contains basic project, which can be used as an example of how to deploy Google Dataflow (Apache Beam) pipeline to App Engine in order to run it as as CRON job. It only works on App Engine Flex Environment, due to I/O used by Apache Beam (on App Engine Standard it throws an error about Read-only file system).

Description

  1. setup.py file is important - without it, Dataflow engine will be unable to distribute packages across dynamically spawned DF workers
  2. app.yaml contains definition of App Engine app, which will spawn Dataflow pipeline
  3. cron.yaml contains definition of App Engine CRON, which will ping one of the App endpoints (in order to spawn Dataflow pipeline)
  4. appengine_config.py adds dependencies to locally installed packages (from lib folder)

Instruction

  1. Remember to put __init__.py files into all local packages
  2. Install all required packages into local lib folder: pip install -r requirements.txt -t lib
  3. To deploy App Engine app, run: gcloud app deploy app.yaml
  4. To deploy App Engine CRON, run: gcloud app deploy cron.yaml