Telemetry-Airflow

Apache Airflow is a platform to programmatically author, schedule and monitor workflows.

Telemetry-Airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow".

Some links relevant to users and developers of WTMO:

The dags directory in this repository contains some custom DAG definitions
Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl
The Data SRE team maintains a WTMO Developer Guide (behind SSO)

Contributing

Forking workflow is required ONLY for contributors without write access.

This repo enforces Conventional Commit style via the Github action Semantic PRs.

Writing DAGs

See the Airflow's Best Practices guide to help you write DAGs.

⚠ Warning: Do not import resources from the dags directory in DAGs definition files ⚠

As an example, if you have dags/dag_a.py and dags/dag_b.py and want to use a helper function in both DAG definition files, define the helper function in the utils directory such as:

utils/helper.py

def helper_function():
    return "Help"

dags/dag_a.py

from airflow import DAG

from utils.helper import helper_function

with DAG("dag_a", ...):
    ...

dags/dag_b.py

from airflow import DAG

from utils.helper import helper_function

with DAG("dag_b", ...):
    ...

WTMO deployments use git-sync sidecars to synchronize DAG files from multiple repositories via telemetry-airflow-dags using git submodules. Git-sync sidecar pattern results in the following directory structure once deployed.

airflow
├─ dags
│  └── repo
│      └── telemetry-airflow-dags
│          ├── <submodule repo_a>
│          │    └── dags
│          │        └── <dag files>
│          ├── <submodule repo_b>
│          │    └── dags
│          │        └── <dag files>
│          └── <submodule repo_c>
│               └── dags
│                   └── <dag files>
├─ utils
│  └── ...
└─ plugins
   └── ...

Hence, defining helper_function() in dags/dag_a.py and importing the function in dags/dag_b.py as from dags.dag_a import helper_function will not work after deployment because of the directory structured required for git-sync sidecars.

Prerequisites

This app is built and deployed with docker and docker-compose. Dependencies are managed with pip-tools pip-compile.

You'll also need to install PostgreSQL to build the database container.

Installing dependencies locally

⚠ Make sure you use the right Python version. Refer to Dockerfile for current supported Python Version ⚠

You can install the project dependencies locally to run tests with Pytest. We use the official Airflow constraints file to simplify Airflow dependency management. Install dependencies locally using the following command:

make pip-install-local

Updating Python dependencies

Add new Python dependencies into requirements.in or requirements-dev.in then execute the following commands:

make pip-compile
make pip-install-local

Build Container

Build Airflow image with

make build

macOS

Assuming you're using Docker for Docker Desktop for macOS, start the docker service, click the docker icon in the menu bar, click on preferences and change the available memory to 4GB.

Testing

Local Deployment

To deploy the Airflow container on the docker engine, with its required dependencies, run:

make build
make up

Adding dummy credentials

Tasks often require credentials to access external credentials. For example, one may choose to store API keys in an Airflow connection or variable. These variables are sure to exist in production but are often not mirrored locally for logistical reasons. Providing a dummy variable is the preferred way to keep the local development environment up to date.

Update the resources/dev_variables.env and resources/dev_connections.env with appropriate strings to prevent broken workflows.

Usage

You can now connect to your local Airflow web console at http://localhost:8080/.

All DAGs are paused by default for local instances and our staging instance of Airflow. In order to submit a DAG via the UI, you'll need to toggle the DAG from "Off" to "On". You'll likely want to toggle the DAG back to "Off" as soon as your desired task starts running.

Testing GKE Jobs (including BigQuery-etl changes)

See https://go.corp.mozilla.com/wtmodev for more details.

make build && make up
make gke

When done:
make clean-gke

From there, connect to Airflow and enable your job.

Testing Dataproc Jobs

Dataproc jobs run on a self-contained Dataproc cluster, created by Airflow.

To test these, jobs, you'll need a sandbox account and corresponding service account. For information on creating that, see "Testing GKE Jobs". Your service account will need Dataproc and GCS permissions (and BigQuery, if you're connecting to it). Note: Dataproc requires "Dataproc/Dataproc Worker" as well as Compute Admin permissions. You'll need to ensure that the Dataproc API is enabled in your sandbox project.

Ensure that your dataproc job has a configurable project to write to. Set the project in the DAG entry to be configured based on development environment; see the ltv.py job for an example of that.

From there, run the following:

make build && make up
./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS google_cloud_airflow_dataproc

You can then connect to Airflow locally. Enable your DAG and see that it runs correctly.

Debugging

Some useful docker tricks for development and debugging:

make clean

# Remove any leftover docker volumes:
docker volume rm $(docker volume ls -qf dangling=true)

# Purge docker volumes (helps with postgres container failing to start)
# Careful as this will purge all local volumes not used by at least one container.
docker volume prune

Production Setup

This repository was structured to be deployed using the offical Airflow Helm Chart. See the Production Guide for best practices.

Production Deployments

Production deployments are automatically triggered for every commit merged in the main branch. Docker images are tagged using the git commit short SHA and automatically deployed.

Refer to the CI configuration for more details!

Dev and Stage Deployments

Non-production deployments are automatically triggered for every commit merged in the main branch. Dev and Stage deployments use the latest image tag for deployments.

It is also possible to manually trigger the image building and pushing workflow manual-publish by manually tagging a commit. This can be achieved using git on your local machine, or by creating a pre-release GitHub Release with a tag prefixed by dev- on a non-main branch commit.

Refer to the CI configuration and deployment repository for more details!

Name		Name	Last commit message	Last commit date
Latest commit History 2,260 Commits
.circleci		.circleci
.github		.github
bin		bin
config		config
dags		dags
dataproc_bootstrap		dataproc_bootstrap
jobs		jobs
operators		operators
plugins		plugins
resources		resources
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
constraints.txt		constraints.txt
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.in		requirements-dev.in
requirements-dev.txt		requirements-dev.txt
requirements-override.txt		requirements-override.txt
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telemetry-Airflow

Contributing

Writing DAGs

Prerequisites

Installing dependencies locally

Updating Python dependencies

Build Container

macOS

Testing

Local Deployment

Adding dummy credentials

Usage

Testing GKE Jobs (including BigQuery-etl changes)

Testing Dataproc Jobs

Debugging

Production Setup

Production Deployments

Dev and Stage Deployments

About

Releases 45

Packages

Contributors 77

Languages

License

mozilla/telemetry-airflow

Folders and files

Latest commit

History

Repository files navigation

Telemetry-Airflow

Contributing

Writing DAGs

Prerequisites

Installing dependencies locally

Updating Python dependencies

Build Container

macOS

Testing

Local Deployment

Adding dummy credentials

Usage

Testing GKE Jobs (including BigQuery-etl changes)

Testing Dataproc Jobs

Debugging

Production Setup

Production Deployments

Dev and Stage Deployments

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 45

Packages 0

Contributors 77

Languages

Packages