Awesome Apache Airflow
This is a curated list of resources about Apache Airflow. Please feel free to contribute any items that should be included. Items are generally added at the top of each section so that more fresh items are featured more prominently.
- Vital links
- Airflow deployment solutions
- Introductions and tutorials
- Airflow Summit 2020 Videos
- Best practices, lessons learned and cool use cases
- Books, blogs, podcasts, and such
- Slide deck presentations and online videos
- Libraries, Hooks, Utilities
- Commercial Airflow-as-a-service providers
- Cloud Composer resources
- Non-English resources
- Source code (latest stable release 1.10.12)
- Documentation (also the official website)
- Confluence page
- Slack workspace
Airflow deployment solutions
- Installing Airflow on IBM Cloud - Quick and easy deployment on IBM Cloud with IBM Bitnami Charts
- Three ways to run Airflow on Kubernetes - Tim van de Keer walks through several methods for deploying Airflow on Kubernetes.
- Apache Airflow Multi-Tier Free Deployment on Azure - A free Azure Resource Manager (ARM) template by Bitnami providing a one-click solution for Airflow deployment on Azure for production use-cases.
- KubernetesExecutor Helm Chart - A lean Helm Chart using the KubernetesExecutor for a more k8s native experience and complementary KubernetesExecutor Docker Image.
- Stable Celery Helm Chart - Curated Helm Chart in the official stable chart repository.
- Puckel's Docker Image - @Puckel_'s well-crafted Docker image has become the base for many Airflow installations. It is regularly updated and closely tracks the official Apache releases.
- Kubernetes Custom Operator for Deploying Airflow - Kubernetes Custom controller (also called operator pattern) for deploying Airflow on Kubernetes.
- airflow-pipeline - Airflow Docker container that comes preconfigured for Spark and Hadoop. It can be docker pulled at
- aws-airflow-stack - An AWS based Airflow cluster deployment with CeleryExecutor. Deploys after a few clicks with CloudFormation.
- kube-airflow - This repository contains both an Airflow Docker image (that appears to have been based on Puckel's work) and Kubernetes service definition. mumoshu's repository has not been recently updated, but there are numerous forks that may be based on more recent releases.
- airflow-on-kubernetes - A guide on all relevant resources, scripts and projects that relate to running Airflow on Kubernetes.
- airflow-k8s-executor-on-GKE - A detailed tutorial to get a scalable, low maintenance airflow kubernetes executor environment deployed on Google Kubernetes Engine with helm.
- airflow-cookbook - Chef cookbook for deploying Airflow.
- Running Airflow on top of Apache Mesos - Blog describing how to configure Mesos to run all of the Airflow componenents.
- Integrating Apache Airflow with Apache Ambari - Mykola Mykhalov walks through using Apache Ambari to configure and deploy an Airflow instance.
- Astronomer Platform - Apache Airflow as a Service on Kubernetes. For more information visit https://www.astronomer.io.
- Bitnami Airflow Docker image - A secure and up-to-date docker image for Airflow maintained by Bitnami.
- Bitnami Airflow Scheduler Docker image - A secure and up-to-date docker image for Airflow Scheduler maintained by Bitnami.
- Bitnami Airflow Worker Docker image - A secure and up-to-date docker image for Airflow Worker maintained by Bitnami. A CeleryExecutor docker-compose deployment is available here.
- Distribute & deploy Apache Airflow via Python PEX files - Example repo with steps to bundle, distribute, & deploy Apache Airflow as PEX files.
- Introducing KEDA for Airflow - How to use KEDA scaler system to enable autoscaling of celery workers based on data stored in the Airflow metadata database.
Introductions and tutorials
- Start Building Better Data Pipelines With apache Airflow 2020-Oct - Naman Gupta covers the basics of Airflow and its concepts.
- Airflow Repository Template - A boilerplate repository for developing locally with Airflow, with linting & tests for valid DAGs and plugins. Just clone and run
make start-airflowto get started! Add some CI jobs to deploy your code and you're done.
- Automate AWS Tasks Thanks to Airflow Hooks - A step by step tutorial to understand how to connect your Airflow pipeline to S3.
- How Apache Airflow Distributes Jobs on Celery workers - A short description of the steps taken by a task instance, from scheduling to success, in a distributed architecture.
- Remote spark-submit to YARN running on EMR - Azhaguselvan walks through submitting Spark jobs to existing EMR clusters with Airflow.
- Running Airflow on top of Apache Mesos and its follow-up, Mesos, Airflow & Docker by Agraj Mangal is a quick overview of running Airflow atop Apache Mesos.
- Dustin Stansbury of Quizlet has written a four-part series that covers what workflow managers do in general, how Quizlet picked Airflow, a tour of Airflow's key concepts, and how Quizlet is now using Airflow in practice:
- Integrating Apache Airflow with Databricks - While this tutorial is focused specifically on Databricks' Spark solutions, it does have a reasonable overview of Airflow basics and demonstrates how a third party solution can quickly integrate into Airflow.
- Apache Airflow: Tutorial and Beginners Guide - This article discusses the basic concepts that stand behind Airflow and discusses the problems it solves.
- Testing and debugging Apache Airflow - Article explaining how to apply unit testing, mocking and debugging to Airflow code.
- Get started developing workflows with Apache Airflow - This brief introductory tutorial covers how to create data pipeline and processing workflow using DAG, operators, Sensor, using Xcoms to communicate between operators.
- Get started with Airflow + Google Cloud Platform + Docker - Step-by-step introduction by Jayce Jiang.
Airflow Summit 2020 videos
The first Airflow Summit 2020 was held in July 2020. It was a truly global, fully online event that was co-hosted by 9 Airflow Meetups from all over the world (Melbourne, Tokyo, Bangalore, Warsaw, Amsterdam, London, NYC, Seattle, BayArea).
It featured 40+ talks and three workshops. You can check out the talk recordings as a YouTube Airflow Summit 2020 Playlist or see the individual talks here:
- Keynote: Airflow then and now
- Scheduler as a service - Apache Airflow at EA Digital Platform
- Keynote: How large companies use Airflow for ML and ETL pipelines
- Data DAGs with lineage for fun and for profit
- Airflow on Kubernetes: Containerizing your workflows
- Data flow with Airflow @ PayPal
- Democratised data workflows at scale
- Migrating Airflow-based Spark jobs to Kubernetes - the native way
- Keynote: Future of Airflow
- Run Airflow DAGs in a secure way
- Keynote: Making Airflow a sustainable project through D&I
- Airflow CI/CD: Github to Cloud Composer (safely)
- Advanced Apache Superset for Data Engineers
- Demo: Reducing the lines, a visual DAG editor
- AIP-31: Airflow functional DAG definition
- Autonomous driving with Airflow
- From cron to Airflow on Kubernetes: A startup story
- Achieving Airflow Observability
- Machine Learning with Apache Airflow
- Airflow: A beast character in the gaming world
- Effective Cross-DAG dependency
- What open source taught us about business
- Data engineering hierarchy of needs
- Building reuseable and trustworthy ELT pipelines (A templated approach)
- Testing Airflow workflows - ensuring your DAGs work before going into production
- Adding an executor to Airflow: A contributor overflow exception
- Migration to Airflow backport providers
- From Zero to Airflow: bootstrapping a ML platform
- Airflow the perfect match in our analytics pipeline
- Airflow at Société Générale : An open source orchestration solution in a banking environment
- Airflow as the next gen of workflow system at Pinterest
- Improving Airflow's user experience
- Teaching an old DAG new tricks
- Ask me anything with Airflow members
- Using Airflow to speed up development of data intensive tools
- Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs
- Production Docker image for Apache Airflow
- Airflow as an elastic ETL tool
- How do we reason about the reliability of our data pipeline in Wrike
- Achieving Airflow observability with Databand
- From S3 to BigQuery - How a first-time Airflow user successfully implemented a data pipeline
Best practices, lessons learned and cool use cases
- Testing in Airflow Part 2 - Chandu Kavar and Sarang Shinde have explained Integration Tests and End-to-End Pipeline Tests.
- Upgrading & Scaling Airflow at Robinhood - Abishek Ray describes how Robinhood tackled upgrading its production Airflow while minimizing downtime.
- We're all using Airflow wrong and how to fix it - Jessica Laughlin of Bluecore shares three engineering problems associated with the Airflow design and how to solve them by using the KubernetesPodOperator in two design patterns.
- Getting started with Data Lineage - Germain Tanguy of Dailymotion shares a data lineage prototype integrated to Apache Airflow.
- Collaboration between data engineers, data analysts and data scientists - Germain Tanguy of Dailymotion shares how to efficiently release in production by collaboration with Apache Airflow.
- Using Apache Airflow’s Docker Operator with Amazon’s Container Repository - Brian Campbell of Lucid has tips for integrating AWS's ECR service with Airflow's DockerOperator.
- Airflow: Lesser Known Tips, Tricks, and Best Practises - Kaxil Naik has explained the lesser-known yet very useful tips and best practises on using Airflow.
- boundary-layer:Declarative Airflow Workflows - Kevin McHale has explained open source project boundary-layer which generates airflow dag with declarative workflows.
- Testing in Airflow Part 1 - Chandu Kavar has explained different categories of tests in Airflow. It includes DAG Validation Tests, DAG Definition Tests, and unit tests.
- Improving Airflow UI Security - WePay's Joy Gao breaks down the need for Role Based Access Controls (RBAC) and how she introduced it to Airflow.
- How to Create a Workflow in Apache Airflow to Track Disease Outbreaks in India - Vinayak Mehta details how SocialCops uses Airflow to scrape India's Ministry of Health and Family Affairs to generate derived data on possible disease outbreaks.
- Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy - Vinayak Mehta talks about identifying data engineering patterns (meta data engineering) to automate DAG generation and how that helped SocialCops to power DISHA, a national data platform where Indian MPs and MLAs monitor the progress of 42 national level schemes.
- Lessons learnt while Airflow-ing and Airflow Part 2: Lessons learned - Nehil Jain has written a two-part series that covers the value of workflow schedulers, some best practices and pitfalls he found while working with Airflow. The second article in particular includes many production tips.
- Why Robinhood uses Airflow - Vineet Goel walks through why financial trading platform Robinhood picked Airflow over alternative work schedulers.
- What we learned migrating off Cron to Airflow - Katie Macias describes VideoAmp's Data Engineering's journey from cron to Airflow.
- Under the Hood: Building AIR at Qubole - Sreenath Kamath and Rajat Venkatesh write about building Qubole's data discovery, insights and recommendations platform atop Airflow.
- Airflow: Why is nothing working? - TL;DR Airflow’s SubDagOperator causes deadlocks by Jessica Laughlin - Deep dive into troubleshooting a troublesome Airflow DAG with good tips on how to diagnosis problems.
- Apache Airflow as an External scheduler for distributed systems - Arunkumar suggests using Airflow as a simple external scheduler for a distributed system.
- How Sift Trains Thousands of Models using Apache Airflow - Summary of Sift Science's deployment strategy for its machine learning model pipelines.
- Apache Airflow at Pandora - Ace Haidrey discusses why Pandora chose Airflow and provides a detailed breakdown of their deployment and the infrastructure behind it.
- Airflow Lessons from the Data Engineering Front in Chicago - Alison Stanton provides a list of tips to avoid gotchas in Airflow jobs.
- Data’s Inferno: 7 Circles of Data Testing Hell with Airflow - The Wholesale Banking Advanced Analytics team at ING details how they torture test their Airflow DAGs before deployment.
- Data quality checkers - Antoine Augusti describes the framework drivy has built atop Airflow to test their datasets for completeness, consistency, timeliness, uniquess, validity and accuracy.
- Building WePay's data warehouse using BigQuery and Airflow - The inestimable Chris Riccomini describes how WePay, one of the first adopters of Airflow, integrated into their Google Cloud Compute environment.
- Using Apache Airflow to Create Data Infrastructure in the Public Sector - Despite an unfortunately very heavy sales pitch tone, this article blog post describes how ARGO Labs, a non-profit data organization, utilizes Airflow for ETLing in public sector data.
- ETL with airflow - ETL core principles and several end-to-end docker-based examples including Kimball, Data Vault on Hive and some simpler examples.
- How to aggregate data for BigQuery using Apache Airflow - Example of how to use Airflow with Google BigQuery to power a Data Studio dashboard.
- Productionizing ML with workflows at Twitter - In depth post on why and how Twitter use Airflow for ML workflows including including custom operators and a custom UI embedded in in the Airflow web interface.
- Running Apache Airflow At Lyft - This provides an overview on how Lyft operates Apache Airflow in production(monitoring, customization, etc).
- Deploying Apache Airflow in Azure to build and run data pipelines - It talks about running Airflow on Azure.
- The Zen of Python and Apache Airflow - Blog post about how the Zen of Python can be applied to Airflow code.
- Securing Apache Airflow UI WITH DAG Level Access - Blog post about Airflow DAG level access and how Lyft uses it.
- Upgrading Airflow with Zero Downtime - A detailed article on how to deploy Airflow with zero downtime.
- Building a Production-Level ETL Pipeline Platform Using Apache Airflow - This post describes how the system management team at Cerner uses Airflow.
- Bare minimal Airflow on Kubernetes (Local, EKS, AKS) - An article on deploying Airflow on local Kubernetes, AWS EKS and Azure AKS with bare minimal setup.
- Breaking up the Airflow DAG monorepo - This post describes how to support managing Airflow DAGs from multiple git repos through S3.
- Improving Performance of Apache Airflow Scheduler - A story of an adventure that allowed Databind to speed up DAG parsing time 10 times
- How SSENSE is using Apache Airflow to do Data Lineage on AWS - Exploring the fundamental themes of architecting and governing a data lake on AWS using Apache Arflow.
Books, blogs, podcasts, and such
- Data Pipelines with Apache Airflow - A Manning book (Early Access September 2019) on Airflow.
- The Airflow Podcast - A semiregular podcast discussing all things Airflow.
- Maxime Beauchemin - Maxime's blog on medium that gives insight into the philosophy behind Apache Airflow.
- Robert Chang - Blog posts about data engineering with Apache Airflow, explains why and has examples in code.
- Airflow 2.0: DAG Authoring Redesigned - Blogpost about new ways of writing DAGs in Airflow 2.0
Slide deck presentations and online videos
- 2020-Feb: Apache Airflow @ Umuzi.org - Sheena O'Connell discusses how South Africa-based tech bootcamp Umuzi uses Airflow.
- Apache Airflow YouTube tutorials - Marc Lamberti has created a series of YouTube tutorials covering many aspects of Airflow concepts, configuration and deployment.
- Advanced Data Engineering Patterns with Apache Airflow - Video of Maxime Beauchemin's talk that briefly introduces Airflow and then goes into more advanced use cases, including self-servive SQL queries, building A/B testing metrics frameworks and machine learning feature extraction all via Airflow. The slides are available separately here.
- Modern Data Pipelines with Apache Airflow - A talk given by Taylor Edmiston and Andy Cooper from Astronomer.io at Momentum Dev Con 2018 on getting started with Airflow, custom components, example DAGs, and the Astronomer Airflow CLI.
- Building Better Data Pipelines using Apache Airflow - Slides from Sid Anand's talk at QCon 18 with a thorough overview of Airflow and its architecture.
- Airflow and Spark Streaming at Astronomer - How Astronomer uses dynamic DAGs to run Spark Streaming jobs with Airflow.
- Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python - Slides from Kaxil Naik's & Satyasheel talk at PyData London 18 introducing the basics of Airflow and how to orchestrate workloads on Google Cloud Platform (GCP).
- Developing elegant workflows in Python code with Apache Airflow - Michał Karzyński at Europython gives a brief introduction to Airflow concepts including the role of workflow managers, DAGs and operators. Link includes both video and slides.
- Data Pipeline Management - Ben Goldberg walks the Chicago Kubernetes Meetup through how SpotHero uses Airflow. Additionally, Ben has a very complete slidedeck of how Airflow plays within Kubernetes.
- How I learned to time travel, or, data pipelining and scheduling with Airflow - Comprehensive deck by Laura Lorenz for why Airflow is necessary and how Industry Dive uses it.
- Introduction to Apache Airflow - Data Day Seattle 2016 - Sid Anand gives a thorough introduction to Airflow and how it was used at Agari.
- Operating Data Pipeline With Airflow - Airflow Meetup April-2018 - Ananth Packkildurai talks about scaling airflow Local Executor and best practices to operate data pipeline at Slack.
- Apache Airflow at WePay - Chris Riccomini discusses why WePay chose Airflow and provides a detailed breakdown of their deployment and the infrastructure behind it.
- Elegant data pipelining with Apache Airflow - Talks from Bolke de Bruin and Fokko Driesprong at PyData Amsterdam 2018 about methodologies that provide clarity in ETL using Airflow.
- Airflow @ Lyft - Talks from Tao Feng at SF big data analytics meetup about how Lyft monitors running Airflow in production.
- Manageable data pipelines with Airflow and Kubernetes - Talk by Jarek Potiuk and Szymon Przedwojski. A introductory talk on Airflow from GDG Warsaw DevFest 2018.
- Migrating Apache Oozie Workflows to Apache Airflow - Talk from Szymon Przedwojski from Airflow Bay Area Meetup June 2018 about Oozie-to-Airflow migration tool.
- Building data lakes with Apache Airflow - Talk by Bas Harenslak and Julian de Ruiter at the Amsterdam Apache Airflow September 2018 meetup about building data lakes with Apache Airflow as the spider in the web managing all data flows.
- First Warsaw Apache Airflow Meetup - Live streamed recording from the first Apache Airflow Meetup in Warsaw in October 2019.
- What's coming in Apache Airflow 2.0 - joint talk by Ash Berlin-Taylor, Kaxil Naik, Jarek Potiuk, Kamil Breguła, Daniel Imbermann, and Tomek Urbaszek at the Online NYC Meetup, 13th of May 2020
- Airflow Breeze - Development and Test Environment for Apache Airflow - Screencast showing how to use Breeze environment by Jarek Potiuk.
Libraries, Hooks, Utilities
- DEAfrica Airflow - Airflow libraries used by Digital Earth Africa, an humanitarian effort to utilize satellite imagery of Africa.
- Airflow plugins - Central collection of repositories of various plugins for Airflow, including mailchimp, trello, sftp, GitHub, etc.
- fileflow - Collection of modules to support large data transfers between Airflow operators through either local file system or S3. This addresses a gap where data is too large for XCOMs but too small or inconvenient for loading directly in the operator. Built by Industry Dive.
- fairflow - Library to abstract away Airflow's Operators with functional pieces that transform the data from one operator to another.
- airflow-maintenance-dags - Clairvoyant has a repo of Airflow DAGs that operator on Airflow itself, clearing out various bits of the backing metadata store.
- test_dags - a more complete solution for DAG integrity tests (first Circle of Data’s Inferno are the first.
- dag-factory - A library for dynamically generating Apache Airflow DAGs from YAML configuration files.
- whirl - Fast iterative local development and testing of Apache Airflow workflows.
- airflow-code-editor - A plugin for Apache Airflow that allows you to edit DAGs in browser.
- Pylint-Airflow - A Pylint plugin for static code analysis on Airflow code.
- afctl - A CLI tool that includes everything required to create, manage and deploy airflow projects faster and smoother.
- Dag Dependencies viewer - A plugin which creates a view to visualize dependencies between the Airflow DAGs
- Airflow ECR Plugin - Plugin to refresh AWS ECR login token at regular intervals. This is helpful where DockerOperator needs to pull images hosted on ECR.
- AirflowK8sDebugger - A library for generate k8s pod yaml templates from an Airflow dag using the KubernetesPodOperator.
- Oozie to Airflow - A tool to easily convert between Apache Oozie workflows and Apache Airflow workflows.
- Airflow Ditto - An extensible framework to do transformations to an Airflow DAG and convert it into another DAG which is flow-isomorphic with the original DAG, to be able to run it on different environments (e.g. on different clouds, or even different container frameworks - Apache Spark on YARN vs Kubernetes). Comes with out-of-the-box support for EMR-to-HDInsight-DAG transforms.
- gusty - Create a DAG using any number of YAML, Python, Jupyter Notebook, or R Markdown files that represent individual tasks in the DAG. gusty also configures dependencies, DAGs, and TaskGroups, features support for your local operators, and more. A fully containerized demo is available here.
- Amsterdam Apache Airflow Meetup
- Bangalore Apache Airflow Meetup
- Bay Area Apache Airflow Meetup
- London Apache Airflow Meetup
- Melbourne Apache Airflow Meetup
- New York City Apache Airflow Meetup
- Paris Apache Airflow Meetup
- Portland Apache Airflow Meetup
- Seattle Apache Airflow Meetup
- Tokyo Apache Airflow (incubating) Meetup
- Warsaw Apache Airflow Meetup
Commercial Airflow-as-a-service providers
- Google Cloud Composer - Google Cloud Composer is a managed service built atop Google Cloud and Airflow.
- Qubole - Qubole is mainly known as a service-and-support company for Apache Hive, but also provides Airflow as a component of its platform.
- Astronomer.io - Astronomer provides complete ETL lifecycle solutions and appears to be entirely focused on providing Airflow-based products.
Cloud Composer resources
This section contains articles that apply to Cloud Composer — a service built by Google Cloud based on Apache Airflow. Tricks and solutions are described here that are intended for Cloud Composer, but may be applicable to vanilla Airflow.
- Enabling Autoscaling in Google Cloud Composer - Supercharge your Cloud Composer deployment while saving up some cost during idle periods.
- Scale your Composer environment together with your business - The Celery Executor architecture and ways to ensure high scheduler performance.
- pianka.sh - Missing command in the gcloud tool. This tool facilitates some administrative tasks.
- The Smarter Way of Scaling With Composer’s Airflow Scheduler on GKE - Roy Berkowitz discusses more effective use of nodes in the Cloud Composer service.
- Better together: orchestrating your Data Fusion pipelines with Cloud Composer - Rachael Deacon-Smith provides an overview of the operator for Datafusion use case on Cloud Composer.
- Airflow Documentation-Chinese - (
🇨🇳Chinese) Apachecn has translated the Airflow official documentation.
- Gestion de Tâches avec Apache Airflow - (
🇫🇷French) Nicolas Crocfer - Overview of Airflow, basic concepts and how to write and trigger a DAG.
- apache airflow 複数worker構成のalpine版docker imageを作った - (
🇯🇵Japanese) Akio Ohta walks through his Docker image for deploying an Alpine-based Airflow system.
- Apache Airflow – Kaikki Mitä Meillä On, Lähtee Dageista - (
🇫🇮Finnish) Olli Iivonen's overview of Airflow, concepts and Airflow's usage at Solita.
- Airflow - Automatizando seu fluxo de trabalho - (
🇧🇷Portuguese) Gilson Filho's overview of Airflow, concept and basic use.
- Panduan Dasar Apache Airflow - (
🇮🇩Indonesian) Imam Digmi - Overview of Airflow, concept, basic use with use case.
- Airflow - (
🇻🇳Vietnamese) Duyet Le - Overview of Airflow, concept, basic use with use case.
To the extent possible under law, Jakob Homan has waived all copyright and related or neighboring rights to this work.