Awesome Apache Airflow
This is a curated list of resources about Apache Airflow (incubating). Please feel free to contribute any items that should be included. Items are generally added at the top of each section so that more fresh items are featured more prominently.
- Vital links
- Airflow deployment solutions
- Introductions and tutorials
- Best practices, lessons learned and cool use cases
- Blogs, etc.
- Slide deck presentations and online videos
- Libraries, Hooks, Utilities
- Commercial Airflow-as-a-service providers
- Non-English resources
- Official website: Apache Airflow
- Latest release: 1.10.0-incubating
- Official Twitter account: Apache Airflow
Airflow deployment solutions
- Puckel's Docker Image - @Puckel_'s well-crafted Docker image has become the base for many Airflow installations. It is regularly updated and closely tracks the official Apache releases.
- Kubernetes Custom Operator for Deploying Airflow - Kubernetes Custom controller (also called operator pattern) for deploying Airflow on Kubernetes.
- airflow-pipeline - Airflow Docker container that comes preconfigured for Spark and Hadoop. It can be docker pulled at
- kube-airflow - This repository contains both an Airflow Docker image (that appears to have been based on Puckel's work) and Kubernetes service definition. mumoshu's repository has not been recently updated, but there are numerous forks that may be based on more recent releases.
- airflow-on-kubernetes - A guide on all relevant resources, scripts and projects that relate to running Airflow on Kubernetes.
- airflow-k8s-executor-on-GKE - A detailed tutorial to get a scalable, low maintenance airflow kubernetes executor environment deployed on Google Kubernetes Engine with helm.
- airflow-cookbook Chef cookbook for deploying Airflow.
- Running Airflow on top of Apache Mesos - Blog describing how to configure Mesos to run all of the Airflow componenents.
- Integrating Apache Airflow with Apache Ambari - Mykola Mykhalov walks through using Apache Ambari to configure and deploy an Airflow instance.
- Astronomer Open Edition github - Open Edition of the Astronomer Platform including Docker images for Airflow (Celery Executor), Postgres, Redis, Flower, StatsD, Prometheus, Grafana, and cAdvisor.
Introductions and tutorials
- Remote spark-submit to YARN running on EMR - Azhaguselvan walks through submitting Spark jobs to existing EMR clusters with Airflow.
- Running Airflow on top of Apache Mesos and its follow-up, Mesos, Airflow & Docker by Agraj Mangal is a quick overview of running Airflow atop Apache Mesos.
- Dustin Stansbury of Quizlet has written a four-part series that covers what workflow managers do in general, how Quizlet picked Airflow, a tour of Airflow's key concepts, and how Quizlet is now using Airflow in practice:
- Apache Airflow for the confused - This short tutorial by Jonathan Pichot takes a different tack than most by using airplane and airport operations as an analogy for Airflow.
- Integrating Apache Airflow with Databricks - While this tutorial is focused specifically on Databricks' Spark solutions, it does have a reasonable overview of Airflow basics and demonstrates how a third party solution can quickly integrate into Airflow.
Best practices, lessons learned and cool use cases
- Testing in Airflow Part 1 - Chandu Kavar has explained different categories of tests in Airflow. It includes DAG Validation Tests, DAG Definition Tests, and Unit Tests
- Improving Airflow UI Security - WePay's Joy Gao breaks down the need for Role Based Access Controls (RBAC) and how she introduced it to Airflow.
- How to Create a Workflow in Apache Airflow to Track Disease Outbreaks in India - Vinayak Mehta details how SocialCops uses Airflow to scrape India's Ministry of Health and Family Affairs to generate derived data on possible disease outbreaks.
- Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy - Vinayak Mehta talks about identifying data engineering patterns (meta data engineering) to automate DAG generation and how that helped SocialCops to power DISHA, a national data platform where Indian MPs and MLAs monitor the progress of 42 national level schemes.
- Lessons learnt while Airflow-ing and Airflow Part 2: Lessons learned - Nehil Jain of Snaptravel has written a two-part series that covers the value of workflow schedulers, some best practices and pitfalls he found while working with Airflow. The second article in particular includes many production tips.
- Why Robinhood uses Airflow - Vineet Goel walks through why financial trading platform Robinhood picked Airflow over alternative work schedulers.
- What we learned migrating off Cron to Airflow - Katie Macias describes VideoAmp's Data Engineering's journey from cron to Airflow.
- Under the Hood: Building AIR at Qubole - Sreenath Kamath and Rajat Venkatesh write about building Qubole's data discovery, insights and recommendations platform atop Airflow.
- Airflow: Why is nothing working? - TL;DR Airflow’s SubDagOperator causes deadlocks by Jessica Laughlin - Deep dive into troubleshooting a troublesome Airflow DAG with good tips on how to diagnosis problems.
- Apache Airflow as an External scheduler for distributed systems - Arunkumar suggests using Airflow as a simple external scheduler for a distributed system.
- How Sift Trains Thousands of Models using Apache Airflow - Summary of Sift Science's deployment strategy for its machine learning model pipelines.
- Apache Airflow at Pandora - Ace Haidrey discusses why Pandora chose Airflow and provides a detailed breakdown of their deployment and the infrastructure behind it.
- Airflow Lessons from the Data Engineering Front in Chicago - Alison Stanton provides a list of tips to avoid gotchas in Airflow jobs.
- Data’s Inferno: 7 Circles of Data Testing Hell with Airflow - The Wholesale Banking Advanced Analytics team at ING details how they torture test their Airflow DAGs before deployment.
- Data quality checkers - Antoine Augusti describes the framework drivy has built atop Airflow to test their datasets for completeness, consistency, timeliness, uniquess, validity and accuracy.
- Building WePay's data warehouse using BigQuery and Airflow - The inestimable Chris Riccomini describes how WePay, one of the first adopters of Airflow, integrated into their Google Cloud Compute environment.
- Using Apache Airflow to Create Data Infrastructure in the Public Sector - Despite an unfortunately very heavy sales pitch tone, this article blog post describes how ARGO Labs, a non-profit data organization, utilizes Airflow for ETLing in public sector data.
- ETL with airflow - ETL core principles and several end-to-end docker-based examples including Kimball, Data Vault on Hive and some simpler examples.
- How to aggregate data for BigQuery using Apache Airflow - Example of how to use Airflow with Google BigQuery to power a Data Studio dashboard.
- Productionizing ML with workflows at Twitter - In depth post on why and how Twitter use Airflow for ML workflows including including custom operators and a custom UI embedded in in the Airflow web interface.
- The Airflow Podcast - A semiregular podcast discussing all things Airflow.
- Maxime Beauchemin - Maxime's blog on medium that gives insight into the philosophy behind Apache Airflow.
- Robert Chang - Blog posts about data engineering with Apache Airflow, explains why and has examples in code.
Slide deck presentations and online videos
- Advanced Data Engineering Patterns with Apache Airflow - Video of Maxime Beauchemin's talk that briefly introduces Airflow and then goes into more advanced use cases, including self-servive SQL queries, building A/B testing metrics frameworks and machine learning feature extraction all via Airflow. The slides are available separately here.
- Modern Data Pipelines with Apache Airflow - A talk given by Taylor Edmiston and Andy Cooper from Astronomer.io at Momentum Dev Con 2018 on getting started with Airflow, custom components, example DAGs, and the Astronomer Airflow CLI.
- Building Better Data Pipelines using Apache Airflow - Slides from Sid Anand's talk at QCon 18 with a thorough overview of Airflow and its architecture.
- Airflow and Spark Streaming at Astronomer - How Astronomer uses dynamic DAGs to run Spark Streaming jobs with Airflow.
- Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python - Slides from Kaxil Naik's & Satyasheel talk at PyData London 18 introducing the basics of Airflow and how to orchestrate workloads on Google Cloud Platform (GCP).
- Developing elegant workflows in Python code with Apache Airflow - Michał Karzyński at Europython gives a brief introduction to Airflow concepts including the role of workflow managers, DAGs and operators. Link includes both video and slides.
- Data Pipeline Management - Ben Goldberg walks the Chicago Kubernetes Meetup through how SpotHero uses Airflow. Additionally, Ben has a very complete slidedeck of how Airflow plays within Kubernetes.
- How I learned to time travel, or, data pipelining and scheduling with Airflow - Comprehensive deck by Laura Lorenz for why Airflow is necessary and how Industry Dive uses it.
- Introduction to Apache Airflow - Data Day Seattle 2016 - Sid Anand gives a thorough introduction to Airflow and how it was used at Agari.
- Operating Data Pipeline With Airflow - Airflow Meetup April-2018 - Ananth Packkildurai talks about scaling airflow Local Executor and best practices to operate data pipeline at Slack.
- Apache Airflow at WePay - Chris Riccomini discusses why WePay chose Airflow and provides a detailed breakdown of their deployment and the infrastructure behind it.
- Elegant data pipelining with Apache Airflow - Talks from Bolke de Bruin and Fokko Driesprong at PyData Amsterdam 2018 about methodologies that provide clarity in ETL using Airflow.
Libraries, Hooks, Utilities
- Airflow plugins - Central collection of repositories of various plugins for Airflow, including mailchimp, trello, sftp, github, etc.
- fileflow - Collection of modules to support large data transfers between Airflow operators through either local file system or S3. This addresses a gap where data is too large for XCOMs but too small or inconvenient for loading directly in the operator. Built by Industry Dive.
- fairflow - Library to abstract away Airflow's Operators with functional pieces that transform the data from one operator to another.
- airflow-maintenance-dags - Clairvoyant has a repo of Airflow DAGs that operator on Airflow itself, clearing out various bits of the backing metadata store.
- test_dags - a more complete solution for DAG integrity tests (first Circle of Data’s Inferno are the first.
Commercial Airflow-as-a-service providers
- Google Cloud Composer - Google Cloud Composer is a managed service built atop Google Cloud and Airflow.
- Qubole - Qubole is mainly known as a service-and-support company for Apache Hive, but also provides Airflow as a component of its platform.
- Astronomer.io - Astronomer provides complete ETL lifecycle solutions and appears to be entirely focused on providing Airflow-based products.
- Gestion de Tâches avec Apache Airflow [French] - Nicolas Crocfer - Overview of Airflow, basic concepts and how to write and trigger a DAG.
- apache airflow 複数worker構成のalpine版docker imageを作った [Japanese] - Akio Ohta walks through his Docker image for deploying an Alpine-based Airflow system.
- Apache Airflow – Kaikki Mitä Meillä On, Lähtee Dageista [Finnish] by Olli Iivonen - Overview of Airflow, concepts and Airflow's usage at Solita
- Airflow - Automatizando seu fluxo de trabalho [Portuguese] - Gilson Filho - Overview of Airflow, concept and basic use.