# ETL Pipelines With Airflow


## Introduction

## Calling An API In Python

![api](./images/api_call.png)

## Setting Up PostgreSQL Database

![api](./images/make_db.png)


## Introduction Airflow 

Data pipelines in Airflow are made up of DAGs that are scheduled to be completed at a specific times. A DAG is a directed acyclic graph and where each node in the graph will be a task that needs to be compeleted.  Tasks that are not dependent on one another can be run in parallel.  We'll go over this in much more detail in the next section.  

What is he components of Airflow are 

- **Metadata DB  (database)** : Keeps track of tasks, how long each run took, etc.

- **Webserver (Flask based UI)** : The webserver talks to metadata db to get information to present.

- **Scheduler** : This scrolls the file system and puts things on the queue.

- **Workers** : These do the actual tasks, these can can be separate from scheduler or the same. If they are separate then you can use celey: http://www.celeryproject.org/


Airflow will dump all information about your dags into logs. The logs are going to be dumped to a file or database as well.  Just for simplicity I made a local directory in,

    ~/airflow/logs
    
Notice the choice of directory to dump the logs this is decided by what <code>base_log_folder</code> is set to in the <code>airflow.cfg</code> file.  You can change it to store the logs remotely by setting the <code>remote_base_log_folder</code> variable in the <code>airflow.cfg</code> file.



### Installing Airflow

To install airflow first set your airflow home directy by typing into your terminal,

    export AIRFLOW_HOME=<path_to_airflow_home>
    
I chose to set <code>AIRFLOW_HOME=~/airflow</code> which is the default setting.  Now we can install airflow with PostgreSQL using pip:
    
    pip install airflow[postgres]

### Metadata DB

We can then initialize the metadata database using,

    airflow initdb 
    
Out of the box, Airflow uses a sqlite database, which you should outgrow fairly quickly since no parallelization is possible using this database backend.  The default will be sqlite database in <code>AIRFLOW_HOME/airflow.db</code>. You can change the database choice using the <code>sql_alchemy_conn</code> variable in the <code>airflow.cfg</code> file.

Airflow also by default works in conjunction with the SequentialExecutor which will only run task instances sequentially.  This is set in the executor variable in the <code>airflow.cfg</code> file.

### Webserver
We can start the webserver locally using the command,

    airflow webserver -p 8080

Then plug in http://0.0.0.0:8080/ into browser and you will get the Airflow UI. The webserver will be extemely helpful to understand what DAGS are running, how long they ran, etc. as well as setting up connections to databases.

### Scheduler

The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.
The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off, all you need to do is execute airflow scheduler. It will use the configuration specified in <code>airflow.cfg</code>.

    airflow scheduler
    
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

## Workers
In this example I won't be using any seperate workers since I'm running this on my personal computer.


## Example ETL Pipeline With Airflow

![api](./images/airflow_ui_db.png)
