# Automate & standardize the boring things

## standardizing reading or moving or cleaning data 
- intake: https://intake.readthedocs.io/en/latest/
- singr https://www.singer.io/#what-it-is
- pyjanitor https://pyjanitor.readthedocs.io/notebooks/dirty_data.html

## data catalog
Important in larger organizations. A place where to store all data sources available with description how it is generated from raw data and what each column means.

- intake: https://intake.readthedocs.io/en/latest/
- atlas: https://atlas.apache.org

## automation & scheduling

- airflow https://atlas.apache.org (open source)
- prefect https://github.com/PrefectHQ/prefect (open core, commercial UI)

**An example using prefect.**

Tasks describe what is executed

In [None]:
from prefect import task

@task
def extract():
    """Get a list of data"""
    return [1, 2, 3]

@task
def transform(data):
    """Multiply the input by 10"""
    return [i * 10 for i in data]

@task
def load(data):
    """Print the data to indicate it was received"""
    print("Here's your data: {}".format(data))

A flow bundles multiple tasks

In [None]:
from prefect import Flow

with Flow('ETL') as flow:
    e = extract()
    t = transform(e)
    l = load(t)

flow.run() # prints "Here's your data: [10, 20, 30]"

In [None]:
flow.visualize()

In [None]:
from prefect.schedules import IntervalSchedule
from datetime import timedelta
schedule = IntervalSchedule(interval=timedelta(minutes=1))
with Flow('ETL', schedule) as flow:
    e = extract()
    t = transform(e)
    l = load(t)

flow.run() # prints "Here's your data: [10, 20, 30]"

In [None]:
flow.visualize()

Airflow

In [None]:
# initialize the database
!airflow initdb

In [None]:
# start the web server, default port is 8080
!airflow webserver -p 8080

visit localhost:8080 in the browser and enable the example dag in the home page