# Automate & standardize the boring things

## standardizing reading or moving or cleaning data 
- intake: https://intake.readthedocs.io/en/latest/
- singr https://www.singer.io/#what-it-is
- pyjanitor https://pyjanitor.readthedocs.io/notebooks/dirty_data.html

## data catalog
Important in larger organizations. A place where to store all data sources available with description how it is generated from raw data and what each column means.

- intake: https://intake.readthedocs.io/en/latest/
- atlas: https://atlas.apache.org

## automation & scheduling

- airflow https://atlas.apache.org (open source)
- prefect https://github.com/PrefectHQ/prefect (open core, commercial UI)

**An example using prefect.**

Tasks describe what is executed

In [None]:
from prefect import task

@task
def extract():
    """Get a list of data"""
    return [1, 2, 3]

@task
def transform(data):
    """Multiply the input by 10"""
    return [i * 10 for i in data]

@task
def load(data):
    """Print the data to indicate it was received"""
    print("Here's your data: {}".format(data))

A flow bundles multiple tasks

In [None]:
from prefect import Flow

with Flow('ETL') as flow:
    e = extract()
    t = transform(e)
    l = load(t)

flow.run() # prints "Here's your data: [10, 20, 30]"

In [None]:
flow.visualize()

In [None]:
from prefect.schedules import IntervalSchedule
from datetime import timedelta
schedule = IntervalSchedule(interval=timedelta(minutes=1))
with Flow('ETL', schedule) as flow:
    e = extract()
    t = transform(e)
    l = load(t)

flow.run() # prints "Here's your data: [10, 20, 30]"

In [None]:
flow.visualize()

## Airflow

https://airflow.apache.org

- oss
- completely free (even UI)

run:

```bash
docker run -p 8080:8080 puckel/docker-airflow webserver
```

got to: http://localhost:8080

in another shell:

```bash
docker ps

# look at the id of the airflow container
docker exec -ti <id> bash

airflow list_dags

# run your first task instance
airflow run example_bash_operator runme_0 2019-12-01

# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial

# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree

# Generate some data in the UI
# start your backfill on a date range
airflow backfill tutorial -s 2015-06-01 -e 2015-06-07

airflow run example_xcom runme_0 2019-12-01
```


visit localhost:8080 in the browser and enable the example dag in the home page


Readalong for further information:
- https://medium.com/@tomaszdudek/yet-another-scalable-apache-airflow-with-docker-example-setup-84775af5c451
- https://medium.com/@itunpredictable/apache-airflow-on-docker-for-complete-beginners-cf76cf7b2c9a

### locally (if you want)

In [None]:
# pip install apache-airflow

In [None]:
# initialize the database
!airflow initdb

In [None]:
# start the web server on 8081
!airflow webserver -p 8081

go to: http://localhost:8081

cleanup:


```bash
docker ps # find the right id
docker stop <<id>>
docker stop 12bcab527bb2
````