This notebook creates a VM in the user's project with the airflow scheduler and webserver. A default GCP zone for the VM has been chosen (below). Feel free to change this as desired.

## Airflow Dashboard
After successful setup of the Airflow VM, you will be able to view the Airflow Dashboard by creating an ssh tunnel to the VM. To do so, a sample command that you could execute:
gcloud compute ssh --zone us-central1-b datalab-airflow -- -N -p 22 -L localhost:5000:localhost:8080

Once this tunnel is open, you'd be able to view the dashboard by navigating to http://localhost:5000 on your browser.

In [1]:
# Get the latest datalab version. Restart the kernel.
!pip install --upgrade --force-reinstall datalab

Collecting datalab
  Using cached https://files.pythonhosted.org/packages/a8/e2/36982b4a3ba4f4fc59efa429961f310411e0db894bf772fd06b736d3b766/datalab-1.1.4-py2-none-any.whl
Collecting google-auth-httplib2>=0.0.2 (from datalab)
  Using cached https://files.pythonhosted.org/packages/33/49/c814d6d438b823441552198f096fcd0377fd6c88714dbed34f1d3c8c4389/google_auth_httplib2-0.0.3-py2.py3-none-any.whl
Collecting urllib3>=1.22 (from datalab)
  Using cached https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c53851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl
Collecting pytz>=2015.4 (from datalab)
  Using cached https://files.pythonhosted.org/packages/30/4e/27c34b62430286c6d59177a0842ed90dc789ce5d1ed740887653b898779a/pytz-2018.5-py2.py3-none-any.whl
Collecting ipykernel>=4.5.2 (from datalab)
  Using cached https://files.pythonhosted.org/packages/8e/65/c7ca3e3d05f9bd51b3010076b84f4e7304b12d0abf62a48f6cec2c90c019/ipykernel-4.8.2-py2-none-any.whl
Collecting mock>

In [2]:
zone='us-central1-b'

In [3]:
from google.datalab import Context
import google.datalab.storage as storage

project = Context.default().project_id
vm_name = 'datalab-airflow'

# The name of this GCS bucket follows a convention between this notebook and 
# the 'BigQuery Pipeline' tutorial notebook, so don't change this.
gcs_dag_bucket_name = project + '-' + vm_name
gcs_dag_bucket = storage.Bucket(gcs_dag_bucket_name)
gcs_dag_bucket.create()

Google Cloud Storage Bucket gs://<project-id>-datalab-airflow

In [4]:
vm_startup_script_contents = """#!/bin/bash
apt-get update
apt-get --assume-yes install python-pip

pip install datalab==1.1.2
pip install apache-airflow==1.9.0
pip install pandas-gbq==0.3.0

export AIRFLOW_HOME=/airflow
export AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
export AIRFLOW__CORE__LOAD_EXAMPLES=False
airflow initdb
airflow scheduler &
airflow webserver -p 8080 &

# We append a gsutil rsync command to the cron file and have this run every minute to sync dags.
PROJECT_ID=$(gcloud info --format="get(config.project)")
GCS_DAG_BUCKET=$PROJECT_ID-datalab-airflow
AIRFLOW_CRON=temp_crontab.txt
crontab -l > $AIRFLOW_CRON
DAG_FOLDER="dags"
LOCAL_DAG_PATH=$AIRFLOW_HOME/$DAG_FOLDER
mkdir $LOCAL_DAG_PATH
echo "* * * * * gsutil rsync gs://$GCS_DAG_BUCKET/$DAG_FOLDER $LOCAL_DAG_PATH" >> $AIRFLOW_CRON
crontab $AIRFLOW_CRON
rm $AIRFLOW_CRON
EOF
"""
vm_startup_script_file_name = 'vm_startup_script.sh'
script_file = open(vm_startup_script_file_name, 'w')
script_file.write(vm_startup_script_contents)
script_file.close()
import subprocess
print subprocess.check_output([
    'gcloud', 'compute', '--project', project, 'instances', 'create', vm_name, 
    '--zone', zone,
    '--machine-type', 'n1-standard-1',
    '--network', 'default',
    '--maintenance-policy', 'MIGRATE',
    '--scopes', 'https://www.googleapis.com/auth/cloud-platform',
    '--image', 'debian-9-stretch-v20171025',
    '--min-cpu-platform', 'Automatic',
    '--image-project', 'debian-cloud',
    '--boot-disk-size', '10',
    '--boot-disk-type', 'pd-standard',
    '--boot-disk-device-name', vm_name,
    '--metadata-from-file', 'startup-script=' + vm_startup_script_file_name])

NAME               ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
datalab-airflow  us-central1-b  n1-standard-1               10.240.0.5   35.192.103.158  RUNNING



# Cleanup


In [5]:
# The following cleans up the VM and associated GCS bucket. Uncomment and run.
#!gsutil rm -r gs://$gcs_dag_bucket_name
#!gcloud compute instances delete datalab-airflow --zone us-central1-b --quiet

# This just verifies that cleanup actually worked. Uncomment and run. Should 
# show an error like "BucketNotFoundException: 404 ...". 
#!gsutil ls gs://$gcs_dag_bucket_name