This notebook creates a VM in the user's project with the airflow scheduler and webserver. A default GCP zone for the VM has been chosen (below). Feel free to change this as desired.

## Airflow Dashboard
After successful setup of the Airflow VM, you will be able to view the Airflow Dashboard by creating an ssh tunnel to the VM. To do so, a sample command that you could execute:
gcloud compute ssh --zone us-central1-b datalab-airflow -- -N -p 22 -L localhost:5000:localhost:8080

Once this tunnel is open, you'd be able to view the dashboard by navigating to http://localhost:5000 on your browser.

In [1]:
# Get the latest datalab version. Restart the kernel.
!pip install --upgrade --force-reinstall datalab

Collecting datalab
Collecting seaborn==0.7.0 (from datalab)
Collecting pytz>=2015.4 (from datalab)
  Using cached pytz-2017.3-py2.py3-none-any.whl
Collecting pyyaml==3.11 (from datalab)
Collecting httplib2==0.10.3 (from datalab)
Collecting ipykernel==4.5.2 (from datalab)
  Using cached ipykernel-4.5.2-py2.py3-none-any.whl
Collecting scikit-learn==0.18.2 (from datalab)
  Using cached scikit_learn-0.18.2-cp27-cp27mu-manylinux1_x86_64.whl
Collecting future==0.16.0 (from datalab)
Collecting pandas==0.22.0 (from datalab)
  Using cached pandas-0.22.0-cp27-cp27mu-manylinux1_x86_64.whl
Collecting oauth2client==2.2.0 (from datalab)
Collecting pandas-profiling>=1.0.0a2 (from datalab)
  Using cached pandas_profiling-1.4.1-py2.py3-none-any.whl
Collecting requests==2.9.1 (from datalab)
  Using cached requests-2.9.1-py2.py3-none-any.whl
Collecting jsonschema==2.6.0 (from datalab)
  Using cached jsonschema-2.6.0-py2.py3-none-any.whl
Collecting configparser==3.5.0 (from datalab)
Collecting scikit-imag

Collecting pexpect; sys_platform != "win32" (from ipython>=4.0.0->ipykernel==4.5.2->datalab)
  Using cached pexpect-4.3.1-py2.py3-none-any.whl
Collecting pathlib2; python_version == "2.7" or python_version == "3.3" (from ipython>=4.0.0->ipykernel==4.5.2->datalab)
  Using cached pathlib2-2.3.0-py2.py3-none-any.whl
Collecting setuptools>=18.5 (from ipython>=4.0.0->ipykernel==4.5.2->datalab)
  Using cached setuptools-38.4.0-py2.py3-none-any.whl
Collecting prompt-toolkit<2.0.0,>=1.0.4 (from ipython>=4.0.0->ipykernel==4.5.2->datalab)
  Using cached prompt_toolkit-1.0.15-py2-none-any.whl
Collecting enum34; python_version == "2.7" (from traitlets>=4.1.0->ipykernel==4.5.2->datalab)
  Using cached enum34-1.1.6-py2-none-any.whl
Collecting ipython-genutils (from traitlets>=4.1.0->ipykernel==4.5.2->datalab)
  Using cached ipython_genutils-0.2.0-py2.py3-none-any.whl
Collecting singledispatch (from tornado>=4.0->ipykernel==4.5.2->datalab)
  Using cached singledispatch-3.4.0.3-py2.py3-none-any.whl
Co

  Found existing installation: traitlets 4.3.2
    Uninstalling traitlets-4.3.2:
      Successfully uninstalled traitlets-4.3.2
  Found existing installation: backports.shutil-get-terminal-size 1.0.0
    Uninstalling backports.shutil-get-terminal-size-1.0.0:
      Successfully uninstalled backports.shutil-get-terminal-size-1.0.0
  Found existing installation: Pygments 2.2.0
    Uninstalling Pygments-2.2.0:
      Successfully uninstalled Pygments-2.2.0
  Found existing installation: ptyprocess 0.5.2
    Uninstalling ptyprocess-0.5.2:
      Successfully uninstalled ptyprocess-0.5.2
  Found existing installation: pexpect 4.3.1
    Uninstalling pexpect-4.3.1:
      Successfully uninstalled pexpect-4.3.1
  Found existing installation: setuptools 38.4.0
    Uninstalling setuptools-38.4.0:
      Successfully uninstalled setuptools-38.4.0
  Found existing installation: wcwidth 0.1.7
    Uninstalling wcwidth-0.1.7:
      Successfully uninstalled wcwidth-0.1.7
  Found existing installation: prom

    Uninstalling gapic-google-cloud-datastore-v1-0.15.3:
      Successfully uninstalled gapic-google-cloud-datastore-v1-0.15.3
  Found existing installation: google-cloud-datastore 1.4.0
    Uninstalling google-cloud-datastore-1.4.0:
      Successfully uninstalled google-cloud-datastore-1.4.0
  Found existing installation: proto-google-cloud-error-reporting-v1beta1 0.15.3
    Uninstalling proto-google-cloud-error-reporting-v1beta1-0.15.3:
      Successfully uninstalled proto-google-cloud-error-reporting-v1beta1-0.15.3
  Found existing installation: gapic-google-cloud-error-reporting-v1beta1 0.15.3
    Uninstalling gapic-google-cloud-error-reporting-v1beta1-0.15.3:
      Successfully uninstalled gapic-google-cloud-error-reporting-v1beta1-0.15.3
  Found existing installation: proto-google-cloud-logging-v2 0.91.3
    Uninstalling proto-google-cloud-logging-v2-0.91.3:
      Successfully uninstalled proto-google-cloud-logging-v2-0.91.3
  Found existing installation: gapic-google-cloud-loggi

In [2]:
zone='us-central1-b'

In [3]:
from google.datalab import Context
import google.datalab.storage as storage

project = Context.default().project_id
vm_name = 'datalab-airflow'

# The name of this GCS bucket follows a convention between this notebook and 
# the 'BigQuery Pipeline' tutorial notebook, so don't change this.
gcs_dag_bucket_name = project + '-' + vm_name
gcs_dag_bucket = storage.Bucket(gcs_dag_bucket_name)
gcs_dag_bucket.create()

Google Cloud Storage Bucket gs://<project-id>-datalab-airflow

In [4]:
vm_startup_script_contents = """#!/bin/bash
apt-get update
apt-get --assume-yes install python-pip

pip install datalab==1.1.2
pip install apache-airflow==1.9.0
pip install pandas-gbq==0.3.0

export AIRFLOW_HOME=/airflow
export AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
export AIRFLOW__CORE__LOAD_EXAMPLES=False
airflow initdb
airflow scheduler &
airflow webserver -p 8080 &

# We append a gsutil rsync command to the cron file and have this run every minute to sync dags.
PROJECT_ID=$(gcloud info --format="get(config.project)")
GCS_DAG_BUCKET=$PROJECT_ID-datalab-airflow
AIRFLOW_CRON=temp_crontab.txt
crontab -l > $AIRFLOW_CRON
DAG_FOLDER="dags"
LOCAL_DAG_PATH=$AIRFLOW_HOME/$DAG_FOLDER
mkdir $LOCAL_DAG_PATH
echo "* * * * * gsutil rsync gs://$GCS_DAG_BUCKET/$DAG_FOLDER $LOCAL_DAG_PATH" >> $AIRFLOW_CRON
crontab $AIRFLOW_CRON
rm $AIRFLOW_CRON
EOF
"""
vm_startup_script_file_name = 'vm_startup_script.sh'
script_file = open(vm_startup_script_file_name, 'w')
script_file.write(vm_startup_script_contents)
script_file.close()
import subprocess
print subprocess.check_output([
    'gcloud', 'compute', '--project', project, 'instances', 'create', vm_name, 
    '--zone', zone,
    '--machine-type', 'n1-standard-1',
    '--network', 'default',
    '--maintenance-policy', 'MIGRATE',
    '--scopes', 'https://www.googleapis.com/auth/cloud-platform',
    '--image', 'debian-9-stretch-v20171025',
    '--min-cpu-platform', 'Automatic',
    '--image-project', 'debian-cloud',
    '--boot-disk-size', '10',
    '--boot-disk-type', 'pd-standard',
    '--boot-disk-device-name', vm_name,
    '--metadata-from-file', 'startup-script=' + vm_startup_script_file_name])

NAME               ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
datalab-airflow  us-central1-b  n1-standard-1               10.240.0.5   35.192.103.158  RUNNING



# Cleanup


In [5]:
# The following cleans up the VM and associated GCS bucket. Uncomment and run.
#!gsutil rm -r gs://$gcs_dag_bucket_name
#!gcloud compute instances delete datalab-airflow --zone us-central1-b --quiet

# This just verifies that cleanup actually worked. Uncomment and run. Should 
# show an error like "BucketNotFoundException: 404 ...". 
#!gsutil ls gs://$gcs_dag_bucket_name