# TF on GKE

This notebook shows how to run the [TensorFlow CIFAR10 sample](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator) on GKE using TfJobs

In [2]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

## Requirements

To run this notebook you must have the following installed
  * gcloud
  * kubectl
  * helm
  * kubernetes python client library
  
There is a Docker image based on Datalab suitable for running this notebook.

## Preliminaries

TODO(jlewi): Should we bake the dependencies into a Docker container?

In [26]:
from kubernetes import client as k8s_client
from kubernetes import config as k8s_config
import datetime
from googleapiclient import discovery
from googleapiclient import errors
from oauth2client.client import GoogleCredentials
import os
import logging
import subprocess

logging.getLogger().setLevel(logging.INFO)

Change **project** to a project you have access to.
* GKE should be enabled for that project
* Optional change the cluster name

In [29]:
project="cloud-ml-dev"
zone="us-east1-d"
cluster_name="gke-tf-example"

gke = discovery.build("container", "v1")

In [2]:
# Install kubectl
!gcloud components install -q kubectl



Your current Cloud SDK version is: 167.0.0
Installing components from version: 167.0.0

┌──────────────────────────────────────────────┐
│     These components will be installed.      │
├─────────────────────────┬─────────┬──────────┤
│           Name          │ Version │   Size   │
├─────────────────────────┼─────────┼──────────┤
│ kubectl                 │         │          │
│ kubectl (Linux, x86_64) │   1.7.3 │ 16.0 MiB │
└─────────────────────────┴─────────┴──────────┘

For the latest full release notes, please visit:
  https://cloud.google.com/sdk/release_notes

╔════════════════════════════════════════════════════════════╗
╠═ Creating update staging area                             ═╣
╠════════════════════════════════════════════════════════════╣
╠═ Installing: kubectl                                      ═╣
╠════════════════════════════════════════════════════════════╣
╠═ Installing: kubectl (Linux, x86_64)                      ═╣
╠═══════════════════════════════════════════

In [3]:
# Install Python K8s client library
!pip install kubernetes

Collecting kubernetes
  Downloading kubernetes-3.0.0-py2.py3-none-any.whl (815kB)
[K    100% |████████████████████████████████| 819kB 506kB/s 
Collecting urllib3!=1.21,>=1.19.1 (from kubernetes)
  Downloading urllib3-1.22-py2.py3-none-any.whl (132kB)
[K    100% |████████████████████████████████| 133kB 919kB/s 
[?25hCollecting setuptools>=21.0.0 (from kubernetes)
  Downloading setuptools-36.6.0-py2.py3-none-any.whl (481kB)
[K    100% |████████████████████████████████| 481kB 27kB/s 
Collecting python-dateutil>=2.5.3 (from kubernetes)
  Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
[K    100% |████████████████████████████████| 194kB 45kB/s 
[?25hCollecting pyyaml>=3.12 (from kubernetes)
  Downloading PyYAML-3.12.tar.gz (253kB)
[K    100% |████████████████████████████████| 256kB 264kB/s 
[?25hCollecting websocket-client<=0.40.0,>=0.32.0 (from kubernetes)
  Downloading websocket_client-0.40.0.tar.gz (196kB)
[K    100% |████████████████████████████████| 204kB 113kB/

### Some Utility Functions

In [27]:
def run(command, cwd=None):
  logging.info("Running: %s", " ".join(command))
  subprocess.check_call(command, cwd=cwd)


class TimeoutError(Exception):
  """An error indicating an operation timed out."""


def wait_for_operation(client,
                       project,
                       zone,
                       op_id,
                       timeout=datetime.timedelta(hours=1),
                       polling_interval=datetime.timedelta(seconds=5)):
  """Wait for the specified operation to complete.

  Args:
    client: Client for the API that owns the operation.
    project: project
    zone: Zone. Set to none if its a global operation
    op_id: Operation id.
    timeout: A datetime.timedelta expressing the amount of time to wait before
      giving up.
    polling_interval: A datetime.timedelta to represent the amount of time to
      wait between requests polling for the operation status.

  Returns:
    op: The final operation.

  Raises:
    TimeoutError: if we timeout waiting for the operation to complete.
  """
  endtime = datetime.datetime.now() + timeout
  while True:
    if zone:
      op = client.projects().zones().operations().get(
          projectId=project, zone=zone,
          operationId=op_id).execute()
    else:
      op = client.globalOperations().get(project=project,
                                         operation=op_id).execute()

    status = op.get("status", "")
    # Need to handle other status's
    if status == "DONE":
      return op
    if datetime.datetime.now() > endtime:
      raise TimeoutError("Timed out waiting for op: {0} to complete.".format(
          op_id))
    time.sleep(polling_interval.total_seconds())


## GKE Cluster Setup

* The instructions below create a **CPU** cluster
* To create a GKE cluster with GPUs sign up for the [GKE GPU Alpha](https://goo.gl/forms/ef7eh2x00hV3hahx1)
* TODO(jlewi): Update code once GPUs are in beta.

In [30]:
def create_cluster(gke, name, project, zone):
  """Create the cluster.

  Args:
    gke: Client for GKE.

  """
  cluster_request = {
      "cluster": {
          "name": name,
          "description": "A GKE cluster for TF.",
          "initialNodeCount": 1,
          "nodeConfig": {
              "machineType": "n1-standard-8",
          },
      }
  }
  request = gke.projects().zones().clusters().create(body=cluster_request,
                                                     projectId=project,
                                                     zone=zone)

  try:
    logging.info("Creating cluster; project=%s, zone=%s, name=%s", project,
                 zone, name)
    response = request.execute()
    logging.info("Response %s", response)
    create_op = wait_for_operation(gke, project, zone, response["name"])
    logging.info("Cluster creation done.\n %s", create_op)

  except errors.HttpError as e:
    logging.error("Exception occured creating cluster: %s, status: %s",
                  e, e.resp["status"])
    # Status appears to be a string.
    if e.resp["status"] == '409':      
      pass
    else:
      raise

create_cluster(gke, cluster_name, project, zone)      
logging.info("Configuring kubectl")
run(["gcloud", "--project=" + project, "container",
     "clusters", "--zone=" + zone, "get-credentials", cluster_name])


INFO:root:Creating cluster; project=cloud-ml-dev, zone=us-east1-d, name=gke-tf-example
ERROR:root:Exception occured creating cluster: <HttpError 409 when requesting https://container.googleapis.com/v1/projects/cloud-ml-dev/zones/us-east1-d/clusters?alt=json returned "The resource "projects/cloud-ml-dev/zones/us-east1-d/clusters/gke-tf-example" already exists.">, status: 409
INFO:root:Configuring kubectl
INFO:root:Running: gcloud --project=cloud-ml-dev container clusters --zone=us-east1-d get-credentials gke-tf-example


In [31]:
k8s_config.load_kube_config()
api_client = k8s_client.ApiClient()
v1 = k8s_client.CoreV1Api(api_client)
print(v1.api_client.host)
print(v1.list_pod_for_all_namespaces())

NAME                                            STATUS    AGE       VERSION
gke-gke-tf-example-default-pool-97078f00-mw22   Ready     16h       v1.7.6


In [24]:
import kubernetes
kubernetes.__path__

['/usr/local/lib/python2.7/dist-packages/kubernetes']