# MNIST E2E on Kubeflow on GKE

* **This is a work in progress; this is not ready for users yet**

This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.

## Prepare model

There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.

Basically, we must:

1. Add options in order to make the model configurable.
1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.
1. Define serving signatures for model serving.

The resulting model is [model.py](model.py).

### Verify we have a GCP account

* The cell below checks that this notebook was spawned with credentials to access GCP


In [18]:
import logging
import os
import uuid
from importlib import reload
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()

## Install Required Libraries

Import the libraries required to train this model.

In [90]:
import notebook_setup
reload(notebook_setup)
notebook_setup.notebook_setup()

[I 200210 22:26:44 notebook_setup:26] pip installing fairing git+git://github.com/kubeflow/fairing.git@9b0d4ed4796ba349ac6067bbd802ff1d6454d015
[I 200210 22:26:48 notebook_setup:34] Checkout kubeflow/tf-operator @9238906
[I 200210 22:26:48 notebook_setup:37] Configure docker credentials
[I 200210 22:26:49 notebook_setup:52] Adding /home/jovyan/git_tf-operator/sdk/python to python path


### Configure The Docker Registry For Kubeflow Fairing

* In order to build docker images from your notebook we need a docker registry where the images will be stored
* Below you set some variables specifying a [GCR container registry](https://cloud.google.com/container-registry/docs/)
* Kubeflow Fairing provides a utility function to guess the name of your GCP project

In [56]:
from kubernetes import client as k8s_client
from kubeflow import fairing   
from kubeflow.fairing import utils as fairing_utils
from kubeflow.fairing.builders import append
from kubeflow.fairing.deployers import job
from kubeflow.fairing.preprocessors import base as base_preprocessor

# Setting up google container repositories (GCR) for storing output containers
# You can use any docker container registry istead of GCR
GCP_PROJECT = fairing.cloud.gcp.guess_project_name()
DOCKER_REGISTRY = 'gcr.io/{}/fairing-job'.format(GCP_PROJECT)

## Use Kubeflow fairing to build the docker image

* You will use kubeflow fairing's kaniko builder to build a docker image that includes all your dependencies
  * You use kaniko because you want to be able to run `pip` to install dependencies
  * Kaniko gives you the flexibility to build images from Dockerfiles

In [4]:
# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default 
# Kaniko image is updated to a newer image than 0.7.0.
from kubeflow.fairing import constants
constants.constants.KANIKO_IMAGE = "gcr.io/kaniko-project/executor:v0.14.0"

In [12]:
from kubeflow.fairing.builders import cluster

# output_map is a map of extra files to add to the notebook.
# It is a map from source location to the location inside the context.
output_map =  {
    "Dockerfile.model": "Dockerfile",
    "model.py": "model.py"
}


preprocessor = base_preprocessor.BasePreProcessor(
    command=["python"], # The base class will set this.
    input_files=[],
    path_prefix="/app", # irrelevant since we aren't preprocessing any files
    output_map=output_map)

preprocessor.preprocess()

set()

In [17]:
# Use a Tensorflow image as the base image
# We use a custom Dockerfile 
cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,
                                                 base_image="", # base_image is set in the Dockerfile
                                                 preprocessor=preprocessor,
                                                 image_name="mnist",
                                                 dockerfile_path="Dockerfile",
                                                 pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],
                                                 context_source=cluster.gcs_context.GCSContextSource())
cluster_builder.build()
logging.info(f"Built image {cluster_builder.image_tag}")

[I 200210 20:05:16 cluster:46] Building image using cluster builder.
[W 200210 20:05:16 base:92] Dockerfile already exists in Fairing context, skipping...
[I 200210 20:05:16 base:105] Creating docker context: /tmp/fairing_context_lmzqhmce
[W 200210 20:05:16 base:92] Dockerfile already exists in Fairing context, skipping...
[W 200210 20:05:17 manager:230] Waiting for fairing-builder-gxjqc-jbcjt to start...
[W 200210 20:05:17 manager:230] Waiting for fairing-builder-gxjqc-jbcjt to start...
[W 200210 20:05:17 manager:230] Waiting for fairing-builder-gxjqc-jbcjt to start...
[I 200210 20:05:19 manager:236] Pod started running True


ERROR: logging before flag.Parse: E0210 20:05:24.082563       1 metadata.go:241] Failed to unmarshal scopes: invalid character 'h' looking for beginning of value
[36mINFO[0m[0005] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3
[36mINFO[0m[0005] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3
[36mINFO[0m[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3
ERROR: logging before flag.Parse: E0210 20:05:24.418208       1 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg
ERROR: logging before flag.Parse: E0210 20:05:24.420332       1 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url
[36mINFO[0m[0005] Error whil

NameError: name 'logging' is not defined

## Create a GCS Bucket

* Create a GCS bucket to store our models and other results.
* Since we are running in python we use the python client libraries but you could also use the `gsutil` command line

In [37]:
from google.cloud import storage
bucket = f"{GCP_PROJECT}-mnist"

client = storage.Client()
b = storage.Bucket(client=client, name=bucket)

if not b.exists():
    logging.info(f"Creating bucket {bucket}")
    b.create()
else:
    logging.info(f"Bucket {bucket} already exists")    

[I 200210 20:32:36 <ipython-input-37-c3f3efa8de59>:8] Creating bucket jlewi-dev-mnist


## Distributed training

* We will train the model by using TFJob to run a distributed training job

In [65]:
train_name = f"mnist-train-{uuid.uuid4().hex[:4]}"
num_ps = 1
num_workers = 2
model_dir = f"gs://{bucket}/mnist"
export_path = f"gs://{bucket}/mnist/export" 
train_steps = 200
batch_size = 100
learning_rate = .01
image = cluster_builder.image_tag

train_spec = f"""apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: {train_name}
spec:
  tfReplicaSpecs:
    Ps:
      replicas: {num_ps}
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          containers:
          - name: tensorflow
            command:
            - python
            - /opt/model.py
            - --tf-model-dir={model_dir}
            - --tf-export-dir={export_path}
            - --tf-train-steps={train_steps}
            - --tf-batch-size={batch_size}
            - --tf-learning-rate={learning_rate}
            image: {image}
            workingDir: /opt
          restartPolicy: OnFailure
    Chief:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          containers:
          - name: tensorflow
            command:
            - python
            - /opt/model.py
            - --tf-model-dir={model_dir}
            - --tf-export-dir={export_path}
            - --tf-train-steps={train_steps}
            - --tf-batch-size={batch_size}
            - --tf-learning-rate={learning_rate}
            image: {image}
            workingDir: /opt
          restartPolicy: OnFailure
    Worker:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          containers:
          - name: tensorflow
            command:
            - python
            - /opt/model.py
            - --tf-model-dir={model_dir}
            - --tf-export-dir={export_path}
            - --tf-train-steps={train_steps}
            - --tf-batch-size={batch_size}
            - --tf-learning-rate={learning_rate}
            image: {image}
            workingDir: /opt
          restartPolicy: OnFailure
"""           

### Create the training job

* We could write the spec to a YAML file and then do `kubectl apply -f {FILE}`
* Since we are running in jupyter we will use the Kubernetes python client

In [92]:
from kubernetes import client as k8s_client
from kubernetes import config as k8s_config
from kubeflow.tfjob.api import tf_job_client as tf_job_client_module

tf_job_client = tf_job_client_module.TFJobClient()

In [93]:
import yaml
tf_job_body = yaml.load(train_spec)
tf_job = tf_job_client.create(tf_job_body, namespace=namespace)


  


RuntimeError: Exception when calling CustomObjectsApi->create_namespaced_custom_object:         (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '7c265fd3-73f6-436a-b5d3-d874c7325aa2', 'Content-Type': 'application/json', 'Date': 'Mon, 10 Feb 2020 22:29:41 GMT', 'Content-Length': '250'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"tfjobs.kubeflow.org \"mnist-train-5a18\" already exists","reason":"AlreadyExists","details":{"name":"mnist-train-5a18","group":"kubeflow.org","kind":"tfjobs"},"code":409}




### Check the job

* We can use kubectl get the status of our job

In [68]:
!kubectl get tfjobs -o yaml {train_name}

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  creationTimestamp: "2020-02-10T20:57:05Z"
  generation: 1
  name: mnist-train-5a18
  namespace: kubeflow-jlewi
  resourceVersion: "64083"
  selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow-jlewi/tfjobs/mnist-train-5a18
  uid: e4b85f47-4c47-11ea-86b4-42010a8e01a3
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python
            - /opt/model.py
            - --tf-model-dir=gs://jlewi-dev-mnist/mnist
            - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export
            - --tf-train-steps=200
            - --tf-batch-size=100
            - --tf-learning-rate=0.01
            image: gcr.io/jlewi-dev/fairing-job/mnist:8EB3617D
            name: tensorflow
            workingDir: /opt
          restartPolicy: OnFailure
          serviceAccount: default-edit

## Wait For the Training Job to finish

In [71]:
while True:
    tf_job = crd_api.get_namespaced_custom_object(
      KF_GROUP, TFJOB_VERSION, namespace, TFJOB_PLURAL, train_name)
    
    if not "status" in tf_job or 

In [None]:
tf_job_client.wait_for_condition()

In [75]:
tf_job["status"]["conditions"][-1]

{'lastTransitionTime': '2020-02-10T20:58:03Z',
 'lastUpdateTime': '2020-02-10T20:58:03Z',
 'message': 'TFJob mnist-train-5a18 successfully completed.',
 'reason': 'TFJobSucceeded',
 'status': 'True',
 'type': 'Succeeded'}