# Kubeflow E2E MNIST Case: Building, Distributed Training and Serving

This example guides you through:
  1. Taking an example TensorFlow model and modifying it to support distributed training.
  1. Using `Kubeflow Fairing` to build docker image and launch a TFJob to train model.
  1. Using `Kubeflow Fairing` to create InferenceService (KFServing) for the trained model.
  1. Clean up the TFJob and InferenceService using `kubeflow-tfjob` and `kfserving` SDK client.

## Requirements

  * The Kubeflow Fairing, TF-Operator and KFServing have been installed in Kubenertes Cluster.

### Prepare Training Code

We modified the [examples](https://github.com/tensorflow/tensorflow/blob/9a24e8acfcd8c9046e1abaac9dbf5e146186f4c2/tensorflow/examples/learn/mnist.py) to be better suited for distributed training and model serving. There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob. The updated training code is [mnist.py](mnist.py). 

### Install Required Libraries

In [1]:
!pip show kubeflow-fairing

Name: kubeflow-fairing
Version: 1.0.1
Summary: Kubeflow Fairing Python SDK.
Home-page: https://github.com/kubeflow/fairing
Author: Kubeflow Authors
Author-email: hejinchi@cn.ibm.com
License: Apache License Version 2.0
Location: /opt/python/python36/lib/python3.6/site-packages
Requires: azure-storage-file, ibm-cos-sdk, google-api-python-client, setuptools, google-cloud-logging, future, oauth2client, docker, notebook, numpy, boto3, azure-mgmt-storage, retrying, google-cloud-storage, kubernetes, kubeflow-pytorchjob, requests, grpcio, python-dateutil, tornado, six, google-auth, kubeflow-tfjob, kfserving, urllib3, httplib2, nbconvert, cloudpickle
Required-by: virtual-training, remote-training


### Configure the Docker Registry for Kubeflow Fairing

* In order to build docker images from your notebook we need a docker registry where the images will be stored

**Note:** The below section must be updated to your values.

In [2]:
# Set docker registry to store image.
# Ensure you have permission for pushing docker image requests. 
DOCKER_REGISTRY = 'index.docker.io/jinchi'

# Set namespace. Note that the created PVC should be in the namespace.
my_namespace = 'hejinchi'
# You also can get the default target namepspace using below API.
#namespace = fairing_utils.get_default_target_namespace()

## Create PV/PVC to Store the Exported Model 

Create Persistent Volume(PV) and Persistent Volume Claim(PVC), the PVC will be used by pods of training and serving for local mode in steps below.

**Note:** The below section must be updated to your values.

In [3]:
# To satify the distributed training, the PVC should be access from all nodes in the cluster.
# The example creates a NFS PV to satify that.
nfs_server = '172.16.189.69'
nfs_path = '/opt/kubeflow/data/mnist'
pv_name = 'kubeflow-mnist'
pvc_name = 'mnist-pvc'

(Optional) Skip below creating PV/PVC step if you set an existing PV and PVC.

In [None]:
from kubernetes import client as k8s_client
from kubernetes import config as k8s_config
from kubeflow.fairing.utils import is_running_in_k8s

pv_yaml = f'''
apiVersion: v1
kind: PersistentVolume
metadata:
  name: {pv_name}
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    path: {nfs_path}
    server: {nfs_server}
'''
pvc_yaml = f'''
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {pvc_name}
  namespace: {my_namespace}
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 10Gi
'''

if is_running_in_k8s():
    k8s_config.load_incluster_config()
else:
    k8s_config.load_kube_config()

k8s_core_api = k8s_client.CoreV1Api()
k8s_core_api.create_persistent_volume(yaml.safe_load(pv_yaml))
k8s_core_api.create_namespaced_persistent_volume_claim(my_namespace, yaml.safe_load(pvc_yaml))

## Use Kubeflow fairing to build the docker image and launch a TFJob for training

* Use kubeflow fairing to build a docker image that includes all your dependencies
* Launch a TFJob in the on premise cluster to taining model.

Firstly set some custom training parameters for TFJob.

In [4]:
num_chief = 1 #number of Chief in TFJob 
num_ps = 1  #number of PS in TFJob 
num_workers = 2  #number of Worker in TFJob 
model_dir = "/mnt"
export_path = "/mnt/export" 
train_steps = "1000"
batch_size = "100"
learning_rate = "0.01"

Use Kubeflow Fairing to build a docker image and push to docker registry, and then launch a TFJob in the on-prem cluster for distributed training model.

In [5]:
import uuid
from kubeflow import fairing   
from kubeflow.fairing.kubernetes.utils import mounting_pvc

tfjob_name = f'mnist-training-{uuid.uuid4().hex[:4]}'

output_map =  {
    "Dockerfile": "Dockerfile",
    "mnist.py": "mnist.py"
}

command=["python",
         "/opt/mnist.py",
         "--tf-model-dir=" + model_dir,
         "--tf-export-dir=" + export_path,
         "--tf-train-steps=" + train_steps,
         "--tf-batch-size=" + batch_size,
         "--tf-learning-rate=" + learning_rate]

fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=DOCKER_REGISTRY, base_image="",
                           image_name="mnist", dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=my_namespace, stream_log=False, job_name=tfjob_name,
                            chief_count=num_chief, worker_count=num_workers, ps_count=num_ps, 
                            pod_spec_mutators=[mounting_pvc(pvc_name=pvc_name, pvc_mount_path=model_dir)])
fairing.config.run()

[W 200727 22:57:30 utils:51] The function mounting_pvc has been deprecated,                     please use `volume_mounts`
[I 200727 22:57:30 config:134] Using preprocessor: <kubeflow.fairing.preprocessors.base.BasePreProcessor object at 0x7f9cb89424a8>
[I 200727 22:57:30 config:136] Using builder: <kubeflow.fairing.builders.docker.docker.DockerBuilder object at 0x7f9cb8942400>
[I 200727 22:57:30 config:138] Using deployer: <kubeflow.fairing.deployers.tfjob.tfjob.TfJob object at 0x7f9cb8942470>
[I 200727 22:57:30 docker:32] Building image using docker
[W 200727 22:57:30 docker:41] Docker command: ['python', '/opt/mnist.py', '--tf-model-dir=/mnt', '--tf-export-dir=/mnt/export', '--tf-train-steps=1000', '--tf-batch-size=100', '--tf-learning-rate=0.01']
[I 200727 22:57:30 base:107] Creating docker context: /tmp/fairing_context_yiq6_iyx
[W 200727 22:57:30 docker:56] Building docker image index.docker.io/jinchi/mnist:1929A63D...
[I 200727 22:57:30 docker:103] Build output: Step 1/5 : FROM t

(<kubeflow.fairing.preprocessors.base.BasePreProcessor at 0x7f9cb89424a8>,
 <kubeflow.fairing.builders.docker.docker.DockerBuilder at 0x7f9cb8942400>,
 <kubeflow.fairing.deployers.tfjob.tfjob.TfJob at 0x7f9cb8942470>)

### Get the Created TFJobs

In [6]:
from kubeflow.tfjob import TFJobClient
tfjob_client = TFJobClient()

tfjob_client.get(tfjob_name, namespace=my_namespace)

{'apiVersion': 'kubeflow.org/v1',
 'kind': 'TFJob',
 'metadata': {'creationTimestamp': '2020-07-28T05:57:33Z',
  'generateName': 'fairing-tfjob-',
  'generation': 1,
  'labels': {'fairing-deployer': 'tfjob',
   'fairing-id': '3b20a9c4-d097-11ea-a7a8-00163e01bd45'},
  'name': 'mnist-training-820c',
  'namespace': 'hejinchi',
  'resourceVersion': '102422171',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/hejinchi/tfjobs/mnist-training-820c',
  'uid': '637adb22-ae73-449d-a3be-7df1410d5ac7'},
 'spec': {'tfReplicaSpecs': {'Chief': {'replicas': 1,
    'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
      'labels': {'fairing-deployer': 'tfjob',
       'fairing-id': '3b20a9c4-d097-11ea-a7a8-00163e01bd45'},
      'name': 'fairing-deployer'},
     'spec': {'containers': [{'command': ['python',
         '/opt/mnist.py',
         '--tf-model-dir=/mnt',
         '--tf-export-dir=/mnt/export',
         '--tf-train-steps=1000',
         '--tf-batch-size=100',
        

### Wait For the Training Job to finish

In [7]:
tfjob_client.wait_for_job(tfjob_name, namespace=my_namespace, watch=True)

NAME                           STATE                TIME                          
mnist-training-820c            Running              2020-07-28T05:57:37Z          
mnist-training-820c            Succeeded            2020-07-28T05:57:43Z          


### Check if the TFJob succeeded.

In [8]:
tfjob_client.is_job_succeeded(tfjob_name, namespace=my_namespace)

True

### Get the Training Logs

In [9]:
tfjob_client.get_logs(tfjob_name, namespace=my_namespace)

[I 200727 22:57:51 tf_job_client:386] The logs of Pod mnist-training-820c-chief-0:
    
    
    W0728 05:57:37.944859 140369861093184 module_wrapper.py:139] From /opt/mnist.py:155: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
    
    
    W0728 05:57:37.945076 140369861093184 module_wrapper.py:139] From /opt/mnist.py:155: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
    
    
    W0728 05:57:37.946639 140369861093184 module_wrapper.py:139] From /opt/mnist.py:160: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
    
    INFO:tensorflow:TF_CONFIG {"cluster":{"chief":["mnist-training-820c-chief-0.hejinchi.svc:2222"],"ps":["mnist-training-820c-ps-0.hejinchi.svc:2222"],"worker":["mnist-training-820c-worker-0.hejinchi.svc:2222","mnist-training-820c-worker-1.hejinchi.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
    I0728 05:57:37.9

## Deploy Service using KFServing

In [10]:
from kubeflow.fairing.deployers.kfserving.kfserving import KFServing
isvc_name = f'mnist-service-{uuid.uuid4().hex[:4]}'
isvc = KFServing('tensorflow', namespace=my_namespace, isvc_name=isvc_name,
                 default_storage_uri='pvc://' + pvc_name + '/export')
isvc.deploy(isvc.generate_isvc())

NAME                 READY      DEFAULT_TRAFFIC CANARY_TRAFFIC  URL                                               
mnist-service-f129   Unknown                                                                                      
mnist-service-f129   False                                                                                        
mnist-service-f129   False                                                                                        
mnist-service-f129   False                                                                                        
mnist-service-f129   False                                                                                        
mnist-service-f129   True       100                             http://mnist-service-f129.hejinchi.example.com/...


[I 200727 22:58:15 kfserving:127] Deployed the InferenceService mnist-service-f129 successfully.


'mnist-service-f129'

### Get the InferenceService

In [11]:
from kfserving import KFServingClient
kfserving_client = KFServingClient()
kfserving_client.get(namespace=my_namespace)

{'apiVersion': 'serving.kubeflow.org/v1alpha2',
 'items': [{'apiVersion': 'serving.kubeflow.org/v1alpha2',
   'kind': 'InferenceService',
   'metadata': {'creationTimestamp': '2020-07-28T02:39:33Z',
    'generateName': 'fairing-kfserving-',
    'generation': 5,
    'name': 'mnist-service-4db9',
    'namespace': 'hejinchi',
    'resourceVersion': '102344640',
    'selfLink': '/apis/serving.kubeflow.org/v1alpha2/namespaces/hejinchi/inferenceservices/mnist-service-4db9',
    'uid': '37f5c4c6-86ed-4a71-9572-810e03a1413f'},
   'spec': {'default': {'predictor': {'tensorflow': {'resources': {'limits': {'cpu': '1',
         'memory': '2Gi'},
        'requests': {'cpu': '1', 'memory': '2Gi'}},
       'runtimeVersion': '1.14.0',
       'storageUri': 'pvc://mnist-pvc/export'}}}},
   'status': {'canary': {},
    'conditions': [{'lastTransitionTime': '2020-07-28T02:40:03Z',
      'status': 'True',
      'type': 'DefaultPredictorReady'},
     {'lastTransitionTime': '2020-07-28T02:40:03Z',
      'sta

### Get the InferenceService and Service Endpoint

In [12]:
mnist_isvc = kfserving_client.get(isvc_name, namespace=my_namespace)
mnist_isvc_name = mnist_isvc['metadata']['name']
mnist_isvc_endpoint = mnist_isvc['status'].get('url', '')
print("MNIST Service Endpoint: " + mnist_isvc_endpoint)

MNIST Service Endpoint: http://mnist-service-f129.hejinchi.example.com/v1/models/mnist-service-f129


### Run a prediction to the InferenceService

In [13]:
ISTIO_CLUSTER_IP=!kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.clusterIP}'
CLUSTER_IP=ISTIO_CLUSTER_IP[0]
MODEL_HOST=f"Host: {mnist_isvc_name}.{my_namespace}.example.com"
!curl -v -H "{MODEL_HOST}" http://{CLUSTER_IP}/v1/models/{mnist_isvc_name}:predict -d @./input.json

* About to connect() to 10.110.51.90 port 80 (#0)
*   Trying 10.110.51.90...
* Connected to 10.110.51.90 (10.110.51.90) port 80 (#0)
> POST /v1/models/mnist-service-f129:predict HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> Host: mnist-service-f129.hejinchi.example.com
> Content-Length: 2052
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
> 
< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< content-length: 257
< content-type: application/json
< date: Tue, 28 Jul 2020 05:58:38 GMT
< x-envoy-upstream-service-time: 314
< server: istio-envoy
< 
{
    "predictions": [
        {
            "classes": 8,
            "predictions": [2.49753812e-05, 9.58313558e-06, 0.000792403473, 0.0170188081, 3.87205364e-05, 0.00188907969, 1.35709442e-05, 2.67499416e-07, 0.980212331, 2.4302895e-07]
        }
    ]
* Connection #0 to host 10.110.51.90 left intact
}

## Clean Up

Delete the TFJob

In [14]:
tfjob_client.delete(tfjob_name, namespace=my_namespace)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'mnist-training-820c',
  'group': 'kubeflow.org',
  'kind': 'tfjobs',
  'uid': '637adb22-ae73-449d-a3be-7df1410d5ac7'}}

Delete the InferenceService.

In [15]:
kfserving_client.delete(isvc_name, namespace=my_namespace)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'mnist-service-f129',
  'group': 'serving.kubeflow.org',
  'kind': 'inferenceservices',
  'uid': 'bde8a9ca-e7dc-4430-abd1-7eae0c298311'}}