# Kubeflow Fairing E2E MNIST Case: Building, Training and Serving

This example guides you through:
  1. Taking an example TensorFlow model and modifying it to support distributed training.
  1. Using Kubeflow Fairing to build docker image and launch a TFJob to train model.
  1. Using Kubeflow Fairing to create InferenceService (KFServing) to deploy the trained model.
  1. Clean up the TFJob and InferenceService using `kubeflow-tfjob` and `kfserving` SDK client.

## Requirements

  * The TF-Operator and KFServing have been installed in Kubenertes Cluster.

### Prepare Training Code

We modified the [examples](https://github.com/tensorflow/tensorflow/blob/9a24e8acfcd8c9046e1abaac9dbf5e146186f4c2/tensorflow/examples/learn/mnist.py) to be better suited for distributed training and model serving. There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob. The updated training code is [mnist.py](mnist.py). 

### Install Required Libraries

In [1]:
!pip install git+git://github.com/kubeflow/fairing.git@dc61c4c88f233edaf22b13bbfb184ded0ed877a4

Collecting git+git://github.com/kubeflow/fairing.git@dc61c4c88f233edaf22b13bbfb184ded0ed877a4
  Cloning git://github.com/kubeflow/fairing.git (to revision dc61c4c88f233edaf22b13bbfb184ded0ed877a4) to /tmp/pip-req-build-yqjo1vet
  Running command git clone -q git://github.com/kubeflow/fairing.git /tmp/pip-req-build-yqjo1vet






Building wheels for collected packages: kubeflow-fairing
  Building wheel for kubeflow-fairing (setup.py) ... [?25ldone
[?25h  Created wheel for kubeflow-fairing: filename=kubeflow_fairing-0.7.1-py3-none-any.whl size=154861 sha256=4b92aa1c6d22a629ae5c766c294641ee2ccd1763d963d006860e8dce554182a7
  Stored in directory: /root/.cache/pip/wheels/10/9f/7c/dda9d45fc21712d6ee8be6592da856aba0afe96abc0bcf6099
Successfully built kubeflow-fairing


In [2]:
import yaml
from importlib import reload
# Force a reload of kubeflow; since kubeflow is a multi namespace module
# it looks like doing this in notebook_setup may not be sufficient
import kubeflow
reload(kubeflow)

<module 'kubeflow' from '/opt/python/python36/lib/python3.6/site-packages/kubeflow/__init__.py'>

### Configure The Docker Registry For Kubeflow Fairing

* In order to build docker images from your notebook we need a docker registry where the images will be stored

**Note:** The below section must be updated to your values.

In [3]:
# Set docker registry to store image.
# Ensure you have permission for pushing docker image requests. 
DOCKER_REGISTRY = 'index.docker.io/jinchi'

# Set namespace. Note that the created PVC should be in the namespace.
my_namespace = 'hejinchi'
# You also can get the default target namepspace using below API.
#namespace = fairing_utils.get_default_target_namespace()

## Create PV/PVC to Store The Exported Model 

Create Persistent Volume(PV) and Persistent Volume Claim(PVC), the PVC will be used by pods of training and serving for local mode in steps below.

**Note:** The below section must be updated to your values.

In [4]:
# To satify the distributed training, the PVC should be access from all nodes in the cluster.
# The example creates a NFS PV to satify that.
nfs_server = '172.16.189.69'
nfs_path = '/opt/kubeflow/data/mnist'
pv_name = 'mnist-e2e-pv'
pvc_name = 'mnist-e2e-pvc'

Skip below creating PV/PVC step if you set an existing PV and PVC.

In [5]:
from kubernetes import client as k8s_client
from kubernetes import config as k8s_config
from kubeflow.fairing.utils import is_running_in_k8s

pv_yaml = f'''
apiVersion: v1
kind: PersistentVolume
metadata:
  name: {pv_name}
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    path: {nfs_path}
    server: {nfs_server}
'''
pvc_yaml = f'''
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {pvc_name}
  namespace: {my_namespace}
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 10Gi
'''

if is_running_in_k8s():
    k8s_config.load_incluster_config()
else:
    k8s_config.load_kube_config()

k8s_core_api = k8s_client.CoreV1Api()
k8s_core_api.create_persistent_volume(yaml.safe_load(pv_yaml))
k8s_core_api.create_namespaced_persistent_volume_claim(my_namespace, yaml.safe_load(pvc_yaml))

{'api_version': 'v1',
 'kind': 'PersistentVolumeClaim',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': datetime.datetime(2020, 2, 24, 4, 53, 57, tzinfo=tzutc()),
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': ['kubernetes.io/pvc-protection'],
              'generate_name': None,
              'generation': None,
              'initializers': None,
              'labels': None,
              'managed_fields': None,
              'name': 'mnist-e2e-pvc',
              'namespace': 'hejinchi',
              'owner_references': None,
              'resource_version': '5761749',
              'self_link': '/api/v1/namespaces/hejinchi/persistentvolumeclaims/mnist-e2e-pvc',
              'uid': '3c81bc0d-2879-4afa-850b-86a56f17c2b9'},
 'spec': {'access_modes': ['ReadWriteMany'],
          'data_source': None,
          'resources': {'limits': None, 'requests'

## Use Kubeflow fairing to build the docker image and launch a TFJob for training

* Use kubeflow fairing to build a docker image that includes all your dependencies
* Launch a TFJob in the on premise cluster to taining model.

Firstly set some custom training parameters for TFJob.

In [6]:
num_ps = 1  #number of PS in TFJob 
num_workers = 2  #number of Worker in TFJob 
model_dir = "/mnt"
export_path = "/mnt/export" 
train_steps = "200"
batch_size = "100"
learning_rate = "0.01"

Use Kubeflow Fairing to build a docker image and push to docker registry, and then launch a TFJob in the on-prem cluster for distributed training model.

In [7]:
import uuid
from kubeflow import fairing   
from kubeflow.fairing.kubernetes.utils import mounting_pvc

tfjob_name = f'mnist-training-{uuid.uuid4().hex[:4]}'

output_map =  {
    "Dockerfile": "Dockerfile",
    "mnist.py": "mnist.py"
}

command=["python",
         "/opt/mnist.py",
         "--tf-model-dir=" + model_dir,
         "--tf-export-dir=" + export_path,
         "--tf-train-steps=" + train_steps,
         "--tf-batch-size=" + batch_size,
         "--tf-learning-rate=" + learning_rate]

fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=DOCKER_REGISTRY, base_image="",
                           image_name="mnist", dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=my_namespace, stream_log=False,
                            worker_count=num_workers, ps_count=num_ps, job_name=tfjob_name,
                            pod_spec_mutators = [mounting_pvc(pvc_name=pvc_name, pvc_mount_path=model_dir)])
fairing.config.run()

[I 200223 20:53:57 config:125] Using preprocessor: <kubeflow.fairing.preprocessors.base.BasePreProcessor object at 0x7f42a6ef96a0>
[I 200223 20:53:57 config:127] Using builder: <kubeflow.fairing.builders.docker.docker.DockerBuilder object at 0x7f42dc4f6eb8>
[I 200223 20:53:57 config:129] Using deployer: <kubeflow.fairing.deployers.tfjob.tfjob.TfJob object at 0x7f42a6ef95f8>
[I 200223 20:53:57 docker:32] Building image using docker
[W 200223 20:53:57 docker:41] Docker command: ['python', '/opt/mnist.py', '--tf-model-dir=/mnt', '--tf-export-dir=/mnt/export', '--tf-train-steps=200', '--tf-batch-size=100', '--tf-learning-rate=0.01']
[I 200223 20:53:57 base:107] Creating docker context: /tmp/fairing_context_iu67hth0
[W 200223 20:53:57 docker:56] Building docker image index.docker.io/jinchi/mnist:54A2DC37...
[I 200223 20:53:57 docker:103] Build output: Step 1/5 : FROM tensorflow/tensorflow:1.15.2-py3
[I 200223 20:53:57 docker:103] Build output: 
[I 200223 20:53:57 docker:103] Build output: -

(<kubeflow.fairing.preprocessors.base.BasePreProcessor at 0x7f42a6ef96a0>,
 <kubeflow.fairing.builders.docker.docker.DockerBuilder at 0x7f42dc4f6eb8>,
 <kubeflow.fairing.deployers.tfjob.tfjob.TfJob at 0x7f42a6ef95f8>)

### Get The Created TFJobs

In [8]:
from kubeflow.tfjob import TFJobClient
tfjob_client = TFJobClient()

tfjob_client.get(tfjob_name, namespace=my_namespace)

{'apiVersion': 'kubeflow.org/v1',
 'kind': 'TFJob',
 'metadata': {'creationTimestamp': '2020-02-24T04:54:01Z',
  'generateName': 'fairing-tfjob-',
  'generation': 1,
  'labels': {'fairing-deployer': 'tfjob',
   'fairing-id': 'acd47550-56c1-11ea-b7e1-00163e01bd45'},
  'name': 'mnist-training-445b',
  'namespace': 'hejinchi',
  'resourceVersion': '5761787',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/hejinchi/tfjobs/mnist-training-445b',
  'uid': '6546a875-348b-41bb-8510-04abc7dd1a58'},
 'spec': {'tfReplicaSpecs': {'Chief': {'replicas': 1,
    'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
      'labels': {'fairing-deployer': 'tfjob',
       'fairing-id': 'acd47550-56c1-11ea-b7e1-00163e01bd45'},
      'name': 'fairing-deployer'},
     'spec': {'containers': [{'command': ['python',
         '/opt/mnist.py',
         '--tf-model-dir=/mnt',
         '--tf-export-dir=/mnt/export',
         '--tf-train-steps=200',
         '--tf-batch-size=100',
         '-

### Wait For the Training Job to finish

In [9]:
tfjob_client.wait_for_job(tfjob_name, namespace=my_namespace, watch=True)

NAME                           STATE                TIME                          
mnist-training-445b            Created              2020-02-24T04:54:01Z          
mnist-training-445b            Created              2020-02-24T04:54:01Z          
mnist-training-445b            Created              2020-02-24T04:54:01Z          
mnist-training-445b            Created              2020-02-24T04:54:01Z          
mnist-training-445b            Running              2020-02-24T04:54:08Z          
mnist-training-445b            Succeeded            2020-02-24T04:54:16Z          


### Check if the TFJob succeeded.

In [10]:
tfjob_client.is_job_succeeded(tfjob_name, namespace=my_namespace)

True

### Get the Training Logs

In [11]:
tfjob_client.get_logs(tfjob_name, namespace=my_namespace)

[I 200223 20:54:16 tf_job_client:386] The logs of Pod mnist-training-445b-chief-0:
    
    
    W0224 04:54:10.486181 139704674924352 module_wrapper.py:139] From /opt/mnist.py:155: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
    
    
    W0224 04:54:10.486561 139704674924352 module_wrapper.py:139] From /opt/mnist.py:155: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
    
    
    W0224 04:54:10.488451 139704674924352 module_wrapper.py:139] From /opt/mnist.py:160: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
    
    INFO:tensorflow:TF_CONFIG {"cluster":{"chief":["mnist-training-445b-chief-0.hejinchi.svc:2222"],"ps":["mnist-training-445b-ps-0.hejinchi.svc:2222"],"worker":["mnist-training-445b-worker-0.hejinchi.svc:2222","mnist-training-445b-worker-1.hejinchi.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
    I0224 04:54:10.4

## Deploy Service using KFServing

In [12]:
from kubeflow.fairing.deployers.kfserving.kfserving import KFServing
isvc_name = f'mnist-service-{uuid.uuid4().hex[:4]}'
isvc = KFServing('tensorflow', namespace=my_namespace, isvc_name=isvc_name,
                 default_storage_uri='pvc://' + pvc_name + '/export')
isvc.deploy(isvc.generate_isvc())

NAME                 READY      DEFAULT_TRAFFIC CANARY_TRAFFIC  URL                                               
mnist-service-5041   Unknown                                                                                      
mnist-service-5041   False                                                                                        
mnist-service-5041   False                                                                                        
mnist-service-5041   False                                                                                        
mnist-service-5041   True       100                             http://mnist-service-5041.hejinchi.example.com/...


[I 200223 20:54:36 kfserving:116] Deployed the InferenceService mnist-service-5041 successfully.


'mnist-service-5041'

### Get the InferenceService

In [13]:
from kfserving import KFServingClient
kfserving_client = KFServingClient()
kfserving_client.get(namespace=my_namespace)

{'apiVersion': 'serving.kubeflow.org/v1alpha2',
 'items': [{'apiVersion': 'serving.kubeflow.org/v1alpha2',
   'kind': 'InferenceService',
   'metadata': {'creationTimestamp': '2020-02-24T04:54:16Z',
    'generateName': 'fairing-kfserving-',
    'generation': 5,
    'name': 'mnist-service-5041',
    'namespace': 'hejinchi',
    'resourceVersion': '5762157',
    'selfLink': '/apis/serving.kubeflow.org/v1alpha2/namespaces/hejinchi/inferenceservices/mnist-service-5041',
    'uid': '7da546be-8831-4743-8631-4d0aad20844d'},
   'spec': {'default': {'predictor': {'tensorflow': {'resources': {'limits': {'cpu': '1',
         'memory': '2Gi'},
        'requests': {'cpu': '1', 'memory': '2Gi'}},
       'runtimeVersion': '1.14.0',
       'storageUri': 'pvc://mnist-e2e-pvc/export'}}}},
   'status': {'canary': {},
    'conditions': [{'lastTransitionTime': '2020-02-24T04:54:36Z',
      'status': 'True',
      'type': 'DefaultPredictorReady'},
     {'lastTransitionTime': '2020-02-24T04:54:36Z',
      's

### Get the InferenceService and Service Endpoint

In [14]:
mnist_isvc = kfserving_client.get(isvc_name, namespace=my_namespace)
print("MNIST Service Endpoint: " + mnist_isvc['status'].get('url', ''))

MNIST Service Endpoint: http://mnist-service-5041.hejinchi.example.com/v1/models/mnist-service-5041


## Clean Up

Delete the TFJob

In [15]:
tfjob_client.delete(tfjob_name, namespace=my_namespace)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'mnist-training-445b',
  'group': 'kubeflow.org',
  'kind': 'tfjobs',
  'uid': '6546a875-348b-41bb-8510-04abc7dd1a58'}}

Delete the InferenceService.

In [16]:
kfserving_client.delete(isvc_name, namespace=my_namespace)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'mnist-service-5041',
  'group': 'serving.kubeflow.org',
  'kind': 'inferenceservices',
  'uid': '7da546be-8831-4743-8631-4d0aad20844d'}}

In [17]:
k8s_core_api.delete_namespaced_persistent_volume_claim(pvc_name, my_namespace)
k8s_core_api.delete_persistent_volume(pv_name)

{'api_version': 'v1',
 'code': None,
 'details': None,
 'kind': 'PersistentVolume',
 'message': None,
 'metadata': {'_continue': None,
              'resource_version': '5762169',
              'self_link': '/api/v1/persistentvolumes/mnist-e2e-pv'},
 'reason': None,
 'status': "{'phase': 'Bound'}"}