# MNIST E2E on Kubeflow for On-Promise Cluster

This example guides you through:
  1. Taking an example TensorFlow model and modifying it to support distributed training
  1. Using Kubeflow Fairing to build docker image and launch a TFJob to train model
  1. Using Kubeflow Fairing to create InferenceService (KFServing) to deploy the trained model.
  1. Clean up the TFJob and InferenceService using `kubeflow-tfjob` and `kfserving` SDK client
  
## Requirements

  * The Kubeflow 0.7 or higher has been installed.

## Prepare model

There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.

Basically, we must:

1. Add options in order to make the model configurable.
1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.
1. Define serving signatures for model serving.

The resulting model is [model.py](model.py).

## Install Required Libraries

Import the libraries required to train this model.

In [1]:
!pip install kubeflow-fairing>=0.7.1
!pip install kubeflow-tfjob>=0.1.3

In [2]:
from importlib import reload
# Force a reload of kubeflow; since kubeflow is a multi namespace module
# it looks like doing this in notebook_setup may not be sufficient
import kubeflow
reload(kubeflow)

<module 'kubeflow' from '/opt/python/python36/lib/python3.6/site-packages/kubeflow/__init__.py'>

## Configure The Docker Registry For Kubeflow Fairing

* In order to build docker images from your notebook we need a docker registry where the images will be stored

In [3]:
# Set docker registry to store image.
# Ensure you have permission for pushing docker image requests. 
DOCKER_REGISTRY = 'index.docker.io/jinchi'

# Set namespace. Note that the created PVC should be in the namespace.
my_namespace = 'hejinchi'
# You also can get the default target namepspace using below API.
#namespace = fairing_utils.get_default_target_namespace()

## Create PV/PVC to Store The Exported Model 

Refer to the [document](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) to create Persistent Volume(PV) and Persistent Volume Claim(PVC), the PVC will be used by pods of training and serving for local mode in steps below.

In [4]:
# To satify the distributed training, the PVC should be access from all nodes in the cluster.
# The example create a NFS PV to satify that.
# Assume the created PVC name is 'mnist-pvc' in the example.
nfs_server = '172.16.189.69'
nfs_path = '/opt/kubeflow/data/mnist'
pvc_name = 'mnist-pvc'

pvc_pvc = f'''
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kubeflow-mnist
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    path: {nfs_path}
    server: {nfs_server}
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {pvc_name}
  namespace: {my_namespace}
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 10Gi
'''

You could write the spec to a YAML file and then do `kubectl apply -f {FILE}`

## Use Kubeflow fairing to build the docker image and launch a TFJob for training

* Use kubeflow fairing to build a docker image that includes all your dependencies
* Launch a TFJob in the on premise cluster to taining model.

Firstly set some custom training parameters for TFJob.

In [5]:
num_ps = 1  #number of PS in TFJob 
num_workers = 2  #number of Worker in TFJob 
model_dir = "/mnt"
export_path = "/mnt/export" 
train_steps = "200"
batch_size = "100"
learning_rate = "0.01"

Then use kubeflow fairing to build a docker image and push to docker registry, and then launch a TFJob in the on-prem cluster to train model.

In [6]:
from kubeflow import fairing   
from kubeflow.fairing.kubernetes.utils import mounting_pvc

output_map =  {
    "Dockerfile.model": "Dockerfile",
    "model.py": "model.py"
}

command=["python",
         "/opt/model.py",
         "--tf-model-dir=" + model_dir,
         "--tf-export-dir=" + export_path,
         "--tf-train-steps=" + train_steps,
         "--tf-batch-size=" + batch_size,
         "--tf-learning-rate=" + learning_rate]

fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=DOCKER_REGISTRY, base_image="",
                           image_name="mnist", dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=my_namespace, stream_log=False,
                            worker_count=num_workers, ps_count=num_ps,
                            pod_spec_mutators = [mounting_pvc(pvc_name=pvc_name, pvc_mount_path='/mnt')])
fairing.config.run()

[I 200220 02:39:49 config:125] Using preprocessor: <kubeflow.fairing.preprocessors.base.BasePreProcessor object at 0x7f0cd82cf4a8>
[I 200220 02:39:49 config:127] Using builder: <kubeflow.fairing.builders.docker.docker.DockerBuilder object at 0x7f0cd82cf4e0>
[I 200220 02:39:49 config:129] Using deployer: <kubeflow.fairing.deployers.tfjob.tfjob.TfJob object at 0x7f0cd82cf518>
[I 200220 02:39:49 docker:32] Building image using docker
[W 200220 02:39:49 docker:41] Docker command: ['python', '/opt/model.py', '--tf-model-dir=/mnt', '--tf-export-dir=/mnt/export', '--tf-train-steps=200', '--tf-batch-size=100', '--tf-learning-rate=0.01']
[I 200220 02:39:49 base:107] Creating docker context: /tmp/fairing_context_xom0160i
[W 200220 02:39:49 base:94] Dockerfile already exists in Fairing context, skipping...
[W 200220 02:39:49 docker:56] Building docker image index.docker.io/jinchi/mnist:6754BE8A...
[I 200220 02:39:49 docker:103] Build output: Step 1/5 : FROM tensorflow/tensorflow:1.15.2-py3
[I 200

(<kubeflow.fairing.preprocessors.base.BasePreProcessor at 0x7f0cd82cf4a8>,
 <kubeflow.fairing.builders.docker.docker.DockerBuilder at 0x7f0cd82cf4e0>,
 <kubeflow.fairing.deployers.tfjob.tfjob.TfJob at 0x7f0cd82cf518>)

### Get The Created TFJobs

In [7]:
from kubeflow.tfjob import TFJobClient
tfjob_client = TFJobClient()

# TBD (@jinchihe) Below code for getting TFJob name will be removed once the issue fixed:
# https://github.com/kubeflow/fairing/issues/462
tfjobs = tfjob_client.get(namespace=my_namespace)
tfjob_name = tfjobs['items'][-1]['metadata'].get('name', '')

tfjob_client.get(tfjob_name, namespace=my_namespace)

{'apiVersion': 'kubeflow.org/v1',
 'kind': 'TFJob',
 'metadata': {'creationTimestamp': '2020-02-20T06:58:25Z',
  'generateName': 'fairing-tfjob-',
  'generation': 1,
  'labels': {'fairing-deployer': 'tfjob',
   'fairing-id': '643f4a6a-53ae-11ea-8d92-00163e01bd45'},
  'name': 'fairing-tfjob-zzxp2',
  'namespace': 'hejinchi',
  'resourceVersion': '3467592',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/hejinchi/tfjobs/fairing-tfjob-zzxp2',
  'uid': 'b82aeb27-c26d-4fdd-bde2-541589156db3'},
 'spec': {'tfReplicaSpecs': {'Chief': {'replicas': 1,
    'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
      'labels': {'fairing-deployer': 'tfjob',
       'fairing-id': '643f4a6a-53ae-11ea-8d92-00163e01bd45'},
      'name': 'fairing-deployer'},
     'spec': {'containers': [{'command': ['python',
         '/opt/model.py',
         '--tf-model-dir=/mnt',
         '--tf-export-dir=/mnt/export',
         '--tf-train-steps=200',
         '--tf-batch-size=100',
         '-

### Wait For the Training Job to finish

In [8]:
tfjob_client.wait_for_job(tfjob_name, namespace=my_namespace, watch=True)

NAME                           STATE                TIME                          
fairing-tfjob-zzxp2            Succeeded            2020-02-20T06:58:36Z          


### Check if the TFJob succeeded.

In [9]:
tfjob_client.is_job_succeeded(tfjob_name, namespace=my_namespace)

True

### Get the Training Logs

In [10]:
tfjob_client.get_logs(tfjob_name, namespace=my_namespace)

[I 200220 02:40:05 tf_job_client:386] The logs of Pod fairing-tfjob-zzxp2-chief-0:
    
    
    W0220 06:58:30.525008 139884460164928 module_wrapper.py:139] From /opt/model.py:153: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
    
    
    W0220 06:58:30.525343 139884460164928 module_wrapper.py:139] From /opt/model.py:153: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
    
    
    W0220 06:58:30.526986 139884460164928 module_wrapper.py:139] From /opt/model.py:158: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
    
    INFO:tensorflow:TF_CONFIG {"cluster":{"chief":["fairing-tfjob-zzxp2-chief-0.hejinchi.svc:2222"],"ps":["fairing-tfjob-zzxp2-ps-0.hejinchi.svc:2222"],"worker":["fairing-tfjob-zzxp2-worker-0.hejinchi.svc:2222","fairing-tfjob-zzxp2-worker-1.hejinchi.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
    I0220 06:58:30.5

## Deploy Service using KFServing

In [11]:
from kubeflow.fairing.deployers.kfserving.kfserving import KFServing
isvc = KFServing('tensorflow', default_storage_uri='pvc://' + pvc_name + '/export', namespace=my_namespace)
isvc.deploy(isvc.generate_isvc())

NAME                 READY      DEFAULT_TRAFFIC CANARY_TRAFFIC  URL                                               
fairing-kfserving... Unknown                                                                                      
fairing-kfserving... False                                                                                        
fairing-kfserving... False                                                                                        
fairing-kfserving... False                                                                                        
fairing-kfserving... True       100                             http://fairing-kfserving-wx78s.hejinchi.example...


[I 200220 02:40:27 kfserving:114] Deployed the InferenceService fairing-kfserving-wx78s successfully.


'fairing-kfserving-wx78s'

### Get the InferenceService

In [13]:
from kfserving import KFServingClient
kfserving_client = KFServingClient()
kfserving_client.get(namespace=my_namespace)

{'apiVersion': 'serving.kubeflow.org/v1alpha2',
 'items': [{'apiVersion': 'serving.kubeflow.org/v1alpha2',
   'kind': 'InferenceService',
   'metadata': {'creationTimestamp': '2020-02-20T10:40:10Z',
    'generateName': 'fairing-kfserving-',
    'generation': 5,
    'name': 'fairing-kfserving-wx78s',
    'namespace': 'hejinchi',
    'resourceVersion': '3559436',
    'selfLink': '/apis/serving.kubeflow.org/v1alpha2/namespaces/hejinchi/inferenceservices/fairing-kfserving-wx78s',
    'uid': 'd2876c95-338d-4d10-aff8-dcbb8772838e'},
   'spec': {'default': {'predictor': {'tensorflow': {'resources': {'limits': {'cpu': '1',
         'memory': '2Gi'},
        'requests': {'cpu': '1', 'memory': '2Gi'}},
       'runtimeVersion': '1.14.0',
       'storageUri': 'pvc://mnist-pvc/export'}}}},
   'status': {'canary': {},
    'conditions': [{'lastTransitionTime': '2020-02-20T10:40:27Z',
      'status': 'True',
      'type': 'DefaultPredictorReady'},
     {'lastTransitionTime': '2020-02-20T10:40:27Z',
  

### Get the InferenceService and Service Endpoint

In [14]:
# TBD @jinchihe Below code for getting kfserving name will be removed once the issue fixed:
# https://github.com/kubeflow/fairing/issues/463
created_isvc = kfserving_client.get(namespace=my_namespace)
service_name = created_isvc['items'][-1]['metadata'].get('name', '')

mnist_isvc = kfserving_client.get(service_name, namespace=my_namespace)
print("Mnist Service Endpoint is: " + mnist_isvc['status'].get('url', ''))

Mnist Service Endpoint is: http://fairing-kfserving-wx78s.hejinchi.example.com/v1/models/fairing-kfserving-wx78s


## Clean Up

Delete the TFJob

In [15]:
tfjob_client.delete(tfjob_name, namespace=my_namespace)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'fairing-tfjob-zzxp2',
  'group': 'kubeflow.org',
  'kind': 'tfjobs',
  'uid': 'b82aeb27-c26d-4fdd-bde2-541589156db3'}}

Delete the InferenceService.

In [16]:
kfserving_client.delete(service_name, namespace=my_namespace)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'fairing-kfserving-wx78s',
  'group': 'serving.kubeflow.org',
  'kind': 'inferenceservices',
  'uid': 'd2876c95-338d-4d10-aff8-dcbb8772838e'}}