# End to end image classification workflow with distributed training
The following example demonstrates an end to end data science workflow for building an an image classifier <br>
The model is trained on an images dataset of cats and dogs. Then the model is deployed as a function in a serving layer <br>
Users can send http request with an image of cats/dogs image and get a respond back that identify whether it is a cat or a dog

This typical data science workflow comprises of the following:
* Download anb label the dataset
* Training a model on the images dataset
* Deploy a function with the new model in a serving layer
* Testing the function

Key technologies:
* Tensorflow-Keras for training the model
* Horovod for running a distributed training
* MLRun (open source library for tracking experiments https://github.com/mlrun/mlrun) for building the functions and tracking experiments
* Nuclio function for creating a funciton that runs the model in a serving layer

This demo is based on the following:<br>
* https://github.com/tensorflow/docs/tree/master/site/en/tutorials
* https://www.kaggle.com/uysimty/keras-cnn-dog-or-cat-classification/log

In [28]:
# !mlrun clean -p -r

In [29]:
# nuclio: ignore
import nuclio

## Helper functions for downloading and labeling images
In the code below we have two functions: 
1. open_archive - Get and extract a zip file that contains cats and dog images. users need to pass the source URL and the target directory which is stored in Iguazio data layer
2. categories_map_builder - labeling the dataset based on the file name. the functions creates a pandas dataframe with the filename and category (i.e. cat & dog)

Note that sometime after running pip install you need to restart the jupyer kernel

#### Function config and code

In [30]:
%nuclio config spec.image = "mlrun/ml-models:0.4.6"

%nuclio: setting spec.image to 'mlrun/ml-models:0.4.6'


In [31]:
import os
import zipfile
import json
from tempfile import mktemp
import pandas as pd

# download the image archive
def open_archive(context, 
                 target_dir='content',
                 archive_url='',
                 refresh=False):
    """Open a file/object archive into a target directory
    
    note: if `refresh` is True the content will be re-downloaded
    """
    # does the target already exist, if yes skip download
    if not os.path.isdir(target_dir) or refresh is True:
        # Define locations
        os.makedirs(target_dir, exist_ok=True)
        context.logger.info('Verified directories')

        # Extract dataset from zip
        context.logger.info('Extracting zip')
        zip_ref = zipfile.ZipFile(archive_url, 'r')
        zip_ref.extractall(target_dir)
        zip_ref.close()

        context.logger.info(f'extracted archive to {target_dir}')
    context.log_artifact('content', local_path=target_dir)

# build categories 
def categories_map_builder(context,
                           source_dir,
                           df_filename='file_categories_df.csv',
                           map_filename='categories_map.json'):
    """Read labeled images from a directory and create category map + df
    
    filename format: <category>.NN.jpg"""
    
    # create filenames list (jpg only)
    filenames = [file for file in os.listdir(source_dir) if file.endswith('.jpg')]
    categories = []
        
    # Create a pandas DataFrame for the full sample
    for filename in filenames:
        category = filename.split('.')[0]
        categories.append(category)

    df = pd.DataFrame({
        'filename': filenames,
        'category': categories
    })
    df['category'] = df['category'].astype('str')
    
    categories = df.category.unique()
    categories = {i: category for i, category in enumerate(categories)}
    with open(os.path.join(context.artifact_path, map_filename), 'w') as f:
        f.write(json.dumps(categories))
        
    context.logger.info(categories)
    context.log_artifact('categories_map', local_path=map_filename)
    context.log_dataset('file_categories', df=df, local_path=df_filename)

In [32]:
# nuclio: end-code

Set the MLRun database location and the base directory

<b>Setup and imports</b>

### mlconfig

In [33]:
from mlrun import mlconf

In [34]:
mlconf.dbpath = mlconf.dbpath or './'
mlconf.dbpath

'http://mlrun-api:8080'

In [35]:
vcs_branch = 'development'
base_vcs = f'https://raw.githubusercontent.com/mlrun/functions/{vcs_branch}/'

mlconf.hub_url = mlconf.hub_url or base_vcs + f'{name}/function.yaml'
mlconf.hub_url

'/User/repos/functions/{name}/function.yaml'

In [36]:
import os
mlconf.artifact_path = mlconf.artifact_path or f'{os.environ["V3IO_HOME"]}/artifacts'
mlconf.artifact_path

'/User/artifacts'

In [37]:
from os import path

# specify paths and artifacts target location
code_dir = path.join(path.abspath('./'), 'src') # Where our source code files are saved
code_dir

'/User/repos/demos/horovod-pipe/src'

In [38]:
images_path = path.join(mlconf.artifact_path, 'images') 
images_path

'/User/artifacts/images'

In [39]:
project_name='image-classification'

### Test locally, Download and extract image archive
The dataset is taken from the Iguazio-sample bucket in S3 <br>
>Note that this step is captured in the MLRun database. <br>

We create a new local function with our inline code from above.  
We then define a `NewTask` with the `open_archive` function handler and the needed parameters and run it.  

In [40]:
# download images from s3 using the local `open_archive` function
from mlrun import NewTask, run_local

open_archive_task = NewTask(name='download', 
                            handler=open_archive, 
                            params={'target_dir': images_path},
                            inputs={'archive_url': 'http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip'})


open_archive = run_local(open_archive_task, project=project_name, artifact_path=mlconf.artifact_path)

[mlrun] 2020-04-25 23:25:17,483 starting run download uid=1b2699cc0e2b41e3a11b13dd2bcc6788  -> http://mlrun-api:8080
[mlrun] 2020-04-25 23:25:17,510 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp
[mlrun] 2020-04-25 23:25:18,488 log artifact content at /User/artifacts/User/artifacts/images, size: None, db: Y



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
image-classification,...2bcc6788,0,Apr 25 23:25:17,completed,download,v3io_user=adminkind=handlerowner=adminhost=jupyter-5859859b4f-hwhxd,archive_url,target_dir=/User/artifacts/images,,content


to track results use .show() or .logs() or in CLI: 
!mlrun get run 1b2699cc0e2b41e3a11b13dd2bcc6788 --project image-classification , !mlrun logs 1b2699cc0e2b41e3a11b13dd2bcc6788 --project image-classification
[mlrun] 2020-04-25 23:25:18,518 run executed, status=completed


# Complete Data-Science Pipeline with MLRun and Kubeflow

We are using a library called MLRun for running the functions and storing the experiments meta data in the MLRun database <br>
Users can query the database to view all the experiments along with their associated meta data <br>
- Get data
- Create categories map
- Train horovod model on the cluster
- Deploy model

## Create a multi-stage project (ingest, label, train, deploy model)

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [41]:
from mlrun import new_project
project_dir = './'
hvdproj = new_project(project_name, project_dir)

#### Add our `utils` function to the project
We convert our inline (notebook) code to a function object and register that under our project

In [42]:
from mlrun import code_to_function
utils = code_to_function(kind='job', 
                         name='utils',
                         image='mlrun/ml-models:0.4.6')

In [43]:
hvdproj.set_function(utils)

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f5b2c30c730>

#### Run/test our function on the cluster 
We define a `NewTask` using the `categories_map_builder` function handler and the needed parameters.
We add Iguazio v3io file mount (share the local paths with our function) and run the task remotely 

In [44]:
# Create categories map
label_task = NewTask(
    name='label', 
    handler=categories_map_builder, 
    params={'source_dir': os.path.join(images_path, 'cats_n_dogs'),
            'map_filename': 'categories_map.json'})

In [45]:
from mlrun import mount_v3io
hvdproj.func('utils').apply(mount_v3io()).run(label_task, artifact_path=mlconf.artifact_path)

[mlrun] 2020-04-25 23:25:33,856 starting run label uid=f0546ec6a2e042c3ad31c22e41b72c03  -> http://mlrun-api:8080
[mlrun] 2020-04-25 23:25:33,935 Job is running in the background, pod: label-pf5tc
[mlrun] 2020-04-25 23:25:37,837 {0: 'cat', 1: 'dog'}
[mlrun] 2020-04-25 23:25:37,852 log artifact categories_map at /User/artifacts/categories_map.json, size: None, db: Y
[mlrun] 2020-04-25 23:25:37,888 log artifact file_categories at /User/artifacts/file_categories_df.csv, size: 40689, db: Y

[mlrun] 2020-04-25 23:25:37,904 run executed, status=completed
final state: succeeded


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
image-classification,...41b72c03,0,Apr 25 23:25:37,completed,label,host=label-pf5tckind=jobowner=adminv3io_user=admin,,map_filename=categories_map.jsonsource_dir=/User/artifacts/images/cats_n_dogs,,categories_mapfile_categories


to track results use .show() or .logs() or in CLI: 
!mlrun get run f0546ec6a2e042c3ad31c22e41b72c03 --project image-classification , !mlrun logs f0546ec6a2e042c3ad31c22e41b72c03 --project image-classification
[mlrun] 2020-04-25 23:25:40,065 run executed, status=completed


<mlrun.model.RunObject at 0x7f5b2c2958e0>

### Define a new function for distributed Training with TensorFlow, Keras and Horovod

Here we use the same structure as before to deploy our **[cats vs. dogs tensorflow model training file](horovod-training.py)** to run on the defined horovod cluster in a distributed manner.  

We define the input parameters for the training function.  
We set the function's `kind='mpijob'` to let MLRun know to apply the job to the MPI CRD and create the requested horovod cluster.  
We set the number of workers for the horovod cluster to use by setting `trainer.spec.replicas = 4` (default is 1 replica).  
We set the number of GPUs each worker will receive by setting `trainer.gpus(1)` (default is 0 GPUs).
> Please verify that the `HOROVOD_FILE` path is available from the cluster (Local path and Mounted path may vary)

In [46]:
from mlrun import new_function

HOROVOD_FILE = os.path.join(code_dir, 'horovod-training.py')
trainer = new_function(name='trainer',
                       kind='mpijob',
                       command=HOROVOD_FILE, 
                       image='mlrun/ml-models:0.4.6')
trainer.spec.image_pull_policy = 'Always'
trainer.spec.replicas = 4
#trainer.gpus(1)
hvdproj.set_function(trainer)

<mlrun.runtimes.mpijob.MpiRuntime at 0x7f5b5c52a850>

#### Add a serving function from the functions hub (marketplace)

In [47]:
hvdproj.set_function('hub://tf2_serving', 'serving')

<mlrun.runtimes.function.RemoteRuntime at 0x7f5b2ebeca30>

#### Register the source images directory as a project artifact (can be accessed by name)

In [48]:
hvdproj.log_artifact(
    'images', 
    target_path='http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip',
    artifact_path=mlconf.artifact_path)
#print(hvdproj.to_yaml())

[mlrun] 2020-04-25 23:25:40,132 log artifact images at http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip, size: None, db: Y


#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes an execution graph (DAG) and how functions are conncted to form an end to end pipline. 

* Download the images 
* Label the images (Cats & Dogs)
* Train the model using distributed TensorFlow (Horovod)
* Deploy the model into a serverless function 

In [49]:
%%writefile src/workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs = {}


def init_functions(functions: dict, project=None, secrets=None):
    '''
    This function will run before running the project.
    It allows us to add our specific system configurations to the functions
    like mounts or secrets if needed.

    In this case we will add Iguazio's user mount to our functions using the
    `mount_v3io()` function to automatically set the mount with the needed
    variables taken from the environment. 
    * mount_v3io can be replaced with mlrun.platforms.mount_pvc() for 
    non-iguazio mount

    @param functions: <function_name: function_yaml> dict of functions in the
                        workflow
    @param project: project object
    @param secrets: secrets required for the functions for s3 connections and
                    such
    '''
    for f in functions.values():
        f.apply(mount_v3io())                  # On Iguazio (Auto-mount /User)
        # f.apply(mlrun.platforms.mount_pvc()) # Non-Iguazio mount
        
    functions['serving'].set_env('MODEL_CLASS', 'TFModel')
    functions['serving'].set_env('IMAGE_HEIGHT', '128')
    functions['serving'].set_env('IMAGE_WIDTH', '128')
    functions['serving'].set_env('ENABLE_EXPLAINER', 'False')


@dsl.pipeline(
    name='Image classification demo',
    description='Train an Image Classification TF Algorithm using MLRun'
)
def kfpipeline(
        image_archive='store:///images',
        images_path='/User/artifacts/images',
        source_dir='/User/artifacts/images/cats_n_dogs',
        checkpoints_dir='/User/artifacts/checkpoints',
        model_path='/User/artifacts/models/cats_n_dogs.h5',
        model_name='cat_vs_dog_v1'):

    # step 1: download images
    open_archive = funcs['utils'].as_step(name='download',
                                          handler='open_archive',
                                          params={'target_dir': images_path},
                                          inputs={'archive_url': image_archive},
                                          outputs=['content'])

    # step 2: label images
    label = funcs['utils'].as_step(name='label',
                                   handler='categories_map_builder',
                                   params={'source_dir': source_dir},
                                   outputs=['categories_map',
                                            'file_categories']).after(open_archive)

    # step 3: train the model
    train = funcs['trainer'].as_step(name='train',
                                     params={'epochs': 2,
                                             'checkpoints_dir': checkpoints_dir,
                                             'model_path'     : model_path,
                                             'data_path'      : source_dir,
                                             'batch_size'     : 256},
                                     inputs={
                                         'categories_map': label.outputs['categories_map'],
                                         'file_categories': label.outputs['file_categories']},
                                     outputs=['model'])
    train.container.set_image_pull_policy('Always')

    # deploy the model using nuclio functions
    deploy = funcs['serving'].deploy_step(models={model_name: train.outputs['model']})


Overwriting src/workflow.py


In [50]:
hvdproj.set_workflow('main', 'src/workflow.py')

In [51]:
hvdproj.save()

<a id='run-pipeline'></a>
## Run a pipeline workflow
You can check the **[workflow.py](src/workflow.py)** file to see how functions objects are initialized and used (by name) inside the workflow.
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note the pipeline can include CI steps like building container images and deploying models.



### Run
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [54]:
artifact_path = mlconf.artifact_path + '/{{workflow.uid}}'
run_id = hvdproj.run(
    'main',
    arguments={}, 
    artifact_path=mlconf.artifact_path, 
    dirty=True)

TypeError: __init__() got an unexpected keyword argument 'resource_references'

In [53]:
from mlrun import get_run_db
db = get_run_db().connect()
db.list_runs(project=hvdproj.name, labels=f'workflow={run_id}').show()

NameError: name 'run_id' is not defined

## Test the serving function

After the function has been deployed we can test it as a regular REST Endpoint using `requests`.

In [None]:
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

### Define test params

In [None]:
# Testing event
cat_image_url = 'https://s3.amazonaws.com/iguazio-sample-data/images/catanddog/cat.102.jpg'
response = requests.get(cat_image_url)
cat_image = response.content
img = Image.open(BytesIO(cat_image))

print('Test image:')
plt.imshow(img)

### Test The Serving Function (with Image URL)

In [None]:
addr = 'http://tf2-images-server:8080' 
model_name='cat_vs_dog_v2'
headers = {'Content-type': 'text/plain'}
response = requests.post(url=f'{addr}/{model_name}/predict', 
                         data=json.dumps({'data_url': cat_image_url}), 
                         headers=headers)
print(response.content.decode('utf-8'))

### Test The Serving Function (with Jpeg Image)

In [None]:
headers = {'Content-type': 'image/jpeg'}
response = requests.post(url=f'{addr}/{model_name}/predict', 
                         data=cat_image, 
                         headers=headers)
print(response.content.decode('utf-8'))

**[back to top](#top)**