# Mask Detection Demo - Training and Evaluation (1 / 3)

The following example demonstrates a training workflow that builds and trains a model that classifies whether a person is wearing a mask or not. The training is auto-logged to both Tensorboard and MLRun, and easily distributed using Horovod. Post training, we will run an evaluation to check our model's performance, updating the logging as a part of a routine test.

1. [Setup the Project](#section_1)
2. [Download the Data](#section_2)
3. [Write the Training and Evaluation Code](#section_3)
4. [Create the MLRun Function](#section_4)
5. [Run Training and Evaluation](#section_5)
6. [Run Distributed Training Using Horovod](#section_6)

Before we continue, we need to install MLRun and the framework of choice (comment and uncomment the framework you wish to use):

In [None]:
!pip install typing-extensions

########## For TF.Keras: ##########
!pip install tensorflow~=2.9.0

######### For PyTorch:  ##########
# !pip install torch~=1.13
# !pip install torchvision~=0.14

<a id="section_1"></a>
## 1. Setup the Project

Create a project using `mlrun.get_or_create_project` (make sure to load it in case it already exists), creating the paths where we'll store the project's artifacts:

In [None]:
import mlrun
import os

# Set our project's name:
project_name = "mask-detection"

# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)

A project in MLRun is based on the MLRun Functions it can run. In this notebook we will see two ways to create a MLRun Function:
* `mlrun.code_to_function`: Create our own MLRun Function from code (will be used for training and evaluation in [section 4](#section_4)).
* `mlrun.import_function`: Import from [MLRun's functions marketplace](https://docs.mlrun.org/en/latest/load-from-marketplace.html) - a functions hub intended to be a centralized location for open source contributions of function components (will be used for downloading the data in [section 2](#section_2)).
 
Before we continue, **please select the desired framework** (comment and uncomment the below lines as needed):

In [3]:
framework = "tf-keras"
# framework = "pytorch"

# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = False

### 1.1 Build project image
Building the image to satisfy the requirements of our project, according to the selection above.

In [None]:
if framework=='tf-keras':
    commands = ['pip install tensorflow~=2.9.0',
                'pip install horovod==0.25.0']
    builder_env = {'HOROVOD_WITH_MPI':'1', 'HOROVOD_WITH_TENSORFLOW':'1'}
else:
    commands = ['pip install torch==1.13.0+cpu torchvision==0.14.0+cpu -f https://download.pytorch.org/whl/torch_stable.html',
                'pip install tensorboard==2.5.0',
                'pip install horovod==0.25.0']
    builder_env={'HOROVOD_WITH_MPI':'1',
                 'HOROVOD_WITH_PYTORCH': '1'}

if not use_gpu: # install horovod dependency (already installed in mlrun/mlrun-gpu image)
    commands = ['apt update -qqq --fix-missing \
            && apt upgrade -y \
            && apt install -y \
            build-essential \
            cmake \
            gcc \
            && apt clean \
            && rm -rf /var/lib/apt/lists/*',
            ] + commands
    builder_env['HOROVOD_WITHOUT_GLOO'] = '1'
    
img = project.build_image(image=f'.mask-detection-{framework}',
                          base_image='mlrun/mlrun-gpu' if use_gpu else 'mlrun/mlrun',
                          commands=commands,
                          builder_env=builder_env,
                          overwrite_build_params=True)

<a id="section_2"></a>
## 2. Download the Data

### 2.1. Import a Function

We will download the images using `open_archive` - a function from MLRun's functions marketplace. We will import the function using `mlrun.import_function` and describe it to get the function's documentation:

In [5]:
# Import the function:
open_archive_function = mlrun.import_function("hub://open_archive")

# Print the function's documentation:
open_archive_function.doc()

function: open-archive
Open a file/object archive into a target directory
default handler: open_archive
entry points:
  open_archive: Open a file/object archive into a target directory
Currently supports zip and tar.gz
    context(MLClientCtx)  - function execution context, default=
    archive_url(DataItem)  - url of archive file , default=
    subdir(str)  - path within artifact store where extracted files are stored, default=content/
    key(str)  - key of archive contents in artifact store, default=content
    target_path(str)  - file system path to store extracted files, default=None


### 2.2. Run the Function - Download the Images

* **Function handlers**: We'll download the images by running the function using the `open_archive` handler as noted in the function's documentation. MLRun function is a collection of code and the handlers are the function headers inside. Every function with a context (type: `mlrun.MLClientCtx`) can be used as a handler.
* **Passing parameters**: MLRun function expects two types of parameters: inputs (type: `mlrun.DataItem`) and parameters. As noted in the function's documentation, we can see the `archive_url` is an `mlrun.DataItem`, so it should be passed in the `inputs` attribute of the `run` function. The others are passed via the `parameters` attribute.
* We will use the `local` argument and pass it as `True`. That means we will run the function locally and not on a pod. Using `local` is a convenient way to debug the code.

For more information regarding MLRun functions, context and data items, refer to [MLRun's documentation](https://docs.mlrun.org/en/latest/index.html).

In [6]:
# Setup the archive url for downloading the dataset images:
archive_url = mlrun.get_sample_path("data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip")

# Set the path to download the images data to:
# Make sure environment variable `PWD` points to a valid accessible path e.g. `/v3io/projects/mask-detection/`
dataset_path = os.environ.get('PWD', None)+'/data/'

# Run the function using the 'open_archive' handler:
open_archive_run = open_archive_function.run(
    name='download_data',
    handler='open_archive',
    inputs={'archive_url': archive_url},
    params={'target_path': dataset_path},
    local=True
)

> 2023-09-05 10:21:14,209 [info] Storing function: {'name': 'download-data', 'uid': 'af37dcded51a455ea43c8a2ef0941936', 'db': 'http://mlrun-api:8080'}


Names with underscore '_' are about to be deprecated, use dashes '-' instead. Replacing underscores with dashes.


> 2023-09-05 10:21:14,387 [info] downloading https://s3.wasabisys.com/iguazio/data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip to local temp file
> 2023-09-05 10:21:30,074 [info] Logging artifact to /User/test/demos/mask-detection/data/


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-iguazio,...f0941936,0,Sep 05 10:21:14,completed,download-data,v3io_user=iguaziokind=owner=iguaziohost=jupyter-6f96d6d666-q5zdg,archive_url,target_path=/User/test/demos/mask-detection/data/,,content





> 2023-09-05 10:21:30,271 [info] Run execution finished: {'status': 'completed', 'name': 'download-data'}


<a id="section_3"></a>
## 3. Write the Training and Evaluation Code

### TF.Keras

The code is taken from the python file [training-and-evaluation.py](tf-keras/training-and-evaluation.py), which is classic and straightforward. We: 
1. Use `_get_datasets` to get the training and validation datasets (on evaluation - the evaluation dataset).
2. Use `_get_model` to build our classifier - simple transfer learning from MobileNetV2 (`keras.applications`).
3. Call `train` to train the model.
4. Call `evaluate` to evaluate the model.

Taking this code one step further is **MLRun**'s framework for `tf.keras`:

```python
# Apply MLRun's interface for tf.keras:
mlrun_tf_keras.apply_mlrun(model=model, context=context, ...)
```

With just one line of code, it seamlessly provides **automatic logging** (for both MLRun and Tensorboard) and **distributed training** by wrapping the `fit` and `evaluate` methods of `tf.keras.Model`.

In addition, there is the `TFKerasModelHandler` class that is being returned from `apply_mlrun`. This class supports loading, saving and logging `tf.keras` models with ease, enabling easy versioning of the model and his results, artifacts and custom objects. Pay attention that by default, the model is loaded and logged automatically by `apply_mlrun`.

### PyTorch

The code is taken from the python file [training-and-evaluation.py](pytorch/training-and-evaluation.py), which is classic and straightforward. We:
1. Use `_get_datasets` to get the training and validation datasets (on evaluation - the evaluation dataset). The function is initiazliing a `MaskDetectionDataset` to handle our images.
2. Initialize our `MaskDetector` classifier class - a simple transfer learning from MobileNetV2 (`torchvision.models`).
3. Call `train` to train the model.
4. Call `evaluate` to evaluate the model.

Taking this code one step further is **MLRun**'s framework for `torch`:

```python
import mlrun.frameworks.pytorch as mlrun_torch
```

`mlrun_torch` is providing what we call "shortcut functions" for using PyTorch with ease:
* `train` - Training a model.
* `evaluate` - Evaluating a model.

Both functions enable **automatic logging** (for both MLRun and Tensorboard) and **distributed training** by simply passing the following parameters: `auto_log: bool` and `use_horovod: bool`.

In addition, you can choose to use our classes directly:
* `PyTorchMLRunInterface` - the interface for training, evaluating and predicting a PyTorch model. Our code is highly generic and should fit for any type of model.
* If you wish to use your own training code, to get automatic logging you will simply need to use our callback mechanism with `CallbackHandler`.
* `PyTorchModelHandler` - supports loading, saving and logging `torch` models with ease, enabling easy versioning of the model and his results, artifacts and custom objects.

Both **TF.Keras** and **PyTorch** has the same features regarding MLRun's automatic logging and distributed training orchastration:
* **Automatic logging**: auto-log your training and model to both **Tensorboard** and **MLRun**. Additional settings can be passed onto this method to gain extra logging capabilities, like:
  * Weights histograms and distributions
  * Weights statistics
  * Weights images (working in progress)
  * Edit static and dynamic hyperparameters tracking
  * Logging frequency and more
* **Distributed training with Horovod**: Horovod will be initialized and used automatically if the MLRun Function's `kind` attribute is equal to `"mpijob"`, there won't be any additional changes needed to the original code! More on that later in [section 6](#section_6)

We suggest reading the documentation for further use, or like in this example, use the default settings.

<a id="section_4"></a>
## 4. Create the MLRun Function

We will use MLRun's `mlrun.code_to_function` to create a MLRun Function from our code in the above mentioned python file. Notice our MLRun Function will have two handlers: `train` and `evaluate`.

We wish to run the training first as a Job, so we will set the `kind` parameter to `"job"`.

In [7]:
# Create the function parsing the given file code using 'code_to_function':
training_and_evaluation_function = mlrun.code_to_function(
    filename=os.path.join(framework, "training-and-evaluation.py"),
    name="training-and-evaluation",
    kind="job",
    image=f'.mask-detection-{framework}'
)

# Mount it:
training_and_evaluation_function.apply(mlrun.auto_mount())
if os.getenv('V3IO_ACCESS_KEY','False')=='False':
    training_and_evaluation_function.spec.disable_auto_mount=False

<a id="section_5"></a>
## 5. Run Training and Evaluation

### 5.1. Train the Model

We will run the training using the `train` handler. We will pass the desired hyperparameters and keep the returning run object in order to pass the trained model to the evaluation (more on that can be seen later on). 

Unlike running the `open_archive` function, the training will be performed as a **job** on the **cluster** and not locally. 

> **Notice** now the `local` attribute is `False` (this is its default value) which means it will run the function on the cluster. To run the training locally, simply pass `local=True` as before.

In [8]:
training_run = training_and_evaluation_function.run(
    name="training",
    handler="train",
    params={
        "dataset_path": dataset_path,
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=False
)

> 2023-09-05 10:21:30,471 [info] Storing function: {'name': 'training', 'uid': 'e80c00ef3e69420fbb2e29a17fa053de', 'db': 'http://mlrun-api:8080'}
> 2023-09-05 10:21:30,780 [info] Job is running in the background, pod: training-5stfr
2023-09-05 10:22:25.921701: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-09-05 10:22:25.921744: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-09-05 10:22:28.127332: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-09-05 10:22:28.127398: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-09-05 10:22:28.127430: I tens

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-iguazio,...7fa053de,0,Sep 05 10:22:25,completed,training,v3io_user=iguaziokind=jobowner=iguaziomlrun/client_version=1.5.0-rc9mlrun/client_python_version=3.9.16host=training-5stfr,,dataset_path=/User/test/demos/mask-detection/data/batch_size=32lr=0.0001epochs=3,dataset_path=/User/test/demos/mask-detection/data/batch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.05770158767700195training_accuracy=0.9686927795410156validation_loss=0.039046022627088756validation_accuracy=0.9927536646525065,training_loss.htmltraining_accuracy.htmlvalidation_loss.htmlvalidation_accuracy.htmlloss_summary.htmlaccuracy_summary.htmllr_values.htmlmodel





> 2023-09-05 10:25:46,645 [info] Run execution finished: {'status': 'completed', 'name': 'training'}


When the training is done, there will be a list of all the <span style="background:lightgreen">artifacts created</span> in MLRun during the training run. All the (hopfully smooth) loss and metrics graphs we all love will be in both MLRun and Tensorboard, as well as the model weights and custom objects.

### 5.2. Evaluate the Model

Evaluating the model requires, you guessed it, the trained model. In order to get the model we just trained, we will use the training run object `training_run` that was returned from calling `run` on the MLRun function. Then, to get the model artifact, we will use `training_run.outputs` - a dictionary of all the function's artifacts that can be accessed by their names. So, to get the model, we will use the key "model" (as seen in the artifacts list generated above).

In [9]:
evaluation_run = training_and_evaluation_function.run(
    name="evaluating",
    handler="evaluate",
    params={
        "model_path": training_run.outputs['model'],  # <- Take the model we trained from the previous MLRun function via the run object.
        "dataset_path": dataset_path,
        "batch_size": 32
    }
)

> 2023-09-05 10:25:46,679 [info] Storing function: {'name': 'evaluating', 'uid': 'a899d1c8d9b5458bbfb6bef7f8e7c119', 'db': 'http://mlrun-api:8080'}
> 2023-09-05 10:25:46,979 [info] Job is running in the background, pod: evaluating-jzts5
2023-09-05 10:26:12.017860: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-09-05 10:26:12.017917: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-09-05 10:26:14.243557: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-09-05 10:26:14.243608: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-09-05 10:26:14.243641: I 

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-iguazio,...f8e7c119,0,Sep 05 10:26:11,completed,evaluating,v3io_user=iguaziokind=jobowner=iguaziomlrun/client_version=1.5.0-rc9mlrun/client_python_version=3.9.16host=evaluating-jzts5,,model_path=store://artifacts/mask-detection-iguazio/mask_detector:e80c00ef3e69420fbb2e29a17fa053dedataset_path=/User/test/demos/mask-detection/data/batch_size=32,model_path=store://artifacts/mask-detection-iguazio/mask_detector:e80c00ef3e69420fbb2e29a17fa053dedataset_path=/User/test/demos/mask-detection/data/batch_size=32evaluation_loss=0.042869803517363796evaluation_accuracy=0.9927325581395349,evaluation_loss.htmlevaluation_accuracy.html





> 2023-09-05 10:27:25,654 [info] Run execution finished: {'status': 'completed', 'name': 'evaluating'}


<a id="section_6"></a>
## 6. Run Distributed Training Using Horovod

Now we can see the second benefit of MLRun, we can **distribute** our model **training** across **multiple workers** (i.e., perform distributed training), assign **GPUs**, and more. We don't need to bother with Dockerfiles or K8s YAML configuration files — MLRun does all of this for us. We will simply create our function with `kind="mpijob"`.

> **Notice**: for this demo, in order to use GPUs in training, set the `use_gpu` variable to `True`. This will later assign the required configurations to use the GPUs and pass the correct image to support GPUs (image with CUDA libraries).

In [10]:
# Create the MLRun Function:
mpi_training_and_evaluation_function = mlrun.code_to_function(
    filename=os.path.join(framework, "training-and-evaluation.py"),
    name="mpi-training-and-evaluation",
    handler="train",
    kind="mpijob",
    image=f'.mask-detection-{framework}',
    with_doc=False
)

We can set additional configurations for our run like image, workers, gpus and more (if `use_gpu` is `True` we will setup 2 workers with 1 GPU per worker):

In [11]:
# Setup the desired configurations:
mpi_training_and_evaluation_function.spec.replicas = 1
if use_gpu:
    # Select the number of GPUs per replica:
    mpi_training_and_evaluation_function.with_limits(gpus=1)
else:
    mpi_training_and_evaluation_function.with_requests(cpu=2)

# Mount it:
mpi_training_and_evaluation_function.apply(mlrun.auto_mount())

<mlrun.runtimes.mpijob.v1.MpiRuntimeV1 at 0x7fbd2b0007c0>

In [12]:
# community edition mount support
if os.getenv('V3IO_ACCESS_KEY','False')=='False':
    mpi_training_and_evaluation_function.spec.disable_auto_mount=False

Call run, and notice each epoch is shorter as we now have 2 workers instead of 1. As the 2 workers will print a lot of outputs we would rather wait for completion and then show the results. For that, we will pass `watch=False` and use the run objects function `wait_for_completion` and `show`. 

To see the logs, you can go into the UI by clicking the blue hyperlink "<span style="color:blue">**click here**</span>" after running the function:

In [13]:
# Run the training job:
distributed_training_run = mpi_training_and_evaluation_function.run(
    name="distributed-training",
    params={
        "dataset_path": dataset_path,
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3,
    },
    watch=False,  # <- Turn off the logs.
)

# Wait for complition and show the results. 
distributed_training_run.wait_for_completion()
distributed_training_run.show()

> 2023-09-05 10:27:25,906 [info] Storing function: {'name': 'distributed-training', 'uid': '0cb4e7dbe2a2438788038915e789d505', 'db': 'http://mlrun-api:8080'}
> 2023-09-05 10:28:17,832 [info] MpiJob distributed-training-d3ae842b launcher pod distributed-training-d3ae842b-launcher state active


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-iguazio,...e789d505,0,Sep 05 10:27:26,running,distributed-training,v3io_user=iguaziokind=mpijobowner=iguaziomlrun/client_version=1.5.0-rc9mlrun/client_python_version=3.9.16,,dataset_path=/User/test/demos/mask-detection/data/batch_size=32lr=0.0001epochs=3,,





> 2023-09-05 10:28:17,853 [info] Run execution finished: {'status': 'running', 'name': 'distributed-training'}
> 2023-09-05 10:28:17,877 [info] run distributed-training is not completed yet, waiting for it to complete: {'current_state': 'running'}
+ POD_NAME=distributed-training-d3ae842b-worker-0
+ shift
+ /opt/kube/kubectl exec distributed-training-d3ae842b-worker-0 -- /bin/sh -c  orted -mca ess "env" -mca ess_base_jobid "2329411584" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "distributed-training-d[1:3]ae842b-launcher,distributed-training-d[1:3]ae842b-worker-0@0(2)" -mca orte_hnp_uri "2329411584.0;tcp://10.233.67.47:37575" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "2329411584.0;tcp://10.233.67.47:37575" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
2023-09-05 10:28:22.201858: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dyn

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-iguazio,...e789d505,0,Sep 05 10:28:21,completed,distributed-training,v3io_user=iguaziokind=mpijobowner=iguaziomlrun/client_version=1.5.0-rc9mlrun/client_python_version=3.9.16mlrun/job=distributed-training-d3ae842bhost=distributed-training-d3ae842b-worker-0,,dataset_path=/User/test/demos/mask-detection/data/batch_size=32lr=0.0001epochs=3,dataset_path=/User/test/demos/mask-detection/data/batch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.040120527148246765training_accuracy=1.0004091262817383validation_loss=0.04653192311525345validation_accuracy=0.9855072498321533,training_loss.htmltraining_accuracy.htmlvalidation_loss.htmlvalidation_accuracy.htmlloss_summary.htmlaccuracy_summary.htmllr_values.htmlmodel
