# Mask Detection Demo - Training and Evaluation (1 / 3)

The following example demonstrates a training workflow that builds and trains a model that classifies whether a person is wearing a mask or not. The training is auto-logged to both Tensorboard and MLRun, and easily distributed using Horovod. Post training, we will run an evaluation to check our model's performance, updating the logging as a part of a routine test.

1. [Setup the Project](#section_1)
2. [Download the Data](#section_2)
3. [Write the Training and Evaluation Code](#section_3)
4. [Create the MLRun Function](#section_4)
5. [Run Training and Evaluation](#section_5)
6. [Run Distributed Training Using Horovod](#section_6)

Before we continue, we need to install MLRun and the framework of choice (comment and uncomment the framework you wish to use):

In [None]:
!pip install mlrun
!pip install -U typing-extensions

########## For TF.Keras: ##########
!pip install -U tensorflow==2.7.0

########## For PyTorch:  ##########
# !pip install -U torch==1.10
# !pip install -U torchvision==0.11.1

<a id="section_1"></a>
## 1. Setup the Project

Create a project using `mlrun.get_or_create_project` (make sure to load it in case it already exists), creating the paths where we'll store the project's artifacts:

In [1]:
import mlrun
import os

# Set our project's name:
project_name = "mask-detection"

# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)

> 2021-11-07 19:30:54,211 [info] loaded project mask-detection from MLRun DB


A project in MLRun is based on the MLRun Functions it can run. In this notebook we will see two ways to create a MLRun Function:
* `mlrun.code_to_function`: Create our own MLRun Function from code (will be used for training and evaluation in [section 4](#section_4)).
* `mlrun.import_function`: Import from [MLRun's functions marketplace](https://docs.mlrun.org/en/latest/load-from-marketplace.html) - a functions hub intended to be a centralized location for open source contributions of function components (will be used for downloading the data in [section 2](#section_2)).

<a id="section_2"></a>
## 2. Download the Data

### 2.1. Import a Function

We will download the images using `open_archive` - a function from MLRun's functions marketplace. We will import the function using `mlrun.import_function` and describe it to get the function's documentation:

In [2]:
# Import the function:
open_archive_function = mlrun.import_function("hub://open_archive")

# Print the function's documentation:
open_archive_function.doc()

function: open-archive
Open a file/object archive into a target directory
default handler: open_archive
entry points:
  open_archive: Open a file/object archive into a target directory

Currently supports zip and tar.gz
    context(MLClientCtx)  - function execution context, default=
    archive_url(DataItem)  - url of archive file, default=
    subdir(str)  - path within artifact store where extracted files are stored, default=content
    key(str)  - key of archive contents in artifact store, default=content
    target_path(str)  - file system path to store extracted files (use either this or subdir), default=None


### 2.2. Run the Function - Download the Images

* **Function handlers**: We'll download the images by running the function using the `open_archive` handler as noted in the function's documentation. MLRun function is a collection of code and the handlers are the function headers inside. Every function with a context (type: `mlrun.MLClientCtx`) can be used as a handler.
* **Passing parameters**: MLRun function expects two types of parameters: inputs (type: `mlrun.DataItem`) and parameters. As noted in the function's documentation, we can see the `archive_url` is an `mlrun.DataItem`, so it should be passed in the `inputs` attribute of the `run` function. The others are passed via the `parameters` attribute.
* We will use the `local` argument and pass it as `True`. That means we will run the function locally and not on a pod. Using `local` is a convenient way to debug the code.

For more information regarding MLRun functions, context and data items, refer to [MLRun's documentation](https://docs.mlrun.org/en/latest/index.html).

In [3]:
# Setup the archive url for downloading the dataset images:
archive_url = mlrun.get_sample_path("data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip")

# Set the path to download the images data to:
dataset_path = os.path.abspath('./Dataset')

# Run the function using the 'open_archive' handler:
open_archive_run = open_archive_function.run(
    name='download_data',
    handler='open_archive',
    inputs={'archive_url': archive_url},
    params={'target_path': dataset_path},
    local=True
)

> 2021-11-07 19:30:54,349 [info] starting run download_data uid=f3b87a845aab4d8abb063b6e2ade7c62 DB=http://mlrun-api:8080
> 2021-11-07 19:30:54,570 [info] downloading https://s3.wasabisys.com/iguazio/data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip to local temp file


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-guyl,...2ade7c62,0,Nov 07 19:30:54,completed,download_data,v3io_user=guylkind=owner=guylhost=guyl-jupyter-bdbbcc6cc-x2mzd,archive_url,target_path=/User/demos/mask-detection/tf-keras/Dataset,,content





> 2021-11-07 19:31:05,848 [info] run executed, status=completed


<a id="section_3"></a>
## 3. Write the Training and Evaluation Code

Before we continue, **please select the desired framework** (comment and uncomment the below lines as needed):

In [4]:
framework = "tf-keras"
# framework = "pytorch"

### TF.Keras

The code is taken from the python file [training-and-evaluation.py](tf-keras/training-and-evaluation.py), which is classic and straightforward. We: 
1. Use `_get_datasets` to get the training and validation datasets (on evaluation - the evaluation dataset).
2. Use `_get_model` to build our classifier - simple transfer learning from MobileNetV2 (`keras.applications`).
3. Call `train` to train the model.
4. Call `evaluate` to evaluate the model.

Taking this code one step further is **MLRun**'s framework for `tf.keras`:

```python
# Apply MLRun's interface for tf.keras:
mlrun_tf_keras.apply_mlrun(model=model, context=context, ...)
```

With just one line of code, it seamlessly provides **automatic logging** (for both MLRun and Tensorboard) and **distributed training** by wrapping the `fit` and `evaluate` methods of `tf.keras.Model`.

In addition, there is the `TFKerasModelHandler` class that is being returned from `apply_mlrun`. This class supports loading, saving and logging `tf.keras` models with ease, enabling easy versioning of the model and his results, artifacts and custom objects. Pay attention that by default, the model is loaded and logged automatically by `apply_mlrun`.

### PyTorch

The code is taken from the python file [training-and-evaluation.py](pytorch/training-and-evaluation.py), which is classic and straightforward. We:
1. Use `_get_datasets` to get the training and validation datasets (on evaluation - the evaluation dataset). The function is initiazliing a `MaskDetectionDataset` to handle our images.
2. Initialize our `MaskDetector` classifier class - a simple transfer learning from MobileNetV2 (`torchvision.models`).
3. Call `train` to train the model.
4. Call `evaluate` to evaluate the model.

Taking this code one step further is **MLRun**'s framework for `torch`:

```python
import mlrun.frameworks.pytorch as mlrun_torch
```

`mlrun_torch` is providing what we call "shortcut functions" for using PyTorch with ease:
* `train` - Training a model.
* `evaluate` - Evaluating a model.

Both functions enable **automatic logging** (for both MLRun and Tensorboard) and **distributed training** by simply passing the following parameters: `auto_log: bool` and `use_horovod: bool`.

In addition, you can choose to use our classes directly:
* `PyTorchMLRunInterface` - the interface for training, evaluating and predicting a PyTorch model. Our code is highly generic and should fit for any type of model.
* If you wish to use your own training code, to get automatic logging you will simply need to use our callback mechanism with `CallbackHandler`.
* `PyTorchModelHandler` - supports loading, saving and logging `torch` models with ease, enabling easy versioning of the model and his results, artifacts and custom objects.

Both **TF.Keras** and **PyTorch** has the same features regarding MLRun's automatic logging and distributed training orchastration:
* **Automatic logging**: auto-log your training and model to both **Tensorboard** and **MLRun**. Additional settings can be passed onto this method to gain extra logging capabilities, like:
  * Weights histograms and distributions
  * Weights statistics
  * Weights images (working in progress)
  * Edit static and dynamic hyperparameters tracking
  * Logging frequency and more
* **Distributed training with Horovod**: Horovod will be initialized and used automatically if the MLRun Function's `kind` attribute is equal to `"mpijob"`, there won't be any additional changes needed to the original code! More on that later in [section 6](#section_6)

We suggest reading the documentation for further use, or like in this example, use the default settings.

<a id="section_4"></a>
## 4. Create the MLRun Function

We will use MLRun's `mlrun.code_to_function` to create a MLRun Function from our code in the above mentioned python file. Notice our MLRun Function will have two handlers: `train` and `evaluate`.

We wish to run the training first as a Job, so we will set the `kind` parameter to `"job"`.

In [5]:
# Create the function parsing the given file code using 'code_to_function':
training_and_evaluation_function = mlrun.code_to_function(
    filename=os.path.join(framework, "training-and-evaluation.py"),
    name="training-and-evaluation",
    kind="job",
    image="mlrun/ml-models"
)

# Mount it:
training_and_evaluation_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f2db70bdb90>

<a id="section_5"></a>
## 5. Run Training and Evaluation

### 5.1. Train the Model

We will run the training using the `train` handler. We will pass the desired hyperparameters and keep the returning run object in order to pass the trained model to the evaluation (more on that can be seen later on). 

Unlike running the `open_archive` function, the training will be performed as a **job** on the **cluster** and not locally. 

> **Notice** now the `local` attribute is `False` (this is its default value) which means it will run the function on the cluster. To run the training locally, simply pass `local=True` as before.

In [6]:
training_run = training_and_evaluation_function.run(
    name="training",
    handler="train",
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=False
)

> 2021-11-07 19:31:05,922 [info] starting run training uid=cd20e186df22485ca0d6be479b304965 DB=http://mlrun-api:8080
> 2021-11-07 19:31:07,177 [info] Job is running in the background, pod: training-4dz85
2021-11-07 19:31:12.307252: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-07 19:31:12.307293: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-07 19:31:13.397493: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-07 19:31:13.397688: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-11-07 19:31:13.397712: W tensorflow/stream_exec

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-guyl,...9b304965,0,Nov 07 19:31:13,completed,training,v3io_user=guylkind=jobowner=guylhost=training-4dz85,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.07825136184692383training_accuracy=0.9684867858886719validation_loss=0.040051152308781944validation_accuracy=0.9891304439968533,training_loss_epoch_1.htmltraining_accuracy_epoch_1.htmlvalidation_loss_epoch_1.htmlvalidation_accuracy_epoch_1.htmltraining_loss_epoch_2.htmltraining_accuracy_epoch_2.htmlvalidation_loss_epoch_2.htmlvalidation_accuracy_epoch_2.htmltraining_loss_epoch_3.htmltraining_accuracy_epoch_3.htmlvalidation_loss_epoch_3.htmlvalidation_accuracy_epoch_3.htmlloss_summary.htmlaccuracy_summary.htmllr.html.htmlmask_detector.zipmask_detector





> 2021-11-07 19:32:45,638 [info] run executed, status=completed


When the training is done, there will be a list of all the <span style="background:lightgreen">artifacts created</span> in MLRun during the training run. All the (hopfully smooth) loss and metrics graphs we all love will be in both MLRun and Tensorboard, as well as the model weights and custom objects.

### 5.2. Evaluate the Model

Evaluating the model requires, you guessed it, the trained model. In order to get the model we just trained, we will use the training run object `training_run` that was returned from calling `run` on the MLRun function. Then, to get the model artifact, we will use `training_run.outputs` - a dictionary of all the function's artifacts that can be accessed by their names. So, to get the model, we will use the key "model" (as seen in the artifacts list generated above).

In [7]:
evaluation_run = training_and_evaluation_function.run(
    name="evaluating",
    handler="evaluate",
    params={
        "model_path": training_run.outputs['model'],  # <- Take the model we trained from the previous MLRun function via the run object.
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32
    }
)

> 2021-11-07 19:32:45,643 [info] starting run evaluating uid=4d5b18f2020149ceb8dab5297458b453 DB=http://mlrun-api:8080
> 2021-11-07 19:32:48,710 [info] Job is running in the background, pod: evaluating-d4t5b
2021-11-07 19:32:53.552564: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-07 19:32:53.552608: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-07 19:32:54.690373: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-07 19:32:54.690599: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-11-07 19:32:54.690622: W tensorflow/stream_

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-guyl,...7458b453,0,Nov 07 19:32:55,completed,evaluating,v3io_user=guylkind=jobowner=guylhost=evaluating-d4t5b,,model_path=store://artifacts/mask-detection-guyl/mask_detector:cd20e186df22485ca0d6be479b304965dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32,model_path=store://artifacts/mask-detection-guyl/mask_detector:cd20e186df22485ca0d6be479b304965dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32evaluation_loss=0.03454501129860102evaluation_accuracy=0.9920058139534884,evaluation_loss_epoch_1.htmlevaluation_accuracy_epoch_1.html





> 2021-11-07 19:33:29,377 [info] run executed, status=completed


<a id="section_6"></a>
## 6. Run Distributed Training Using Horovod

Now we can see the second benefit of MLRun, we can **distribute** our model **training** across **multiple workers** (i.e., perform distributed training), assign **GPUs**, and more. We don't need to bother with Dockerfiles or K8s YAML configuration files — MLRun does all of this for us. We will simply create our function with `kind="mpijob"`.

> **Notice**: for this demo, in order to use GPUs in training, set the `use_gpu` variable to `True`. This will later assign the required configurations to use the GPUs and pass the correct image to support GPUs (image with CUDA libraries).

In [8]:
# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = False

# Create the MLRun Function:
mpi_training_and_evaluation_function = mlrun.code_to_function(
    filename=os.path.join(framework, "training-and-evaluation.py"),
    name="mpi-training-and-evaluation",
    handler="train",
    kind="mpijob",
    image="mlrun/ml-models-gpu" if use_gpu else "mlrun/ml-models",
    with_doc=False
)

We can set additional configurations for our run like image, workers, gpus and more (if `use_gpu` is `True` we will setup 2 workers with 1 GPU per worker):

In [9]:
# Setup the desired configurations:
mpi_training_and_evaluation_function.spec.replicas = 2
if use_gpu:
    # Select the number of GPUs per replica:
    mpi_training_and_evaluation_function.gpus(1)
else:
    mpi_training_and_evaluation_function.with_requests(cpu=2)

# Mount it:
mpi_training_and_evaluation_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.mpijob.v1.MpiRuntimeV1 at 0x7f2db79b3410>

Call run, and notice each epoch is shorter as we now have 2 workers instead of 1. As the 2 workers will print a lot of outputs we would rather wait for completion and then show the results. For that, we will pass `watch=False` and use the run objects function `wait_for_completion` and `show`. 

To see the logs, you can go into the UI by clicking the blue hyperlink "<span style="color:blue">**click here**</span>" after running the function:

In [10]:
# Run the training job:
distributed_training_run = mpi_training_and_evaluation_function.run(
    name="distributed-training",
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3,
    },
    watch=False,  # <- Turn off the logs.
)

# Wait for complition and show the results. 
distributed_training_run.wait_for_completion()
distributed_training_run.show()

> 2021-11-07 19:33:29,443 [info] starting run distributed-training uid=2d5b8ae1bc844bf6abf7fa0ac1c6a3fd DB=http://mlrun-api:8080
> 2021-11-07 19:33:36,158 [info] MpiJob distributed-training-29437a95 launcher pod distributed-training-29437a95-launcher state active


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-guyl,...c1c6a3fd,0,Nov 07 19:33:29,running,distributed-training,v3io_user=guylkind=mpijobowner=guyl,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,,





> 2021-11-07 19:33:36,174 [info] run executed, status=running


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-guyl,...c1c6a3fd,0,Nov 07 19:33:41,completed,distributed-training,v3io_user=guylkind=mpijobowner=guylmlrun/job=distributed-training-29437a95host=distributed-training-29437a95-worker-0,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32epochs=3lr=0.00015999999595806003training_loss=0.07515096664428711training_accuracy=1.001007080078125validation_loss=0.056933866606818304validation_accuracy=0.9855072233412001,training_loss_epoch_1.htmltraining_accuracy_epoch_1.htmlvalidation_loss_epoch_1.htmlvalidation_accuracy_epoch_1.htmltraining_loss_epoch_2.htmltraining_accuracy_epoch_2.htmlvalidation_loss_epoch_2.htmlvalidation_accuracy_epoch_2.htmltraining_loss_epoch_3.htmltraining_accuracy_epoch_3.htmlvalidation_loss_epoch_3.htmlvalidation_accuracy_epoch_3.htmlloss_summary.htmlaccuracy_summary.htmllr.html.htmlmask_detector.zipmask_detector
