# Mask Detection Demo - Training (1 / 2)
The following example demonstrates a training workflow - building and training a model for classifing whether a person is wearing a mask or not. The training is auto-logged to both Tensorbaord and MLRun, and easily distributed using Horovod.

#### Key Technologies:
- [**PyTorch**](https://pytorch.org/) to train the model
- [**Horovod**](https://horovod.ai/) to run distributed training
- [**MLRun**](https://www.mlrun.org/) to orchestrate the process

#### Credits:

* The model is trained on a dataset containing images of people with or without masks. The data used was taken from Prajna Bhandary, [github link](https://github.com/prajnasb/observations). 
* The training code is taken from Adrian Rosebrock, COVID-19: Face Mask Detector with OpenCV, Keras/TensorFlow, and Deep Learning, PyImageSearch, [page link](https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/), accessed on 29 June 2021

#### Table of Contents:
1. [Setup the Project and Environment](#section_1)
2. [Write the Training Code](#section_2)
3. [Create the Training Function](#section_3)
4. [Run Training](#section_4)

<a id="section_1"></a>
## 1. Setup the Project and Environment
Create a new project, set the environment and create the paths where we'll store the project's artifacts:

In [1]:
import mlrun
import os

# Create the project:
project_name='mask-detection-2'
project_dir = os.path.abspath('./')
project = mlrun.new_project(project_name, project_dir)

# Set the environment:
mlrun.set_environment(project=project.metadata.name)

# Setup the archive url for downloading the dataset images:
archive_url = "https://s3.wasabisys.com/iguazio/data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip"

> 2021-08-29 19:21:51,346 [info] loaded project mask-detection-2 from MLRun DB


<a id="section_2"></a>
## 2. Write the Training Code

We wrote two classes for the training code:
* [MaskDetectionDataset](./custom-objects/mask_detection_dataset.py) - for holding our labeled images of masked and unmaksed faces.
* [MaskDetecttionMobilenetV2](./custom-objects/mask_detection_mobilenet_v2.py) - our classifier that is based on MobilenetV2.

From here, the training code is classic and straightforward, we: 
1. Use `get_datasets` for downloading the images and initializing our `MaskDetectionDataset` datasets.
2. Initialize a new `MaskDetecttionMobilenetV2` model.
3. Train the model.

Taking this code one step further is **MLRun**'s framework for pytorch. With our interface you gain flexable and generic training and evaluation functions, callback mechanism that comes with our auto-logging capabilities, and enable distributed training with Horovod. Here we will be using our default training function:

```python
import mlrun.frameworks.pytorch as mlrun_torch

mlrun_torch.train(...)
```

Our callbacks mechanism is simple and easy to use outside of our supplied training and evaluation functions, we supply a handler for holding multiple callbacks and all is needed to be done is place the right calls at the right places. 

With our supplied callbacks we provide logging to both Tensorboard and MLRun. In this demo we will showcase the default (auto-logging) settings, but be sure additional settings can be passed onto these callbacks initializers (or our functions that generetes them) to gain extra logging capabilities, like:

* Weights histograms and distributions
* Weights statistics
* Weights images (working in progress)
* Edit static and dynamic hyperparameters tracking
* Logging frequency and more

We suggest reading the documentation for further use, or like in this example, use the default settings.

In [2]:
# mlrun: start-code

In [3]:
import os
import sys
from typing import List, Tuple
import pathlib
import zipfile

from sklearn.model_selection import train_test_split

import torch
from torch import Tensor
from torch.nn import Module
from torch.utils.data import Dataset, DataLoader
import torchvision

from PIL import Image

import mlrun
import mlrun.frameworks.pytorch as mlrun_torch

# Add our path to the custom objects directory containing 'MaskDetectionMobilenetV2':
CUSTOM_OBJECTS_DIRECTORY = os.path.join(os.getcwd(), "custom-objects")
sys.path.insert(0, CUSTOM_OBJECTS_DIRECTORY)
from mask_detection_dataset import MaskDetectionDataset
from mask_detection_mobilenet_v2 import MaskDetectionMobilenetV2

In [4]:
def get_datasets(
    archive_url: mlrun.DataItem,
    dataset_path: str,
    batch_size: int,
    train_test_split_ratio: float,
):
    # Setup directories paths:
    dataset_path = os.path.join(dataset_path, "Dataset")
    os.makedirs(dataset_path, exist_ok=True)

    # Check if needed to download the archive:
    dataset_directory_size = sum(
        [
            f.stat().st_size
            for f in pathlib.Path(dataset_path).glob("**/*")
            if f.is_file()
        ]
    )
    if dataset_directory_size == 0:
        # Download it:
        zip_file = archive_url.local()
        # Extract it:
        zipfile.ZipFile(zip_file, "r").extractall(dataset_path)

    # Build the dataset:
    images = []
    labels = []
    for label, directory in enumerate(["with_mask", "without_mask"]):
        images_directory = os.path.join(dataset_path, directory)
        for image_file in os.listdir(images_directory):
            images.append(os.path.join(images_directory, image_file))
            labels.append(label)

    # Perform one-hot encoding on the labels:
    labels = torch.tensor(labels)
    labels = torch.nn.functional.one_hot(labels)

    # Split the dataset into training and validation sets:
    x_train, x_test, y_train, y_test = train_test_split(
        images,
        labels,
        test_size=train_test_split_ratio,
        stratify=labels,
        random_state=42,
    )

    # Construct the datasets:
    training_set = MaskDetectionDataset(images=x_train, labels=y_train)
    validation_set = MaskDetectionDataset(images=x_test, labels=y_test)

    # Construct the data loaders:
    training_set = DataLoader(dataset=training_set, batch_size=batch_size, shuffle=True)
    validation_set = DataLoader(
        dataset=validation_set, batch_size=batch_size, shuffle=False
    )

    return training_set, validation_set

In [5]:
def train(
    context: mlrun.MLClientCtx,
    archive_url: mlrun.DataItem,
    dataset_path: str = os.path.abspath('./Dataset'),
    batch_size: int = 32,
    lr: float = 1e-4,
    epochs: int = 5,
):
    # Get the datasets:
    training_set, validation_set = get_datasets(
        archive_url=archive_url,
        dataset_path=os.path.abspath("./"),
        batch_size=batch_size,
        train_test_split_ratio=0.2,
    )

    # Initialize the model:
    model = MaskDetectionMobilenetV2()

    # Initialize the optimizer:
    optimizer = torch.optim.Adam(lr=lr, params=model.parameters())
    loss = torch.nn.MSELoss()

    def accuracy(y_true, y_pred):
        return (sum(y_pred.argmax(1) == y_true.argmax(1)) / y_true.size()[0]).item()

    # Train the head of the network:
    mlrun_torch.train(
        model=model,
        training_set=training_set,
        loss_function=loss,
        optimizer=optimizer,
        validation_set=validation_set,
        metric_functions=[accuracy],
        epochs=epochs,
        custom_objects_map={
            "mask_detection_mobilenet_v2.py": "MaskDetectionMobilenetV2"
        },
        custom_objects_directory=CUSTOM_OBJECTS_DIRECTORY,
        context=context,
        # training_iterations=35,
    )

In [6]:
# mlrun: end-code

<a id="section_3"></a>
## 3. Create the Training Function

We will use MLRun's `code_to_function` to get our code from this notebook. Notice the comments `# mlrun: start-code` and `# mlrun: end-code`, these are marking what code to turn into a MLRun function.

We wish to run the training first as a Job, so we will set the `kind` parameter to `"job"`.

In [7]:
training_function = mlrun.code_to_function(
    name="job-trainer",
    handler="train",
    kind="job",
    image="guyliguazio/ml-models-gpu-066:tf243",
    with_doc=False
)

<a id="section_4"></a>
## 4. Run Training

### 4.1. Train Locally:

First, we will run the training locally setting `local` to `True`. 

In [8]:
training_run = training_function.run(
    name="job-trainer-local-run",
    inputs={
        "archive_url": archive_url
    },
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 5
    },
    local=True
)

> 2021-08-29 19:22:02,678 [info] starting run job-trainer-local-run uid=27932cc4d5b746ba8948bbe2622c914f DB=http://mlrun-api:8080
Epoch 1/5:
Training: 100% |██████████| 35/35 [00:34<00:00,  1.01Batch/s, MSELoss=0.171, accuracy=0.833]
Validating: 100% |██████████| 9/9 [00:09<00:00,  1.02s/Batch, MSELoss=0.114, accuracy=0.95] 

Summary:
Metrics      Values
---------  --------
MSELoss    0.106625
accuracy   0.9375

Epoch 2/5:
Training: 100% |██████████| 35/35 [00:32<00:00,  1.09Batch/s, MSELoss=0.0707, accuracy=1]    
Validating: 100% |██████████| 9/9 [00:09<00:00,  1.09s/Batch, MSELoss=0.0642, accuracy=0.95] 

Summary:
Metrics      Values
---------  --------
MSELoss    0.053809
accuracy   0.9375

Epoch 3/5:
Training: 100% |██████████| 35/35 [00:30<00:00,  1.14Batch/s, MSELoss=0.0702, accuracy=0.917]
Validating: 100% |██████████| 9/9 [00:10<00:00,  1.12s/Batch, MSELoss=0.0535, accuracy=0.9]  

Summary:
Metrics       Values
---------  ---------
MSELoss    0.0448375
accuracy   0.9375

Epoch

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection-2,...622c914f,0,Aug 29 19:22:02,completed,job-trainer-local-run,v3io_user=adminkind=owner=adminhost=guyl-jupyter-9b85dff7-n26l2,archive_url,dataset_path=/User/guyl/testing-custom-objects/Datasetbatch_size=32lr=0.0001epochs=5,dataset_path=/User/guyl/testing-custom-objects/Datasetbatch_size=32epochs=5lr=0.0001training_MSELoss=0.014774623326957226training_accuracy=1.0validation_MSELoss=0.03156984597444534validation_accuracy=0.9375,training_MSELoss_epoch_1.htmltraining_accuracy_epoch_1.htmlvalidation_MSELoss_epoch_1.htmlvalidation_accuracy_epoch_1.htmltraining_MSELoss_epoch_2.htmltraining_accuracy_epoch_2.htmlvalidation_MSELoss_epoch_2.htmlvalidation_accuracy_epoch_2.htmltraining_MSELoss_epoch_3.htmltraining_accuracy_epoch_3.htmlvalidation_MSELoss_epoch_3.htmlvalidation_accuracy_epoch_3.htmltraining_MSELoss_epoch_4.htmltraining_accuracy_epoch_4.htmlvalidation_MSELoss_epoch_4.htmlvalidation_accuracy_epoch_4.htmltraining_MSELoss_epoch_5.htmltraining_accuracy_epoch_5.htmlvalidation_MSELoss_epoch_5.htmlvalidation_accuracy_epoch_5.htmlMSELoss_summary.htmlaccuracy_summary.htmllr.html.htmlMaskDetectionMobilenetV2.ptMaskDetectionMobilenetV2_custom_objects_map.jsonMaskDetectionMobilenetV2_custom_objects.zipMaskDetectionMobilenetV2





> 2021-08-29 19:25:55,493 [info] run executed, status=completed


### 4.2. Train with Kubernetes Job:

Now, we will run the training as a job, so we set the `local` parameter we used before to `False`.

In [10]:
training_function.apply(mlrun.platforms.auto_mount())
training_run = training_function.run(
    name="job-trainer-run",
    inputs={
        "archive_url": archive_url
    },
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=False
)

> 2021-08-03 08:50:07,539 [info] starting run job-trainer-run uid=4dc847d5348e42868743ed5acd95c9d9 DB=http://mlrun-api:8080
> 2021-08-03 08:50:07,691 [info] Job is running in the background, pod: job-trainer-run-bcj8k
2021-08-03 08:50:12.122084: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-08-03 08:50:13.118334: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-08-03 08:50:13.119482: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-08-03 08:50:13.153344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-03 08:50:13.153937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:1e.

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection,...cd95c9d9,0,Aug 03 08:50:13,completed,job-trainer-run,v3io_user=adminkind=jobowner=adminhost=job-trainer-run-bcj8k,archive_url,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.07963061332702637training_accuracy=1.0003585815429688validation_loss=0.03511989116668701validation_accuracy=0.9963767793443468,loss_summary.htmlaccuracy_summary.htmllr.htmlmodel.h5model


to track results use .show() or .logs() or in CLI: 
!mlrun get run 4dc847d5348e42868743ed5acd95c9d9 --project mask-detection , !mlrun logs 4dc847d5348e42868743ed5acd95c9d9 --project mask-detection
> 2021-08-03 08:51:05,424 [info] run executed, status=completed


### 4.3. Train with Horovod:

Now we can see the second of MLRun, we can **distribute** our model **training** across **multiple workers** (i.e., perform distributed training), assign **GPUs**, and more. We don't need to bother with Dockerfiles or K8s YAML configuration files — MLRun does all of this for us.

All is needed to be done, is create our function with `kind="mpijob"`:

In [8]:
training_function = mlrun.code_to_function(
    name="mpijob-trainer",
    handler="train",
    kind="mpijob",
    image="guyliguazio/ml-models-gpu-066:tf243",
    with_doc=False
)

We can set additional configurations for our run like image, workers, gpus and more. We will setup 4 workers with 1 GPU per worker:

In [9]:
# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = True

# Setup the desired configurations:
training_function.spec.replicas = 4
if use_gpu:
    training_function.gpus(1)
else:
    training_function.with_requests(cpu=4)
training_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.mpijob.v1.MpiRuntimeV1 at 0x7f47e3d69910>

Call run, and notice each epoch is shorter as we now have 4 workers instead of 1.

In [10]:
# Run the training job:
training_run = training_function.run(
    name="trainer-mpijob-run",
    inputs={
        "archive_url": archive_url
    },
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3,
    },
    watch=False,
)

# Print the progress in steps as the 4 workers will print a lot of tf outputs...
import time
from IPython.display import clear_output

while(training_run.state() not in ['completed', 'error']):
    time.sleep(3)
    clear_output(wait=True)
    training_run.show()

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection,...a37a8e22,0,Aug 03 09:09:43,completed,trainer-mpijob-run,v3io_user=adminkind=mpijobowner=adminmlrun/job=trainer-mpijob-run-f932eb02host=trainer-mpijob-run-f932eb02-worker-0,archive_url,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32epochs=3lr=0.0002799999783746898training_loss=0.0443190336227417training_accuracy=1.0validation_loss=0.06190176804860433validation_accuracy=0.9782608879937066,loss_summary.htmlaccuracy_summary.htmllr.htmlmodel.h5model
