# Mask Detection Pegasus Workflow

The following project addresses the problem of determining what percentage of the population is properly wearing masks to better track our collective efforts in preventing the spread of COVID-19 in public spaces. To help solve this problem, we leverage modern deep learning tools such as the Optuna hyper parameter optimization framework and the [FastRCNNPredictor](https://arxiv.org/abs/1506.01497) model. The experiment is organized as a scientific workflow and utilizes the Pegasus Workflow Management System to handle its execution on distributed resources. 


The workflow uses **images of masks on faces** and **annotations** related to each image as classified as one of the following **three categories** as the main input dataset:
* wearing a mask 
* not wearing a mask
* wearing a mask incorrectly

<img src="imgs/classes.png" style="width: 400px;"/>
<br>
<img src="imgs/sample_output.png" style="width: 600px;"/>

The dataset is split into training, validation, and test sets before the workflow starts.  The **Pre-processing** step and Data Augmentation of Images is done to resize images and normalize them to make sure the data is consistent among all classes and also to avoid class imbalance. Additionally, **image augmentation** is done by injecting Gaussian noise. Next, the train and validation data are passed to the **hyperparameter optimization** step, where different learning rates are explored. The **training** of **FastRCNN** model is done with the recommended learning rate on the concatenated train and validation set, and obtains the weights. Then the **evaluation** is performed test set in order to generate a txt file with the scores for relevant performance metrics like average running loss. Finally, **predictions** can be made with any user input images using the trained model and show mask detection results.

**Machine Learning steps in the workflow :**
<br>
<img src="imgs/ml_steps3.png" style="width: 900px;"/>
<br>

<br>
<br>
<img src="imgs/mask_dectection_wf2.png" style="width: 1000px;"/>
<br>


## Container
All tools required to execute the jobs are all included in the container available on Dockerhub :
<br>[Mask Detection Container](https://hub.docker.com/r/zaiyancse/mask-detection) which runs on python and uses  machine learning libraries defined in `bin/Dockerfile` as -
* tensorflow==2.1.0
* optuna==2.0.0
* numpy==1.18.4
* torch
* pandas 
* opencv-python
* scikit-learn 
* pytorchtools
* matplotlib
* Pillow
* torchvision
* bs4

## Input Data
Sample input data has been provided in `data` containing images and annotation for training and testing.
<br>`inputs/images` **:** consists of images for training related to aforementioned three categories and also predictions on unseen images
<br>`inputs/annotations` **:** consists of annotations in xml file per corresponding image the category it belongs to


## Workflow
The workflow pre-processes the input data and then trains deep learning model to detect masks on faces and then classify them into one of three categories. The following figure shows an overview of the design of the pegasus workflow for mask detection and classification :

<img src="imgs/wf_graph3.png" style="width: 650px;"/>

<br>The descriptions for various jobs in the worklfow are listed in a table below

| Job Label              | Description                                                    |
| -----------------------|----------------------------------------------------------------|
| preprocess_val         | data preprocessing for the validation set of images            |
| preprocess_test        | data preprocessing for the testing set of images               |
| preprocess_aug_train   | data augmentation of the training set images                   |
| plot_class_distribution| data exploration step to visualize class distribution          |
| hpo                    | hyperparameter optimization step for FastRCNN model            |
| train_model            | training the FastRCNN model and fine-tuning it                 |
| evaluate               | generates relevant performance metrics like running loss       |
| predict                | make final prediction for mask detection on a given input image|

## 1. Create the Mask Detection Workflow

By now, you have a good idea about the Pegasus Workflow API.
We now create the workflow for the Mask Detection based on the picture above.

All workflow parameters are have been set along with input dataset values. This workflow is running on the sample dataset, which is included in the repository under `data` directory. The workflow parameters and input files location are set in the beginning of the workflow.

In [None]:
import re
import glob,os
import pickle
import logging
from pathlib import Path
from utils.wf import split_data_filenames, create_ann_list,create_augmented_filelist

logging.basicConfig(level=logging.DEBUG)

# --- Import Pegasus API -----------------------------------------------------------
from Pegasus.api import *

# --- Top Directory Setup ----------------------------------------------------------
top_dir = Path(".").resolve()

#### Data Acquisition and Splitting
Image dataset has been provided in the `data/images` directory along with the annotations (class label regarding each image) in the `data/annotations` directory. Moreover the data split is done as follows:
* 70% training
* 10% validation
* 20% testing

<img src="imgs/data_split.png" align="left" style="width: 220px;"/><img src="imgs/data_acquisition.png" style="width: 200px;"/>

In [None]:
# DATA AQUSITION
imagesList = glob.glob('data/images/*.png')
predict_images = glob.glob('data/pred_imgs/*.png')
annotationList = glob.glob('data/annotations/*.xml')

NUM_TRIALS = 2
NUM_EPOCHS = 1

#DATA SPLIT
train_filenames,val_filenames,test_filenames, files_split_dict = split_data_filenames(imagesList)

#ANNOTATIONS
train_imgs, train_ann = create_ann_list(train_filenames)
val_imgs, val_ann     = create_ann_list(val_filenames)
test_imgs, test_ann   = create_ann_list(test_filenames)

**Note :** If you are planning to train the model properly, please be advised optimum resuslts were obtained at 25 epochs and 4 trials but since the workflow is designed to be run on CPU it may take around a minimum of **12+ hours** of training time

#### Creating replica catalog and properties
Replica catalog is crated regarding all input images and annotation files. Moreover, `dagman.retry` is set for checkpointing - if jobs fails or timeouts, Pegasus will retry the job 2 times and use the checkpoint to restart the job

In [None]:
######################################## PROPERTIES ###########################################################
props = Properties()
props["dagman.retry"] = "1"
props["pegasus.mode"] = "development"
props.write()


###################################### REPLICA CATALOG ###########################################################

rc = ReplicaCatalog()

inputFiles = []
for img in imagesList:
    fileName = img.split("/")[-1]
    img_file = File(fileName)
    inputFiles.append(img_file)
    rc.add_replica("local", img_file,  os.path.join(os.getcwd(),str(img)))

pred_imgs = []
for img in predict_images:
    fileName = img.split("/")[-1]
    img_file = File(fileName)
    pred_imgs.append(img_file)
    rc.add_replica("local", img_file,  os.path.join(os.getcwd(),str(img)))

annFiles = []
for ann in annotationList:
    fileName = ann.split("/")[-1]
    ann_file = File(fileName)
    annFiles.append(ann_file)
    rc.add_replica("local", ann_file,  os.path.join(os.getcwd(),str(ann)))

## add checkpointing file for the hpo model job
def create_pkl(model):
    pkl_filename = "hpo_study_" + model + ".pkl"
    file = open(pkl_filename, 'ab')
    pickle.dump("", file, pickle.HIGHEST_PROTOCOL)
    return pkl_filename

mask_detection_pkl = create_pkl("mask_detection")
mask_detection_pkl_file = File(mask_detection_pkl)
rc.add_replica("local", mask_detection_pkl, os.path.join(os.getcwd(), mask_detection_pkl))

fastRCNNP_pkl = create_pkl("fastRCNNP")
fastRCNNP_pkl_file = File(fastRCNNP_pkl)
rc.add_replica("local", fastRCNNP_pkl, os.path.join(os.getcwd(), fastRCNNP_pkl))

rc.write()


#### Creating the Tranformation catalog

* `mask_detection_container` : the container consisting all tools and libraries in order to execute a job
* `plot_class_distribution.py` : data exploration to visualize class distribution
* `data_aug.py` : data augmentation of the training set images, using Gaussian noise
* `rename_file.py` : renames the image file with ***test*** or ***val*** prefixes
* `hpo_train.py` : hyperparameter optimization for FastRCNN model
* `train_model.py` : training the FastRCNN model and fine-tuning it

In [None]:
###################################### TRANSFORMATIONS ###########################################################

# Container for all the jobs
tc = TransformationCatalog()
mask_detection_wf_cont = Container(
                "mask_detection_wf",
                Container.SINGULARITY,
                image="docker://pegasus/mask-detection:latest",
                image_site="docker_hub"
            )

tc.add_containers(mask_detection_wf_cont)


dist_plot = Transformation(
                "dist_plot",
                site = "local",
                pfn = top_dir/"bin/plot_class_distribution.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )

augment_imgs = Transformation(
                "augment_images",
                site = "local",
                pfn = top_dir/"bin/data_aug.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )

rename_imgs = Transformation(
                "rename_images",
                site = "local",
                pfn = top_dir/"bin/rename_file.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )

hpo_model = Transformation(
                "hpo_script",
                site = "local",
                pfn = top_dir/"bin/hpo_train.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )

train_model = Transformation(
                "train_script",
                site = "local",
                pfn = top_dir/"bin/train_model.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )

evaluate_model = Transformation(
                "evaluate_script",
                site = "local",
                pfn = top_dir/"bin/evaluate.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )

predict_detection = Transformation(
                "predict_script",
                site = "local",
                pfn = top_dir/"bin/predict.py",
                is_stageable = True,
                container = mask_detection_wf_cont
            )


tc.add_transformations(augment_imgs, dist_plot, rename_imgs, hpo_model, train_model, evaluate_model, predict_detection)
logging.info("writing tc with transformations: {}, containers: {}".format([k for k in tc.transformations], [k for k in tc.containers]))
tc.write()

#### Creating jobs and adding it to the workflow

In [None]:
###################################### CREATE JOBS ###########################################################
wf = Workflow("mask_detection_workflow")

train_preprocessed_files = create_augmented_filelist(train_filenames,2)
distribution_plot_file = File("class_distribution.png")
val_preprocessed_files = [File("val_"+ f.split("/")[-1]) for f in val_filenames]
test_preprocessed_files = [File("test_"+ f.split("/")[-1]) for f in test_filenames]

#### Data Exploration
Takes in all the annotations files and creates plot with distribution of the classes. It helps in detecting if class imbalance exists which affects bias regarding classification of images into threee categories as discussed above.

In [None]:
distribution_plot_job = Job(dist_plot)
distribution_plot_job.add_args(distribution_plot_file)
distribution_plot_job.add_inputs(*train_ann, *val_ann, *test_ann)
distribution_plot_job.add_outputs(distribution_plot_file)
wf.add_jobs(distribution_plot_job)

#### Data Preprocessing
Image augmentation is done on the training images by adding Gaussian noise to them. This helps to normalize the training dataset and maintain consistensy.

<img src="imgs/data_augmentation.png" style="width: 350px;"/>

Also, renaming of the test set images and validation set images using the prefixes ***test*** and ***val*** as input arguments respectively.

In [None]:
# TRAIN DATA AUGMENTATION
preprocess_train_job = Job(augment_imgs)
preprocess_train_job.add_inputs(*train_imgs)
preprocess_train_job.add_outputs(*train_preprocessed_files,stage_out=False)
wf.add_jobs(preprocess_train_job)

# VAL DATA-FILE RENAMING
preprocess_val_job = Job(rename_imgs)
preprocess_val_job.add_inputs(*val_imgs)
preprocess_val_job.add_outputs(*val_preprocessed_files,stage_out=False)
preprocess_val_job.add_args("val")
wf.add_jobs(preprocess_val_job)

# TEST DATA-FILE RENAMING
preprocess_test_job = Job(rename_imgs)
preprocess_test_job.add_inputs(*test_imgs)
preprocess_test_job.add_outputs(*test_preprocessed_files,stage_out=False)
preprocess_test_job.add_args("test")
wf.add_jobs(preprocess_test_job)

####  Hyperparameter optimization model
Use Hyper-parameters optimization library `optuna` to find adequate learning
rate, backbone model for transfer learning and more. The output is saved into a `txt` file and then used by the model training step.
<img src="imgs/hpo.png" style="width: 350px;"/>

In [None]:
hpo_params = File("best_hpo_params.txt")
hpo_job = Job(hpo_model)
hpo_job.add_args("--epochs",NUM_EPOCHS, "--trials", NUM_TRIALS)
hpo_job.add_inputs(*train_preprocessed_files,*train_ann,*val_preprocessed_files,*val_ann)
hpo_job.add_outputs(hpo_params)
hpo_job.add_checkpoint(mask_detection_pkl_file, stage_out=True)
hpo_job.add_pegasus_profile(cores=8, memory="12 GB", runtime=14400)
wf.add_jobs(hpo_job)

#### Model training

Using the optimum hyperparameters from the last step, we train the model using the validation, testing and training image sets. The entire model is saved in `mask_detection_model.pth`, which can be used to make inferences.

<img src="imgs/training.png" style="width: 400px;"/>

In [None]:
model_file = File("mask_detection_model.pth")
model_training_job = Job(train_model)
model_training_job.add_args(hpo_params,model_file)
model_training_job.add_inputs(hpo_params,*train_imgs,
                              *train_preprocessed_files, *val_preprocessed_files,
                              *test_preprocessed_files, *annFiles)
model_training_job.add_checkpoint(fastRCNNP_pkl_file, stage_out=True)
model_training_job.add_outputs(model_file)
wf.add_jobs(model_training_job)

**Note :** If you are planning to train the model properly, please be advised optimum results were obtained at 25 epochs of training the model (which can be edited in `train_model.py`) but since the workflow is designed to be run on CPU it may take around a minimum of **12+ hours** of training time

#### Model evaluation

Evaluate performance of the final model using test data. The confusion matrix is used regarding evalation of classification labels and plot is provided in `confusion_matrix.png`. Moreover running loss of the trained model is also provided in `evaluation.txt` file.

<img src="imgs/cm.png" style="width: 400px;"/>

In [None]:
confusion_matrix_file = File("confusion_matrix.png")
evaluation_file = File("evaluation.txt")
model_evaluating_job = Job(evaluate_model)
model_evaluating_job.add_args(model_file,evaluation_file,confusion_matrix_file)
model_evaluating_job.add_inputs(model_file,*test_preprocessed_files, *annFiles)
model_evaluating_job.add_outputs(evaluation_file,confusion_matrix_file)
model_evaluating_job.add_pegasus_profile(cores=8, memory="12 GB", runtime=14400)
wf.add_jobs(model_evaluating_job)

#### Prediction

Predictions can be made using new images in the `/data/pred_images` with "_pred__" prefix. The predicted image with mask detection and predicted classes is obtained in `predicted_image.png`. Moreover, `predictions.txt` contains **confidence scores** of predicted classes along with detections, you can plot it on images using your own methods. The following figure is for reference regarding the accuracy of the predictions made by the model trained for different epochs.

<img src="imgs/prediction.png" style="width: 900px;"/>

In [None]:
predicted_image = File("predicted_image.png")
predicted_classes = File("predictions.txt")
predict_detection_job = Job(predict_detection)
predict_detection_job.add_args(model_file,predicted_image,predicted_classes)
predict_detection_job.add_inputs(model_file,*pred_imgs, *annFiles)
predict_detection_job.add_outputs(predicted_image,predicted_classes)
predict_detection_job.add_pegasus_profile(cores=8, memory="12 GB", runtime=14400)
wf.add_jobs(predict_detection_job)

## 2. Plan and Submit the Workflow

We will now plan and submit the workflow for execution. By default we are running jobs on site **condorpool** i.e the selected ACCESS resource.

In [None]:
try:
    wf.plan(submit=True)
except PegasusClientError as e:
    print(e.output)

After the workflow has been successfully planned and submitted, you can use the Python `Workflow` object in order to monitor the status of the workflow. It shows in detail the counts of jobs of each status and also the whether the job is idle or running.

In [None]:
wf.status()

## 3.  Launch Pilots Jobs on ACCESS resources

At this point you should have some idle jobs in the queue. They are idle because there are no resources yet to execute on. Resources can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:

A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.

A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.

A pilot will only run jobs for the user who started it.

The process of starting pilots is described in the [ACCESS Pegasus Documentation](https://xsedetoaccess.ccs.uky.edu/confluence/redirect/ACCESS+Pegasus.html)

## 4.Wait for the workflow to finish

In [None]:
wf.wait()

## 5. Statistics

Depending on if the workflow finished successfully or not, you have options on what to do next. If the workflow failed you can use `wf.analyze()` do get help finding out what went wrong. If the workflow finished successfully, we can pull out some statistcs from the provenance database:

In [None]:
wf.statistics()