In [None]:
# # Imports
# import numpy as np

# from determined.experimental import Determined
# from models import ObjectDetectionModel
# from predict import predict
# from utils import check_model

# # remove warnings
# import warnings
# warnings.filterwarnings('ignore')

# # Set up .detignore file so the checkpoints directory is not packaged into future experiments
# !echo checkpoints > .detignore

<img src="https://raw.githubusercontent.com/determined-ai/determined/master/determined-logo.png" align='right' width=150 />

# Building an Object Detection Model from Infared sensors with MLDE

<img title="FLIR Image Example" src="https://storage.googleapis.com/roboflow-platform-sources/v2ONhxR3iuHqgfOUfmjO/5MjAkaP42DwQAO26W3em/original.jpg">


This notebook will walk through the benefits of building a Deep Learning model with MLDE.  We will build an object detection model trained on the [Self-Driving Thermal Object-Detection](https://universe.roboflow.com/thermal-imaging-0hwfw/flir-data-set).


# Table of Contents


<font size="3">
<ol>
  <li>What Modeling looks like Today</li>
  <li>Building a model with Determined
    <ol>
      <li>Single GPU training</li>
      <li>Cluster-scale multi-GPU training</li>
      <li>Adapative hyperparameter search</li>
    </ol>
  </li>
</ol>
</font>

# What modeling looks like without Determined

First let's look at the kind of work modelers do today.  Below, we train a model we found on Github and modified, printing validation set metrics after each epoch.

```python
from models import ObjectDetectionModel

NUM_EPOCHS = 10

model = ObjectDetectionModel({'lr': 0.00045, 'm': 0.72})

try:
    for epoch in range(NUM_EPOCHS):
        print(f"Training epoch {epoch + 1} of {NUM_EPOCHS}")
        model.train_one_epoch()
        iou = model.eval()
        print(f"Validation set average IoU: {iou}\n")
except KeyboardInterrupt:
    pass
```

We might also roll our own simple hyperparameter tuning:

```python
import numpy as np

from models import ObjectDetectionModel

def hp_grid_search():
    for lr in np.logspace(-4, -2, num=10):
        for m in np.linspace(0.7, 0.95, num=10):
            print(f"Training model with learning rate {lr} and momentum {m}")
            model = ObjectDetectionModel({'lr': lr, 'm': m})
            model.train_one_epoch()
            iou = model.eval()
            print(f"Validation set average IoU: {iou}\n")

try:
    hp_grid_search()
except KeyboardInterrupt:
    pass
```

# What's Missing?

<font size="4">This approach works in theory -- we could get a good model, save it, and use it for predictions.  But we're missing a lot from the ideal state:</font>
<font size="4">
<ul style="margin-top: 15px">
  <li style="margin-bottom: 10px">Distributed training</li>
  <li style="margin-bottom: 10px">Parallel search</li>
  <li style="margin-bottom: 10px">Intelligent checkpointing</li>
  <li style="margin-bottom: 10px">Interruptibility and fault tolerance</li>
  <li                            >Logging of experiment configurations and results </li>
</ul>
</font>

<font size=6><b>Scaled Experimentation with Determined</b></font>

With less work than setting up a limited random search, you can get started with Determined.

## Our First Experiment

Here is what our `configs/flir_training/const_fasterrcnn_flir.yaml` training config looks like, training a FasterRCNN model on the FLIR dataset

```yaml
name: detectron2_const_fasterrcnn_flir
environment:
    image: "determinedai/example-detectron2:0.6-cuda-10.2-pytorch-1.10"
hyperparameters:
  global_batch_size: 8
  model_yaml: models/fast_rcnn_R_50_FPN_1x.yaml
  dataset_name: 'flir-camera-objects'
  output_dir: None
  fake_data: False
searcher:
  name: single
  metric: bboxAP
  max_length: 
    batches: 9000
  smaller_is_better: false
resources:
    slots_per_trial: 4
entrypoint: model_def:DetectronTrial
max_restarts: 0
min_validation_period:
  batches: 100
```

For our first example, we run a simple single-GPU training job with fixed hyperparameters.

<img src="https://raw.githubusercontent.com/determined-ai/public_assets/main/images/StartAnExperiment.png" align=left width=330/>

In [None]:
!det e create configs/flir_training/const_fasterrcnn_flir.yaml .

And evaluate its performance:

In [None]:
experiment_id = 810

In [None]:
from determined.experimental import Determined
from determined import pytorch
from predict import predict

In [None]:
checkpoint = Determined().get_experiment(experiment_id).top_checkpoint()
path = checkpoint.download()

Here is a test image we will run model predictions on

In [None]:
from PIL import Image
Image.open('test_flir.jpg')

In [None]:
# Run this cell to visualize model predictions
predict(ckpt_path=path,
            img_path='test_flir.jpg',
            confidence=0.05,
            yaml_path='/run/determined/workdir/demo_revamp/determined/examples/computer_vision/detectron2_coco_pytorch/models/fast_rcnn_R_50_FPN_1x.yaml',
            dataset_name='flir-camera-objects')

## Scaling up to Distributed Training

Determined makes it trivial to move from single-GPU to multi-GPU (and even multi-node) training. All you need to increase is the `slots_per_trial`. Below is an example config, located in `configs/flir_training/dist_fasterrcnn_flir.yaml` Here we'll simply modify the config above to request 8 GPUs instead of 1, and increase the global batch size to increase the data throughput.

```yaml
name: detectron2_distributed
environment:
    image: "determinedai/example-detectron2:0.6-cuda-10.2-pytorch-1.10"
    environment_variables:
      - DETECTRON2_DATASETS=/mnt/dtrain-fsx/detectron2
hyperparameters:
  global_batch_size: 16 # Detectron defaults to 16 regardless of N GPUs
  model_yaml: mask_rcnn_R_50_FPN_noaug_1x.yaml
  output_dir: None
  fake_data: False
searcher:
  name: single
  metric: bboxAP
  max_length: 
    batches: 90000
  smaller_is_better: false
resources:
    slots_per_trial: 4
    shm_size: 824600000000
entrypoint: model_def:DetectronTrial
bind_mounts:
  - host_path: /path/to/data
    container_path: /mnt/dtrain-fsx/detectron2
    read_only: true
min_validation_period:
  batches: 5000
```

In [None]:
!det experiment configs/flir_training/dist_fasterrcnn_flir.yaml .

<img src="https://raw.githubusercontent.com/determined-ai/public_assets/main/images/4GPUexperiment.png" align=left width=530 />

## Run Distributed Hyperparameter Tuning

By simply building a config file and adapting our code to meet the determined trial interface, we can conduct a sophisticated hyperparamter search.  Instructions for how to configure different types of experiments [can be found in the Determined documentation.](https://docs.determined.ai/latest/how-to/index.html). The config below (located at `configs/flir_training/search_fasterrcnn_flir.yaml`) run a grid search over all models to experiment which model gets the best performance. 

```yaml
name: detectron2_search_flir
environment:
    image: "determinedai/example-detectron2:0.6-cuda-10.2-pytorch-1.10"
hyperparameters:
  global_batch_size: 8
  model_yaml:
      type: categorical
      vals: ['models/fast_rcnn_R_50_FPN_1x.yaml','models/mask_rcnn_R_50_FPN_noaug_1x.yaml','models/cascade_mask_rcnn_R_50_FPN_3x.yaml']
  dataset_name: 'flir-camera-objects'
  output_dir: None
  fake_data: False
searcher:
  name: grid
  metric: bboxAP
  max_length: 
    batches: 9000
  smaller_is_better: false
resources:
    slots_per_trial: 4
entrypoint: model_def:DetectronTrial
max_restarts: 0
min_validation_period:
  batches: 100
```

## Create your Experiment

Now that you've described your experiment, you'll simply need to use the command line interface to submit it to the Determined Cluster.  

In [None]:
!det experiment create configs/flir_training/search_flir.yaml .

<img src="https://raw.githubusercontent.com/determined-ai/public_assets/main/images/12GPUexperiment.png" align=left width=800 />

# Model Registry

After training, we'll want to actually use our model in some sort of system.  Determined provides a model registry to version your trained models, making them easy to retrieve for inference.

In [None]:
experiment_id = 810
MODEL_NAME = "flir_object_detection"

In [None]:
# Get the best checkpoint from the training
checkpoint = Determined().get_experiment(experiment_id).top_checkpoint()

In [None]:
from utils import check_model
model = check_model(MODEL_NAME)

In [None]:
model.register_version(checkpoint.uuid)

# Inference

Once your model is versioned in the model registry, using that model for inference is straightforward:

In [None]:
# Retrieve latest checkpoint for a given model name
latest_version = model.get_version()
print(latest_version)

In [None]:
# Load the model checkpoint into memory
from determined import pytorch

path = latest_version.checkpoint.download()

In [None]:
# Run inference as before
predict(ckpt_path=path,
            img_path='test_flir.jpg',
            confidence=0.05,
            yaml_path='/run/determined/workdir/demo_revamp/determined/examples/computer_vision/detectron2_coco_pytorch/models/fast_rcnn_R_50_FPN_1x.yaml',
            dataset_name='flir-camera-objects')