# Project 1.1: Object Detection in Urban Environments
## Model Setup and Training
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## 1. Introduction

In [None]:
!sudo python -m pip install --upgrade pip

In [None]:
!sudo pip uninstall -y protobuf
!sudo pip uninstall -y google
!sudo pip install google
!sudo pip install protobuf
!sudo pip install google-cloud
!pip install ray
!pip install omegaconf
!pip install hydra-core
!pip install packaging
!pip install importlib-resources

In [None]:
!python -m pip install waymo-open-dataset-tf-2-3-0

In [None]:
### Importing the required modules
# Doing this here to test installation

In [None]:
import datetime
import glob
import google.protobuf
import hydra
import numpy as np
import os
import ray
import sys
from tensorboard import notebook
import tensorflow as tf
import waymo_open_dataset

In [None]:
!python --version

In [None]:
ray.__version__

In [None]:
tf.__version__

In [None]:
np.__version__

In [None]:
### Setting the environment variables

In [None]:
ENV_COLAB = False               # True if running in Google Colab instance

In [None]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/1-1-Object-Detection-in-Urban-Environments'
DIR_BASE = os.path.abspath(DIR_BASE)
DIR_BASE

In [None]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data')

In [None]:
### Creating subdirectories (if not exists)
os.makedirs(DIR_SRC, exist_ok=True)
os.makedirs(DIR_OUT, exist_ok=True)

In [None]:
# Subdirectory to model folder
DIR_MODEL = os.path.join(DIR_BASE, 'experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8')

In [None]:
# Load the TensorBoard notebook extension (if using Colab)
#%load_ext tensorboard

In [None]:
### Downloading the Google Cloud CLI (if folder doesn't already exist)
if not os.path.exists(os.path.join(DIR_BASE, 'addons/google-cloud-sdk')):
    # Download Cloud CLI tools from Google
    !curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-405.0.0-linux-x86_64.tar.gz
    # Unzip to addons folder
    !tar -xf google-cloud-cli-405.0.0-linux-x86_64.tar.gz -C addons/
### Setting up Google Cloud CLI tools (run commands inside interactive shell)
# ./addons/google-cloud-sdk/install.sh
# ./gcloud init
### Authenticate the service account
# ./gcloud auth activate-service-account [service-email] --key-file"[path/to/key-file]"

In [None]:
### Sweeping the working directory for Python modules

In [None]:
!pip install --editable {DIR_BASE}

### 1.1. Setup and Data Acquisition

In this section of the notebook we will be fetching the Waymo Open Dataset files from their Google Cloud Storage bucket locations. To view the file paths we will be downloading, see `filenames.txt` inside the `data/waymo_open_dataset` folder.


#### Environment setup
The Python files inside `/scripts/..` have been modified to work on the Linux Ubuntu VM provided in the Udacity workspace. Please see previous commits of this repository to obtain script files suited for macOS and Google Colab. As of now, the Ubuntu VM is running Python version 3.7.3, TensorFlow 2.3.0 and Waymo Open Dataset version `tf-2-3-0==1.4.0`. The other dependency versions should be checked for conflicts on any other machine.

Running the `!pip install --editable setup.py` command will add all modules from this project repository onto the Python path. This is the recommended way to resolve `PYTHONPATH` issues, preferred to updating the `/.bashrc` file or `os.environ['PYTHONPATH']`/`sys.path` variables.

#### Data acquisition

The `download_process.py` script will fetch the `.tfrecord` files from GCS and store them locally in the `data/waymo_open_dataset/raw` subdirectory to be processed. The raw `.tfrecord` files are parsed in `process_tfr`; the images, bounding box labels and attribute data are stored in a dictionary-like object and converted to `tf.data.TFRecordDataset` instances. After the files have been converted, their originals are deleted from inside the `raw` folder.

Lastly, we will split the data we have collected into train, test and validation subsets. The default split sizes were selected to be 80%/20% on train/test and from the remaining train data 20% is witheld for the validation set. The split sizes can be customised inside the `configs/dataset/waymo_open_dataset.yaml` configuration file.


**Disclaimer**: A lot of effort has been put in by me (the author of this notebook, Jonathan L. Moran) to mitigate platform issues between macOS/Ubuntu/Google Colab instances. Many hurdles prevent one from currently utilising the Udacity VM to carry out the full extent of this project. I'm doing my best to work with the Udacity mentors/staff to resolve these issues as they come up. If you are able to execute the project on a local setup with GPU/TPU hardware acceleration, please let me know. 

## 2. Programming Task

### 2.1. Data Acquisition

In [None]:
!python3 experiments/testing_configs.py --help

**Note**: If you see the above help message/welcome screen -- congratulations! You've compiled the project successfully.

#### Downloading and processing

To download and process the `.tfrecord` files from Google Cloud Storage into `tf.data.TFRecordDataset` instances, run:

    ```python
    python3 download_process.py
    ```
with none/any/all of the following parameters:
```
        DATA_DIR:        str         Path to the `data` directory to download files to.
        LABEL_MAP_PATH:  str         Path to the dataset `label_map.pbtxt` file.
        SIZE:            int         Number of `.tfrecord` files to download from GCS.
```
Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 download_process.py \
        dataset.data_dir={DATA_DIR} \
        dataset.label_map_path={LABEL_MAP_PATH} \
        dataset.size={SIZE}
    ```
See `configs/dataset/` for additional details on preconfigured values.

In [None]:
DATA_DIR = DIR_SRC
LABEL_MAP_PATH = os.path.join(DIR_SRC, 'waymo_open_dataset/label_map.pbtxt')
SIZE = 1

In [None]:
!python3 scripts/preprocessing/download_process.py \
    dataset.data_dir="{DATA_DIR}/waymo_open_dataset" \
    dataset.label_map_path={LABEL_MAP_PATH} \
    dataset.size={SIZE}

#### Splitting the dataset

To split the downloaded data into three subsets `train`, `val`, and `test`, run:
    
    ```python
    python3 create_splits.py
    ```

with none/any/all of the following:
```
        DATA_DIR:           str         Path to the source `data` directory.
        TRAIN:              str         Path to the `train` data directory.
        TEST:               str         Path to the `test` data directory.
        VAL:                str         Path to the `val` data directory.
        TRAIN_TEST_SPLIT:   float       Percent as [0, 1] to split train/test.
        TRAIN_VAL_SPLIT:    float       Percent as [0, 1] to split train/val.
```
Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 create_splits.py \
        dataset.data_dir={DATA_DIR} \
        dataset.train={TRAIN} dataset.test={TEST} dataset.val={VAL} \
        dataset.train_test_split={TRAIN_TEST_SPLIT} \
        dataset.train_val_split={TRAIN_VAL_SPLIT}
    ```
See `configs/dataset/` for additional details on preconfigured values.

In [None]:
TRAIN = os.path.join(DIR_SRC, 'waymo_open_dataset/split/train')
TEST = os.path.join(DIR_SRC, 'waymo_open_dataset/split/test')
VAL = os.path.join(DIR_SRC, 'waymo_open_dataset/split/val')
TRAIN_TEST_SPLIT = 0.8    # 80/20 train/test split
TRAIN_VAL_SPLIT = 0.8     # 80/20 train/val split (performed on split train set)

In [None]:
!python3 create_splits.py \
    dataset.train={TRAIN} \
    dataset.test={TEST} \
    dataset.val={VAL} \
    dataset.train_test_split={TRAIN_TEST_SPLIT} \
    dataset.train_val_split={TRAIN_VAL_SPLIT}

**Note**: we will be skipping the data acquisition step for now, since the processed `.tfrecord` files have been provided to us in the `/home/workspace/data` folder.

Let's see how many files we have in each subset..

In [None]:
N_TRAIN = len(glob.glob(f"{TRAIN}/*.tfrecord"))
N_TEST = len(glob.glob(f"{TEST}/*.tfrecord"))
N_VAL = len(glob.glob(f"{VAL}/*.tfrecord"))
print('Number of training `segment` files:  ', N_TRAIN, 
      '\nNumber of testing `segment` files:   ', N_TEST,
      '\nNumber of validation `segment` files:', N_VAL)

### 2.2. Training the model

#### Modifying the config file

To configure the model for training, run:

    ```python
    python3 edit_config.py
    ```

with none/any/all of the following parameters:
```
    TRAIN:                                  str         Path to the `train` data directory.
    TEST:                                   str         Path to the `test` data directory.
    VAL:                                    str         Path to the `val` data directory.
    BATCH_SIZE:                             int         Number of examples to process per iteration.
    CHECKPOINT_DIR:                         str         Path to the pre-trained `checkpoint` folder.
    LABEL_MAP_PATH:                         str         Path to the dataset `label_map.pbtxt` file.
    PIPELINE_CONFIG_PATH:                   str         Path to the `pipeline.config` file to modify.
```

Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 edit_config.py \
        dataset.train={TRAIN} dataset.test={TEST} dataset.val={VAL} \
        dataset.label_map_path={LABEL_MAP_PATH} \
        hyperparameters.batch_size={BATCH_SIZE} \
        model.checkpoint_dir={CHECKPOINT_DIR} \
        model.pipeline_config_path={PIPELINE_CONFIG_PATH}
    ```
See `configs/model/` for additional details on preconfigured values.

In [None]:
BATCH_SIZE = 2     # Modify to something reasonable for your training setup
CHECKPOINT_DIR = os.path.join(DIR_MODEL, 'tmp/model_outputs/ckpt-2')   # Starting with pre-trained weights checkpoint
PIPELINE_CONFIG_PATH = os.path.join(DIR_MODEL, 'pipeline.config')

In [None]:
CHECKPOINT_DIR

In [None]:
PIPELINE_CONFIG_PATH

In [None]:
!python3 scripts/training/edit_config.py \
    dataset.train={TRAIN} dataset.test={TEST} dataset.val={VAL} \
    dataset.label_map_path={LABEL_MAP_PATH} \
    hyperparameters.batch_size={BATCH_SIZE} \
    model.checkpoint_dir={CHECKPOINT_DIR} \
    model.pipeline_config_path={PIPELINE_CONFIG_PATH}

#### Running the training loop

For local training/evaluation run:

    ```python
    python3 model_main_tf2.py
    ```

with none/any/all of the following parameters:
```
    DIR_BASE:                               str         Path to the current `model` subdirectory.
    MODEL_OUT:                              str         Path to the `/tmp/model_outputs` folder.
    CHECKPOINT_DIR:                         str         Path to the pretrained weights/variables saved in `checkpoint` folder.
    PIPELINE_CONFIG_PATH:                   str         Path to the `pipeline_new.config` file.
    NUM_TRAIN_STEPS:                        int         Number of training steps (batch iterations) to perform. 
    EVAL_ON_TRAIN_DATA:                     bool        If True, will evaluate on training data (only supported in distributed training).
    SAMPLE_1_OF_N_EVAL_EXAMPLES:            int         Number of evaluation samples to skip (will sample 1 of every n samples per batch).
    SAMPLE_1_OF_N_EVAL_ON_TRAIN_EXAMPLES:   int         Number of training samples to skip (only used if `eval_on_train_data` is True).
    EVAL_TIMEOUT:                           int         Number of seconds to wait for an evaluation checkpoint before exiting.
    USE_TPU:                                bool        Whether or not the job is executing on a TPU.
    TPU_NAME:                               str         Name of the Cloud TPU for Cluster Resolvers.
    CHECKPOINT_EVERY_N:                     int         Defines how often to checkpoint (every n steps).
    RECORD_SUMMARIES:                       bool        Whether or not to record summaries defined by the model or training pipeline.
    NUM_WORKERS:                            int         When `num_workers` > 1, training uses 'MultiWorkerMirroredStrategy',
                                                        When `num_workers` = 1, training uses 'MirroredStrategy'.
```

Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 model_main_tf2.py \
        model.pipeline_config_path={PIPELINE_CONFIG_PATH} \
        model.model_out={MODEL_OUT} model.num_train_steps={NUM_TRAIN_STEPS} \
        model.sample_1_of_n_eval_examples={SAMPLE_1_OF_N_EVAL_EXAMPLES} \
        ...
    ```
See `configs/model/` for additional details on preconfigured values if running without parameters.

In [None]:
DIR_BASE = DIR_MODEL
MODEL_OUT = os.path.join(DIR_BASE, '/tmp/model_outputs')
PIPELINE_CONFIG_PATH = os.path.join(DIR_BASE, 'pipeline_new.config')
EPOCHS = 450
NUM_TRAIN_STEPS = N_TRAIN // BATCH_SIZE * EPOCHS
EVAL_ON_TRAIN_DATA = False
SAMPLE_1_OF_N_EVAL_EXAMPLES = 1
EVAL_TIMEOUT = 5000    # NOTE: should be sufficiently long to wait until each training checkpoint is complete
USE_TPU = False
TPU_NAME = None
RECORD_SUMMARIES = False
NUM_WORKERS = 1

In [None]:
NUM_TRAIN_STEPS

In [None]:
os.makedirs(MODEL_OUT, exist_ok=True)

In [None]:
if USE_TPU:
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
        print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
    except ValueError:
        raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
    TPU_NAME = os.environ['COLAB_TPU_ADDR']
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
    ### Updating training parameters to be multiples of TPU cores
    BATCH_SIZE = BATCH_SIZE * tpu_strategy.num_replicas_in_sync
    NUM_WORKERS = len(tf.config.list_logical_devices('TPU'))
    ### Check if batch size and learning rate are auto updated with USE_TPU
    ### Check if dataset call uses `prefetch` with AUTOTUNE

In [None]:
notebook.list() # View open TensorBoard instances

In [None]:
#%load_ext tensorboard

In [None]:
PIPELINE_CONFIG_PATH

In [None]:
### Run the training loop

In [None]:
!python3 experiments/model_main_tf2.py \
    --model_dir {MODEL_OUT} \
    --pipeline_config_path {PIPELINE__NEW_CONFIG_PATH} \
    #--checkpoint_dir {TRAINED_CHECKPOINT_DIR} \    # NOTE: only uncomment for continuous evaluation
    --num_train_steps NUM_TRAIN_STEPS \
    --eval_on_train_data False \
    --sample_1_of_n_eval_examples 1 \
    --eval_timeout EVAL_TIMEOUT \                   # NOTE: should be sufficiently long for continuous evaluation
    --use_tpu False \
    --tpu_name None \
    --record_summaries False \
    --num_workers 1

In [None]:
# Control TensorBoard display. If no port is provided, 
# the most recently launched TensorBoard is used
notebook.display(height=1000)
# Can also use:
# %load_ext tensorboard
# %tensorboard --logdir {MODEL_OUT}

#### Exporting the trained model

In [None]:
!python experiments/exporter_main_v2.py \
    --input_type image_tensor \
    --pipeline_config_path {PIPELINE_NEW_CONFIG_PATH} \
    --trained_checkpoint_dir {MODEL_OUT} \
    --output_directory {DIR_BASE}

### 2.3. Evaluation

#### Running inference on test data

In [None]:
!pip install ffmpeg-python

In [None]:
!python inference_video.py \
    --labelmap_path {LABEL_MAP_PATH} \
    --model_path {EXPORTED_MODEL_DIR} \
    --tf_record_path {TF_RECORD_PATH} \
    --config_path {PIPELINE_NEW_CONFIG_PATH} \
    --output_path {INFERENCE_OUTPUT_PATH}

## 3. Closing Remarks

##### Alternatives
* Run `download_process` script in different environments (e.g., Ubuntu, Google Colab, Google Compute Engine instance);
* Skip `download_process` and instead use the provided `.tfrecord` data to train model (see caveats [here](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/issues/21));
* Balance dataset classes ('pedestrian' and 'cyclist' classes -- see [`2022-10-16-Report.md`]()).

##### Extensions of task
* Train model on additional data (more than 100 `.tfrecord` files);
* Compare pre-trained model performance against other models on TensorFlow's [Object Detection Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md);
* Customise the `pipeline.config` file to include e.g., additional [data augmentation](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) strategies (see [Exercise 1.4.3](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/934e20c38832186c534846ba1eaaaa3abdead499/1-Computer-Vision/Exercises/1-4-3-Image-Augmentations/2022-09-19-Image-Augmentations.ipynb) for domain-specific examples), 

## 4. Future Work
* ✅ Train model and evaluate using the Udacity provided `.tfrecord` data;
* ⬜️ Address model under-performance on the 'pedestrian' and 'cyclist' classes;
* ⬜️ Compare training and inference times between Udacity VM GPU and Google TPU cluster (5 v3.8 and v2.8 nodes, 100 v2.8 nodes).

## Credits

This assignment was prepared by Thomas Hossler, Michael Virgo et al., Winter 2021 (link [here](https://github.com/udacity/nd013-c1-vision-starter)).


References
* [1] Huang, J. et al., Speed/accuracy trade-offs for modern convolutional object detectors. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2017. [doi: 10.48550/arXiv.1611.10012](https://arxiv.org/abs/1611.10012).

* [2] Sun, Pei, et al., Scalability in Perception for Autonomous Driving: Waymo Open Dataset. arXiv. 2019. [doi: 10.48550/ARXIV.1912.04838](https://arxiv.org/abs/1912.04838).


Helpful explanations:
* [Training Custom Object Detector by L. Vladimirov | TensorFlow 2 Object Detection API tutorial](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html).