# Project 1.1: Object Detection in Urban Environments
## Model Setup and Training
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## 1. Introduction

In [None]:
!sudo python -m pip install --upgrade pip

In [None]:
!sudo pip uninstall -y protobuf
!sudo pip uninstall -y google
!sudo pip install google
!sudo pip install protobuf
!sudo pip install google-cloud
!pip install ray
!pip install omegaconf
!pip install hydra-core
!pip install packaging
!pip install importlib-resources

In [None]:
!python -m pip install waymo-open-dataset-tf-2-3-0

In [4]:
### Importing the required modules
# Doing this here to test installation

In [5]:
import datetime
import glob
import google.protobuf
import hydra
import numpy as np
import os
import ray
import sys
from tensorboard import notebook
import tensorflow as tf
import waymo_open_dataset

In [6]:
!python --version

Python 3.7.3


In [7]:
ray.__version__

'0.8.7'

In [8]:
tf.__version__

'2.3.0'

In [9]:
np.__version__

'1.18.5'

In [10]:
### Setting the environment variables

In [11]:
ENV_COLAB = False               # True if running in Google Colab instance

In [12]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/1-1-Object-Detection-in-Urban-Environments'
DIR_BASE = os.path.abspath(DIR_BASE)
DIR_BASE

'/home/workspace/1-1-Object-Detection-in-Urban-Environments'

In [13]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data')

In [14]:
### Creating subdirectories (if not exists)
os.makedirs(DIR_SRC, exist_ok=True)
os.makedirs(DIR_OUT, exist_ok=True)

In [15]:
# Subdirectory to model folder
DIR_MODEL = os.path.join(DIR_BASE, 'experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8')

In [16]:
# Load the TensorBoard notebook extension (if using Colab)
#%load_ext tensorboard

In [17]:
### Downloading the Google Cloud CLI (if folder doesn't already exist)
if not os.path.exists(os.path.join(DIR_BASE, 'addons/google-cloud-sdk')):
    # Download Cloud CLI tools from Google
    !curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-405.0.0-linux-x86_64.tar.gz
    # Unzip to addons folder
    !tar -xf google-cloud-cli-405.0.0-linux-x86_64.tar.gz -C addons/
### Setting up Google Cloud CLI tools (run commands inside interactive shell)
# ./addons/google-cloud-sdk/install.sh
# ./gcloud init
### Authenticate the service account
# ./gcloud auth activate-service-account [service-email] --key-file"[path/to/key-file]"

In [18]:
### Sweeping the working directory for Python modules

In [19]:
!pip install --editable {DIR_BASE}

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Obtaining file:///home/workspace/1-1-Object-Detection-in-Urban-Environments
Installing collected packages: 1-1-Object-Detection-in-Urban-Environments
  Running setup.py develop for 1-1-Object-Detection-in-Urban-Environments
Successfully installed 1-1-Object-Detection-in-Urban-Environments
You should consider upgrading via the '/root/miniconda3/bin/python -m pip install --upgrade pip' command.[0m


### 1.1. Setup and Data Acquisition

In this section of the notebook we will be fetching the Waymo Open Dataset files from their Google Cloud Storage bucket locations. To view the file paths we will be downloading, see `filenames.txt` inside the `data/waymo_open_dataset` folder.


#### Environment setup
The Python files inside `/scripts/..` have been modified to work on the Linux Ubuntu VM provided in the Udacity workspace. Please see previous commits of this repository to obtain script files suited for macOS and Google Colab. As of now, the Ubuntu VM is running Python version 3.7.3, TensorFlow 2.3.0 and Waymo Open Dataset version `tf-2-3-0==1.4.0`. The other dependency versions should be checked for conflicts on any other machine.

Running the `!pip install --editable setup.py` command will add all modules from this project repository onto the Python path. This is the recommended way to resolve `PYTHONPATH` issues, preferred to updating the `/.bashrc` file or `os.environ['PYTHONPATH']`/`sys.path` variables.

#### Data acquisition

The `download_process.py` script will fetch the `.tfrecord` files from GCS and store them locally in the `data/waymo_open_dataset/raw` subdirectory to be processed. The raw `.tfrecord` files are parsed in `process_tfr`; the images, bounding box labels and attribute data are stored in a dictionary-like object and converted to `tf.data.TFRecordDataset` instances. After the files have been converted, their originals are deleted from inside the `raw` folder.

Lastly, we will split the data we have collected into train, test and validation subsets. The default split sizes were selected to be 80%/20% on train/test and from the remaining train data 20% is witheld for the validation set. The split sizes can be customised inside the `configs/dataset/waymo_open_dataset.yaml` configuration file.


**Disclaimer**: A lot of effort has been put in by me (the author of this notebook, Jonathan L. Moran) to mitigate platform issues between macOS/Ubuntu/Google Colab instances. Many hurdles prevent one from currently utilising the Udacity VM to carry out the full extent of this project. I'm doing my best to work with the Udacity mentors/staff to resolve these issues as they come up. If you are able to execute the project on a local setup with GPU/TPU hardware acceleration, please let me know. 

### 1.2. Training and Evaluation

TODO.

## 2. Programming Task

### 2.1. Data Acquisition

In [20]:
!python3 experiments/testing_configs.py --help

---------------------------------------------------------------------------
                    TensorFlow Object Detection API
---------------------------------------------------------------------------
Packaged with <3 by Jonathan L. Moran (jonathan.moran107@gmail.com).
Intended for use on the Waymo Open Dataset for the Perception-Sensor 2D task.


Training/evaluation
--------------------
For local training/evaluation run:

```python
python3 model_main_tf2.py
```

with none/any/all of the following parameters:
```
DIR_BASE:                               str         Path to the current `model` subdirectory.
MODEL_OUT:                              str         Path to the `/tmp/model_outputs` folder.
CHECKPOINT_DIR:                         str         Path to the pretrained weights/variables saved in `checkpoint` folder.
PIPELINE_CONFIG_PATH:                   str         Path to the `pipeline.config` file.
NUM_TRAIN_STEPS:                        int         Number 

**Note**: If you see the above help message/welcome screen -- congratulations! You've compiled the project successfully.

#### Downloading and processing

To download and process the `.tfrecord` files from Google Cloud Storage into `tf.data.TFRecordDataset` instances, run:

    ```python
    python3 download_process.py
    ```
with none/any/all of the following parameters:
```
        DATA_DIR:        str         Path to the `data` directory to download files to.
        LABEL_MAP_PATH:  str         Path to the dataset `label_map.pbtxt` file.
        SIZE:            int         Number of `.tfrecord` files to download from GCS.
```
Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 download_process.py \
        dataset.data_dir={DATA_DIR} \
        dataset.label_map_path={LABEL_MAP_PATH} \
        dataset.size={SIZE}
    ```
See `configs/dataset/` for additional details on preconfigured values.

In [21]:
DATA_DIR = DIR_SRC
LABEL_MAP_PATH = os.path.join(DIR_SRC, 'waymo_open_dataset/label_map.pbtxt')
SIZE = 1

In [192]:
!python3 scripts/preprocessing/download_process.py \
    dataset.data_dir="{DATA_DIR}/waymo_open_dataset" \
    dataset.label_map_path={LABEL_MAP_PATH} \
    dataset.size={SIZE}

2022-10-13 22:59:35.676931: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2022-10-13 22:59:39,392 INFO     Downloading 1 files. Be patient, this will take a long time.
2022-10-13 22:59:39,393 INFO     Downloading segment-10017090168044687777_6380_000_6400_000_with_camera_labels.tfrecord
^C
Traceback (most recent call last):
  File "scripts/preprocessing/download_process.py", line 340, in <module>
    main()
  File "/root/miniconda3/lib/python3.7/site-packages/hydra/main.py", line 95, in decorated_main
    config_name=config_name,
  File "/root/miniconda3/lib/python3.7/site-packages/hydra/_internal/utils.py", line 396, in _run_hydra
    overrides=overrides,
  File "/root/miniconda3/lib/python3.7/site-packages/hydra/_internal/utils.py", line 453, in _run_app
    lambda: hydra.run(
  File "/root/miniconda3/lib/python3.7/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/roo

#### Splitting the dataset

To split the downloaded data into three subsets `train`, `val`, and `test`, run:
    
    ```python
    python3 create_splits.py
    ```

with none/any/all of the following:
```
        DATA_DIR:           str         Path to the source `data` directory.
        TRAIN:              str         Path to the `train` data directory.
        TEST:               str         Path to the `test` data directory.
        VAL:                str         Path to the `val` data directory.
        TRAIN_TEST_SPLIT:   float       Percent as [0, 1] to split train/test.
        TRAIN_VAL_SPLIT:    float       Percent as [0, 1] to split train/val.
```
Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 create_splits.py \
        dataset.data_dir={DATA_DIR} \
        dataset.train={TRAIN} dataset.test={TEST} dataset.val={VAL} \
        dataset.train_test_split={TRAIN_TEST_SPLIT} \
        dataset.train_val_split={TRAIN_VAL_SPLIT}
    ```
See `configs/dataset/` for additional details on preconfigured values.

In [22]:
TRAIN = os.path.join(DIR_SRC, 'waymo_open_dataset/split/train')
TEST = os.path.join(DIR_SRC, 'waymo_open_dataset/split/test')
VAL = os.path.join(DIR_SRC, 'waymo_open_dataset/split/val')
TRAIN_TEST_SPLIT = 0.8    # 80/20 train/test split
TRAIN_VAL_SPLIT = 0.8     # 80/20 train/val split (performed on split train set)

In [None]:
!python3 create_splits.py \
    dataset.train={TRAIN} \
    dataset.test={TEST} \
    dataset.val={VAL} \
    dataset.train_test_split={TRAIN_TEST_SPLIT} \
    dataset.train_val_split={TRAIN_VAL_SPLIT}

**Note**: we will be skipping the data acquisition step for now, since the processed `.tfrecord` files have been provided to us in the `/home/workspace/data` folder.

Let's see how many files we have in each subset..

In [23]:
N_TRAIN = len(glob.glob(f"{TRAIN}/*.tfrecord"))
N_TEST = len(glob.glob(f"{TEST}/*.tfrecord"))
N_VAL = len(glob.glob(f"{VAL}/*.tfrecord"))
print('Number of training `segment` files:  ', N_TRAIN, 
      '\nNumber of testing `segment` files:   ', N_TEST,
      '\nNumber of validation `segment` files:', N_VAL)

Number of training `segment` files:   86 
Number of testing `segment` files:    3 
Number of validation `segment` files: 10


### 2.2. Training the model

#### Modifying the config file

To configure the model for training, run:

    ```python
    python3 edit_config.py
    ```

with none/any/all of the following parameters:
```
    TRAIN:                                  str         Path to the `train` data directory.
    TEST:                                   str         Path to the `test` data directory.
    VAL:                                    str         Path to the `val` data directory.
    BATCH_SIZE:                             int         Number of examples to process per iteration.
    CHECKPOINT_DIR:                         str         Path to the pre-trained `checkpoint` folder.
    LABEL_MAP_PATH:                         str         Path to the dataset `label_map.pbtxt` file.
    PIPELINE_CONFIG_PATH:                   str         Path to the `pipeline.config` file to modify.
```

Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 edit_config.py \
        dataset.train={TRAIN} dataset.test={TEST} dataset.val={VAL} \
        dataset.label_map_path={LABEL_MAP_PATH} \
        hyperparameters.batch_size={BATCH_SIZE} \
        model.checkpoint_dir={CHECKPOINT_DIR} \
        model.pipeline_config_path={PIPELINE_CONFIG_PATH}
    ```
See `configs/model/` for additional details on preconfigured values.

In [24]:
BATCH_SIZE = 2     # Modify to something reasonable for your training setup
CHECKPOINT_DIR = os.path.join(DIR_MODEL, 'checkpoint')
PIPELINE_CONFIG_PATH = os.path.join(DIR_MODEL, 'pipeline.config')

In [25]:
PIPELINE_CONFIG_PATH

'/home/workspace/1-1-Object-Detection-in-Urban-Environments/experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config'

In [26]:
!python3 scripts/training/edit_config.py \
    dataset.label_map_path={LABEL_MAP_PATH} \
    hyperparameters.batch_size={BATCH_SIZE} \
    model.checkpoint_dir={CHECKPOINT_DIR} \
    model.pipeline_config_path={PIPELINE_CONFIG_PATH}

2022-10-14 19:32:31.523677: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1


#### Running the training loop

For local training/evaluation run:

    ```python
    python3 model_main_tf2.py
    ```

with none/any/all of the following parameters:
```
    DIR_BASE:                               str         Path to the current `model` subdirectory.
    MODEL_OUT:                              str         Path to the `/tmp/model_outputs` folder.
    CHECKPOINT_DIR:                         str         Path to the pretrained weights/variables saved in `checkpoint` folder.
    PIPELINE_CONFIG_PATH:                   str         Path to the `pipeline.config` file.
    NUM_TRAIN_STEPS:                        int         Number of training steps (batch iterations) to perform. 
    EVAL_ON_TRAIN_DATA:                     bool        If True, will evaluate on training data (only supported in distributed training).
    SAMPLE_1_OF_N_EVAL_EXAMPLES:            int         Number of evaluation samples to skip (will sample 1 of every n samples per batch).
    SAMPLE_1_OF_N_EVAL_ON_TRAIN_EXAMPLES:   int         Number of training samples to skip (only used if `eval_on_train_data` is True).
    EVAL_TIMEOUT:                           int         Number of seconds to wait for an evaluation checkpoint before exiting.
    USE_TPU:                                bool        Whether or not the job is executing on a TPU.
    TPU_NAME:                               str         Name of the Cloud TPU for Cluster Resolvers.
    CHECKPOINT_EVERY_N:                     int         Defines how often to checkpoint (every n steps).
    RECORD_SUMMARIES:                       bool        Whether or not to record summaries defined by the model or training pipeline.
    NUM_WORKERS:                            int         When `num_workers` > 1, training uses 'MultiWorkerMirroredStrategy',
                                                        When `num_workers` = 1, training uses 'MirroredStrategy'.
```

Overriding parameters globally is accomplished at runtime using the Basic Override syntax provided by Hydra:

    ```python
    python3 model_main_tf2.py \
        model.pipeline_config_path={PIPELINE_CONFIG_PATH} \
        model.model_out={MODEL_OUT} model.num_train_steps={NUM_TRAIN_STEPS} \
        model.sample_1_of_n_eval_examples={SAMPLE_1_OF_N_EVAL_EXAMPLES} \
        ...
    ```
See `configs/model/` for additional details on preconfigured values if running without parameters.

In [59]:
DIR_BASE = DIR_MODEL
MODEL_OUT = os.path.join(DIR_BASE, '/tmp/model_outputs')
#EPOCHS = 25
#NUM_TRAIN_STEPS = N_TRAIN // BATCH_SIZE * EPOCHS
NUM_TRAIN_STEPS = 1    # Testing with 1 iteration right now
EVAL_ON_TRAIN_DATA = False
SAMPLE_1_OF_N_EVAL_EXAMPLES = 1
EVAL_TIMEOUT = 100
USE_TPU = False
TPU_NAME = None
RECORD_SUMMARIES = True
NUM_WORKERS = 1

In [28]:
os.makedirs(MODEL_OUT, exist_ok=True)

In [29]:
if USE_TPU:
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
        print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
    except ValueError:
        raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
    TPU_NAME = os.environ['COLAB_TPU_ADDR']
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
    ### Updating training parameters to be multiples of TPU cores
    BATCH_SIZE = BATCH_SIZE * tpu_strategy.num_replicas_in_sync
    NUM_WORKERS = len(tf.config.list_logical_devices('TPU'))
    ### Check if batch size and learning rate are auto updated with USE_TPU
    ### Check if dataset call uses `prefetch` with AUTOTUNE

In [30]:
notebook.list() # View open TensorBoard instances

No known TensorBoard instances running.


In [31]:
%load_ext tensorboard

In [32]:
### Setting the logs directory for TensorBoard
logs_dir = os.path.join(DIR_MODEL, f"tmp/out/{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}")
logs_dir

'/home/workspace/1-1-Object-Detection-in-Urban-Environments/experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/tmp/out/20221014-193250'

In [33]:
os.makedirs(logs_dir, exist_ok=True)

In [34]:
### Run the training loop

In [41]:
os.environ['OC_CAUSE']='1'

In [62]:
#%tensorboard --logdir logs_dir
!python3 experiments/model_main_tf2.py \
    model.dir_base={DIR_BASE} \
    model.model_out={MODEL_OUT} \
    model.checkpoint_dir={CHECKPOINT_DIR} \
    model.pipeline_config_path={PIPELINE_CONFIG_PATH} \
    model.num_train_steps={NUM_TRAIN_STEPS} \
    model.eval_on_train_data={EVAL_ON_TRAIN_DATA} \
    model.sample_1_of_n_eval_examples={SAMPLE_1_OF_N_EVAL_EXAMPLES} \
    model.eval_timeout={EVAL_TIMEOUT} \
    model.use_tpu={USE_TPU} \
    model.tpu_name={TPU_NAME} \
    model.record_summaries={RECORD_SUMMARIES} \
    model.num_workers={NUM_WORKERS}

2022-10-14 19:57:15.358405: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
['experiments/model_main_tf2.py', 'model.dir_base=/home/workspace/1-1-Object-Detection-in-Urban-Environments/experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8', 'model.model_out=/tmp/model_outputs', 'model.checkpoint_dir=/home/workspace/1-1-Object-Detection-in-Urban-Environments/experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint', 'model.pipeline_config_path=/home/workspace/1-1-Object-Detection-in-Urban-Environments/experiments/pretrained_model/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config', 'model.num_train_steps=1', 'model.eval_on_train_data=False', 'model.sample_1_of_n_eval_examples=1', 'model.eval_timeout=100', 'model.use_tpu=False', 'model.tpu_name=', 'model.record_summaries=True', 'model.num_workers=1']
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/

In [42]:
# Control TensorBoard display. If no port is provided, 
# the most recently launched TensorBoard is used
notebook.display(port=6006, height=1000)

In [None]:
#%tensorboard --logdir logs

#### Exporting the trained model

In [None]:
# TODO

### 2.3. Evaluation

## 3. Closing Remarks

##### Alternatives
* Run `download_process` script in different environments (e.g., Ubuntu, Google Colab, Google Compute Engine instance);
* Skip `download_process` and instead use the provided `.tfrecord` data to train model (see caveats [here](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/issues/21));

##### Extensions of task
* Train model on additional data (more than 100 `.tfrecord` files);
* Compare pre-trained model performance against other models on TensorFlow's [Object Detection Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md);
* Customise the `pipeline.config` file to include e.g., additional [data augmentation](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) strategies (see [Exercise 1.4.3](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/934e20c38832186c534846ba1eaaaa3abdead499/1-Computer-Vision/Exercises/1-4-3-Image-Augmentations/2022-09-19-Image-Augmentations.ipynb) for domain-specific examples).

## 4. Future Work
- [ ] Train model and evaluate using the Udacity provided `.tfrecord` data;
- [ ] Compare training and inference times between Udacity VM GPU and Google TPU cluster (5 v3.8 and v2.8 nodes, 100 v2.8 nodes).

## Credits

This assignment was prepared by Thomas Hossler, Michael Virgo et al., Winter 2021 (link [here](https://github.com/udacity/nd013-c1-vision-starter)).


References
* [1] Sun, Pei, et al., Scalability in Perception for Autonomous Driving: Waymo Open Dataset. arXiv. 2019. [doi: 10.48550/ARXIV.1912.04838](https://arxiv.org/abs/1912.04838).


Helpful explanations:
* [Training Custom Object Detector by L. Vladimirov | TensorFlow 2 Object Detection API tutorial](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html).