# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [1]:
%%capture
%pip install tensorflow_io sagemaker -U

In [2]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

Save the IAM role in a variable called `role`. This would be useful when training the model.

In [3]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::476375380884:role/service-role/AmazonSageMaker-ExecutionRole-20230227T161354


In [4]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
        'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-project-jckuri/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the the repository
* build the docker image and push it 
* print the container name

In [5]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [6]:
# build and push the docker image. This code can be commented after being ran once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
!sh ./docker/build_and_push.sh $image_name

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Building image with name tf2-object-detection
Sending build context to Docker daemon  722.9MB
Step 1/17 : FROM tensorflow/tensorflow:2.9.0-gpu
 ---> c8d9ee2a0ff4
Step 2/17 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Running in de06bd2fc29c
Removing intermediate container de06bd2fc29c
 ---> aac1f932fc42
Step 3/17 : RUN rm /etc/apt/sources.list.d/cuda.list
 ---> Running in 786f369d2dcd
Removing intermediate container 786f369d2dcd
 ---> b42c60eb2007
Step 4/17 : RUN apt-key del 7fa2af80
 ---> Running in 3dd02a7e927d
OK
Removing intermediate container 3dd02a7e927d
 ---> 5242f9953824
Step 5/17 : RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
 ---> Running in 95e954b6211d
[0mExecuting: /tmp/apt-key-gpghome.J0ngiDoAdo/gpg.1.sh --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
[91m

Get:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg-wks-server amd64 2.2.19-3ubuntu2.2 [90.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg-utils amd64 2.2.19-3ubuntu2.2 [481 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg-agent amd64 2.2.19-3ubuntu2.2 [232 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg amd64 2.2.19-3ubuntu2.2 [482 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgconf amd64 2.2.19-3ubuntu2.2 [124 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg-l10n all 2.2.19-3ubuntu2.2 [51.7 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg all 2.2.19-3ubuntu2.2 [259 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgsm amd64 2.2.19-3ubuntu2.2 [217 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgv amd64 2.2.19-3ubuntu2.2 [200 kB]
Get:12 http://archive.ubuntu.com/ubuntu focal-upda

Get:89 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libx11-dev amd64 2:1.6.9-2ubuntu1.2 [647 kB]
Get:90 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libglx-dev amd64 1.3.2-1~ubuntu0.20.04.2 [14.0 kB]
Get:91 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgl-dev amd64 1.3.2-1~ubuntu0.20.04.2 [97.8 kB]
Get:92 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libegl-dev amd64 1.3.2-1~ubuntu0.20.04.2 [17.2 kB]
Get:93 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libjbig0 amd64 2.1-3.1ubuntu0.20.04.1 [27.3 kB]
Get:94 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libwebp6 amd64 0.6.1-2ubuntu0.20.04.1 [185 kB]
Get:95 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libtiff5 amd64 4.1.0+git191117-2ubuntu0.20.04.8 [163 kB]
Get:96 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgdk-pixbuf2.0-common all 2.40.0+dfsg-3ubuntu0.4 [4592 B]
Get:97 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgdk

Selecting previously unselected package libxau6:amd64.
Preparing to unpack .../010-libxau6_1%3a1.0.9-0ubuntu1_amd64.deb ...
Unpacking libxau6:amd64 (1:1.0.9-0ubuntu1) ...
Selecting previously unselected package libxdmcp6:amd64.
Preparing to unpack .../011-libxdmcp6_1%3a1.1.3-0ubuntu1_amd64.deb ...
Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu1) ...
Selecting previously unselected package libxcb1:amd64.
Preparing to unpack .../012-libxcb1_1.14-2_amd64.deb ...
Unpacking libxcb1:amd64 (1.14-2) ...
Selecting previously unselected package libx11-data.
Preparing to unpack .../013-libx11-data_2%3a1.6.9-2ubuntu1.2_all.deb ...
Unpacking libx11-data (2:1.6.9-2ubuntu1.2) ...
Selecting previously unselected package libx11-6:amd64.
Preparing to unpack .../014-libx11-6_2%3a1.6.9-2ubuntu1.2_amd64.deb ...
Unpacking libx11-6:amd64 (2:1.6.9-2ubuntu1.2) ...
Selecting previously unselected package libxext6:amd64.
Preparing to unpack .../015-libxext6_2%3a1.3.4-0ubuntu1_amd64.deb ...
Unpacking libxext6:amd64 (

Selecting previously unselected package libxshmfence1:amd64.
Preparing to unpack .../055-libxshmfence1_1.3-1_amd64.deb ...
Unpacking libxshmfence1:amd64 (1.3-1) ...
Selecting previously unselected package libegl-mesa0:amd64.
Preparing to unpack .../056-libegl-mesa0_21.2.6-0ubuntu0.1~20.04.2_amd64.deb ...
Unpacking libegl-mesa0:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Selecting previously unselected package libegl1:amd64.
Preparing to unpack .../057-libegl1_1.3.2-1~ubuntu0.20.04.2_amd64.deb ...
Unpacking libegl1:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Selecting previously unselected package libxcb-glx0:amd64.
Preparing to unpack .../058-libxcb-glx0_1.14-2_amd64.deb ...
Unpacking libxcb-glx0:amd64 (1.14-2) ...
Selecting previously unselected package libxfixes3:amd64.
Preparing to unpack .../059-libxfixes3_1%3a5.0.3-2_amd64.deb ...
Unpacking libxfixes3:amd64 (1:5.0.3-2) ...
Selecting previously unselected package libxxf86vm1:amd64.
Preparing to unpack .../060-libxxf86vm1_1%3a1.1.4-1build1_amd64.

Selecting previously unselected package libprotoc17:amd64.
Preparing to unpack .../096-libprotoc17_3.6.1.3-2ubuntu5.2_amd64.deb ...
Unpacking libprotoc17:amd64 (3.6.1.3-2ubuntu5.2) ...
Selecting previously unselected package libwebpdemux2:amd64.
Preparing to unpack .../097-libwebpdemux2_0.6.1-2ubuntu0.20.04.1_amd64.deb ...
Unpacking libwebpdemux2:amd64 (0.6.1-2ubuntu0.20.04.1) ...
Selecting previously unselected package libwebpmux3:amd64.
Preparing to unpack .../098-libwebpmux3_0.6.1-2ubuntu0.20.04.1_amd64.deb ...
Unpacking libwebpmux3:amd64 (0.6.1-2ubuntu0.20.04.1) ...
tar: ./triggers: Cannot open: No space left on device
tar: Exiting with failure status due to previous errors
dpkg-deb: error: tar subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-7g8wth/099-libxcb-randr0_1.14-2_amd64.deb (--unpack):
 dpkg-deb --control subprocess returned error exit status 2
tar: ./triggers: Cannot open: No space left on device
tar: Exiting with failure statu

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [7]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

476375380884.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20230324215902


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be ajusted if you were to experiment with other architectures.

In [8]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint

#wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
wget -O /tmp/ssd_mobilenet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz

#tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint
tar -zxvf /tmp/ssd_mobilenet.tar.gz --strip-components 2 --directory source_dir/checkpoint ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint

#tar -zxvf /tmp/ssd_mobilenet.tar.gz --strip-components 2 --directory . ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/pipeline.config


ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2023-03-24 21:59:36--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.250.31.128, 2607:f8b0:4004:c19::80
Connecting to download.tensorflow.org (download.tensorflow.org)|142.250.31.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20518283 (20M) [application/x-tar]
Saving to: ‘/tmp/ssd_mobilenet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 12.8M 2s
    50K .......... .......... .......... .......... ..........  0% 26.4M 1s
   100K .......... .......... .......... .......... ..........  0% 25.4M 1s
   150K .......... .......... .......... .......... ..........  0% 64.9M 1s
   200K .......... .......... .......... .......... ..........  1%  115M 1s
   250K .......

## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [9]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "10000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-24-21-59-37-771


2023-03-24 21:59:39 Starting - Starting the training job.........
2023-03-24 22:01:20 Starting - Preparing the instances for training......
2023-03-24 22:02:22 Downloading - Downloading input data...
2023-03-24 22:02:42 Training - Downloading the training image...............
2023-03-24 22:05:08 Training - Training image download completed. Training in progress...[34m2023-03-24 22:05:37,455 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 22:05:37,489 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 22:05:37,522 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 22:05:37,536 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/da

[34mINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mI0324 22:05:45.869502 139860041893696 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mINFO:tensorflow:Maybe overwriting train_steps: 10000[0m
[34mI0324 22:05:45.873974 139860041893696 config_util.py:552] Maybe overwriting train_steps: 10000[0m
[34mINFO:tensorflow:Maybe overwriting use_bfloat16: False[0m
[34mI0324 22:05:45.874117 139860041893696 config_util.py:552] Maybe overwriting use_bfloat16: False[0m
[34mInstructions for updating:[0m
[34mrename to distribute_datasets_from_function[0m
[34mW0324 22:05:45.905601 139860041893696 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future versi

[34mINFO:tensorflow:Step 100 per-step time 0.638s[0m
[34mI0324 22:07:31.836306 139860041893696 model_lib_v2.py:705] Step 100 per-step time 0.638s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.27433845,
 'Loss/localization_loss': 0.45661312,
 'Loss/regularization_loss': 0.14830008,
 'Loss/total_loss': 0.8792516,
 'learning_rate': 0.0005}[0m
[34mI0324 22:07:31.836710 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.27433845,
 'Loss/localization_loss': 0.45661312,
 'Loss/regularization_loss': 0.14830008,
 'Loss/total_loss': 0.8792516,
 'learning_rate': 0.0005}[0m
[34mINFO:tensorflow:Step 200 per-step time 0.196s[0m
[34mI0324 22:07:51.418323 139860041893696 model_lib_v2.py:705] Step 200 per-step time 0.196s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.22640929,
 'Loss/localization_loss': 0.3284973,
 'Loss/regularization_loss': 0.14691202,
 'Loss/total_loss': 0.70181865,
 'learning_rate': 0.0005}[0m
[34mI0324 22:07:51.418702 139860041893696

[34mINFO:tensorflow:Step 1500 per-step time 0.197s[0m
[34mI0324 22:12:10.302374 139860041893696 model_lib_v2.py:705] Step 1500 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.14305441,
 'Loss/localization_loss': 0.21356262,
 'Loss/regularization_loss': 0.13264716,
 'Loss/total_loss': 0.4892642,
 'learning_rate': 0.0005}[0m
[34mI0324 22:12:10.302746 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.14305441,
 'Loss/localization_loss': 0.21356262,
 'Loss/regularization_loss': 0.13264716,
 'Loss/total_loss': 0.4892642,
 'learning_rate': 0.0005}[0m
[34mINFO:tensorflow:Step 1600 per-step time 0.198s[0m
[34mI0324 22:12:30.053666 139860041893696 model_lib_v2.py:705] Step 1600 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.14027408,
 'Loss/localization_loss': 0.15374522,
 'Loss/regularization_loss': 0.13184035,
 'Loss/total_loss': 0.42585963,
 'learning_rate': 0.0005}[0m
[34mI0324 22:12:30.053987 1398600418

[34mINFO:tensorflow:Step 2900 per-step time 0.199s[0m
[34mI0324 22:16:49.402247 139860041893696 model_lib_v2.py:705] Step 2900 per-step time 0.199s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.12295121,
 'Loss/localization_loss': 0.18839437,
 'Loss/regularization_loss': 0.125071,
 'Loss/total_loss': 0.43641657,
 'learning_rate': 1e-04}[0m
[34mI0324 22:16:49.402628 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.12295121,
 'Loss/localization_loss': 0.18839437,
 'Loss/regularization_loss': 0.125071,
 'Loss/total_loss': 0.43641657,
 'learning_rate': 1e-04}[0m
[34mINFO:tensorflow:Step 3000 per-step time 0.197s[0m
[34mI0324 22:17:09.058851 139860041893696 model_lib_v2.py:705] Step 3000 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.12847449,
 'Loss/localization_loss': 0.17361991,
 'Loss/regularization_loss': 0.12479824,
 'Loss/total_loss': 0.42689264,
 'learning_rate': 1e-04}[0m
[34mI0324 22:17:09.059189 139860041893696

[34mINFO:tensorflow:Step 4300 per-step time 0.198s[0m
[34mI0324 22:21:29.528682 139860041893696 model_lib_v2.py:705] Step 4300 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.12251886,
 'Loss/localization_loss': 0.16919112,
 'Loss/regularization_loss': 0.121373296,
 'Loss/total_loss': 0.41308329,
 'learning_rate': 1e-04}[0m
[34mI0324 22:21:29.528993 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.12251886,
 'Loss/localization_loss': 0.16919112,
 'Loss/regularization_loss': 0.121373296,
 'Loss/total_loss': 0.41308329,
 'learning_rate': 1e-04}[0m
[34mINFO:tensorflow:Step 4400 per-step time 0.196s[0m
[34mI0324 22:21:49.151745 139860041893696 model_lib_v2.py:705] Step 4400 per-step time 0.196s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.11309688,
 'Loss/localization_loss': 0.15531996,
 'Loss/regularization_loss': 0.12113383,
 'Loss/total_loss': 0.38955066,
 'learning_rate': 1e-04}[0m
[34mI0324 22:21:49.152104 139860041

[34mINFO:tensorflow:Step 5700 per-step time 0.198s[0m
[34mI0324 22:26:08.111973 139860041893696 model_lib_v2.py:705] Step 5700 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.09271584,
 'Loss/localization_loss': 0.109492145,
 'Loss/regularization_loss': 0.11891102,
 'Loss/total_loss': 0.321119,
 'learning_rate': 5e-05}[0m
[34mI0324 22:26:08.112334 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.09271584,
 'Loss/localization_loss': 0.109492145,
 'Loss/regularization_loss': 0.11891102,
 'Loss/total_loss': 0.321119,
 'learning_rate': 5e-05}[0m
[34mINFO:tensorflow:Step 5800 per-step time 0.198s[0m
[34mI0324 22:26:27.955020 139860041893696 model_lib_v2.py:705] Step 5800 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.11338841,
 'Loss/localization_loss': 0.13082312,
 'Loss/regularization_loss': 0.118792295,
 'Loss/total_loss': 0.36300382,
 'learning_rate': 5e-05}[0m
[34mI0324 22:26:27.955332 139860041893

[34mINFO:tensorflow:Step 7100 per-step time 0.217s[0m
[34mI0324 22:30:49.517207 139860041893696 model_lib_v2.py:705] Step 7100 per-step time 0.217s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.10446587,
 'Loss/localization_loss': 0.14171587,
 'Loss/regularization_loss': 0.11734128,
 'Loss/total_loss': 0.363523,
 'learning_rate': 5e-05}[0m
[34mI0324 22:30:49.517566 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.10446587,
 'Loss/localization_loss': 0.14171587,
 'Loss/regularization_loss': 0.11734128,
 'Loss/total_loss': 0.363523,
 'learning_rate': 5e-05}[0m
[34mINFO:tensorflow:Step 7200 per-step time 0.199s[0m
[34mI0324 22:31:09.388001 139860041893696 model_lib_v2.py:705] Step 7200 per-step time 0.199s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.08250409,
 'Loss/localization_loss': 0.070177905,
 'Loss/regularization_loss': 0.11723688,
 'Loss/total_loss': 0.2699189,
 'learning_rate': 5e-05}[0m
[34mI0324 22:31:09.388311 139860041893696

[34mINFO:tensorflow:Step 8500 per-step time 0.198s[0m
[34mI0324 22:35:28.429512 139860041893696 model_lib_v2.py:705] Step 8500 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.085484244,
 'Loss/localization_loss': 0.112347655,
 'Loss/regularization_loss': 0.116702035,
 'Loss/total_loss': 0.31453395,
 'learning_rate': 1e-05}[0m
[34mI0324 22:35:28.429796 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.085484244,
 'Loss/localization_loss': 0.112347655,
 'Loss/regularization_loss': 0.116702035,
 'Loss/total_loss': 0.31453395,
 'learning_rate': 1e-05}[0m
[34mINFO:tensorflow:Step 8600 per-step time 0.198s[0m
[34mI0324 22:35:48.235233 139860041893696 model_lib_v2.py:705] Step 8600 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.093264736,
 'Loss/localization_loss': 0.11470249,
 'Loss/regularization_loss': 0.11667985,
 'Loss/total_loss': 0.32464707,
 'learning_rate': 1e-05}[0m
[34mI0324 22:35:48.235552 1398

[34mINFO:tensorflow:Step 9900 per-step time 0.197s[0m
[34mI0324 22:40:06.366255 139860041893696 model_lib_v2.py:705] Step 9900 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.083399944,
 'Loss/localization_loss': 0.10496391,
 'Loss/regularization_loss': 0.11639039,
 'Loss/total_loss': 0.30475426,
 'learning_rate': 1e-05}[0m
[34mI0324 22:40:06.366652 139860041893696 model_lib_v2.py:708] {'Loss/classification_loss': 0.083399944,
 'Loss/localization_loss': 0.10496391,
 'Loss/regularization_loss': 0.11639039,
 'Loss/total_loss': 0.30475426,
 'learning_rate': 1e-05}[0m
[34mINFO:tensorflow:Step 10000 per-step time 0.197s[0m
[34mI0324 22:40:26.076305 139860041893696 model_lib_v2.py:705] Step 10000 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.115165,
 'Loss/localization_loss': 0.14037831,
 'Loss/regularization_loss': 0.1163688,
 'Loss/total_loss': 0.37191212,
 'learning_rate': 1e-05}[0m
[34mI0324 22:40:26.076652 1398600418

[34mINFO:tensorflow:Finished eval step 100[0m
[34mI0324 22:41:18.314298 140197592700736 model_lib_v2.py:966] Finished eval step 100[0m
[34mINFO:tensorflow:Finished eval step 200[0m
[34mI0324 22:41:23.552092 140197592700736 model_lib_v2.py:966] Finished eval step 200[0m
[34mINFO:tensorflow:Performing evaluation on 258 images.[0m
[34mI0324 22:41:26.500505 140197592700736 coco_evaluation.py:293] Performing evaluation on 258 images.[0m
[34mINFO:tensorflow:Loading and preparing annotation results...[0m
[34mI0324 22:41:26.505082 140197592700736 coco_tools.py:116] Loading and preparing annotation results...[0m
[34mINFO:tensorflow:DONE (t=0.02s)[0m
[34mI0324 22:41:26.520691 140197592700736 coco_tools.py:138] DONE (t=0.02s)[0m
[34mINFO:tensorflow:Eval metrics at step 10000[0m
[34mI0324 22:41:39.781741 140197592700736 model_lib_v2.py:1015] Eval metrics at step 10000[0m
[34mINFO:tensorflow:#011+ DetectionBoxes_Precision/mAP: 0.123258[0m
[34mI0324 22:41:39.795719 1401975

[34mW0324 22:42:28.828684 139847185581888 save.py:271] Found untraced functions such as WeightSharedConvolutionalBoxPredictor_layer_call_fn, WeightSharedConvolutionalBoxPredictor_layer_call_and_return_conditional_losses, WeightSharedConvolutionalBoxHead_layer_call_fn, WeightSharedConvolutionalBoxHead_layer_call_and_return_conditional_losses, WeightSharedConvolutionalClassHead_layer_call_fn while saving (showing 5 of 173). These functions will not be directly callable after loading.[0m
[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0324 22:42:35.196313 139847185581888 builder_impl.py:797] Assets written to: /tmp/exported/saved_model/assets[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0324 22:42:36.584451 139847185581888 config_util.py:253] Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34m2023-03-24 22:42:37,960 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)


In [10]:
import sagemaker
sagemaker_session = sagemaker.Session()
aws_region = sagemaker_session.boto_region_name
print('aws_region={}'.format(aws_region))

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


aws_region=us-east-1


In [11]:
job_artifacts_path = estimator.latest_job_tensorboard_artifacts_path()

!echo "pip install 'tensorflow<2.4'"
!echo "pip install 'tensorflow-io<2.4'"
!echo "pip install 'tensorboard<2.4'"
!echo "AWS_REGION={aws_region}"
!echo "tensorboard --logdir={job_artifacts_path}"

pip install 'tensorflow<2.4'
pip install 'tensorflow-io<2.4'
pip install 'tensorboard<2.4'
AWS_REGION=us-east-1
tensorboard --logdir=s3://object-detection-project-jckuri/logs/tf2-object-detection-2023-03-24-21-59-37-771/tensorboard-output


In [12]:
class URL:
    
    def __init__(self, url):
        self.url = url
    
    def _repr_html_(self):
        return '<a href="{}">{}</a>'.format(self.url, self.url)

In [13]:
# https://<notebook instance hostname>/proxy/6006/
jupyter_notebook_url = 'https://object-detection-project-a5lc.notebook.us-east-1.sagemaker.aws'
url = '{}/proxy/6006/'.format(jupyter_notebook_url)

o = URL(url)
o

## Improve on the initial model

Most likely, this initial experiment did not yield optimal results. However, you can make multiple changes to the `pipeline.config` file to improve this model. One obvious change consists in improving the data augmentation strategy. The [`preprocessor.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) file contains the different data augmentation method available in the Tf Object Detection API. Justify your choices of augmentations in the writeup.

Keep in mind that the following are also available:
* experiment with the optimizer: type of optimizer, learning rate, scheduler etc
* experiment with the architecture. The Tf Object Detection API model zoo offers many architectures. Keep in mind that the pipeline.config file is unique for each architecture and you will have to edit it.
* visualize results on the test frames using the `2_deploy_model` notebook available in this repository.

In the cell below, write down all the different approaches you have experimented with, why you have chosen them and what you would have done if you had more time and resources. Justify your choices using the tensorboard visualizations (take screenshots and insert them in your writeup), the metrics on the evaluation set and the generated animation you have created with [this tool](../2_run_inference/2_deploy_model.ipynb).

In [14]:
# your writeup goes here.