# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [10]:
%%capture
%pip install tensorflow_io sagemaker -U

In [11]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

Save the IAM role in a variable called `role`. This would be useful when training the model.

In [12]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::251792761470:role/service-role/AmazonSageMaker-ExecutionRole-20240817T122151


In [13]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
          'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-in-urban-env/logs/'


## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the repository
* build the docker image and push it 
* print the container name

In [14]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [23]:
# build and push the docker image. This code can be commented out after being run once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
!sh ./docker/build_and_push.sh $image_name

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Building image with name tf2-object-detection
Sending build context to Docker daemon  736.8MB
Step 1/14 : FROM tensorflow/tensorflow:2.13.0-gpu
 ---> 6bdca089cc38
Step 2/14 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Running in 69ac4a5c3861
Removing intermediate container 69ac4a5c3861
 ---> 4c389ce3d026
Step 3/14 : RUN apt-get update && apt-get install -y     git     gpg-agent     python3-cairocffi     protobuf-compiler     python3-pil     python3-lxml     python3-tk     libgl1-mesa-dev     wget
 ---> Running in 913224f1299b
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [128 kB]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1581 B]
Get:3 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:5 https://developer.download

Get:15 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libx11-data all 2:1.6.9-2ubuntu1.6 [114 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libx11-6 amd64 2:1.6.9-2ubuntu1.6 [577 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal/main amd64 libxext6 amd64 2:1.3.4-0ubuntu1 [29.1 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal/main amd64 libxmuu1 amd64 2:1.1.3-0ubuntu1 [9728 B]
Get:19 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 openssh-client amd64 1:8.2p1-4ubuntu0.11 [670 kB]
Get:20 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 wget amd64 1.20.3-1ubuntu2.1 [349 kB]
Get:21 http://archive.ubuntu.com/ubuntu focal/main amd64 xauth amd64 1:1.1-0ubuntu1 [25.0 kB]
Get:22 http://archive.ubuntu.com/ubuntu focal/main amd64 libtcl8.6 amd64 8.6.10+dfsg-1 [902 kB]
Get:23 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libfreetype6 amd64 2.10.1-2ubuntu0.3 [341 kB]
Get:24 http://archive.ubuntu.com/ubuntu focal/main amd64 fonts-deja

Get:98 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libopengl-dev amd64 1.3.2-1~ubuntu0.20.04.2 [3584 B]
Get:99 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libprotobuf-lite17 amd64 3.6.1.3-2ubuntu5.2 [132 kB]
Get:100 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libprotobuf17 amd64 3.6.1.3-2ubuntu5.2 [798 kB]
Get:101 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libprotoc17 amd64 3.6.1.3-2ubuntu5.2 [646 kB]
Get:102 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libwebpdemux2 amd64 0.6.1-2ubuntu0.20.04.3 [9560 B]
Get:103 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libwebpmux3 amd64 0.6.1-2ubuntu0.20.04.3 [19.5 kB]
Get:104 http://archive.ubuntu.com/ubuntu focal/main amd64 libxcb-randr0 amd64 1.14-2 [16.3 kB]
Get:105 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libxslt1.1 amd64 1.1.34-4ubuntu0.20.04.1 [151 kB]
Get:106 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 mesa-vulkan-drivers amd64 

Selecting previously unselected package libtk8.6:amd64.
Preparing to unpack .../030-libtk8.6_8.6.10-1_amd64.deb ...
Unpacking libtk8.6:amd64 (8.6.10-1) ...
Selecting previously unselected package tk8.6-blt2.5.
Preparing to unpack .../031-tk8.6-blt2.5_2.5.3+dfsg-4_amd64.deb ...
Unpacking tk8.6-blt2.5 (2.5.3+dfsg-4) ...
Selecting previously unselected package blt.
Preparing to unpack .../032-blt_2.5.3+dfsg-4_amd64.deb ...
Unpacking blt (2.5.3+dfsg-4) ...
Selecting previously unselected package libcurl3-gnutls:amd64.
Preparing to unpack .../033-libcurl3-gnutls_7.68.0-1ubuntu2.23_amd64.deb ...
Unpacking libcurl3-gnutls:amd64 (7.68.0-1ubuntu2.23) ...
Selecting previously unselected package liberror-perl.
Preparing to unpack .../034-liberror-perl_0.17029-1_all.deb ...
Unpacking liberror-perl (0.17029-1) ...
Selecting previously unselected package git-man.
Preparing to unpack .../035-git-man_1%3a2.25.1-1ubuntu3.13_all.deb ...
Unpacking git-man (1:2.25.1-1ubuntu3.13) ...
Selecting previously u

Selecting previously unselected package libpthread-stubs0-dev:amd64.
Preparing to unpack .../077-libpthread-stubs0-dev_0.4-1_amd64.deb ...
Unpacking libpthread-stubs0-dev:amd64 (0.4-1) ...
Selecting previously unselected package libxcb1-dev:amd64.
Preparing to unpack .../078-libxcb1-dev_1.14-2_amd64.deb ...
Unpacking libxcb1-dev:amd64 (1.14-2) ...
Selecting previously unselected package libx11-dev:amd64.
Preparing to unpack .../079-libx11-dev_2%3a1.6.9-2ubuntu1.6_amd64.deb ...
Unpacking libx11-dev:amd64 (2:1.6.9-2ubuntu1.6) ...
Selecting previously unselected package libglx-dev:amd64.
Preparing to unpack .../080-libglx-dev_1.3.2-1~ubuntu0.20.04.2_amd64.deb ...
Unpacking libglx-dev:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Selecting previously unselected package libgl-dev:amd64.
Preparing to unpack .../081-libgl-dev_1.3.2-1~ubuntu0.20.04.2_amd64.deb ...
Unpacking libgl-dev:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Selecting previously unselected package libegl-dev:amd64.
Preparing to unpack .../082

Selecting previously unselected package protobuf-compiler.
Preparing to unpack .../122-protobuf-compiler_3.6.1.3-2ubuntu5.2_amd64.deb ...
Unpacking protobuf-compiler (3.6.1.3-2ubuntu5.2) ...
Setting up liblcms2-2:amd64 (2.9-4) ...
Setting up libpixman-1-0:amd64 (0.38.4-0ubuntu2.1) ...
Setting up libwayland-server0:amd64 (1.18.0-1ubuntu0.1) ...
Setting up libx11-xcb1:amd64 (2:1.6.9-2ubuntu1.6) ...
Setting up libpciaccess0:amd64 (0.16-0ubuntu1) ...
Setting up libxau6:amd64 (1:1.0.9-0ubuntu1) ...
Setting up wget (1.20.3-1ubuntu2.1) ...
Setting up libglvnd0:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Setting up libprotobuf-lite17:amd64 (3.6.1.3-2ubuntu5.2) ...
Setting up python3-olefile (0.46-2) ...
Setting up python3-ply (3.11-3ubuntu0.1) ...
Setting up libgdk-pixbuf2.0-common (2.40.0+dfsg-3ubuntu0.5) ...
Setting up x11-common (1:7.7+19ubuntu14) ...
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up libsensors-config (1:3.6.0-2ubuntu1.

    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.2
[0mRemoving intermediate container ffec4da38487
 ---> 2905cea0cb78
Step 11/14 : RUN python -m pip install .
 ---> Running in 88fbe929129a
Processing /home/tensorflow/models/research
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting avro-python3 (from object-detection==0.1)
  Downloading avro-python3-1.10.2.tar.gz (38 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting apache-beam (from object-detection==0.1)
  Downloading apache_beam-2.58.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting pillow==9.5 (from object-detection==0.1)
  Downloading Pillow-9.5.0-cp38-cp38-manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Collecting matplotlib (from object-detection==0.1)
  Downloading matplotlib-3.7.5-cp38-cp38-manylinux_2_12_x86_64.m

Collecting orjson<4,>=3.9.7 (from apache-beam->object-detection==0.1)
  Downloading orjson-3.10.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
Collecting dill<0.3.2,>=0.3.1.1 (from apache-beam->object-detection==0.1)
  Downloading dill-0.3.1.1.tar.gz (151 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting cloudpickle~=2.2.1 (from apache-beam->object-detection==0.1)
  Downloading cloudpickle-2.2.1-py3-none-any.whl.metadata (6.9 kB)
Collecting fastavro<2,>=0.23.6 (from apache-beam->object-detection==0.1)
  Downloading fastavro-1.9.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting fasteners<1.0,>=0.3 (from apache-beam->object-detection==0.1)
  Downloading fasteners-0.19-py3-none-any.whl.metadata (4.9 kB)
Collecting hdfs<3.0.0,>=2.1.0 (from apache-beam->object-detection==0.1)
  Downloading hdfs-2.7.3.tar.gz (43 kB)
  Preparing metadata (setup.py): started
  P

Collecting async-timeout>=4.0.3 (from redis<6,>=5.0.0->apache-beam->object-detection==0.1)
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting charset-normalizer<4,>=2 (from requests!=2.32.*,<3.0.0,>=2.24.0->apache-beam->object-detection==0.1)
  Downloading charset_normalizer-3.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (33 kB)
Collecting tf-keras>=2.14.1 (from tensorflow-hub>=0.6.0->tf-models-official>=2.5.1->object-detection==0.1)
  Downloading tf_keras-2.15.1-py3-none-any.whl.metadata (1.7 kB)
Collecting dm-tree~=0.1.1 (from tensorflow-model-optimization>=0.4.1->tf-models-official>=2.5.1->object-detection==0.1)
  Downloading dm_tree-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting scikit-learn>=0.21.3 (from seqeval->tf-models-official>=2.5.1->object-detection==0.1)
  Downloading scikit_learn-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting array

Downloading Pillow-9.5.0-cp38-cp38-manylinux_2_28_x86_64.whl (3.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 105.6 MB/s eta 0:00:00
Downloading pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Downloading sacrebleu-2.2.0-py3-none-any.whl (116 kB)
Downloading tf_models_official-2.13.2-py2.py3-none-any.whl (2.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 11.5 MB/s eta 0:00:00
Downloading pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 146.1 MB/s eta 0:00:00
Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.5/34.5 MB 171.3 MB/s eta 0:00:00
Downloading tf_slim-1.1.0-py2.py3-none-any.whl (352 kB)
Downloading apache_beam-2.58.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.3/15.3 MB 151.3 MB/s eta 0:00:00
Downloading contex

Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Downloading googleapis_common_protos-1.63.2-py2.py3-none-any.whl (220 kB)
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Downloading backports.zoneinfo-0.2.1-cp38-cp38-manylinux1_x86_64.whl (74 kB)
Building wheels for collected packages: object-detection, avro-python3, crcmod, dill, hdfs, kaggle, seqeval, pyjsparser, docopt, promise
  Building wheel for object-detection (setup.py): started
  Building wheel for object-detection (setup.py): finished with status 'done'
  Created wheel for object-detection: filename=object_detection-0.1-py3-none-any.whl size=1466938 sha256=3db202168e98eb3d119355e84aca6c18db34201df8f7f9bf0b77ab45d27ebdda
  Stored in directory: /tmp/pip-ephem-wheel-cache-8cuktmfv/wheels/28/d2/ce/f2754826bc8f50adf45d76a4c3cffa1a58dd936429295e0ddd
  Building wheel for avro-python3 (setup.py): started
  Buildin

[0mRemoving intermediate container 88fbe929129a
 ---> d2bb07774c39
Step 12/14 : ENV TF_CPP_MIN_LOG_LEVEL 3
 ---> Running in d2311bac954b
Removing intermediate container d2311bac954b
 ---> 9fcbb4a632f0
Step 13/14 : RUN pip3 install sagemaker-training
 ---> Running in 363771f69805
Collecting sagemaker-training
  Downloading sagemaker_training-4.8.0.tar.gz (60 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting boto3 (from sagemaker-training)
  Downloading boto3-1.35.0-py3-none-any.whl.metadata (6.6 kB)
Collecting retrying>=1.3.3 (from sagemaker-training)
  Downloading retrying-1.3.4-py3-none-any.whl.metadata (6.9 kB)
Collecting gevent (from sagemaker-training)
  Downloading gevent-24.2.1-cp38-cp38-manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting inotify_simple==1.2.1 (from sagemaker-training)
  Downloading inotify_simple-1.2.1.tar.gz (7.9 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): fi

[0mRemoving intermediate container a9a6ed1674bc
 ---> 2b56fa5aae8d
Successfully built 2b56fa5aae8d
Successfully tagged tf2-object-detection:latest
Pushing image to ECR 251792761470.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20240817150551
The push refers to repository [251792761470.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection]

[1Bc730b387: Preparing 
[1B1207aeff: Preparing 
[1B31b618e3: Preparing 
[1Bae160149: Preparing 
[1B8668d37d: Preparing 
[1B9e263a13: Preparing 
[1Becf28ef2: Preparing 
[1B6aa04aab: Preparing 
[1Bfc1c5fcf: Preparing 
[1Ba40e4dcd: Preparing 
[1Bb5695a98: Preparing 
[1Bf0d116f4: Preparing 
[1B2813a979: Preparing 
[9B9e263a13: Waiting g 
[1B136c7d36: Preparing 
[1B891e0e76: Preparing 
[11Bcf28ef2: Waiting g 
[11Baa04aab: Waiting g 
[7B2813a979: Waiting g 
[1Be103257c: Preparing 
[13Bc1c5fcf: Waiting g 
[9B6e868aa5: Waiting g 
[9B136c7d36: Waiting g 
[15B40e4dcd: Waiting g 
[1B5c845fcf: Preparing 
[16B5695a98: Waiting g 


To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [24]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

251792761470.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20240817150551


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be adjusted if you were to experiment with other architectures.

In [None]:
%%bash

#delete checkpointer if exits
# Xóa thư mục nếu nó đã tồn tại và tạo lại thư mục
!rm -rf /tmp/checkpoint 
!rm -rf source_dir/checkpoint

mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
#EfficientDet D1 640x640
wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint

mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2024-08-17 15:12:56--  http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.179.207, 172.253.62.207, 172.253.63.207, ...
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.179.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51839363 (49M) [application/x-tar]
Saving to: ‘/tmp/efficientdet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 13.9M 4s
    50K .......... .......... .......... .......... ..........  0% 26.1M 3s
   100K .......... .......... .......... .......... ..........  0% 27.4M 2s
   150K .......... .......... .......... .......... ..........  0% 27.5M 2s
   200K .......... .......... .......... .......... ..........  0% 74.8M 2s
   250K .........

  5000K .......... .......... .......... .......... ..........  9%  101M 1s
  5050K .......... .......... .......... .......... .......... 10%  284M 1s
  5100K .......... .......... .......... .......... .......... 10%  117M 1s
  5150K .......... .......... .......... .......... .......... 10%  103M 1s
  5200K .......... .......... .......... .......... .......... 10%  244M 0s
  5250K .......... .......... .......... .......... .......... 10% 91.3M 0s
  5300K .......... .......... .......... .......... .......... 10%  413M 0s
  5350K .......... .......... .......... .......... .......... 10%  104M 0s
  5400K .......... .......... .......... .......... .......... 10% 97.2M 0s
  5450K .......... .......... .......... .......... .......... 10%  114M 0s
  5500K .......... .......... .......... .......... .......... 10%  161M 0s
  5550K .......... .......... .......... .......... .......... 11%  414M 0s
  5600K .......... .......... .......... .......... .......... 11%  102M 0s
  5650K ....

 10400K .......... .......... .......... .......... .......... 20%  229M 0s
 10450K .......... .......... .......... .......... .......... 20%  253M 0s
 10500K .......... .......... .......... .......... .......... 20%  164M 0s
 10550K .......... .......... .......... .......... .......... 20%  162M 0s
 10600K .......... .......... .......... .......... .......... 21%  166M 0s
 10650K .......... .......... .......... .......... .......... 21%  192M 0s
 10700K .......... .......... .......... .......... .......... 21%  349M 0s
 10750K .......... .......... .......... .......... .......... 21%  126M 0s
 10800K .......... .......... .......... .......... .......... 21%  246M 0s
 10850K .......... .......... .......... .......... .......... 21%  139M 0s
 10900K .......... .......... .......... .......... .......... 21%  376M 0s
 10950K .......... .......... .......... .......... .......... 21%  183M 0s
 11000K .......... .......... .......... .......... .......... 21%  139M 0s
 11050K ....

 15800K .......... .......... .......... .......... .......... 31%  345M 0s
 15850K .......... .......... .......... .......... .......... 31%  203M 0s
 15900K .......... .......... .......... .......... .......... 31%  190M 0s
 15950K .......... .......... .......... .......... .......... 31%  178M 0s
 16000K .......... .......... .......... .......... .......... 31%  221M 0s
 16050K .......... .......... .......... .......... .......... 31%  424M 0s
 16100K .......... .......... .......... .......... .......... 31%  173M 0s
 16150K .......... .......... .......... .......... .......... 32%  311M 0s
 16200K .......... .......... .......... .......... .......... 32%  202M 0s
 16250K .......... .......... .......... .......... .......... 32%  236M 0s
 16300K .......... .......... .......... .......... .......... 32%  204M 0s
 16350K .......... .......... .......... .......... .......... 32%  165M 0s
 16400K .......... .......... .......... .......... .......... 32%  190M 0s
 16450K ....

 21200K .......... .......... .......... .......... .......... 41%  218M 0s
 21250K .......... .......... .......... .......... .......... 42%  190M 0s
 21300K .......... .......... .......... .......... .......... 42%  184M 0s
 21350K .......... .......... .......... .......... .......... 42%  370M 0s
 21400K .......... .......... .......... .......... .......... 42%  331M 0s
 21450K .......... .......... .......... .......... .......... 42%  223M 0s
 21500K .......... .......... .......... .......... .......... 42%  221M 0s
 21550K .......... .......... .......... .......... .......... 42%  204M 0s
 21600K .......... .......... .......... .......... .......... 42%  233M 0s
 21650K .......... .......... .......... .......... .......... 42%  428M 0s
 21700K .......... .......... .......... .......... .......... 42%  232M 0s
 21750K .......... .......... .......... .......... .......... 43%  306M 0s
 21800K .......... .......... .......... .......... .......... 43%  230M 0s
 21850K ....

 26600K .......... .......... .......... .......... .......... 52%  192M 0s
 26650K .......... .......... .......... .......... .......... 52%  221M 0s
 26700K .......... .......... .......... .......... .......... 52%  184M 0s
 26750K .......... .......... .......... .......... .......... 52%  307M 0s
 26800K .......... .......... .......... .......... .......... 53%  399M 0s
 26850K .......... .......... .......... .......... .......... 53%  213M 0s
 26900K .......... .......... .......... .......... .......... 53%  271M 0s
 26950K .......... .......... .......... .......... .......... 53%  295M 0s
 27000K .......... .......... .......... .......... .......... 53%  221M 0s
 27050K .......... .......... .......... .......... .......... 53%  323M 0s
 27100K .......... .......... .......... .......... .......... 53%  325M 0s
 27150K .......... .......... .......... .......... .......... 53%  266M 0s
 27200K .......... .......... .......... .......... .......... 53%  217M 0s
 27250K ....

 32000K .......... .......... .......... .......... .......... 63%  348M 0s
 32050K .......... .......... .......... .......... .......... 63%  264M 0s
 32100K .......... .......... .......... .......... .......... 63%  230M 0s
 32150K .......... .......... .......... .......... .......... 63%  394M 0s
 32200K .......... .......... .......... .......... .......... 63%  199M 0s
 32250K .......... .......... .......... .......... .......... 63%  229M 0s
 32300K .......... .......... .......... .......... .......... 63%  187M 0s
 32350K .......... .......... .......... .......... .......... 64%  334M 0s
 32400K .......... .......... .......... .......... .......... 64%  302M 0s
 32450K .......... .......... .......... .......... .......... 64%  444M 0s
 32500K .......... .......... .......... .......... .......... 64%  306M 0s
 32550K .......... .......... .......... .......... .......... 64%  297M 0s
 32600K .......... .......... .......... .......... .......... 64%  196M 0s
 32650K ....

 37400K .......... .......... .......... .......... .......... 73%  383M 0s
 37450K .......... .......... .......... .......... .......... 74%  328M 0s
 37500K .......... .......... .......... .......... .......... 74%  256M 0s
 37550K .......... .......... .......... .......... .......... 74%  262M 0s
 37600K .......... .......... .......... .......... .......... 74%  360M 0s
 37650K .......... .......... .......... .......... .......... 74%  223M 0s
 37700K .......... .......... .......... .......... .......... 74%  380M 0s
 37750K .......... .......... .......... .......... .......... 74%  297M 0s
 37800K .......... .......... .......... .......... .......... 74%  205M 0s
 37850K .......... .......... .......... .......... .......... 74%  292M 0s
 37900K .......... .......... .......... .......... .......... 74%  363M 0s
 37950K .......... .......... .......... .......... .......... 75%  281M 0s
 38000K .......... .......... .......... .......... .......... 75%  334M 0s
 38050K ....

 42800K .......... .......... .......... .......... .......... 84%  297M 0s
 42850K .......... .......... .......... .......... .......... 84%  335M 0s
 42900K .......... .......... .......... .......... .......... 84%  391M 0s
 42950K .......... .......... .......... .......... .......... 84%  313M 0s
 43000K .......... .......... .......... .......... .......... 85%  313M 0s
 43050K .......... .......... .......... .......... .......... 85%  369M 0s
 43100K .......... .......... .......... .......... .......... 85%  382M 0s
 43150K .......... .......... .......... .......... .......... 85%  350M 0s
 43200K .......... .......... .......... .......... .......... 85%  275M 0s
 43250K .......... .......... .......... .......... .......... 85%  265M 0s
 43300K .......... .......... .......... .......... .......... 85%  337M 0s
 43350K .......... .......... .......... .......... .......... 85%  308M 0s
 43400K .......... .......... .......... .......... .......... 85%  278M 0s
 43450K ....

 48200K .......... .......... .......... .......... .......... 95%  309M 0s
 48250K .......... .......... .......... .......... .......... 95%  388M 0s
 48300K .......... .......... .......... .......... .......... 95%  362M 0s
 48350K .......... .......... .......... .......... .......... 95%  337M 0s
 48400K .......... .......... .......... .......... .......... 95%  287M 0s
 48450K .......... .......... .......... .......... .......... 95%  311M 0s
 48500K .......... .......... .......... .......... .......... 95%  284M 0s
 48550K .......... .......... .......... .......... .......... 96%  277M 0s
 48600K .......... .......... .......... .......... .......... 96%  339M 0s
 48650K .......... .......... .......... .......... .......... 96%  349M 0s
 48700K .......... .......... .......... .......... .......... 96%  323M 0s
 48750K .......... .......... .......... .......... .......... 96%  409M 0s
 48800K .......... .......... .......... .......... .......... 96%  294M 0s
 48850K ....

efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001
efficientdet_d1_coco17_tpu-32/checkpoint/checkpoint
efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.index


## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [28]:
#Pre-setting Training job
!ls source_dir/checkpoint  #check Checkpoint


checkpoint  ckpt-0.data-00000-of-00001	ckpt-0.index


In [30]:

tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir": "/opt/training",        
        "pipeline_config_path": "Model_EfficientNet_pipeline.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.g5.xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: tf2-object-detection-2024-08-17-15-15-49-969


2024-08-17 15:15:52 Starting - Starting the training job
2024-08-17 15:15:52 Pending - Training job waiting for capacity.........
2024-08-17 15:17:24 Pending - Preparing the instances for training......
2024-08-17 15:18:25 Downloading - Downloading the training image..................
2024-08-17 15:21:02 Training - Training image download completed. Training in progress.[34m2024-08-17 15:21:18,954 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-08-17 15:21:18,988 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-08-17 15:21:19,023 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-08-17 15:21:19,039 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/i

[34mINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mI0817 15:21:26.956126 140065013004096 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mINFO:tensorflow:Maybe overwriting train_steps: 2000[0m
[34mI0817 15:21:27.267271 140065013004096 config_util.py:552] Maybe overwriting train_steps: 2000[0m
[34mINFO:tensorflow:Maybe overwriting use_bfloat16: False[0m
[34mI0817 15:21:27.267453 140065013004096 config_util.py:552] Maybe overwriting use_bfloat16: False[0m
[34mI0817 15:21:27.279422 140065013004096 ssd_efficientnet_bifpn_feature_extractor.py:161] EfficientDet EfficientNet backbone version: efficientnet-b1[0m
[34mI0817 15:21:27.279535 140065013004096 ssd_efficientnet_bifpn_feature_extractor.py:163] EfficientDet BiFPN num filters: 88[0m
[34mI0817 15:21:27.279612 140065013004096 ssd_efficientnet_bifpn_feature_extractor.py:164] EfficientDet BiFPN 

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0817 15:21:37.731389 140065013004096 deprecation.py:364] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0817 15:21:41.648863 140065013004096 deprecation.py:364] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mI0817 15:21:51.507500 140041682671360 api.py:460] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (

[34mINFO:tensorflow:Step 300 per-step time 0.687s[0m
[34mI0817 15:27:10.603657 140065013004096 model_lib_v2.py:705] Step 300 per-step time 0.687s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.33960986,
 'Loss/localization_loss': 0.021012666,
 'Loss/regularization_loss': 0.029552536,
 'Loss/total_loss': 0.39017507,
 'learning_rate': 0.010480001}[0m
[34mI0817 15:27:10.604004 140065013004096 model_lib_v2.py:708] {'Loss/classification_loss': 0.33960986,
 'Loss/localization_loss': 0.021012666,
 'Loss/regularization_loss': 0.029552536,
 'Loss/total_loss': 0.39017507,
 'learning_rate': 0.010480001}[0m
[34mINFO:tensorflow:Step 400 per-step time 0.686s[0m
[34mI0817 15:28:19.241195 140065013004096 model_lib_v2.py:705] Step 400 per-step time 0.686s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.36486185,
 'Loss/localization_loss': 0.021611063,
 'Loss/regularization_loss': 0.029558122,
 'Loss/total_loss': 0.41603103,
 'learning_rate': 0.0136400005}[0m
[34mI0817 15:28:

[34mINFO:tensorflow:Step 1700 per-step time 0.685s[0m
[34mI0817 15:43:13.969744 140065013004096 model_lib_v2.py:705] Step 1700 per-step time 0.685s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.28184512,
 'Loss/localization_loss': 0.016337572,
 'Loss/regularization_loss': 0.03022418,
 'Loss/total_loss': 0.32840687,
 'learning_rate': 0.05472}[0m
[34mI0817 15:43:13.970079 140065013004096 model_lib_v2.py:708] {'Loss/classification_loss': 0.28184512,
 'Loss/localization_loss': 0.016337572,
 'Loss/regularization_loss': 0.03022418,
 'Loss/total_loss': 0.32840687,
 'learning_rate': 0.05472}[0m
[34mINFO:tensorflow:Step 1800 per-step time 0.685s[0m
[34mI0817 15:44:22.479382 140065013004096 model_lib_v2.py:705] Step 1800 per-step time 0.685s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.22740836,
 'Loss/localization_loss': 0.012269204,
 'Loss/regularization_loss': 0.030331511,
 'Loss/total_loss': 0.27000907,
 'learning_rate': 0.05788}[0m
[34mI0817 15:44:22.479735 1

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0817 15:46:57.754302 139849489086272 deprecation.py:364] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0817 15:46:59.163041 139849489086272 deprecation.py:364] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0817 15:47:01.744731 139849489086272 c

[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0817 15:52:01.845653 139849489086272 checkpoint_utils.py:168] Waiting for new checkpoint at /opt/training[0m
[34mINFO:tensorflow:Timed-out waiting for a checkpoint.[0m
[34mI0817 15:52:10.861068 139849489086272 checkpoint_utils.py:231] Timed-out waiting for a checkpoint.[0m
[34mcreating index...[0m
[34mindex created![0m
[34mcreating index...[0m
[34mindex created![0m
[34mRunning per image evaluation...[0m
[34mEvaluate annotation type *bbox*[0m
[34mDONE (t=9.63s).[0m
[34mAccumulating evaluation results...[0m
[34mDONE (t=0.25s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.093
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.229
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.061
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.039
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=med

[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0817 15:54:03.576756 140083899914048 builder_impl.py:804] Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0817 15:54:04.613744 140083899914048 fingerprinting_utils.py:48] Writing fingerprint to /tmp/exported/saved_model/fingerprint.pb[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0817 15:54:06.026856 140083899914048 config_util.py:253] Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34m2024-08-17 15:54:08,293 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2024-08-17 15:54:28 Uploading - Uploading generated training model
2024-08-17 15:54:28 Completed - Training job completed
Training seconds: 2183
Billable seconds: 2183


You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../../data/example_trainings.png)


## Improve on the initial model

Most likely, this initial experiment did not yield optimal results. However, you can make multiple changes to the `pipeline.config` file to improve this model. One obvious change consists in improving the data augmentation strategy. The [`preprocessor.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) file contains the different data augmentation method available in the Tf Object Detection API. Justify your choices of augmentations in the write-up.

Keep in mind that the following are also available:
* experiment with the optimizer: type of optimizer, learning rate, scheduler etc
* experiment with the architecture. The Tf Object Detection API model zoo offers many architectures. Keep in mind that the pipeline.config file is unique for each architecture and you will have to edit it.
* visualize results on the test frames using the `2_deploy_model` notebook available in this repository.

In the cell below, write down all the different approaches you have experimented with, why you have chosen them and what you would have done if you had more time and resources. Justify your choices using the tensorboard visualizations (take screenshots and insert them in your write-up), the metrics on the evaluation set and the generated animation you have created with [this tool](../2_run_inference/2_deploy_model.ipynb).