# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [28]:
%%capture
%pip install tensorflow_io sagemaker -U

In [29]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

Save the IAM role in a variable called `role`. This would be useful when training the model.

In [30]:
role = sagemaker.get_execution_role()
print(role)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


arn:aws:iam::177455752734:role/service-role/AmazonSageMaker-ExecutionRole-20230325T002227


In [31]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
        'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-project-s3/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the the repository
* build the docker image and push it 
* print the container name

In [32]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [13]:
# build and push the docker image. This code can be commented after being ran once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
!sh ./docker/build_and_push.sh $image_name

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Building image with name tf2-object-detection
Sending build context to Docker daemon  723.5MB
Step 1/17 : FROM tensorflow/tensorflow:2.9.0-gpu
2.9.0-gpu: Pulling from tensorflow/tensorflow

[1B17ec1767: Pulling fs layer 
[1B9ecd2bff: Pulling fs layer 
[1B4ae53552: Pulling fs layer 
[1B2d09b8c4: Pulling fs layer 
[1B0d530989: Pulling fs layer 
[1B81af025b: Pulling fs layer 
[1Bc129f45e: Pulling fs layer 
[1B8fcb70c6: Pulling fs layer 
[1B9aa4a247: Pulling fs layer 
[1B3100c8d1: Pulling fs layer 
[1B3a6b487b: Pulling fs layer 
[1Be8773234: Pulling fs layer 
[1B36c9476c: Pulling fs layer 


[3Be8773234: Extracting  497.5MB/583.3MBB[14A[2K[13A[2K[14A[2K[13A[2K[13A[2K[10A[2K[7A[2K[9A[2K[14A[2K[9A[2K[6A[2K[6A[2K[8A[2K[8A[2K[9A[2K[8A[2K[6A[2K[9A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[14A[2K[9A[2K[8A[2K[5A[2K[8A[2K[9A[2K[4A[2K[8A[2K[8A[2K[14A[2K[8A[2K[9A[2K[8A[2K[3A[2K[3A[2K[9A[2K[3A[2K[9A[2K[14A[2K[3A[2K[14A[2K[3A[2K[14A[2K[3A[2K[8A[2K[3A[2K[14A[2K[9A[2K[14A[2K[9A[2K[9A[2K[14A[2K[8A[2K[14A[2K[9A[2K[3A[2K[3A[2K[8A[2K[8A[2K[8A[2K[8A[2K[9A[2K[3A[2K[14A[2K[3A[2K[8A[2K[14A[2K[3A[2K[14A[2K[9A[2K[8A[2K[14A[2K[3A[2K[9A[2K[14A[2K[8A[2K[8A[2K[3A[2K[8A[2K[3A[2K[3A[2K[14A[2K[3A[2K[9A[2K[3A[2K[9A[2K[3A[2K[9A[2K[3A[2K[9A[2K[3A[2K[9A[2K[3A[2K[9A[2K[3A[2K[13A[2K[3A[2K[9A[2K[13A[2K[3A[2K[3A[2K[9A[2K[3A[2K[3A[2K[13A[2K[3A[2K[13A[2K[3A[2K[13A[2K[3A[2K[13A[2K[

[1B1b420cea: Pull complete 087kB/1.087kB[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[2A[2K[1A[2KDigest: sha256:aa9f4a6a7debc976135702118aedfd0d72bf9e495af6ecfd5a31d9714e335426
Status: Downloaded newer image for tensorflow/tensorflow:2.9.0-gpu
 ---> c8d9ee2a0ff4
Step 2/17 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Running in c00e2f7fc2e8
Removing intermediate container c00e2f7fc2e8
 ---> f3279c0a7511
Step 3/17 : RUN rm /etc/apt/sources.list.d/cuda.list
 ---> Running in 632e404a79b0
Removing intermediate container 632e404a79b0
 ---> df2e8b8a46a1
Step 4/17 : RUN apt-key del 7fa2af80
 ---> Running in 457d2dffbfb3
OK
Removing intermediate container 457d2dffbfb3
 ---> 969490296e9e
Step 5/17 : RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863c

Get:2 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 dirmngr amd64 2.2.19-3ubuntu2.2 [330 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg-wks-server amd64 2.2.19-3ubuntu2.2 [90.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg-utils amd64 2.2.19-3ubuntu2.2 [481 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg-agent amd64 2.2.19-3ubuntu2.2 [232 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg amd64 2.2.19-3ubuntu2.2 [482 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgconf amd64 2.2.19-3ubuntu2.2 [124 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg-l10n all 2.2.19-3ubuntu2.2 [51.7 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg all 2.2.19-3ubuntu2.2 [259 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgsm amd64 2.2.19-3ubuntu2.2 [217 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-up

Get:89 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libx11-dev amd64 2:1.6.9-2ubuntu1.2 [647 kB]
Get:90 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libglx-dev amd64 1.3.2-1~ubuntu0.20.04.2 [14.0 kB]
Get:91 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgl-dev amd64 1.3.2-1~ubuntu0.20.04.2 [97.8 kB]
Get:92 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libegl-dev amd64 1.3.2-1~ubuntu0.20.04.2 [17.2 kB]
Get:93 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libjbig0 amd64 2.1-3.1ubuntu0.20.04.1 [27.3 kB]
Get:94 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libwebp6 amd64 0.6.1-2ubuntu0.20.04.1 [185 kB]
Get:95 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libtiff5 amd64 4.1.0+git191117-2ubuntu0.20.04.8 [163 kB]
Get:96 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgdk-pixbuf2.0-common all 2.40.0+dfsg-3ubuntu0.4 [4592 B]
Get:97 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgdk

Selecting previously unselected package libxau6:amd64.
Preparing to unpack .../010-libxau6_1%3a1.0.9-0ubuntu1_amd64.deb ...
Unpacking libxau6:amd64 (1:1.0.9-0ubuntu1) ...
Selecting previously unselected package libxdmcp6:amd64.
Preparing to unpack .../011-libxdmcp6_1%3a1.1.3-0ubuntu1_amd64.deb ...
Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu1) ...
Selecting previously unselected package libxcb1:amd64.
Preparing to unpack .../012-libxcb1_1.14-2_amd64.deb ...
Unpacking libxcb1:amd64 (1.14-2) ...
Selecting previously unselected package libx11-data.
Preparing to unpack .../013-libx11-data_2%3a1.6.9-2ubuntu1.2_all.deb ...
Unpacking libx11-data (2:1.6.9-2ubuntu1.2) ...
Selecting previously unselected package libx11-6:amd64.
Preparing to unpack .../014-libx11-6_2%3a1.6.9-2ubuntu1.2_amd64.deb ...
Unpacking libx11-6:amd64 (2:1.6.9-2ubuntu1.2) ...
Selecting previously unselected package libxext6:amd64.
Preparing to unpack .../015-libxext6_2%3a1.3.4-0ubuntu1_amd64.deb ...
Unpacking libxext6:amd64 (

Unpacking libegl1:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Selecting previously unselected package libxcb-glx0:amd64.
Preparing to unpack .../058-libxcb-glx0_1.14-2_amd64.deb ...
Unpacking libxcb-glx0:amd64 (1.14-2) ...
Selecting previously unselected package libxfixes3:amd64.
Preparing to unpack .../059-libxfixes3_1%3a5.0.3-2_amd64.deb ...
Unpacking libxfixes3:amd64 (1:5.0.3-2) ...
Selecting previously unselected package libxxf86vm1:amd64.
Preparing to unpack .../060-libxxf86vm1_1%3a1.1.4-1build1_amd64.deb ...
Unpacking libxxf86vm1:amd64 (1:1.1.4-1build1) ...
Selecting previously unselected package libllvm12:amd64.
Preparing to unpack .../061-libllvm12_1%3a12.0.0-3ubuntu1~20.04.5_amd64.deb ...
Unpacking libllvm12:amd64 (1:12.0.0-3ubuntu1~20.04.5) ...
Selecting previously unselected package libsensors-config.
Preparing to unpack .../062-libsensors-config_1%3a3.6.0-2ubuntu1.1_all.deb ...
Unpacking libsensors-config (1:3.6.0-2ubuntu1.1) ...
Selecting previously unselected package libsensors5:

Selecting previously unselected package python3-soupsieve.
Preparing to unpack .../102-python3-soupsieve_1.9.5+dfsg-1_all.deb ...
Unpacking python3-soupsieve (1.9.5+dfsg-1) ...
Selecting previously unselected package python3-bs4.
Preparing to unpack .../103-python3-bs4_4.8.2-1_all.deb ...
Unpacking python3-bs4 (4.8.2-1) ...
Selecting previously unselected package python3-ply.
Preparing to unpack .../104-python3-ply_3.11-3ubuntu0.1_all.deb ...
Unpacking python3-ply (3.11-3ubuntu0.1) ...
Selecting previously unselected package python3-pycparser.
Preparing to unpack .../105-python3-pycparser_2.19-1ubuntu1_all.deb ...
Unpacking python3-pycparser (2.19-1ubuntu1) ...
Selecting previously unselected package python3-cffi.
Preparing to unpack .../106-python3-cffi_1.14.0-1build1_all.deb ...
Unpacking python3-cffi (1.14.0-1build1) ...
Selecting previously unselected package python3-xcffib.
Preparing to unpack .../107-python3-xcffib_0.8.1-0.8_amd64.deb ...
Unpacking python3-xcffib (0.8.1-0.8) ...


Setting up libdrm-intel1:amd64 (2.4.107-8ubuntu1~20.04.2) ...
Setting up libgl1-mesa-dri:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Setting up libx11-dev:amd64 (2:1.6.9-2ubuntu1.2) ...
Setting up libxext6:amd64 (2:1.3.4-0ubuntu1) ...
Setting up libcairo2:amd64 (1.16.0-4ubuntu1) ...
Setting up libxxf86vm1:amd64 (1:1.1.4-1build1) ...
Setting up libegl-mesa0:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Setting up libxfixes3:amd64 (1:5.0.3-2) ...
Setting up libgdk-pixbuf2.0-0:amd64 (2.40.0+dfsg-3ubuntu0.4) ...
Setting up python3-cairocffi (0.9.0-4) ...
Setting up xauth (1:1.1-0ubuntu1) ...
Setting up libgdk-pixbuf2.0-bin (2.40.0+dfsg-3ubuntu0.4) ...
Setting up libegl1:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Setting up gnupg (2.2.19-3ubuntu2.2) ...
Setting up libxss1:amd64 (1:1.2.3-1) ...
Setting up libxft2:amd64 (2.3.3-0ubuntu1) ...
Setting up libglx-mesa0:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Setting up libglx0:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Setting up libtk8.6:amd64 (8.6.10-1) ...
Setting up libgl1

  Preparing metadata (setup.py): finished with status 'done'
Collecting google-api-python-client>=1.6.7
  Downloading google_api_python_client-2.83.0-py2.py3-none-any.whl (11.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.2/11.2 MB 109.6 MB/s eta 0:00:00
Collecting opencv-python-headless
  Downloading opencv_python_headless-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.2/49.2 MB 43.4 MB/s eta 0:00:00
Collecting tensorflow-hub>=0.6.0
  Downloading tensorflow_hub-0.13.0-py2.py3-none-any.whl (100 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.6/100.6 kB 18.3 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 16.2 MB/s eta 0:00:00
Collecting psutil>=5.4.3
  Downloading psutil-5.9.4-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_6

Collecting flatbuffers>=2.0
  Downloading flatbuffers-23.3.3-py2.py3-none-any.whl (26 kB)
Collecting keras
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 103.0 MB/s eta 0:00:00
Collecting tensorflow-estimator<2.12,>=2.11.0
  Downloading tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 439.2/439.2 kB 65.6 MB/s eta 0:00:00
Collecting protobuf<4,>3.12.2
  Downloading protobuf-3.19.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 106.0 MB/s eta 0:00:00
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 109.9 MB/s eta 0:00:00
Collecting dm-tree~=0.1.1
  Downloading dm_tree-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (152 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 152.9/152.9 k

  Building wheel for crcmod (setup.py): finished with status 'done'
  Created wheel for crcmod: filename=crcmod-1.7-cp38-cp38-linux_x86_64.whl size=36034 sha256=56c0e04b8228858e26b67076d252e8ca37eb96a1ed78d2ca6b0a738559ec5219
  Stored in directory: /root/.cache/pip/wheels/ca/5a/02/f3acf982a026f3319fb3e798a8dca2d48fafee7761788562e9
  Building wheel for dill (setup.py): started
  Building wheel for dill (setup.py): finished with status 'done'
  Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78543 sha256=ff3e1373aa6e2cd244b4c49dea9db7b4430daa6e04310f3a543441caccc31e6e
  Stored in directory: /root/.cache/pip/wheels/07/35/78/e9004fa30578734db7f10e7a211605f3f0778d2bdde38a239d
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.5.13-py3-none-any.whl size=77714 sha256=9b96676a3dce6fb81b589dbb5903aebeded3292eca520ee5ce244d39f538a045
  Stored in directory: /root/.cac

Collecting retrying>=1.3.3
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Collecting gevent
  Downloading gevent-22.10.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 90.8 MB/s eta 0:00:00
Collecting inotify_simple==1.2.1
  Downloading inotify_simple-1.2.1.tar.gz (7.9 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting paramiko>=2.4.2
  Downloading paramiko-3.1.0-py3-none-any.whl (211 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.2/211.2 kB 40.3 MB/s eta 0:00:00
Collecting cryptography>=3.3
  Downloading cryptography-40.0.1-cp36-abi3-manylinux_2_28_x86_64.whl (3.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 106.9 MB/s eta 0:00:00
Collecting pynacl>=1.5
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/85

[7Bfcbe1ae8: Pushing  91.32MB/514.2MB[26A[2K[26A[2K[24A[2K[26A[2K[25A[2K[26A[2K[23A[2K[25A[2K[23A[2K[24A[2K[26A[2K[26A[2K[23A[2K[25A[2K[26A[2K[23A[2K[25A[2K[23A[2K[24A[2K[25A[2K[24A[2K[26A[2K[22A[2K[24A[2K[25A[2K[23A[2K[24A[2K[25A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2K[25A[2K[26A[2K[24A[2K[26A[2K[24A[2K[26A[2K[25A[2K[26A[2K[23A[2K[24A[2K[26A[2K[23A[2K[26A[2K[25A[2K[24A[2K[25A[2K[23A[2K[25A[2K[24A[2K[21A[2K[23A[2K[23A[2K[24A[2K[26A[2K[25A[2K[25A[2K[25A[2K[24A[2K[26A[2K[20A[2K[24A[2K[26A[2K[25A[2K[26A[2K[24A[2K[23A[2K[26A[2K[25A[2K[26A[2K[24A[2K[23A[2K[24A[2K[20A[2K[24A[2K[26A[2K[25A[2K[24A[2K[26A[2K[25A[2K[23A[2K[20A[2K[25A[2K[25A[2K[23A[2K[25A[2K[26A[2K[24A[2K[20A[2K[24A[2K[25A[2K[24A[2K[23A[2K[26A[2K[24A[2K[23A[2K[24A[2K[20A[2K[25A[2K[24A[2K[25A[2K[24A[2K[23A[2

[24Bc9e6dde: Pushing  1.053GB/3.357GB[12A[2K[7A[2K[24A[2K[7A[2K[24A[2K[12A[2K[24A[2K[24A[2K[7A[2K[24A[2K[12A[2K[24A[2K[7A[2K[24A[2K[6A[2K[24A[2K[12A[2K[24A[2K[24A[2K[6A[2K[7A[2K[4A[2K[24A[2K[12A[2K[7A[2K[24A[2K[7A[2K[24A[2K[12A[2K[7A[2K[7A[2K[24A[2K[7A[2K[7A[2K[7A[2K[12A[2K[7A[2K[24A[2K[3A[2K[24A[2K[3A[2K[7A[2K[3A[2K[7A[2K[12A[2K[3A[2K[24A[2K[7A[2K[3A[2K[6A[2K[7A[2K[3A[2K[24A[2K[3A[2K[24A[2K[12A[2K[7A[2K[3A[2K[24A[2K[7A[2K[12A[2K[3A[2K[7A[2K[3A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[7A[2K[3A[2K[7A[2K[12A[2K[3A[2K[7A[2K[24A[2K[3A[2K[6A[2K[7A[2K[3A[2K[7A[2K[6A[2K[24A[2K[7A[2K[6A[2K[24A[2K[7A[2K[3A[2K[12A[2K[7A[2K[12A[2K[12A[2K[3A[2K[24A[2K[3A[2K[7A[2K[3A[2K[7A[2K[24A[2K[3A[2K[24A[2K[3A[2K[24A[2K[6A[2K[24A[2K[3A[2K[12A[2K[7A[2K[24A[2K[7A[2K[7A[2K[24A[2K[7A[2K[7A[2K[2

[24Bc9e6dde: Pushing  1.316GB/3.357GB[24A[2K[12A[2K[6A[2K[12A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[12A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[12A[2K[24A[2K[24A[2K[12A[2K[6A[2K[12A[2K[6A[2K[12A[2K[24A[2K[12A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[12A[2K[12A[2K[6A[2K[12A[2K[24A[2K[6A[2K[24A[2K[6A[2K[12A[2K[6A[2K[24A[2K[12A[2K[6A[2K[12A[2K[6A[2K[24A[2K[12A[2K[6A[2K[24A[2K[6A[2K[12A[2K[12A[2K[12A[2K[24A[2K[12A[2K[6A[2K[12A[2K[6A[2K[12A[2K[6A[2K[12A[2K[6A[2K[12A[2K[24A[2K[6A[2K[12A[2K[6A[2K[12A[2K[6A[2K[24A[2K[12A[2K[24A[2K[12A[2K[6A[2K[24A[2K[6A[2K[24A[2K[12A[2K[24A[2K[12A[2K[24A[2K[6A[2K[12A[2K[12A[2K[24A[2K[12A[2K[12A[2K[24A[2K[6A[2K[12A[2K[24A[2K[12A[2K[6A[2K[12A[2K[6A[2K[24A[2K[12A[2K[6A[2K[12A[2K[24A

[24Bc9e6dde: Pushing   2.17GB/3.357GB[12A[2K[24A[2K[12A[2K[24A[2K[12A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[12A[2K[2K[12A[2K[6A[2K[12A[2K[24A[2K[12A[2K[24A[2K[24A[2K[12A[2K[6A[2K[12A[2K[12A[2K[24A[2K[24A[2K[12A[2K[6A[2K[12A[2K[6A[2K[24A[2K[12A[2K[12A[2K[24A[2K[12A[2K[6A[2K[12A[2K[24A[2K[12A[2K[24A[2K[6A[2K[12A[2K[12A[2K[12A[2K[12A[2K[6A[2K[12A[2K[6A[2K[12A[2K[24A[2K[6A[2K[12A[2K[6A[2K[12A[2K[6A[2K[12A[2K[24A[2K[6A[2K[24A[2K[24A[2K[12A[2K[6A[2K[12A[2K[24A[2K[6A[2K[24A[2K[12A[2K[24A[2K[12A[2K[6A[2K[24A[2K[12A[2K[24A[2K[24A[2K[12A[2K[24A[2K[6A[2K[24A[2K[12A[2K[24A[2K[6A[2K[24A[2K[24A[2K[12A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[12A[2K[12A[2K[24A[2K[12A[2K[24A[2K[12A[2K[24A[2K[12A[2K[24A[2K[12A[2K[24A[2K[12A[2K[6A[2K[24A[2K[12A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[1

[24Bc9e6dde: Pushing  3.239GB/3.357GB[24A[2K[6A[2K[24A[2K[24A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[24A[2K[6A[2K[6A[2K[

[6Bfae0f58c: Pushed   3.691GB/3.681GB[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[6A[2K[6A[2K[24A[2K[6A[2K[6A[2K[6A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[24A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[24A[2K[6A[2K[6A[2K[24A[2K

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [33]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

177455752734.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20230329000734


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be ajusted if you were to experiment with other architectures.

In [15]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint

efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001
efficientdet_d1_coco17_tpu-32/checkpoint/checkpoint
efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.index


--2023-03-29 00:24:46--  http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.16.128, 2607:f8b0:4004:c19::80
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.16.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51839363 (49M) [application/x-tar]
Saving to: ‘/tmp/efficientdet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 16.5M 3s
    50K .......... .......... .......... .......... ..........  0% 31.4M 2s
   100K .......... .......... .......... .......... ..........  0% 28.8M 2s
   150K .......... .......... .......... .......... ..........  0% 23.1M 2s
   200K .......... .......... .......... .......... ..........  0% 32.3M 2s
   250K .......... .......... .......... .......... ..........  0% 30.4M 2s
   300K .......... .......... .......... .......... ..........  0% 29.9M 2s
   350K ..

## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [11]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-24-07-22-19-848


2023-03-24 07:22:28 Starting - Starting the training job...
2023-03-24 07:22:54 Starting - Preparing the instances for training......
2023-03-24 07:24:03 Downloading - Downloading input data...
2023-03-24 07:24:23 Training - Downloading the training image...............
2023-03-24 07:26:59 Training - Training image download completed. Training in progress....[34m2023-03-24 07:27:26,826 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 07:27:26,862 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 07:27:26,895 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 07:27:26,908 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/va

[34mINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mI0324 07:27:35.289755 140326486873920 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mINFO:tensorflow:Maybe overwriting train_steps: 2000[0m
[34mI0324 07:27:35.293904 140326486873920 config_util.py:552] Maybe overwriting train_steps: 2000[0m
[34mINFO:tensorflow:Maybe overwriting use_bfloat16: False[0m
[34mI0324 07:27:35.294045 140326486873920 config_util.py:552] Maybe overwriting use_bfloat16: False[0m
[34mI0324 07:27:35.306915 140326486873920 ssd_efficientnet_bifpn_feature_extractor.py:150] EfficientDet EfficientNet backbone version: efficientnet-b1[0m
[34mI0324 07:27:35.307038 140326486873920 ssd_efficientnet_bifpn_feature_extractor.py:152] EfficientDet BiFPN num filters: 88[0m
[34mI0324 07:27:35.307148 140326486873920 ssd_efficientnet_bifpn_feature_extractor.py:153] EfficientDet BiFPN 

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0324 07:27:46.787175 140326486873920 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0324 07:27:50.928708 140326486873920 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mI0324 07:28:00.975277 140303428220672 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (

[34mINFO:tensorflow:Step 300 per-step time 0.607s[0m
[34mI0324 07:33:13.129344 140326486873920 model_lib_v2.py:705] Step 300 per-step time 0.607s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.3351292,
 'Loss/localization_loss': 0.024426365,
 'Loss/regularization_loss': 0.029549818,
 'Loss/total_loss': 0.38910538,
 'learning_rate': 0.010480001}[0m
[34mI0324 07:33:13.129723 140326486873920 model_lib_v2.py:708] {'Loss/classification_loss': 0.3351292,
 'Loss/localization_loss': 0.024426365,
 'Loss/regularization_loss': 0.029549818,
 'Loss/total_loss': 0.38910538,
 'learning_rate': 0.010480001}[0m
[34mINFO:tensorflow:Step 400 per-step time 0.606s[0m
[34mI0324 07:34:13.779035 140326486873920 model_lib_v2.py:705] Step 400 per-step time 0.606s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.2676249,
 'Loss/localization_loss': 0.016456617,
 'Loss/regularization_loss': 0.029560918,
 'Loss/total_loss': 0.31364244,
 'learning_rate': 0.0136400005}[0m
[34mI0324 07:34:13.

[34mINFO:tensorflow:Step 1700 per-step time 0.609s[0m
[34mI0324 07:47:26.288833 140326486873920 model_lib_v2.py:705] Step 1700 per-step time 0.609s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.27231342,
 'Loss/localization_loss': 0.01530519,
 'Loss/regularization_loss': 0.030304896,
 'Loss/total_loss': 0.31792352,
 'learning_rate': 0.05472}[0m
[34mI0324 07:47:26.289128 140326486873920 model_lib_v2.py:708] {'Loss/classification_loss': 0.27231342,
 'Loss/localization_loss': 0.01530519,
 'Loss/regularization_loss': 0.030304896,
 'Loss/total_loss': 0.31792352,
 'learning_rate': 0.05472}[0m
[34mINFO:tensorflow:Step 1800 per-step time 0.606s[0m
[34mI0324 07:48:26.910415 140326486873920 model_lib_v2.py:705] Step 1800 per-step time 0.606s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.24135278,
 'Loss/localization_loss': 0.015840685,
 'Loss/regularization_loss': 0.030396296,
 'Loss/total_loss': 0.28758976,
 'learning_rate': 0.05788}[0m
[34mI0324 07:48:26.910765 1

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0324 07:50:48.134898 140387232540480 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0324 07:50:49.623990 140387232540480 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0324 07:50:52.382055 140387232540480 c

[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0324 07:55:52.479500 140387232540480 checkpoint_utils.py:140] Waiting for new checkpoint at /opt/training[0m
[34mINFO:tensorflow:Timed-out waiting for a checkpoint.[0m
[34mI0324 07:56:01.493650 140387232540480 checkpoint_utils.py:203] Timed-out waiting for a checkpoint.[0m
[34mcreating index...[0m
[34mindex created![0m
[34mcreating index...[0m
[34mindex created![0m
[34mRunning per image evaluation...[0m
[34mEvaluate annotation type *bbox*[0m
[34mDONE (t=14.30s).[0m
[34mAccumulating evaluation results...[0m
[34mDONE (t=0.26s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.090
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.221
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.056
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.039
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=me

[34mW0324 07:57:50.940387 140451362797376 save.py:271] Found untraced functions such as WeightSharedConvolutionalBoxPredictor_layer_call_fn, WeightSharedConvolutionalBoxPredictor_layer_call_and_return_conditional_losses, WeightSharedConvolutionalBoxHead_layer_call_fn, WeightSharedConvolutionalBoxHead_layer_call_and_return_conditional_losses, WeightSharedConvolutionalClassHead_layer_call_fn while saving (showing 5 of 535). These functions will not be directly callable after loading.[0m
[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0324 07:58:27.486438 140451362797376 builder_impl.py:797] Assets written to: /tmp/exported/saved_model/assets[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0324 07:58:32.158858 140451362797376 config_util.py:253] Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34m2023-03-24 07:58:35,754 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)


## Improve on the intial model

Most likely, this initial experiment did not yield optimal results. However, you can make multiple changes to the `pipeline.config` file to improve this model. One obvious change consists in improving the data augmentation strategy. The [`preprocessor.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) file contains the different data augmentation method available in the Tf Object Detection API. Justify your choices of augmentations in the writeup.

Keep in mind that the following are also available:
* experiment with the optimizer: type of optimizer, learning rate, scheduler etc
* experiment with the architecture. The Tf Object Detection API model zoo offers many architectures. Keep in mind that the pipeline.config file is unique for each architecture and you will have to edit it.
* visualize results on the test frames using the `2_deploy_model` notebook available in this repository.

In the cell below, write down all the different approaches you have experimented with, why you have chosen them and what you would have done if you had more time and resources. Justify your choices using the tensorboard visualizations (take screenshots and insert them in your writeup), the metrics on the evaluation set and the generated animation you have created with [this tool](../2_run_inference/2_deploy_model.ipynb).

##  Change "total_steps" and "warmup_steps" in learning_rate

Now, we used "num_train_steps" is 2000. but, "total_steps" and "warmup_steps" in learning_rate is 300000 and 2500.
Under this conditions, the cosine_decay_learning_rate curve stops during warmup_steps region. This is not goot for trainig condition.
so, I change "total_steps" and "warmup_steps" in learning_rate is 2000 and 200.

In [10]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline_total_steps.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-24-13-21-00-977


2023-03-24 13:21:04 Starting - Starting the training job...
2023-03-24 13:21:40 Starting - Preparing the instances for training......
2023-03-24 13:22:37 Downloading - Downloading input data...
2023-03-24 13:22:57 Training - Downloading the training image...............
2023-03-24 13:25:23 Training - Training image download completed. Training in progress...[34m2023-03-24 13:25:51,110 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 13:25:51,142 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 13:25:51,175 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-24 13:25:51,189 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/val

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0324 13:26:10.817227 140596730767168 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mkeep_dims is deprecated, use keepdims instead[0m
[34mW0324 13:26:12.472265 140596730767168 deprecation.py:554] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mkeep_dims is deprecated, use keepdims instead[0m
[34mInstructions for updating:[0m
[3

[34mINFO:tensorflow:Step 200 per-step time 0.587s[0m
[34mI0324 13:30:28.861446 140596730767168 model_lib_v2.py:705] Step 200 per-step time 0.587s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.38110533,
 'Loss/localization_loss': 0.020265257,
 'Loss/regularization_loss': 0.029979268,
 'Loss/total_loss': 0.43134984,
 'learning_rate': 0.08}[0m
[34mI0324 13:30:28.861738 140596730767168 model_lib_v2.py:708] {'Loss/classification_loss': 0.38110533,
 'Loss/localization_loss': 0.020265257,
 'Loss/regularization_loss': 0.029979268,
 'Loss/total_loss': 0.43134984,
 'learning_rate': 0.08}[0m
[34mINFO:tensorflow:Step 300 per-step time 0.586s[0m
[34mI0324 13:31:27.500921 140596730767168 model_lib_v2.py:705] Step 300 per-step time 0.586s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.40742272,
 'Loss/localization_loss': 0.029255258,
 'Loss/regularization_loss': 0.030287078,
 'Loss/total_loss': 0.46696508,
 'learning_rate': 0.07939231}[0m
[34mI0324 13:31:27.501222 140596

[34mINFO:tensorflow:Step 1600 per-step time 0.588s[0m
[34mI0324 13:44:14.981795 140596730767168 model_lib_v2.py:705] Step 1600 per-step time 0.588s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.19306538,
 'Loss/localization_loss': 0.012861852,
 'Loss/regularization_loss': 0.031162404,
 'Loss/total_loss': 0.23708963,
 'learning_rate': 0.009358215}[0m
[34mI0324 13:44:14.982129 140596730767168 model_lib_v2.py:708] {'Loss/classification_loss': 0.19306538,
 'Loss/localization_loss': 0.012861852,
 'Loss/regularization_loss': 0.031162404,
 'Loss/total_loss': 0.23708963,
 'learning_rate': 0.009358215}[0m
[34mINFO:tensorflow:Step 1700 per-step time 0.589s[0m
[34mI0324 13:45:13.832865 140596730767168 model_lib_v2.py:705] Step 1700 per-step time 0.589s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.20126282,
 'Loss/localization_loss': 0.011219826,
 'Loss/regularization_loss': 0.0311536,
 'Loss/total_loss': 0.24363625,
 'learning_rate': 0.0053589796}[0m
[34mI0324 13:4

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0324 13:48:29.541598 139923438155584 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0324 13:48:30.988099 139923438155584 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0324 13:48:33.627784 139923438155584 c

[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0324 13:53:33.727694 139923438155584 checkpoint_utils.py:140] Waiting for new checkpoint at /opt/training[0m
[34mINFO:tensorflow:Timed-out waiting for a checkpoint.[0m
[34mI0324 13:53:42.743023 139923438155584 checkpoint_utils.py:203] Timed-out waiting for a checkpoint.[0m
[34mcreating index...[0m
[34mindex created![0m
[34mcreating index...[0m
[34mindex created![0m
[34mRunning per image evaluation...[0m
[34mEvaluate annotation type *bbox*[0m
[34mDONE (t=13.10s).[0m
[34mAccumulating evaluation results...[0m
[34mDONE (t=0.26s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.124
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.273
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.101
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.054
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=me

[34mW0324 13:55:32.986784 139860454094656 save.py:271] Found untraced functions such as WeightSharedConvolutionalBoxPredictor_layer_call_fn, WeightSharedConvolutionalBoxPredictor_layer_call_and_return_conditional_losses, WeightSharedConvolutionalBoxHead_layer_call_fn, WeightSharedConvolutionalBoxHead_layer_call_and_return_conditional_losses, WeightSharedConvolutionalClassHead_layer_call_fn while saving (showing 5 of 535). These functions will not be directly callable after loading.[0m
[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0324 13:56:08.469683 139860454094656 builder_impl.py:797] Assets written to: /tmp/exported/saved_model/assets[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0324 13:56:13.033423 139860454094656 config_util.py:253] Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34m2023-03-24 13:56:16,420 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

In order to this change, the all of Loss value decreased and Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] increase from 0.090 to 0.124. I get the better model than previous conditions.

## Additional Augmentation

I add the new augumentation that is "random_rgb_to_gray". It is better method to learn about the low lignt situation.

In [16]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline_augment.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-29-00-26-19-505


2023-03-29 00:26:23 Starting - Starting the training job...
2023-03-29 00:26:50 Starting - Preparing the instances for training......
2023-03-29 00:27:54 Downloading - Downloading input data...
2023-03-29 00:28:15 Training - Downloading the training image...............
2023-03-29 00:30:41 Training - Training image download completed. Training in progress...[34m2023-03-29 00:31:09,687 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 00:31:09,719 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 00:31:09,750 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 00:31:09,764 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/val

[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mW0329 00:31:22.650742 140668385695552 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.[0m
[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0329 00:31:29.169249 140668385695552 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/d

[34mINFO:tensorflow:Step 100 per-step time 1.574s[0m
[34mI0329 00:34:53.548921 140668385695552 model_lib_v2.py:705] Step 100 per-step time 1.574s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.46041527,
 'Loss/localization_loss': 0.039735664,
 'Loss/regularization_loss': 0.02962241,
 'Loss/total_loss': 0.52977335,
 'learning_rate': 0.040499996}[0m
[34mI0329 00:34:53.549302 140668385695552 model_lib_v2.py:708] {'Loss/classification_loss': 0.46041527,
 'Loss/localization_loss': 0.039735664,
 'Loss/regularization_loss': 0.02962241,
 'Loss/total_loss': 0.52977335,
 'learning_rate': 0.040499996}[0m
[34mINFO:tensorflow:Step 200 per-step time 0.633s[0m
[34mI0329 00:35:56.853015 140668385695552 model_lib_v2.py:705] Step 200 per-step time 0.633s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.4177233,
 'Loss/localization_loss': 0.046582647,
 'Loss/regularization_loss': 0.029850258,
 'Loss/total_loss': 0.49415618,
 'learning_rate': 0.08}[0m
[34mI0329 00:35:56.853347 1

[34mINFO:tensorflow:Step 1500 per-step time 0.633s[0m
[34mI0329 00:49:41.167862 140668385695552 model_lib_v2.py:705] Step 1500 per-step time 0.633s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.20298663,
 'Loss/localization_loss': 0.0071899197,
 'Loss/regularization_loss': 0.031149004,
 'Loss/total_loss': 0.24132554,
 'learning_rate': 0.014288494}[0m
[34mI0329 00:49:41.168157 140668385695552 model_lib_v2.py:708] {'Loss/classification_loss': 0.20298663,
 'Loss/localization_loss': 0.0071899197,
 'Loss/regularization_loss': 0.031149004,
 'Loss/total_loss': 0.24132554,
 'learning_rate': 0.014288494}[0m
[34mINFO:tensorflow:Step 1600 per-step time 0.634s[0m
[34mI0329 00:50:44.617449 140668385695552 model_lib_v2.py:705] Step 1600 per-step time 0.634s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.19783974,
 'Loss/localization_loss': 0.008576823,
 'Loss/regularization_loss': 0.031137848,
 'Loss/total_loss': 0.23755442,
 'learning_rate': 0.009358215}[0m
[34mI0329 0

[34mI0329 00:55:32.070667 140586453665600 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0329 00:55:47.625029 140586453665600 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0329 00:55:56.584451 140586453665600 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Finished eval step 0[0m
[34mI0329 00:55:56.617400 140586453665600 model_lib_v2.py:966] Finished eval step 0[0m
[34mInstructions for updating:[0m
[34mtf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    ten

[34mI0329 01:00:36.402286 140090219398976 ssd_efficientnet_bifpn_feature_extractor.py:150] EfficientDet EfficientNet backbone version: efficientnet-b1[0m
[34mI0329 01:00:36.402500 140090219398976 ssd_efficientnet_bifpn_feature_extractor.py:152] EfficientDet BiFPN num filters: 88[0m
[34mI0329 01:00:36.402580 140090219398976 ssd_efficientnet_bifpn_feature_extractor.py:153] EfficientDet BiFPN num iterations: 4[0m
[34mI0329 01:00:36.407388 140090219398976 efficientnet_model.py:143] round_filter input=32 output=32[0m
[34mI0329 01:00:36.448312 140090219398976 efficientnet_model.py:143] round_filter input=32 output=32[0m
[34mI0329 01:00:36.448451 140090219398976 efficientnet_model.py:143] round_filter input=16 output=16[0m
[34mI0329 01:00:36.630771 140090219398976 efficientnet_model.py:143] round_filter input=16 output=16[0m
[34mI0329 01:00:36.630934 140090219398976 efficientnet_model.py:143] round_filter input=24 output=24[0m
[34mI0329 01:00:36.959042 140090219398976 efficie

In order to this change, the all of Loss value decreased and Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] increase from 0.124 to 0.129. I get the better model than previous conditions.

## MobilNet V2
In the code below, the MobileNet V2 model is downloaded and extracted.

In [17]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
wget -O /tmp/mobilenet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/mobilenet.tar.gz --strip-components 2 --directory source_dir/checkpoint ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint

ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2023-03-29 01:23:18--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.111.128, 2607:f8b0:4004:c08::80
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.111.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20518283 (20M) [application/x-tar]
Saving to: ‘/tmp/mobilenet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 24.5M 1s
    50K .......... .......... .......... .......... ..........  0% 42.7M 1s
   100K .......... .......... .......... .......... ..........  0% 46.3M 1s
   150K .......... .......... .......... .......... ..........  0% 37.7M 1s
   200K .......... .......... .......... .......... ..........  1% 48.9M 1s
   250K .........

In the cell below, Create the training process for MobileNet V2.

In [18]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline_ssd_mobilenetv2.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-29-01-31-13-477


2023-03-29 01:31:17 Starting - Starting the training job...
2023-03-29 01:31:41 Starting - Preparing the instances for training.........
2023-03-29 01:33:17 Downloading - Downloading input data
2023-03-29 01:33:17 Training - Downloading the training image...............
2023-03-29 01:35:48 Training - Training image download completed. Training in progress....[34m2023-03-29 01:36:15,212 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 01:36:15,244 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 01:36:15,275 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 01:36:15,288 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/va

[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mW0329 01:36:24.132856 140217287337792 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.[0m
[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0329 01:36:30.525356 140217287337792 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/d

[34mINFO:tensorflow:Step 100 per-step time 0.603s[0m
[34mI0329 01:38:05.775221 140217287337792 model_lib_v2.py:705] Step 100 per-step time 0.603s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.31157234,
 'Loss/localization_loss': 0.061194032,
 'Loss/regularization_loss': 0.15140143,
 'Loss/total_loss': 0.5241678,
 'learning_rate': 0.040499996}[0m
[34mI0329 01:38:05.775599 140217287337792 model_lib_v2.py:708] {'Loss/classification_loss': 0.31157234,
 'Loss/localization_loss': 0.061194032,
 'Loss/regularization_loss': 0.15140143,
 'Loss/total_loss': 0.5241678,
 'learning_rate': 0.040499996}[0m
[34mINFO:tensorflow:Step 200 per-step time 0.195s[0m
[34mI0329 01:38:25.258503 140217287337792 model_lib_v2.py:705] Step 200 per-step time 0.195s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.70497596,
 'Loss/localization_loss': 0.023832431,
 'Loss/regularization_loss': 0.15123035,
 'Loss/total_loss': 0.88003874,
 'learning_rate': 0.08}[0m
[34mI0329 01:38:25.258819 140

[34mINFO:tensorflow:Step 1500 per-step time 0.195s[0m
[34mI0329 01:42:39.544528 140217287337792 model_lib_v2.py:705] Step 1500 per-step time 0.195s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.23300925,
 'Loss/localization_loss': 0.018062567,
 'Loss/regularization_loss': 0.14579183,
 'Loss/total_loss': 0.39686364,
 'learning_rate': 0.014288494}[0m
[34mI0329 01:42:39.544827 140217287337792 model_lib_v2.py:708] {'Loss/classification_loss': 0.23300925,
 'Loss/localization_loss': 0.018062567,
 'Loss/regularization_loss': 0.14579183,
 'Loss/total_loss': 0.39686364,
 'learning_rate': 0.014288494}[0m
[34mINFO:tensorflow:Step 1600 per-step time 0.195s[0m
[34mI0329 01:42:58.998591 140217287337792 model_lib_v2.py:705] Step 1600 per-step time 0.195s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.26016077,
 'Loss/localization_loss': 0.019853545,
 'Loss/regularization_loss': 0.14565557,
 'Loss/total_loss': 0.42566988,
 'learning_rate': 0.009358215}[0m
[34mI0329 01:42:

[34mI0329 01:44:39.563991 140666797791040 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0329 01:44:52.864710 140666797791040 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0329 01:44:59.251924 140666797791040 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Finished eval step 0[0m
[34mI0329 01:44:59.287539 140666797791040 model_lib_v2.py:966] Finished eval step 0[0m
[34mInstructions for updating:[0m
[34mtf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    ten

[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mW0329 01:49:48.943148 140053732468544 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.[0m
[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mInstructions for updating:[0m
[34mback_prop=False is deprecated. Consider using tf.stop_gradient instead.[0m
[34mInstead of:[0m
[34mresults = tf.map_fn(fn, elems, back_prop=False)[0m
[34mUse:[0m
[34mresults = tf.nest.map_structure(tf

[34mW0329 01:50:30.558800 140053732468544 save.py:271] Found untraced functions such as WeightSharedConvolutionalBoxPredictor_layer_call_fn, WeightSharedConvolutionalBoxPredictor_layer_call_and_return_conditional_losses, WeightSharedConvolutionalBoxHead_layer_call_fn, WeightSharedConvolutionalBoxHead_layer_call_and_return_conditional_losses, WeightSharedConvolutionalClassHead_layer_call_fn while saving (showing 5 of 173). These functions will not be directly callable after loading.[0m
[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0329 01:50:36.909745 140053732468544 builder_impl.py:797] Assets written to: /tmp/exported/saved_model/assets[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0329 01:50:38.340974 140053732468544 config_util.py:253] Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34m2023-03-29 01:50:39,808 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

## resnet50
In the code below, the resnet50 model is downloaded and extracted.

In [19]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
wget -O /tmp/resnet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/resnet.tar.gz --strip-components 2 --directory source_dir/checkpoint ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint

ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2023-03-29 01:54:57--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.111.128, 2607:f8b0:4004:c17::80
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.111.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 244817203 (233M) [application/x-tar]
Saving to: ‘/tmp/resnet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 16.0M 15s
    50K .......... .......... .......... .......... ..........  0% 28.7M 11s
   100K .......... .......... .......... .......... ..........  0% 33.2M 10s
   150K .......... .......... .......... .......... ..........  0% 90.2M 8s
   200K .......... .......... .......... .......... ..........  0% 88.0M 7s
   250K .......... .

In the cell below, Create the training process for resnet50.

In [22]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline_ssd_resnet50.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-29-02-35-34-176


2023-03-29 02:35:43 Starting - Starting the training job...
2023-03-29 02:36:09 Starting - Preparing the instances for training......
2023-03-29 02:37:21 Downloading - Downloading input data...
2023-03-29 02:37:47 Training - Downloading the training image...............
2023-03-29 02:40:13 Training - Training image download completed. Training in progress....[34m2023-03-29 02:40:43,524 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 02:40:43,563 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 02:40:43,600 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 02:40:43,615 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/va

[34mINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mI0329 02:40:52.302202 139777608804160 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m
[34mINFO:tensorflow:Maybe overwriting train_steps: 2000[0m
[34mI0329 02:40:52.306302 139777608804160 config_util.py:552] Maybe overwriting train_steps: 2000[0m
[34mINFO:tensorflow:Maybe overwriting use_bfloat16: False[0m
[34mI0329 02:40:52.306449 139777608804160 config_util.py:552] Maybe overwriting use_bfloat16: False[0m
[34mInstructions for updating:[0m
[34mrename to distribute_datasets_from_function[0m
[34mW0329 02:40:52.336361 139777608804160 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version

[34mI0329 02:41:48.271227 139754502862592 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0329 02:41:56.263424 139754502862592 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0329 02:42:05.060463 139754502862592 api.py:459] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mINFO:tensorflow:Step 100 per-step time 0.855s[0m
[34mI0329 02:43:03.217629 139777608804160 model_lib_v2.py:705] Step 100 per-step time 0.855s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.5385508,
 'Loss/localization_loss': 0.04592109,
 'Loss/regularization_loss': 0.026066214,
 'Loss/total_loss': 0.6105381,
 'learning_rate': 0.040499996}[0m
[34mI0329 02:43:03.218054 139777608804160 model_lib_v2.py:708] {'Loss/classification_loss': 0.5385508,
 'Loss/localization_loss': 0.04592109,
 'Loss/regularization_loss': 0.026066214,
 'Loss/total_loss': 0.6105381,
 'learning_rate': 0.0

[34mINFO:tensorflow:Step 1400 per-step time 0.376s[0m
[34mI0329 02:51:15.859798 139777608804160 model_lib_v2.py:705] Step 1400 per-step time 0.376s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.69222647,
 'Loss/localization_loss': 0.05906934,
 'Loss/regularization_loss': 13.253136,
 'Loss/total_loss': 14.004432,
 'learning_rate': 0.019999998}[0m
[34mI0329 02:51:15.860113 139777608804160 model_lib_v2.py:708] {'Loss/classification_loss': 0.69222647,
 'Loss/localization_loss': 0.05906934,
 'Loss/regularization_loss': 13.253136,
 'Loss/total_loss': 14.004432,
 'learning_rate': 0.019999998}[0m
[34mINFO:tensorflow:Step 1500 per-step time 0.377s[0m
[34mI0329 02:51:53.519311 139777608804160 model_lib_v2.py:705] Step 1500 per-step time 0.377s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.75248796,
 'Loss/localization_loss': 0.04130427,
 'Loss/regularization_loss': 13.23439,
 'Loss/total_loss': 14.028183,
 'learning_rate': 0.014288494}[0m
[34mI0329 02:51:53.519608 

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0329 02:55:16.801776 140285297522496 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0329 02:55:18.327437 140285297522496 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0329 02:55:21.317922 140285297522496 c

[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0329 03:00:21.415199 140285297522496 checkpoint_utils.py:140] Waiting for new checkpoint at /opt/training[0m
[34mINFO:tensorflow:Timed-out waiting for a checkpoint.[0m
[34mI0329 03:00:30.429401 140285297522496 checkpoint_utils.py:203] Timed-out waiting for a checkpoint.[0m
[34mcreating index...[0m
[34mindex created![0m
[34mcreating index...[0m
[34mindex created![0m
[34mRunning per image evaluation...[0m
[34mEvaluate annotation type *bbox*[0m
[34mDONE (t=15.03s).[0m
[34mAccumulating evaluation results...[0m
[34mDONE (t=0.29s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=me

## Faster RCNN ResNet101
In the code below, the Faster RCNN ResNet101 model is downloaded and extracted.

In [36]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
wget -O /tmp/resnet101.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/resnet101.tar.gz --strip-components 2 --directory source_dir/checkpoint faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint

faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint/checkpoint
faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2023-03-29 05:17:31--  http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.253.63.128, 2607:f8b0:4004:c17::80
Connecting to download.tensorflow.org (download.tensorflow.org)|172.253.63.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 353643040 (337M) [application/x-tar]
Saving to: ‘/tmp/resnet101.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 18.4M 18s
    50K .......... .......... .......... .......... ..........  0% 36.7M 14s
   100K .......... .......... .......... .......... ..........  0% 35.9M 12s
   150K .......... .......... .......... .......... ..........  0% 24.6M 13s
   200K .......... .......... .......... .......... ..........  0% 37.8M 12s
   250K ....

In the cell below, Create the training process for Faster RCNN ResNet101.

In [37]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline_frcnn_resnet101.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-29-05-18-07-641


2023-03-29 05:18:24 Starting - Starting the training job...
2023-03-29 05:18:49 Starting - Preparing the instances for training......
2023-03-29 05:19:45 Downloading - Downloading input data...
2023-03-29 05:20:10 Training - Downloading the training image...............
2023-03-29 05:22:51 Training - Training image download completed. Training in progress....[34m2023-03-29 05:23:19,540 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 05:23:19,574 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 05:23:19,606 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-29 05:23:19,620 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/va

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0329 05:23:34.730974 140184545793856 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0329 05:23:38.796266 140184545793856 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:depth of additional conv before box predictor: 0[0m
[34mI0329 05:23:49.974357 140161415374

[34mINFO:tensorflow:Step 200 per-step time 0.247s[0m
[34mI0329 05:26:33.736022 140184545793856 model_lib_v2.py:705] Step 200 per-step time 0.247s[0m
[34mINFO:tensorflow:{'Loss/BoxClassifierLoss/classification_loss': 0.30467826,
 'Loss/BoxClassifierLoss/localization_loss': 0.5284495,
 'Loss/RPNLoss/localization_loss': 1.7026118,
 'Loss/RPNLoss/objectness_loss': 0.48177233,
 'Loss/regularization_loss': 0.0,
 'Loss/total_loss': 3.0175118,
 'learning_rate': 0.08}[0m
[34mI0329 05:26:33.736360 140184545793856 model_lib_v2.py:708] {'Loss/BoxClassifierLoss/classification_loss': 0.30467826,
 'Loss/BoxClassifierLoss/localization_loss': 0.5284495,
 'Loss/RPNLoss/localization_loss': 1.7026118,
 'Loss/RPNLoss/objectness_loss': 0.48177233,
 'Loss/regularization_loss': 0.0,
 'Loss/total_loss': 3.0175118,
 'learning_rate': 0.08}[0m
[34mINFO:tensorflow:Step 300 per-step time 0.249s[0m
[34mI0329 05:26:58.641317 140184545793856 model_lib_v2.py:705] Step 300 per-step time 0.249s[0m
[34mINFO:t

[34mINFO:tensorflow:Step 1200 per-step time 0.245s[0m
[34mI0329 05:30:42.935031 140184545793856 model_lib_v2.py:705] Step 1200 per-step time 0.245s[0m
[34mINFO:tensorflow:{'Loss/BoxClassifierLoss/classification_loss': 0.21121594,
 'Loss/BoxClassifierLoss/localization_loss': 0.23883379,
 'Loss/RPNLoss/localization_loss': 0.89880604,
 'Loss/RPNLoss/objectness_loss': 0.13289595,
 'Loss/regularization_loss': 0.0,
 'Loss/total_loss': 1.4817517,
 'learning_rate': 0.033054072}[0m
[34mI0329 05:30:42.935390 140184545793856 model_lib_v2.py:708] {'Loss/BoxClassifierLoss/classification_loss': 0.21121594,
 'Loss/BoxClassifierLoss/localization_loss': 0.23883379,
 'Loss/RPNLoss/localization_loss': 0.89880604,
 'Loss/RPNLoss/objectness_loss': 0.13289595,
 'Loss/regularization_loss': 0.0,
 'Loss/total_loss': 1.4817517,
 'learning_rate': 0.033054072}[0m
[34mINFO:tensorflow:Step 1300 per-step time 0.245s[0m
[34mI0329 05:31:07.452829 140184545793856 model_lib_v2.py:705] Step 1300 per-step time 

[34mINFO:tensorflow:Reading unweighted datasets: ['/opt/ml/input/data/val/*.tfrecord'][0m
[34mI0329 05:34:08.718782 140327633033024 dataset_builder.py:162] Reading unweighted datasets: ['/opt/ml/input/data/val/*.tfrecord'][0m
[34mINFO:tensorflow:Reading record datasets for input file: ['/opt/ml/input/data/val/*.tfrecord'][0m
[34mI0329 05:34:08.720362 140327633033024 dataset_builder.py:79] Reading record datasets for input file: ['/opt/ml/input/data/val/*.tfrecord'][0m
[34mINFO:tensorflow:Number of filenames to read: 13[0m
[34mI0329 05:34:08.720503 140327633033024 dataset_builder.py:80] Number of filenames to read: 13[0m
[34mW0329 05:34:08.720648 140327633033024 dataset_builder.py:86] num_readers has been reduced to 13 to match input file shards.[0m
[34mW0329 05:34:08.723127 140327633033024 dataset_builder.py:93] `shuffle` is false, but the input data stream is still slightly shuffled since `num_readers` > 1.[0m
[34mInstructions for updating:[0m
[34mUse `tf.data.Datas

[34mINFO:tensorflow:Finished eval step 100[0m
[34mI0329 05:35:02.168758 140327633033024 model_lib_v2.py:966] Finished eval step 100[0m
[34mINFO:tensorflow:Finished eval step 200[0m
[34mI0329 05:35:09.897058 140327633033024 model_lib_v2.py:966] Finished eval step 200[0m
[34mINFO:tensorflow:Performing evaluation on 258 images.[0m
[34mI0329 05:35:14.480276 140327633033024 coco_evaluation.py:293] Performing evaluation on 258 images.[0m
[34mINFO:tensorflow:Loading and preparing annotation results...[0m
[34mI0329 05:35:14.485685 140327633033024 coco_tools.py:116] Loading and preparing annotation results...[0m
[34mINFO:tensorflow:DONE (t=0.03s)[0m
[34mI0329 05:35:14.514098 140327633033024 coco_tools.py:138] DONE (t=0.03s)[0m
[34mINFO:tensorflow:Eval metrics at step 2000[0m
[34mI0329 05:35:35.444257 140327633033024 model_lib_v2.py:1015] Eval metrics at step 2000[0m
[34mINFO:tensorflow:#011+ DetectionBoxes_Precision/mAP: 0.003169[0m
[34mI0329 05:35:35.457017 140327633

[34mW0329 05:40:01.296068 139696016164672 save_impl.py:66] Skipping full serialization of Keras layer <object_detection.meta_architectures.faster_rcnn_meta_arch.FasterRCNNMetaArch object at 0x7f0cc15410d0>, because it is not built.[0m
[34mW0329 05:40:28.852456 139696016164672 save.py:271] Found untraced functions such as FirstStageBoxPredictor_layer_call_fn, FirstStageBoxPredictor_layer_call_and_return_conditional_losses, mask_rcnn_keras_box_predictor_layer_call_fn, mask_rcnn_keras_box_predictor_layer_call_and_return_conditional_losses, _jit_compiled_convolution_op while saving (showing 5 of 135). These functions will not be directly callable after loading.[0m
[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0329 05:40:38.342046 139696016164672 builder_impl.py:797] Assets written to: /tmp/exported/saved_model/assets[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0329 05:40:40.220364 139696016164672 

## Summary
This is a summary of my experiment. Because SSD ResNet50 was not convergence, it is eliminated.
SSD EfficientDet D1 has the most highest AP and AR. it is best model in my experiments.

|  Metrics                                                              |  SSD EfficientDet D1 | SSD MobileNet V2 | Faster RCNN ResNet101 |
| ----                                                                  | ---- | ---- | ---- |
|  Average Precision  (AP) @[ IoU=0.50:0.95 ; area=all ; maxDets=100 ]  |  0.129  |  0.103  |  0.003  |
|  Average Recall     (AR) @[ IoU=0.50:0.95 ; area=all ; maxDets=100 ]  |  0.179  |  0.159  |  0.026  |