# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [1]:
%%capture
%pip install tensorflow_io sagemaker -U

In [2]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

Save the IAM role in a variable called `role`. This would be useful when training the model.

In [3]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::476375380884:role/service-role/AmazonSageMaker-ExecutionRole-20230227T161354


In [4]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
        'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-project-jckuri/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the the repository
* build the docker image and push it 
* print the container name

In [5]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [6]:
# build and push the docker image. This code can be commented after being ran once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
!sh ./docker/build_and_push.sh $image_name

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Building image with name tf2-object-detection
Sending build context to Docker daemon  722.9MB
Step 1/17 : FROM tensorflow/tensorflow:2.9.0-gpu
 ---> c8d9ee2a0ff4
Step 2/17 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Running in ca21e76d1416
Removing intermediate container ca21e76d1416
 ---> 063a581ec0fe
Step 3/17 : RUN rm /etc/apt/sources.list.d/cuda.list
 ---> Running in 4463e992a51e
Removing intermediate container 4463e992a51e
 ---> 8b94f7292444
Step 4/17 : RUN apt-key del 7fa2af80
 ---> Running in acc61013afc4
OK
Removing intermediate container acc61013afc4
 ---> 32004679cc8b
Step 5/17 : RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
 ---> Running in 7f62030de8b7
[0mExecuting: /tmp/apt-key-gpghome.nSpnYDwjjk/gpg.1.sh --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
[91m

Get:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg-wks-server amd64 2.2.19-3ubuntu2.2 [90.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg-utils amd64 2.2.19-3ubuntu2.2 [481 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg-agent amd64 2.2.19-3ubuntu2.2 [232 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpg amd64 2.2.19-3ubuntu2.2 [482 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgconf amd64 2.2.19-3ubuntu2.2 [124 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg-l10n all 2.2.19-3ubuntu2.2 [51.7 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gnupg all 2.2.19-3ubuntu2.2 [259 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgsm amd64 2.2.19-3ubuntu2.2 [217 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 gpgv amd64 2.2.19-3ubuntu2.2 [200 kB]
Get:12 http://archive.ubuntu.com/ubuntu focal-upda

Get:83 http://archive.ubuntu.com/ubuntu focal/main amd64 x11proto-core-dev all 2019.2-1ubuntu1 [2620 B]
Get:84 http://archive.ubuntu.com/ubuntu focal/main amd64 libxau-dev amd64 1:1.0.9-0ubuntu1 [9552 B]
Get:85 http://archive.ubuntu.com/ubuntu focal/main amd64 libxdmcp-dev amd64 1:1.1.3-0ubuntu1 [25.3 kB]
Get:86 http://archive.ubuntu.com/ubuntu focal/main amd64 xtrans-dev all 1.4.0-1 [68.9 kB]
Get:87 http://archive.ubuntu.com/ubuntu focal/main amd64 libpthread-stubs0-dev amd64 0.4-1 [5384 B]
Get:88 http://archive.ubuntu.com/ubuntu focal/main amd64 libxcb1-dev amd64 1.14-2 [80.5 kB]
Get:89 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libx11-dev amd64 2:1.6.9-2ubuntu1.2 [647 kB]
Get:90 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libglx-dev amd64 1.3.2-1~ubuntu0.20.04.2 [14.0 kB]
Get:91 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libgl-dev amd64 1.3.2-1~ubuntu0.20.04.2 [97.8 kB]
Get:92 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libeg

Unpacking libdrm-common (2.4.107-8ubuntu1~20.04.2) ...
Selecting previously unselected package libdrm2:amd64.
Preparing to unpack .../007-libdrm2_2.4.107-8ubuntu1~20.04.2_amd64.deb ...
Unpacking libdrm2:amd64 (2.4.107-8ubuntu1~20.04.2) ...
Selecting previously unselected package libedit2:amd64.
Preparing to unpack .../008-libedit2_3.1-20191231-1_amd64.deb ...
Unpacking libedit2:amd64 (3.1-20191231-1) ...
Selecting previously unselected package libfido2-1:amd64.
Preparing to unpack .../009-libfido2-1_1.3.1-1ubuntu2_amd64.deb ...
Unpacking libfido2-1:amd64 (1.3.1-1ubuntu2) ...
Selecting previously unselected package libxau6:amd64.
Preparing to unpack .../010-libxau6_1%3a1.0.9-0ubuntu1_amd64.deb ...
Unpacking libxau6:amd64 (1:1.0.9-0ubuntu1) ...
Selecting previously unselected package libxdmcp6:amd64.
Preparing to unpack .../011-libxdmcp6_1%3a1.1.3-0ubuntu1_amd64.deb ...
Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu1) ...
Selecting previously unselected package libxcb1:amd64.
Preparing to un

Selecting previously unselected package libxcb-xfixes0:amd64.
Preparing to unpack .../054-libxcb-xfixes0_1.14-2_amd64.deb ...
Unpacking libxcb-xfixes0:amd64 (1.14-2) ...
Selecting previously unselected package libxshmfence1:amd64.
Preparing to unpack .../055-libxshmfence1_1.3-1_amd64.deb ...
Unpacking libxshmfence1:amd64 (1.3-1) ...
Selecting previously unselected package libegl-mesa0:amd64.
Preparing to unpack .../056-libegl-mesa0_21.2.6-0ubuntu0.1~20.04.2_amd64.deb ...
Unpacking libegl-mesa0:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Selecting previously unselected package libegl1:amd64.
Preparing to unpack .../057-libegl1_1.3.2-1~ubuntu0.20.04.2_amd64.deb ...
Unpacking libegl1:amd64 (1.3.2-1~ubuntu0.20.04.2) ...
Selecting previously unselected package libxcb-glx0:amd64.
Preparing to unpack .../058-libxcb-glx0_1.14-2_amd64.deb ...
Unpacking libxcb-glx0:amd64 (1.14-2) ...
Selecting previously unselected package libxfixes3:amd64.
Preparing to unpack .../059-libxfixes3_1%3a5.0.3-2_amd64.deb 

Selecting previously unselected package libxslt1.1:amd64.
Preparing to unpack .../100-libxslt1.1_1.1.34-4ubuntu0.20.04.1_amd64.deb ...
Unpacking libxslt1.1:amd64 (1.1.34-4ubuntu0.20.04.1) ...
Selecting previously unselected package mesa-vulkan-drivers:amd64.
Preparing to unpack .../101-mesa-vulkan-drivers_21.2.6-0ubuntu0.1~20.04.2_amd64.deb ...
Unpacking mesa-vulkan-drivers:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Selecting previously unselected package python3-soupsieve.
Preparing to unpack .../102-python3-soupsieve_1.9.5+dfsg-1_all.deb ...
Unpacking python3-soupsieve (1.9.5+dfsg-1) ...
Selecting previously unselected package python3-bs4.
Preparing to unpack .../103-python3-bs4_4.8.2-1_all.deb ...
Unpacking python3-bs4 (4.8.2-1) ...
Selecting previously unselected package python3-ply.
Preparing to unpack .../104-python3-ply_3.11-3ubuntu0.1_all.deb ...
Unpacking python3-ply (3.11-3ubuntu0.1) ...
Selecting previously unselected package python3-pycparser.
Preparing to unpack .../105-python3

Setting up libdrm-nouveau2:amd64 (2.4.107-8ubuntu1~20.04.2) ...
Setting up libxcb1-dev:amd64 (1.14-2) ...
Setting up gpg-wks-client (2.2.19-3ubuntu2.2) ...
Setting up libxrender1:amd64 (1:0.9.10-1) ...
Setting up libgbm1:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Setting up libdrm-radeon1:amd64 (2.4.107-8ubuntu1~20.04.2) ...
Setting up openssh-client (1:8.2p1-4ubuntu0.5) ...
Setting up libdrm-intel1:amd64 (2.4.107-8ubuntu1~20.04.2) ...
Setting up libgl1-mesa-dri:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Setting up libx11-dev:amd64 (2:1.6.9-2ubuntu1.2) ...
Setting up libxext6:amd64 (2:1.3.4-0ubuntu1) ...
Setting up libcairo2:amd64 (1.16.0-4ubuntu1) ...
Setting up libxxf86vm1:amd64 (1:1.1.4-1build1) ...
Setting up libegl-mesa0:amd64 (21.2.6-0ubuntu0.1~20.04.2) ...
Setting up libxfixes3:amd64 (1:5.0.3-2) ...
Setting up libgdk-pixbuf2.0-0:amd64 (2.40.0+dfsg-3ubuntu0.4) ...
Setting up python3-cairocffi (0.9.0-4) ...
Setting up xauth (1:1.1-0ubuntu1) ...
Setting up libgdk-pixbuf2.0-bin (2.40.0+dfsg-3

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.8/5.8 MB 117.9 MB/s eta 0:00:00
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.6/43.6 kB 8.9 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting immutabledict
  Downloading immutabledict-2.2.3-py3-none-any.whl (4.0 kB)
Collecting psutil>=5.4.3
  Downloading psutil-5.9.4-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (280 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 280.2/280.2 kB 58.4 MB/s eta 0:00:00
Collecting tensorflow-hub>=0.6.0
  Downloading tensorflow_hub-0.13.0-py2.py3-none-any.whl (100 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.6/100.6 kB 30.2 MB/s eta 0:00:00
Collecting gin-config
  Downloading gin_config-0.5.0-py3-none-any.whl (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.3/61.3 kB 17.8 MB/s eta 0:00:00
Collecting

Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (195 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 195.9/195.9 kB 47.5 MB/s eta 0:00:00
Collecting tensorflow-estimator<2.12,>=2.11.0
  Downloading tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 439.2/439.2 kB 29.6 MB/s eta 0:00:00
Collecting protobuf<4,>3.12.2
  Downloading protobuf-3.19.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 100.2 MB/s eta 0:00:00
Collecting flatbuffers>=2.0
  Downloading flatbuffers-23.3.3-py2.py3-none-any.whl (26 kB)
Collecting keras
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 110.8 MB/s eta 0:00:00
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Building wheel for pycocotools (pyproject.toml): finished with status 'done'
  Created wheel for pycocotools: filename=pycocotools-2.0.6-cp38-cp38-linux_x86_64.whl size=423704 sha256=698bc59abe237512633d41479015d816ab2768765227250eb3e1d2ed38550684
  Stored in directory: /root/.cache/pip/wheels/3e/08/ac/58126fe59992032701437336493f6132e1b72381a62d00b595
  Building wheel for crcmod (setup.py): started
  Building wheel for crcmod (setup.py): finished with status 'done'
  Created wheel for crcmod: filename=crcmod-1.7-cp38-cp38-linux_x86_64.whl size=36027 sha256=2a5d8459100b9a3e1016791677958a3e6ba62831c8453cbebd65bc1a3d1a5a6f
  Stored in directory: /root/.cache/pip/wheels/ca/5a/02/f3acf982a026f3319fb3e798a8dca2d48fafee7761788562e9
  Building wheel for dill (setup.py): started
  Building wheel for dill (setup.py): finished with status 'done'
  Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78543 sha256=6d518dc7c82d5bcaaacfc4c9d1c642b7ebc16b9946e58def53fb9a544d8ad0d3
  

Collecting boto3
  Downloading boto3-1.26.100-py3-none-any.whl (135 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 135.5/135.5 kB 24.5 MB/s eta 0:00:00
Collecting retrying>=1.3.3
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Collecting gevent
  Downloading gevent-22.10.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 105.5 MB/s eta 0:00:00
Collecting inotify_simple==1.2.1
  Downloading inotify_simple-1.2.1.tar.gz (7.9 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting paramiko>=2.4.2
  Downloading paramiko-3.1.0-py3-none-any.whl (211 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.2/211.2 kB 35.3 MB/s eta 0:00:00
Collecting pynacl>=1.5
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 80.8 MB/s eta 0:00:00
Colle

[24Ba77cab5: Pushing  1.486GB/3.357GB[22A[2K[25A[2K[25A[2K[24A[2K[26A[2K[23A[2K[26A[2K[23A[2K[26A[2K[23A[2K[26A[2K[23A[2K[24A[2K[23A[2K[24A[2K[25A[2K[21A[2K[25A[2K[21A[2K[21A[2K[23A[2K[25A[2K[24A[2K[23A[2K[25A[2K[23A[2K[23A[2K[25A[2K[21A[2K[24A[2K[26A[2K[25A[2K[26A[2K[24A[2K[26A[2K[20A[2K[25A[2K[25A[2K[26A[2K[24A[2K[26A[2K[23A[2K[26A[2K[25A[2K[26A[2K[20A[2K[24A[2K[20A[2K[24A[2K[26A[2K[25A[2K[24A[2K[20A[2K[24A[2K[26A[2K[19A[2K[26A[2K[20A[2K[26A[2K[20A[2K[19A[2K[26A[2K[19A[2K[26A[2K[19A[2K[25A[2K[19A[2K[24A[2K[19A[2K[26A[2K[25A[2K[20A[2K[24A[2K[25A[2K[25A[2K[26A[2K[25A[2K[20A[2K[25A[2K[26A[2K[20A[2K[19A[2K[25A[2K[20A[2K[25A[2K[26A[2K[25A[2K[24A[2K[25A[2K[24A[2K[19A[2K[20A[2K[19A[2K[20A[2K[24A[2K[19A[2K[24A[2K[20A[2K[19A[2K[25A[2K[20A[2K[25A[2K[19A[2K[26A[2K[25A[2K[26A[2K[19A[2K

[24Ba77cab5: Pushed   3.388GB/3.357GB[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2K[24A[2

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [7]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

476375380884.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20230328055858


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be ajusted if you were to experiment with other architectures.

In [8]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint

#wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
wget -O /tmp/ssd_mobilenet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz

#tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint
tar -zxvf /tmp/ssd_mobilenet.tar.gz --strip-components 2 --directory source_dir/checkpoint ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint

#tar -zxvf /tmp/ssd_mobilenet.tar.gz --strip-components 2 --directory . ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/pipeline.config


ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2023-03-28 06:06:50--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.253.63.128, 2607:f8b0:4004:c17::80
Connecting to download.tensorflow.org (download.tensorflow.org)|172.253.63.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20518283 (20M) [application/x-tar]
Saving to: ‘/tmp/ssd_mobilenet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 12.7M 2s
    50K .......... .......... .......... .......... ..........  0% 25.1M 1s
   100K .......... .......... .......... .......... ..........  0% 22.8M 1s
   150K .......... .......... .......... .......... ..........  0% 19.4M 1s
   200K .......... .......... .......... .......... ..........  1% 24.5M 1s
   250K .......

## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [9]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "10000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-03-28-06-06-51-015


2023-03-28 06:06:53 Starting - Starting the training job...
2023-03-28 06:07:32 Starting - Preparing the instances for training......
2023-03-28 06:08:30 Downloading - Downloading input data...
2023-03-28 06:08:49 Training - Downloading the training image...............
2023-03-28 06:11:25 Training - Training image download completed. Training in progress....[34m2023-03-28 06:11:51,785 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-28 06:11:51,816 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-28 06:11:51,847 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-28 06:11:51,861 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/va

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0328 06:12:07.401397 140083402434368 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mkeep_dims is deprecated, use keepdims instead[0m
[34mW0328 06:12:10.100912 140083402434368 deprecation.py:554] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mkeep_dims is deprecated, use keepdims instead[0m
[34mInstructions for updating:[0m
[3

[34mINFO:tensorflow:Step 400 per-step time 0.197s[0m
[34mI0328 06:14:44.315681 140083402434368 model_lib_v2.py:705] Step 400 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.20762832,
 'Loss/localization_loss': 0.33142215,
 'Loss/regularization_loss': 0.14429398,
 'Loss/total_loss': 0.6833444,
 'learning_rate': 0.0005}[0m
[34mI0328 06:14:44.316056 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.20762832,
 'Loss/localization_loss': 0.33142215,
 'Loss/regularization_loss': 0.14429398,
 'Loss/total_loss': 0.6833444,
 'learning_rate': 0.0005}[0m
[34mINFO:tensorflow:Step 500 per-step time 0.199s[0m
[34mI0328 06:15:04.262911 140083402434368 model_lib_v2.py:705] Step 500 per-step time 0.199s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.19174273,
 'Loss/localization_loss': 0.269517,
 'Loss/regularization_loss': 0.14297979,
 'Loss/total_loss': 0.6042395,
 'learning_rate': 0.0005}[0m
[34mI0328 06:15:04.263198 140083402434368 m

[34mINFO:tensorflow:Step 1800 per-step time 0.198s[0m
[34mI0328 06:19:23.649404 140083402434368 model_lib_v2.py:705] Step 1800 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.15266459,
 'Loss/localization_loss': 0.21546625,
 'Loss/regularization_loss': 0.13031332,
 'Loss/total_loss': 0.49844414,
 'learning_rate': 0.0005}[0m
[34mI0328 06:19:23.649739 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.15266459,
 'Loss/localization_loss': 0.21546625,
 'Loss/regularization_loss': 0.13031332,
 'Loss/total_loss': 0.49844414,
 'learning_rate': 0.0005}[0m
[34mINFO:tensorflow:Step 1900 per-step time 0.197s[0m
[34mI0328 06:19:43.367184 140083402434368 model_lib_v2.py:705] Step 1900 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.18649651,
 'Loss/localization_loss': 0.2537909,
 'Loss/regularization_loss': 0.12961614,
 'Loss/total_loss': 0.56990355,
 'learning_rate': 0.0005}[0m
[34mI0328 06:19:43.367499 140083402

[34mINFO:tensorflow:Step 3200 per-step time 0.198s[0m
[34mI0328 06:24:04.004481 140083402434368 model_lib_v2.py:705] Step 3200 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.12721327,
 'Loss/localization_loss': 0.20037253,
 'Loss/regularization_loss': 0.12417373,
 'Loss/total_loss': 0.45175955,
 'learning_rate': 1e-04}[0m
[34mI0328 06:24:04.004794 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.12721327,
 'Loss/localization_loss': 0.20037253,
 'Loss/regularization_loss': 0.12417373,
 'Loss/total_loss': 0.45175955,
 'learning_rate': 1e-04}[0m
[34mINFO:tensorflow:Step 3300 per-step time 0.198s[0m
[34mI0328 06:24:23.791681 140083402434368 model_lib_v2.py:705] Step 3300 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.10752064,
 'Loss/localization_loss': 0.13682666,
 'Loss/regularization_loss': 0.12390746,
 'Loss/total_loss': 0.36825478,
 'learning_rate': 1e-04}[0m
[34mI0328 06:24:23.792006 14008340243

[34mINFO:tensorflow:Step 4600 per-step time 0.197s[0m
[34mI0328 06:28:42.717858 140083402434368 model_lib_v2.py:705] Step 4600 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.100091666,
 'Loss/localization_loss': 0.11157066,
 'Loss/regularization_loss': 0.12064162,
 'Loss/total_loss': 0.33230394,
 'learning_rate': 1e-04}[0m
[34mI0328 06:28:42.718199 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.100091666,
 'Loss/localization_loss': 0.11157066,
 'Loss/regularization_loss': 0.12064162,
 'Loss/total_loss': 0.33230394,
 'learning_rate': 1e-04}[0m
[34mINFO:tensorflow:Step 4700 per-step time 0.198s[0m
[34mI0328 06:29:02.542382 140083402434368 model_lib_v2.py:705] Step 4700 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.11069843,
 'Loss/localization_loss': 0.14835526,
 'Loss/regularization_loss': 0.120415956,
 'Loss/total_loss': 0.37946966,
 'learning_rate': 1e-04}[0m
[34mI0328 06:29:02.542722 14008340

[34mINFO:tensorflow:Step 6000 per-step time 0.198s[0m
[34mI0328 06:33:21.738929 140083402434368 model_lib_v2.py:705] Step 6000 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.10756228,
 'Loss/localization_loss': 0.14833282,
 'Loss/regularization_loss': 0.11860817,
 'Loss/total_loss': 0.37450328,
 'learning_rate': 5e-05}[0m
[34mI0328 06:33:21.739262 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.10756228,
 'Loss/localization_loss': 0.14833282,
 'Loss/regularization_loss': 0.11860817,
 'Loss/total_loss': 0.37450328,
 'learning_rate': 5e-05}[0m
[34mINFO:tensorflow:Step 6100 per-step time 0.215s[0m
[34mI0328 06:33:43.233663 140083402434368 model_lib_v2.py:705] Step 6100 per-step time 0.215s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.108089484,
 'Loss/localization_loss': 0.14087132,
 'Loss/regularization_loss': 0.1184957,
 'Loss/total_loss': 0.3674565,
 'learning_rate': 5e-05}[0m
[34mI0328 06:33:43.233988 140083402434

[34mINFO:tensorflow:Step 7400 per-step time 0.198s[0m
[34mI0328 06:38:02.846490 140083402434368 model_lib_v2.py:705] Step 7400 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.08257485,
 'Loss/localization_loss': 0.11392988,
 'Loss/regularization_loss': 0.11712201,
 'Loss/total_loss': 0.31362677,
 'learning_rate': 5e-05}[0m
[34mI0328 06:38:02.846834 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.08257485,
 'Loss/localization_loss': 0.11392988,
 'Loss/regularization_loss': 0.11712201,
 'Loss/total_loss': 0.31362677,
 'learning_rate': 5e-05}[0m
[34mINFO:tensorflow:Step 7500 per-step time 0.197s[0m
[34mI0328 06:38:22.589185 140083402434368 model_lib_v2.py:705] Step 7500 per-step time 0.197s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.09261198,
 'Loss/localization_loss': 0.110857464,
 'Loss/regularization_loss': 0.117027335,
 'Loss/total_loss': 0.3204968,
 'learning_rate': 1e-05}[0m
[34mI0328 06:38:22.589524 1400834024

[34mINFO:tensorflow:Step 8800 per-step time 0.198s[0m
[34mI0328 06:42:42.426987 140083402434368 model_lib_v2.py:705] Step 8800 per-step time 0.198s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.0946676,
 'Loss/localization_loss': 0.10392599,
 'Loss/regularization_loss': 0.11674377,
 'Loss/total_loss': 0.31533736,
 'learning_rate': 1e-05}[0m
[34mI0328 06:42:42.427338 140083402434368 model_lib_v2.py:708] {'Loss/classification_loss': 0.0946676,
 'Loss/localization_loss': 0.10392599,
 'Loss/regularization_loss': 0.11674377,
 'Loss/total_loss': 0.31533736,
 'learning_rate': 1e-05}[0m
[34mINFO:tensorflow:Step 8900 per-step time 0.199s[0m
[34mI0328 06:43:02.319526 140083402434368 model_lib_v2.py:705] Step 8900 per-step time 0.199s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.06862148,
 'Loss/localization_loss': 0.08142805,
 'Loss/regularization_loss': 0.11672192,
 'Loss/total_loss': 0.26677147,
 'learning_rate': 1e-05}[0m
[34mI0328 06:43:02.319877 1400834024343

[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mW0328 06:46:54.756629 139632741836608 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mCreate a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0328 06:46:55.884582 139632741836608 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Waiting for new checkpoint at /opt/training[0m
[34mI0328 06:46:58.552043 139632741836608 c

[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mW0328 06:47:59.711834 140348559726400 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.[0m
[34mInstructions for updating:[0m
[34mLambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089[0m
[34mInstructions for updating:[0m
[34mback_prop=False is deprecated. Consider using tf.stop_gradient instead.[0m
[34mInstead of:[0m
[34mresults = tf.map_fn(fn, elems, back_prop=False)[0m
[34mUse:[0m
[34mresults = tf.nest.map_structure(tf

[34mW0328 06:48:41.331298 140348559726400 save.py:271] Found untraced functions such as WeightSharedConvolutionalBoxPredictor_layer_call_fn, WeightSharedConvolutionalBoxPredictor_layer_call_and_return_conditional_losses, WeightSharedConvolutionalBoxHead_layer_call_fn, WeightSharedConvolutionalBoxHead_layer_call_and_return_conditional_losses, WeightSharedConvolutionalClassHead_layer_call_fn while saving (showing 5 of 173). These functions will not be directly callable after loading.[0m
[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0328 06:48:47.651884 140348559726400 builder_impl.py:797] Assets written to: /tmp/exported/saved_model/assets[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0328 06:48:49.053254 140348559726400 config_util.py:253] Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34m2023-03-28 06:48:50,453 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)


In [10]:
# Inspired by: https://stackoverflow.com/questions/36209068/boto3-grabbing-only-selected-objects-from-the-s3-resource

import boto3

output_path = estimator.output_path
print('Output path:\n{}\n'.format(output_path))

bucket_name = output_path[5:-1]
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket_name)

prefix = 'tf2-object-detection-'
model_objects = my_bucket.objects.filter(Prefix=prefix)
model_objects = [object_summary for object_summary in model_objects if object_summary.key.endswith('model.tar.gz')]
model_objects = sorted(model_objects, key = lambda e: e.key)
#for model_object in model_objects: print(model_object)
last_model_object = model_objects[-1]
last_model_uri = 's3://' + last_model_object.bucket_name + '/' + last_model_object.key
print('Last model URI:\n{}'.format(last_model_uri))

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Output path:
s3://sagemaker-us-east-1-476375380884/

Last model URI:
s3://sagemaker-us-east-1-476375380884/tf2-object-detection-2023-03-28-06-06-51-015/output/model.tar.gz


In [11]:
import sagemaker
sagemaker_session = sagemaker.Session()
aws_region = sagemaker_session.boto_region_name
print('aws_region={}'.format(aws_region))

aws_region=us-east-1


In [12]:
job_artifacts_path = estimator.latest_job_tensorboard_artifacts_path()

!echo "pip install 'tensorflow<2.4'"
!echo "pip install 'tensorflow-io<2.4'"
!echo "pip install 'tensorboard<2.4'"
!echo "AWS_REGION={aws_region}"
!echo "tensorboard --logdir={job_artifacts_path}"

pip install 'tensorflow<2.4'
pip install 'tensorflow-io<2.4'
pip install 'tensorboard<2.4'
AWS_REGION=us-east-1
tensorboard --logdir=s3://object-detection-project-jckuri/logs/tf2-object-detection-2023-03-28-06-06-51-015/tensorboard-output


In [13]:
class URL:
    
    def __init__(self, url):
        self.url = url
    
    def _repr_html_(self):
        return '<a href="{}">{}</a>'.format(self.url, self.url)

In [14]:
# https://<notebook instance hostname>/proxy/6006/
jupyter_notebook_url = 'https://object-detection-project-a5lc.notebook.us-east-1.sagemaker.aws'
url = '{}/proxy/6006/'.format(jupyter_notebook_url)

o = URL(url)
o

## Improve on the initial model

Most likely, this initial experiment did not yield optimal results. However, you can make multiple changes to the `pipeline.config` file to improve this model. One obvious change consists in improving the data augmentation strategy. The [`preprocessor.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) file contains the different data augmentation method available in the Tf Object Detection API. Justify your choices of augmentations in the writeup.

Keep in mind that the following are also available:
* experiment with the optimizer: type of optimizer, learning rate, scheduler etc
* experiment with the architecture. The Tf Object Detection API model zoo offers many architectures. Keep in mind that the pipeline.config file is unique for each architecture and you will have to edit it.
* visualize results on the test frames using the `2_deploy_model` notebook available in this repository.

In the cell below, write down all the different approaches you have experimented with, why you have chosen them and what you would have done if you had more time and resources. Justify your choices using the tensorboard visualizations (take screenshots and insert them in your writeup), the metrics on the evaluation set and the generated animation you have created with [this tool](../2_run_inference/2_deploy_model.ipynb).

In [15]:
# your writeup goes here.

**I preferred to write all my documentation in the file <big><big>[README.md](../README.md)</big></big>.<br/>
Please, click on this file and read it because it explains all the items in the rubric.**