# Detect dark vessel fishing activity in satellite imagery on Amazon SageMaker

## Background

In this notebook, we demonstrate how to train, evaluate, and deploy .

Steps:
* ~~Setup~~
* Data preparation:
    * ~~prepare labels csv dataframe ~~
        * ~~merge train + val~~
        * ~~create new val, new train, and mini train for hyperparameter exploration~~
    * ~~create annotations~~ 
        * ~~training:full scene~~
        *  ~~full scene validation set (no need to do for public leaderboard)~~
    * ~~SM processing jobs:
        * ~~convert public and validation set to hdf5 for faster inference (it’s about 10x faster dataloading w/ hdf5 input than geotif images)~~
        * ~~[optional] chip scenes, demo on tiny.~~
    * Modify dataset dict w/ path?, check file paths and upload dataset dict to S3.
    * Modify dataset dict for scenes/files actually present on machine when `s3shardedbykey=True`
* Train
* Inference on leaderboard images & scoring
    * Batch Transform for inference
        * time speed to convert to hdf5 from tif for speedup?
    * SM processing for evaluation
* Visualize -> move this to separate notebook?, this would not be part of the pipeline.

## Setup

#### Configure docker

In [2]:
%%writefile /home/ec2-user/SageMaker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-shm-size": "200G",
    "data-root": "/home/ec2-user/SageMaker/docker"
} 

Writing /home/ec2-user/SageMaker/daemon.json


In [3]:
%%bash
sudo service docker stop
mkdir -p /home/ec2-user/SageMaker/docker
sudo rsync -aqxP /var/lib/docker/ /home/ec2-user/SageMaker/docker
sudo mv /var/lib/docker /var/lib/docker.old
sudo mv /home/ec2-user/SageMaker/daemon.json /etc/docker/
sudo service docker start

Redirecting to /bin/systemctl stop docker.service
  docker.socket
Redirecting to /bin/systemctl start docker.service


#### Imports

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import sys
from datetime import datetime
from pathlib import Path

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
from sagemaker.processing import ProcessingInput, ProcessingOutput, Processor

sys.path.append('tools/')
from docker_utils import build_and_push_docker_image

Define execution role, S3 bucket in account, and SageMaker session.

In [3]:
role = get_execution_role()
region = boto3.Session().region_name
bucket = 'xview3-blog-sagemaker'
sagemaker_session = sagemaker.Session(default_bucket=bucket)
account = sagemaker_session.account_id()
tags =[{'Key': 'project', 'Value': 'xview3-blog'}]

In [4]:
USE_TINY = False

## Dataset Creation with SageMaker Processing
In this section we create the following from the xView3 challenge dataset:
1. a new `train` and `valid`, after merging the train and validation set provided by the challenge. 
2. Detectron2 compatible dataset dicts to be used for training. 


#### Merge & split data labels. 
The xView3 challenge provided detection labels for each scene in `train.csv`, `validation.csv`, and `public.csv`. 
We will merge the `train.csv` and `validation.csv` and create a new `train` and `validation` set for training. The `public` leaderboard set will remain fixed.

#### Create Detectron2 Datasets
Here we create the Detectron2-compatible [dataset dicts](https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html) used for training models in Detectron2. The format of the dataset is a list of dictionaries with each dict containing information for one image with at least the following fields:
- `filename`:str
- `height`:int
- `width`: int
- `image_id`:str or int
- `annotations`: list[dict]

For more information on how to generate the dataset dict, see [Detectron2 docs] (https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html#standard-dataset-dicts).

Our dataset dict is generated from the information provided in the label `csv` files used in the previous section. Depending on whether we train our models with inputs originating from image chips (tiles) or from the full scene we will use one of two functions in the `xview3_d2` pacakage: `create_xview3_chipped_scene_annotations` or `create_xview3_full_scene_annotations`, respectively. 

### Build base image for SM Processing tasks.
For convenience, we build a base processing container which handles package installations. We can build a subsequent image from this base container to include the code we want to run. Here is what the base processing container looks like.

For building and pushing the containers, we use helper function `build_and_push_docker_image` in `tools/docker_utils.py`

In [6]:
!pygmentize -l docker docker/processing/base.Dockerfile

[34mFROM[39;49;00m [33mubuntu:20.04[39;49;00m

[34mENV[39;49;00m [31mDEBIAN_FRONTEND[39;49;00m=noninteractive

[34mRUN[39;49;00m apt-get update [33m\[39;49;00m
    && apt-get -y install python3 python3-pip vim nano git

[34mCOPY[39;49;00m requirements_cpu.txt .
[34mRUN[39;49;00m pip3 install -r requirements_cpu.txt

[34mRUN[39;49;00m python3 -m pip install [33m'git+https://github.com/facebookresearch/detectron2.git'[39;49;00m

[34mWORKDIR[39;49;00m[33m /opt/ml/code[39;49;00m
[34mCOPY[39;49;00m src /opt/ml/code/
[34mRUN[39;49;00m pip install /opt/ml/code/src

[37m# Make sure python doesn't buffer stdout so we get logs ASAP.[39;49;00m
[34mENV[39;49;00m [31mPYTHONUNBUFFERED[39;49;00m=TRUE


Let's build and push the base processing container.

In [4]:
processing_base_name = 'xview3-processing:base'
base_image = build_and_push_docker_image(processing_base_name, 
                                         dockerfile='docker/processing/base.Dockerfile')

Building docker image xview3-processing:base from docker/processing/base.Dockerfile
$ docker build -t xview3-processing:base -f docker/processing/base.Dockerfile .
Sending build context to Docker daemon  736.3kB
Step 1/8 : FROM ubuntu:20.04
 ---> 20fffa419e3a
Step 2/8 : ENV DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 0654e794ab1a
Step 3/8 : RUN apt-get update     && apt-get -y install python3 python3-pip vim nano git
 ---> Using cache
 ---> ccba904afcad
Step 4/8 : COPY requirements_cpu.txt .
 ---> b25e8ef26586
Step 5/8 : RUN pip3 install -r requirements_cpu.txt
 ---> Running in fe81aadb268a
Collecting h5py==3.6.0
  Downloading h5py-3.6.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.5 MB)
Collecting iopath==0.1.9
  Downloading iopath-0.1.9-py3-none-any.whl (27 kB)
Collecting numpy==1.23.1
  Downloading numpy-1.23.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Collecting opencv_python_headless>=4.5
  Downloading opencv_python_headless-4.6.0

Here's the main processing container, which copies the `.py` scripts in `tools/`. In each processing job, to follow, we can specify which the entrypoint `.py` script to run.

In [12]:
!pygmentize -l docker docker/processing/main.Dockerfile

[34mARG[39;49;00m BASE_IMAGE
[34mFROM[39;49;00m [33m${BASE_IMAGE}[39;49;00m
[34mWORKDIR[39;49;00m[33m /opt/ml/code[39;49;00m

[34mCOPY[39;49;00m src /opt/ml/code/src
[34mRUN[39;49;00m pip install /opt/ml/code/src

[34mCOPY[39;49;00m tools/ /opt/ml/code/


#### Build and push main processing container.

In [5]:
processing_base_image = f'{account}.dkr.ecr.{region}.amazonaws.com/xview3-processing:base'

In [13]:
processing_main_name = 'xview3-processing:main'
processing_main_image = build_and_push_docker_image(processing_main_name, 
                                                    dockerfile='docker/processing/main.Dockerfile', 
                                                    base_image=base_image)

Building docker image xview3-processing:main from docker/processing/main.Dockerfile
$ docker build -t xview3-processing:main -f docker/processing/main.Dockerfile . --build-arg BASE_IMAGE=869814743361.dkr.ecr.us-east-1.amazonaws.com/xview3-processing:base
Sending build context to Docker daemon  736.3kB
Step 1/6 : ARG BASE_IMAGE
Step 2/6 : FROM ${BASE_IMAGE}
 ---> fe12288332e8
Step 3/6 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> 6476897a5667
Step 4/6 : COPY src /opt/ml/code/src
 ---> 305c10561b09
Step 5/6 : RUN pip install /opt/ml/code/src
 ---> Running in 3f46a00d489f
Processing ./src
Building wheels for collected packages: xview3-d2
  Building wheel for xview3-d2 (setup.py): started
  Building wheel for xview3-d2 (setup.py): finished with status 'done'
  Created wheel for xview3-d2: filename=xview3_d2-1.0-py3-none-any.whl size=52530 sha256=7919bf72c0be99974bfdf3dd02ea037ed65e9b238e4d2c09e4061a02dba90785
  Stored in directory: /tmp/pip-ephem-wheel-cache-0k0xse6n/wheels/4d/ae/7d/713a7

In [113]:
processing_main_image = f'{account}.dkr.ecr.{region}.amazonaws.com/xview3-processing:main'

#### Launch SageMaker Processing job for dataset preparation. 
The SageMaker Processing task will run `tools/create_xview3_dataset_dict.py`. This script creates a detectron2-compatible dataset dict for full scene imagery or chipped scenes. Optionally, this script will merge train and validation csvs and create a new split. 

Let's see the arguments required this script:

In [15]:
!pygmentize -l python tools/create_xview3_dataset_dict.py

[33m"""Create detectron2 dataset dict for xView3."""[39;49;00m

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mxview3_d2[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m[04m[36m.[39;49;00m[04m[36mdatasets[39;49;00m[04m[36m.[39;49;00m[04m[36mxview3[39;49;00m [34mimport[39;49;00m (
    create_xview3_full_scene_annotations,
    create_xview3_chipped_scene_annotations,
    CHANNELS,
    create_data_split,
)
[34mfrom[39;49;00m [04m[36mxview3_d2[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m save_dataset, configure_logging
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Union, Tuple, List, Callable
[34mfrom[39;49;00m [04m[36mfunctools[39;49;00m [34mimport[39;49;00m partial
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path
[34mfrom[39;49;00m [04m

#### Initialize SM Processing job. 
We only need 1 instance for this task.

In [None]:
instance_type = 'ml.t3.xlarge'
volume_size_in_gb = 30 
instance_count = 1
base_job_name = 'xview3-dataset-prep'
                      
dataset_processor = Processor(image_uri=processing_image_name,
                              role=role,
                              instance_count=instance_count,
                              base_job_name=base_job_name,
                              instance_type=instance_type, 
                              volume_size_in_gb=volume_size_in_gb, 
                              entrypoint=['python3', 'create_xview3_dataset_dict.py'],
                              sagemaker_session=sagemaker_session, 
                              tags=tags)

#### Specify inputs and run processing job. 

`tools/create_xview3_dataset_dict.py` has several defaults, which can be overridden by providing the relevant argument in the processor `arugments`.  The cell below will launch a processor job that creates a new data split and creates a dataset dict for full scenes. To create a dataset dict for chipped scenes, change `dataset-type` to `chipped` and provide additional inputs and/or arguments such as `--shoreline_dir`

In [6]:
override = False
current_timestamp = '202207250702'
SEED = 46998886

if override:
    current_timestamp = datetime.now().strftime("%Y%m%d%M%S")


In [14]:
input_labels = ProcessingInput(source='data/labels/', 
                               destination='/opt/ml/processing/input/labels',
                              input_name='labels')
input_stats = ProcessingInput(source='data/scene-stats.csv', 
                              destination='/opt/ml/processing/input/scene-stats',
                             input_name='stats')

job_output = ProcessingOutput(source='/opt/ml/processing/output/prepared/',  
                              destination=f's3://xview3-blog/data/processing/{current_timestamp}',
                              output_name='prepared-dataset')

dataset_processor.run(inputs=[input_labels, input_stats], 
              outputs=[job_output],
              arguments=["--dataset-type", "full", 
                         "--train-labels-csv", f"{input_labels.destination}/train.csv",
                         "--valid-labels-csv", f"{input_labels.destination}/validation.csv",
                         "--tiny-labels-csv", f"{input_labels.destination}/tiny.csv",
                         "--scene-stats-csv", f"{input_stats.destination}/scene-stats.csv",
                         "--seed", str(SEED), 
                         "--output-dir", job_output.source,
              ],
              wait=True,
              logs=True)


Job Name:  xview3-dataset-prep-2022-08-01-18-02-06-937
Inputs:  [{'InputName': 'labels', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/xview3-dataset-prep-2022-08-01-18-02-06-937/input/labels', 'LocalPath': '/opt/ml/processing/input/labels', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'stats', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/xview3-dataset-prep-2022-08-01-18-02-06-937/input/stats/scene-stats.csv', 'LocalPath': '/opt/ml/processing/input/scene-stats', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'prepared-dataset', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://xview3-blog/data/processing/202207250702', 'LocalPath': '/opt/ml/processing/output/prepared/', 'S3UploadMode': 'EndOfJob'}}]
................................[34mINFO:Create-xview3-dataset

#### [Optional] Run processing job to created dataset dict for chipped scenes.

In [9]:
processing_image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/xview3-processing:main'

In [10]:
instance_type = 'ml.t3.xlarge'
volume_size_in_gb = 30 
instance_count = 1
base_job_name = 'xview3-dataset-prep'
                      
dataset_processor = Processor(image_uri=processing_image_name,
                              role=role,
                              instance_count=instance_count,
                              base_job_name=base_job_name,
                              instance_type=instance_type, 
                              volume_size_in_gb=volume_size_in_gb, 
                              entrypoint=['python3', 'create_xview3_dataset_dict.py'],
                              sagemaker_session=sagemaker_session, 
                              tags=tags)

In [12]:
s3_destination_uri = f's3://xview3-blog/data/processing/{current_timestamp}'

input_stats = ProcessingInput(source='data/scene-stats.csv', 
                              destination='/opt/ml/processing/input/scene-stats',
                              input_name='stats')
input_label_trn = ProcessingInput(source=f'{s3_destination_uri}/labels/train.csv',
                                  destination='/opt/ml/processing/input/labels/train',
                                  input_name='trn-labels')
input_labels_tiny = ProcessingInput(source=f'{s3_destination_uri}/labels/tiny-train.csv',
                                    destination='/opt/ml/processing/input/labels/tiny',
                                    input_name='tiny-labels')
inputs_shoreline = ProcessingInput(source='s3://xview3-blog/data/shoreline/trainval/', 
                                  destination='/opt/ml/processing/input/shoreline/')

job_output = ProcessingOutput(source='/opt/ml/processing/output/prepared/',  
                              destination=s3_destination_uri,
                              output_name='prepared-dataset')

dataset_processor.run(inputs=[input_label_trn, input_labels_tiny, input_stats, inputs_shoreline], 
                      outputs=[job_output],
                      arguments=["--dataset-type", "chipped", 
                                 "--scene-stats-csv", f"{input_stats.destination}/scene-stats.csv",
                                 "--seed", str(SEED), 
                                 "--output-dir", job_output.source, 
                                 "--shoreline-dir", inputs_shoreline.destination,
                                 "--gt-labels-dir", str(Path(input_label_trn.destination).parent)],
                      wait=True,
                      logs=True)


Job Name:  xview3-dataset-prep-2022-08-04-18-23-04-146
Inputs:  [{'InputName': 'trn-labels', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/data/processing/202207250702/labels/train.csv', 'LocalPath': '/opt/ml/processing/input/labels/train', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'tiny-labels', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/data/processing/202207250702/labels/tiny-train.csv', 'LocalPath': '/opt/ml/processing/input/labels/tiny', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'stats', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog-sagemaker/xview3-dataset-prep-2022-08-04-18-23-04-146/input/stats/scene-stats.csv', 'LocalPath': '/opt/ml/processing/input/scene-stats', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'Ful

## Imagery Preparation with SageMaker Processing
We use SageMaker Processing to prepare our imagery for training. 
The imagery data will be uploaded to the SageMaker session S3 bucket under `imagery`

### a. Save native scene imagery in file storage/
For dynamically sampling from full scene imagery, we observed that we can speed up training and evaluation by a factor of 10 if the scene imagery was stored in `hdf5` format, compared to loading the provided GeoTIFF (Geostationary Earth Orbit Tagged Image File Format) imagery data with `rasterio`. This is also useful during inference for evaluation.

Let's kick of SageMaker Processsing job to convert imagery to `hdf5`. This only needs to be done once.

In [61]:
instance_type = 'ml.t3.xlarge'
volume_size_in_gb = 300 
instance_count = 75
                      
s3_uri_source = 's3://xview3-blog/data/raw'
s3_uri_imagery = f'{s3_destination_uri}/imagery'

storage_processor = Processor(image_uri=processing_image_name,
                              role=role,
                              instance_count=instance_count, 
                              base_job_name='xview3-storage',
                              instance_type=instance_type, 
                              volume_size_in_gb=volume_size_in_gb,
                              entrypoint=['python3', 'store_xview3_imagery.py'],
                              sagemaker_session=sagemaker_session,
                              tags=tags,)

storage_processor.run(inputs=[ProcessingInput(source=s3_uri_source, 
                                              destination='/opt/ml/processing/input/',
                                              s3_data_distribution_type='ShardedByS3Key')], 
                      outputs=[ProcessingOutput(source='/opt/ml/processing/output/imagery/', 
                                                destination=s3_uri_imagery,
                                                output_name='imagery',
                                                s3_upload_mode="Continuous")],
                      arguments=["--store-format", "hdf5"],
                      wait=False,
                      logs=False)


Job Name:  xview3-storage-2022-07-26-16-53-55-910
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/data/raw', 'LocalPath': '/opt/ml/processing/input/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'imagery', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://xview3-blog/data/processing/202207250702/imagery', 'LocalPath': '/opt/ml/processing/output/imagery/', 'S3UploadMode': 'Continuous'}}]


4:12: E225 missing whitespace around operator
18:30: E124 closing bracket does not match visual indentation


### b. [Optional] Image chipping 
If we decide to train with image chips, we can also use SageMaker Processing to generate image chips using the dataset dict created in the previous section.


In [13]:
s3_destination_uri = f's3://xview3-blog/data/processing/{current_timestamp}'

In [14]:
s3_uri_imagery = f'{s3_destination_uri}/imagery'
s3_uri_imagery

's3://xview3-blog/data/processing/202207250702/imagery'

In [16]:
processing_image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/xview3-processing:main'

In [62]:
s3_uri_destination_base = f"{s3_uri_imagery}/chipped-scenes"
s3_uri_source_base = "s3://xview3-blog/data/raw"
s3_uri_d2_datasets = f'{s3_destination_uri}/detectron2_dataset/'


d2_dataset_fn = f"xview3-chipped_2560x2560-{'tiny' if USE_TINY else 'train'}.dataset"
num_instances = 2 if USE_TINY else 50 
s3_uri_imagery_source = f"{s3_uri_source_base}/{'tiny' if USE_TINY else 'trainval'}"
s3_uri_destination = f"{s3_uri_destination_base}/{'tiny' if USE_TINY else 'train'}"

# specify local input data for SageMaker Processing job.
input_scenes = ProcessingInput(source=s3_uri_imagery_source, 
                               destination='/opt/ml/processing/input/scenes/', 
                               s3_data_distribution_type='ShardedByS3Key')

input_d2_dataset = ProcessingInput(source=s3_uri_d2_datasets, 
                                   destination='/opt/ml/processing/input/datasets/',)
                                                
job_output = ProcessingOutput(source='/opt/ml/processing/output/', 
                              destination=s3_uri_destination, 
                              s3_upload_mode="Continuous",)

Need at least 32GB CPU instance

In [64]:
chip_processor = Processor(image_uri=processing_image_name,
                           role=role,
                           instance_count=num_instances, 
                           base_job_name=f"xview3-chip-scenes-{'tiny' if USE_TINY else 'train'}", 
                           instance_type='ml.t3.2xlarge',#'ml.r5.xlarge', 
                           volume_size_in_gb=1024, 
                           entrypoint=['python3', 'chip_scenes_from_annotations.py'],
                           sagemaker_session=sagemaker_session, 
                           tags=tags)

chip_processor.run(inputs=[input_scenes, input_d2_dataset], 
                   outputs=[job_output],
                   arguments=['--scenes-input-dir', input_scenes.destination,
                              '--d2-dataset', f"{input_d2_dataset.destination}/{d2_dataset_fn}",],
                   wait=USE_TINY,
                   logs=USE_TINY)


Job Name:  xview3-chip-scenes-train-2022-08-04-22-26-45-367
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/data/raw/trainval', 'LocalPath': '/opt/ml/processing/input/scenes/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://xview3-blog/data/processing/202207250702/detectron2_dataset/', 'LocalPath': '/opt/ml/processing/input/datasets/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://xview3-blog/data/processing/202207250702/imagery/chipped-scenes/train', 'LocalPath': '/opt/ml/processing/output/', 'S3UploadMode': 'Continuous'}}]


## Train

In [22]:
from dataclasses import dataclass

from sagemaker.inputs import TrainingInput

In [34]:
USE_CHIPPED = False
LOCAL = False 

In [24]:
base_train_dockerfile = str(Path("docker/training/base.Dockerfile").resolve())
train_dockerfile = str(Path("docker/training/main.Dockerfile").resolve())

In [75]:
!pygmentize -l docker {base_train_dockerfile}

[34mARG[39;49;00m REGION
[34mFROM[39;49;00m [33m763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04[39;49;00m
[34mLABEL[39;49;00m [31mauthor[39;49;00m=[33m"kachio@amazon.com"[39;49;00m

[34mENV[39;49;00m [31mDEBIAN_FRONTEND[39;49;00m=noninteractive

[34mRUN[39;49;00m apt-get update [33m\[39;49;00m
  && apt-get -y install python3 python3-pip git python3-setuptools [33m\[39;49;00m
  && rm -rf /var/lib/apt/lists/*

[34mRUN[39;49;00m pip install -U sagemaker
[34mRUN[39;49;00m pip install -U --upgrade [31mtorch[39;49;00m==[34m1[39;49;00m.10.0+cu102 [31mtorchvision[39;49;00m==[34m0[39;49;00m.11.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html
[34mRUN[39;49;00m pip install -U --no-cache-dir detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.10/index.html
[34mRUN[39;49;00m pip install -U [31mboto3[39;49;00m==[34m1[39;49;00m.17.18 pandas rasterio zarr
[34mRUN[39;49;00m pip 

#### Build Base Training Container

In [36]:
training_base_name = 'xview3-training:base'

base_image_uri = build_and_push_docker_image(training_base_name,  
                                             dockerfile=str(base_train_dockerfile),)
print(f'Base image: {base_image_uri}')

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Logged into ECR
Building docker image xview3-training:base from /home/ec2-user/SageMaker/xview3-blog/docker/training/base.Dockerfile
$ docker build -t xview3-training:base -f /home/ec2-user/SageMaker/xview3-blog/docker/training/base.Dockerfile . --build-arg BASE_IMAGE=869814743361.dkr.ecr.us-east-1.amazonaws.com/xview3-training:base
Sending build context to Docker daemon  965.6kB
Step 1/13 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
1.9.1-gpu-py38-cu111-ubuntu20.04: Pulling from pytorch-training
d5fd17ec1767: Pulling fs layer
d5f48f468589: Pulling fs layer
1600774dceb6: Pulling fs layer
d97603d2ab53: Pulling fs layer
679ab295ca40: Pulling fs layer
48d246cb90c2: Pulling fs layer
bb41250418ce: Pulling fs layer
44b4c20eb03f: Pulling fs layer
c44412e812a9: Pulling fs layer
752f0917fc9a: Pulling fs layer
8b9d79ac0c89: Pulling fs layer
5f10c

#### Build Training Container

In [20]:
!pygmentize -l docker {train_dockerfile}

[34mARG[39;49;00m[37m [39;49;00mBASE_IMAGE
[34mFROM[39;49;00m[37m [39;49;00m[33m${BASE_IMAGE}[39;49;00m


[34mWORKDIR[39;49;00m[37m [39;49;00m[33m/opt/ml/code[39;49;00m

[34mCOPY[39;49;00m[37m [39;49;00mtools/xview3_train_net.py /opt/ml/code/
[34mCOPY[39;49;00m[37m [39;49;00mconfigs/xview3 /opt/ml/code/configs
[34mCOPY[39;49;00m[37m [39;49;00msrc /opt/ml/code/src
[34mRUN[39;49;00m[37m [39;49;00mpip install /opt/ml/code/src


[34mENV[39;49;00m[37m [39;49;00mSAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
[34mENV[39;49;00m[37m [39;49;00mSAGEMAKER_PROGRAM xview3_train_net.py


In [37]:
training_base_name = 'xview3-training:base'
base_image_uri = f'{account}.dkr.ecr.{region}.amazonaws.com/{training_base_name}'
training_main_name = 'xview3-training:train'

In [90]:
training_image_uri = build_and_push_docker_image(training_main_name, 
                                                 dockerfile=str(train_dockerfile),
                                                 base_image=base_image_uri)
print(f'Training image: {training_image_uri}')

Building docker image xview3-training:train from /home/ec2-user/SageMaker/xview3-blog/docker/training/main.Dockerfile
$ docker build -t xview3-training:train -f /home/ec2-user/SageMaker/xview3-blog/docker/training/main.Dockerfile . --build-arg BASE_IMAGE=869814743361.dkr.ecr.us-east-1.amazonaws.com/xview3-training:base
Sending build context to Docker daemon  1.013MB
Step 1/11 : ARG BASE_IMAGE
Step 2/11 : FROM ${BASE_IMAGE}
 ---> 6005678c93a5
Step 3/11 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> de1eaaf98385
Step 4/11 : COPY tools/train_net.py /opt/ml/code/
 ---> Using cache
 ---> a0479f110d75
Step 5/11 : COPY configs/xview3 /opt/ml/code/configs
 ---> e9adeed1503b
Step 6/11 : COPY src /opt/ml/code/src
 ---> b0c9e24c9558
Step 7/11 : RUN python3 -m pip install /opt/ml/code/src
 ---> Running in b94b455d8649
Processing ./src
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: xview3-d2
  Building wh

In [25]:
training_image_uri = f'{account}.dkr.ecr.{region}.amazonaws.com/xview3-training:train'

In [26]:
output_dir='/opt/ml/model/FRCNN/auto'
shoreline_dir  = '/opt/ml/input/data/shoreline/'

metrics = [
    {"Name": "training:loss", "Regex": "total_loss: ([0-9\\.]+)",},
    {"Name": "training:loss_cls", "Regex": "loss_cls: ([0-9\\.]+)",},
    {"Name": "training:loss_box_reg", "Regex": "loss_box_reg: ([0-9\\.]+)",},
    {"Name": "training:loss_rpn_cls", "Regex": "loss_rpn_cls: ([0-9\\.]+)",},
    {"Name": "training:loss_rpn_loc", "Regex": "loss_rpn_loc: ([0-9\\.]+)",},
    {"Name": "training:loss_length_reg", "Regex": "loss_length_reg: ([0-9\\.]+)",},
    {"Name": "training:lr", "Regex": "lr: ([0-9\\.]+)"},
    {"Name": "training:dataloader_time", "Regex": "data_time: ([0-9\\.]+)"},
    {"Name": "training:time", "Regex": "time: ([0-9\\.]+)"},
    {"Name": "validation:aggregate", "Regex": "aggregate=([0-9\\.]+)",},
    {"Name": "validation:loc_fscore", "Regex": "loc_fscore=([0-9\\.]+)",},
    {"Name": "validation:loc_fscore_shore", "Regex": "loc_fscore_shore=([0-9\\.]+)",},
    {"Name": "validation:vessel_fscore", "Regex": "vessel_fscore=([0-9\\.]+)",},
    {"Name": "validation:fishing_fscore", "Regex": "fishing_fscore=([0-9\\.]+)",},
    {"Name": "validation:length_acc", "Regex": "length_acc=([0-9\\.]+)",},
]

In [27]:
def compute_iterations_from_epochs(epochs, bs, num_annotations, max_evals, warmup_prop, num_gpus=1):
    iter_max = int(num_annotations / (num_gpus * bs) * epochs)
    eval_period = iter_max//max_evals
    iter_warmup = int(iter_max * warmup_prop)
    
    return iter_max, eval_period, iter_warmup

In [28]:
@dataclass(order=True)
class Instances:
    name: str
    num_gpus: int = 1
    instance_limit: int = 1
    num_workers: int = 4
    batch_size: int = 12
    volume: int = 2048

In [29]:
instance_members = [Instances('local_gpu', num_gpus=4),
                    Instances('ml.p3.2xlarge'), 
                    Instances('ml.p3.8xlarge', 4, 4, 16), 
                    Instances('ml.p3.16xlarge', 8, 2, 32),
                    Instances('ml.p3dn.24xlarge', 8, num_workers=48, batch_size=24, volume=1800)]

In [73]:
NUM_ANNOTS = {'tiny': 1679, 
              'train': 54360}

if USE_CHIPPED:
    NUM_ANNOTS['tiny'] = 1907
    NUM_ANNOTS['train'] = 62766

In [82]:
instance = instance_members[-2]
instance

Instances(name='ml.p3.16xlarge', num_gpus=8, instance_limit=2, num_workers=32, batch_size=12, volume=2048)

In [83]:
epochs = 6
num_annotations = NUM_ANNOTS['tiny'] if USE_TINY else NUM_ANNOTS['train']
bs = instance.batch_size
#num_gpus = 1 #instance.num_gpus
max_evals = 5
max_checkpoints = max_evals * 2
warmup_prop = 0.2

max_iter, eval_period, warmup_iter = compute_iterations_from_epochs(epochs, bs, num_annotations, num_gpus, max_evals, warmup_prop)
checkpoint_period = eval_period // 2
print(max_iter, eval_period, warmup_iter, checkpoint_period)

54360 10872 10872 5436


In [84]:
# Datasets
mode = "tiny" if USE_TINY else "trainval"
imagery_s3_uri = f's3://xview3-blog/data/processing/202207250702/imagery/hdf5/{mode}/'

if USE_CHIPPED:
    imagery_s3_uri = f's3://xview3-blog/data/processing/202207250702/imagery/chipped-scenes/{mode}/xview3_chipped_2560x2560_{mode.replace("val", "")}/'
    val_imagery_s3_uri = f's3://xview3-blog/data/processing/202207250702/imagery/hdf5/{mode}/'
    s3_channel_valid_imagery = TrainingInput(val_imagery_s3_uri, 
                                   distribution='FullyReplicated', 
                                   s3_data_type='S3Prefix',
                                   input_mode='FastFile')
    
shoreline_s3_uri = 's3://xview3-blog/data/shoreline/trainval/'
datasets_s3_uri = 's3://xview3-blog/data/processing/202207250702/detectron2_dataset/'

s3_channel_imagery = TrainingInput(imagery_s3_uri, 
                                   distribution='FullyReplicated', 
                                   s3_data_type='S3Prefix',
                                   input_mode='FastFile')
s3_channel_shoreline = TrainingInput(shoreline_s3_uri, 
                                     distribution='FullyReplicated', 
                                     s3_data_type='S3Prefix', 
                                     input_mode='FastFile')
s3_channel_datasets = TrainingInput(datasets_s3_uri, 
                                    distribution='FullyReplicated', 
                                    s3_data_type='S3Prefix',
                                    input_mode='FastFile')

train_inputs = {'imagery': s3_channel_imagery, 
                'shoreline': s3_channel_shoreline, 
                'datasets': s3_channel_datasets}
if USE_CHIPPED:
    train_inputs['valid_imagery'] = s3_channel_valid_imagery

# Use EFS if local
if LOCAL:
    train_inputs['imagery'] = f'file:////home/ec2-user/SageMaker/xview3-blog/data/imagery/hdf5/tiny/'
    train_inputs['shoreline'] = 'file:///home/ec2-user/SageMaker/xview3-blog/data/shoreline/trainval/'
    train_inputs['datasets'] = 'file:///home/ec2-user/SageMaker/xview3-blog/data/detectron2_datasets/new/'
    

In [88]:
config_file = 'frcnn_X101_32x8d_FPN_full.yaml'#'frcnn_R101_FPN_full.yaml'#'frcnn_R101_FPN_full_VH3.yaml' 
if USE_CHIPPED:
    config_file = 'frcnn_R101_FPN_chipped_histeq.yaml'

config_params = [f'OUTPUT_DIR {output_dir}',
                 f'TEST.INPUT.SHORELINE_DIR {shoreline_dir}',
                 f'INPUT.DATA.SHORELINE_DIR {shoreline_dir}',
                 f"SOLVER.IMS_PER_BATCH {bs}",
                 f"TEST.EVAL_PERIOD {eval_period}",
                 f"SOLVER.WARMUP_ITERS {warmup_iter}",
                 f"SOLVER.MAX_ITER {max_iter}",
                 f"SOLVER.CHECKPOINT_PERIOD {checkpoint_period}",
                 f"DATALOADER.NUM_WORKERS {instance.num_workers}",
                 "SOLVER.LR_SCHEDULER_NAME WarmupCosineLR",
                 "SOLVER.BASE_LR 0.005",
                ]

training_job_hp = {'config-file': f'/opt/ml/code/configs/{config_file}',
                   'imagery-dir': '/opt/ml/input/data/imagery',
                   'd2-dataset-dir': '/opt/ml/input/data/datasets',
                   'zopts': ' '.join(config_params)}

if USE_CHIPPED:
    training_job_hp['valid-imagery-dir'] = '/opt/ml/input/data/valid_imagery'

In [89]:
config_params

['OUTPUT_DIR /opt/ml/output/FRCNN/auto',
 'TEST.INPUT.SHORELINE_DIR /opt/ml/input/data/shoreline/',
 'INPUT.DATA.SHORELINE_DIR /opt/ml/input/data/shoreline/',
 'SOLVER.IMS_PER_BATCH 6',
 'TEST.EVAL_PERIOD 10872',
 'SOLVER.WARMUP_ITERS 10872',
 'SOLVER.MAX_ITER 54360',
 'SOLVER.CHECKPOINT_PERIOD 5436',
 'DATALOADER.NUM_WORKERS 32',
 'SOLVER.LR_SCHEDULER_NAME WarmupCosineLR',
 'SOLVER.BASE_LR 0.005']

In [91]:
#base_job_name = f"xview3-{'chipped' if USE_CHIPPED else 'full'}-{'tiny' if USE_TINY else 'trainval'}"
base_job_name = f"xview3-{config_file.split('.')[0].replace('_', '-')}"

training_instance = instance.name
num_instances = 1
training_session = sagemaker_session


if training_instance.startswith("local"):
    training_session = sagemaker.LocalSession()
    training_session.config = {"local": {"local_code": True}}
    LOCAL = True

d2_estimator = Estimator(image_uri=training_image_uri,
                         role=role, 
                         sagemaker_session=training_session, 
                         instance_count=num_instances, 
                         instance_type=training_instance, 
                         volume_size=instance.volume,
                         metric_definitions=metrics, 
                         hyperparameters=training_job_hp,
                         base_job_name=base_job_name, 
                         max_retry_attempts=30, 
                         max_run=432000,
                         checkpoint_local_path=None if LOCAL else '/opt/ml/checkpoints/' ,
                         checkpoint_s3_uri=None if LOCAL else 's3://xview3-blog-sagemaker/checkpoints/',
                         disable_profiler=True,
                         debugger_hook_config=False,
                        tags=tags)

d2_estimator.fit(inputs=train_inputs, 
                 wait=True if USE_TINY else False, 
                 logs="All")

`/tmp/tmp7dix_o_f/algo-1-2i620`

In [107]:
len(data)

66320