## Welcome to the Detectron2 SageMaker Demo!

In [245]:
# Define IAM role
import boto3
import re
import sys
import os
import time
import json
import numpy as np
import pandas as pd
import sagemaker
from time import gmtime, strftime
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.pytorch import estimator, PyTorchModel, PyTorchPredictor, PyTorch

# get our execution role giving us permissions to do things like launch training jobs
role = get_execution_role()
session = boto3.session.Session()
sess = sagemaker.Session() # can use LocalSession() to run container locally

bucket = 'privisaa-bucket-2' # sess.default_bucket()
region = "us-east-1"
account = sess.boto_session.client('sts').get_caller_identity()['Account']
prefix_input = 'detectron2-input'
prefix_output = 'detectron2-output'

Install SageMaker Experiments

In [7]:
!{sys.executable} -m pip install sagemaker-experiments==0.1.24

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m


# Upload data for training

We are grabbing data from COCO, decompressing the data, and then sending it to s3. In this notebook we have two examples, one using s3 and one using EFS, if you want to utilize the EFS example, you'll need to mount your EFS volume.

In [3]:
! ./upload_coco2017_to_s3.sh {bucket} coco

Create stage directory: /home/ec2-user/SageMaker/coco-2017-2020-10-15-00-24-32
--2020-10-15 00:24:32--  http://images.cocodataset.org/zips/train2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.217.44.140
Connecting to images.cocodataset.org (images.cocodataset.org)|52.217.44.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19336861798 (18G) [application/zip]
Saving to: ‘/home/ec2-user/SageMaker/coco-2017-2020-10-15-00-24-32/train2017.zip’


2020-10-15 00:28:32 (77.0 MB/s) - ‘/home/ec2-user/SageMaker/coco-2017-2020-10-15-00-24-32/train2017.zip’ saved [19336861798/19336861798]

Extracting /home/ec2-user/SageMaker/coco-2017-2020-10-15-00-24-32/train2017.zip
--2020-10-15 00:30:14--  http://images.cocodataset.org/zips/val2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 54.231.33.163
Connecting to images.cocodataset.org (images.cocodataset.org)|54.231.33.163|:80... connected.
HTTP request sent, awaiting response... 200 

If using EFS, run the following:

In [None]:
!bash mount_efs.sh 

In [None]:
!bash send_data_to_efs.sh

## Push Docker image to registry

For this training, we'll extend the [Sagemaker PyTorch Container](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html) with Detectron2 dependencies (using official [D2 Dockerfile](https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile)) as baseline. See Dockerfile below.

You are in no means limited to using our containers, for examples of jobs using outside containers see:

[SageMaker Nvidia NGC Examples ](https://github.com/aws-samples/amazon-sagemaker-nvidia-ngc-examples)

In [301]:
!pygmentize Dockerfile

[37m# Build an image of Detectron2 that can do distributing training on Amazon Sagemaker [39;49;00m

[37m# using Sagemaker PyTorch container as base image[39;49;00m
[37m# https://github.com/aws/sagemaker-pytorch-container/blob/master/docker/1.4.0/py3/Dockerfile.gpu[39;49;00m
[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-gpu-py36-cu101-ubuntu16.04[39;49;00m
[34mLABEL[39;49;00m [31mauthor[39;49;00m=[33m"vadimd@amazon.com"[39;49;00m

[37m############# Installing latest builds ############[39;49;00m

[37m# This is to fix issue: https://github.com/pytorch/vision/issues/1489[39;49;00m
[34mRUN[39;49;00m pip install --upgrade --force-reinstall [31mtorch[39;49;00m==[34m1[39;49;00m.5.1 [31mtorchvision[39;49;00m==[34m0[39;49;00m.6.1 cython
[37m# RUN pip install torchvision==0.7.0[39;49;00m

[37m############# D2 section ##############[39;49;00m

[37m# installing dependecies for D2 https://github.com/facebookresearch/de

You'll need to build your container from this Dockerfile and push it to Amazon Elastic Container Registry using the `build_and_push.sh` script. But you'll need to loging to Sagemaker ECR and your private ECR first.

In [4]:
# loging to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# loging to your private ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 553020858742.dkr.ecr.{region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Now you can push your D2 container to Amazon Elastic Container Registry (ECR)

In [288]:
! ./build_and_push.sh d2-sm-coco distributed

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  511.6MB
Step 1/18 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-gpu-py36-cu101-ubuntu16.04
 ---> 2aa16a1b866d
Step 2/18 : LABEL author="vadimd@amazon.com"
 ---> Using cache
 ---> 5e2766146cbe
Step 3/18 : RUN pip install --upgrade --force-reinstall torch==1.5.1 torchvision==0.6.1 cython
 ---> Using cache
 ---> 24840acf45f9
Step 4/18 : RUN pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
 ---> Using cache
 ---> 07b33cfeee9a
Step 5/18 : RUN pip install 'git+https://github.com/facebookresearch/fvcore'
 ---> Using cache
 ---> 17df2af93165
Step 6/18 : ENV FORCE_CUDA="1"
 ---> Using cache
 ---> 3d6698291fd7
Step 7/18 : ENV TORCH_CUDA_ARCH_LIST="Volta"
 ---> Using cache
 ---> d78460d1df58
Step 8/18 : ENV FVCORE_CACHE="/tmp"
 ---> Using cache
 ---> 646f1ef9d106
Step 9/18 : ENV DETECTRON2_DATASETS="/

This is a variation on the main container designed to work with EFS instead of s3

In [271]:
! ./build_and_push.sh d2-sm-coco2 distributed Dockerfile.efs

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  511.6MB
Step 1/17 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
 ---> 47cd15520b75
Step 2/17 : LABEL author="vadimd@amazon.com"
 ---> Using cache
 ---> c7249177e518
Step 3/17 : RUN pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
 ---> Using cache
 ---> d83fd986a49e
Step 4/17 : RUN pip install 'git+https://github.com/facebookresearch/fvcore'
 ---> Using cache
 ---> a540b481c57c
Step 5/17 : ENV FORCE_CUDA="1"
 ---> Using cache
 ---> e6686d2a4ec6
Step 6/17 : ENV TORCH_CUDA_ARCH_LIST="Volta"
 ---> Using cache
 ---> c3d011fc9e21
Step 7/17 : ENV FVCORE_CACHE="/tmp"
 ---> Using cache
 ---> 96c48ab9df44
Step 8/17 : ENV DETECTRON2_DATASETS="/opt/ml/input/data/train"
 ---> Using cache
 ---> cd61b288f019
Step 9/17 : COPY container_training /opt/ml/code
 ---> 9cf5053f94ad
St

# Train your model

Let's define some algorithm metrics. SageMaker will scrape the logs from our training job and render them in the training job console. The metrics we are defining are pretty standard, you just need to define the regex to find them, feel free to define your own!

In [5]:
container = "d2-sm-coco" # your container name
tag = "distributed"
image = f'{account}.dkr.ecr.{region}.amazonaws.com/{container}:{tag}'

In [215]:
metric_definitions=[{
        "Name": "total_loss",
        "Regex": ".*total_loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_cls",
        "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_box_reg",
        "Regex": ".*loss_box_reg:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_mask",
        "Regex": ".*loss_mask:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_cls",
        "Regex": ".*loss_rpn_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_loc",
        "Regex": ".*loss_rpn_loc:\s([0-9\\.]+)\s*"
    }, 
    {
        "Name": "overall_training_speed",
        "Regex": ".*Overall training speed:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "lr",  
        "Regex": ".*lr:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "iter",  
        "Regex": ".*iter:\s([0-9\\.]+)\s*"
    }
]

metric_path = 'metric_defs'

with open(metric_path, 'w') as f:
    for met in metric_definitions:
        f.write(json.dumps(met))
        f.write('\n')
        
metric_definitions

[{'Name': 'total_loss', 'Regex': '.*total_loss:\\s([0-9\\.]+)\\s*'},
 {'Name': 'loss_cls', 'Regex': '.*loss_cls:\\s([0-9\\.]+)\\s*'},
 {'Name': 'loss_box_reg', 'Regex': '.*loss_box_reg:\\s([0-9\\.]+)\\s*'},
 {'Name': 'loss_mask', 'Regex': '.*loss_mask:\\s([0-9\\.]+)\\s*'},
 {'Name': 'loss_rpn_cls', 'Regex': '.*loss_rpn_cls:\\s([0-9\\.]+)\\s*'},
 {'Name': 'loss_rpn_loc', 'Regex': '.*loss_rpn_loc:\\s([0-9\\.]+)\\s*'},
 {'Name': 'overall_training_speed',
  'Regex': '.*Overall training speed:\\s([0-9\\.]+)\\s*'},
 {'Name': 'lr', 'Regex': '.*lr:\\s([0-9\\.]+)\\s*'},
 {'Name': 'iter', 'Regex': '.*iter:\\s([0-9\\.]+)\\s*'}]

## SageMaker Experiments

SageMaker experiments needs some setup before we can hook it into our estimators. We first are going to create our tracker and within our tracker, create an experiment.

In [296]:
# create d2 experiment

from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

b3sess = boto3.Session()
sm = b3sess.client('sagemaker')

with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
    tracker.log_parameters({
        "normalization_mean": 0.1307,
        "normalization_std": 0.3081,
    })
    # we can log the s3 uri to the dataset we just uploaded
#     tracker.log_input(name="d2-dataset", media_type="s3/uri", value=inputs)

d2_experiment = Experiment.create(
    experiment_name=f"d2-coco-demo-{int(time.time())}", 
    description="Detectron2 training on COCO2017", 
    sagemaker_boto_client=sm)
print(d2_experiment)


Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f3f88ed37b8>,experiment_name='d2-coco-demo-1603299483',description='Detectron2 training on COCO2017',tags=None,experiment_arn='arn:aws:sagemaker:us-east-1:209419068016:experiment/d2-coco-demo-1603299483',response_metadata={'RequestId': '9f26cc0d-76ce-4af3-9220-c6b2300c174e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '9f26cc0d-76ce-4af3-9220-c6b2300c174e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '95', 'date': 'Wed, 21 Oct 2020 16:58:03 GMT'}, 'RetryAttempts': 0})


Now we will create a trial within our experiment

In [297]:
hidden_channel_trial_name_map = {}
preprocessing_trial_component = tracker.trial_component

trial_name = f"d2-demo-training-job-{int(time.time())}"
d2_trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=d2_experiment.experiment_name,
    sagemaker_boto_client=sm,
)
hidden_channel_trial_name_map[0] = trial_name

# associate the proprocessing trial component with the current trial
d2_trial.add_trial_component(preprocessing_trial_component)

# Set Hyperparameters

Let's set our hyperparameters

In [299]:
# !wget https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-101.pkl .
# !aws s3 cp R-101.pkl s3://privisaa-bucket-2/models/mask_rcnn_R_101_C4_3x/R-101.pkl

hyperparameters = {"config-file":"COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml", 
                   #"local-config-file" : "config.yaml", # if you'd like to supply custom config file, please add it in container_training folder, and provide file name here
                   "resume":"True", # whether to re-use weights from pre-trained model
                   "eval-only":"False", # whether to perform only D2 model evaluation
                  # opts are D2 model configuration as defined here: https://detectron2.readthedocs.io/modules/config.html#config-references
                  # this is a way to override individual parameters in D2 configuration from Sagemaker API
                   "opts": "SOLVER.MAX_ITER 1000",
                   "spot_ckpt":''
                   }

with open('hyperparams.json', 'w') as f:
    json.dump(hyperparameters, f)

# Launch a Job via Notebook

In [333]:
# sessLocal = sagemaker.LocalSession() # can use LocalSession()
    
d2 = PyTorch(      image_name = image,
                   role=role,
                   entry_point='/home/ec2-user/SageMaker/detectron2-sagemaker/container_training/train_coco.py',
                   train_instance_count=2, 
                   train_instance_type= 'ml.p3dn.24xlarge',
#                     train_instance_type="local_gpu", # use local_gpu for quick troubleshooting
                   train_volume_size=100,
                   framework_version='1.5.1',
                   source_dir='/home/ec2-user/SageMaker/detectron2-sagemaker/',
                   output_path=f"s3://{bucket}/{prefix_output}",
                   metric_definitions = metric_definitions,
                   hyperparameters = hyperparameters, 
                   sagemaker_session=sess,
                   profiler_config=profiler_config)

# d2 = sagemaker.estimator.Estimator(image_name=image,
#                                    role=role,
#                                    train_instance_count=2, 
#                                    train_instance_type= 'ml.p3.16xlarge',
# #                                   train_instance_type="local_gpu", # use local_gpu for quick troubleshooting
#                                    train_volume_size=100,
#                                    output_path="s3://{}/{}".format(bucket, prefix_output),
#                                    metric_definitions = metric_definitions,
#                                    hyperparameters = hyperparameters, 
#                                    sagemaker_session=sess,
#                                   profiler_config=profiler_config)

d2.fit({'training':f"s3://{bucket}/train-coco"},
       job_name = "2-nodes-max-iter-1000-demo2-p3dn",
       wait=False,
              experiment_config={
            "TrialName": d2_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        }) 

INFO:sagemaker:Creating training-job with name: 2-nodes-max-iter-1000-demo2-p3dn


# EFS Setup

In order to use EFS with SageMaker training we need to setup a pointer to our file system

In [298]:
# for EFS

from sagemaker.inputs import FileSystemInput

# Specify EFS ile system id.
file_system_id = 'fs----------'
print(f"EFS file-system-id: {file_system_id}")

# Specify directory path for input data on the file system. 
# You need to provide normalized and absolute path below.
file_system_directory_path = '/training' # sagemaker/input/train
print(f'EFS file-system data input path: {file_system_directory_path}')

home_dir='/home/ec2-user/SageMaker/' #os.environ['HOME']
local_efs_path = os.path.join(home_dir, 'efs', file_system_directory_path) # 'efs',
print(f"Creating log directory on EFS: {local_efs_path}")

# Specify the access mode of the mount of the directory associated with the file system. 
# Directory must be mounted  'ro'(read-only).
file_system_access_mode = 'ro'

# Specify your file system type
file_system_type = 'EFS'

train = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

EFS file-system-id: fs-d120c724
EFS file-system data input path: /training
Creating log directory on EFS: /training


# Launch a job using Spot and EFS

In [239]:
train_use_spot_instances = True
train_max_run=10000
train_max_wait = 10000 if train_use_spot_instances else None

import uuid
checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_uri = 's3://{}/artifacts/d2-efs-checkpoint-{}/'.format(bucket, checkpoint_suffix) if train_use_spot_instances else None

Setup separate trial for EFS training

In [326]:
hidden_channel_trial_name_map = {}
preprocessing_trial_component = tracker.trial_component

trial_name = f"d2-demo-efs-training-job-{int(time.time())}"
d2_trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=d2_experiment.experiment_name,
    sagemaker_boto_client=sm,
)
hidden_channel_trial_name_map[0] = trial_name

# associate the proprocessing trial component with the current trial
d2_trial.add_trial_component(preprocessing_trial_component)

In [328]:
hyperparameters = {"config-file":"COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml", 
                   #"local-config-file" : "config.yaml", # if you'd like to supply custom config file, please add it in container_training folder, and provide file name here
                   "resume":"True", # whether to re-use weights from pre-trained model
                   "eval-only":"False", # whether to perform only D2 model evaluation
                  # opts are D2 model configuration as defined here: https://detectron2.readthedocs.io/modules/config.html#config-references
                  # this is a way to override individual parameters in D2 configuration from Sagemaker API
                   "opts": "SOLVER.MAX_ITER 5000 SOLVER.BASE_LR 0.002 SOLVER.CHECKPOINT_PERIOD 1000",
                   "spot_ckpt":'s3://privisaa-bucket-2/artifacts/d2-efs-checkpoint-57a35863/model_final.pth'
                   }


In [330]:
# Give Amazon SageMaker Training Jobs Access to FileSystem Resources in Your Amazon VPC.

security_group_ids = ['sg-317ad11a'] # ['sg-xxxxxxxx']
subnets =  ['subnet-9e445ef9'] # ['subnet-xxxxxxx', 'subnet-xxxxxxx', 'subnet-xxxxxxx']
sagemaker_session = sagemaker.session.Session(boto_session=session)

d2_efs = PyTorch(image_name = "209419068016.dkr.ecr.us-east-1.amazonaws.com/d2-sm-coco2:distributed",
                                   role=role,
                                   entry_point='/home/ec2-user/SageMaker/detectron2-sagemaker/container_training/train_coco_efs.py',
                                   train_instance_count=2, 
                                   train_instance_type= 'ml.p3.8xlarge',
#                                   train_instance_type="local_gpu", # use local_gpu for quick troubleshooting
                                   train_volume_size=100,
                                   framework_version='1.5.0',
                                   source_dir='/home/ec2-user/SageMaker/detectron2-sagemaker',
                                   output_path="s3://{}/{}".format(bucket, prefix_output),
                                   metric_definitions = metric_definitions,
                                   hyperparameters = hyperparameters, 
                                   sagemaker_session=sess,
                                   train_use_spot_instances=train_use_spot_instances,
                                   train_max_run=train_max_run,
                                   train_max_wait=train_max_wait,
                                   checkpoint_s3_uri=checkpoint_s3_uri,
                                   profiler_config=profiler_config,
                                    subnets=subnets,
                                    security_group_ids=security_group_ids)


job_name=f'd2-efstraining-spotckpt-recovery2-{int(time.time())}'
print(f"Launching Training Job: {job_name}")
data_channels = {'train': train}


# set wait=True below if you want to print logs in cell output
d2_efs.fit(inputs=data_channels, job_name=job_name, logs="All", wait=False)



Launching Training Job: d2-efstraining-spotckpt-recovery2-1603306370


INFO:sagemaker:Creating training-job with name: d2-efstraining-spotckpt-recovery2-1603306370


# Launch via CLI

In [332]:
!{sys.executable} launch_coco_train_boto3.py run-d2-sm --help

usage: launch_coco_train_boto3.py run-d2-sm [-h] [--bucket BUCKET]
                                            [--image_name IMAGE_NAME]
                                            [--metric_path METRIC_PATH]
                                            [--job_name JOB_NAME]
                                            [--region REGION]
                                            [--prefix_input PREFIX_INPUT]
                                            [--prefix_output PREFIX_OUTPUT]
                                            [--instance_count INSTANCE_COUNT]
                                            [--data_prefix DATA_PREFIX]
                                            [--instance_type INSTANCE_TYPE]
                                            [--volume_size VOLUME_SIZE]
                                            [--use_spot] [--role ROLE]
                                            [--max_run_time MAX_RUN_TIME]
                                            [--max_wait_time MAX_WAIT_

In [313]:
!{sys.executable} launch_coco_train_boto3.py run-d2-sm --bucket privisaa-bucket-2 --job_name d2-cli-job2 --region us-east-1 --metric_path metric_defs --hyperparam_path hyperparams.json


Job launched!


In [318]:
!{sys.executable} launch_coco_train_boto3.py check-d2-sm --job_name d2-cli-job2

Job ARN:  arn:aws:sagemaker:us-east-1:209419068016:training-job/d2-cli-job2
Job Status:  InProgress
Hyperparameters:  {'config-file': 'COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml', 'eval-only': 'False', 'opts': 'SOLVER.MAX_ITER 1000', 'resume': 'True'}


## Training with Spot Instance with s3

In [32]:
train_use_spot_instances = True
train_max_run=21600
train_max_wait = 30000 if train_use_spot_instances else None

import uuid
checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_uri = 's3://{}/artifacts/d2-checkpoint-{}/'.format(bucket, checkpoint_suffix) if train_use_spot_instances else None

In [219]:
container = "d2-sm-coco-custom" # your container name
image = '{}.dkr.ecr.{}.amazonaws.com/d2-sm-coco:distributed'.format(account, region)

hyperparameters = {"config-file":"COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml", 
                   "resume":"True", # whether to re-use weights from pre-trained model
                   "eval-only":"False", # whether to perform only D2 model evaluation
                   "opts": "SOLVER.MAX_ITER 1000" #  MODEL.WEIGHTS 
                   }


d2 = sagemaker.estimator.Estimator(image,
                                   role=role,
                                   train_instance_count=2, 
                                   train_instance_type='ml.p3.2xlarge',
                                   train_volume_size=100,
                                   output_path="s3://{}/{}".format(bucket, prefix_output),
                                   metric_definitions = metric_definitions,
                                   hyperparameters = hyperparameters, 
                                   sagemaker_session=sess,
                                   train_use_spot_instances=train_use_spot_instances,
                                   train_max_run=train_max_run,
                                   train_max_wait=train_max_wait,
                                   checkpoint_s3_uri=checkpoint_s3_uri,
                                   profiler_config=profiler_config
                                  )

d2.fit({'training':f"s3://{bucket}/train-coco"},
       job_name = "2-nodes-max-iter-2000-genest-prof-spot6",
       wait=False,
              experiment_config={
            "TrialName": d2_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        }) 

INFO:sagemaker:Creating training-job with name: 2-nodes-max-iter-2000-genest-prof-spot6


In [325]:
search_expression = {
    "Filters":[
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=Session(b3sess, sm), 
    experiment_name=d2_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
    metric_names=['test:accuracy'],
)

trial_component_analytics.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,config-file,eval-only,opts,resume,sagemaker_container_log_level,sagemaker_enable_cloudwatch_metrics,sagemaker_job_name,sagemaker_program,sagemaker_region,sagemaker_submit_directory,spot_ckpt
0,2-nodes-max-iter-1000-ptest-newfpath-resnestno...,Training,arn:aws:sagemaker:us-east-1:209419068016:train...,209419068016.dkr.ecr.us-east-1.amazonaws.com/d...,2.0,ml.p3.16xlarge,100.0,"""COCO-InstanceSegmentation/mask_rcnn_R_101_C4_...","""False""","""SOLVER.MAX_ITER 1000""","""True""",20.0,False,"""2-nodes-max-iter-1000-ptest-newfpath-resnestn...","""/home/ec2-user/SageMaker/detectron2-sagemaker...","""us-east-1""","""s3://privisaa-bucket-2/2-nodes-max-iter-1000-...",""""""


In [294]:
sm_client.describe_training_job(TrainingJobName='d2-efs-efstraining-prevd2-spotfix-spotckptquotes-1603293318')

{'TrainingJobName': 'd2-efs-efstraining-prevd2-spotfix-spotckptquotes-1603293318',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:209419068016:training-job/d2-efs-efstraining-prevd2-spotfix-spotckptquotes-1603293318',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://privisaa-bucket-2/detectron2-output/d2-efs-efstraining-prevd2-spotfix-spotckptquotes-1603293318/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'config-file': '"COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml"',
  'eval-only': '"False"',
  'opts': '"SOLVER.MAX_ITER 1000"',
  'resume': '"True"',
  'sagemaker_container_log_level': '20',
  'sagemaker_enable_cloudwatch_metrics': 'false',
  'sagemaker_job_name': '"d2-efs-efstraining-prevd2-spotfix-spotckptquotes-1603293318"',
  'sagemaker_program': '"/home/ec2-user/SageMaker/detectron2-sagemaker/container_training/train_coco_efs.py"',
  'sagemaker_region': '"us-east-1"',
  'sagemaker_submit_directory': '"s3://pri

# Profiler Setup

In order to use SageMaker Debugger's Profiler we need to specify a configuration we will hook into our estimator

In [8]:
# setup profiler hooks

from sagemaker.profiler import ProfilerConfig 

profiler_config = ProfilerConfig(
    profiling_interval_millis=500,
    profiling_parameters={
        "ProfilerEnabled": str(True),
        "GeneralMetricsConfig": "{\"StartStep\": \"2\", \"NumSteps\": \"2\"}"
   }
)