# Dog Breed Classifier HPO, tuning and debugging.
----

## Do not use SageMaker in dark mode with this notebook!
#### IPython HTML reporting changes output field colors and some text boxes may be white text / white bg.

Building a CNN classifier layer with PyTorch using AWS Sagemaker tools and python API.

ResNet152 with pretrained weights serves as the CNN, with a drop-in classifier to be trained and tested.

First there's a brief run of SageMaker's HyperParameterOptimizer, searching optimal values for epochs, learning rate, dropout rate, and number of hidden units.

Then the model is retrained, considering findings from HPO, this time logging utilization and training metrics with SageMaker's profiler and debugger.

In [99]:
# Running on data science 2.0 kernel / python 3.8

!pip install -U pip
!pip install smdebug
!pip install -U awscli
!pip install -U sagemaker==2.103.0
!pip install -U boto3
!pip install torchvision


Collecting awscli
  Downloading awscli-1.25.47-py3-none-any.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Installing collected packages: awscli
  Attempting uninstall: awscli
    Found existing installation: awscli 1.24.2
    Uninstalling awscli-1.24.2:
      Successfully uninstalled awscli-1.24.2
Successfully installed awscli-1.25.47
[0m

In [3]:
!pip list

Package                              Version
------------------------------------ --------------------
alabaster                            0.7.12
anaconda-client                      1.9.0
anaconda-project                     0.10.1
anyio                                3.6.1
appdirs                              1.4.4
argh                                 0.26.2
argon2-cffi                          20.1.0
arrow                                0.13.1
asn1crypto                           1.4.0
astroid                              2.6.6
astropy                              4.3.1
asttokens                            2.0.5
async-generator                      1.10
atomicwrites                         1.4.0
attrs                                20.3.0
autopep8                             1.5.7
autovizwidget                        0.20.0
awscli                               1.24.2
Babel                                2.9.1
backcall                             0.2.0
backports.shutil-get-terminal-

In [3]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

In [6]:
sagemaker.pytorch.PyTorch.__dict__

mappingproxy({'__module__': 'sagemaker.pytorch.estimator',
              '__doc__': 'Handle end-to-end training and deployment of custom PyTorch code.',
              '_framework_name': 'pytorch',
              'LAUNCH_PYTORCH_DDP_ENV_NAME': 'sagemaker_pytorch_ddp_enabled',
              'INSTANCE_TYPE_ENV_NAME': 'sagemaker_instance_type',
              '__init__': <function sagemaker.pytorch.estimator.PyTorch.__init__(self, entry_point: Union[str, sagemaker.workflow.entities.PipelineVariable], framework_version=None, py_version=None, source_dir: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, hyperparameters=None, image_uri=None, distribution=None, **kwargs)>,
              '_pytorch_distribution_configuration': <function sagemaker.pytorch.estimator.PyTorch._pytorch_distribution_configuration(self, distribution)>,
              'hyperparameters': <function sagemaker.pytorch.estimator.PyTorch.hyperparameters(self)>,
              'create_model': <function sag

## Dataset

Runs against the dog breed classification dataset offered by Udacity. This dataset includes ~8300 images across 133 different labeled breeds. Though labeled, a number of the pictures include humans, puppies, multiple dogs, etc..

The images come in a variety of resolutions. They are scaled down to 224 * 224 for training and inference.

In [133]:
# Command to download and unzip data
# !wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
# !unzip dogImages.zip

dogImagesFolder = 's3://sagemaker-us-east-1-155306683617/dogImages'

In [22]:
hyperparameter_ranges = None
estimator = None
tuner = None
# print(datapaths)

## Hyperparameter Tuning

The model is being tuned against learning rate, batch size, hidden units, and dropout rate.

Since I'll probably be sticking to a small number of epochs for this project, I doubt dropout will be necessary / helpful.

In a real-world scenario I might increase the search ranges here and use faster hardware.

In [23]:
hyperparameter_ranges = {
    'learning-rate': ContinuousParameter(0.0001, 0.02),
    'batch-size': CategoricalParameter([32, 64, 128]),
    'hidden-units': CategoricalParameter([128, 256]),
    'dropout': ContinuousParameter(0.0, 0.5)
}

objective_metric_name = 'test set loss'
objective_type = 'Minimize'
metric_definitions = [{'Name': 'test set loss', 'Regex': 'Test set: Average loss: ([0-9.]+)'}]

In [24]:
sagemaker_role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

estimator = PyTorch(
    entry_point='hpo.py',
    role=sagemaker_role,
    py_version='py38',
    framework_version='1.11',
    instance_count=1,
    instance_type='ml.g4dn.xlarge'   # cheapest GPU ... 4 vcpu / 16GB RAM
    # instance_type='ml.m5.large' # crazy slow
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=10,
    max_parallel_jobs=2,
    objective_type=objective_type,
)

In [25]:
datapaths = {
    'train': dogImagesFolder + '/train',       
    'test': dogImagesFolder + '/test',      
    'valid': dogImagesFolder + '/valid'
}

# tuner.fit(datapaths)
tuner.attach('pytorch-training-220807-1319')

<sagemaker.tuner.HyperparameterTuner at 0x7f5b47900850>

In [30]:
# best_estimator = tuner.best_estimator()
# best_training_job = tuner.best_training_job()

#Get the hyperparameters of the best trained model
best_hyperparams = best_estimator.hyperparameters()

# print(best_estimator)
# print(best_training_job)
print(best_hyperparams)

{'_tuning_objective_metric': '"test set loss"', 'batch-size': '"64"', 'dropout': '0.021019058153043346', 'hidden-units': '"256"', 'learning-rate': '0.0009069108358995359', 'sagemaker_container_log_level': '20', 'sagemaker_estimator_class_name': '"PyTorch"', 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"', 'sagemaker_job_name': '"pytorch-training-2022-08-07-13-19-27-741"', 'sagemaker_program': '"hpo.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-155306683617/pytorch-training-2022-08-07-13-19-27-741/source/sourcedir.tar.gz"'}


## Model Profiling and Debugging

Using the best hyperparameters, create and finetune a new model

In [16]:
import pip
import sys

def import_or_install(package):
    try:
        __import__(package)
    except ImportError:
        !{sys.executable} -m pip install {package}
        
required_packages=['smdebug', 'pytest']

for package in required_packages:
    import_or_install(package)

In [187]:
sagemaker_role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

debug_output_path = 's3://' + bucket + '/smdebug'
print(debug_output_path)

s3://sagemaker-us-east-1-155306683617/smdebug


In [193]:
# Set up debugging and profiling rules and hooks

from sagemaker.debugger import Rule, ProfilerRule, rule_configs, DebuggerHookConfig
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]

hook_config = DebuggerHookConfig(
    hook_parameters={
        'train.save_interval': '100',
        'eval.save_interval': '10'
    }
)

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(num_steps=10)
)

hyperparameters = {
    'epochs': 6,
    'hidden-units': 256,
    'learning-rate': 0.001, # rounded up from 0.0009
    'batch-size': 64,
    'dropout': 0.0 # might as well just round this to nothing
}


In [194]:
dogImagesFolder = 's3://sagemaker-us-east-1-155306683617/dogImages'

datapaths = {
    'train': dogImagesFolder + '/train',       
    'test': dogImagesFolder + '/test',      
    'valid': dogImagesFolder + '/valid'
}

print(datapaths)

{'train': 's3://sagemaker-us-east-1-155306683617/dogImages/train', 'test': 's3://sagemaker-us-east-1-155306683617/dogImages/test', 'valid': 's3://sagemaker-us-east-1-155306683617/dogImages/valid'}


In [195]:
# Create and fit an estimator

estimator = PyTorch(
    entry_point='train_model.py',
    base_job_name='sm-dbc-pytorch',
    role=sagemaker_role,
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    hyperparameters=hyperparameters,
    py_version='py38',
    framework_version='1.11',
    profiler_config=profiler_config,
    rules=rules,
    debugger_hook_config=hook_config,
)

estimator.fit(datapaths, wait=False)

In [198]:
import pprint

job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)

print(job_name)
print(client)
pprint.pprint(description)

sm-dbc-pytorch-2022-08-09-13-58-06-101
<botocore.client.SageMaker object at 0x7f92ee2a0ca0>
{'AlgorithmSpecification': {'EnableSageMakerMetricsTimeSeries': True,
                            'TrainingImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.11-gpu-py38',
                            'TrainingInputMode': 'File'},
 'BillableTimeInSeconds': 1448,
 'CreationTime': datetime.datetime(2022, 8, 9, 13, 58, 6, 559000, tzinfo=tzlocal()),
 'DebugHookConfig': {'CollectionConfigurations': [{'CollectionName': 'losses',
                                                   'CollectionParameters': {'save_interval': '500'}},
                                                  {'CollectionName': 'relu_input',
                                                   'CollectionParameters': {'include_regex': '.*relu_input',
                                                                            'save_interval': '500'}},
                                                  {'CollectionNa

In [199]:
"""
from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-debugger/pytorch_model_debugging/pytorch_script_change_smdebug.ipynb
"""

import time
from IPython import display

%matplotlib inline

# while description["SecondaryStatus"] not in {"Stopped", "Completed"}:
description = client.describe_training_job(TrainingJobName=job_name)
primary_status = description["TrainingJobStatus"]
secondary_status = description["SecondaryStatus"]
print("====================================================================")
print("TrainingJobStatus: ", primary_status, " | SecondaryStatus: ", secondary_status)
print("====================================================================")
for r in range(len(estimator.latest_training_job.rule_job_summary())):
    rule_summary = estimator.latest_training_job.rule_job_summary()
    print(
        rule_summary[r]["RuleConfigurationName"], ": ", rule_summary[r]["RuleEvaluationStatus"]
    )
    if rule_summary[r]["RuleEvaluationStatus"] == "IssuesFound":
        print(rule_summary[r]["StatusDetails"])
    print("====================================================================")
print("Current time: ", time.asctime())
#     display.clear_output(wait=True)
#     time.sleep(100)

TrainingJobStatus:  Completed  | SecondaryStatus:  Completed
LossNotDecreasing :  NoIssuesFound
VanishingGradient :  NoIssuesFound
Overfit :  NoIssuesFound
Overtraining :  NoIssuesFound
PoorWeightInitialization :  NoIssuesFound
LowGPUUtilization :  IssuesFound
RuleEvaluationConditionMet: Evaluation of the rule LowGPUUtilization at step 9 resulted in the condition being met

ProfilerReport :  IssuesFound
RuleEvaluationConditionMet: Evaluation of the rule ProfilerReport at step 24 resulted in the condition being met

Current time:  Tue Aug  9 14:30:17 2022


In [200]:
"""
https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-debugger/pytorch_model_debugging/pytorch_script_change_smdebug.ipynb
"""

def _get_rule_job_name(training_job_name, rule_configuration_name, rule_job_arn):
    '''Helper function to get the rule job name with correct casing'''
    return '{}-{}-{}'.format(
        training_job_name[:26], rule_configuration_name[:26], rule_job_arn[-8:]
    )


def _get_cw_url_for_rule_job(rule_job_name, region):
    return 'https://{}.console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix'.format(
        region, region, rule_job_name
    )


def get_rule_jobs_cw_urls(estimator):
    region = boto3.Session().region_name
    training_job = estimator.latest_training_job
    training_job_name = training_job.describe()['TrainingJobName']
    rule_eval_statuses = training_job.describe()['DebugRuleEvaluationStatuses']

    result = {}
    for status in rule_eval_statuses:
        if status.get('RuleEvaluationJobArn', None) is not None:
            rule_job_name = _get_rule_job_name(
                training_job_name, status['RuleConfigurationName'], status['RuleEvaluationJobArn']
            )
            result[status['RuleConfigurationName']] = _get_cw_url_for_rule_job(
                rule_job_name, region
            )
    return result


get_rule_jobs_cw_urls(estimator)


{'LossNotDecreasing': 'https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=sm-dbc-pytorch-2022-08-09--LossNotDecreasing-244dbc3e;streamFilter=typeLogStreamPrefix',
 'VanishingGradient': 'https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=sm-dbc-pytorch-2022-08-09--VanishingGradient-28a9557d;streamFilter=typeLogStreamPrefix',
 'Overfit': 'https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=sm-dbc-pytorch-2022-08-09--Overfit-fb3aca78;streamFilter=typeLogStreamPrefix',
 'Overtraining': 'https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=sm-dbc-pytorch-2022-08-09--Overtraining-4040c1ff;streamFilter=typeLogStreamPrefix',
 'PoorWeightInitialization': 'https://us-east-1.console.aws.amazon.com/clo

In [201]:
import sagemaker
import boto3
import pprint
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, DebuggerHookConfig
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
import pprint

session = boto3.session.Session()
sagemaker_role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
sagemaker_bucket = sagemaker_session.default_bucket()
region = session.region_name

# job_name = 'smdebugger-doggers-pytorch-2022-08-08-03-00-41-895'
# estimator = PyTorch.attach(job_name)

print(trial.tensor_names())

['CrossEntropyLoss_output_0', 'gradient/MyClassifier_fc1.bias', 'gradient/MyClassifier_fc1.weight', 'gradient/MyClassifier_fc2.bias', 'gradient/MyClassifier_fc2.weight', 'gradient/MyClassifier_fc3.bias', 'gradient/MyClassifier_fc3.weight']


In [202]:
# pprint.pprint(estimator.__dict__)

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

print(trial.tensor_names())
print(len(trial.tensor('CrossEntropyLoss_output_0').steps(mode=ModeKeys.TRAIN)))
print(len(trial.tensor('CrossEntropyLoss_output_0').steps(mode=ModeKeys.EVAL)))

training_job_name = estimator.latest_training_job.name
training_job = TrainingJob(training_job_name, region)
training_job.wait_for_sys_profiling_data_to_be_available()

system_metrics_reader = training_job.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + '/rule-output'
print(f'You will find the profiler report in {rule_output_path}')

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=['CPU', 'GPU'],
    select_events=['total'],
)

[2022-08-09 14:30:35.877 sagemaker-data-scienc-ml-t3-medium-5812005f1de07e20cb211bb2dcf1:17 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-155306683617/sm-dbc-pytorch-2022-08-09-13-58-06-101/debug-output
['CrossEntropyLoss_output_0', 'gradient/MyClassifier_fc1.bias', 'gradient/MyClassifier_fc1.weight', 'gradient/MyClassifier_fc2.bias', 'gradient/MyClassifier_fc2.weight', 'gradient/MyClassifier_fc3.bias', 'gradient/MyClassifier_fc3.weight']
2
4
ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-155306683617/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cPr

In [203]:
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]




2022-08-09 14:23:30     406853 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-report.html
2022-08-09 14:23:29     260561 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2022-08-09 14:23:25        192 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2022-08-09 14:23:25      18517 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2022-08-09 14:23:25       2148 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2022-08-09 14:23:25        330 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
2022-08-09 14:23:25       3659 sm-dbc-pytorch-2022-08-09-13-58-06-101/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottle

In [204]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

Unnamed: 0,Description,Recommendation,Number of times rule triggered,Number of datapoints,Rule parameters
GPUMemoryIncrease,Measures the average GPU memory footprint and triggers if there is a large increase.,Choose a larger instance type with more memory if footprint is close to maximum available memory.,267,2740,increase:5  patience:1000  window:10
LowGPUUtilization,"Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.","Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.",15,2740,threshold_p95:70  threshold_p5:10  window:500  patience:1000
StepOutlier,"Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.","Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.",14,2402,threshold:3  mode:None  n_outliers:10  stddev:3
MaxInitializationTime,Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.,"Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.",0,2402,threshold:20
IOBottleneck,Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.,"Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.",0,3669,threshold:50  io_threshold:50  gpu_threshold:10  patience:1000
BatchSize,"Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.","The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.",0,2716,cpu_threshold_p95:70  gpu_threshold_p95:70  gpu_memory_threshold_p95:70  patience:1000  window:500
Dataloader,"Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.",Change the number of data loader processes.,0,644,min_threshold:70  max_threshold:200
LoadBalancing,"Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.",Choose a different distributed training strategy or a different distributed training framework.,0,2740,threshold:0.2  patience:1000
CPUBottleneck,"Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.",Consider increasing the number of data loaders or applying data pre-fetching.,0,3669,threshold:50  cpu_threshold:90  gpu_threshold:10  patience:1000

Unnamed: 0,mean,max,p99,p95,p50,min
Step Durations in [s],0.0,0.04,0.01,0.01,0.0,0.0


## Anomalous Behavior

I've done quite a few revisions trying to get the predictor to the point where it's working. Right now I'm fairly happy with the performance data. It's a simple model.

In my first runs **the GPU was very underutilized**. Note that you can't see this above, it was fixed for later runs. 

This was due to single batches from the S3 dataloader, which was not configured with num_loaders. I increased the number of worker threads on the dataloader from 1 to 4 to match the CPU cores, and the model trains much faster.

I've had some bad experiences crashing PyTorch when increasing the worker-thread count on the dataloader on my home workstation, but this seems to work more reliably in this environment.

## Model Deploying

In [274]:
from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data='s3://sagemaker-us-east-1-155306683617/sm-dbc-pytorch-2022-08-09-13-58-06-101/output/model.tar.gz',
    entry_point='infer.py',
    role=sagemaker_role,
    py_version='py38',
    framework_version='1.11')

predictor = pytorch_model.deploy(instance_type='ml.g4dn.xlarge', initial_instance_count=1)

----------!

In [275]:
from PIL import Image
import torch
import torchvision.transforms as T
import numpy as np
import s3fs
import os

preprocess = T.Compose([
    T.Resize(256),
    T.CenterCrop(224)
])

process_tensor = T.Compose([
    T.ToTensor(),
    T.Normalize(
        [0.485, 0.456, 0.406],
        [0.229, 0.224, 0.225]),
    
])

fs = s3fs.S3FileSystem()

dogImagesFolder = 's3://sagemaker-us-east-1-155306683617/dogImages/test'
test_file = '014.Basenji/Basenji_00955.jpg'

def get_img(test_file):
    with fs.open(os.path.join(dogImagesFolder, test_file)) as f:
        img = preprocess(Image.open(f))
        img_np = np.transpose(img, (2, 0, 1))
        # img_t = process_tensor(img)
        # img = img_t.numpy()
        return img

test_images = ['014.Basenji/Basenji_00955.jpg', '103.Mastiff/Mastiff_06825.jpg',
               '116.Parson_russell_terrier/Parson_russell_terrier_07549.jpg',
               '123.Pomeranian/Pomeranian_07873.jpg']

targets = [14, 103, 116, 123]

response_a = predictor.predict(get_img(x))

In [327]:
import torch
import torch.nn.functional as F
import numpy as np

print('PREDICTED\n-----\n')

for i, pic in enumerate(response):
    print(test_images[i])
    pt = torch.from_numpy(pic[0])
    probs = F.softmax(pt, dim=0)
    # print(pt)
    # print(probs)
    top_p, top_class = probs.topk(1)
    print(f'CLASS {int(top_class + 1)} : confidence {100*float(top_p):.2f}%\n')


PREDICTED
-----

014.Basenji/Basenji_00955.jpg
CLASS 14 : confidence 97.33%

103.Mastiff/Mastiff_06825.jpg
CLASS 103 : confidence 89.66%

116.Parson_russell_terrier/Parson_russell_terrier_07549.jpg
CLASS 116 : confidence 80.23%

123.Pomeranian/Pomeranian_07873.jpg
CLASS 123 : confidence 99.05%



In [278]:
predictor.delete_endpoint()